opencv/doc/gpu_data_structures.tex

\section{Data Structures}


\cvclass{gpu::DevMem2D\_}
This is a simple lightweight class that encapsulate pitched memory on GPU. It is untented to pass to nvcc-compiled code, i.e. CUDA kernels. Its members can be called both from host and from device code.

\begin{lstlisting}
template <typename T> struct DevMem2D_
{            
    int cols;
    int rows;
    T* data;
    size_t step;
	
    DevMem2D_() : cols(0), rows(0), data(0), step(0){};	
    DevMem2D_(int rows_, int cols_, T *data_, size_t step_);
			
    template <typename U>            
    explicit DevMem2D_(const DevMem2D_<U>& d);
	
    typedef T elem_type;
    enum { elem_size = sizeof(elem_type) };

    __CV_GPU_HOST_DEVICE__ size_t elemSize() const;

    /* returns pointer to the beggining of given image row */
    __CV_GPU_HOST_DEVICE__ T* ptr(int y = 0);
    __CV_GPU_HOST_DEVICE__ const T* ptr(int y = 0) const;
};
\end{lstlisting}


\cvclass{gpu::PtrStep\_}
This is class like DevMem2D\_ but contain only pointer and row step.  Image sizes are excluded due to performance reasons.

\begin{lstlisting}
template<typename T> struct PtrStep_
{
	T* data;
	size_t step;

	PtrStep_();
	PtrStep_(const DevMem2D_<T>& mem);

	typedef T elem_type;
	enum { elem_size = sizeof(elem_type) };

	__CV_GPU_HOST_DEVICE__ size_t elemSize() const;
	__CV_GPU_HOST_DEVICE__ T* ptr(int y = 0);
	__CV_GPU_HOST_DEVICE__ const T* ptr(int y = 0) const;
};

\end{lstlisting}

\cvclass{gpu::PtrElemStrp\_}
This is class like DevMem2D\_ but contain only pointer and row step in elements.  Image sizes are excluded due to performance reasons. This class is can only be constructed if sizeof(T) is multiple of 256.

\begin{lstlisting}
template<typename T> struct PtrElemStep_ : public PtrStep_<T>
{                   
	PtrElemStep_(const DevMem2D_<T>& mem);
	__CV_GPU_HOST_DEVICE__ T* ptr(int y = 0);
	__CV_GPU_HOST_DEVICE__ const T* ptr(int y = 0) const;
};
\end{lstlisting}


\cvclass{gpu::GpuMat}

The base storage class for GPU memory with reference counting. Its interface is almost \cvCppCross{Mat} interface with some limitations, so using it won't be a problem. The limitations are no arbitrary dimensions support (only 2D), no functions that returns references to its data (because references on GPU are not valid for CPU), no expression templates technique support. Because of last limitation please take care with overloaded matrix operators - they cause memory allocations. The GpuMat class is convertible to cv::gpu::DevMem2D\_ and cv::gpu::PtrStep\_ so it can be passed to directly to kernel.

\textbf{Please note:} In contrast with \cvCppCross{Mat}, I most cases \texttt{GpuMat::isContinuous() == false}, i.e. rows are aligned to size depending on hardware.

\begin{lstlisting}
class CV_EXPORTS GpuMat
{
public:
	//! default constructor
	GpuMat();

	GpuMat(int rows, int cols, int type);
	GpuMat(Size size, int type);

        .....

	//! builds GpuMat from Mat. Perfom blocking upload to device.
	explicit GpuMat (const Mat& m);

	//! returns lightweight DevMem2D_ structure for passing 
        //to nvcc-compiled code. Contains size, data ptr and step.
	template <class T> operator DevMem2D\_<T>() const;
	template <class T> operator PtrStep\_<T>() const;

	//! pefroms blocking upload data to GpuMat.
	void upload(const cv::Mat& m);
	void upload(const CudaMem& m, Stream& stream);

	//! downloads data from device to host memory. Blocking calls.
	operator Mat() const;
	void download(cv::Mat& m) const;

	//! download async
	void download(CudaMem& m, Stream& stream) const;
};
\end{lstlisting}

\textbf{Please note:} Is it a bad practice to leave static or global GpuMat variables allocated, i.e. to rely on its destructor. That is because destruction order of such variables and CUDA context is undefined and GPU memory release function returns error if CUDA context has been destroyed before.


See also: \cvCppCross{Mat}


\cvclass{gpu::CudaMem}
This is a class with reference counting that wraps special memory type allocation functions from CUDA. Its interface is also \cvCppCross{Mat}-like but with additional memory type parameter:
\begin{itemize}
    \item \texttt{ALLOC\_PAGE\_LOCKED} Sets page locked memory type, used commonly for fast and asynchronous upload/download data from/to GPU.
    \item \texttt{ALLOC\_ZEROCOPY} Specifies zero copy memory allocation, i.e. with possibility to map host memory to GPU address space if supported.
    \item \texttt{ALLOC\_WRITE\_COMBINED} Sets write combined buffer which is not cached by CPU. Such buffers are used to supply GPU with data when GPU only reads it. The advantage is better CPU cache utilization.
\end{itemize}
Please note that allocation size of such memory types is usually limited. For more details please see "CUDA 2.2 Pinned Memory APIs" document or "CUDA\_C Programming Guide".

\begin{lstlisting}
class CV_EXPORTS CudaMem
{
public:
	enum  { ALLOC_PAGE_LOCKED = 1, ALLOC_ZEROCOPY = 2,
                 ALLOC_WRITE_COMBINED = 4 };

	CudaMem(Size size, int type, int alloc_type = ALLOC_PAGE_LOCKED);

	//! creates from cv::Mat with coping data
	explicit CudaMem(const Mat& m, int alloc_type = ALLOC_PAGE_LOCKED);

	 ......

	void create(Size size, int type, int alloc_type = ALLOC_PAGE_LOCKED);

	//! returns matrix header with disabled ref. counting for CudaMem data.
	Mat createMatHeader() const;
	operator Mat() const;

	//! maps host memory into device address space
	GpuMat createGpuMatHeader() const;
	operator GpuMat() const;

	//if host memory can be mapperd to gpu address space;
	static bool canMapHostMemory();

	int alloc_type;
};

\end{lstlisting}

\cvCppFunc{gpu::CudaMem::createMatHeader}
Creates \cvCppCross{Mat} header without reference counting to CudaMem data.

\cvdefCpp{
Mat CudaMem::createMatHeader() const; \newline
CudaMem::operator Mat() const;
}

\cvCppFunc{gpu::CudaMem::createGpuMatHeader}
Maps CPU memory to GPU address space and creates \cvCppCross{gpu::GpuMat} header without reference counting for it. This can be done only if memory was allocated with \texttt{ALLOC\_ZEROCOPY} flag and if it is supported by hardware (laptops often share video and CPU memory, so address spaces can be mapped, and that eliminates extra copy).

\cvdefCpp{
GpuMat CudaMem::createGpuMatHeader() const; \newline
CudaMem::operator GpuMat() const;
}

\cvCppFunc{gpu::CudaMem::canMapHostMemory}
Returns true is current hardware support address space mapping and \texttt{ALLOC\_ZEROCOPY} memory allocation
\cvdefCpp{static bool CudaMem::canMapHostMemory();}


\cvclass{gpu::Stream}


This class is a queue class used for asynchronous calls. Some functions have overloads with additional \cvCppCross{gpu::Stream} parameter. The overloads do initialization work (allocate output buffers, upload constants, etc.), start GPU kernel and return before results are ready. A check if all operation are complete can be performed via \cvCppCross{gpu::Stream::queryIfComplete()}.  Asynchronous upload/download have to be performed from/to page-locked buffers, i.e. using \cvCppCross{gpu::CudaMem} or \cvCppCross{Mat} header that points to a region of \cvCppCross{gpu::CudaMem}.

\textbf{Please note the limitation}: currently it is not guaranteed that all will work properly if one operation will be enqueued twice with different data. Some functions use constant GPU memory and next call may update the memory before previous has been finished. But calling asynchronously different operations is safe because each operation has own constant buffer. Memory copy/upload/download/set operations to buffers hold by user are also safe.

\begin{lstlisting}
class CV_EXPORTS Stream
{
public:
	Stream();
	~Stream();

	Stream(const Stream&);
	Stream& operator=(const Stream&);

	bool queryIfComplete();
	void waitForCompletion();

	//! downloads asynchronously.
	// Warning! cv::Mat must point to page locked memory
                 (i.e. to CudaMem data or to its subMat)
	void enqueueDownload(const GpuMat& src, CudaMem& dst);
	void enqueueDownload(const GpuMat& src, Mat& dst);

	//! uploads asynchronously.
	// Warning! cv::Mat must point to page locked memory 
                 (i.e. to CudaMem data or to its ROI)
	void enqueueUpload(const CudaMem& src, GpuMat& dst);
	void enqueueUpload(const Mat& src, GpuMat& dst);

	void enqueueCopy(const GpuMat& src, GpuMat& dst);

	void enqueueMemSet(const GpuMat& src, Scalar val);
	void enqueueMemSet(const GpuMat& src, Scalar val, const GpuMat& mask);

	// converts matrix type, ex from float to uchar depending on type
	void enqueueConvert(const GpuMat& src, GpuMat& dst, int type, 
                double a = 1, double b = 0);
};

\end{lstlisting}

\cvCppFunc{gpu::Stream::queryIfComplete}
Returns true if current stream queue is finished, otherwise false.
\cvdefCpp{bool Stream::queryIfComplete()}

\cvCppFunc{gpu::Stream::waitForCompletion}
Blocks until all operations in the stream are complete.
\cvdefCpp{void Stream::waitForCompletion();}


\cvclass{gpu::StreamAccessor}

This class provides possibility to get \texttt{cudaStream\_t} from \cvCppCross{gpu::Stream}. This class is declared in \texttt{stream\_accessor.hpp} because this is only public header that depend on Cuda Runtime API. Including it will bring the dependency to your code.

\begin{lstlisting}
struct StreamAccessor
{
	CV_EXPORTS static cudaStream_t getStream(const Stream& stream);
};
\end{lstlisting}

\cvCppFunc{gpu::createContinuous}
Creates continuous matrix in GPU memory.

\cvdefCpp{void createContinuous(int rows, int cols, int type, GpuMat\& m);}
\begin{description}
\cvarg{rows}{Row count.}
\cvarg{cols}{Column count.}
\cvarg{type}{Type of the matrix.}
\cvarg{m}{Destination matrix. Will be only reshaped if it has proper type and area (\texttt{rows} $\times$ \texttt{cols}).}
\end{description}

Also the following wrappers are available:
\cvdefCpp{GpuMat createContinuous(int rows, int cols, int type);\newline
void createContinuous(Size size, int type, GpuMat\& m);\newline
GpuMat createContinuous(Size size, int type);}

Matrix is called continuous if its elements are stored continuously, i.e. wuthout gaps in the end of each row.


\cvCppFunc{gpu::ensureSizeIsEnough}
Ensures that size of matrix is big enough and matrix has proper type. The function doesn't reallocate memory if the  matrix has proper attributes already.

\cvdefCpp{void ensureSizeIsEnough(int rows, int cols, int type, GpuMat\& m);}
\begin{description}
\cvarg{rows}{Minimum desired number of rows.}
\cvarg{cols}{Minimum desired number of cols.}
\cvarg{type}{Desired matrix type.}
\cvarg{m}{Destination matrix.}
\end{description}

Also the following wrapper is available:
\cvdefCpp{void ensureSizeIsEnough(Size size, int type, GpuMat\& m);}
restructured gpu modules docs 14 years ago			`\section{Data Structures}`


documented data structures, cascade classifier GPU 14 years ago			`\cvclass{gpu::DevMem2D\_}`
			`This is a simple lightweight class that encapsulate pitched memory on GPU. It is untented to pass to nvcc-compiled code, i.e. CUDA kernels. Its members can be called both from host and from device code.`

			`\begin{lstlisting}`
			`template <typename T> struct DevMem2D_`
			`{`
			`int cols;`
			`int rows;`
			`T* data;`
			`size_t step;`

			`DevMem2D_() : cols(0), rows(0), data(0), step(0){};`
			`DevMem2D_(int rows_, int cols_, T *data_, size_t step_);`

			`template <typename U>`
			`explicit DevMem2D_(const DevMem2D_<U>& d);`

			`typedef T elem_type;`
			`enum { elem_size = sizeof(elem_type) };`

			`__CV_GPU_HOST_DEVICE__ size_t elemSize() const;`

			`/* returns pointer to the beggining of given image row */`
			`__CV_GPU_HOST_DEVICE__ T* ptr(int y = 0);`
			`__CV_GPU_HOST_DEVICE__ const T* ptr(int y = 0) const;`
			`};`
			`\end{lstlisting}`


			`\cvclass{gpu::PtrStep\_}`
			`This is class like DevMem2D\_ but contain only pointer and row step. Image sizes are excluded due to performance reasons.`

			`\begin{lstlisting}`
			`template<typename T> struct PtrStep_`
			`{`
			`T* data;`
			`size_t step;`

			`PtrStep_();`
			`PtrStep_(const DevMem2D_<T>& mem);`

			`typedef T elem_type;`
			`enum { elem_size = sizeof(elem_type) };`

			`__CV_GPU_HOST_DEVICE__ size_t elemSize() const;`
			`__CV_GPU_HOST_DEVICE__ T* ptr(int y = 0);`
			`__CV_GPU_HOST_DEVICE__ const T* ptr(int y = 0) const;`
			`};`

			`\end{lstlisting}`

			`\cvclass{gpu::PtrElemStrp\_}`
			`This is class like DevMem2D\_ but contain only pointer and row step in elements. Image sizes are excluded due to performance reasons. This class is can only be constructed if sizeof(T) is multiple of 256.`

			`\begin{lstlisting}`
			`template<typename T> struct PtrElemStep_ : public PtrStep_<T>`
			`{`
			`PtrElemStep_(const DevMem2D_<T>& mem);`
			`__CV_GPU_HOST_DEVICE__ T* ptr(int y = 0);`
			`__CV_GPU_HOST_DEVICE__ const T* ptr(int y = 0) const;`
			`};`
			`\end{lstlisting}`


			`\cvclass{gpu::GpuMat}`

			The base storage class for GPU memory with reference counting. Its interface is almost \cvCppCross{Mat} interface with some limitations, so using it won't be a problem. The limitations are no arbitrary dimensions support (only 2D), no functions that returns references to its data (because references on GPU are not valid for CPU), no expression templates technique support. Because of last limitation please take care with overloaded matrix operators - they cause memory allocations. The GpuMat class is convertible to cv::gpu::DevMem2D\_ and cv::gpu::PtrStep\_ so it can be passed to directly to kernel.

			`\textbf{Please note:} In contrast with \cvCppCross{Mat}, I most cases \texttt{GpuMat::isContinuous() == false}, i.e. rows are aligned to size depending on hardware.`

			`\begin{lstlisting}`
			`class CV_EXPORTS GpuMat`
			`{`
			`public:`
			`//! default constructor`
			`GpuMat();`

			`GpuMat(int rows, int cols, int type);`
			`GpuMat(Size size, int type);`

			`.....`

			`//! builds GpuMat from Mat. Perfom blocking upload to device.`
			`explicit GpuMat (const Mat& m);`

			`//! returns lightweight DevMem2D_ structure for passing`
			`//to nvcc-compiled code. Contains size, data ptr and step.`
			`template <class T> operator DevMem2D\_<T>() const;`
			`template <class T> operator PtrStep\_<T>() const;`

			`//! pefroms blocking upload data to GpuMat.`
			`void upload(const cv::Mat& m);`
			`void upload(const CudaMem& m, Stream& stream);`

			`//! downloads data from device to host memory. Blocking calls.`
			`operator Mat() const;`
			`void download(cv::Mat& m) const;`

			`//! download async`
			`void download(CudaMem& m, Stream& stream) const;`
			`};`
			`\end{lstlisting}`

			`\textbf{Please note:} Is it a bad practice to leave static or global GpuMat variables allocated, i.e. to rely on its destructor. That is because destruction order of such variables and CUDA context is undefined and GPU memory release function returns error if CUDA context has been destroyed before.`


			`See also: \cvCppCross{Mat}`


			`\cvclass{gpu::CudaMem}`
			`This is a class with reference counting that wraps special memory type allocation functions from CUDA. Its interface is also \cvCppCross{Mat}-like but with additional memory type parameter:`
			`\begin{itemize}`
			`\item \texttt{ALLOC\_PAGE\_LOCKED} Sets page locked memory type, used commonly for fast and asynchronous upload/download data from/to GPU.`
			`\item \texttt{ALLOC\_ZEROCOPY} Specifies zero copy memory allocation, i.e. with possibility to map host memory to GPU address space if supported.`
			`\item \texttt{ALLOC\_WRITE\_COMBINED} Sets write combined buffer which is not cached by CPU. Such buffers are used to supply GPU with data when GPU only reads it. The advantage is better CPU cache utilization.`
			`\end{itemize}`
			`Please note that allocation size of such memory types is usually limited. For more details please see "CUDA 2.2 Pinned Memory APIs" document or "CUDA\_C Programming Guide".`

			`\begin{lstlisting}`
			`class CV_EXPORTS CudaMem`
			`{`
			`public:`
			`enum { ALLOC_PAGE_LOCKED = 1, ALLOC_ZEROCOPY = 2,`
			`ALLOC_WRITE_COMBINED = 4 };`

			`CudaMem(Size size, int type, int alloc_type = ALLOC_PAGE_LOCKED);`

			`//! creates from cv::Mat with coping data`
			`explicit CudaMem(const Mat& m, int alloc_type = ALLOC_PAGE_LOCKED);`

			`......`

			`void create(Size size, int type, int alloc_type = ALLOC_PAGE_LOCKED);`

			`//! returns matrix header with disabled ref. counting for CudaMem data.`
			`Mat createMatHeader() const;`
			`operator Mat() const;`

			`//! maps host memory into device address space`
			`GpuMat createGpuMatHeader() const;`
			`operator GpuMat() const;`

			`//if host memory can be mapperd to gpu address space;`
			`static bool canMapHostMemory();`

			`int alloc_type;`
			`};`

			`\end{lstlisting}`

			`\cvCppFunc{gpu::CudaMem::createMatHeader}`
			`Creates \cvCppCross{Mat} header without reference counting to CudaMem data.`

			`\cvdefCpp{`
			`Mat CudaMem::createMatHeader() const; \newline`
			`CudaMem::operator Mat() const;`
			`}`

			`\cvCppFunc{gpu::CudaMem::createGpuMatHeader}`
			`Maps CPU memory to GPU address space and creates \cvCppCross{gpu::GpuMat} header without reference counting for it. This can be done only if memory was allocated with \texttt{ALLOC\_ZEROCOPY} flag and if it is supported by hardware (laptops often share video and CPU memory, so address spaces can be mapped, and that eliminates extra copy).`

			`\cvdefCpp{`
			`GpuMat CudaMem::createGpuMatHeader() const; \newline`
			`CudaMem::operator GpuMat() const;`
			`}`

			`\cvCppFunc{gpu::CudaMem::canMapHostMemory}`
			`Returns true is current hardware support address space mapping and \texttt{ALLOC\_ZEROCOPY} memory allocation`
			`\cvdefCpp{static bool CudaMem::canMapHostMemory();}`


			`\cvclass{gpu::Stream}`


			This class is a queue class used for asynchronous calls. Some functions have overloads with additional \cvCppCross{gpu::Stream} parameter. The overloads do initialization work (allocate output buffers, upload constants, etc.), start GPU kernel and return before results are ready. A check if all operation are complete can be performed via \cvCppCross{gpu::Stream::queryIfComplete()}. Asynchronous upload/download have to be performed from/to page-locked buffers, i.e. using \cvCppCross{gpu::CudaMem} or \cvCppCross{Mat} header that points to a region of \cvCppCross{gpu::CudaMem}.

			`\textbf{Please note the limitation}: currently it is not guaranteed that all will work properly if one operation will be enqueued twice with different data. Some functions use constant GPU memory and next call may update the memory before previous has been finished. But calling asynchronously different operations is safe because each operation has own constant buffer. Memory copy/upload/download/set operations to buffers hold by user are also safe.`

			`\begin{lstlisting}`
			`class CV_EXPORTS Stream`
			`{`
			`public:`
			`Stream();`
			`~Stream();`

			`Stream(const Stream&);`
			`Stream& operator=(const Stream&);`

			`bool queryIfComplete();`
			`void waitForCompletion();`

			`//! downloads asynchronously.`
			`// Warning! cv::Mat must point to page locked memory`
			`(i.e. to CudaMem data or to its subMat)`
			`void enqueueDownload(const GpuMat& src, CudaMem& dst);`
			`void enqueueDownload(const GpuMat& src, Mat& dst);`

			`//! uploads asynchronously.`
			`// Warning! cv::Mat must point to page locked memory`
			`(i.e. to CudaMem data or to its ROI)`
			`void enqueueUpload(const CudaMem& src, GpuMat& dst);`
			`void enqueueUpload(const Mat& src, GpuMat& dst);`

			`void enqueueCopy(const GpuMat& src, GpuMat& dst);`

			`void enqueueMemSet(const GpuMat& src, Scalar val);`
			`void enqueueMemSet(const GpuMat& src, Scalar val, const GpuMat& mask);`

			`// converts matrix type, ex from float to uchar depending on type`
			`void enqueueConvert(const GpuMat& src, GpuMat& dst, int type,`
			`double a = 1, double b = 0);`
			`};`

			`\end{lstlisting}`

			`\cvCppFunc{gpu::Stream::queryIfComplete}`
			`Returns true if current stream queue is finished, otherwise false.`
			`\cvdefCpp{bool Stream::queryIfComplete()}`

			`\cvCppFunc{gpu::Stream::waitForCompletion}`
			`Blocks until all operations in the stream are complete.`
			`\cvdefCpp{void Stream::waitForCompletion();}`


			`\cvclass{gpu::StreamAccessor}`

			`This class provides possibility to get \texttt{cudaStream\_t} from \cvCppCross{gpu::Stream}. This class is declared in \texttt{stream\_accessor.hpp} because this is only public header that depend on Cuda Runtime API. Including it will bring the dependency to your code.`

			`\begin{lstlisting}`
			`struct StreamAccessor`
			`{`
			`CV_EXPORTS static cudaStream_t getStream(const Stream& stream);`
			`};`
			`\end{lstlisting}`

restructured gpu modules docs 14 years ago			`\cvCppFunc{gpu::createContinuous}`
			`Creates continuous matrix in GPU memory.`

			`\cvdefCpp{void createContinuous(int rows, int cols, int type, GpuMat\& m);}`
			`\begin{description}`
			`\cvarg{rows}{Row count.}`
			`\cvarg{cols}{Column count.}`
			`\cvarg{type}{Type of the matrix.}`
finished gpu module docs for matrix operations 14 years ago			`\cvarg{m}{Destination matrix. Will be only reshaped if it has proper type and area (\texttt{rows} $\times$ \texttt{cols}).}`
fixed some mistakes in gpu docs 14 years ago			`\end{description}`

added ensureSizeIsEnough into gpu module, updated reduction methods 14 years ago			`Also the following wrappers are available:`
			`\cvdefCpp{GpuMat createContinuous(int rows, int cols, int type);\newline`
			`void createContinuous(Size size, int type, GpuMat\& m);\newline`
			`GpuMat createContinuous(Size size, int type);}`

			`Matrix is called continuous if its elements are stored continuously, i.e. wuthout gaps in the end of each row.`


			`\cvCppFunc{gpu::ensureSizeIsEnough}`
minor changes in gpu docs 14 years ago			`Ensures that size of matrix is big enough and matrix has proper type. The function doesn't reallocate memory if the matrix has proper attributes already.`
added ensureSizeIsEnough into gpu module, updated reduction methods 14 years ago
			`\cvdefCpp{void ensureSizeIsEnough(int rows, int cols, int type, GpuMat\& m);}`
			`\begin{description}`
			`\cvarg{rows}{Minimum desired number of rows.}`
			`\cvarg{cols}{Minimum desired number of cols.}`
			`\cvarg{type}{Desired matrix type.}`
			`\cvarg{m}{Destination matrix.}`
			`\end{description}`

			`Also the following wrapper is available:`
			`\cvdefCpp{void ensureSizeIsEnough(Size size, int type, GpuMat\& m);}`