:param arr:Destination array (host or device memory, can be :ocv:class:`Mat` , :ocv:class:`gpu::GpuMat` , :ocv:class:`ogl::Buffer` or ``ogl::Texture2D`` ).
:param arr:Destination array (host or device memory, can be :ocv:class:`Mat` , :ocv:class:`cuda::GpuMat` , :ocv:class:`ogl::Buffer` or ``ogl::Texture2D`` ).
:param ddepth:Destination depth.
@ -532,12 +532,12 @@ Render OpenGL texture or primitives.
gpu::setGlDevice
----------------
cuda::setGlDevice
-----------------
Sets a CUDA device and initializes it for the current thread with OpenGL interoperability.
..ocv:function:: void gpu::setGlDevice( int device = 0 )
..ocv:function:: void cuda::setGlDevice( int device = 0 )
:param device:System index of a GPU device starting with 0.
:param device:System index of a CUDA device starting with 0.
This function should be explicitly called after OpenGL context creation and before any CUDA calls.
Lightweight class encapsulating pitched memory on a GPU and passed to nvcc-compiled code (CUDA kernels). Typically, it is used internally by OpenCV and by users who write device code. You can call its members from both host and device code. ::
@ -30,11 +30,11 @@ Lightweight class encapsulating pitched memory on a GPU and passed to nvcc-compi
gpu::PtrStep
------------
..ocv:class::gpu::PtrStep
cuda::PtrStep
-------------
..ocv:class::cuda::PtrStep
Structure similar to :ocv:class:`gpu::PtrStepSz` but containing only a pointer and row step. Width and height fields are excluded due to performance reasons. The structure is intended for internal use or for users who write device code. ::
Structure similar to :ocv:class:`cuda::PtrStepSz` but containing only a pointer and row step. Width and height fields are excluded due to performance reasons. The structure is intended for internal use or for users who write device code. ::
template <typename T> struct PtrStep : public DevPtr<T>
{
@ -57,9 +57,9 @@ Structure similar to :ocv:class:`gpu::PtrStepSz` but containing only a pointer a
gpu::GpuMat
-----------
..ocv:class::gpu::GpuMat
cuda::GpuMat
------------
..ocv:class::cuda::GpuMat
Base storage class for GPU memory with reference counting. Its interface matches the :ocv:class:`Mat` interface with the following limitations:
@ -67,7 +67,7 @@ Base storage class for GPU memory with reference counting. Its interface matches
* no functions that return references to their data (because references on GPU are not valid for CPU)
* no expression templates technique support
Beware that the latter limitation may lead to overloaded matrix operators that cause memory allocations. The ``GpuMat`` class is convertible to :ocv:class:`gpu::PtrStepSz` and :ocv:class:`gpu::PtrStep` so it can be passed directly to the kernel.
Beware that the latter limitation may lead to overloaded matrix operators that cause memory allocations. The ``GpuMat`` class is convertible to :ocv:class:`cuda::PtrStepSz` and :ocv:class:`cuda::PtrStep` so it can be passed directly to the kernel.
..note:: In contrast with :ocv:class:`Mat`, in most cases ``GpuMat::isContinuous() == false`` . This means that rows are aligned to a size depending on the hardware. Single-row ``GpuMat`` is always a continuous matrix.
@ -76,34 +76,34 @@ Beware that the latter limitation may lead to overloaded matrix operators that c
class CV_EXPORTS GpuMat
{
public:
//! default constructor
GpuMat();
//! default constructor
GpuMat();
//! constructs GpuMat of the specified size and type
GpuMat(int rows, int cols, int type);
GpuMat(Size size, int type);
//! constructs GpuMat of the specified size and type
GpuMat(int rows, int cols, int type);
GpuMat(Size size, int type);
.....
.....
//! builds GpuMat from host memory (Blocking call)
explicit GpuMat(InputArray arr);
//! builds GpuMat from host memory (Blocking call)
explicit GpuMat(InputArray arr);
//! returns lightweight PtrStepSz structure for passing
//to nvcc-compiled code. Contains size, data ptr and step.
template <class T> operator PtrStepSz<T>() const;
template <class T> operator PtrStep<T>() const;
//! returns lightweight PtrStepSz structure for passing
//to nvcc-compiled code. Contains size, data ptr and step.
template <class T> operator PtrStepSz<T>() const;
template <class T> operator PtrStep<T>() const;
//! pefroms upload data to GpuMat (Blocking call)
void upload(InputArray arr);
//! pefroms upload data to GpuMat (Blocking call)
void upload(InputArray arr);
//! pefroms upload data to GpuMat (Non-Blocking call)
void upload(InputArray arr, Stream& stream);
//! pefroms upload data to GpuMat (Non-Blocking call)
void upload(InputArray arr, Stream& stream);
//! pefroms download data from device to host memory (Blocking call)
void download(OutputArray dst) const;
//! pefroms download data from device to host memory (Blocking call)
void download(OutputArray dst) const;
//! pefroms download data from device to host memory (Non-Blocking call)
@ -113,11 +113,11 @@ Beware that the latter limitation may lead to overloaded matrix operators that c
gpu::createContinuous
---------------------
cuda::createContinuous
----------------------
Creates a continuous matrix.
..ocv:function:: void gpu::createContinuous(int rows, int cols, int type, OutputArray arr)
..ocv:function:: void cuda::createContinuous(int rows, int cols, int type, OutputArray arr)
:param rows:Row count.
@ -131,11 +131,11 @@ Matrix is called continuous if its elements are stored continuously, that is, wi
gpu::ensureSizeIsEnough
-----------------------
cuda::ensureSizeIsEnough
------------------------
Ensures that the size of a matrix is big enough and the matrix has a proper type.
..ocv:function:: void gpu::ensureSizeIsEnough(int rows, int cols, int type, OutputArray arr)
..ocv:function:: void cuda::ensureSizeIsEnough(int rows, int cols, int type, OutputArray arr)
:param rows:Minimum desired number of rows.
@ -149,9 +149,9 @@ The function does not reallocate memory if the matrix has proper attributes alre
gpu::CudaMem
------------
..ocv:class::gpu::CudaMem
cuda::CudaMem
-------------
..ocv:class::cuda::CudaMem
Class with reference counting wrapping special memory type allocation functions from CUDA. Its interface is also :ocv:func:`Mat`-like but with additional memory type parameters.
@ -191,47 +191,47 @@ Class with reference counting wrapping special memory type allocation functions
gpu::CudaMem::createMatHeader
-----------------------------
Creates a header without reference counting to :ocv:class:`gpu::CudaMem` data.
cuda::CudaMem::createMatHeader
------------------------------
Creates a header without reference counting to :ocv:class:`cuda::CudaMem` data.
..ocv:function:: Mat gpu::CudaMem::createMatHeader() const
..ocv:function:: Mat cuda::CudaMem::createMatHeader() const
gpu::CudaMem::createGpuMatHeader
--------------------------------
Maps CPU memory to GPU address space and creates the :ocv:class:`gpu::GpuMat` header without reference counting for it.
cuda::CudaMem::createGpuMatHeader
---------------------------------
Maps CPU memory to GPU address space and creates the :ocv:class:`cuda::GpuMat` header without reference counting for it.
This can be done only if memory was allocated with the ``SHARED`` flag and if it is supported by the hardware. Laptops often share video and CPU memory, so address spaces can be mapped, which eliminates an extra copy.
gpu::registerPageLocked
-----------------------
cuda::registerPageLocked
------------------------
Page-locks the memory of matrix and maps it for the device(s).
..ocv:function:: void gpu::registerPageLocked(Mat& m)
..ocv:function:: void cuda::registerPageLocked(Mat& m)
:param m:Input matrix.
gpu::unregisterPageLocked
-------------------------
cuda::unregisterPageLocked
--------------------------
Unmaps the memory of matrix and makes it pageable again.
..ocv:function:: void gpu::unregisterPageLocked(Mat& m)
..ocv:function:: void cuda::unregisterPageLocked(Mat& m)
:param m:Input matrix.
gpu::Stream
-----------
..ocv:class::gpu::Stream
cuda::Stream
------------
..ocv:class::cuda::Stream
This class encapsulates a queue of asynchronous calls.
@ -265,45 +265,45 @@ This class encapsulates a queue of asynchronous calls.
gpu::Stream::queryIfComplete
----------------------------
cuda::Stream::queryIfComplete
-----------------------------
Returns ``true`` if the current stream queue is finished. Otherwise, it returns false.
..note:: Callbacks must not make any CUDA API calls. Callbacks must not perform any synchronization that may depend on outstanding device work or other callbacks that are not mandated to run earlier. Callbacks without a mandated order (in independent streams) execute in undefined order and may be serialized.
gpu::StreamAccessor
-------------------
..ocv:struct::gpu::StreamAccessor
cuda::StreamAccessor
--------------------
..ocv:struct::cuda::StreamAccessor
Class that enables getting ``cudaStream_t`` from :ocv:class:`gpu::Stream` and is declared in ``stream_accessor.hpp`` because it is the only public header that depends on the CUDA Runtime API. Including it brings a dependency to your code. ::
Class that enables getting ``cudaStream_t`` from :ocv:class:`cuda::Stream` and is declared in ``stream_accessor.hpp`` because it is the only public header that depends on the CUDA Runtime API. Including it brings a dependency to your code. ::
The OpenCV CUDA module is a set of classes and functions to utilize CUDA computational capabilities. It is implemented using NVIDIA* CUDA* Runtime API and supports only NVIDIA GPUs. The OpenCV CUDA module includes utility functions, low-level vision primitives, and high-level algorithms. The utility functions and low-level primitives provide a powerful infrastructure for developing fast vision algorithms taking advantage of CUDA whereas the high-level functionality includes some state-of-the-art algorithms (such as stereo correspondence, face and people detectors, and others) ready to be used by the application developers.
The CUDA module is designed as a host-level API. This means that if you have pre-compiled OpenCV CUDA binaries, you are not required to have the CUDA Toolkit installed or write any extra code to make use of the CUDA.
The OpenCV CUDA module is designed for ease of use and does not require any knowledge of CUDA. Though, such a knowledge will certainly be useful to handle non-trivial cases or achieve the highest performance. It is helpful to understand the cost of various operations, what the GPU does, what the preferred data formats are, and so on. The CUDA module is an effective instrument for quick implementation of CUDA-accelerated computer vision algorithms. However, if your algorithm involves many simple operations, then, for the best possible performance, you may still need to write your own kernels to avoid extra write and read operations on the intermediate results.
To enable CUDA support, configure OpenCV using ``CMake`` with ``WITH_CUDA=ON`` . When the flag is set and if CUDA is installed, the full-featured OpenCV CUDA module is built. Otherwise, the module is still built but at runtime all functions from the module throw
:ocv:class:`Exception` with ``CV_GpuNotSupported`` error code, except for
:ocv:func:`cuda::getCudaEnabledDeviceCount()`. The latter function returns zero GPU count in this case. Building OpenCV without CUDA support does not perform device code compilation, so it does not require the CUDA Toolkit installed. Therefore, using the
:ocv:func:`cuda::getCudaEnabledDeviceCount()` function, you can implement a high-level algorithm that will detect GPU presence at runtime and choose an appropriate implementation (CPU or GPU) accordingly.
Compilation for Different NVIDIA* Platforms
-------------------------------------------
NVIDIA* compiler enables generating binary code (cubin and fatbin) and intermediate code (PTX). Binary code often implies a specific GPU architecture and generation, so the compatibility with other GPUs is not guaranteed. PTX is targeted for a virtual platform that is defined entirely by the set of capabilities or features. Depending on the selected virtual platform, some of the instructions are emulated or disabled, even if the real hardware supports all the features.
At the first call, the PTX code is compiled to binary code for the particular GPU using a JIT compiler. When the target GPU has a compute capability (CC) lower than the PTX code, JIT fails.
By default, the OpenCV CUDA module includes:
*
Binaries for compute capabilities 1.3 and 2.0 (controlled by ``CUDA_ARCH_BIN`` in ``CMake``)
*
PTX code for compute capabilities 1.1 and 1.3 (controlled by ``CUDA_ARCH_PTX`` in ``CMake``)
This means that for devices with CC 1.3 and 2.0 binary images are ready to run. For all newer platforms, the PTX code for 1.3 is JIT'ed to a binary image. For devices with CC 1.1 and 1.2, the PTX for 1.1 is JIT'ed. For devices with CC 1.0, no code is available and the functions throw
:ocv:class:`Exception`. For platforms where JIT compilation is performed first, the run is slow.
On a GPU with CC 1.0, you can still compile the CUDA module and most of the functions will run flawlessly. To achieve this, add "1.0" to the list of binaries, for example, ``CUDA_ARCH_BIN="1.0 1.3 2.0"`` . The functions that cannot be run on CC 1.0 GPUs throw an exception.
You can always determine at runtime whether the OpenCV GPU-built binaries (or PTX code) are compatible with your GPU. The function
:ocv:func:`cuda::DeviceInfo::isCompatible` returns the compatibility status (true/false).
Utilizing Multiple GPUs
-----------------------
In the current version, each of the OpenCV CUDA algorithms can use only a single GPU. So, to utilize multiple GPUs, you have to manually distribute the work between GPUs.
Switching active devie can be done using :ocv:func:`cuda::setDevice()` function. For more details please read Cuda C Programing Guide.
While developing algorithms for multiple GPUs, note a data passing overhead. For primitive functions and small images, it can be significant, which may eliminate all the advantages of having multiple GPUs. But for high-level algorithms, consider using multi-GPU acceleration. For example, the Stereo Block Matching algorithm has been successfully parallelized using the following algorithm:
1. Split each image of the stereo pair into two horizontal overlapping stripes.
2. Process each pair of stripes (from the left and right images) on a separate Fermi* GPU.
3. Merge the results into a single disparity map.
With this algorithm, a dual GPU gave a 180% performance increase comparing to the single Fermi GPU. For a source code example, see http://code.opencv.org/projects/opencv/repository/revisions/master/entry/samples/gpu/.
:param filename:Name of the file from which the classifier is loaded. Only the old ``haar`` classifier (trained by the ``haar`` training application) and NVIDIA's ``nvbin`` are supported for HAAR and only new type of OpenCV XML cascade supported for LBP.
:param filename:Name of the file from which the classifier is loaded. Only the old ``haar`` classifier (trained by the ``haar`` training application) and NVIDIA's ``nvbin`` are supported for HAAR and only new type of OpenCV XML cascade supported for LBP.
@ -123,11 +123,11 @@ Creates implementation for :ocv:class:`gpu::LookUpTable` .
gpu::copyMakeBorder
-----------------------
cuda::copyMakeBorder
--------------------
Forms a border around an image.
..ocv:function:: void gpu::copyMakeBorder(InputArray src, OutputArray dst, int top, int bottom, int left, int right, int borderType, Scalar value = Scalar(), Stream& stream = Stream::Null())
..ocv:function:: void cuda::copyMakeBorder(InputArray src, OutputArray dst, int top, int bottom, int left, int right, int borderType, Scalar value = Scalar(), Stream& stream = Stream::Null())
:param src:Source image. ``CV_8UC1`` , ``CV_8UC4`` , ``CV_32SC1`` , and ``CV_32FC1`` types are supported.