Merge pull request #20361 from r0hit2005:master

Tutorial for parallel_for_ and Universal Intrinsic (GSoC '21) * New parallel_for tutorial * Universal Intrinsics Draft Tutorial * Added draft of universal intrinsic tutorial * * Added final markdown for parallel_for_new * Added first half of universal intrinsic tutorial * Fixed warnings in documentation and sample code for parallel_for_new tutorial * Restored original parallel_for_ tutorial and table_of_content_core * Minor changes * Added demonstration of 1-D vectorized convolution * * Added 2-D convolution implementation and tutorial * Minor changes in vectorized implementation of 1-D and 2-D convolution * Minor changes to univ_intrin tutorial. Added new tutorials to the table of contents * Minor changes * Removed variable sized array initializations * Fixed conversion warnings * Added doxygen references, minor fixes * Added jpg image for parallel_for_ doc
4 years ago · 41a2eb5245
parent 3d7670a6ba
commit 41a2eb5245
8 changed files with 1065 additions and 2 deletions
--- a/doc/tutorials/core/file_input_output_with_xml_yml/file_input_output_with_xml_yml.markdown
+++ b/doc/tutorials/core/file_input_output_with_xml_yml/file_input_output_with_xml_yml.markdown
@ -4,7 +4,7 @@ File Input and Output using XML and YAML files {#tutorial_file_input_output_with
@tableofcontents

@prev_tutorial{tutorial_discrete_fourier_transform}
-@next_tutorial{tutorial_how_to_use_OpenCV_parallel_for_}
+@next_tutorial{tutorial_how_to_use_OpenCV_parallel_for_new}

 |    |    |
 | -: | :- |
--- a/doc/tutorials/core/how_to_use_OpenCV_parallel_for_new/how_to_use_OpenCV_parallel_for_new.markdown
+++ b/doc/tutorials/core/how_to_use_OpenCV_parallel_for_new/how_to_use_OpenCV_parallel_for_new.markdown
@ -0,0 +1,166 @@
+How to use the OpenCV parallel_for_ to parallelize your code {#tutorial_how_to_use_OpenCV_parallel_for_new}
+==================================================================
+
+@tableofcontents
+
+@prev_tutorial{tutorial_file_input_output_with_xml_yml}
+@next_tutorial{tutorial_univ_intrin}
+
+|    |    |
+| -: | :- |
+| Compatibility | OpenCV >= 3.0 |
+
+Goal
+----
+
+The goal of this tutorial is to demonstrate the use of the OpenCV `parallel_for_` framework to easily parallelize your code. To illustrate the concept, we will write a program to perform convolution operation over an image.
+The full tutorial code is [here](https://github.com/opencv/opencv/tree/master/samples/cpp/tutorial_code/how_to_use_OpenCV_parallel_for_/how_to_use_OpenCV_parallel_for_new.cpp).
+
+Precondition
+----
+
+### Parallel Frameworks
+The first precondition is to have OpenCV built with a parallel framework.
+In OpenCV 4.5, the following parallel frameworks are available in that order:
+
+*   Intel Threading Building Blocks (3rdparty library, should be explicitly enabled)
+*   OpenMP (integrated to compiler, should be explicitly enabled)
+*   APPLE GCD (system wide, used automatically (APPLE only))
+*   Windows RT concurrency (system wide, used automatically (Windows RT only))
+*   Windows concurrency (part of runtime, used automatically (Windows only - MSVC++ >= 10))
+*   Pthreads
+
+As you can see, several parallel frameworks can be used in the OpenCV library. Some parallel libraries are third party libraries and have to be explicitly enabled in CMake before building, while others are automatically available with the platform (e.g. APPLE GCD).
+
+### Race Conditions
+Race conditions occur when more than one thread try to write *or* read and write to a particular memory location simultaneously.
+Based on that, we can broadly classify algorithms into two categories:-
+1. Algorithms in which only a single thread writes data to a particular memory location.
+    * In *convolution*, for example, even though multiple  threads may read from a pixel at a particular time, only a single thread *writes* to a particular pixel.
+
+2. Algorithms in which multiple threads may write to a single memory location.
+    * Finding contours, features, etc. Such algorithms may require each thread to add data to a global variable simultaneously. For example, when detecting features, each thread will add features of their respective parts of the image to a common vector, thus creating a race condition.
+
+Convolution
+-----------
+
+We will use the example of performing a convolution to demonstrate the use of `parallel_for_` to parallelize the computation. This is an example of an algorithm which does not lead to a race condition.
+
+Theory
+------
+Convolution is a simple mathematical operation widely used in image processing. Here, we slide a smaller matrix, called the *kernel*, over an image and a sum of the product of pixel values and corresponding values in the kernel gives us the value of the particular pixel in the output (called the anchor point of the kernel).  Based on the values in the kernel, we get different results.
+In the example below, we use a 3x3 kernel (anchored at its center) and convolve over a 5x5 matrix to produce a 3x3 matrix. The size of the output can be altered by padding the input with suitable values.
+![Convolution Animation](images/convolution-example-matrix.gif)
+
+For more information about different kernels and what they do, look [here](https://en.wikipedia.org/wiki/Kernel_(image_processing))
+
+For the purpose of this tutorial, we will implement the simplest form of the function which takes a grayscale image (1 channel) and an odd length square kernel and produces an output image.
+The operation will not be performed in-place.
+@note We can store a few of the relevant pixels temporarily to make sure we use the original values during the convolution and then do it in-place. However, the purpose of this tutorial is to introduce parallel_for_ function and an inplace implementation may be too complicated.
+
+Pseudocode
+-----------
+
+    InputImage src, OutputImage dst, kernel(size n)
+    makeborder(src, n/2)
+    for each pixel (i, j) strictly inside borders, do:
+    {
+        value := 0
+        for k := -n/2 to n/2, do:
+            for l := -n/2 to n/2, do:
+                value += kernel[n/2 + k][n/2 + l]*src[i + k][j + l]
+
+        dst[i][j] := value
+    }
+
+For an *n-sized kernel*, we will add a border of size *n/2* to handle edge cases.
+We then run two loops to move along the kernel and add the products to sum
+
+Implementation
+--------------
+
+### Sequential implementation
+
+@snippet how_to_use_OpenCV_parallel_for_new.cpp convolution-sequential
+
+We first make an output matrix(dst) with the same size as src and add borders to the src image(to handle edge cases).
+@snippet how_to_use_OpenCV_parallel_for_new.cpp convolution-make-borders
+
+We then sequentially iterate over the pixels in the src image and compute the value over the kernel and the neighbouring pixel values.
+We then fill value to the corresponding pixel in the dst image.
+@snippet how_to_use_OpenCV_parallel_for_new.cpp convolution-kernel-loop
+
+### Parallel implementation
+
+When looking at the sequential implementation, we can notice that each pixel depends on multiple neighbouring pixels but only one pixel is edited at a time. Thus, to optimize the computation, we can split the image into stripes and parallely perform convolution on each, by exploiting the multi-core architecture of modern processor. The OpenCV @ref cv::parallel_for_ framework automatically decides how to split the computation efficiently and does most of the work for us.
+
+@note Although values of a pixel in a particular stripe may depend on pixel values outside the stripe, these are only read only operations and hence will not cause undefined behaviour.
+
+
+We first declare a custom class that inherits from @ref cv::ParallelLoopBody and override the `virtual void operator ()(const cv::Range& range) const`.
+@snippet how_to_use_OpenCV_parallel_for_new.cpp convolution-parallel
+
+The range in the `operator ()` represents the subset of values that will be treated by an individual thread. Based on the requirement, there may be different ways of splitting the range which in turn changes the computation.
+
+For example, we can either
+1. Split the entire traversal of the image and obtain the [row, col] coordinate in the following way (as shown in the above code):
+
+    @snippet how_to_use_OpenCV_parallel_for_new.cpp overload-full
+
+    We would then call the parallel_for_ function in the following way:
+    @snippet how_to_use_OpenCV_parallel_for_new.cpp convolution-parallel-function
+<br>
+
+2. Split the rows and compute for each row:
+
+    @snippet how_to_use_OpenCV_parallel_for_new.cpp overload-row-split
+
+    In this case, we call the parallel_for_ function with a different range:
+    @snippet how_to_use_OpenCV_parallel_for_new.cpp convolution-parallel-function-row
+
+@note In our case, both implementations perform similarly. Some cases may allow better memory access patterns or other performance benefits.
+
+To set the number of threads, you can use: @ref cv::setNumThreads. You can also specify the number of splitting using the nstripes parameter in @ref cv::parallel_for_. For instance, if your processor has 4 threads, setting `cv::setNumThreads(2)` or setting `nstripes=2` should be the same as by default it will use all the processor threads available but will split the workload only on two threads.
+
+@note C++ 11 standard allows to simplify the parallel implementation by get rid of the `parallelConvolution` class and replacing it with lambda expression:
+
+@snippet how_to_use_OpenCV_parallel_for_new.cpp convolution-parallel-cxx11
+
+Results
+-----------
+
+The resulting time taken for execution of the two implementations on a
+* *512x512 input* with a *5x5 kernel*:
+
+        This program shows how to use the OpenCV parallel_for_ function and
+        compares the performance of the sequential and parallel implementations for a
+        convolution operation
+        Usage:
+        ./a.out [image_path -- default lena.jpg]
+
+        Sequential Implementation: 0.0953564s
+        Parallel Implementation: 0.0246762s
+        Parallel Implementation(Row Split): 0.0248722s
+
+<br>
+
+* *512x512 input with a 3x3 kernel*
+
+        This program shows how to use the OpenCV parallel_for_ function and
+        compares the performance of the sequential and parallel implementations for a
+        convolution operation
+        Usage:
+        ./a.out [image_path -- default lena.jpg]
+
+        Sequential Implementation: 0.0301325s
+        Parallel Implementation: 0.0117053s
+        Parallel Implementation(Row Split): 0.0117894s
+
+The performance of the parallel implementation depends on the type of CPU you have. For instance, on 4 cores - 8 threads CPU, runtime may be 6x to 7x faster than a sequential implementation. There are many factors to explain why we do not achieve a speed-up of 8x:
+    * the overhead to create and manage the threads,
+    * background processes running in parallel,
+    * the difference between 4 hardware cores with 2 logical threads for each core and 8 hardware cores.
+
+In the tutorial, we used a horizontal gradient filter(as shown in the animation above), which produces an image highlighting the vertical edges.
+
+![result image](images/resimg.jpg)
--- a/doc/tutorials/core/how_to_use_OpenCV_parallel_for_new/images/convolution-example-matrix.gif
+++ b/doc/tutorials/core/how_to_use_OpenCV_parallel_for_new/images/convolution-example-matrix.gif
--- a/doc/tutorials/core/how_to_use_OpenCV_parallel_for_new/images/resimg.jpg
+++ b/doc/tutorials/core/how_to_use_OpenCV_parallel_for_new/images/resimg.jpg
--- a/doc/tutorials/core/table_of_content_core.markdown
+++ b/doc/tutorials/core/table_of_content_core.markdown
@ -9,4 +9,5 @@ The Core Functionality (core module) {#tutorial_table_of_content_core}
 -   @subpage tutorial_basic_linear_transform
 -   @subpage tutorial_discrete_fourier_transform
 -   @subpage tutorial_file_input_output_with_xml_yml
-   @subpage tutorial_how_to_use_OpenCV_parallel_for_
+-   @subpage tutorial_how_to_use_OpenCV_parallel_for_new
+-   @subpage tutorial_univ_intrin
--- a/doc/tutorials/core/univ_intrin/univ_intrin.markdown
+++ b/doc/tutorials/core/univ_intrin/univ_intrin.markdown
@ -0,0 +1,334 @@
+Vectorizing your code using Universal Intrinsics {#tutorial_univ_intrin}
+==================================================================
+
+@tableofcontents
+
+@prev_tutorial{tutorial_how_to_use_OpenCV_parallel_for_new}
+
+|    |    |
+| -: | :- |
+| Compatibility | OpenCV >= 3.0 |
+
+Goal
+----
+
+The goal of this tutorial is to provide a guide to using the @ref core_hal_intrin feature to vectorize your C++ code for a faster runtime.
+We'll briefly look into _SIMD intrinsics_ and how to work with wide _registers_, followed by a tutorial on the basic operations using wide registers.
+
+Theory
+------
+
+In this section, we will briefly look into a few concepts to better help understand the functionality.
+
+### Intrinsics
+Intrinsics are functions which are separately handled by the compiler. These functions are often optimized to perform in the most efficient ways possible and hence run faster than normal implementations. However, since these functions depend on the compiler, it makes it difficult to write portable applications.
+
+### SIMD
+SIMD stands for **Single Instruction, Multiple Data**. SIMD Intrinsics allow the processor to vectorize calculations. The data is stored in what are known as *registers*. A *register* may be *128-bits*, *256-bits* or *512-bits* wide. Each *register* stores **multiple values** of the **same data type**. The size of the register and the size of each value determines the number of values stored in total.
+
+Depending on what *Instruction Sets* your CPU supports, you may be able to use the different registers. To learn more, look [here](https://en.wikipedia.org/wiki/Instruction_set_architecture)
+
+Universal Intrinsics
+--------------------
+
+OpenCVs universal intrinsics provides an abstraction to SIMD vectorization methods and allows the user to use intrinsics without the need to write system specific code.
+
+OpenCV Universal Intrinsics support the following instruction sets:
+* *128 bit* registers of various types support is implemented for a wide range of architectures including
+    * x86(SSE/SSE2/SSE4.2),
+    * ARM(NEON),
+    * PowerPC(VSX),
+    * MIPS(MSA).
+* *256 bit* registers are supported on x86(AVX2) and
+* *512 bit* registers are supported on x86(AVX512)
+
+**We will now introduce the available structures and functions:**
+* Register structures
+* Load and store
+* Mathematical Operations
+* Reduce and Mask
+
+### Register Structures
+
+The Universal Intrinsics set implements every register as a structure based on the particular SIMD register.
+All types contain the `nlanes` enumeration which gives the exact number of values that the type can hold. This eliminates the need to hardcode the number of values during implementations.
+
+@note Each register structure is under the `cv` namespace.
+
+There are **two types** of registers:
+
+* **Variable sized registers**: These structures do not have a fixed size and their exact bit length is deduced during compilation, based on the available SIMD capabilities. Consequently, the value of the `nlanes` enum is determined in compile time.
+    <br>
+
+    Each structure follows the following convention:
+
+        v_[type of value][size of each value in bits]
+
+    For instance, **v_uint8 holds 8-bit unsigned integers** and **v_float32 holds 32-bit floating point values**. We then declare a register like we would declare any object in C++
+
+    Based on the available SIMD instruction set, a particular register will hold different number of values.
+    For example: If your computer supports a maximum of 256bit registers,
+    * *v_uint8* will hold 32 8-bit unsigned integers
+    * *v_float64* will hold 4 64-bit floats (doubles)
+
+            v_uint8 a;                            // a is a register supporting uint8(char) data
+            int n = a.nlanes;                     // n holds 32
+
+    Available data type and sizes:
+    |Type|Size in bits|
+    |-:|:-|
+    |uint| 8, 16, 32, 64|
+    |int | 8, 16, 32, 64|
+    |float | 32, 64|
+
+* **Constant sized registers**: These structures have a fixed bit size and hold a constant number of values. We need to know what SIMD instruction set is supported by the system and select compatible registers. Use these only if exact bit length is necessary.
+    <br>
+
+    Each structure follows the convention:
+
+        v_[type of value][size of each value in bits]x[number of values]
+
+    Suppose we want to store
+    * 32-bit(*size in bits*) signed integers in a **128 bit register**. Since the register size is already known, we can find out the *number of data points in register* (*128/32 = 4*):
+
+            v_int32x8 reg1                       // holds 8 32-bit signed integers.
+
+    * 64-bit floats in 512 bit register:
+
+            v_float64x8 reg2                     // reg2.nlanes = 8
+
+### Load and Store operations
+
+Now that we know how registers work, let us look at the functions used for filling these registers with values.
+
+* **Load**: Load functions allow you to *load* values into a register.
+    * *Constructors* - When declaring a register structure, we can either provide a memory address from where the register will pick up contiguous values, or provide the values explicitly as multiple arguments (Explicit multiple arguments is available only for Constant Sized Registers):
+
+            float ptr[32] = {1, 2, 3 ..., 32};   // ptr is a pointer to a contiguous memory block of 32 floats
+
+            // Variable Sized Registers //
+            int x = v_float32().nlanes;          // set x as the number of values the register can hold
+
+            v_float32 reg1(ptr);                 // reg1 stores first x values according to the maximum register size available.
+            v_float32 reg2(ptr + x);             // reg stores the next x values
+
+            // Constant Sized Registers //
+            v_float32x4 reg1(ptr);               // reg1 stores the first 4 floats (1, 2, 3, 4)
+            v_float32x4 reg2(ptr + 4);           // reg2 stores the next 4 floats (5, 6, 7, 8)
+
+            // Or we can explicitly write down the values.
+            v_float32x4(1, 2, 3, 4);
+
+    <br>
+    * *Load Function* - We can use the load method and provide the memory address of the data:
+
+            float ptr[32] = {1, 2, 3, ..., 32};
+            v_float32 reg_var;
+            reg_var = vx_load(ptr);              // loads values from ptr[0] upto ptr[reg_var.nlanes - 1]
+
+            v_float32x4 reg_128;
+            reg_128 = v_load(ptr);               // loads values from ptr[0] upto ptr[3]
+
+            v_float32x8 reg_256;
+            reg_256 = v256_load(ptr);            // loads values from ptr[0] upto ptr[7]
+
+            v_float32x16 reg_512;
+            reg_512 = v512_load(ptr);            // loads values from ptr[0] upto ptr[15]
+
+        @note The load function assumes data is unaligned. If your data is aligned, you may use the `vx_load_aligned()` function.
+    <br>
+
+* **Store**: Store functions allow you to *store* the values from a register into a particular memory location.
+    * To store values from a register into a memory location, you may use the *v_store()* function:
+
+            float ptr[4];
+            v_store(ptr, reg); // store the first 128 bits(interpreted as 4x32-bit floats) of reg into ptr.
+<br>
+@note Ensure **ptr** has the same type as register. You can also cast the register into the proper type before carrying out operations. Simply typecasting the pointer to a particular type will lead wrong interpretation of data.
+
+### Binary and Unary Operators
+
+The universal intrinsics set provides element wise binary and unary operations.
+
+* **Arithmetics**: We can add, subtract, multiply and divide two registers element-wise. The registers must be of the same width and hold the same type. To multiply two registers, for example:
+
+        v_float32 a, b;                          // {a1, ..., an}, {b1, ..., bn}
+        v_float32 c;
+        c = a + b                                // {a1 + b1, ..., an + bn}
+        c = a * b;                               // {a1 * b1, ..., an * bn}
+
+<br>
+
+* **Bitwise Logic and Shifts**: We can left shift or right shift the bits of each element of the register. We can also apply bitwise &, |, ^ and ~ operators between two registers element-wise:
+
+        v_int32 as;                              // {a1, ..., an}
+        v_int32 al = as << 2;                    // {a1 << 2, ..., an << 2}
+        v_int32 bl = as >> 2;                    // {a1 >> 2, ..., an >> 2}
+
+        v_int32 a, b;
+        v_int32 a_and_b = a & b;                 // {a1 & b1, ..., an & bn}
+
+<br>
+
+* **Comparison Operators**: We can compare values between two registers using the <, >, <= , >=, == and != operators. Since each register contains multiple values, we don't get a single bool for these operations. Instead, for true values, all bits are converted to one (0xff for 8 bits, 0xffff for 16 bits, etc), while false values return bits converted to zero.
+
+        // let us consider the following code is run in a 128-bit register
+        v_uint8 a;                               // a = {0, 1, 2, ..., 15}
+        v_uint8 b;                               // b = {15, 14, 13, ..., 0}
+
+        v_uint8 c = a < b;
+
+        /*
+            let us look at the first 4 values in binary
+
+            a = |00000000|00000001|00000010|00000011|
+            b = |00001111|00001110|00001101|00001100|
+            c = |11111111|11111111|11111111|11111111|
+
+            If we store the values of c and print them as integers, we will get 255 for true values and 0 for false values.
+        */
+        ---
+        // In a computer supporting 256-bit registers
+        v_int32 a;                               // a = {1, 2, 3, 4, 5, 6, 7, 8}
+        v_int32 b;                               // b = {8, 7, 6, 5, 4, 3, 2, 1}
+
+        v_int32 c = (a < b);                     // c = {-1, -1, -1, -1, 0, 0, 0, 0}
+
+        /*
+            The true values are 0xffffffff, which in signed 32-bit integer representation is equal to -1.
+        */
+<br>
+
+* **Min/Max operations**: We can use the *v_min()* and *v_max()* functions to return registers containing element-wise min, or max, of the two registers:
+
+        v_int32 a;                               // {a1, ..., an}
+        v_int32 b;                               // {b1, ..., bn}
+
+        v_int32 mn = v_min(a, b);                // {min(a1, b1), ..., min(an, bn)}
+        v_int32 mx = v_max(a, b);                // {max(a1, b1), ..., max(an, bn)}
+<br>
+
+@note Comparison and Min/Max operators are not available for 64 bit integers. Bitwise shift and logic operators are available only for integer values. Bitwise shift is available only for 16, 32 and 64 bit registers.
+
+### Reduce and Mask
+
+* **Reduce Operations**: The *v_reduce_min()*, *v_reduce_max()* and *v_reduce_sum()* return a single value denoting the min, max or sum of the entire register:
+
+        v_int32 a;                                //  a = {a1, ..., a4}
+        int mn = v_reduce_min(a);                 // mn = min(a1, ..., an)
+        int sum = v_reduce_sum(a);                // sum = a1 + ... + an
+<br>
+
+* **Mask Operations**: Mask operations allow us to replicate conditionals in wide registers. These include:
+    * *v_check_all()* - Returns a bool, which is true if all the values in the register are less than zero.
+    * *v_check_any()* - Returns a bool, which is true if any value in the register is less than zero.
+    * *v_select()* - Returns a register, which blends two registers, based on a mask.
+
+            v_uint8 a;                           // {a1, .., an}
+            v_uint8 b;                           // {b1, ..., bn}
+
+            v_int32x4 mask:                      // {0xff, 0, 0, 0xff, ..., 0xff, 0}
+
+            v_uint8 Res = v_select(mask, a, b)   // {a1, b2, b3, a4, ..., an-1, bn}
+
+            /*
+                "Res" will contain the value from "a" if mask is true (all bits set to 1),
+                and value from "b" if mask is false (all bits set to 0)
+
+                We can use comparison operators to generate mask and v_select to obtain results based on conditionals.
+                It is common to set all values of b to 0. Thus, v_select will give values of "a" or 0 based on the mask.
+            */
+
+## Demonstration
+In the following section, we will vectorize a simple convolution function for single channel and compare the results to a scalar implementation.
+@note Not all algorithms are improved by manual vectorization. In fact, in certain cases, the compiler may *autovectorize* the code, thus producing faster results for scalar implementations.
+
+You may learn more about convolution from the previous tutorial. We use the same naive implementation from the previous tutorial and compare it to the vectorized version.
+
+The full tutorial code is [here](https://github.com/opencv/opencv/tree/master/samples/cpp/tutorial_code/univ_intrin/univ_intrin.cpp).
+
+### Vectorizing Convolution
+
+We will first implement a 1-D convolution and then vectorize it. The 2-D vectorized convolution will perform 1-D convolution across the rows to produce the correct results.
+
+#### 1-D Convolution: Scalar
+@snippet univ_intrin.cpp convolution-1D-scalar
+
+1. We first set up variables and make a border on both sides of the src matrix, to take care of edge cases.
+    @snippet univ_intrin.cpp convolution-1D-border
+
+2. For the main loop, we select an index *i* and offset it on both sides along with the kernel, using the k variable. We store the value in *value* and add it to the *dst* matrix.
+    @snippet univ_intrin.cpp convolution-1D-scalar-main
+
+#### 1-D Convolution: Vector
+
+We will now look at the vectorized version of 1-D convolution.
+@snippet univ_intrin.cpp convolution-1D-vector
+
+1. In our case, the kernel is a float. Since the kernel's datatype is the largest, we convert src to float32, forming *src_32*. We also make a border like we did for the naive case.
+    @snippet univ_intrin.cpp convolution-1D-convert
+
+2. Now, for each column in the *kernel*, we calculate the scalar product of the value with all *window* vectors of length `step`. We add these values to the already stored values in ans
+    @snippet univ_intrin.cpp convolution-1D-main
+
+    * We declare a pointer to the src_32 and kernel and run a loop for each kernel element
+        @snippet univ_intrin.cpp convolution-1D-main-h1
+
+    * We load a register with the current kernel element. A window is shifted from *0* to *len - step* and its product with the kernel_wide array is added to the values stored in *ans*. We store the values back into *ans*
+        @snippet univ_intrin.cpp convolution-1D-main-h2
+
+    * Since the length might not be divisible by steps, we take care of the remaining values directly. The number of *tail* values will always be less than *step* and will not affect the performance significantly. We store all the values to *ans* which is a float pointer. We can also directly store them in a `Mat` object
+        @snippet univ_intrin.cpp convolution-1D-main-h3
+
+    * Here is an iterative example:
+
+            For example:
+            kernel: {k1, k2, k3}
+            src:           ...|a1|a2|a3|a4|...
+
+
+            iter1:
+            for each idx i in (0, len), 'step' idx at a time
+                kernel_wide:          |k1|k1|k1|k1|
+                window:               |a0|a1|a2|a3|
+                ans:               ...| 0| 0| 0| 0|...
+                sum =  ans + window * kernel_wide
+                    =  |a0 * k1|a1 * k1|a2 * k1|a3 * k1|
+
+            iter2:
+                kernel_wide:          |k2|k2|k2|k2|
+                window:               |a1|a2|a3|a4|
+                ans:               ...|a0 * k1|a1 * k1|a2 * k1|a3 * k1|...
+                sum =  ans + window * kernel_wide
+                    =  |a0 * k1 + a1 * k2|a1 * k1 + a2 * k2|a2 * k1 + a3 * k2|a3 * k1 + a4 * k2|
+
+            iter3:
+                kernel_wide:          |k3|k3|k3|k3|
+                window:               |a2|a3|a4|a5|
+                ans:               ...|a0 * k1 + a1 * k2|a1 * k1 + a2 * k2|a2 * k1 + a3 * k2|a3 * k1 + a4 * k2|...
+                sum =  sum + window * kernel_wide
+                    =  |a0*k1 + a1*k2 + a2*k3|a1*k1 + a2*k2 + a3*k3|a2*k1 + a3*k2 + a4*k3|a3*k1 + a4*k2 + a5*k3|
+
+
+@note The function parameters also include *row*, *rowk* and *len*. These values are used when using the function as an intermediate step of 2-D convolution
+
+#### 2-D Convolution
+
+Suppose our kernel has *ksize* rows. To compute the values for a particular row, we compute the 1-D convolution of the previous *ksize/2* and the next *ksize/2* rows, with the corresponding kernel row. The final values is simply the sum of the individual 1-D convolutions
+@snippet univ_intrin.cpp convolution-2D
+
+1. We first initialize variables and make a border above and below the *src* matrix. The left and right sides are handled by the 1-D convolution function.
+    @snippet univ_intrin.cpp convolution-2D-init
+
+2. For each row, we calculate the 1-D convolution of the rows above and below it. we then add the values to the *dst* matrix.
+    @snippet univ_intrin.cpp convolution-2D-main
+
+3. We finally convert the *dst* matrix to a *8-bit* `unsigned char` matrix
+    @snippet univ_intrin.cpp convolution-2D-conv
+
+Results
+-------
+
+In the tutorial, we used a horizontal gradient kernel. We obtain the same output image for both methods.
+
+Improvement in runtime varies and will depend on the SIMD capabilities available in your CPU.
--- a/samples/cpp/tutorial_code/core/how_to_use_OpenCV_parallel_for_/how_to_use_OpenCV_parallel_for_new.cpp
+++ b/samples/cpp/tutorial_code/core/how_to_use_OpenCV_parallel_for_/how_to_use_OpenCV_parallel_for_new.cpp
@ -0,0 +1,332 @@
+#include <iostream>
+#include <opencv2/core.hpp>
+#include <opencv2/imgcodecs.hpp>
+#include <opencv2/imgproc.hpp>
+#include <opencv2/highgui.hpp>
+
+using namespace std;
+using namespace cv;
+
+namespace
+{
+//! [convolution-sequential]
+void conv_seq(Mat src, Mat &dst, Mat kernel)
+{
+    //![convolution-make-borders]
+    int rows = src.rows, cols = src.cols;
+    dst = Mat(rows, cols, src.type());
+
+    // Taking care of edge values
+    // Make border = kernel.rows / 2;
+
+    int sz = kernel.rows / 2;
+    copyMakeBorder(src, src, sz, sz, sz, sz, BORDER_REPLICATE);
+    //![convolution-make-borders]
+
+    //! [convolution-kernel-loop]
+    for (int i = 0; i < rows; i++)
+    {
+        uchar *dptr = dst.ptr(i);
+        for (int j = 0; j < cols; j++)
+        {
+            double value = 0;
+
+            for (int k = -sz; k <= sz; k++)
+            {
+                // slightly faster results when we create a ptr due to more efficient memory access.
+                uchar *sptr = src.ptr(i + sz + k);
+                for (int l = -sz; l <= sz; l++)
+                {
+                    value += kernel.ptr<double>(k + sz)[l + sz] * sptr[j + sz + l];
+                }
+            }
+            dptr[j] = saturate_cast<uchar>(value);
+        }
+    }
+    //! [convolution-kernel-loop]
+}
+//! [convolution-sequential]
+
+#ifdef CV_CXX11
+void conv_parallel(Mat src, Mat &dst, Mat kernel)
+{
+    int rows = src.rows, cols = src.cols;
+
+    dst = Mat(rows, cols, CV_8UC1, Scalar(0));
+
+    // Taking care of edge values
+    // Make border = kernel.rows / 2;
+
+    int sz = kernel.rows / 2;
+    copyMakeBorder(src, src, sz, sz, sz, sz, BORDER_REPLICATE);
+
+    //! [convolution-parallel-cxx11]
+    parallel_for_(Range(0, rows * cols), [&](const Range &range)
+                    {
+                        for (int r = range.start; r < range.end; r++)
+                        {
+                            int i = r / cols, j = r % cols;
+
+                            double value = 0;
+                            for (int k = -sz; k <= sz; k++)
+                            {
+                                uchar *sptr = src.ptr(i + sz + k);
+                                for (int l = -sz; l <= sz; l++)
+                                {
+                                    value += kernel.ptr<double>(k + sz)[l + sz] * sptr[j + sz + l];
+                                }
+                            }
+                            dst.ptr(i)[j] = saturate_cast<uchar>(value);
+                        }
+                    });
+    //! [convolution-parallel-cxx11]
+}
+
+void conv_parallel_row_split(Mat src, Mat &dst, Mat kernel)
+{
+    int rows = src.rows, cols = src.cols;
+
+    dst = Mat(rows, cols, CV_8UC1, Scalar(0));
+
+    // Taking care of edge values
+    // Make border = kernel.rows / 2;
+
+    int sz = kernel.rows / 2;
+    copyMakeBorder(src, src, sz, sz, sz, sz, BORDER_REPLICATE);
+
+    //! [convolution-parallel-cxx11-row-split]
+    parallel_for_(Range(0, rows), [&](const Range &range)
+                    {
+                        for (int i = range.start; i < range.end; i++)
+                        {
+
+                            uchar *dptr = dst.ptr(i);
+                            for (int j = 0; j < cols; j++)
+                            {
+                                double value = 0;
+                                for (int k = -sz; k <= sz; k++)
+                                {
+                                    uchar *sptr = src.ptr(i + sz + k);
+                                    for (int l = -sz; l <= sz; l++)
+                                    {
+                                        value += kernel.ptr<double>(k + sz)[l + sz] * sptr[j + sz + l];
+                                    }
+                                }
+                                dptr[j] = saturate_cast<uchar>(value);
+                            }
+                        }
+                    });
+    //! [convolution-parallel-cxx11-row-split]
+}
+#else
+
+//! [convolution-parallel]
+class parallelConvolution : public ParallelLoopBody
+{
+private:
+    Mat m_src, &m_dst;
+    Mat m_kernel;
+    int sz;
+
+public:
+    parallelConvolution(Mat src, Mat &dst, Mat kernel)
+        : m_src(src), m_dst(dst), m_kernel(kernel)
+    {
+        sz = kernel.rows / 2;
+    }
+
+    //! [overload-full]
+    virtual void operator()(const Range &range) const CV_OVERRIDE
+    {
+        for (int r = range.start; r < range.end; r++)
+        {
+            int i = r / m_src.cols, j = r % m_src.cols;
+
+            double value = 0;
+            for (int k = -sz; k <= sz; k++)
+            {
+                uchar *sptr = m_src.ptr(i + sz + k);
+                for (int l = -sz; l <= sz; l++)
+                {
+                    value += m_kernel.ptr<double>(k + sz)[l + sz] * sptr[j + sz + l];
+                }
+            }
+            m_dst.ptr(i)[j] = saturate_cast<uchar>(value);
+        }
+    }
+    //! [overload-full]
+};
+//! [convolution-parallel]
+
+void conv_parallel(Mat src, Mat &dst, Mat kernel)
+{
+    int rows = src.rows, cols = src.cols;
+
+    dst = Mat(rows, cols, CV_8UC1, Scalar(0));
+
+    // Taking care of edge values
+    // Make border = kernel.rows / 2;
+
+    int sz = kernel.rows / 2;
+    copyMakeBorder(src, src, sz, sz, sz, sz, BORDER_REPLICATE);
+
+    //! [convolution-parallel-function]
+    parallelConvolution obj(src, dst, kernel);
+    parallel_for_(Range(0, rows * cols), obj);
+    //! [convolution-parallel-function]
+}
+
+//! [conv-parallel-row-split]
+class parallelConvolutionRowSplit : public ParallelLoopBody
+{
+private:
+    Mat m_src, &m_dst;
+    Mat m_kernel;
+    int sz;
+
+public:
+    parallelConvolutionRowSplit(Mat src, Mat &dst, Mat kernel)
+        : m_src(src), m_dst(dst), m_kernel(kernel)
+    {
+        sz = kernel.rows / 2;
+    }
+
+    //! [overload-row-split]
+    virtual void operator()(const Range &range) const CV_OVERRIDE
+    {
+        for (int i = range.start; i < range.end; i++)
+        {
+
+            uchar *dptr = dst.ptr(i);
+            for (int j = 0; j < cols; j++)
+            {
+                double value = 0;
+                for (int k = -sz; k <= sz; k++)
+                {
+                    uchar *sptr = src.ptr(i + sz + k);
+                    for (int l = -sz; l <= sz; l++)
+                    {
+                        value += kernel.ptr<double>(k + sz)[l + sz] * sptr[j + sz + l];
+                    }
+                }
+                dptr[j] = saturate_cast<uchar>(value);
+            }
+        }
+    }
+    //! [overload-row-split]
+};
+//! [conv-parallel-row-split]
+
+void conv_parallel_row_split(Mat src, Mat &dst, Mat kernel)
+{
+    int rows = src.rows, cols = src.cols;
+
+    dst = Mat(rows, cols, CV_8UC1, Scalar(0));
+
+    // Taking care of edge values
+    // Make border = kernel.rows / 2;
+
+    int sz = kernel.rows / 2;
+    copyMakeBorder(src, src, sz, sz, sz, sz, BORDER_REPLICATE);
+
+    //! [convolution-parallel-function-row]
+    parallelConvolutionRowSplit obj(src, dst, kernel);
+    parallel_for_(Range(0, rows), obj);
+    //! [convolution-parallel-function-row]
+}
+
+#endif
+
+static void help(char *progName)
+{
+    cout << endl
+            << " This program shows how to use the OpenCV parallel_for_ function and \n"
+            << " compares the performance of the sequential and parallel implementations for a \n"
+            << " convolution operation\n"
+            << " Usage:\n "
+            << progName << " [image_path -- default lena.jpg] " << endl
+            << endl;
+}
+}
+
+int main(int argc, char *argv[])
+{
+
+    help(argv[0]);
+    const char *filepath = argc >= 2 ? argv[1] : "../../../../data/lena.jpg";
+
+    Mat src, dst, kernel;
+    src = imread(filepath, IMREAD_GRAYSCALE);
+
+    if (src.empty())
+    {
+        cerr << "Can't open [" << filepath << "]" << endl;
+        return EXIT_FAILURE;
+    }
+    namedWindow("Input", 1);
+    namedWindow("Output1", 1);
+    namedWindow("Output2", 1);
+    namedWindow("Output3", 1);
+    imshow("Input", src);
+
+    kernel = (Mat_<double>(3, 3) << 1, 0, -1,
+              1, 0, -1,
+              1, 0, -1);
+
+    /*
+    Uncomment the kernels you want to use or write your own kernels to test out
+    performance.
+    */
+
+    /*
+    kernel = (Mat_<double>(5, 5) <<   1, 1, 1, 1, 1,
+                                      1, 1, 1, 1, 1,
+                                      1, 1, 1, 1, 1,
+                                      1, 1, 1, 1, 1,
+                                      1, 1, 1, 1, 1);
+    kernel /= 100;
+    */
+
+    /*
+    kernel = (Mat_<double>(3, 3) <<  1,  1,  1,
+                                     0,  0,  0,
+                                    -1, -1, -1);
+
+    */
+
+    double t = (double)getTickCount();
+
+    conv_seq(src, dst, kernel);
+
+    t = ((double)getTickCount() - t) / getTickFrequency();
+    cout << " Sequential implementation: " << t << "s" << endl;
+
+    imshow("Output1", dst);
+    waitKey(0);
+
+    t = (double)getTickCount();
+
+    conv_parallel(src, dst, kernel);
+
+    t = ((double)getTickCount() - t) / getTickFrequency();
+    cout << " Parallel Implementation: " << t << "s" << endl;
+
+    imshow("Output2", dst);
+    waitKey(0);
+
+    t = (double)getTickCount();
+
+    conv_parallel_row_split(src, dst, kernel);
+
+    t = ((double)getTickCount() - t) / getTickFrequency();
+    cout << " Parallel Implementation(Row Split): " << t << "s" << endl
+         << endl;
+
+    imshow("Output3", dst);
+    waitKey(0);
+
+    // imwrite("src.png", src);
+    // imwrite("dst.png", dst);
+
+    return 0;
+}
--- a/samples/cpp/tutorial_code/core/univ_intrin/univ_intrin.cpp
+++ b/samples/cpp/tutorial_code/core/univ_intrin/univ_intrin.cpp
@ -0,0 +1,230 @@
+#include <iostream>
+#include <opencv2/core.hpp>
+#include <opencv2/imgcodecs.hpp>
+#include <opencv2/imgproc.hpp>
+#include <opencv2/highgui.hpp>
+#include <opencv2/core/simd_intrinsics.hpp>
+
+using namespace std;
+using namespace cv;
+
+const int N = 100005, K = 2000;
+
+namespace
+{
+
+void conv_seq(Mat src, Mat &dst, Mat kernel)
+{
+    int rows = src.rows, cols = src.cols;
+    dst = Mat(rows, cols, CV_8UC1);
+
+    int sz = kernel.rows / 2;
+    copyMakeBorder(src, src, sz, sz, sz, sz, BORDER_REPLICATE);
+    for (int i = 0; i < rows; i++)
+    {
+        uchar *dptr = dst.ptr<uchar>(i);
+        for (int j = 0; j < cols; j++)
+        {
+            float value = 0;
+
+            for (int k = -sz; k <= sz; k++)
+            {
+                // slightly faster results when we create a ptr due to more efficient memory access.
+                uchar *sptr = src.ptr<uchar>(i + sz + k);
+                for (int l = -sz; l <= sz; l++)
+                {
+                    value += kernel.ptr<float>(k + sz)[l + sz] * sptr[j + sz + l];
+                }
+            }
+            dptr[j] = saturate_cast<uchar>(value);
+        }
+    }
+}
+
+//! [convolution-1D-scalar]
+void conv1d(Mat src, Mat &dst, Mat kernel)
+{
+
+    //! [convolution-1D-border]
+    int len = src.cols;
+    dst = Mat(1, len, CV_8UC1);
+
+    int sz = kernel.cols / 2;
+    copyMakeBorder(src, src, 0, 0, sz, sz, BORDER_REPLICATE);
+    //! [convolution-1D-border]
+
+    //! [convolution-1D-scalar-main]
+    for (int i = 0; i < len; i++)
+    {
+        double value = 0;
+        for (int k = -sz; k <= sz; k++)
+            value += src.ptr<uchar>(0)[i + k + sz] * kernel.ptr<float>(0)[k + sz];
+
+        dst.ptr<uchar>(0)[i] = saturate_cast<uchar>(value);
+    }
+    //! [convolution-1D-scalar-main]
+}
+//! [convolution-1D-scalar]
+
+//! [convolution-1D-vector]
+void conv1dsimd(Mat src, Mat kernel, float *ans, int row = 0, int rowk = 0, int len = -1)
+{
+    if (len == -1)
+        len = src.cols;
+
+    //! [convolution-1D-convert]
+    Mat src_32, kernel_32;
+
+    const int alpha = 1;
+    src.convertTo(src_32, CV_32FC1, alpha);
+
+    int ksize = kernel.cols, sz = kernel.cols / 2;
+    copyMakeBorder(src_32, src_32, 0, 0, sz, sz, BORDER_REPLICATE);
+    //! [convolution-1D-convert]
+
+
+    //! [convolution-1D-main]
+    //! [convolution-1D-main-h1]
+    int step = v_float32().nlanes;
+    float *sptr = src_32.ptr<float>(row), *kptr = kernel.ptr<float>(rowk);
+    for (int k = 0; k < ksize; k++)
+    {
+    //! [convolution-1D-main-h1]
+    //! [convolution-1D-main-h2]
+        v_float32 kernel_wide = vx_setall_f32(kptr[k]);
+        int i;
+        for (i = 0; i + step < len; i += step)
+        {
+            v_float32 window = vx_load(sptr + i + k);
+            v_float32 sum = vx_load(ans + i) + kernel_wide * window;
+            v_store(ans + i, sum);
+        }
+    //! [convolution-1D-main-h2]
+
+    //! [convolution-1D-main-h3]
+        for (; i < len; i++)
+        {
+            *(ans + i) += sptr[i + k]*kptr[k];
+        }
+    //! [convolution-1D-main-h3]
+    }
+    //! [convolution-1D-main]
+}
+//! [convolution-1D-vector]
+
+//! [convolution-2D]
+void convolute_simd(Mat src, Mat &dst, Mat kernel)
+{
+    //! [convolution-2D-init]
+    int rows = src.rows, cols = src.cols;
+    int ksize = kernel.rows, sz = ksize / 2;
+    dst = Mat(rows, cols, CV_32FC1);
+
+    copyMakeBorder(src, src, sz, sz, 0, 0, BORDER_REPLICATE);
+
+    int step = v_float32().nlanes;
+    //! [convolution-2D-init]
+
+    //! [convolution-2D-main]
+    for (int i = 0; i < rows; i++)
+    {
+        for (int k = 0; k < ksize; k++)
+        {
+            float ans[N] = {0};
+            conv1dsimd(src, kernel, ans, i + k, k, cols);
+            int j;
+            for (j = 0; j + step < cols; j += step)
+            {
+                v_float32 sum = vx_load(&dst.ptr<float>(i)[j]) + vx_load(&ans[j]);
+                v_store(&dst.ptr<float>(i)[j], sum);
+            }
+
+            for (; j < cols; j++)
+                dst.ptr<float>(i)[j] += ans[j];
+        }
+    }
+    //! [convolution-2D-main]
+
+    //! [convolution-2D-conv]
+    const int alpha = 1;
+    dst.convertTo(dst, CV_8UC1, alpha);
+    //! [convolution-2D-conv]
+}
+//! [convolution-2D]
+
+static void help(char *progName)
+{
+    cout << endl
+            << " This program shows how to use the OpenCV parallel_for_ function and \n"
+            << " compares the performance of the sequential and parallel implementations for a \n"
+            << " convolution operation\n"
+            << " Usage:\n "
+            << progName << " [image_path -- default lena.jpg] " << endl
+            << endl;
+}
+}
+
+int main(int argc, char *argv[])
+{
+
+    //  1-D Convolution  //
+    Mat vsrc(1, N, CV_8UC1), k(1, K, CV_32FC1), vdst;
+    RNG rng(time(0));
+    rng.RNG::fill(vsrc, RNG::UNIFORM, Scalar(0), Scalar(255));
+    rng.RNG::fill(k, RNG::UNIFORM, Scalar(-50), Scalar(50));
+
+    double t = (double)getTickCount();
+    conv1d(vsrc, vdst, k);
+    t = ((double)getTickCount() - t) / getTickFrequency();
+    cout << " Sequential 1-D convolution implementation: " << t << "s" << endl;
+
+    t = (double)getTickCount();
+    float ans[N] = {0};
+    conv1dsimd(vsrc, k, ans);
+    t = ((double)getTickCount() - t) / getTickFrequency();
+    cout << " Vectorized 1-D convolution implementation: " << t << "s" << endl;
+
+    //  2-D Convolution  //
+    help(argv[0]);
+
+    const char *filepath = argc >= 2 ? argv[1] : "../../../../data/lena.jpg";
+
+    Mat src, dst1, dst2, kernel;
+    src = imread(filepath, IMREAD_GRAYSCALE);
+
+    if (src.empty())
+    {
+        cerr << "Can't open [" << filepath << "]" << endl;
+        return EXIT_FAILURE;
+    }
+    namedWindow("Input", 1);
+    namedWindow("Output", 1);
+    imshow("Input", src);
+
+    kernel = (Mat_<float>(3, 3) << 1, 0, -1,
+                                   2, 0, -2,
+                                   1, 0, -1);
+
+    t = (double)getTickCount();
+
+    conv_seq(src, dst1, kernel);
+
+    t = ((double)getTickCount() - t) / getTickFrequency();
+    cout << " Sequential 2-D convolution implementation: " << t << "s" << endl;
+
+    imshow("Output", dst1);
+    waitKey(0);
+
+    t = (double)getTickCount();
+
+    convolute_simd(src, dst2, kernel);
+
+    t = ((double)getTickCount() - t) / getTickFrequency();
+    cout << " Vectorized 2-D convolution implementation: " << t << "s" << endl
+         << endl;
+
+    imshow("Output", dst2);
+    waitKey(0);
+
+    return 0;
+}