In the previous version only the default stream was/could be used, i.e.
cv::cuda::Stream::Null().
With this change, HOG::compute() will now run in parallel over different
cuda::Streams.
The code has been reordered so that all data allocation is completed
first, then all the kernels are run in parallel over streams.
Fix#8177
See https://github.com/Itseez/opencv/issues/5721
COMMENTS:
* The second __syncthreads() is necessary, I am sure of that.
* The code works without the first __syncthreads() too, but I have however added it for symmetry. Anyway it doesn't affect time performances, I have checked it with some profiling with nvvp