Resize reworked using wide universal intrinsics (#13781)
* Added wide universal intrinsics optimized implementation for 3 channel bit-exact linear resize
* Reworked linear resize using new wide LUT intrinsics
* Fix for VSX intrinsics
Due to size limit of shared memory, histogram is built on
the global memory for CV_16UC1 case.
The amount of memory needed for building histogram is:
65536 * 4byte = 256KB
and shared memory limit is 48KB typically.
Added test cases for CV_16UC1 and various clip limits.
Added perf tests for CV_16UC1 on both CPU and CUDA code.
There was also a bug in CV_8UC1 case when redistributing
"residual" clipped pixels. Adding the test case where clip
limit is 5.0 exposes this bug.
* Add Operator override for multi-channel Mat with literal constant.
* simple test
* Operator overloading channel constraint for primitive types
* fix some test for #13586
* added performance test for compareHist
* compareHist reworked to use wide universal intrinsics
* Disabled vectorization for CV_COMP_CORREL and CV_COMP_BHATTACHARYYA if f64 is unsupported
* Added performance tests for hal::norm functions
* Added sum of absolute differences intrinsic
* norm implementation updated to use wide universal intrinsics
* improve and fix v_reduce_sad on VSX
- add infrastructure support for Power9/VSX3
- fix missing VSX flags on GCC4.9 and CLANG4(#13210, #13222)
- fix disable VSX optimzation on GCC by using flag ENABLE_VSX
- flag ENABLE_VSX is deprecated now, use CPU_BASELINE, CPU_DISPATCH instead
- add VSX3 to arithmetic dispatchable flags
* Updated boxFilter implementations to use wide universal intrinsics
* boxFilter implementation moved to separate file
* Replaced ROUNDUP macro with roundUp() function
- initialize arithmetic dispatcher
- add new universal intrinsic v_absdiffs
- add new universal intrinsic v_pack_b
- add accumulate version of universal intrinsic v_round
- fix sse/avx2:uint8 multiplication overflow
- reimplement arithmetic, logic and comparison operations into wide universal intrinsics
with full support for all types
- reimplement IPP arithmetic, logic and comparison operations in a sperate file arithm_ipp.hpp
- avoid scalar multiplication if scaling factor eq 1 and use integer multiplication
- move C arithmetic operations to precomp.hpp and delete [arithm_simd|arithm_core].hpp
- add compatibility with new opencv4 divide policy
* js: update build script
- support emscipten 1.38.12 (wasm is ON by default)
- verbose build messages
* js: use builtin Math functions
* js: disable tracing code completelly
Fixes for instrumentation of IPP and OCL (#12637)
* fixed warning about re-declaring variable when both IPP and instrumentation are enabled
* fixed segfault when no funName provided
* compilation fixed when both OCL and instrumentation are enabled
* Remove isIntel check from deep learning layers
* Remove fp16->fp32 fallbacks where it's not necessary
* Fix Kernel::run to prevent localsize > globalsize