* use v_float16x4 (universal intrinsic) instead of raw SSE/NEON implementation
* define v_load_f16/v_store_f16 since v_load can't be distinguished when short pointer passed
* brush up implementation on old compiler (guard correctly)
* add test for v_load_f16 and round trip conversion of v_float16x4
* fix conversion error
* use universal intrinsic for accumulate series using float/double
* accumulate, accumulateSquare, accumulateProduct and accumulateWeighted
* add v_cvt_f64_high in both SSE/NEON
* add test for conversion v_cvt_f64_high in test_intrin.cpp
* improve some existing universal intrinsic by using new instructions in Aarch64
* add workaround for Android build in intrin_neon.hpp
* Added 2-channel ops to match existing 3-channel and 4-channel ops
* v_load_deinterleave() and v_store_interleave()
* Implements float32x4 only on SSE (but all types on NEON and CPP)
* Includes tests
* Will be used to vectorize 2D functions, such as estimateAffine2D()