Provide neon implementation for sse4 function.
Performance comparison tests are shown below.
- sse_2_c: 80.7
- sse_2_neon: 31.0
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>