Provide optimized implementation of vsad16 function for arm64.
Performance comparison tests are shown below.
- vsad_0_c: 285.2
- vsad_0_neon: 39.5
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Co-authored-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
Provide optimized implementation of pix_abs8 function for arm64.
Performance comparison tests are shown below.
- pix_abs_1_0_c: 101.2
- pix_abs_1_0_neon: 22.5
- sad_1_c: 101.2
- sad_1_neon: 22.5
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Martin Storsjö <martin@martin.st>
Provide optimized implementation of sse8 function for arm64.
Performance comparison tests are shown below.
- sse_1_c: 130.7
- sse_1_neon: 29.7
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
Provide optimized implementation of pix_abs16_y2 function for arm64.
Performance comparison tests are shown below.
pix_abs_0_2_c: 317.2
pix_abs_0_2_neon: 37.5
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
Provide neon implementation for sse4 function.
Performance comparison tests are shown below.
- sse_2_c: 80.7
- sse_2_neon: 31.0
Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
Provide neon implementation for sse16 function.
Performance comparison tests are shown below.
- sse_0_c: 268.2
- sse_0_neon: 43.5
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
The max height is currently documented as 16; the max difference per
pixel is 255, and a .8h element can easily contain 16*255, thus keep
accumulating in two .8h vectors, and just do the final accumulationat the
end. This should work for heights up to 256.
This requires a minor register renumbering in ff_pix_abs16_xy2_neon.
Before: Cortex A53 A72 A73 Graviton 3
pix_abs_0_0_neon: 97.7 47.0 37.5 22.7
pix_abs_0_1_neon: 154.0 59.0 52.0 25.0
pix_abs_0_3_neon: 179.7 96.7 87.5 41.2
After:
pix_abs_0_0_neon: 96.0 39.2 31.2 22.0
pix_abs_0_1_neon: 150.7 59.7 46.2 23.7
pix_abs_0_3_neon: 175.7 83.7 81.7 38.2
Signed-off-by: Martin Storsjö <martin@martin.st>
Using absolute-difference-accumulate does use twice the amount of
absolute-difference instructions, but avoids the need for the
uaddl and add instructions, reducing the total number of instructions
by 3.
These can be interleaved in the rest of the calculation, to avoid
tight dependencies at the end. Unfortunately, this is marginally
slower on Cortex A53, but faster on A72 and A73.
Before: Cortex A53 A72 A73 Graviton 3
pix_abs_0_3_neon: 175.7 109.2 92.0 41.2
After:
pix_abs_0_3_neon: 179.7 96.7 87.5 41.2
Signed-off-by: Martin Storsjö <martin@martin.st>
Provide neon implementation for pix_abs16_x2 function.
Performance tests of implementation are below.
- pix_abs_0_1_c: 283.5
- pix_abs_0_1_neon: 39.0
Benchmarks and tests run with checkasm tool on AWS Graviton 3.
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
- ff_pix_abs16_neon
- ff_pix_abs16_xy2_neon
In direct micro benchmarks of these ff functions verses their C implementations,
these functions performed as follows on AWS Graviton 3.
ff_pix_abs16_neon:
pix_abs_0_0_c: 141.1
pix_abs_0_0_neon: 19.6
ff_pix_abs16_xy2_neon:
pix_abs_0_3_c: 269.1
pix_abs_0_3_neon: 39.3
Tested with:
./tests/checkasm/checkasm --test=motion --bench --disable-linux-perf
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>