This work is sponsored by, and copyright, Google.
This has mostly got the same differences to the 8 bit version as
in the arm version. For the horizontal filters, we do 16 pixels
in parallel as well. For the 8 pixel wide vertical filters, we can
accumulate 4 rows before storing, just as in the 8 bit version.
Examples of runtimes vs the 32 bit version, on a Cortex A53:
ARM AArch64
vp9_avg4_10bpp_neon: 35.7 30.7
vp9_avg8_10bpp_neon: 93.5 84.7
vp9_avg16_10bpp_neon: 324.4 296.6
vp9_avg32_10bpp_neon: 1236.5 1148.2
vp9_avg64_10bpp_neon: 4639.6 4571.1
vp9_avg_8tap_smooth_4h_10bpp_neon: 130.0 128.0
vp9_avg_8tap_smooth_4hv_10bpp_neon: 440.0 440.5
vp9_avg_8tap_smooth_4v_10bpp_neon: 114.0 105.5
vp9_avg_8tap_smooth_8h_10bpp_neon: 327.0 314.0
vp9_avg_8tap_smooth_8hv_10bpp_neon: 918.7 865.4
vp9_avg_8tap_smooth_8v_10bpp_neon: 330.0 300.2
vp9_avg_8tap_smooth_16h_10bpp_neon: 1187.5 1155.5
vp9_avg_8tap_smooth_16hv_10bpp_neon: 2663.1 2591.0
vp9_avg_8tap_smooth_16v_10bpp_neon: 1107.4 1078.3
vp9_avg_8tap_smooth_64h_10bpp_neon: 17754.6 17454.7
vp9_avg_8tap_smooth_64hv_10bpp_neon: 33285.2 33001.5
vp9_avg_8tap_smooth_64v_10bpp_neon: 16066.9 16048.6
vp9_put4_10bpp_neon: 25.5 21.7
vp9_put8_10bpp_neon: 56.0 52.0
vp9_put16_10bpp_neon/armv8: 183.0 163.1
vp9_put32_10bpp_neon/armv8: 678.6 563.1
vp9_put64_10bpp_neon/armv8: 2679.9 2195.8
vp9_put_8tap_smooth_4h_10bpp_neon: 120.0 118.0
vp9_put_8tap_smooth_4hv_10bpp_neon: 435.2 435.0
vp9_put_8tap_smooth_4v_10bpp_neon: 107.0 98.2
vp9_put_8tap_smooth_8h_10bpp_neon: 303.0 290.0
vp9_put_8tap_smooth_8hv_10bpp_neon: 893.7 828.7
vp9_put_8tap_smooth_8v_10bpp_neon: 305.5 263.5
vp9_put_8tap_smooth_16h_10bpp_neon: 1089.1 1059.2
vp9_put_8tap_smooth_16hv_10bpp_neon: 2578.8 2452.4
vp9_put_8tap_smooth_16v_10bpp_neon: 1009.5 933.5
vp9_put_8tap_smooth_64h_10bpp_neon: 16223.4 15918.6
vp9_put_8tap_smooth_64hv_10bpp_neon: 32153.0 31016.2
vp9_put_8tap_smooth_64v_10bpp_neon: 14516.5 13748.1
These are generally about as fast as the corresponding ARM
routines on the same CPU (at least on the A53), in most cases
marginally faster.
The speedup vs C code is around 4-9x.
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
The plain pixel put/copy functions are used from the 8 bit version,
for the double size (e.g. put16 uses ff_vp9_copy32_neon), and a new
copy128 is added.
Compared with the 8 bit version, the filters can no longer use the
trick to accumulate in 16 bit with only saturation at the end, but now
the accumulators need to be 32 bit. This avoids the need to keep track
of which filter index is the largest though, reducing the size of the
executable code for these filters.
For the horizontal filters, we only do 4 or 8 pixels wide in parallel
(while doing two rows at a time), since we don't have enough register
space to filter 16 pixels wide.
For the vertical filters, we still do 4 and 8 pixels in parallel just
as in the 8 bit case, but we need to store the output after every 2
rows instead of after every 4 rows.
Examples of relative speedup compared to the C version, from checkasm:
Cortex A7 A8 A9 A53
vp9_avg4_10bpp_neon: 2.25 2.44 3.05 2.16
vp9_avg8_10bpp_neon: 3.66 8.48 3.86 3.50
vp9_avg16_10bpp_neon: 3.39 8.26 3.37 2.72
vp9_avg32_10bpp_neon: 4.03 10.20 4.07 3.42
vp9_avg64_10bpp_neon: 4.15 10.01 4.13 3.70
vp9_avg_8tap_smooth_4h_10bpp_neon: 3.38 6.22 3.41 4.75
vp9_avg_8tap_smooth_4hv_10bpp_neon: 3.89 6.39 4.30 5.32
vp9_avg_8tap_smooth_4v_10bpp_neon: 5.32 9.73 6.34 7.31
vp9_avg_8tap_smooth_8h_10bpp_neon: 4.45 9.40 4.68 6.87
vp9_avg_8tap_smooth_8hv_10bpp_neon: 4.64 8.91 5.44 6.47
vp9_avg_8tap_smooth_8v_10bpp_neon: 6.44 13.42 8.68 8.79
vp9_avg_8tap_smooth_64h_10bpp_neon: 4.66 9.02 4.84 7.71
vp9_avg_8tap_smooth_64hv_10bpp_neon: 4.61 9.14 4.92 7.10
vp9_avg_8tap_smooth_64v_10bpp_neon: 6.90 14.13 9.57 10.41
vp9_put4_10bpp_neon: 1.33 1.46 2.09 1.33
vp9_put8_10bpp_neon: 1.57 3.42 1.83 1.84
vp9_put16_10bpp_neon: 1.55 4.78 2.17 1.89
vp9_put32_10bpp_neon: 2.06 5.35 2.14 2.30
vp9_put64_10bpp_neon: 3.00 2.41 1.95 1.66
vp9_put_8tap_smooth_4h_10bpp_neon: 3.19 5.81 3.31 4.63
vp9_put_8tap_smooth_4hv_10bpp_neon: 3.86 6.22 4.32 5.21
vp9_put_8tap_smooth_4v_10bpp_neon: 5.40 9.77 6.08 7.21
vp9_put_8tap_smooth_8h_10bpp_neon: 4.22 8.41 4.46 6.63
vp9_put_8tap_smooth_8hv_10bpp_neon: 4.56 8.51 5.39 6.25
vp9_put_8tap_smooth_8v_10bpp_neon: 6.60 12.43 8.17 8.89
vp9_put_8tap_smooth_64h_10bpp_neon: 4.41 8.59 4.54 7.49
vp9_put_8tap_smooth_64hv_10bpp_neon: 4.43 8.58 5.34 6.63
vp9_put_8tap_smooth_64v_10bpp_neon: 7.26 13.92 9.27 10.92
For the larger 8tap filters, the speedup vs C code is around 4-14x.
Signed-off-by: Martin Storsjö <martin@martin.st>
Consistently apply this rule: the guard name is obtained from the
filename by stripping the leading "lib", converting '/' and '.' to
'_' and uppercasing the resulting name. Guard names in the root
directory have to be prefixed by "FFMPEG_".
Originally committed as revision 15120 to svn://svn.ffmpeg.org/ffmpeg/trunk