This work is sponsored by, and copyright, Google.
This has mostly got the same differences to the 8 bit version as
in the arm version. For the horizontal filters, we do 16 pixels
in parallel as well. For the 8 pixel wide vertical filters, we can
accumulate 4 rows before storing, just as in the 8 bit version.
Examples of runtimes vs the 32 bit version, on a Cortex A53:
ARM AArch64
vp9_avg4_10bpp_neon: 35.7 30.7
vp9_avg8_10bpp_neon: 93.5 84.7
vp9_avg16_10bpp_neon: 324.4 296.6
vp9_avg32_10bpp_neon: 1236.5 1148.2
vp9_avg64_10bpp_neon: 4639.6 4571.1
vp9_avg_8tap_smooth_4h_10bpp_neon: 130.0 128.0
vp9_avg_8tap_smooth_4hv_10bpp_neon: 440.0 440.5
vp9_avg_8tap_smooth_4v_10bpp_neon: 114.0 105.5
vp9_avg_8tap_smooth_8h_10bpp_neon: 327.0 314.0
vp9_avg_8tap_smooth_8hv_10bpp_neon: 918.7 865.4
vp9_avg_8tap_smooth_8v_10bpp_neon: 330.0 300.2
vp9_avg_8tap_smooth_16h_10bpp_neon: 1187.5 1155.5
vp9_avg_8tap_smooth_16hv_10bpp_neon: 2663.1 2591.0
vp9_avg_8tap_smooth_16v_10bpp_neon: 1107.4 1078.3
vp9_avg_8tap_smooth_64h_10bpp_neon: 17754.6 17454.7
vp9_avg_8tap_smooth_64hv_10bpp_neon: 33285.2 33001.5
vp9_avg_8tap_smooth_64v_10bpp_neon: 16066.9 16048.6
vp9_put4_10bpp_neon: 25.5 21.7
vp9_put8_10bpp_neon: 56.0 52.0
vp9_put16_10bpp_neon/armv8: 183.0 163.1
vp9_put32_10bpp_neon/armv8: 678.6 563.1
vp9_put64_10bpp_neon/armv8: 2679.9 2195.8
vp9_put_8tap_smooth_4h_10bpp_neon: 120.0 118.0
vp9_put_8tap_smooth_4hv_10bpp_neon: 435.2 435.0
vp9_put_8tap_smooth_4v_10bpp_neon: 107.0 98.2
vp9_put_8tap_smooth_8h_10bpp_neon: 303.0 290.0
vp9_put_8tap_smooth_8hv_10bpp_neon: 893.7 828.7
vp9_put_8tap_smooth_8v_10bpp_neon: 305.5 263.5
vp9_put_8tap_smooth_16h_10bpp_neon: 1089.1 1059.2
vp9_put_8tap_smooth_16hv_10bpp_neon: 2578.8 2452.4
vp9_put_8tap_smooth_16v_10bpp_neon: 1009.5 933.5
vp9_put_8tap_smooth_64h_10bpp_neon: 16223.4 15918.6
vp9_put_8tap_smooth_64hv_10bpp_neon: 32153.0 31016.2
vp9_put_8tap_smooth_64v_10bpp_neon: 14516.5 13748.1
These are generally about as fast as the corresponding ARM
routines on the same CPU (at least on the A53), in most cases
marginally faster.
The speedup vs C code is around 4-9x.
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
The plain pixel put/copy functions are used from the 8 bit version,
for the double size (e.g. put16 uses ff_vp9_copy32_neon), and a new
copy128 is added.
Compared with the 8 bit version, the filters can no longer use the
trick to accumulate in 16 bit with only saturation at the end, but now
the accumulators need to be 32 bit. This avoids the need to keep track
of which filter index is the largest though, reducing the size of the
executable code for these filters.
For the horizontal filters, we only do 4 or 8 pixels wide in parallel
(while doing two rows at a time), since we don't have enough register
space to filter 16 pixels wide.
For the vertical filters, we still do 4 and 8 pixels in parallel just
as in the 8 bit case, but we need to store the output after every 2
rows instead of after every 4 rows.
Examples of relative speedup compared to the C version, from checkasm:
Cortex A7 A8 A9 A53
vp9_avg4_10bpp_neon: 2.25 2.44 3.05 2.16
vp9_avg8_10bpp_neon: 3.66 8.48 3.86 3.50
vp9_avg16_10bpp_neon: 3.39 8.26 3.37 2.72
vp9_avg32_10bpp_neon: 4.03 10.20 4.07 3.42
vp9_avg64_10bpp_neon: 4.15 10.01 4.13 3.70
vp9_avg_8tap_smooth_4h_10bpp_neon: 3.38 6.22 3.41 4.75
vp9_avg_8tap_smooth_4hv_10bpp_neon: 3.89 6.39 4.30 5.32
vp9_avg_8tap_smooth_4v_10bpp_neon: 5.32 9.73 6.34 7.31
vp9_avg_8tap_smooth_8h_10bpp_neon: 4.45 9.40 4.68 6.87
vp9_avg_8tap_smooth_8hv_10bpp_neon: 4.64 8.91 5.44 6.47
vp9_avg_8tap_smooth_8v_10bpp_neon: 6.44 13.42 8.68 8.79
vp9_avg_8tap_smooth_64h_10bpp_neon: 4.66 9.02 4.84 7.71
vp9_avg_8tap_smooth_64hv_10bpp_neon: 4.61 9.14 4.92 7.10
vp9_avg_8tap_smooth_64v_10bpp_neon: 6.90 14.13 9.57 10.41
vp9_put4_10bpp_neon: 1.33 1.46 2.09 1.33
vp9_put8_10bpp_neon: 1.57 3.42 1.83 1.84
vp9_put16_10bpp_neon: 1.55 4.78 2.17 1.89
vp9_put32_10bpp_neon: 2.06 5.35 2.14 2.30
vp9_put64_10bpp_neon: 3.00 2.41 1.95 1.66
vp9_put_8tap_smooth_4h_10bpp_neon: 3.19 5.81 3.31 4.63
vp9_put_8tap_smooth_4hv_10bpp_neon: 3.86 6.22 4.32 5.21
vp9_put_8tap_smooth_4v_10bpp_neon: 5.40 9.77 6.08 7.21
vp9_put_8tap_smooth_8h_10bpp_neon: 4.22 8.41 4.46 6.63
vp9_put_8tap_smooth_8hv_10bpp_neon: 4.56 8.51 5.39 6.25
vp9_put_8tap_smooth_8v_10bpp_neon: 6.60 12.43 8.17 8.89
vp9_put_8tap_smooth_64h_10bpp_neon: 4.41 8.59 4.54 7.49
vp9_put_8tap_smooth_64hv_10bpp_neon: 4.43 8.58 5.34 6.63
vp9_put_8tap_smooth_64v_10bpp_neon: 7.26 13.92 9.27 10.92
For the larger 8tap filters, the speedup vs C code is around 4-14x.
Signed-off-by: Martin Storsjö <martin@martin.st>
The values of {FLT,DBL}_{MAX,MIN} macros on some systems (older musl
libc, some BSD flavours) are not exactly representable, i.e.
(double)DBL_MAX == DBL_MAX is false
This violates (at least some interpretations of) the C99 standard and
breaks code (e.g. in vf_fps) like
double f = DBL_MAX;
[...]
if (f == DBL_MAX) { // f has not been changed yet
[....]
}
Iterative implementation of 32 bit fixed point split-radix FFT.
Max FFT that can be calculated currently is 2^12.
Signed-off-by: Nedeljko Babic <nbabic@mips.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
These windows do not really belong in fft/mdct files and were
easily confused with the similarly named tables used by rdft.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Instead of defining functions in per-arch header files included
by the main cpu.c, define them normally and call them from the
generic one.
Originally committed as revision 25084 to svn://svn.ffmpeg.org/ffmpeg/trunk
It contains optimizations that are not specific to i386 and
libavutil uses this naming scheme already.
Originally committed as revision 16270 to svn://svn.ffmpeg.org/ffmpeg/trunk
c is 1.9x faster than previous c (on various x86 cpus), sse is 1.6x faster than previous sse.
Originally committed as revision 14698 to svn://svn.ffmpeg.org/ffmpeg/trunk
include paths in the source files.
mostly from a patch by Ronald S. Bultje, rbultje ronald.bitfreak net
Originally committed as revision 9034 to svn://svn.ffmpeg.org/ffmpeg/trunk
Patch by Zuxy Meng, zuxy <<dot>> meng >>at<< gmail <<dot>> com
Minor non-functional diff-related fixes by me.
Originally committed as revision 5125 to svn://svn.ffmpeg.org/ffmpeg/trunk