Fixes multiple runtime error: shift exponent 792 is too large for 32-bit type 'unsigned int'
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Benchmarks with START_TIMER indicate that the code is faster with unsigned, (that is
with the patch), there was quite some fluctuation in the numbers so this may be just
random
Fixes: 811/clusterfuzz-testcase-6465493076541440
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
add kVTCompressionPropertyKey_DataRateLimits support by rc_max_bitrate
Reviewed-by: Rick Kern <kernrj@gmail.com>
Signed-off-by: Steven Liu <lq@chinaffmpeg.org>
Fixes: 755/clusterfuzz-testcase-5369072516595712
See: [FFmpeg-devel] [PATCH 1/2] avcodec/h264_direct: Fix runtime error: signed integer overflow: 2147483647 - -14133 cannot be represented in type 'int'
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: 732/clusterfuzz-testcase-4872990070145024
See: [FFmpeg-devel] [PATCH 2/6] avcodec/dca_xll: Fix runtime error: signed integer overflow: 2147286116 + 6298923 cannot be represented in type 'int'
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This matches the order they are in the 16 bpp version.
There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.
This makes the 8 bpp version match the 16 bpp version better.
This is cherrypicked from libav commit
b8f66c0838.
Signed-off-by: Martin Storsjö <martin@martin.st>
This matches the order they are in the 16 bpp version.
There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.
This makes the 8 bpp version match the 16 bpp version better.
This is cherrypicked from libav commit
08074c092d.
Signed-off-by: Martin Storsjö <martin@martin.st>
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.
This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.
This is cherrypicked from libav commit
09eb88a12e.
Signed-off-by: Martin Storsjö <martin@martin.st>
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.
This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.
This is cherrypicked from libav commit
de06bdfe6c.
Signed-off-by: Martin Storsjö <martin@martin.st>
The idct32x32 function actually pushed d8-d15 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.
After this, we still can skip pushing d12-d15.
Before:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3
After:
vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3
This is cherrypicked from libav commit
65aa002d54.
Signed-off-by: Martin Storsjö <martin@martin.st>
The idct32x32 function actually pushed q4-q7 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.
Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
in the idct16 function), and the lanewise vmul needs a register in
the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
while doing idct16.
While keeping these coefficients in registers, we still can skip pushing
q7.
Before: Cortex A7 A8 A9 A53
vp9_inv_dct_dct_32x32_sub32_add_neon: 18553.8 17182.7 14303.3 12089.7
After:
vp9_inv_dct_dct_32x32_sub32_add_neon: 18470.3 16717.7 14173.6 11860.8
This is cherrypicked from libav commit
402546a172.
Signed-off-by: Martin Storsjö <martin@martin.st>
For this case, with 8 inputs but only changing 4 of them, we can fit
all 16 input pixels into a q register, and still have enough temporary
registers for doing the loop filter.
The wd=8 filters would require too many temporary registers for
processing all 16 pixels at once though.
Before: Cortex A7 A8 A9 A53
vp9_loop_filter_mix2_v_44_16_neon: 289.7 256.2 237.5 181.2
After:
vp9_loop_filter_mix2_v_44_16_neon: 221.2 150.5 177.7 138.0
This is cherrypicked from libav commit
575e31e931.
Signed-off-by: Martin Storsjö <martin@martin.st>
This is one cycle faster in total, and three instructions fewer.
Before:
vp9_loop_filter_mix2_v_44_16_neon: 123.2
After:
vp9_loop_filter_mix2_v_44_16_neon: 122.2
This is cherrypicked from libav commit
3bf9c48320.
Signed-off-by: Martin Storsjö <martin@martin.st>
This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.
This is cherrypicked from libav commit
b0806088d3.
Signed-off-by: Martin Storsjö <martin@martin.st>
This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.
This is cherrypicked from libav commit
e18c39005a.
Signed-off-by: Martin Storsjö <martin@martin.st>
Previously we first calculated hev, and then negated it.
Since we were able to schedule the negation in the middle
of another calculation, we don't see any gain in all cases.
Before: Cortex A7 A8 A9 A53 A53/AArch64
vp9_loop_filter_v_4_8_neon: 147.0 129.0 115.8 89.0 88.7
vp9_loop_filter_v_8_8_neon: 242.0 198.5 174.7 140.0 136.7
vp9_loop_filter_v_16_8_neon: 500.0 419.5 382.7 293.0 275.7
vp9_loop_filter_v_16_16_neon: 971.2 825.5 731.5 579.0 453.0
After:
vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7
vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7
vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7
vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0
This is cherrypicked from libav commit
e1f9de86f4.
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
Before: Cortex A53
vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3
vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1
After:
vp9_inv_dct_dct_16x16_sub1_add_neon: 180.2
vp9_inv_dct_dct_32x32_sub1_add_neon: 475.3
This is cherrypicked from libav commit
3fcf788fbb.
Signed-off-by: Martin Storsjö <martin@martin.st>
No measured speedup on a Cortex A53, but other cores might benefit.
This is cherrypicked from libav commit
388e0d2515.
Signed-off-by: Martin Storsjö <martin@martin.st>