checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.
vc1dsp.vc1_unescape_buffer_c: 918624.7
vc1dsp.vc1_unescape_buffer_neon: 142958.0
Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.
vc1dsp.vc1_unescape_buffer_c: 655617.7
vc1dsp.vc1_unescape_buffer_neon: 118237.0
Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C
version can still outperform the NEON version in specific cases. The balance
between different code paths is stream-dependent, but in practice the best
case happens about 5% of the time, the worst case happens about 40% of the
time, and the complexity of the remaining cases fall somewhere in between.
Therefore, taking the average of the best and worst case timings is
probably a conservative estimate of the degree by which the NEON code
improves performance.
vc1dsp.vc1_h_loop_filter4_bestcase_c: 19.0
vc1dsp.vc1_h_loop_filter4_bestcase_neon: 48.5
vc1dsp.vc1_h_loop_filter4_worstcase_c: 144.7
vc1dsp.vc1_h_loop_filter4_worstcase_neon: 76.2
vc1dsp.vc1_h_loop_filter8_bestcase_c: 41.0
vc1dsp.vc1_h_loop_filter8_bestcase_neon: 75.0
vc1dsp.vc1_h_loop_filter8_worstcase_c: 294.0
vc1dsp.vc1_h_loop_filter8_worstcase_neon: 102.7
vc1dsp.vc1_h_loop_filter16_bestcase_c: 54.7
vc1dsp.vc1_h_loop_filter16_bestcase_neon: 130.0
vc1dsp.vc1_h_loop_filter16_worstcase_c: 569.7
vc1dsp.vc1_h_loop_filter16_worstcase_neon: 186.7
vc1dsp.vc1_v_loop_filter4_bestcase_c: 20.2
vc1dsp.vc1_v_loop_filter4_bestcase_neon: 47.2
vc1dsp.vc1_v_loop_filter4_worstcase_c: 164.2
vc1dsp.vc1_v_loop_filter4_worstcase_neon: 68.5
vc1dsp.vc1_v_loop_filter8_bestcase_c: 43.5
vc1dsp.vc1_v_loop_filter8_bestcase_neon: 55.2
vc1dsp.vc1_v_loop_filter8_worstcase_c: 316.2
vc1dsp.vc1_v_loop_filter8_worstcase_neon: 72.7
vc1dsp.vc1_v_loop_filter16_bestcase_c: 62.2
vc1dsp.vc1_v_loop_filter16_bestcase_neon: 103.7
vc1dsp.vc1_v_loop_filter16_worstcase_c: 646.5
vc1dsp.vc1_v_loop_filter16_worstcase_neon: 110.7
Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C
version can still outperform the NEON version in specific cases. The balance
between different code paths is stream-dependent, but in practice the best
case happens about 5% of the time, the worst case happens about 40% of the
time, and the complexity of the remaining cases fall somewhere in between.
Therefore, taking the average of the best and worst case timings is
probably a conservative estimate of the degree by which the NEON code
improves performance.
vc1dsp.vc1_h_loop_filter4_bestcase_c: 10.7
vc1dsp.vc1_h_loop_filter4_bestcase_neon: 43.5
vc1dsp.vc1_h_loop_filter4_worstcase_c: 184.5
vc1dsp.vc1_h_loop_filter4_worstcase_neon: 73.7
vc1dsp.vc1_h_loop_filter8_bestcase_c: 31.2
vc1dsp.vc1_h_loop_filter8_bestcase_neon: 62.2
vc1dsp.vc1_h_loop_filter8_worstcase_c: 358.2
vc1dsp.vc1_h_loop_filter8_worstcase_neon: 88.2
vc1dsp.vc1_h_loop_filter16_bestcase_c: 51.0
vc1dsp.vc1_h_loop_filter16_bestcase_neon: 107.7
vc1dsp.vc1_h_loop_filter16_worstcase_c: 722.7
vc1dsp.vc1_h_loop_filter16_worstcase_neon: 140.5
vc1dsp.vc1_v_loop_filter4_bestcase_c: 9.7
vc1dsp.vc1_v_loop_filter4_bestcase_neon: 43.0
vc1dsp.vc1_v_loop_filter4_worstcase_c: 178.7
vc1dsp.vc1_v_loop_filter4_worstcase_neon: 69.0
vc1dsp.vc1_v_loop_filter8_bestcase_c: 30.2
vc1dsp.vc1_v_loop_filter8_bestcase_neon: 50.7
vc1dsp.vc1_v_loop_filter8_worstcase_c: 353.0
vc1dsp.vc1_v_loop_filter8_worstcase_neon: 69.2
vc1dsp.vc1_v_loop_filter16_bestcase_c: 60.0
vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0
vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2
vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2
Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
This test deliberately doesn't exercise the full range of inputs described in
the committee draft VC-1 standard. It says:
input coefficients in frequency domain, D, satisfy -2048 <= D < 2047
intermediate coefficients, E, satisfy -4096 <= E < 4095
fully inverse-transformed coefficients, R, satisfy -512 <= R < 511
For one thing, the inequalities look odd. Did they mean them to go the
other way round? That would make more sense because the equations generally
both add and subtract coefficients multiplied by constants, including powers
of 2. Requiring the most-negative values to be valid extends the number of
bits to represent the intermediate values just for the sake of that one case!
For another thing, the extreme values don't look to occur in real streams -
both in my experience and supported by the following comment in the AArch32
decoder:
tNhalf is half of the value of tN (as described in vc1_inv_trans_8x8_c).
This is done because sometimes files have input that causes tN + tM to
overflow. To avoid this overflow, we compute tNhalf, then compute
tNhalf + tM (which doesn't overflow), and then we use vhadd to compute
(tNhalf + (tNhalf + tM)) >> 1 which does not overflow because it is
one instruction.
My AArch64 decoder goes further than this. It calculates tNhalf and tM
then does an SRA (essentially a fused halve and add) to compute
(tN + tM) >> 1 without ever having to hold (tNhalf + tM) in a 16-bit element
without overflowing. It only encounters difficulties if either tNhalf or
tM overflow in isolation.
I haven't had sight of the final standard, so it's possible that these
issues were dealt with during finalisation, which could explain the lack
of usage of extreme inputs in real streams. Or a preponderance of decoders
that only support 16-bit intermediate values in their inverse transforms
might have caused encoders to steer clear of such cases.
I have effectively followed this approach in the test, and limited the
scale of the coefficients sufficient that both the existing AArch32 decoder
and my new AArch64 decoder both pass.
Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
Note that the benchmarking results for these functions are highly dependent
upon the input data. Therefore, each function is benchmarked twice,
corresponding to the best and worst case complexity of the reference C
implementation. The performance of a real stream decode will fall somewhere
between these two extremes.
Signed-off-by: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Martin Storsjö <martin@martin.st>
Upstream gained a new tone-mapping API, which we never switched to. We
don't need a version bump for this because it was included as part of
the v4.192 release we currently already depend on.
Some of the old options can be moderately approximated with the new API,
but specifically "desaturation_base" and "max_boost" cannot. Remove
these entirely, rather than deprecating them. They have actually been
non-functional for a while as a result of the upstream deprecation.
Signed-off-by: Niklas Haas <git@haasn.dev>
Fixes: Out of array read
Fixes: 45137/clusterfuzz-testcase-minimized-ffmpeg_BSF_VP9_SUPERFRAME_SPLIT_fuzzer-4984270639202304
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
They are invalid in VP9. If any of the frames inside a superframe
had a size of zero, the code would either read into the next frame
or into the superframe index; so check for the length to stop this.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Packets without data need to be handled specially in order to avoid
undefined reads. Pass these packets through unchanged in case there
are no cached packets* and error out in case there are cached packets:
Returning the packet would mess with the order of the packets;
if one returned the zero-sized packet before the superframe that will
be created from the packets in the cache, the zero-sized packet would
overtake the packets in the cache; if one returned the packet later,
the packets that complete the superframe will overtake the zero-sized
packet.
*: This case e.g. encompasses the scenario of updated extradata
side-data at the end.
Fixes: Out of array read
Fixes: 45722/clusterfuzz-testcase-minimized-ffmpeg_BSF_VP9_SUPERFRAME_fuzzer-5173378975137792
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
tiny_ssim is built for the build host, not for the target platform.
Therefore, it mustn't include the config.h header, which is set up
specifically for the target platform and compiler.
This fixes cross building for older WinStore platforms, where
config.h contains "#define getenv(x) NULL".
Signed-off-by: Martin Storsjö <martin@martin.st>
The existing x86 assembly for loop filters uses the stride as a
full register without clearing/sign extending the upper half
of the registers on x86_64.
This avoids crashes if the caller would have passed nonzero bits
in the previously undefined upper 32 bits of the parameters.
Signed-off-by: Martin Storsjö <martin@martin.st>
The upper limit of strlen(streamid) is 512. Use a larger buffer for
future proof, for example, deal with percent-encoding.
Reviewed-by: Zhao Jun <barryjzhao@tencent.com>
Signed-off-by: Steven Liu <liuqi05@kuaishou.com>
a1c4929f accidentally undid part of d9a9b4c8, so the bug in ticket #9420
resurfaced. Fixing again.
Signed-off-by: Diederick Niehorster <dcnieho@gmail.com>
Reviewed-by: Roger Pack <rogerdpack2@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: Out of array write
Fixes: 45613/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_WMALOSSLESS_fuzzer-4539073606320128
Fixes: 46008/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_WMALOSSLESS_fuzzer-4681245747970048
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
Fixes: NULL pointer dereference
Fixes: 45955/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_BINKAUDIO_DCT_fuzzer-4842044192849920
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
So I can merge my own changes to this filter after they pass peer
review, as well as keeping it in sync with upstream API changes / new
features.
Signed-off-by: Niklas Haas <git@haasn.dev>
Signed-off-by: James Almer <jamrial@gmail.com>
Fixes: division by zero
Fixes: 45811/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_VMDAUDIO_fuzzer-6412592581574656
Fixes: 45979/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_VMDAUDIO_fuzzer-5362043060879360
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
av_buffersrc_parameters_set() can be called to set paramenters after the filter
was initialized with for example avfilter_graph_create_filter().
Signed-off-by: James Almer <jamrial@gmail.com>
Fixes: signed integer overflow: -5 - 9223372036854775807 cannot be represented in type 'long'
Fixes: 45665/clusterfuzz-testcase-minimized-ffmpeg_DEMUXER_fuzzer-475618463934054
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: division by 0
Fixes: 45643/clusterfuzz-testcase-minimized-ffmpeg_DEMUXER_fuzzer-4957777905188864.fuzz
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This search takes alot of time especially when compared with small packets
46631 decicycles -> 15719 decicycles in read_frame_internal() for amr-nb in 3gp
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
7-bit PictureIDs are not supported by WebRTC:
https://groups.google.com/g/discuss-webrtc/c/333-L02vuWA
In practice, 15-bit PictureIDs offer better compatibility.
Signed-off-by: Kevin Wang <kevin@muxable.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
No point running all 64 iterations in the loop to never write anything to ret.
Also make ambisonic layouts check its mask too while at it.
Signed-off-by: James Almer <jamrial@gmail.com>
This comment only applies to the scenario in which one uses
the AVCodecContexts embedded in AVStreams. Yet this code sample
stopped doing so in 9897d9f4e074cdc6c7f2409885ddefe300f18dc7;
and the last major version bump even removed the public
AVCodecContexts in AVStreams. So just remove this comment.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Modifying the main context by a slice thread is racy;
so constify the pointer to it in H264SliceContext.
The code itself was already compatible with this change.
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Since 7be2d2a70c only one context
is used. Moving it to H264Context from H264SliceContext is natural.
One could access the ERContext from H264SliceContext
via H264SliceContext.h264->er; yet H264SliceContext.h264 should
naturally be const-qualified, because slice threads should not
modify the main context. The ERContext is an exception
to this, as ff_er_add_slice() is intended to be called simultaneously
by multiple threads. And for this one needs a pointer whose
pointed-to-type is not const-qualified.
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
ff_er_frame_start() initializes ERContext.error_count
to three times the number of macroblocks to decode.
Later ff_er_add_slice() reduces this number by the amount
of macroblocks whose AC resp. DC resp. MV have been finished
(so every correctly decoded MB counts three times).
So the frame has been decoded correctly if error_count is zero
at the end.
The H.264 decoder uses multiple ERContexts when using
slice threading and therefore combines these error counts:
The first slice's ERContext is intended to be initialized
by ff_er_frame_start(), error_count of all the other
slice contexts is intended to be zeroed initially and
all afterwards all the error_counts are summed.
Yet commit 43b434210e
(probably unintentionally) changed the code to set
the first slice's error_count to zero as well.
This leads to bogus error messages in case one decodes
an input video using multiple slices with slice threading
with error concealment enabled (which is not the default)
("concealing 0 DC, 0 AC, 0 MV errors in [IPB] frame");
furthermore the returned frame is marked as corrupt as well
(ffmpeg reports "corrupt decoded frame in stream %d" for this).
This can be fixed easily given that only the first ERContext
is really used since 7be2d2a70cd20d88fd826a83f87037d14681a579:
Don't reset the error_count; and don't sum the error counts as well.
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>