This avoids SEI and IDR recovery flags affecting each other
Also eliminate litteral numbers from recovery handling
This should make the code clearer
Improves: tickets/4738/tickets_cut.ts
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
In this case, the inner loop computing the scalar product can be reduced
to just one multiplication and one sum even with 128-bit vectors. The
result is a lot simpler, but also brings more modest performance gains:
flac_lpc_16_13_c: 15241.0
flac_lpc_16_13_rvv_i32: 11230.0
flac_lpc_16_16_c: 17884.0
flac_lpc_16_16_rvv_i32: 12125.7
flac_lpc_16_29_c: 27847.7
flac_lpc_16_29_rvv_i32: 10494.0
flac_lpc_16_32_c: 30051.5
flac_lpc_16_32_rvv_i32: 10355.0
The entire set of 32 coefficients and corresponding past 32 samples can
fit in a single vector (with LMUL=8) exactly, but... since widening
double the needed vector sizes, we still end up too short with 128-bit
vectors. This adds a very simple version for future 256+-bit hardware,
and for pred_orders values up to 16, and a bit more involved loop for
for 128-bit hardware with pred_orders between 17 and 32.
With 128-bit hardware, the benchmarks look like this:
flac_lpc_32_13_c: 30152.0
flac_lpc_32_13_rvv_i32: 10244.7
flac_lpc_32_16_c: 37314.2
flac_lpc_32_16_rvv_i32: 10126.2
flac_lpc_32_29_c: 61910.0
flac_lpc_32_29_rvv_i32: 14495.2
flac_lpc_32_32_c: 68204.0
flac_lpc_32_32_rvv_i32: 13273.7
This commit adds support for VP8 bitstream read methods to the cbs
codec. This enables the trace_headers bitstream filter to support VP8,
in addition to AV1, H.264, H.265, and VP9. This can be useful for
debugging VP8 stream issues.
The CBS VP8 implements a simple VP8 boolean decoder using GetBitContext
to read the bitstream.
Only the read methods `read_unit` and `split_fragment` are implemented.
The write methods `write_unit` and `assemble_fragment` return the error
code AVERROR_PATCHWELCOME. This is because CBS VP8 write is unlikely to
be used by any applications at the moment. The write methods can be
added later if there is a real need for them.
TESTS: ffmpeg -i fate-suite/vp8/frame_size_change.webm -vcodec copy
-bsf:v trace_headers -f null -
Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
This commit exports the `vp8_token_update_probs` variable to internal
library scope to facilitate its reuse within the library.
Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Better performance can probably be achieved with a more intricate
unrolled loop, but this is a start:
add_hfyu_left_pred_bgr32_c: 15084.0
add_hfyu_left_pred_bgr32_rvv_i32: 10280.2
This would actually be cleaner with the RISC-V P extension, but that is
not ratified yet (I think?) and usually not supported if V is supported.
With explicit unrolling, we can skip half of the sign bit flips, and
the compiler is then better able to optimise the scalar loop:
predictor_c: 31376.0 (before)
predictor_c: 23703.0 (after)
This is restricted to 128-bit vectors as larger vector sizes could read
past the end of the noise array. Support for future hardware with larger
vector sizes is left for some other time.
hf_apply_noise_0_c: 2319.7
hf_apply_noise_0_rvv_f32: 1229.0
hf_apply_noise_1_c: 2539.0
hf_apply_noise_1_rvv_f32: 1244.7
hf_apply_noise_2_c: 2319.7
hf_apply_noise_2_rvv_f32: 1232.7
hf_apply_noise_3_c: 2541.2
hf_apply_noise_3_rvv_f32: 1244.2
This commit fixes issue with missing SPS/PPS headers in video
encoded by AMF AV1 encoder.
Missing headers leads to broken seek in MPV video player.
Default value for property AV1_HEADER_INSERTION_MODE shouldn't be setup
to NONE (no headers insertion). We need to skip definition of this property,
because default value depends on USAGE property.
Signed-off-by: Dmitrii Ovchinnikov <ovchinnikov.dmitrii@gmail.com>
With 5 accumulator vectors and 6 inputs, this can only use LMUL=2.
Also the number of vector loop iterations is small, just 5 on 128-bit
vector hardware.
The vector loop is somewhat unusual in that it processes data in
descending memory order, in order to save on vector slides:
in descending order, we can extract elements to carry over to the next
iteration from the bottom of the vectors directly. With ascending order
(see in the Opus postfilter function), there are no ways to get the top
elements directly. On the downside, this requires the use of separate
shift and sub (the would-be SH3SUB instruction does not exist), with
a small pipeline stall on the vector load address.
The edge cases in scalar are done in scalar as this saves on loads
and remains significantly faster than C.
autocorrelate_c: 669.2
autocorrelate_rvv_f32: 421.0
Given the size of the data set, strided memory accesses cannot be avoided.
We can still do better than the current code.
ps_hybrid_synthesis_deint_c: 12065.5
ps_hybrid_synthesis_deint_rvv_i32: 13650.2 (before)
ps_hybrid_synthesis_deint_rvv_i64: 8181.0 (after)
Segmented loads may be slower than not. So this advantageously uses a
unit-strided load and narrowing shifts instead.
Before:
ps_add_squares_c: 60757.7
ps_add_squares_rvv_f32: 22242.5
After:
ps_add_squares_c: 60516.0
ps_add_squares_rvv_i64: 17067.7
Fixes: index -1 out of bounds for type 'CFrameBuffer [100]'
Fixes: 63877/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_FOURXM_fuzzer-5854263397711872
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
PGMYUV seems to be always limited range. This was a format originally
invented by FFmpeg at a time when YUVJ distinguished limited from full
range YUV, and this codec never appeared to output YUVJ in any
circumstance, so hard-coding limited range preserves the status quo.
The other formats are explicitly documented to be full range RGB/gray
formats. That said, don't tag them yet, due to outstanding bugs w.r.t
grayscale formats and color range handling.
This change in behavior updates a bunch of FATE tests in trivial ways
(added tagging being the only difference).
Partially fixes ticket #798
Reviewed-by: James Almer <jamrial@gmail.com>
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: Peter Ross <pross@xvid.org>
This uses a more traditional approach allowing up processing of up to
period minus two elements per iteration. This also allows the algorithm
to work for all and any vector length.
As the T-Head C908 device under test can load 16 elements loop, there is
unsurprisingly a little performance drop when the period is minimal and
the parallelism is capped at 13 elements:
Before:
postfilter_15_c: 21222.2
postfilter_15_rvv_f32: 22007.7
postfilter_512_c: 20189.7
postfilter_512_rvv_f32: 22004.2
postfilter_1022_c: 20189.7
postfilter_1022_rvv_f32: 22004.2
After:
postfilter_15_c: 20189.5
postfilter_15_rvv_f32: 7057.2
postfilter_512_c: 20189.5
postfilter_512_rvv_f32: 5667.2
postfilter_1022_c: 20192.7
postfilter_1022_rvv_f32: 5667.2
As in the aligned case, we can use VLSE64.V, though the way of doing so
gets more convoluted, so the performance gains are more modest:
get_pixels_unaligned_c: 126.7
get_pixels_unaligned_rvv_i32: 145.5 (before)
get_pixels_unaligned_rvv_i64: 62.2 (after)
For the reference, those are the aligned benchmarks (unchanged) on the
same T-Head C908 hardware:
get_pixels_c: 126.7
get_pixels_rvi: 85.7
get_pixels_rvv_i64: 33.2
Fixes: out of array write
Fixes: 63520/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_FLIC_fuzzer-4876198087622656
Regression since: c7f8d42c12 (was not posted to ffmpeg-devel)
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Reviewed-by: Sean McGovern <gseanmcg@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
With 128-bit vectors, this is mostly pointless but also harmless.
Performance gains should be more noticeable with larger vector sizes.
neg_odd_64_c: 76.2
neg_odd_64_rvv_i64: 74.7
Up until now each thread had its own buffer pool for extradata
buffers when using frame-threading. Each thread can have at most
three references to extradata and in the long run, each thread's
bufferpool seems to fill up with three entries. But given
that at any given time there can be at most 2 + number of threads
entries used (the oldest thread can have two references to preceding
frames that are not currently decoded and each thread has its own
current frame, but there can be no references to any other frames),
this is wasteful. This commit therefore uses a single buffer pool
that is synced across threads.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>