Partially fixes ticket #798
Reviewed-by: James Almer <jamrial@gmail.com>
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: Peter Ross <pross@xvid.org>
This uses a more traditional approach allowing up processing of up to
period minus two elements per iteration. This also allows the algorithm
to work for all and any vector length.
As the T-Head C908 device under test can load 16 elements loop, there is
unsurprisingly a little performance drop when the period is minimal and
the parallelism is capped at 13 elements:
Before:
postfilter_15_c: 21222.2
postfilter_15_rvv_f32: 22007.7
postfilter_512_c: 20189.7
postfilter_512_rvv_f32: 22004.2
postfilter_1022_c: 20189.7
postfilter_1022_rvv_f32: 22004.2
After:
postfilter_15_c: 20189.5
postfilter_15_rvv_f32: 7057.2
postfilter_512_c: 20189.5
postfilter_512_rvv_f32: 5667.2
postfilter_1022_c: 20192.7
postfilter_1022_rvv_f32: 5667.2
As in the aligned case, we can use VLSE64.V, though the way of doing so
gets more convoluted, so the performance gains are more modest:
get_pixels_unaligned_c: 126.7
get_pixels_unaligned_rvv_i32: 145.5 (before)
get_pixels_unaligned_rvv_i64: 62.2 (after)
For the reference, those are the aligned benchmarks (unchanged) on the
same T-Head C908 hardware:
get_pixels_c: 126.7
get_pixels_rvi: 85.7
get_pixels_rvv_i64: 33.2
Without this flag, timestamps were embedded into the final
binary if CUDA was enabled.
Signed-off-by: Rob Hall <robxnanocode@outlook.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
We currently mostly do not empty the MMX state in our MMX
DSP functions; instead we only do so before code that might
be using x87 code. This is a violation of the System V i386 ABI
(and maybe of other ABIs, too):
"The CPU shall be in x87 mode upon entry to a function. Therefore,
every function that uses the MMX registers is required to issue an
emms or femms instruction after using MMX registers, before returning
or calling another function." (See 2.2.1 in [1])
This patch does not intend to change all these functions to abide
by the ABI; it only does so for ff_pixelutils_sad_8x8_mmxext, as this
function can by called by external users, because it is exported
via the pixelutils API. Without this, the following fragment will
assert (on x86/x64):
uint8_t src1[8 * 8], src2[8 * 8];
av_pixelutils_sad_fn fn = av_pixelutils_get_sad_fn(3, 3, 0, NULL);
fn(src1, 8, src2, 8);
av_assert0_fpu();
[1]: https://raw.githubusercontent.com/wiki/hjl-tools/x86-psABI/intel386-psABI-1.1.pdf
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Fixes: out of array write
Fixes: 63520/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_FLIC_fuzzer-4876198087622656
Regression since: c7f8d42c12 (was not posted to ffmpeg-devel)
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Reviewed-by: Sean McGovern <gseanmcg@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Reviewed-by: Sean McGovern <gseanmcg@gmail.com>
Reviewed-by: Nicolas George <george@nsup.org>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Reviewed-by: Sean McGovern <gseanmcg@gmail.com>
Reviewed-by: Nicolas George <george@nsup.org>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This is cleaner, but it is also a workaround for when
the header exists, but cannot be compiled.
This will happen when the compiler has no inline asm
support.
Possibly the configure check should be improved as well.
The test can currently pass when _Pragma is not supported, since
_Pragma might be treated as a implicitly declared function.
This happens e.g. with tinycc.
Extending the check to 2 pragmas both matches the actual use
better and avoids this misdetection.
Fixes ticket #10638 (and should also fix ticket #10482)
by restoring the behaviour from before
3c7dd5ed37.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The `-caption_encoding` option was reported as having a default value of
'ass', whereas it's actually 'auto'.
Signed-off-by: zheng qian <xqq@xqq.im>
Signed-off-by: Gyan Doshi <ffmpeg@gyani.pro>
With 128-bit vectors, this is mostly pointless but also harmless.
Performance gains should be more noticeable with larger vector sizes.
neg_odd_64_c: 76.2
neg_odd_64_rvv_i64: 74.7
Up until now each thread had its own buffer pool for extradata
buffers when using frame-threading. Each thread can have at most
three references to extradata and in the long run, each thread's
bufferpool seems to fill up with three entries. But given
that at any given time there can be at most 2 + number of threads
entries used (the oldest thread can have two references to preceding
frames that are not currently decoded and each thread has its own
current frame, but there can be no references to any other frames),
this is wasteful. This commit therefore uses a single buffer pool
that is synced across threads.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Up until now, the VAAPI encoder uses fake data with the
AVBuffer-API: The data pointer does not point to real memory,
but is instead just a VABufferID converted to a pointer.
This has probably been copied from the VAAPI-hwcontext-API
(which presumably does it to avoid allocations).
This commit changes this without causing additional allocations
by switching to the RefStruct-pool API. This also fixes an
unchecked av_buffer_ref().
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is in preparation for the following commit.
Reviewed-by: Anton Khirnov <anton@khirnov.net>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It involves less allocations, in particular no allocations
after the entry has been created. Therefore creating a new
reference from an existing one can't fail and therefore
need not be checked. It also avoids indirections and casts.
Also note that nvdec_decoder_frame_init() (the callback
to initialize new entries from the pool) does not use
atomics to read and replace the number of entries
currently used by the pool. This relies on nvdec (like
most other hwaccels) not being run in a truely frame-threaded
way.
Tested-by: Timo Rothenpieler <timo@rothenpieler.org>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It involves less allocations and therefore has the nice property
that deriving a reference from a reference can't fail,
simplifying hevc_ref_frame().
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>