This code was simply incorrect through and through. It did not
protect what actually has to be protected in a multi-threaded setup.
Perhaps it was used to silence threading errors?
Either way, remove it, and document the correct way to use execution
pools in a threaded environment.
Originally, the decoder had a single execution pool, with one
execution context per thread. Execution pools were always intended
to be thread-safe, as long as there were enough execution contexts
in the pool to satisfy all threads.
Due to synchronization issues, the threading part was removed at some
point, and, for decoding, each thread had its own execution pool.
Having a single execution pool per context is hacky, not to mention
wasteful.
Most importantly, we *cannot* associate single shaders across multiple
execution pools for a single application. This means that we cannot
use shaders to either apply film grain, or use this framework for
software-defined decoders.
The recent commits added threading capabilities back to the execution
pool, and the number of contexts in each pool was increased. This was
done with the assumption that the execution pool was singular, which
it was not. This led to increased parallelism and number of frames
in flight, which is taxing on memory.
This commit finally restores proper threading behaviour.
The validation layer has isses that are reported and addressed in the
earlier commit.
Some drivers are more strict about the size of the reference lists given
(i.e. VAOn12 [1]). The next_prev list is used to handle multiple "L0"
references in AV1 encode. Restrict the size of next_prev based on the
value of ref_l0 when the GOP structure is initialized.
[1] https://github.com/intel/cartwheel-ffmpeg/issues/278
v2: fix indentation issues
These functions were divided into two special cases; one assuming that
uvalpha == 0, and the other assuming that uvalpha == 2048. This worked fine
for simple 2x chroma upscaling but broke for e.g. yuv410p, non-centered chroma,
or other special cases that involved non-aligned chroma filters.
Fix it by instead dividing this check into two cases, a uvalpha==0 fast path
and a uvalpha>0 general path. Instead of (A+B)/2 the general path now multiplies
in the true uvalpha weight.
I tried preserving the old fast path for the case of uvalpha == 2048, but this
was significantly slower in practise versus having just one general path.
However, we still need a uvalpha == 0 path for the unscaled case.
Fixes: ticket #5083
Signed-off-by: Niklas Haas <git@haasn.dev>
Sponsored-by: Sovereign Tech Fund
As per section 3.11.1 of the IAMF spec, the sample rate used in Codec Config
for Opus shall be 48kHz, regardless of the original sample rate used during
encoding.
Signed-off-by: James Almer <jamrial@gmail.com>
This option, which is also available on other FFmpeg hardware encoders,
allows the user to trade throughput for reduced output latency. This is
useful for ultra low latency applications like game streaming.
Signed-off-by: Cameron Gutman <aicommander@gmail.com>
Since its introduction, this function has claimed to return 0 on success, yet
never actually did so (until the introduction of the new graph based API). It
always returned the number of scaled lines, and continues to do so.
To avoid confusion, but also avoid regressing possible clients that relied on
the existing semantics, simply update the documentation to reflect the actual
behavior. Remain ambiguous about the exact interpretation of the return value
on account of the unfortunate difference in behavior between the legacy and
new scaling APIs.
Signed-off-by: Niklas Haas <git@haasn.dev>
Sponsored-by: Sovereign Tech Fund
This logic was inverted, but || was not replaced by &&.
Fixes: ed5dd67562
Fixes: ticket #11353
Signed-off-by: Niklas Haas <git@haasn.dev>
Sponsored-by: Sovereign Tech Fund
It's currently actually not used in MSVC builds, since
6e49b86996.
Older versions of MSVC (or, in particular, older versions of UCRT)
don't have stdalign.h; it's available since WinSDK 10.0.20348.0;
such a new enough version has been installed by default only since
MSVC 2022 17.4 and newer.
With this change, ffmpeg can still be built with MSVC 2019 16.8
(v19.28).
Signed-off-by: Martin Storsjö <martin@martin.st>
Explicitly use ldur for unaligned offsets; newer versions of
armasm64 implicitly convert ldr to ldur as necessary, but older
versions require it explicitly written out.
This fixes these build errors:
ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2039) :
error A2518: operand 2: Memory offset must be aligned
ldr s5, [x1, #1]
ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2250) :
error A2518: operand 2: Memory offset must be aligned
ldr d7, [x1, #2]
Signed-off-by: Martin Storsjö <martin@martin.st>
Fix test failure on aarch64:
./tests/checkasm/checkasm --test=h264pred 367840
Signed-off-by: Peng Bin <pengbin@visionular.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
Fix test failure on aarch64:
./tests/checkasm/checkasm --test=h264pred 479612
The mismatch between neon and C functions can also be reproduced using the following bitstream and command line.
wget https://streams.videolan.org/ffmpeg/incoming/intra8x8pred_10bit.264
./ffmpeg -cpuflags 0 -threads 1 -i intra8x8pred_10bit.264 -f framemd5 -y md5_ref
./ffmpeg -threads 1 -i intra8x8pred_10bit.264 -f framemd5 -y md5_neon
Signed-off-by: Bin Peng <pengbin@visionular.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
In input.c and output.c and many other places, swscale follows the rule of using
15-bit intermediate if output bpc is <= 8, and 19-bit (inside int32_t)
intermediate otherwise. See e.g. the comments on hyScale() on
swscale_internal.h. These are also the coefficients that yuv2gbrpf32_full_X_c()
is using.
In contrast to this, the plane init code in slice.c (function fill_ones) is
assuming that we use 35-bit intermediates (inside 64-bit integers) for this
case, seemingly added by commit b4967fc71c with
no further justification.
This causes a mismatch whenever the implicitly initialized plane contents leak
out to the output, e.g. when converting from grayscale to RGB.
Fixes: ticket #10716
Signed-off-by: Niklas Haas <git@haasn.dev>
Sponsored-by: Sovereign Tech Fund
pthread_t is currently defined as a struct, which gets placed into
caller's memory and filled by pthread_create() (which accepts a
pthread_t*).
The problem with this approach is that pthread_join() accepts pthread_t
itself rather than a pointer to it, so it gets a _copy_ of this
structure. This causes non-deterministic failures of pthread_join() to
produce the correct return value - depending on whether the thread
already finished before pthread_join() is called (and thus the copy
contains the correct value), or not (then it contains 0).
Change the definition of pthread_t into a pointer to a struct, that gets
malloced by pthread_create() and freed by pthread_join().
Fixes random failures of fate-ffmpeg-error-rate-fail on Windows after
433cf391f5.
See also [1] for an alternative approach that does not require dynamic
allocation, but relies on an assumption that the pthread_t value
remains in a fixed memory location.
[1] 23829dd2b2
Reviewed-By: Martin Storsjö <martin@martin.st>
When subblock durations are constant, the last block may be smaller and the
value needs to be calculated.
Signed-off-by: James Almer <jamrial@gmail.com>
Section 3.6.1 of the IAMF spec states "When constant_subblock_duration is equal to 0, the summation of all
subblock_duration in this parameter block SHALL be equal to duration.".
Signed-off-by: James Almer <jamrial@gmail.com>
The queue needs to track each frame/packet's stream index, this is
achieved by maintaining a parallel AVFifo instance for that purpose.
This is simpler than implementing custom AVContainerFifo callbacks.