This change will save approximately 531 MB for an 8K clip when processed with 16 threads.
The calculation is as follows:
7680 * 4320 * sizeof(int) * 2 * 2 * 16 / (4 * 4).
The deblock boundary strength stage utilizes ~5% of CPU resources for 8K clips.
It's worth considering it as a standalone stage. This stage has been relocated
to follow the parser process, allowing us to reuse CUs and TUs before releasing them.
Due to the nature of multithreading, using a "ready check" mechanism may introduce a deadlock. For example:
Suppose all tasks have been submitted to the executor, and the last thread checks the entire list and finds
no ready tasks. It then goes to sleep, waiting for a new task. However, for some multithreading-related reason,
a task becomes ready after the check. Since no other thread is aware of this and no new tasks are being added to
the executor, a deadlock occurs.
In VVC, this function is unnecessary because we use a scoreboard. All tasks submitted to the executor are ready tasks.
We still need several refactors to improve the current VVC decoder's performance,
which will frequently break the API/ABI. To mitigate this, we've copied the executor from
avutil to avcodec. Once the API/ABI is stable, we will move this class back to avutil
ff_vvc_output_frame is called before actually decoding. It's possible
for ff_vvc_output_frame to select current frame to output. If current
frame is nonref frame, it will be released by ff_vvc_unref_frame.
Fix this by always marking the current frame with
VVC_FRAME_FLAG_SHORT_REF, as is done by the HEVC decoder.
This reverts commit 110d8549d5.
I have been working through fixing bugs, particularly crashes I've
found using a fuzzer, in the VVC decoder for the past few months.
While I won't claim it is now bug-free, it is considerably more
resilient than it was and I think in a position to have the
experimental flag removed for release 7.1.
Additionally, most of the Main 10 features of VVC which were missing
version of the decoder released in 7.0 have now been implemented.
This includes the most major missing features: IBC, subpictures and RPR.
Signed-off-by: Frank Plowman <post@frankplowman.com>
Fixes:
https://fate.ffmpeg.org/report.cgi?slot=x86_64-archlinux-gcc-tsan&time=20240823175808
Reproduction steps:
./configure --enable-memory-poisoning --toolchain=gcc-tsan --disable-stripping && make fate-vvc
Root cause:
We hold the current frame's lock while updating progress for other frames,
which also requires acquiring other frame locks. This could potentially lead to a deadlock.
However, I don't think this will happen in practice because progress updates are one-way, with no cyclic dependencies.
But we need this patch to make FATE happy.
The previous logic relied on the subpicture boundaries coinciding with
the tile boundaries. Per 6.3.1 of H.266 (V3), vertical subpicture
boundaries are always tile boundaries however the same cannot be said
for horizontal subpicture boundaries. Furthermore, it is possible to
construct an illegal bitstream where vertical subpicture boundaries are
not coincident with tile boundaries. In these cases, the condition of
the while loop would never be satisfied resulting in an OOB read on
col_bd/row_bd.
Patch fixes this issue by replacing != with <, thereby not requiring
subpicture boundaries and tile boundaries to be coincident.
Signed-off-by: Frank Plowman <post@frankplowman.com>
fix
==135000== Conditional jump or move depends on uninitialised value(s)
==135000== at 0x169FF95: vvc_deblock_bs (filter.c:699)
and
==135000== Conditional jump or move depends on uninitialised value(s)
==135000== at 0x16A2E72: ff_vvc_alf_filter (filter.c:1217)
Reported-by: James Almer <jamrial@gmail.com>
From Jun Zhao <mypopydev@gmail.com>:
> Should we relocate this to the decoder? Other codecs typically set this
> parameter in the decoder.
Signed-off-by: Wu Jianhua <toqsxw@outlook.com>
memset tables in the main thread can become a bottleneck for the decoder.
For example, if it takes 1% of the processing time for one core, the maximum achievable FPS will be 100.
Move the memeset to worker threads will fix the issue.
For luma, qp can only change at the CU level, so the qp tab size is related to the CU.
For chroma, considering the joint CbCr, the QP tab size is related to the TU.
The parser stage is not parallelizable.
We need to schedule it as soon as possible to create later stages, which are more parallelizable
clips | before | after | delta
--------------------------------------------|--------|-------|------
RitualDance_1920x1080_60_10_420_37_RA.266 | 342.7 | 365.3 | 6.59%
NovosobornayaSquare_1920x1080.bin | 321.7 | 400 | 24.34%
Tango2_3840x2160_60_10_420_27_LD.266 | 82.3 | 91.7 | 11.42%
RitualDance_1920x1080_60_10_420_32_LD.266 | 323.7 | 319.3 | -1.36%
Chimera_8bit_1080P_1000_frames.vvc | 364 | 411.3 | 12.99%
BQTerrace_1920x1080_60_10_420_22_RA.vvc | 162.7 | 185.7 | 14.14%
This simplification assumes that the code is correct
Fixes: CID1560036 Logically dead code
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Not a bugfix, but might fix CID1604361 Overflowed constant
Sponsored-by: Sovereign Tech Fund
Reviewed-by: Nuo Mi <nuomi2021@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: signed integer overflow: 1107820800 + 1107820800 cannot be represented in type 'int'
Fixes: left shift of 1091059712 by 6 places cannot be represented in type 'int'
Fixes: 69910/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_VVC_fuzzer-5162839971528704
Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Reviewed-by: Nuo Mi <nuomi2021@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
ff_vvc_frame_rpl uses the flags to detect whether a frame is in use.
Therefore, in the case of a CVSS AU (RASL/GDR with
NoOutputBeforeRecoveryFlag) with ph_non_ref_pic_flag = 1, the frame
would be freed before it is used. Fix this by always marking the
current frame with VVC_FRAME_FLAG_SHORT_REF, as is done by the HEVC
decoder.
Signed-off-by: Frank Plowman <post@frankplowman.com>
From H.266 (V3) (09/2023) p. 321:
It is a requirement of bitstream conformance that the luma block
vector bvL shall obey the following constraints:
- CtbSizeY is greater than or equal to
((yCb + (bvL[ 1 ] >> 4)) & (CtbSizeY − 1)) + cbHeight
This patch checks this is true, which fixes crashes on fuzzed
bitstreams.
Signed-off-by: Frank Plowman <post@frankplowman.com>