This loop correctly assumes that VLMAX=16 (4x128-bit vectors
with 32-bit elements) and 32 >= pred_order > 16. We need to alternate
between VL=16 and VL=t2=pred_order-16 elements to add up to pred_order.
The current code requests AVL=a2=pred_order elements. In QEMU and on
thte K230 hardware, this sets VL=16 as we need. But the specification
merely guarantees that we get: ceil(AVL / 2) <= VL <= VLMAX. For
instance, if pred_order equals 27, we could end up with VL=14 or VL=15
instead of VL=16. So instead, request literally VLMAX=16.
X86ASM libavcodec/x86/vvc/vvc_sad.o
libavcodec/x86/vvc/vvc_sad.asm:85: error: invalid number of operands
libavcodec/x86/vvc/vvc_sad.asm:87: error: invalid number of operands
Signed-off-by: Haihao Xiang <haihao.xiang@intel.com>
Signed-off-by: James Almer <jamrial@gmail.com>
Implements AVX2 DMVR (decoder-side motion vector refinement) SAD functions. DMVR SAD is only calculated if w >= 8, h >= 8, and w * h > 128. To reduce complexity, SAD is only calculated on even rows. This is calculated for all video bitdepths, but the values passed to the function are always 16bit (even if the original video bitdepth is 8). The AVX2 implementation uses min/max/sub.
Additionally this changes parameters dx and dy from int to intptr_t. This allows dx & dy to be used as pointer offsets without needing to use movsxd.
Benchmarks ( AMD 7940HS )
Before:
BQTerrace_1920x1080_60_10_420_22_RA.vvc | 106.0 |
Chimera_8bit_1080P_1000_frames.vvc | 204.3 |
NovosobornayaSquare_1920x1080.bin | 197.3 |
RitualDance_1920x1080_60_10_420_37_RA.266 | 174.0 |
After:
BQTerrace_1920x1080_60_10_420_22_RA.vvc | 109.3 |
Chimera_8bit_1080P_1000_frames.vvc | 216.0 |
NovosobornayaSquare_1920x1080.bin | 204.0|
RitualDance_1920x1080_60_10_420_37_RA.266 | 181.7 |
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
Since the horizontal and vertical filters are identical except for a
transposition, this uses a common subprocedure with an ad-hoc ABI.
To preserve return-address stack prediction, a link register has to be
used (c.f. the "Control Transfer Instructions" from the
RISC-V ISA Manual). The alternate/temporary link register T0 is used
here, so that the normal RA is preserved (something Arm cannot do!).
To load the strength value based on `qscale`, the shortest possible
and PIC-compatible sequence is used: AUIPC; ADD; LBU. The classic
LLA; ADD; LBU sequence would add one more instruction since LLA is a
convenience alias for AUIPC; ADDI. To ensure that this trick works,
relocation relaxation is disabled.
To implement the two signed divisions by a power of two toward zero:
(x / (1 << SHIFT))
the code relies on the small range of integers involved, computing:
(x + (x >> (16 - SHIFT))) >> SHIFT
rather than the more general:
(x + ((x >> (16 - 1)) & ((1 << SHIFT) - 1))) >> SHIFT
Thus one ANDI instruction is avoided.
T-Head C908:
h263dsp.h_loop_filter_c: 228.2
h263dsp.h_loop_filter_rvv_i32: 144.0
h263dsp.v_loop_filter_c: 242.7
h263dsp.v_loop_filter_rvv_i32: 114.0
(C is probably worse in real use due to less predictible branches.)
While this function can easily be written with vectors, it just fails to
get any performance improvement.
For reference, this is a simpler loop-free implementation that does get
better performance than the current one depending on hardware, but still
more or less the same metrics as the C code:
func ff_sbr_neg_odd_64_rvv, zve64x
li a1, 32
addi a0, a0, 7
li t0, 8
vsetvli zero, a1, e8, m2, ta, ma
li t1, 0x80
vlse8.v v8, (a0), t0
vxor.vx v8, v8, t1
vsse8.v v8, (a0), t0
ret
endfunc
This reverts commit d06fd18f8f.
Notes:
- The loop is biased toward no unescaped bytes as that should be most common.
- The input byte array is slid rather than the (8 times smaller) bit-mask,
as RISC-V V does not provide a bit-mask (or bit-wise) slide instruction.
- There are two comparisons with 0 per iteration, for the same reason.
- In case of match, bytes are copied until the first match, and the loop is
restarted after the escape byte. Vector compression (vcompress.vm) could
discard all escape bytes but that is slower if escape bytes are rare.
Further optimisations should be possible, e.g.:
- processing 2 bytes fewer per iteration to get rid of a 2 slides,
- taking a short cut if the input vector contains less than 2 zeroes.
But this is a good starting point:
T-Head C908:
vc1dsp.vc1_unescape_buffer_c: 12749.5
vc1dsp.vc1_unescape_buffer_rvv_i32: 6009.0
SpacemiT X60:
vc1dsp.vc1_unescape_buffer_c: 11038.0
vc1dsp.vc1_unescape_buffer_rvv_i32: 2061.0
For RPR, the current frame may reference a frame with a different resolution.
Therefore, we need to consider frame scaling when we wait for reference pixels.
Most users of ff_adts_header_parse() don't already have
an opened GetBitContext for the header, so add a convenience
function for them.
Also use a forward declaration of GetBitContext in adts_header.h
as this avoids (implicit) inclusion of get_bits.h in some of
the users that now no longer use a GetBitContext of their own.
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Also remove the (unused) AAC_AC3_PARSE_ERROR_CHANNEL_CFG while at it;
furthermore, fix the documentation of ff_ac3_parse_header()
and (ff|avpriv)_adts_header_parse().
Reviewed-by: Andrew Sayers <ffmpeg-devel@pileofstuff.org>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The documentation of av_adts_header_parse() does not require
the buffer to be padded at all.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
When the external allocator is used for dynamic frame allocation, only
video memory is supported, the SDK doesn't lock/unlock the memory block
via Lock()/Unlock() calls.
Signed-off-by: Haihao Xiang <haihao.xiang@intel.com>
This could cause timeouts
Fixes: CID1439568 Untrusted loop bound
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
I dont think this can actually overflow but 64bit seems reasonable to use
Fixes: CID1521983 Unintentional integer overflow
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: CID1516090 Unchecked return value
Sponsored-by: Sovereign Tech Fund
Reviewed-by: Peter Ross <pross@xvid.org>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: CID1560042 Unchecked return value
Sponsored-by: Sovereign Tech Fund
Reviewed-by: Nuo Mi <nuomi2021@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: CID1461482 Improper use of negative value
Sponsored-by: Sovereign Tech Fund
Reviewed-.by: "Xiang, Haihao" <haihao.xiang@intel.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: CID1452425 Logically dead code
Sponsored-by: Sovereign Tech Fund
Reviewed-by: Peter Ross <pross@xvid.org>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Fixes: CID1507483 Unchecked return value
Sponsored-by: Sovereign Tech Fund
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>