This is almost the same story as vp7_idct_add4y. We just have to use
strided loads of 2 64-bit elements to account for the different data
layout in memory.
T-Head C908:
vp7_idct_dc_add4uv_c: 7.5
vp7_idct_dc_add4uv_rvv_i64: 2.0
vp8_idct_dc_add4uv_c: 6.2
vp8_idct_dc_add4uv_rvv_i32: 2.2 (before)
vp8_idct_dc_add4uv_rvv_i64: 2.0
SpacemiT X60:
vp7_idct_dc_add4uv_c: 6.7
vp7_idct_dc_add4uv_rvv_i64: 2.2
vp8_idct_dc_add4uv_c: 5.7
vp8_idct_dc_add4uv_rvv_i32: 2.5 (before)
vp8_idct_dc_add4uv_rvv_i64: 2.0
DCT-related FFmpeg functions often add an unsigned 8-bit sample to a
signed 16-bit coefficient, then clip the result back to an unsigned
8-bit value. RISC-V has no signed 16-bit to unsigned 8-bit clip, so
instead our most common sequence is:
VWADDU.WV
set SEW to 16 bits
VMAX.VV zero # clip negative values to 0
set SEW to 8 bits
VNCLIPU.WI # clip values over 255 to 255 and narrow
Here we use a different sequence which does not require toggling the
vector type. This assumes that the wide addend vector is biased by
-128:
VWADDU.WV
VNCLIP.WI # clip values to signed 8-bit and narrow
VXOR.VX 0x80 # flip sign bit (convert signed to unsigned)
Also the VMAX is effectively replaced by a VXOR of half-width. In this
function, this comes for free as we anyway add a constant to the wide
vector in the prologue.
On C908, this has no observable effects. On X60, this improves
microbenchmarks by about 20%.
As with idct_dc_add, most of the code is shared with, and replaces, the
previous VP8 function. To improve performance, we break down the 16x4
matrix into 4 rows, rather than 4 squares. Thus strided loads and
stores are avoided, and the 4 DC calculations are vectored.
Unfortunately this requires a vector gather to splat the DC values, but
overall this is still a win for performance:
T-Head C908:
vp7_idct_dc_add4y_c: 7.2
vp7_idct_dc_add4y_rvv_i32: 2.2
vp8_idct_dc_add4y_c: 6.2
vp8_idct_dc_add4y_rvv_i32: 2.2 (before)
vp8_idct_dc_add4y_rvv_i32: 1.7
SpacemiT X60:
vp7_idct_dc_add4y_c: 6.2
vp7_idct_dc_add4y_rvv_i32: 2.0
vp8_idct_dc_add4y_c: 5.5
vp8_idct_dc_add4y_rvv_i32: 2.5 (before)
vp8_idct_dc_add4y_rvv_i32: 1.7
I also tried to provision the DC values using indexed loads. It ends up
slower overall, especially for VP7, as we then have to compute 16 DC's
instead of just 4.
This just computes the direct coefficient and hands over to code shared
with VP8. Accordingly the bulk of changes are just rewriting the VP8
code to share.
Nothing to write home about:
vp7_idct_dc_add_c: 1.7
vp7_idct_dc_add_rvv_i32: 1.2
Allocations in the following lines depend on the pixel shift, and so
these buffers must be reallocated if the pixel shift changes. Patch
fixes segmentation faults in fuzzed bitstreams.
Signed-off-by: Frank Plowman <post@frankplowman.com>
In all HEVCLocalContext instances except the first one, the bitreader is
never used for actually reading bits, but merely for passing the buffer
to ff_init_cabac_decoder(), which is better done directly.
The instance that actually is used for bitreading gets moved to stack in
decode_nal_unit(), which makes its lifetime clearer.
Do it in hls_slice_header() rather than cabac_init_decoder() - the
former is a more logical place as according the spec the byte alignment
is a part of the slice header, not slice data. Avoids a second instance
of alignment handling in vaapi_hevc.
Also, check that alignment_bit_equal_to_one is, in fact, equal to one.
USAC supports up to 64 audio channels, but puts no limit on the total
number of extensions that may be present. Which may mean that there's
a single audio channel, with 65 thousand extension elements.
We assume that 64 elements is the maximum for now. So check the value.
Some calls to get_escaped_value() specify 0 bits as the third value.
This would result in get_bits(0), which is not a correct usage of the
get_bits API.
Codec IDs have split from `avcodec.h` into `codec_id.h` after commit
c6978418b8.
General documentation contents (which are now in
`general_contents.texi`) have split from the header in `general.texi`
after commit 6accb7718a.
Update the developer documentation to match these changes.
Signed-off-by: Marcus B Spencer <marcus@marcusspencer.xyz>
Fixes "libavcodec/aac/aacdec_usac.c(543): error C2440: 'type cast': cannot convert from 'GetBitContext' to 'GetBitContext'"
from msvc.
Signed-off-by: James Almer <jamrial@gmail.com>
If its not replaced we would have a negative index used in an array potentially
Helps: CID1440385 Negative array index read
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
It seems nothing prevents such overflow even though odd
Fixes: CID1441934 Unintentional integer overflow
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This might not be needed for correctness but it could
help general reproducability of issues
Related to: CID1560037 Uninitialized scalar variable
Related to: CID1560044 Uninitialized scalar variable
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Found while reviewing: CID1500309 Unintentional integer overflow
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This issue cannot happen with the current function parameters
Fixes: CID1500309 Unintentional integer overflow
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>