Returning a void is not allowed by the spec. Just return instead.
Reviewed-by: Nuo Mi <nuomi2021@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It is unnecessary to first pad to 32bits; the memset later
will pad everything will with zeroes anyway.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
If any of these files (say A) would be changed in such a way
that A acquires a new dependency on another file B, building B
would need to be added to all the rules that lead to A being built.
Yet currently the rules for several files are spread over
the lavc Makefile and the Makefile of the lavc/hevc subdir, making
it more likely to be forgotten. So move the rules for these files
to the lavc/Makefile.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
A 360 video specific tool
see https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9503377
passed files:
DMVR_A_Huawei_3.bit
WRAP_D_InterDigital_4.bit
WRAP_A_InterDigital_4.bit
WRAP_B_InterDigital_4.bit
WRAP_C_InterDigital_4.bit
ERP_A_MediaTek_3.bit
Require that there is a valid layout with a valid number of channels
before accepting nb_elems.
The value is required when flushing.
Thanks to kasper93 for figuring it out.
The issue is that AOT 45 isn't defined anywhere, and looking at the git
blame, it seems to have sprung up through a reordering of the enum,
and adding a hole.
The spec does not define an explicit AOT for SBR and no SBR, and only
uses AOT 42 (previously AOT_USAC_NOSBR), so just rename AOT_USAC to
it and replace its use everywhere.
Yes the same dead code is in "iLBC Speech Coder ANSI-C Source Code"
Fixes: CID1509370 Logically dead code
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
T-Head C908 (cycles):
vc1dsp.vc1_inv_trans_4x4_c: 310.7
vc1dsp.vc1_inv_trans_4x4_rvv_i32: 120.0
We could use 1 `vlseg4e64.v` instead of 4 `vle16.v`, but that seems to
be about 7% slower.
There is no bsf for other codecs to modify crop info except H.265.
For H.265, the assumption that FFALIGN(width, 16)xFFALIGN(height, 16)
is the video resolution can be wrong, since the encoder can use CTU
larger than 16x16. In that case, use FFALIGN(width, 16) - width
as crop_right is incorrect. So disable the workaround for H.265 now.
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
It's a common usecase to request a video size after crop. Before
this patch, user must know the video size before crop, then set
crop_right/crop_bottom accordingly. Since HEVC can have different
CTU size, it's not easy to get/deduce the video size before crop.
With the new width/height options, there is no such requirement.
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
This is almost the same story as vp7_idct_add4y. We just have to use
strided loads of 2 64-bit elements to account for the different data
layout in memory.
T-Head C908:
vp7_idct_dc_add4uv_c: 7.5
vp7_idct_dc_add4uv_rvv_i64: 2.0
vp8_idct_dc_add4uv_c: 6.2
vp8_idct_dc_add4uv_rvv_i32: 2.2 (before)
vp8_idct_dc_add4uv_rvv_i64: 2.0
SpacemiT X60:
vp7_idct_dc_add4uv_c: 6.7
vp7_idct_dc_add4uv_rvv_i64: 2.2
vp8_idct_dc_add4uv_c: 5.7
vp8_idct_dc_add4uv_rvv_i32: 2.5 (before)
vp8_idct_dc_add4uv_rvv_i64: 2.0
DCT-related FFmpeg functions often add an unsigned 8-bit sample to a
signed 16-bit coefficient, then clip the result back to an unsigned
8-bit value. RISC-V has no signed 16-bit to unsigned 8-bit clip, so
instead our most common sequence is:
VWADDU.WV
set SEW to 16 bits
VMAX.VV zero # clip negative values to 0
set SEW to 8 bits
VNCLIPU.WI # clip values over 255 to 255 and narrow
Here we use a different sequence which does not require toggling the
vector type. This assumes that the wide addend vector is biased by
-128:
VWADDU.WV
VNCLIP.WI # clip values to signed 8-bit and narrow
VXOR.VX 0x80 # flip sign bit (convert signed to unsigned)
Also the VMAX is effectively replaced by a VXOR of half-width. In this
function, this comes for free as we anyway add a constant to the wide
vector in the prologue.
On C908, this has no observable effects. On X60, this improves
microbenchmarks by about 20%.
As with idct_dc_add, most of the code is shared with, and replaces, the
previous VP8 function. To improve performance, we break down the 16x4
matrix into 4 rows, rather than 4 squares. Thus strided loads and
stores are avoided, and the 4 DC calculations are vectored.
Unfortunately this requires a vector gather to splat the DC values, but
overall this is still a win for performance:
T-Head C908:
vp7_idct_dc_add4y_c: 7.2
vp7_idct_dc_add4y_rvv_i32: 2.2
vp8_idct_dc_add4y_c: 6.2
vp8_idct_dc_add4y_rvv_i32: 2.2 (before)
vp8_idct_dc_add4y_rvv_i32: 1.7
SpacemiT X60:
vp7_idct_dc_add4y_c: 6.2
vp7_idct_dc_add4y_rvv_i32: 2.0
vp8_idct_dc_add4y_c: 5.5
vp8_idct_dc_add4y_rvv_i32: 2.5 (before)
vp8_idct_dc_add4y_rvv_i32: 1.7
I also tried to provision the DC values using indexed loads. It ends up
slower overall, especially for VP7, as we then have to compute 16 DC's
instead of just 4.
This just computes the direct coefficient and hands over to code shared
with VP8. Accordingly the bulk of changes are just rewriting the VP8
code to share.
Nothing to write home about:
vp7_idct_dc_add_c: 1.7
vp7_idct_dc_add_rvv_i32: 1.2