FFmpeg

Commit Graph

Author	SHA1	Message	Date
Rémi Denis-Courmont	b818dff8d8	lavc/vc1dsp: fix potential overflow in R-V V inv_trans_4 Judging by the coefficients, the last round of add/sub can overflow to 17 bits with a very small probability just as with the 8-point transform. This is not observed under FATE, but better safe than sorry.	6 months ago
Rémi Denis-Courmont	349c49fd1b	lavc/vc1dsp: fix overflow in R-V V inv_trans_8 The last set of additions/subtractions can break the 16-bit limit, and require 17 bits of precision. This uses widening adds accordingly to fix the MSS2 FATE tests. The problem potentially also affects inv_trans_4 with a very low probability, but this is not reproducible under FATE.	6 months ago
Rémi Denis-Courmont	2c900d4c11	lavc/vc1dsp: factor R-V V inv_trans_8 code	6 months ago
sunyuechi	a4901a56c6	lavc/vp8dsp: R-V V bilin_load to bilin_load_h Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	7 months ago
sunyuechi	8d9fb7b5cf	lavc/vp8dsp: R-V V put_bilin_h v unroll Since len < 64, the registers are sufficient, so it can be directly unrolled (a4 is even). Another benefit of unrolling is that it reduces one load operation vertically compared to horizontally. old new C908 X60 C908 X60 vp8_put_bilin4_h_c : 6.2 5.5 : 6.2 5.5 vp8_put_bilin4_h_rvv_i32 : 2.2 2.0 : 1.5 1.5 vp8_put_bilin4_v_c : 6.5 5.7 : 6.2 5.7 vp8_put_bilin4_v_rvv_i32 : 2.2 2.0 : 1.2 1.5 vp8_put_bilin8_h_c : 24.2 21.5 : 24.2 21.5 vp8_put_bilin8_h_rvv_i32 : 5.2 4.7 : 3.5 3.5 vp8_put_bilin8_v_c : 24.5 21.7 : 24.5 21.7 vp8_put_bilin8_v_rvv_i32 : 5.2 4.7 : 3.5 3.2 vp8_put_bilin16_h_c : 48.0 42.7 : 48.0 42.7 vp8_put_bilin16_h_rvv_i32 : 5.7 5.0 : 5.2 4.5 vp8_put_bilin16_v_c : 48.2 43.0 : 48.2 42.7 vp8_put_bilin16_v_rvv_i32 : 5.7 5.2 : 4.5 4.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	7 months ago
Rémi Denis-Courmont	378d1b06c3	riscv: probe for Zbb extension at load time Due to hysterical raisins, most RISC-V Linux distributions target a RV64GC baseline excluding the Bit-manipulation ISA extensions, most notably: - Zba: address generation extension and - Zbb: basic bit manipulation extension. Most CPUs that would make sense to run FFmpeg on support Zba and Zbb (including the current FATE runner), so it makes sense to optimise for them. In fact a large chunk of existing assembler optimisations relies on Zba and/or Zbb. Since we cannot patch shared library code, the next best thing is to carry a flag initialised at load-time and check it on need basis. This results in 3 instructions overhead on isolated use, e.g.: 1: AUIPC rd, %pcrel_hi(ff_rv_zbb_supported) LBU rd, %pcrel_lo(1b)(rd) BEQZ rd, non_Zbb_fallback_code // Zbb code here The C compiler will typically load the flag ahead of time to reducing latency, and can also keep it around if Zbb is used multiple times in a single optimisation scope. For this to work, the flag symbol must be hidden; otherwise the optimisation degrades with a GOT look-up to support interposition: 1: AUIPC rd, GOT_OFFSET_HI LD rd, GOT_OFFSET_LO(rd) LBU rd, (rd) BEQZ rd, non_Zbb_fallback_code // Zbb code here This patch adds code to provision the flag in libraries using bit manipulation functions from libavutil: byte-swap, bit-weight and counting leading or trailing zeroes.	7 months ago
Rémi Denis-Courmont	b6f37ffba7	lavc/vc1dsp: match C block layout in inv_trans_4x8_rvv Although checkasm does not verify this, the decoder requires that the transform updates the input block exactly like the C code does. This fixes vc1-ism, vc1_ilaced_twomv, vc1_sa00040, vc1_sa10091, vc1_sa10143, vc1_sa20021, vc1test_smm0005 and wmv3-drm-dec tests.	7 months ago
Rémi Denis-Courmont	6c05069e68	lavc/vc1dsp: match C block layout in inv_trans_4x4_rvv Although checkasm does not verify this, the decoder requires that the transform updates the input block exactly like the C code does. This fixes vc1-ism, vc1_ilaced_twomv, vc1_sa00040, vc1_sa10091, vc1_sa10143, vc1_sa20021, vc1test_smm0005 and wmv3-drm-dec tests.	7 months ago
Rémi Denis-Courmont	daac101e61	lavc/aacencdsp: fix rounding in R-V V quantize_bands We need to round toward zero here.	7 months ago
Rémi Denis-Courmont	658439934b	lavc/vp8dsp: R-V V vp8_idct_add T-Head C908 (cycles): vp8_idct_add_c: 312.2 vp8_idct_add_rvv_i32: 117.0	7 months ago
Rémi Denis-Courmont	3152c684cb	lavc/vc1dsp: R-V V vc1_inv_trans_4x4 T-Head C908 (cycles): vc1dsp.vc1_inv_trans_4x4_c: 310.7 vc1dsp.vc1_inv_trans_4x4_rvv_i32: 120.0 We could use 1 `vlseg4e64.v` instead of 4 `vle16.v`, but that seems to be about 7% slower.	7 months ago
Rémi Denis-Courmont	6ffa639c8a	lavc/vc1dsp: R-V V vc1_inv_trans_4x8 T-Head C908 (cycles): vc1dsp.vc1_inv_trans_4x8_c: 653.2 vc1dsp.vc1_inv_trans_4x8_rvv_i32: 234.0	7 months ago
Rémi Denis-Courmont	a169f3bca5	lavc/vc1dsp: R-V V vc1_inv_trans_8x4 T-Head C908 (cycles): vc1dsp.vc1_inv_trans_8x4_c: 626.2 vc1dsp.vc1_inv_trans_8x4_rvv_i32: 215.2	7 months ago
Rémi Denis-Courmont	04397a29de	lavc/vc1dsp: R-V V vc1_inv_trans_8x8 T-Head C908 (cycles): vc1dsp.vc1_inv_trans_8x8_c: 871.7 vc1dsp.vc1_inv_trans_8x8_rvv_i32: 286.7	7 months ago
Rémi Denis-Courmont	c3dbbb316e	lavc/flacdsp: fix sign extension in R-V V wasted33 We need to use either VWCVT.X.X.V or VSEXT.VF2. The later is preferable to avoid changing VTYPE.	7 months ago
Rémi Denis-Courmont	0415bb74c8	lavc/vp8dsp: remove no longer used macros	7 months ago
Rémi Denis-Courmont	121fb846b9	lavc/vp7dsp: add R-V V vp7_idct_dc_add4uv This is almost the same story as vp7_idct_add4y. We just have to use strided loads of 2 64-bit elements to account for the different data layout in memory. T-Head C908: vp7_idct_dc_add4uv_c: 7.5 vp7_idct_dc_add4uv_rvv_i64: 2.0 vp8_idct_dc_add4uv_c: 6.2 vp8_idct_dc_add4uv_rvv_i32: 2.2 (before) vp8_idct_dc_add4uv_rvv_i64: 2.0 SpacemiT X60: vp7_idct_dc_add4uv_c: 6.7 vp7_idct_dc_add4uv_rvv_i64: 2.2 vp8_idct_dc_add4uv_c: 5.7 vp8_idct_dc_add4uv_rvv_i32: 2.5 (before) vp8_idct_dc_add4uv_rvv_i64: 2.0	7 months ago
Rémi Denis-Courmont	225de53c9d	lavc/vp8dsp: rework R-V V idct_dc_add4y DCT-related FFmpeg functions often add an unsigned 8-bit sample to a signed 16-bit coefficient, then clip the result back to an unsigned 8-bit value. RISC-V has no signed 16-bit to unsigned 8-bit clip, so instead our most common sequence is: VWADDU.WV set SEW to 16 bits VMAX.VV zero # clip negative values to 0 set SEW to 8 bits VNCLIPU.WI # clip values over 255 to 255 and narrow Here we use a different sequence which does not require toggling the vector type. This assumes that the wide addend vector is biased by -128: VWADDU.WV VNCLIP.WI # clip values to signed 8-bit and narrow VXOR.VX 0x80 # flip sign bit (convert signed to unsigned) Also the VMAX is effectively replaced by a VXOR of half-width. In this function, this comes for free as we anyway add a constant to the wide vector in the prologue. On C908, this has no observable effects. On X60, this improves microbenchmarks by about 20%.	7 months ago
Rémi Denis-Courmont	4e120fbbbd	lavc/vp8dsp: add R-V V vp7_idct_dc_add4y As with idct_dc_add, most of the code is shared with, and replaces, the previous VP8 function. To improve performance, we break down the 16x4 matrix into 4 rows, rather than 4 squares. Thus strided loads and stores are avoided, and the 4 DC calculations are vectored. Unfortunately this requires a vector gather to splat the DC values, but overall this is still a win for performance: T-Head C908: vp7_idct_dc_add4y_c: 7.2 vp7_idct_dc_add4y_rvv_i32: 2.2 vp8_idct_dc_add4y_c: 6.2 vp8_idct_dc_add4y_rvv_i32: 2.2 (before) vp8_idct_dc_add4y_rvv_i32: 1.7 SpacemiT X60: vp7_idct_dc_add4y_c: 6.2 vp7_idct_dc_add4y_rvv_i32: 2.0 vp8_idct_dc_add4y_c: 5.5 vp8_idct_dc_add4y_rvv_i32: 2.5 (before) vp8_idct_dc_add4y_rvv_i32: 1.7 I also tried to provision the DC values using indexed loads. It ends up slower overall, especially for VP7, as we then have to compute 16 DC's instead of just 4.	7 months ago
Rémi Denis-Courmont	30797e4ff6	lavc/vp8dsp: add R-V V vp7_idct_dc_add This just computes the direct coefficient and hands over to code shared with VP8. Accordingly the bulk of changes are just rewriting the VP8 code to share. Nothing to write home about: vp7_idct_dc_add_c: 1.7 vp7_idct_dc_add_rvv_i32: 1.2	7 months ago
Rémi Denis-Courmont	fd4977c876	lavc/aacencdsp: R-V V quant_bands T-Head C908: quant_bands_signed_c: 576.0 quant_bands_signed_rvv_f32: 48.7 quant_bands_unsigned_c: 414.2 quant_bands_unsigned_rvv_f32: 31.7 SpacemiT X60: quant_bands_signed_c: 497.7 quant_bands_signed_rvv_f32: 23.0 quant_bands_unsigned_c: 353.5 quant_bands_unsigned_rvv_f32: 16.2	7 months ago
Rémi Denis-Courmont	6c6bec04f3	lavc/vc1dsp: fix R-V V avg_mspel_pixels The 8x8 pixel arrays are not necessarily aligned to 64 bits, so the current code leads to Bus error on real hardware. This reproducible with FATE's vc1_ilaced_twomv test case. The new "pessimist" code can trivially be shared for 16x16 pixel arrays so we also do that. FWIW, this also nominally reduces the hardware requirement from Zve64x to Zve32x. T-Head C908: vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_c: 14.7 vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_rvv_i32: 3.5 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_c: 3.7 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_rvv_i32: 1.5 SpacemiT X60: vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_c: 13.0 vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_rvv_i32: 3.0 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_c: 3.2 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_rvv_i32: 1.2	7 months ago
Rémi Denis-Courmont	06fc919aad	lavc/sbrdsp: add support for 256-bit vectors hf_apply_noise_0_c: 35.7 hf_apply_noise_0_rvv_f32: 9.5 hf_apply_noise_1_c: 38.5 hf_apply_noise_1_rvv_f32: 10.0 hf_apply_noise_2_c: 35.5 hf_apply_noise_2_rvv_f32: 9.7 hf_apply_noise_3_c: 38.5 hf_apply_noise_3_rvv_f32: 10.0 Maybe extending the noise table manually is not such great idea, but I not quite sure how to deal with that otherwise? Allocating the table dynamically is possible but would require an ELF destructor to clean up.	7 months ago
sunyuechi	544acfa2c0	lavc/vp9dsp: R-V V rename ff_avg to ff_vp9_avg Avoid potential naming conflicts Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	7 months ago
Rémi Denis-Courmont	4fe8f2cc43	riscv: allow passing addend to vtype_vli macro A constant (-1) is added to the length value, so we can have an added for free, and optimise the addition away if the addend is exactly 1.	7 months ago
Rémi Denis-Courmont	fa3b153cb1	lavc/vp7dsp: R-V V vp7_idct_add Most of the code is shared with DC, thanks to minor earlier changes. vp7_idct_add_c: 5.2 vp7_idct_add_rvv_i32: 2.5	7 months ago
Rémi Denis-Courmont	4a0e629b6f	lavc/vp7dsp: revector ff_vp7_dc_wht_rvv This prepares for some code reuse.	7 months ago
Rémi Denis-Courmont	fd39997f72	lavc/vp7dsp: add R-V V vp7_luma_dc_wht This works out a bit more favourably than VP8's due to: - additional multiplications that can be vectored, - hardware-supported fixed-point rounding mode. vp7_luma_dc_wht_c: 3.2 vp7_luma_dc_wht_rvv_i64: 2.0	7 months ago
Rémi Denis-Courmont	91b5ea7bb9	lavc/vp8dsp: R-V V vp8_luma_dc_wht This is not great as transposition is poorly supported, but it works: vp8_luma_dc_wht_c: 2.5 vp8_luma_dc_wht_rvv_i32: 1.7	7 months ago
Rémi Denis-Courmont	c53d42380d	lavc/lpc: optimise RVV vector type for compute_autocorr On SpacemiT X60 (with len == 4000): autocorr_10_c: 2303.7 autocorr_10_rvv_f64: 1411.5 (before) autocorr_10_rvv_f64: 842.2 (after)	7 months ago
Rémi Denis-Courmont	a11122f9c6	lavc/vp8dsp: save one R-V GPR This saves one instruction and frees up A5, which will be repurposed in later changes. Unfortunately, we need to add quite a lot of alternative code for this.	7 months ago
Rémi Denis-Courmont	4e56455d36	lavc/vp8dsp: avoid one multiplication on RISC-V Use shifts rather than multiply, and save one instruction.	7 months ago
Rémi Denis-Courmont	0aad5b9bf5	lavc/vp8dsp: factor R-V V bilin functions For a given type, only the first VSETVLI instruction varies depending on the size.	7 months ago
Rémi Denis-Courmont	b248d7c319	lavc/sbrdsp: fold immediate offset into relocation This results in AUIPC; ADDI instead of AUIPC; ADDI; ... ADDI.	7 months ago
Rémi Denis-Courmont	8444115262	lavc/startcode: fix RVV return value on no match If there are no zero bytes, t2 equals -1. The code cannot simply fall through to the match case.	7 months ago
Rémi Denis-Courmont	af20fb9c4e	lavc/lpc: fix off-by-one in R-V V compute_autocorr	7 months ago
Rémi Denis-Courmont	a535ce2ac0	lavc/flacdsp: R-V Zvl256b lpc33 flac_lpc_33_13_c: 499.7 flac_lpc_33_13_rvv_i64: 197.7 flac_lpc_33_16_c: 601.5 flac_lpc_33_16_rvv_i64: 195.2 flac_lpc_33_29_c: 1011.5 flac_lpc_33_29_rvv_i64: 300.7 flac_lpc_33_32_c: 1099.0 flac_lpc_33_32_rvv_i64: 296.7	7 months ago
Rémi Denis-Courmont	5ebb071d79	lavc/vp8dsp: disable EPEL HV on RV128 RV128 is mostly scifi at this point, so we can just disable it here (the EPEL HV prologue/epilogue do not save 128-bit registers).	7 months ago
Rémi Denis-Courmont	25a33665a0	lavc/vp8dsp: remove unused macro parameter	7 months ago
Rémi Denis-Courmont	728a1dd3b6	lavc/rv34dsp: remove stray load immediate	7 months ago
sunyuechi	63697d3350	lavc/vp8dsp: R-V V put_epel hv C908: vp8_put_epel4_h4v4_c: 20.0 vp8_put_epel4_h4v4_rvv_i32: 11.0 vp8_put_epel4_h4v6_c: 25.2 vp8_put_epel4_h4v6_rvv_i32: 13.5 vp8_put_epel4_h6v4_c: 22.2 vp8_put_epel4_h6v4_rvv_i32: 14.5 vp8_put_epel4_h6v6_c: 29.0 vp8_put_epel4_h6v6_rvv_i32: 15.7 vp8_put_epel8_h4v4_c: 73.0 vp8_put_epel8_h4v4_rvv_i32: 22.2 vp8_put_epel8_h4v6_c: 90.5 vp8_put_epel8_h4v6_rvv_i32: 26.7 vp8_put_epel8_h6v4_c: 85.0 vp8_put_epel8_h6v4_rvv_i32: 27.2 vp8_put_epel8_h6v6_c: 104.7 vp8_put_epel8_h6v6_rvv_i32: 29.5 vp8_put_epel16_h4v4_c: 145.5 vp8_put_epel16_h4v4_rvv_i32: 26.5 vp8_put_epel16_h4v6_c: 190.7 vp8_put_epel16_h4v6_rvv_i32: 47.5 vp8_put_epel16_h6v4_c: 173.7 vp8_put_epel16_h6v4_rvv_i32: 33.2 vp8_put_epel16_h6v6_c: 222.2 vp8_put_epel16_h6v6_rvv_i32: 35.5 Amended to disable unsupported RV128. Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	7 months ago
Rémi Denis-Courmont	0b2316e37f	lavc/sbrdsp: fix inverted boundary check 128-bit is the maximum, not the minimum here. Larger vector sizes can result in reads past the end of the noise value table. This partially reverts commit `cdcb4b98b7`.	8 months ago
Rémi Denis-Courmont	f883746587	lavc/flacdsp: do not assume maximum R-V VL This loop correctly assumes that VLMAX=16 (4x128-bit vectors with 32-bit elements) and 32 >= pred_order > 16. We need to alternate between VL=16 and VL=t2=pred_order-16 elements to add up to pred_order. The current code requests AVL=a2=pred_order elements. In QEMU and on thte K230 hardware, this sets VL=16 as we need. But the specification merely guarantees that we get: ceil(AVL / 2) <= VL <= VLMAX. For instance, if pred_order equals 27, we could end up with VL=14 or VL=15 instead of VL=16. So instead, request literally VLMAX=16.	8 months ago
Rémi Denis-Courmont	ba38d0e328	lavc/pixblockdsp: add scalar get_pixels_unaligned The code is already there, we just need to use it. get_pixels_unaligned_c: 2.2 get_pixels_unaligned_misaligned: 1.7	8 months ago
Rémi Denis-Courmont	910d281b21	lavc/h263dsp: R-V V {h,v}_loop_filter Since the horizontal and vertical filters are identical except for a transposition, this uses a common subprocedure with an ad-hoc ABI. To preserve return-address stack prediction, a link register has to be used (c.f. the "Control Transfer Instructions" from the RISC-V ISA Manual). The alternate/temporary link register T0 is used here, so that the normal RA is preserved (something Arm cannot do!). To load the strength value based on `qscale`, the shortest possible and PIC-compatible sequence is used: AUIPC; ADD; LBU. The classic LLA; ADD; LBU sequence would add one more instruction since LLA is a convenience alias for AUIPC; ADDI. To ensure that this trick works, relocation relaxation is disabled. To implement the two signed divisions by a power of two toward zero: (x / (1 << SHIFT)) the code relies on the small range of integers involved, computing: (x + (x >> (16 - SHIFT))) >> SHIFT rather than the more general: (x + ((x >> (16 - 1)) & ((1 << SHIFT) - 1))) >> SHIFT Thus one ANDI instruction is avoided. T-Head C908: h263dsp.h_loop_filter_c: 228.2 h263dsp.h_loop_filter_rvv_i32: 144.0 h263dsp.v_loop_filter_c: 242.7 h263dsp.v_loop_filter_rvv_i32: 114.0 (C is probably worse in real use due to less predictible branches.)	8 months ago
sunyuechi	0c1304ae11	lavc/vp9dsp: R-V V mc avg C908: vp9_avg4_8bpp_c: 1.2 vp9_avg4_8bpp_rvv_i64: 1.0 vp9_avg8_8bpp_c: 3.7 vp9_avg8_8bpp_rvv_i64: 1.5 vp9_avg16_8bpp_c: 14.7 vp9_avg16_8bpp_rvv_i64: 3.5 vp9_avg32_8bpp_c: 57.7 vp9_avg32_8bpp_rvv_i64: 10.0 vp9_avg64_8bpp_c: 229.0 vp9_avg64_8bpp_rvv_i64: 31.7 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	8 months ago
Rémi Denis-Courmont	7591eb4055	Revert "lavc/sbrdsp: R-V V neg_odd_64" While this function can easily be written with vectors, it just fails to get any performance improvement. For reference, this is a simpler loop-free implementation that does get better performance than the current one depending on hardware, but still more or less the same metrics as the C code: func ff_sbr_neg_odd_64_rvv, zve64x li a1, 32 addi a0, a0, 7 li t0, 8 vsetvli zero, a1, e8, m2, ta, ma li t1, 0x80 vlse8.v v8, (a0), t0 vxor.vx v8, v8, t1 vsse8.v v8, (a0), t0 ret endfunc This reverts commit `d06fd18f8f`.	8 months ago
Rémi Denis-Courmont	d452db8410	lavc/vc1dsp: R-V V vc1_unescape_buffer Notes: - The loop is biased toward no unescaped bytes as that should be most common. - The input byte array is slid rather than the (8 times smaller) bit-mask, as RISC-V V does not provide a bit-mask (or bit-wise) slide instruction. - There are two comparisons with 0 per iteration, for the same reason. - In case of match, bytes are copied until the first match, and the loop is restarted after the escape byte. Vector compression (vcompress.vm) could discard all escape bytes but that is slower if escape bytes are rare. Further optimisations should be possible, e.g.: - processing 2 bytes fewer per iteration to get rid of a 2 slides, - taking a short cut if the input vector contains less than 2 zeroes. But this is a good starting point: T-Head C908: vc1dsp.vc1_unescape_buffer_c: 12749.5 vc1dsp.vc1_unescape_buffer_rvv_i32: 6009.0 SpacemiT X60: vc1dsp.vc1_unescape_buffer_c: 11038.0 vc1dsp.vc1_unescape_buffer_rvv_i32: 2061.0	8 months ago
Rémi Denis-Courmont	463c573e6b	lavc/huffyuvdsp: optimise RVV vtype for add_hfyu_left_pred_bgr32 T-Head C908: add_hfyu_left_pred_bgr32_c: 237.5 add_hfyu_left_pred_bgr32_rvv_i32: 173.5 (before) add_hfyu_left_pred_bgr32_rvv_i32: 110.0 (after)	8 months ago
Rémi Denis-Courmont	233066e85a	lavc/flacdsp: optimise RVV vector type for lpc32 This is pretty much the same as for lpc16, though it only improves half as large prediction orders. With 128-bit vectors, this gives: C V old V new 1 69.2 181.5 95.5 2 107.7 180.7 95.2 3 145.5 180.0 103.5 4 183.0 179.2 102.7 5 220.7 178.5 128.0 6 257.7 194.0 127.5 7 294.5 193.7 126.7 8 331.0 193.0 126.5 Larger prediction orders see no significant changes at that size.	8 months ago

1 2 3 4 5

206 Commits (1242abdcee257f0cfefc7aabf118d23253f37769)