FFmpeg

Commit Graph

Author	SHA1	Message	Date
Rémi Denis-Courmont	3152c684cb	lavc/vc1dsp: R-V V vc1_inv_trans_4x4 T-Head C908 (cycles): vc1dsp.vc1_inv_trans_4x4_c: 310.7 vc1dsp.vc1_inv_trans_4x4_rvv_i32: 120.0 We could use 1 `vlseg4e64.v` instead of 4 `vle16.v`, but that seems to be about 7% slower.	7 months ago
Rémi Denis-Courmont	6ffa639c8a	lavc/vc1dsp: R-V V vc1_inv_trans_4x8 T-Head C908 (cycles): vc1dsp.vc1_inv_trans_4x8_c: 653.2 vc1dsp.vc1_inv_trans_4x8_rvv_i32: 234.0	7 months ago
Rémi Denis-Courmont	a169f3bca5	lavc/vc1dsp: R-V V vc1_inv_trans_8x4 T-Head C908 (cycles): vc1dsp.vc1_inv_trans_8x4_c: 626.2 vc1dsp.vc1_inv_trans_8x4_rvv_i32: 215.2	7 months ago
Rémi Denis-Courmont	04397a29de	lavc/vc1dsp: R-V V vc1_inv_trans_8x8 T-Head C908 (cycles): vc1dsp.vc1_inv_trans_8x8_c: 871.7 vc1dsp.vc1_inv_trans_8x8_rvv_i32: 286.7	7 months ago
Rémi Denis-Courmont	c3dbbb316e	lavc/flacdsp: fix sign extension in R-V V wasted33 We need to use either VWCVT.X.X.V or VSEXT.VF2. The later is preferable to avoid changing VTYPE.	7 months ago
Rémi Denis-Courmont	0415bb74c8	lavc/vp8dsp: remove no longer used macros	7 months ago
Rémi Denis-Courmont	121fb846b9	lavc/vp7dsp: add R-V V vp7_idct_dc_add4uv This is almost the same story as vp7_idct_add4y. We just have to use strided loads of 2 64-bit elements to account for the different data layout in memory. T-Head C908: vp7_idct_dc_add4uv_c: 7.5 vp7_idct_dc_add4uv_rvv_i64: 2.0 vp8_idct_dc_add4uv_c: 6.2 vp8_idct_dc_add4uv_rvv_i32: 2.2 (before) vp8_idct_dc_add4uv_rvv_i64: 2.0 SpacemiT X60: vp7_idct_dc_add4uv_c: 6.7 vp7_idct_dc_add4uv_rvv_i64: 2.2 vp8_idct_dc_add4uv_c: 5.7 vp8_idct_dc_add4uv_rvv_i32: 2.5 (before) vp8_idct_dc_add4uv_rvv_i64: 2.0	7 months ago
Rémi Denis-Courmont	225de53c9d	lavc/vp8dsp: rework R-V V idct_dc_add4y DCT-related FFmpeg functions often add an unsigned 8-bit sample to a signed 16-bit coefficient, then clip the result back to an unsigned 8-bit value. RISC-V has no signed 16-bit to unsigned 8-bit clip, so instead our most common sequence is: VWADDU.WV set SEW to 16 bits VMAX.VV zero # clip negative values to 0 set SEW to 8 bits VNCLIPU.WI # clip values over 255 to 255 and narrow Here we use a different sequence which does not require toggling the vector type. This assumes that the wide addend vector is biased by -128: VWADDU.WV VNCLIP.WI # clip values to signed 8-bit and narrow VXOR.VX 0x80 # flip sign bit (convert signed to unsigned) Also the VMAX is effectively replaced by a VXOR of half-width. In this function, this comes for free as we anyway add a constant to the wide vector in the prologue. On C908, this has no observable effects. On X60, this improves microbenchmarks by about 20%.	7 months ago
Rémi Denis-Courmont	4e120fbbbd	lavc/vp8dsp: add R-V V vp7_idct_dc_add4y As with idct_dc_add, most of the code is shared with, and replaces, the previous VP8 function. To improve performance, we break down the 16x4 matrix into 4 rows, rather than 4 squares. Thus strided loads and stores are avoided, and the 4 DC calculations are vectored. Unfortunately this requires a vector gather to splat the DC values, but overall this is still a win for performance: T-Head C908: vp7_idct_dc_add4y_c: 7.2 vp7_idct_dc_add4y_rvv_i32: 2.2 vp8_idct_dc_add4y_c: 6.2 vp8_idct_dc_add4y_rvv_i32: 2.2 (before) vp8_idct_dc_add4y_rvv_i32: 1.7 SpacemiT X60: vp7_idct_dc_add4y_c: 6.2 vp7_idct_dc_add4y_rvv_i32: 2.0 vp8_idct_dc_add4y_c: 5.5 vp8_idct_dc_add4y_rvv_i32: 2.5 (before) vp8_idct_dc_add4y_rvv_i32: 1.7 I also tried to provision the DC values using indexed loads. It ends up slower overall, especially for VP7, as we then have to compute 16 DC's instead of just 4.	7 months ago
Rémi Denis-Courmont	30797e4ff6	lavc/vp8dsp: add R-V V vp7_idct_dc_add This just computes the direct coefficient and hands over to code shared with VP8. Accordingly the bulk of changes are just rewriting the VP8 code to share. Nothing to write home about: vp7_idct_dc_add_c: 1.7 vp7_idct_dc_add_rvv_i32: 1.2	7 months ago
Rémi Denis-Courmont	fd4977c876	lavc/aacencdsp: R-V V quant_bands T-Head C908: quant_bands_signed_c: 576.0 quant_bands_signed_rvv_f32: 48.7 quant_bands_unsigned_c: 414.2 quant_bands_unsigned_rvv_f32: 31.7 SpacemiT X60: quant_bands_signed_c: 497.7 quant_bands_signed_rvv_f32: 23.0 quant_bands_unsigned_c: 353.5 quant_bands_unsigned_rvv_f32: 16.2	7 months ago
Rémi Denis-Courmont	6c6bec04f3	lavc/vc1dsp: fix R-V V avg_mspel_pixels The 8x8 pixel arrays are not necessarily aligned to 64 bits, so the current code leads to Bus error on real hardware. This reproducible with FATE's vc1_ilaced_twomv test case. The new "pessimist" code can trivially be shared for 16x16 pixel arrays so we also do that. FWIW, this also nominally reduces the hardware requirement from Zve64x to Zve32x. T-Head C908: vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_c: 14.7 vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_rvv_i32: 3.5 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_c: 3.7 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_rvv_i32: 1.5 SpacemiT X60: vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_c: 13.0 vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_rvv_i32: 3.0 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_c: 3.2 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_rvv_i32: 1.2	7 months ago
Rémi Denis-Courmont	06fc919aad	lavc/sbrdsp: add support for 256-bit vectors hf_apply_noise_0_c: 35.7 hf_apply_noise_0_rvv_f32: 9.5 hf_apply_noise_1_c: 38.5 hf_apply_noise_1_rvv_f32: 10.0 hf_apply_noise_2_c: 35.5 hf_apply_noise_2_rvv_f32: 9.7 hf_apply_noise_3_c: 38.5 hf_apply_noise_3_rvv_f32: 10.0 Maybe extending the noise table manually is not such great idea, but I not quite sure how to deal with that otherwise? Allocating the table dynamically is possible but would require an ELF destructor to clean up.	7 months ago
sunyuechi	544acfa2c0	lavc/vp9dsp: R-V V rename ff_avg to ff_vp9_avg Avoid potential naming conflicts Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	7 months ago
Rémi Denis-Courmont	4fe8f2cc43	riscv: allow passing addend to vtype_vli macro A constant (-1) is added to the length value, so we can have an added for free, and optimise the addition away if the addend is exactly 1.	7 months ago
Rémi Denis-Courmont	fa3b153cb1	lavc/vp7dsp: R-V V vp7_idct_add Most of the code is shared with DC, thanks to minor earlier changes. vp7_idct_add_c: 5.2 vp7_idct_add_rvv_i32: 2.5	7 months ago
Rémi Denis-Courmont	4a0e629b6f	lavc/vp7dsp: revector ff_vp7_dc_wht_rvv This prepares for some code reuse.	7 months ago
Rémi Denis-Courmont	fd39997f72	lavc/vp7dsp: add R-V V vp7_luma_dc_wht This works out a bit more favourably than VP8's due to: - additional multiplications that can be vectored, - hardware-supported fixed-point rounding mode. vp7_luma_dc_wht_c: 3.2 vp7_luma_dc_wht_rvv_i64: 2.0	7 months ago
Rémi Denis-Courmont	91b5ea7bb9	lavc/vp8dsp: R-V V vp8_luma_dc_wht This is not great as transposition is poorly supported, but it works: vp8_luma_dc_wht_c: 2.5 vp8_luma_dc_wht_rvv_i32: 1.7	7 months ago
Rémi Denis-Courmont	c53d42380d	lavc/lpc: optimise RVV vector type for compute_autocorr On SpacemiT X60 (with len == 4000): autocorr_10_c: 2303.7 autocorr_10_rvv_f64: 1411.5 (before) autocorr_10_rvv_f64: 842.2 (after)	7 months ago
Rémi Denis-Courmont	a11122f9c6	lavc/vp8dsp: save one R-V GPR This saves one instruction and frees up A5, which will be repurposed in later changes. Unfortunately, we need to add quite a lot of alternative code for this.	8 months ago
Rémi Denis-Courmont	4e56455d36	lavc/vp8dsp: avoid one multiplication on RISC-V Use shifts rather than multiply, and save one instruction.	8 months ago
Rémi Denis-Courmont	0aad5b9bf5	lavc/vp8dsp: factor R-V V bilin functions For a given type, only the first VSETVLI instruction varies depending on the size.	8 months ago
Rémi Denis-Courmont	b248d7c319	lavc/sbrdsp: fold immediate offset into relocation This results in AUIPC; ADDI instead of AUIPC; ADDI; ... ADDI.	8 months ago
Rémi Denis-Courmont	8444115262	lavc/startcode: fix RVV return value on no match If there are no zero bytes, t2 equals -1. The code cannot simply fall through to the match case.	8 months ago
Rémi Denis-Courmont	af20fb9c4e	lavc/lpc: fix off-by-one in R-V V compute_autocorr	8 months ago
Rémi Denis-Courmont	a535ce2ac0	lavc/flacdsp: R-V Zvl256b lpc33 flac_lpc_33_13_c: 499.7 flac_lpc_33_13_rvv_i64: 197.7 flac_lpc_33_16_c: 601.5 flac_lpc_33_16_rvv_i64: 195.2 flac_lpc_33_29_c: 1011.5 flac_lpc_33_29_rvv_i64: 300.7 flac_lpc_33_32_c: 1099.0 flac_lpc_33_32_rvv_i64: 296.7	8 months ago
Rémi Denis-Courmont	5ebb071d79	lavc/vp8dsp: disable EPEL HV on RV128 RV128 is mostly scifi at this point, so we can just disable it here (the EPEL HV prologue/epilogue do not save 128-bit registers).	8 months ago
Rémi Denis-Courmont	25a33665a0	lavc/vp8dsp: remove unused macro parameter	8 months ago
Rémi Denis-Courmont	728a1dd3b6	lavc/rv34dsp: remove stray load immediate	8 months ago
sunyuechi	63697d3350	lavc/vp8dsp: R-V V put_epel hv C908: vp8_put_epel4_h4v4_c: 20.0 vp8_put_epel4_h4v4_rvv_i32: 11.0 vp8_put_epel4_h4v6_c: 25.2 vp8_put_epel4_h4v6_rvv_i32: 13.5 vp8_put_epel4_h6v4_c: 22.2 vp8_put_epel4_h6v4_rvv_i32: 14.5 vp8_put_epel4_h6v6_c: 29.0 vp8_put_epel4_h6v6_rvv_i32: 15.7 vp8_put_epel8_h4v4_c: 73.0 vp8_put_epel8_h4v4_rvv_i32: 22.2 vp8_put_epel8_h4v6_c: 90.5 vp8_put_epel8_h4v6_rvv_i32: 26.7 vp8_put_epel8_h6v4_c: 85.0 vp8_put_epel8_h6v4_rvv_i32: 27.2 vp8_put_epel8_h6v6_c: 104.7 vp8_put_epel8_h6v6_rvv_i32: 29.5 vp8_put_epel16_h4v4_c: 145.5 vp8_put_epel16_h4v4_rvv_i32: 26.5 vp8_put_epel16_h4v6_c: 190.7 vp8_put_epel16_h4v6_rvv_i32: 47.5 vp8_put_epel16_h6v4_c: 173.7 vp8_put_epel16_h6v4_rvv_i32: 33.2 vp8_put_epel16_h6v6_c: 222.2 vp8_put_epel16_h6v6_rvv_i32: 35.5 Amended to disable unsupported RV128. Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	8 months ago
Rémi Denis-Courmont	0b2316e37f	lavc/sbrdsp: fix inverted boundary check 128-bit is the maximum, not the minimum here. Larger vector sizes can result in reads past the end of the noise value table. This partially reverts commit `cdcb4b98b7`.	8 months ago
Rémi Denis-Courmont	f883746587	lavc/flacdsp: do not assume maximum R-V VL This loop correctly assumes that VLMAX=16 (4x128-bit vectors with 32-bit elements) and 32 >= pred_order > 16. We need to alternate between VL=16 and VL=t2=pred_order-16 elements to add up to pred_order. The current code requests AVL=a2=pred_order elements. In QEMU and on thte K230 hardware, this sets VL=16 as we need. But the specification merely guarantees that we get: ceil(AVL / 2) <= VL <= VLMAX. For instance, if pred_order equals 27, we could end up with VL=14 or VL=15 instead of VL=16. So instead, request literally VLMAX=16.	8 months ago
Rémi Denis-Courmont	ba38d0e328	lavc/pixblockdsp: add scalar get_pixels_unaligned The code is already there, we just need to use it. get_pixels_unaligned_c: 2.2 get_pixels_unaligned_misaligned: 1.7	8 months ago
Rémi Denis-Courmont	910d281b21	lavc/h263dsp: R-V V {h,v}_loop_filter Since the horizontal and vertical filters are identical except for a transposition, this uses a common subprocedure with an ad-hoc ABI. To preserve return-address stack prediction, a link register has to be used (c.f. the "Control Transfer Instructions" from the RISC-V ISA Manual). The alternate/temporary link register T0 is used here, so that the normal RA is preserved (something Arm cannot do!). To load the strength value based on `qscale`, the shortest possible and PIC-compatible sequence is used: AUIPC; ADD; LBU. The classic LLA; ADD; LBU sequence would add one more instruction since LLA is a convenience alias for AUIPC; ADDI. To ensure that this trick works, relocation relaxation is disabled. To implement the two signed divisions by a power of two toward zero: (x / (1 << SHIFT)) the code relies on the small range of integers involved, computing: (x + (x >> (16 - SHIFT))) >> SHIFT rather than the more general: (x + ((x >> (16 - 1)) & ((1 << SHIFT) - 1))) >> SHIFT Thus one ANDI instruction is avoided. T-Head C908: h263dsp.h_loop_filter_c: 228.2 h263dsp.h_loop_filter_rvv_i32: 144.0 h263dsp.v_loop_filter_c: 242.7 h263dsp.v_loop_filter_rvv_i32: 114.0 (C is probably worse in real use due to less predictible branches.)	8 months ago
sunyuechi	0c1304ae11	lavc/vp9dsp: R-V V mc avg C908: vp9_avg4_8bpp_c: 1.2 vp9_avg4_8bpp_rvv_i64: 1.0 vp9_avg8_8bpp_c: 3.7 vp9_avg8_8bpp_rvv_i64: 1.5 vp9_avg16_8bpp_c: 14.7 vp9_avg16_8bpp_rvv_i64: 3.5 vp9_avg32_8bpp_c: 57.7 vp9_avg32_8bpp_rvv_i64: 10.0 vp9_avg64_8bpp_c: 229.0 vp9_avg64_8bpp_rvv_i64: 31.7 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	8 months ago
Rémi Denis-Courmont	7591eb4055	Revert "lavc/sbrdsp: R-V V neg_odd_64" While this function can easily be written with vectors, it just fails to get any performance improvement. For reference, this is a simpler loop-free implementation that does get better performance than the current one depending on hardware, but still more or less the same metrics as the C code: func ff_sbr_neg_odd_64_rvv, zve64x li a1, 32 addi a0, a0, 7 li t0, 8 vsetvli zero, a1, e8, m2, ta, ma li t1, 0x80 vlse8.v v8, (a0), t0 vxor.vx v8, v8, t1 vsse8.v v8, (a0), t0 ret endfunc This reverts commit `d06fd18f8f`.	8 months ago
Rémi Denis-Courmont	d452db8410	lavc/vc1dsp: R-V V vc1_unescape_buffer Notes: - The loop is biased toward no unescaped bytes as that should be most common. - The input byte array is slid rather than the (8 times smaller) bit-mask, as RISC-V V does not provide a bit-mask (or bit-wise) slide instruction. - There are two comparisons with 0 per iteration, for the same reason. - In case of match, bytes are copied until the first match, and the loop is restarted after the escape byte. Vector compression (vcompress.vm) could discard all escape bytes but that is slower if escape bytes are rare. Further optimisations should be possible, e.g.: - processing 2 bytes fewer per iteration to get rid of a 2 slides, - taking a short cut if the input vector contains less than 2 zeroes. But this is a good starting point: T-Head C908: vc1dsp.vc1_unescape_buffer_c: 12749.5 vc1dsp.vc1_unescape_buffer_rvv_i32: 6009.0 SpacemiT X60: vc1dsp.vc1_unescape_buffer_c: 11038.0 vc1dsp.vc1_unescape_buffer_rvv_i32: 2061.0	8 months ago
Rémi Denis-Courmont	463c573e6b	lavc/huffyuvdsp: optimise RVV vtype for add_hfyu_left_pred_bgr32 T-Head C908: add_hfyu_left_pred_bgr32_c: 237.5 add_hfyu_left_pred_bgr32_rvv_i32: 173.5 (before) add_hfyu_left_pred_bgr32_rvv_i32: 110.0 (after)	8 months ago
Rémi Denis-Courmont	233066e85a	lavc/flacdsp: optimise RVV vector type for lpc32 This is pretty much the same as for lpc16, though it only improves half as large prediction orders. With 128-bit vectors, this gives: C V old V new 1 69.2 181.5 95.5 2 107.7 180.7 95.2 3 145.5 180.0 103.5 4 183.0 179.2 102.7 5 220.7 178.5 128.0 6 257.7 194.0 127.5 7 294.5 193.7 126.7 8 331.0 193.0 126.5 Larger prediction orders see no significant changes at that size.	8 months ago
Rémi Denis-Courmont	6ab4b92e82	lavc/flacdsp: optimise RVV vector type for lpc16 This calculates the optimal vector type value at run-time based on the hardware vector length and the FLAC LPC prediction order. In this particular case, the additional computation is easily amortised over the loop iterations: T-Head C908: C V before V after 1 48.0 214.7 95.2 2 64.7 214.2 94.7 3 79.7 213.5 94.5 4 96.2 196.5 94.2 # 5 111.0 195.7 118.5 6 127.0 211.2 102.0 7 143.7 194.2 101.5 8 175.7 193.2 101.2 # 9 176.2 224.2 126.0 10 191.5 192.0 125.5 11 224.5 191.2 124.7 12 223.0 190.2 124.2 13 239.2 189.5 123.7 14 253.7 188.7 139.5 15 286.2 188.0 122.7 16 284.0 187.0 122.5 # 17 300.2 186.5 186.5 18 314.0 185.5 185.7 19 329.7 184.7 185.0 20 343.0 184.2 184.2 21 358.7 199.2 183.7 22 371.7 182.7 182.7 23 387.5 181.7 182.0 24 400.7 181.0 181.2 25 431.5 180.2 196.5 26 443.7 195.5 196.0 27 459.0 178.7 196.2 28 470.7 177.7 194.2 29 470.0 177.0 193.5 30 481.2 176.2 176.5 31 496.2 175.5 175.7 32 507.2 174.7 191.0 # # Power of two boundary. With 128-bit vectors, improvements are expected for the first two test cases only. For the other two, there is overhead but below noise. Improvements should be better observable with prediction order of 8 and less, or on hardware with larger vector sizes.	8 months ago
Rémi Denis-Courmont	259c639137	lavc/vp9_intra: fix another .irp use with LLVM as	8 months ago
Rémi Denis-Courmont	8cea66a73c	lavc/vp9_intra: fix .irp use with LLVM as	8 months ago
Rémi Denis-Courmont	cbe51ebf93	lavc/vp8dsp: fix .irp use with LLVM as	8 months ago
Rémi Denis-Courmont	fa47299516	lavc/startcode: add R-V V startcode_find_candidate	8 months ago
Rémi Denis-Courmont	4ad5b9c8db	lavc/startcode: add R-V Zbb startcode_find_candidate The main loop processes 8 bytes in 5 instructions. For comparison, the optimal plain strnlen() requires 4 instructions per byte (6.4x worse): LBU; ADDI; BEQZ; BNE. The current libavcodec C code involves 5 instructions per byte (8x worse). Actual benchmarks may be slightly less favourable due to latency from ORC.B to BNE.	8 months ago
sunyuechi	d521b7280c	lavc/vp9dsp: R-V V ipred tm C908: vp9_tm_4x4_8bpp_c: 116.5 vp9_tm_4x4_8bpp_rvv_i32: 43.5 vp9_tm_8x8_8bpp_c: 416.2 vp9_tm_8x8_8bpp_rvv_i32: 86.0 vp9_tm_16x16_8bpp_c: 1665.5 vp9_tm_16x16_8bpp_rvv_i32: 187.2 vp9_tm_32x32_8bpp_c: 6974.2 vp9_tm_32x32_8bpp_rvv_i32: 625.7 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	8 months ago
Rémi Denis-Courmont	88d973a5d6	lavc/flacdsp: R-V V flac_wasted33 T-Head C908: flac_wasted_33_c: 786.2 flac_wasted_33_rvv_i64: 486.5	8 months ago
sunyuechi	d4083ecb7c	lavc/vc1dsp: R-V V mspel_pixels C908 X60 vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_c : 14.7 13.2 vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_rvv_i32 : 2.5 2.2 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_c : 3.7 3.5 vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_rvv_i64 : 1.0 1.2 vc1dsp.put_vc1_mspel_pixels_tab[0][0]_c : 9.0 8.0 vc1dsp.put_vc1_mspel_pixels_tab[0][0]_rvi : 1.0 1.0 vc1dsp.put_vc1_mspel_pixels_tab[1][0]_c : 2.5 2.2 vc1dsp.put_vc1_mspel_pixels_tab[1][0]_rvi : 0.5 0.5 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	8 months ago
Rémi Denis-Courmont	7b47099bc0	lavc/flacdsp: R-V V flac_wasted32 T-Head C908: flac_wasted_32_c: 949.0 flac_wasted_32_rvv_i32: 278.7	8 months ago

1 2 3 4

196 Commits (9b41cc04300e8d00ae3a6326639e975712e21bb6)