FFmpeg

Commit Graph

Author	SHA1	Message	Date
Rémi Denis-Courmont	f8715d0300	lavc/vp9dsp: fix compilation with llvm-as	11 months ago
Rémi Denis-Courmont	9e77188cba	lavc/ac3dsp: R-V Zbb ac3_exponent_min SiFive U74: ac3_exponent_min_reuse0_c: 10.0 ac3_exponent_min_reuse0_rvb_b: 8.0 ac3_exponent_min_reuse1_c: 2924.7 ac3_exponent_min_reuse1_rvb_b: 1803.0 ac3_exponent_min_reuse2_c: 5043.0 ac3_exponent_min_reuse2_rvb_b: 2827.5 ac3_exponent_min_reuse3_c: 7028.7 ac3_exponent_min_reuse3_rvb_b: 3872.0 ac3_exponent_min_reuse4_c: 8824.2 ac3_exponent_min_reuse4_rvb_b: 5122.2 ac3_exponent_min_reuse5_c: 10487.5 ac3_exponent_min_reuse5_rvb_b: 6412.2	11 months ago
Rémi Denis-Courmont	38f67a32b3	lavc/ac3dsp: R-V V min_exponents T-Head C908: ac3_exponent_min_reuse0_c: 7.5 ac3_exponent_min_reuse0_rvv_i32: 7.5 ac3_exponent_min_reuse1_c: 1820.7 ac3_exponent_min_reuse1_rvv_i32: 102.5 ac3_exponent_min_reuse2_c: 3088.5 ac3_exponent_min_reuse2_rvv_i32: 138.7 ac3_exponent_min_reuse3_c: 5073.7 ac3_exponent_min_reuse3_rvv_i32: 174.7 ac3_exponent_min_reuse4_c: 4624.2 ac3_exponent_min_reuse4_rvv_i32: 204.2 ac3_exponent_min_reuse5_c: 5138.7 ac3_exponent_min_reuse5_rvv_i32: 238.0	11 months ago
sunyuechi	5bc3b7f513	lavc/rv40dsp: R-V V chroma_mc This is similar to h264, but here we use manual_avg instead of vaaddu because rv40's OP differs from h264. If we use vaaddu, rv40 would need to repeatedly switch between vxrm=0 and vxrm=2, and switching vxrm is very slow. C908: avg_chroma_mc4_c: 2330.0 avg_chroma_mc4_rvv_i32: 602.7 avg_chroma_mc8_c: 1211.0 avg_chroma_mc8_rvv_i32: 602.7 put_chroma_mc4_c: 1825.0 put_chroma_mc4_rvv_i32: 414.7 put_chroma_mc8_c: 932.0 put_chroma_mc8_rvv_i32: 414.7 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	11 months ago
sunyuechi	7d0673db7e	lavc/blockdsp: R-V V fill_block C908: blockdsp.fill_block_tab[0]_c: 549.7 blockdsp.fill_block_tab[0]_rvv_i64: 48.2 blockdsp.fill_block_tab[1]_c: 77.0 blockdsp.fill_block_tab[1]_rvv_i64: 19.7 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	11 months ago
Rémi Denis-Courmont	6cd97cd797	lavc/ac3dsp: R-V V sum_square_butterfly_float As we do not need to widen accumulators to 64 bits, we effectively get double capacity for unrolling compared to the integer function. This explains the slightly better performance gains. ac3_sum_square_bufferfly_float_c: 65.2 ac3_sum_square_bufferfly_float_rvv_f32: 12.2	11 months ago
Rémi Denis-Courmont	6459966beb	lavc/ac3dsp: R-V V sum_square_butterfly_int32 ac3_sum_square_bufferfly_int32_c: 61.0 ac3_sum_square_bufferfly_int32_rvv_i64: 14.7	11 months ago
Andreas Rheinhardt	08781ebe1a	avcodec/riscv/vp9dsp: Fix inclusion guard Fixes fate-source. Reviewed-by: Jan Ekström <jeebjp@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	11 months ago
sunyuechi	c3a96f97f8	lavc/vp9dsp: R-V V ipred dc C908: vp9_dc_8x8_8bpp_c: 46.0 vp9_dc_8x8_8bpp_rvv_i64: 41.0 vp9_dc_16x16_8bpp_c: 109.2 vp9_dc_16x16_8bpp_rvv_i32: 72.7 vp9_dc_32x32_8bpp_c: 365.2 vp9_dc_32x32_8bpp_rvv_i32: 165.5 vp9_dc_127_8x8_8bpp_c: 23.0 vp9_dc_127_8x8_8bpp_rvv_i64: 22.0 vp9_dc_127_16x16_8bpp_c: 70.2 vp9_dc_127_16x16_8bpp_rvv_i32: 50.2 vp9_dc_127_32x32_8bpp_c: 295.2 vp9_dc_127_32x32_8bpp_rvv_i32: 136.7 vp9_dc_128_8x8_8bpp_c: 23.0 vp9_dc_128_8x8_8bpp_rvv_i64: 22.0 vp9_dc_128_16x16_8bpp_c: 70.2 vp9_dc_128_16x16_8bpp_rvv_i32: 50.2 vp9_dc_128_32x32_8bpp_c: 295.2 vp9_dc_128_32x32_8bpp_rvv_i32: 136.7 vp9_dc_129_8x8_8bpp_c: 23.0 vp9_dc_129_8x8_8bpp_rvv_i64: 22.0 vp9_dc_129_16x16_8bpp_c: 70.2 vp9_dc_129_16x16_8bpp_rvv_i32: 50.2 vp9_dc_129_32x32_8bpp_c: 295.2 vp9_dc_129_32x32_8bpp_rvv_i32: 136.7 vp9_dc_left_8x8_8bpp_c: 38.0 vp9_dc_left_8x8_8bpp_rvv_i64: 36.0 vp9_dc_left_16x16_8bpp_c: 93.2 vp9_dc_left_16x16_8bpp_rvv_i32: 67.7 vp9_dc_left_32x32_8bpp_c: 333.2 vp9_dc_left_32x32_8bpp_rvv_i32: 158.5 vp9_dc_top_8x8_8bpp_c: 38.7 vp9_dc_top_8x8_8bpp_rvv_i64: 36.0 vp9_dc_top_16x16_8bpp_c: 93.2 vp9_dc_top_16x16_8bpp_rvv_i32: 67.7 vp9_dc_top_32x32_8bpp_c: 333.2 vp9_dc_top_32x32_8bpp_rvv_i32: 156.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	11 months ago
Andreas Rheinhardt	88b3b09afa	avcodec/aacenc: Move initializing DSP out of aacenc.c Otherwise aacenc.o gets pulled in by the aacencdsp checkasm test and it in turn pulls the rest of lavc in. Besides being bad size-wise this also has the downside that it pulls in avpriv_(cga\|vga16)_font from libavutil which are marked as being imported from another library when building libavcodec as a DLL and this breaks checkasm because it links both lavc and lavu statically. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	1 year ago
sunyuechi	a7ad76fbbf	lavc/me_cmp: R-V V nsse C908: nsse_0_c: 1990.0 nsse_0_rvv_i32: 572.0 nsse_1_c: 910.0 nsse_1_rvv_i32: 456.0 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	9b90d0d36a	lavc/me_cmp: R-V V vsse vsad intra C908: vsad_4_c: 681.0 vsad_4_rvv_i32: 182.2 vsad_5_c: 278.0 vsad_5_rvv_i32: 145.2 vsse_4_c: 595.0 vsse_4_rvv_i32: 125.2 vsse_5_c: 281.0 vsse_5_rvv_i32: 101.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	925b55a5e8	lavc/me_cmp: R-V V vsse vsad C908: vsad_0_c: 936.0 vsad_0_rvv_i32: 236.2 vsad_1_c: 424.0 vsad_1_rvv_i32: 190.2 vsse_0_c: 877.0 vsse_0_rvv_i32: 204.2 vsse_1_c: 439.0 vsse_1_rvv_i32: 140.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	9cb8f262f2	lavc/me_cmp: R-V V sse C908: sse_0_c: 614.7 sse_0_rvv_i32: 138.2 sse_1_c: 302.7 sse_1_rvv_i32: 107.2 sse_2_c: 175.7 sse_2_rvv_i32: 104.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	37463d7979	lavc/me_cmp: R-V V pix_abs_y2 C908: pix_abs_0_2_c: 904.0 pix_abs_0_2_rvv_i32: 172.2 pix_abs_1_2_c: 460.0 pix_abs_1_2_rvv_i32: 168.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	f1ec475f66	lavc/me_cmp: R-V V pix_abs_x2 C908: pix_abs_0_1_c: 767.0 pix_abs_0_1_rvv_i32: 196.2 pix_abs_1_1_c: 388.0 pix_abs_1_1_rvv_i32: 185.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	b41e115dde	lavc/me_cmp: R-V V pix_abs C908: pix_abs_0_0_c: 534.0 pix_abs_0_0_rvv_i32: 136.2 pix_abs_1_0_c: 287.7 pix_abs_1_0_rvv_i32: 125.2 sad_0_c: 534.0 sad_0_rvv_i32: 136.2 sad_1_c: 287.7 sad_1_rvv_i32: 125.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	d897bbb48d	lavc/vp8dsp: R-V V vp8_idct_dc_add4uv c908: vp8_idct_dc_add4uv_c: 387.7 vp8_idct_dc_add4uv_rvv_i32: 134.5 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	e74e18cae4	lavc/vp8dsp: R-V V vp8_idct_dc_add4y c908: vp8_idct_dc_add4y_c: 368.5 vp8_idct_dc_add4y_rvv_i32: 134.5 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	c12053cefc	lavc/vp8dsp: R-V V vp8_idct_dc_add c908: vp8_idct_dc_add_c: 102.2 vp8_idct_dc_add_rvv_i32: 42.0 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	89189dd9e7	lavc/rv34dsp: R-V V rv34_idct_dc_add C908: rv34_idct_dc_add_c: 134.7 rv34_idct_dc_add_rvv_i32: 45.5 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	ee08974f90	lavc/rv34dsp: R-V V rv34_inv_transform_dc C908: rv34_inv_transform_dc_c: 35.5 rv34_inv_transform_dc_rvv_i32: 27.0 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	fdebde817c	lavc/blockdsp: R-V V clear_blocks C908: blockdsp.clear_blocks_c: 128.2 blockdsp.clear_blocks_rvv_i64: 102.5 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	0748d2bbc7	lavc/blockdsp: R-V V clear_block C908: blockdsp.clear_block_c: 47.2 blockdsp.clear_block_rvv_i64: 28.5 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	8e23ebe6f9	lavc/svq1enc: R-V V ssd_int8_vs_int16 C908 ssd_int8_vs_int16_c: 207.7 ssd_int8_vs_int16_rvv_i32: 14.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
Rémi Denis-Courmont	278b4b60d6	lavc/takdsp: R-V V decorrelate_sf decorrelate_sf_c: 259.2 decorrelate_sf_rvv_i32: 45.5	1 year ago
sunyuechi	3d39b8d4e7	lavc/takdsp: R-V V decorrelate_sm C908: decorrelate_sm_c: 130.0 decorrelate_sm_rvv_i32: 43.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net> (with minor changes)	1 year ago
James Almer	46775e64f8	avcodec/takdsp: fix const correctness Signed-off-by: James Almer <jamrial@gmail.com>	1 year ago
sunyuechi	c933ff2779	lavc/takdsp: R-V V decorrelate_sr C908: decorrelate_sr_c: 95.5 decorrelate_sr_rvv_i32: 28.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	864174dd00	lavc/takdsp: R-V V decorrelate_ls C908: decorrelate_ls_c: 69.7 decorrelate_ls_rvv_i32: 27.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
Rémi Denis-Courmont	cdd38a2ffe	lavc/aacpsdsp: fix R-V V stereo interpolate The penultimate loop iteration could pick any vl such that: vlenb/4 < vl <= vlenb/2 Thus if the total length is not a multiple of vlenb/2, the vfadd.vf on the penultimate iteration would yield corrupt values for the last iteration. To avoid this, force vl = vlen/2 until the last iteration. Unfortunately this latent bug is not reproducible with either hardware or QEMU as of now.	1 year ago
Rémi Denis-Courmont	db32f75c63	lavc/opusdsp: simplify R-V V postfilter This skips the round-trip to scalar register for the sliding 'x' coefficients, improving performance by about 5%. The trick here is that the vector slide-up instruction preserves elements in destination vector until the slide offset. The switch from vfslide1up.vf to vslideup.vi also allows the elimination of data dependencies on consecutive slides. Since the specifications recommend sticking to power of two offsets, we could slide as follows: vslideup.vi v8, v0, 2 vslideup.vi v4, v0, 1 vslideup.vi v12, v8, 1 vslideup.vi v16, v8, 2 However in the device under test, this seems to make performance slightly worse, so this is left for (in)validation with future better hardware.	1 year ago
Rémi Denis-Courmont	419145c11b	lavc/vc1dsp: fix R-V V vector lengths The 8x4 and 4x4 use a needlessly large multiplier (unless/until we care about embedded 64-bit-vector hardware). This is merely suboptimal. The 8x4 case also uses an incorrect vector length, which leads to incorrect behaviour on future/hypothetical hardware with 256-bit or larger vectors. Pointed-out-by: Martin Storsjö <martin@martin.st>	1 year ago
Martin Storsjö	b51d9eb58e	riscv: vc1dsp: Don't check vlenb before checking the CPU flags We can't call ff_get_rv_vlenb() if we don't have RVV available at all. Acked-by: Rémi Denis-Courmont <remi@remlab.net> Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Rémi Denis-Courmont	918b3ed2d5	lavc/lpc: R-V V compute_autocorr The loop iterates over the length of the vector, not the order. This is to avoid reloading the same data for each lag value. However this means the loop only works if the maximum order is no larger than VLENB. The loop is roughly equivalent to: for (size_t j = 0; j < lag; j++) autoc[j] = 1.; while (len > lag) { for (ptrdiff_t j = 0; j < lag; j++) autoc[j] += data[j] * data; data++; len--; } while (len > 0) { for (ptrdiff_t j = 0; j < len; j++) autoc[j] += data[j] *data; data++; len--; } Since register pressure is only at 50%, it should be possible to implement the same loop for order up to 2xVLENB. But this is left for future work. Performance numbers are all over the place from ~1.25x to ~4x speedups, but at least they are always noticeably better than nothing.	1 year ago
sunyuechi	98596f90f4	lavc/aacencdsp: R-V V abs_pow34 C908: abs_pow34_c: 535.5 abs_pow34_rvv_f32: 337.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
Rémi Denis-Courmont	272d0c164d	lavc/lpc: R-V V apply_welch_window apply_welch_window_even_c: 617.5 apply_welch_window_even_rvv_f64: 235.0 apply_welch_window_odd_c: 709.0 apply_welch_window_odd_rvv_f64: 256.5	1 year ago
Rémi Denis-Courmont	b3825bbe45	riscv: test for assembler support This should fix the build on LLVM 16 and earlier, at the cost of turning all non-RVV optimisations off.	1 year ago
sunyuechi	0b9d009b4a	lavc/vc1dsp: R-V V inv_trans C908: vc1dsp.vc1_inv_trans_4x4_dc_c: 125.7 vc1dsp.vc1_inv_trans_4x4_dc_rvv_i32: 53.5 vc1dsp.vc1_inv_trans_4x8_dc_c: 230.7 vc1dsp.vc1_inv_trans_4x8_dc_rvv_i32: 65.5 vc1dsp.vc1_inv_trans_8x4_dc_c: 228.7 vc1dsp.vc1_inv_trans_8x4_dc_rvv_i64: 64.5 vc1dsp.vc1_inv_trans_8x8_dc_c: 476.5 vc1dsp.vc1_inv_trans_8x8_dc_rvv_i64: 80.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
sunyuechi	8bdb663062	lavc/ac3dsp: R-V V float_to_fixed24 c910 float_to_fixed24_c: 2207.2 float_to_fixed24_rvv_f32: 696.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	1 year ago
Rémi Denis-Courmont	0fa421c8f1	lavc/llvidencdsp: add R-V V diff_bytes diff_bytes_c: 163.0 diff_bytes_rvv_i32: 52.7	1 year ago
Rémi Denis-Courmont	0183c2c830	lavc/aacpsdsp: use LMUL=2 and amortise strides The input is laid out in 16 segments, of which 13 actually need to be loaded. There are no really efficient ways to deal with this: 1) If we load 8 segments wit unit stride, then narrow to 16 segments with right shifts, we can only get one half-size vector per segment, or just 2 elements per vector (EMUL=1/2) - at least with 128-bit vectors. This ends up unsurprisingly about as fas as the C code. 2) The current approach is to load with strides. We keep that approach, but improve it using three 4-segmented loads instead of 12 single-segment loads. This divides the number of distinct loaded addresses by 4. 3) A potential third approach would be to avoid segmentation altogether and splat the scalar coefficient into vectors. Then we can use a unit-stride and maximum EMUL. But the downside then is that we have to multiply the 3 (of 16) unused segments with zero as part of the multiply-accumulate operations. In addition, we also reuse vectors mid-loop so as to increase the EMUL from 1 to 2, which also improves performance a little bit. Oeverall the gains are quite small with the device under test, as it does not deal with segmented loads very well. But at least the code is tidier, and should enjoy bigger speed-ups on better hardware implementation. Before: ps_hybrid_analysis_c: 1819.2 ps_hybrid_analysis_rvv_f32: 1037.0 (before) ps_hybrid_analysis_rvv_f32: 990.0 (after)	1 year ago
Rémi Denis-Courmont	b88d4058f9	lavc/g722dsp: optimise R-V V apply_qmf This stores the constant coefficients deinterleaved, so that they can be loaded directly with NF=0. Unfortunately, we cannot optimise loading the input, due to insufficient memory alignment (not 32-bit). Before: g722_apply_qmf_c: 82.5 g722_apply_qmf_rvv_i32: 78.2 After: g722_apply_qmf_c: 82.5 g722_apply_qmf_rvv_i32: 65.2	1 year ago
Rémi Denis-Courmont	fbc7adba67	lavc/llviddsp: R-V V add_bytes add_bytes_c: 2077.2 add_bytes_rvv_i32: 105.0	1 year ago
Rémi Denis-Courmont	ca664f2254	lavc/flacdsp: R-V V LPC16 function In this case, the inner loop computing the scalar product can be reduced to just one multiplication and one sum even with 128-bit vectors. The result is a lot simpler, but also brings more modest performance gains: flac_lpc_16_13_c: 15241.0 flac_lpc_16_13_rvv_i32: 11230.0 flac_lpc_16_16_c: 17884.0 flac_lpc_16_16_rvv_i32: 12125.7 flac_lpc_16_29_c: 27847.7 flac_lpc_16_29_rvv_i32: 10494.0 flac_lpc_16_32_c: 30051.5 flac_lpc_16_32_rvv_i32: 10355.0	1 year ago
Rémi Denis-Courmont	295092b46d	lavc/flacdsp: R-V V LPC32 The entire set of 32 coefficients and corresponding past 32 samples can fit in a single vector (with LMUL=8) exactly, but... since widening double the needed vector sizes, we still end up too short with 128-bit vectors. This adds a very simple version for future 256+-bit hardware, and for pred_orders values up to 16, and a bit more involved loop for for 128-bit hardware with pred_orders between 17 and 32. With 128-bit hardware, the benchmarks look like this: flac_lpc_32_13_c: 30152.0 flac_lpc_32_13_rvv_i32: 10244.7 flac_lpc_32_16_c: 37314.2 flac_lpc_32_16_rvv_i32: 10126.2 flac_lpc_32_29_c: 61910.0 flac_lpc_32_29_rvv_i32: 14495.2 flac_lpc_32_32_c: 68204.0 flac_lpc_32_32_rvv_i32: 13273.7	1 year ago
Rémi Denis-Courmont	07c303b708	lavc/flacdsp: R-V V decorrelate_indep 16-bit packed flac_decorrelate_indep2_16_c: 981.7 flac_decorrelate_indep2_16_rvv_i32: 199.2 flac_decorrelate_indep4_16_c: 1749.7 flac_decorrelate_indep4_16_rvv_i32: 401.2 flac_decorrelate_indep6_16_c: 2517.7 flac_decorrelate_indep6_16_rvv_i32: 858.0 flac_decorrelate_indep8_16_c: 3285.7 flac_decorrelate_indep8_16_rvv_i32: 1123.5	1 year ago
Rémi Denis-Courmont	fb0295e5fd	lavc/flacdsp: R-V V decorrelate_indep 32-bit packed flac_decorrelate_indep2_32_c: 981.7 flac_decorrelate_indep2_32_rvv_i32: 183.7 flac_decorrelate_indep4_32_c: 1749.7 flac_decorrelate_indep4_32_rvv_i32: 362.5 flac_decorrelate_indep6_32_c: 2517.7 flac_decorrelate_indep6_32_rvv_i32: 715.2 flac_decorrelate_indep8_32_c: 3285.7 flac_decorrelate_indep8_32_rvv_i32: 909.0	1 year ago
Rémi Denis-Courmont	6183a69c0b	lavc/flacdsp: R-V V decorrelate_ms packed flac_decorrelate_ms_16_c: 585.5 flac_decorrelate_ms_16_rvv_i32: 263.0 flac_decorrelate_ms_32_c: 584.7 flac_decorrelate_ms_32_rvv_i32: 250.0	1 year ago
Rémi Denis-Courmont	636ae0e0bc	lavc/flacdsp: R-V V packed decorrelate_{l,r}s flac_decorrelate_ms_16_c: 457.2 flac_decorrelate_ms_16_rvv_i32: 203.0 flac_decorrelate_ms_32_c: 457.2 flac_decorrelate_ms_32_rvv_i32: 203.5 flac_decorrelate_rs_16_c: 456.2 flac_decorrelate_rs_16_rvv_i32: 207.0 flac_decorrelate_rs_32_c: 456.2 flac_decorrelate_rs_32_rvv_i32: 210.5	1 year ago

1 2 3 4

182 Commits (8a96495fef05eac704131dff25d0ef1b410d13a7)