FFmpeg

Commit Graph

Author	SHA1	Message	Date
Rémi Denis-Courmont	54ae270213	lavc/rv34dsp: use saturating add/sub for R-V V DC add T-Head C908 (cycles): rv34_idct_dc_add_c: 113.2 rv34_idct_dc_add_rvv_i32: 48.5 (before) rv34_idct_dc_add_rvv_i32: 39.5 (after)	6 months ago
Rémi Denis-Courmont	952b426f3b	lavc/bswapdsp: add RV Zvbb bswap16 and bswap32	6 months ago
Rémi Denis-Courmont	262168b04e	lavc/videodsp: RISC-V zicbop prefetch There are currently no ways to run-time detect the CPU capability, so we take it for granted (in the worst case, it will execute NOPs).	6 months ago
Rémi Denis-Courmont	324eba69f7	lavc/vc1dsp: use saturating arithmetic for RVV inv_trans_dc T-Head C908 (cycles): vc1dsp.vc1_inv_trans_4x4_dc_c: 113.7 vc1dsp.vc1_inv_trans_4x4_dc_rvv_i32: 46.5 (before) vc1dsp.vc1_inv_trans_4x4_dc_rvv_i32: 45.5 (after) vc1dsp.vc1_inv_trans_4x8_dc_c: 230.7 vc1dsp.vc1_inv_trans_4x8_dc_rvv_i32: 65.7 (before) vc1dsp.vc1_inv_trans_4x8_dc_rvv_i32: 52.5 (after) vc1dsp.vc1_inv_trans_8x4_dc_c: 246.7 vc1dsp.vc1_inv_trans_8x4_dc_rvv_i64: 56.7 (before) vc1dsp.vc1_inv_trans_8x4_dc_rvv_i64: 45.5 (after) vc1dsp.vc1_inv_trans_8x8_dc_c: 419.7 vc1dsp.vc1_inv_trans_8x8_dc_rvv_i64: 81.2 (before) vc1dsp.vc1_inv_trans_8x8_dc_rvv_i64: 53.5 (after)	6 months ago
Rémi Denis-Courmont	784a72a116	lavc/vc1dsp: unify R-V V DC bypass functions	6 months ago
Rémi Denis-Courmont	bd0c3edb13	lavu/riscv: count bytes rather than words for bswap32 This removes the dependency on Zba at essentially zero cost.	6 months ago
Rémi Denis-Courmont	5171baa228	lavc/ac3dsp: fix R-V CPU requirements It probably will not matter on any real hardware, but the Zbb optimisations do not require Zba. And then, we need HAVE_RVV to build the RVV stuff.	6 months ago
Rémi Denis-Courmont	7b24f96c87	lavc/vp9dsp: remove R-V I intra functions At this point, they are identical to the C code, except for instruction ordering. In fact, they are typically slower or no faster than the C code.	6 months ago
Rémi Denis-Courmont	b0b3bea10b	lavc/h264dsp: use saturing add/sub for R-V V 8-bit DC add T-Head C908 (cycles): h264_idct4_dc_add_8bpp_c: 109.2 h264_idct4_dc_add_8bpp_rvv_i32: 34.5 (before) h264_idct4_dc_add_8bpp_rvv_i32: 25.5 (after) h264_idct8_dc_add_8bpp_c: 418.7 h264_idct8_dc_add_8bpp_rvv_i64: 69.5 (before) h264_idct8_dc_add_8bpp_rvv_i64: 33.5 (after)	7 months ago
Rémi Denis-Courmont	9b4655c3a1	lavc/vp8dsp: use saturating add/sub for R-V V DC add T-Head C908 (cycles): vp7_idct_dc_add_c: 108.5 vp7_idct_dc_add_rvv_i32: 56.2 (before) vp7_idct_dc_add_rvv_i32: 47.2 (after) vp8_idct_dc_add_c: 96.2 vp8_idct_dc_add_rvv_i32: 43.0 (before) vp8_idct_dc_add_rvv_i32: 34.0 (after)	7 months ago
Rémi Denis-Courmont	bbfc0ac9ca	lavc/riscv: don't set vxrm if unnecessary While narrowing clip is nominally a rounding operation, the rounding mode has no arithmetic consequence if the right shift is by zero bits.	7 months ago
Rémi Denis-Courmont	f2c30fe15a	lavc/riscv: add forward-edge CFI landing pads	7 months ago
Rémi Denis-Courmont	b62586e310	lavc/h264dsp: use RISC-V B extension This saves one register and one instruction per transform. add16 and add16intra thus become stack-less.	7 months ago
Rémi Denis-Courmont	187d4d066a	lavc/riscv: require B or zba explicitly	7 months ago
Rémi Denis-Courmont	896c22ef00	lavc/vp8dsp: fix RV32 stack alignment SP must be a multiple of 16 bytes at all times on POSIX - even in leaf functions - so that signal handlers have a properly aligned stack.	7 months ago
Rémi Denis-Courmont	9135dffd17	lavc/h264dsp: reduce spills in R-V V idct_add16	7 months ago
Rémi Denis-Courmont	245f76ad74	lavc/h264dsp: reuse the R-V V IDCT DC add functions This reuses the DC bypass functions from the multiple IDCT functions, to leverage vector code. As an added bonus, the caller functions can now rely on the callee functions to preserve their parameters, thus cutting down on stack spills.	7 months ago
Rémi Denis-Courmont	0a5b5bae89	lavc/h264dsp: correct VL and LMUL in idct_dc_add T-Head C908 (cycles): h264_idct4_dc_add_8bpp_c: 94.7 h264_idct4_dc_add_8bpp_rvv_i32: 55.0 (before) h264_idct4_dc_add_8bpp_rvv_i32: 34.5 (after) h264_idct4_dc_add_9bpp_c: 94.7 h264_idct4_dc_add_9bpp_rvv_i32: 43.5 (before) h264_idct4_dc_add_9bpp_rvv_i32: 38.2 (after) h264_idct4_dc_add_10bpp_c: 94.7 h264_idct4_dc_add_10bpp_rvv_i32: 43.5 (before) h264_idct4_dc_add_10bpp_rvv_i32: 38.2 (after) h264_idct4_dc_add_12bpp_c: 94.7 h264_idct4_dc_add_12bpp_rvv_i32: 43.7 (before) h264_idct4_dc_add_12bpp_rvv_i32: 38.5 (after) h264_idct4_dc_add_14bpp_c: 94.7 h264_idct4_dc_add_14bpp_rvv_i32: 43.7 (before) h264_idct4_dc_add_14bpp_rvv_i32: 38.5 (after)	7 months ago
J. Dekker	c9dc2ad09b	lavc/h264dsp: move R-V V idct_dc_add No functional changes. This just moves the assembler so that it can be referenced by other functions in h264idct_rvv.S with local jumps. Edited-by: Rémi Denis-Courmont <remi@remlab.net>	7 months ago
Rémi Denis-Courmont	d15169c51f	lavc/h264dsp: factor some mostly identical R-V V code	7 months ago
Rémi Denis-Courmont	483fd732ab	lavc/h264dsp: R-V V high-depth idct_add{,intra}16, idct8_add4 As with 8-bit, this tends to be faster, but results are all over the place due to the variable distribution of non-zero coefficients.	7 months ago
J. Dekker	fa5a605542	avcodec/riscv: add h264 dc idct rvv checkasm: bench runs 131072 (1 << 17) h264_idct4_add_dc_8bpp_c: 1.5 h264_idct4_add_dc_8bpp_rvv_i64: 0.7 h264_idct4_add_dc_9bpp_c: 1.5 h264_idct4_add_dc_9bpp_rvv_i64: 0.7 h264_idct4_add_dc_10bpp_c: 1.5 h264_idct4_add_dc_10bpp_rvv_i64: 0.7 h264_idct4_add_dc_12bpp_c: 1.2 h264_idct4_add_dc_12bpp_rvv_i64: 0.7 h264_idct4_add_dc_14bpp_c: 1.2 h264_idct4_add_dc_14bpp_rvv_i64: 0.7 h264_idct8_add_dc_8bpp_c: 5.2 h264_idct8_add_dc_8bpp_rvv_i64: 1.5 h264_idct8_add_dc_9bpp_c: 5.5 h264_idct8_add_dc_9bpp_rvv_i64: 1.2 h264_idct8_add_dc_10bpp_c: 5.5 h264_idct8_add_dc_10bpp_rvv_i64: 1.2 h264_idct8_add_dc_12bpp_c: 4.2 h264_idct8_add_dc_12bpp_rvv_i64: 1.2 h264_idct8_add_dc_14bpp_c: 4.2 h264_idct8_add_dc_14bpp_rvv_i64: 1.2 Signed-off-by: J. Dekker <jdek@itanimul.li>	7 months ago
Rémi Denis-Courmont	3002310b70	lavc/h264dsp: R-V V high-depth add_pixels8 T-Head C908 (cycles); h264_add_pixels8_9bpp_c: 270.5 h264_add_pixels8_9bpp_rvv_i32: 164.2 h264_add_pixels8_10bpp_c: 270.5 h264_add_pixels8_10bpp_rvv_i32: 164.2 h264_add_pixels8_12bpp_c: 270.5 h264_add_pixels8_12bpp_rvv_i32: 164.2 h264_add_pixels8_14bpp_c: 270.5 h264_add_pixels8_14bpp_rvv_i32: 164.2	7 months ago
Rémi Denis-Courmont	7744c08240	lavc/h264dsp: R-V V add_pixels4 and 8-bit add_pixels8 T-Head C908 (cycles): h264_add_pixels4_8bpp_c: 93.5 h264_add_pixels4_8bpp_rvv_i32: 39.5 h264_add_pixels4_9bpp_c: 87.5 h264_add_pixels4_9bpp_rvv_i64: 50.5 h264_add_pixels4_10bpp_c: 87.5 h264_add_pixels4_10bpp_rvv_i64: 50.5 h264_add_pixels4_12bpp_c: 87.5 h264_add_pixels4_12bpp_rvv_i64: 50.5 h264_add_pixels4_14bpp_c: 87.5 h264_add_pixels4_14bpp_rvv_i64: 50.5 h264_add_pixels8_8bpp_c: 265.2 h264_add_pixels8_8bpp_rvv_i64: 84.5	7 months ago
Rémi Denis-Courmont	c654e37254	lavc/h264dsp: R-V V high-depth h264_idct8_add Unlike the 8-bit version, we need two iterations to process this within 128-bit vectors. This adds some extra complexity for pointer arithmetic and counting down which is unnecessary in the 8-bit variant. Accordingly the gain relative to C are just slight better than half as good with 128-bit vectors as with 256-bit ones. T-Head C908 (2 iterations): h264_idct8_add_9bpp_c: 17.5 h264_idct8_add_9bpp_rvv_i32: 10.0 h264_idct8_add_10bpp_c: 17.5 h264_idct8_add_10bpp_rvv_i32: 9.7 h264_idct8_add_12bpp_c: 17.7 h264_idct8_add_12bpp_rvv_i32: 9.7 h264_idct8_add_14bpp_c: 17.7 h264_idct8_add_14bpp_rvv_i32: 9.7 SpacemiT X60 (single iteration): h264_idct8_add_9bpp_c: 15.2 h264_idct8_add_9bpp_rvv_i32: 5.0 h264_idct8_add_10bpp_c: 15.2 h264_idct8_add_10bpp_rvv_i32: 5.0 h264_idct8_add_12bpp_c: 14.7 h264_idct8_add_12bpp_rvv_i32: 5.0 h264_idct8_add_14bpp_c: 14.7 h264_idct8_add_14bpp_rvv_i32: 4.7	7 months ago
Rémi Denis-Courmont	4e0e872881	lavc/h264dsp: R-V V high-depth h264_idct_add T-Head C908 (cycles): h264_idct4_add_9bpp_c: 248.2 h264_idct4_add_9bpp_rvv_i32: 128.7 h264_idct4_add_10bpp_c: 256.7 h264_idct4_add_10bpp_rvv_i32: 128.7 h264_idct4_add_12bpp_c: 252.5 h264_idct4_add_12bpp_rvv_i32: 129.7 h264_idct4_add_14bpp_c: 258.0 h264_idct4_add_14bpp_rvv_i32: 129.7	7 months ago
Rémi Denis-Courmont	d28a7e8eb7	lavc/h264dsp: avoid \+ expansion This seems to be unsupported by LLVM-as.	7 months ago
Rémi Denis-Courmont	f1ed351d3b	lavc/h264dsp: R-V V 8-bit h264_biweight_pixels T-Head C908: h264_biweight2_8_c: 58.0 h264_biweight2_8_rvv_i32: 11.2 h264_biweight4_8_c: 106.0 h264_biweight4_8_rvv_i32: 22.7 h264_biweight8_8_c: 205.7 h264_biweight8_8_rvv_i32: 50.0 h264_biweight16_8_c: 403.5 h264_biweight16_8_rvv_i32: 83.2 SpacemiT X60: h264_weight2_8_c: 48.2 h264_weight2_8_rvv_i32: 8.2 h264_weight4_8_c: 90.5 h264_weight4_8_rvv_i32: 16.5 h264_weight8_8_c: 175.2 h264_weight8_8_rvv_i32: 38.0 h264_weight16_8_c: 342.2 h264_weight16_8_rvv_i32: 66.0	7 months ago
Rémi Denis-Courmont	3606e592ea	lavc/h264dsp: R-V V 8-bit h264_weight_pixels There are two implementations here: - a generic scalable one processing two columns at a time, - a specialised processing one (fixed-size) row at a time. Unsurprisingly, the generic one works out better with smaller widths. With larger widths, the gains from filling vectors are outweighed by the extra cost of strided loads and stores. In other words, memory accesses become the bottleneck. T-Head C908: h264_weight2_8_c: 54.5 h264_weight2_8_rvv_i32: 13.7 h264_weight4_8_c: 101.7 h264_weight4_8_rvv_i32: 27.5 h264_weight8_8_c: 197.0 h264_weight8_8_rvv_i32: 75.5 h264_weight16_8_c: 385.0 h264_weight16_8_rvv_i32: 74.2 SpacemiT X60: h264_weight2_8_c: 48.5 h264_weight2_8_rvv_i32: 8.2 h264_weight4_8_c: 90.7 h264_weight4_8_rvv_i32: 16.5 h264_weight8_8_c: 175.0 h264_weight8_8_rvv_i32: 37.7 h264_weight16_8_c: 342.2 h264_weight16_8_rvv_i32: 66.0	7 months ago
Rémi Denis-Courmont	f9d1230224	lavc/h264dsp: R-V V 8-bit h264_idct8_add T-Head C908 (cycles): h264_idct8_add_8bpp_c: 1072.0 h264_idct8_add_8bpp_rvv_i32: 318.5	7 months ago
Rémi Denis-Courmont	f447189b0c	lavc/h264dsp: R-V V 8-bit h264_idct_add T-Head C908 (cycles): h264_idct4_add_8bpp_c: 271.5 h264_idct4_add_8bpp_rvv_i32: 91.5	7 months ago
Rémi Denis-Courmont	e0eff64ed1	lavc/h264dsp: R-V V 8-bit h264_idct8_add4	7 months ago
Rémi Denis-Courmont	d1f0c1fbf8	lavc/h264dsp: R-V V 8-bit h264_idct_add16intra	7 months ago
Rémi Denis-Courmont	30475c95ba	lavc/h264dsp: R-V V 8-bit h264_idct_add16 While this tends to be faster than plain C, the performance numbers are all over the place, presuambly due to the conditional character of the main loop. Some additional micro-optimisations should be feasible after the underlying h264_idct_add and h264_idct_dc_add functions are also implemented. Then it will no longer be necesseray to stricly abide by the C ABI.	7 months ago
Rémi Denis-Courmont	e2af5904f0	lavc/h264dsp: R-V V 8-bit MBAFF loop filter Performance is (unfortunately) the same as with non-MBAFF, since the hardware under test does not short-circuit vector tail calculations. (IMO, a generic solution or work-around should be agreed on, rather than bespoke approaches all over the place.)	7 months ago
Rémi Denis-Courmont	5a6e333fc7	lavc/h264dsp: R-V V 8-bit luma loop filter T-Head C908 (cycles): h264_h_loop_filter_luma_8bpp_c: 297.5 h264_h_loop_filter_luma_8bpp_rvv_i32: 369.2 h264_v_loop_filter_luma_8bpp_c: 862.7 h264_v_loop_filter_luma_8bpp_rvv_i32: 199.7 Performance in the horizontal scenario seems worse than scalar. x86 SSE2 and AVX optimisations are similarly affected. This is presumably caused by unlucky inputs from checkasm, such that the C code short-circuits almost all filter calculations.	7 months ago
Rémi Denis-Courmont	4a2de380b7	lavc/vc1dsp: fuse multiply-adds in R-V V inv_trans_8 T-Head C908 (cycles) before after vc1dsp.vc1_inv_trans_4x8_rvv_i32: 240.0 228.0 vc1dsp.vc1_inv_trans_8x4_rvv_i32: 235.2 224.2 vc1dsp.vc1_inv_trans_8x8_rvv_i32: 340.7 327.2	7 months ago
Rémi Denis-Courmont	78e1565f84	lavc/vc1dsp: fuse multiply-adds in R-V V inv_trans_4 T-Head C908 (cycles): before after vc1dsp.vc1_inv_trans_4x4_rvv_i32: 128.0 120.0 vc1dsp.vc1_inv_trans_4x8_rvv_i32: 244.0 240.0 vc1dsp.vc1_inv_trans_8x4_rvv_i32: 239.2 235.2	7 months ago
Rémi Denis-Courmont	b818dff8d8	lavc/vc1dsp: fix potential overflow in R-V V inv_trans_4 Judging by the coefficients, the last round of add/sub can overflow to 17 bits with a very small probability just as with the 8-point transform. This is not observed under FATE, but better safe than sorry.	7 months ago
Rémi Denis-Courmont	349c49fd1b	lavc/vc1dsp: fix overflow in R-V V inv_trans_8 The last set of additions/subtractions can break the 16-bit limit, and require 17 bits of precision. This uses widening adds accordingly to fix the MSS2 FATE tests. The problem potentially also affects inv_trans_4 with a very low probability, but this is not reproducible under FATE.	7 months ago
Rémi Denis-Courmont	2c900d4c11	lavc/vc1dsp: factor R-V V inv_trans_8 code	7 months ago
sunyuechi	a4901a56c6	lavc/vp8dsp: R-V V bilin_load to bilin_load_h Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	8 months ago
sunyuechi	8d9fb7b5cf	lavc/vp8dsp: R-V V put_bilin_h v unroll Since len < 64, the registers are sufficient, so it can be directly unrolled (a4 is even). Another benefit of unrolling is that it reduces one load operation vertically compared to horizontally. old new C908 X60 C908 X60 vp8_put_bilin4_h_c : 6.2 5.5 : 6.2 5.5 vp8_put_bilin4_h_rvv_i32 : 2.2 2.0 : 1.5 1.5 vp8_put_bilin4_v_c : 6.5 5.7 : 6.2 5.7 vp8_put_bilin4_v_rvv_i32 : 2.2 2.0 : 1.2 1.5 vp8_put_bilin8_h_c : 24.2 21.5 : 24.2 21.5 vp8_put_bilin8_h_rvv_i32 : 5.2 4.7 : 3.5 3.5 vp8_put_bilin8_v_c : 24.5 21.7 : 24.5 21.7 vp8_put_bilin8_v_rvv_i32 : 5.2 4.7 : 3.5 3.2 vp8_put_bilin16_h_c : 48.0 42.7 : 48.0 42.7 vp8_put_bilin16_h_rvv_i32 : 5.7 5.0 : 5.2 4.5 vp8_put_bilin16_v_c : 48.2 43.0 : 48.2 42.7 vp8_put_bilin16_v_rvv_i32 : 5.7 5.2 : 4.5 4.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	8 months ago
Rémi Denis-Courmont	378d1b06c3	riscv: probe for Zbb extension at load time Due to hysterical raisins, most RISC-V Linux distributions target a RV64GC baseline excluding the Bit-manipulation ISA extensions, most notably: - Zba: address generation extension and - Zbb: basic bit manipulation extension. Most CPUs that would make sense to run FFmpeg on support Zba and Zbb (including the current FATE runner), so it makes sense to optimise for them. In fact a large chunk of existing assembler optimisations relies on Zba and/or Zbb. Since we cannot patch shared library code, the next best thing is to carry a flag initialised at load-time and check it on need basis. This results in 3 instructions overhead on isolated use, e.g.: 1: AUIPC rd, %pcrel_hi(ff_rv_zbb_supported) LBU rd, %pcrel_lo(1b)(rd) BEQZ rd, non_Zbb_fallback_code // Zbb code here The C compiler will typically load the flag ahead of time to reducing latency, and can also keep it around if Zbb is used multiple times in a single optimisation scope. For this to work, the flag symbol must be hidden; otherwise the optimisation degrades with a GOT look-up to support interposition: 1: AUIPC rd, GOT_OFFSET_HI LD rd, GOT_OFFSET_LO(rd) LBU rd, (rd) BEQZ rd, non_Zbb_fallback_code // Zbb code here This patch adds code to provision the flag in libraries using bit manipulation functions from libavutil: byte-swap, bit-weight and counting leading or trailing zeroes.	8 months ago
Rémi Denis-Courmont	b6f37ffba7	lavc/vc1dsp: match C block layout in inv_trans_4x8_rvv Although checkasm does not verify this, the decoder requires that the transform updates the input block exactly like the C code does. This fixes vc1-ism, vc1_ilaced_twomv, vc1_sa00040, vc1_sa10091, vc1_sa10143, vc1_sa20021, vc1test_smm0005 and wmv3-drm-dec tests.	8 months ago
Rémi Denis-Courmont	6c05069e68	lavc/vc1dsp: match C block layout in inv_trans_4x4_rvv Although checkasm does not verify this, the decoder requires that the transform updates the input block exactly like the C code does. This fixes vc1-ism, vc1_ilaced_twomv, vc1_sa00040, vc1_sa10091, vc1_sa10143, vc1_sa20021, vc1test_smm0005 and wmv3-drm-dec tests.	8 months ago
Rémi Denis-Courmont	daac101e61	lavc/aacencdsp: fix rounding in R-V V quantize_bands We need to round toward zero here.	8 months ago
Rémi Denis-Courmont	658439934b	lavc/vp8dsp: R-V V vp8_idct_add T-Head C908 (cycles): vp8_idct_add_c: 312.2 vp8_idct_add_rvv_i32: 117.0	8 months ago
Rémi Denis-Courmont	3152c684cb	lavc/vc1dsp: R-V V vc1_inv_trans_4x4 T-Head C908 (cycles): vc1dsp.vc1_inv_trans_4x4_c: 310.7 vc1dsp.vc1_inv_trans_4x4_rvv_i32: 120.0 We could use 1 `vlseg4e64.v` instead of 4 `vle16.v`, but that seems to be about 7% slower.	8 months ago
Rémi Denis-Courmont	6ffa639c8a	lavc/vc1dsp: R-V V vc1_inv_trans_4x8 T-Head C908 (cycles): vc1dsp.vc1_inv_trans_4x8_c: 653.2 vc1dsp.vc1_inv_trans_4x8_rvv_i32: 234.0	8 months ago

1 2 3 4 5

244 Commits (54ae270213b5a98f923bfd4506e450b2e764ede2)