FFmpeg

Author	SHA1	Message	Date
Rémi Denis-Courmont	8009581912	lavc/opusdsp: RISC-V V (128-bit) postfilter This is implemented for a vector size of 128-bit. Since the scalar product in the inner loop covers 5 samples or 160 bits, we need a group multipler of 2. To avoid reconfiguring the vector type, the outer loop, which loads multiple input samples sticks to the same multipler. Consequently, the outer loop loads 8 samples per iteration. This is safe since the minimum period of the CELT codec is 15 samples. The same code would also work, albeit needlessly inefficiently with a vector length of 256 bits. A proper implementation will follow instead.	2 years ago
Rémi Denis-Courmont	d7528af4df	lavc/bswapdsp: RISC-V V bswap_buf	2 years ago
Rémi Denis-Courmont	f0ef11ea83	lavc/bswapdsp: RISC-V B bswap_buf Simply taking the Zbb REV8 instruction into use in a simple loop gives some significant savings: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 771.0 But we can also use the 64-bit REV8 as a pseudo-SIMD instruction with just one additional shift, and one fewer load, effectively doubling the bandwidth. Consequently, this patch is useful even if the compile-time target has Zbb enabled for C code: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 341.0 (this patch) On the other hand, this approach fails miserably for bswap16_buf as the ratio of shifts and stores becomes unfavorable compared to naïve C: bswap16_buf_c: 1542.0 bswap16_buf_rvb_b: 1803.7 Unrolling to process 128 bits (4 samples) at a time actually worsens performance ever so slightly: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 408.5	2 years ago
Rémi Denis-Courmont	64ab577954	lavc/alacdsp: RISC-V V decorrelate_stereo To avoid data dependencies, this does the following unroll, which requires one extra but probably free addition: coeff = (b * left_weight) >> decorr_shift; b += a; a -= coeff; b -= coeff; swap(a, b);	2 years ago
Rémi Denis-Courmont	676b08cb70	lavc/pixblockdsp: RISC-V V 8-bit get_pixels & get_pixels_unaligned	2 years ago
Rémi Denis-Courmont	b29ee63a1b	lavc/idctdsp: RISC-V V put_pixels_clamped function	2 years ago
Rémi Denis-Courmont	b0cacf4c3f	lavc/aacpsdsp: RISC-V V add_squares	2 years ago
Rémi Denis-Courmont	453aba71e6	lavc/vorbisdsp: RISC-V V inverse_coupling This uses the following vectorisation: for (i = 0; i < blocksize; i++) { ang[i] = mag[i] - copysignf(fmaxf(ang[i], 0.f), mag[i]); mag[i] = mag[i] - copysignf(fminf(ang[i], 0.f), mag[i]); }	2 years ago
Rémi Denis-Courmont	47a10b9a99	lavc/fmtconvert: RISC-V V int32_to_float_fmul_scalar	2 years ago
Rémi Denis-Courmont	27da9514c3	lavc/audiodsp: RISC-V V vector_clip_int32	2 years ago
Rémi Denis-Courmont	1edac8eb46	lavc/pixblockdsp: RISC-V I get_pixels Benchmarks on SiFive U74-MC (courtesy of Shanghai StarFive Tech): get_pixels_c: 180.0 get_pixels_rvi: 136.7	2 years ago
Rémi Denis-Courmont	04d092e7d5	lavc/audiodsp: RISC-V F vector_clipf RV64G supports MIN & MAX instructions natively only on floating point registers, not general purpose ones. The later would require the Zbb extension. Due to that, it is actually faster to perform the clipping "properly" in FPU. Benchmarks on SiFive U74-MC (courtesy of Shanghai StarFive Tech): audiodsp.vector_clipf_c: 29551.5 audiodsp.vector_clipf_rvf: 17871.0 Also tried unrolling with 2 or 8 elements but it gets worse either way.	2 years ago

12 Commits (ec2b07db79dbbd58329bf5ec19ecf867b21a38b7)