sunyuechi
c3a96f97f8
lavc/vp9dsp: R-V V ipred dc
...
C908:
vp9_dc_8x8_8bpp_c: 46.0
vp9_dc_8x8_8bpp_rvv_i64: 41.0
vp9_dc_16x16_8bpp_c: 109.2
vp9_dc_16x16_8bpp_rvv_i32: 72.7
vp9_dc_32x32_8bpp_c: 365.2
vp9_dc_32x32_8bpp_rvv_i32: 165.5
vp9_dc_127_8x8_8bpp_c: 23.0
vp9_dc_127_8x8_8bpp_rvv_i64: 22.0
vp9_dc_127_16x16_8bpp_c: 70.2
vp9_dc_127_16x16_8bpp_rvv_i32: 50.2
vp9_dc_127_32x32_8bpp_c: 295.2
vp9_dc_127_32x32_8bpp_rvv_i32: 136.7
vp9_dc_128_8x8_8bpp_c: 23.0
vp9_dc_128_8x8_8bpp_rvv_i64: 22.0
vp9_dc_128_16x16_8bpp_c: 70.2
vp9_dc_128_16x16_8bpp_rvv_i32: 50.2
vp9_dc_128_32x32_8bpp_c: 295.2
vp9_dc_128_32x32_8bpp_rvv_i32: 136.7
vp9_dc_129_8x8_8bpp_c: 23.0
vp9_dc_129_8x8_8bpp_rvv_i64: 22.0
vp9_dc_129_16x16_8bpp_c: 70.2
vp9_dc_129_16x16_8bpp_rvv_i32: 50.2
vp9_dc_129_32x32_8bpp_c: 295.2
vp9_dc_129_32x32_8bpp_rvv_i32: 136.7
vp9_dc_left_8x8_8bpp_c: 38.0
vp9_dc_left_8x8_8bpp_rvv_i64: 36.0
vp9_dc_left_16x16_8bpp_c: 93.2
vp9_dc_left_16x16_8bpp_rvv_i32: 67.7
vp9_dc_left_32x32_8bpp_c: 333.2
vp9_dc_left_32x32_8bpp_rvv_i32: 158.5
vp9_dc_top_8x8_8bpp_c: 38.7
vp9_dc_top_8x8_8bpp_rvv_i64: 36.0
vp9_dc_top_16x16_8bpp_c: 93.2
vp9_dc_top_16x16_8bpp_rvv_i32: 67.7
vp9_dc_top_32x32_8bpp_c: 333.2
vp9_dc_top_32x32_8bpp_rvv_i32: 156.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
10 months ago
sunyuechi
b41e115dde
lavc/me_cmp: R-V V pix_abs
...
C908:
pix_abs_0_0_c: 534.0
pix_abs_0_0_rvv_i32: 136.2
pix_abs_1_0_c: 287.7
pix_abs_1_0_rvv_i32: 125.2
sad_0_c: 534.0
sad_0_rvv_i32: 136.2
sad_1_c: 287.7
sad_1_rvv_i32: 125.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
c12053cefc
lavc/vp8dsp: R-V V vp8_idct_dc_add
...
c908:
vp8_idct_dc_add_c: 102.2
vp8_idct_dc_add_rvv_i32: 42.0
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
ee08974f90
lavc/rv34dsp: R-V V rv34_inv_transform_dc
...
C908:
rv34_inv_transform_dc_c: 35.5
rv34_inv_transform_dc_rvv_i32: 27.0
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
0748d2bbc7
lavc/blockdsp: R-V V clear_block
...
C908:
blockdsp.clear_block_c: 47.2
blockdsp.clear_block_rvv_i64: 28.5
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
8e23ebe6f9
lavc/svq1enc: R-V V ssd_int8_vs_int16
...
C908
ssd_int8_vs_int16_c: 207.7
ssd_int8_vs_int16_rvv_i32: 14.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
864174dd00
lavc/takdsp: R-V V decorrelate_ls
...
C908:
decorrelate_ls_c: 69.7
decorrelate_ls_rvv_i32: 27.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
98596f90f4
lavc/aacencdsp: R-V V abs_pow34
...
C908:
abs_pow34_c: 535.5
abs_pow34_rvv_f32: 337.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
Rémi Denis-Courmont
272d0c164d
lavc/lpc: R-V V apply_welch_window
...
apply_welch_window_even_c: 617.5
apply_welch_window_even_rvv_f64: 235.0
apply_welch_window_odd_c: 709.0
apply_welch_window_odd_rvv_f64: 256.5
1 year ago
Rémi Denis-Courmont
b3825bbe45
riscv: test for assembler support
...
This should fix the build on LLVM 16 and earlier, at the cost of turning
all non-RVV optimisations off.
1 year ago
sunyuechi
0b9d009b4a
lavc/vc1dsp: R-V V inv_trans
...
C908:
vc1dsp.vc1_inv_trans_4x4_dc_c: 125.7
vc1dsp.vc1_inv_trans_4x4_dc_rvv_i32: 53.5
vc1dsp.vc1_inv_trans_4x8_dc_c: 230.7
vc1dsp.vc1_inv_trans_4x8_dc_rvv_i32: 65.5
vc1dsp.vc1_inv_trans_8x4_dc_c: 228.7
vc1dsp.vc1_inv_trans_8x4_dc_rvv_i64: 64.5
vc1dsp.vc1_inv_trans_8x8_dc_c: 476.5
vc1dsp.vc1_inv_trans_8x8_dc_rvv_i64: 80.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
8bdb663062
lavc/ac3dsp: R-V V float_to_fixed24
...
c910
float_to_fixed24_c: 2207.2
float_to_fixed24_rvv_f32: 696.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
Rémi Denis-Courmont
0fa421c8f1
lavc/llvidencdsp: add R-V V diff_bytes
...
diff_bytes_c: 163.0
diff_bytes_rvv_i32: 52.7
1 year ago
Rémi Denis-Courmont
fbc7adba67
lavc/llviddsp: R-V V add_bytes
...
add_bytes_c: 2077.2
add_bytes_rvv_i32: 105.0
1 year ago
Rémi Denis-Courmont
636ae0e0bc
lavc/flacdsp: R-V V packed decorrelate_{l,r}s
...
flac_decorrelate_ms_16_c: 457.2
flac_decorrelate_ms_16_rvv_i32: 203.0
flac_decorrelate_ms_32_c: 457.2
flac_decorrelate_ms_32_rvv_i32: 203.5
flac_decorrelate_rs_16_c: 456.2
flac_decorrelate_rs_16_rvv_i32: 207.0
flac_decorrelate_rs_32_c: 456.2
flac_decorrelate_rs_32_rvv_i32: 210.5
1 year ago
Rémi Denis-Courmont
45d0eb3f70
lavc/llauddsp: R-V V scalarproduct_and_madd_int16
...
scalarproduct_and_madd_int16_c: 10355.7
scalarproduct_and_madd_int16_rvv_i32: 1480.0
1 year ago
Rémi Denis-Courmont
86bee42473
lavc/sbrdsp: R-V V sum64x5
...
sum64x5_c: 385.0
sum64x5_rvv_f32: 116.0
1 year ago
Rémi Denis-Courmont
73dea2bb91
lavc/jpeg2000dsp: R-V V ict_float
...
jpeg2000_ict_float_c: 3112.2
jpeg2000_ict_float_rvv_f32: 1225.0
1 year ago
Rémi Denis-Courmont
424c8ceb08
lavc/huffyuvdsp: R-V V add_int16
...
add_int16_128_c: 2390.5
add_int16_128_rvv_i32: 832.0
add_int16_rnd_width_c: 2390.2
add_int16_rnd_width_rvv_i32: 832.5
1 year ago
Rémi Denis-Courmont
4aea0da230
lavc/utvideodsp: R-V V restore_rgb_planes
...
restore_rgb_planes_c: 133065.7
restore_rgb_planes_rvv_i32: 33317.2
1 year ago
Rémi Denis-Courmont
3c6516330f
lavc/exrdsp: R-V V reoder_pixels
1 year ago
Rémi Denis-Courmont
89c10d8d20
lavc/ac3: add R-V Zbb extract_exponents
1 year ago
Rémi Denis-Courmont
9bc5676e40
lavc/g722dsp: add RISC-V V DSP function
2 years ago
Arnie Chang
c5508f60c2
lavc/h264chroma: RISC-V V add motion compensation for 8x8 chroma blocks
...
Optimize the put and avg filtering for 8x8 chroma blocks
Signed-off-by: Arnie Chang <arnie.chang@sifive.com>
2 years ago
Rémi Denis-Courmont
8009581912
lavc/opusdsp: RISC-V V (128-bit) postfilter
...
This is implemented for a vector size of 128-bit. Since the scalar
product in the inner loop covers 5 samples or 160 bits, we need a group
multipler of 2.
To avoid reconfiguring the vector type, the outer loop, which loads
multiple input samples sticks to the same multipler. Consequently, the
outer loop loads 8 samples per iteration. This is safe since the minimum
period of the CELT codec is 15 samples.
The same code would also work, albeit needlessly inefficiently with a
vector length of 256 bits. A proper implementation will follow instead.
2 years ago
Rémi Denis-Courmont
d7528af4df
lavc/bswapdsp: RISC-V V bswap_buf
2 years ago
Rémi Denis-Courmont
f0ef11ea83
lavc/bswapdsp: RISC-V B bswap_buf
...
Simply taking the Zbb REV8 instruction into use in a simple loop gives
some significant savings:
bswap_buf_c: 1081.0
bswap_buf_rvb_b: 771.0
But we can also use the 64-bit REV8 as a pseudo-SIMD instruction with
just one additional shift, and one fewer load, effectively doubling the
bandwidth. Consequently, this patch is useful even if the compile-time
target has Zbb enabled for C code:
bswap_buf_c: 1081.0
bswap_buf_rvb_b: 341.0 (this patch)
On the other hand, this approach fails miserably for bswap16_buf as the
ratio of shifts and stores becomes unfavorable compared to naïve C:
bswap16_buf_c: 1542.0
bswap16_buf_rvb_b: 1803.7
Unrolling to process 128 bits (4 samples) at a time actually worsens
performance ever so slightly:
bswap_buf_c: 1081.0
bswap_buf_rvb_b: 408.5
2 years ago
Rémi Denis-Courmont
64ab577954
lavc/alacdsp: RISC-V V decorrelate_stereo
...
To avoid data dependencies, this does the following unroll, which
requires one extra but probably free addition:
coeff = (b * left_weight) >> decorr_shift;
b += a;
a -= coeff;
b -= coeff;
swap(a, b);
2 years ago
Rémi Denis-Courmont
676b08cb70
lavc/pixblockdsp: RISC-V V 8-bit get_pixels & get_pixels_unaligned
2 years ago
Rémi Denis-Courmont
b29ee63a1b
lavc/idctdsp: RISC-V V put_pixels_clamped function
2 years ago
Rémi Denis-Courmont
b0cacf4c3f
lavc/aacpsdsp: RISC-V V add_squares
2 years ago
Rémi Denis-Courmont
453aba71e6
lavc/vorbisdsp: RISC-V V inverse_coupling
...
This uses the following vectorisation:
for (i = 0; i < blocksize; i++) {
ang[i] = mag[i] - copysignf(fmaxf(ang[i], 0.f), mag[i]);
mag[i] = mag[i] - copysignf(fminf(ang[i], 0.f), mag[i]);
}
2 years ago
Rémi Denis-Courmont
47a10b9a99
lavc/fmtconvert: RISC-V V int32_to_float_fmul_scalar
2 years ago
Rémi Denis-Courmont
27da9514c3
lavc/audiodsp: RISC-V V vector_clip_int32
2 years ago
Rémi Denis-Courmont
1edac8eb46
lavc/pixblockdsp: RISC-V I get_pixels
...
Benchmarks on SiFive U74-MC (courtesy of Shanghai StarFive Tech):
get_pixels_c: 180.0
get_pixels_rvi: 136.7
2 years ago
Rémi Denis-Courmont
04d092e7d5
lavc/audiodsp: RISC-V F vector_clipf
...
RV64G supports MIN & MAX instructions natively only on floating point
registers, not general purpose ones. The later would require the Zbb
extension. Due to that, it is actually faster to perform the clipping
"properly" in FPU.
Benchmarks on SiFive U74-MC (courtesy of Shanghai StarFive Tech):
audiodsp.vector_clipf_c: 29551.5
audiodsp.vector_clipf_rvf: 17871.0
Also tried unrolling with 2 or 8 elements but it gets worse either way.
2 years ago