sunyuechi
0cc8661499
lavc/vp9dsp: R-V V ipred hor
...
C908:
vp9_hor_8x8_8bpp_c: 74.7
vp9_hor_8x8_8bpp_rvv_i32: 35.7
vp9_hor_16x16_8bpp_c: 175.5
vp9_hor_16x16_8bpp_rvv_i32: 80.2
vp9_hor_32x32_8bpp_c: 510.2
vp9_hor_32x32_8bpp_rvv_i32: 264.0
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
9 months ago
sunyuechi
b82d9f55d1
lavc/vp9dsp: R-V mc copy
...
C908:
vp9_put4_8bpp_c: 0.7
vp9_put4_8bpp_rvi: 0.5
vp9_put8_8bpp_c: 2.5
vp9_put8_8bpp_rvi: 0.5
vp9_put16_8bpp_c: 16.7
vp9_put16_8bpp_rvi: 1.5
vp9_put32_8bpp_c: 37.2
vp9_put32_8bpp_rvi: 5.7
vp9_put64_8bpp_c: 107.5
vp9_put64_8bpp_rvi: 21.7
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
9 months ago
sunyuechi
aa9dbd91cf
lavc/vp9dsp: R-V ipred vert
...
C908:
vp9_vert_8x8_8bpp_c: 22.0
vp9_vert_8x8_8bpp_rvi: 15.7
vp9_vert_16x16_8bpp_c: 71.2
vp9_vert_16x16_8bpp_rvi: 39.0
vp9_vert_32x32_8bpp_c: 300.2
vp9_vert_32x32_8bpp_rvi: 135.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
9 months ago
Rémi Denis-Courmont
a3e45063c0
lavc/flacdsp: fix CPU requirement for 32-bit LPC
9 months ago
Rémi Denis-Courmont
9d3f561721
lavc/vp8dsp: restrict RVI optimisations
...
They are actually awfully slow if the CPU does not support misaligned
accesses natively, so only use them if misaligned accesses are fast.
9 months ago
Rémi Denis-Courmont
cdcb4b98b7
lavc/riscv: use ff_rv_vlen_least()
10 months ago
Rémi Denis-Courmont
38e7b0ecf8
lavc/vp9dsp: fix indentation
10 months ago
Rémi Denis-Courmont
0d9591841b
lavc/ac3dsp: add R-V Zvbb extract_exponents
10 months ago
Rémi Denis-Courmont
c07af340ae
lavc/riscv: explicitly require Zbb for MIN
10 months ago
sunyuechi
6e77af1c22
lavc/vp8dsp: R-V V put_epel v
...
C908:
vp8_put_epel4_v4_c: 11.0
vp8_put_epel4_v4_rvv_i32: 5.0
vp8_put_epel4_v6_c: 16.5
vp8_put_epel4_v6_rvv_i32: 6.2
vp8_put_epel8_v4_c: 43.7
vp8_put_epel8_v4_rvv_i32: 11.2
vp8_put_epel8_v6_c: 68.7
vp8_put_epel8_v6_rvv_i32: 13.2
vp8_put_epel16_v4_c: 92.5
vp8_put_epel16_v4_rvv_i32: 13.7
vp8_put_epel16_v6_c: 135.7
vp8_put_epel16_v6_rvv_i32: 16.5
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
10 months ago
sunyuechi
109daea619
lavc/vp8dsp: R-V V put_epel h
...
C908:
vp8_put_epel4_h4_c: 10.7
vp8_put_epel4_h4_rvv_i32: 5.0
vp8_put_epel4_h6_c: 15.0
vp8_put_epel4_h6_rvv_i32: 6.2
vp8_put_epel8_h4_c: 43.2
vp8_put_epel8_h4_rvv_i32: 11.2
vp8_put_epel8_h6_c: 57.5
vp8_put_epel8_h6_rvv_i32: 13.5
vp8_put_epel16_h4_c: 92.5
vp8_put_epel16_h4_rvv_i32: 13.7
vp8_put_epel16_h6_c: 139.0
vp8_put_epel16_h6_rvv_i32: 16.5
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
10 months ago
sunyuechi
538f217bbb
lavc/vp8dsp: R-V V put_bilin_hv
...
C908:
vp8_put_bilin4_hv_c: 561.0
vp8_put_bilin4_hv_rvv_i32: 232.7
vp8_put_bilin8_hv_c: 2162.7
vp8_put_bilin8_hv_rvv_i32: 506.7
vp8_put_bilin16_hv_c: 4769.7
vp8_put_bilin16_hv_rvv_i32: 556.7
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
10 months ago
sunyuechi
bb5039b3cb
lavc/vp8dsp: R-V V put_bilin_h v
...
C908:
vp8_put_bilin4_h_c: 367.0
vp8_put_bilin4_h_rvv_i32: 137.7
vp8_put_bilin4_v_c: 377.0
vp8_put_bilin4_v_rvv_i32: 137.7
vp8_put_bilin8_h_c: 1431.0
vp8_put_bilin8_h_rvv_i32: 297.5
vp8_put_bilin8_v_c: 1449.0
vp8_put_bilin8_v_rvv_i32: 297.5
vp8_put_bilin16_h_c: 2839.0
vp8_put_bilin16_h_rvv_i32: 344.7
vp8_put_bilin16_v_c: 2857.0
vp8_put_bilin16_v_rvv_i32: 344.7
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
10 months ago
sunyuechi
0b8e5e5a00
lavc/vp8dsp: R-V put_vp8_pixels
...
C908:
vp8_put_pixels4_c: 78.0
vp8_put_pixels4_rvi: 33.7
vp8_put_pixels8_c: 278.0
vp8_put_pixels8_rvi: 55.0
vp8_put_pixels16_c: 999.0
vp8_put_pixels16_rvi: 86.7
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
10 months ago
Rémi Denis-Courmont
f8715d0300
lavc/vp9dsp: fix compilation with llvm-as
10 months ago
Rémi Denis-Courmont
9e77188cba
lavc/ac3dsp: R-V Zbb ac3_exponent_min
...
SiFive U74:
ac3_exponent_min_reuse0_c: 10.0
ac3_exponent_min_reuse0_rvb_b: 8.0
ac3_exponent_min_reuse1_c: 2924.7
ac3_exponent_min_reuse1_rvb_b: 1803.0
ac3_exponent_min_reuse2_c: 5043.0
ac3_exponent_min_reuse2_rvb_b: 2827.5
ac3_exponent_min_reuse3_c: 7028.7
ac3_exponent_min_reuse3_rvb_b: 3872.0
ac3_exponent_min_reuse4_c: 8824.2
ac3_exponent_min_reuse4_rvb_b: 5122.2
ac3_exponent_min_reuse5_c: 10487.5
ac3_exponent_min_reuse5_rvb_b: 6412.2
10 months ago
Rémi Denis-Courmont
38f67a32b3
lavc/ac3dsp: R-V V min_exponents
...
T-Head C908:
ac3_exponent_min_reuse0_c: 7.5
ac3_exponent_min_reuse0_rvv_i32: 7.5
ac3_exponent_min_reuse1_c: 1820.7
ac3_exponent_min_reuse1_rvv_i32: 102.5
ac3_exponent_min_reuse2_c: 3088.5
ac3_exponent_min_reuse2_rvv_i32: 138.7
ac3_exponent_min_reuse3_c: 5073.7
ac3_exponent_min_reuse3_rvv_i32: 174.7
ac3_exponent_min_reuse4_c: 4624.2
ac3_exponent_min_reuse4_rvv_i32: 204.2
ac3_exponent_min_reuse5_c: 5138.7
ac3_exponent_min_reuse5_rvv_i32: 238.0
10 months ago
sunyuechi
5bc3b7f513
lavc/rv40dsp: R-V V chroma_mc
...
This is similar to h264, but here we use manual_avg instead of vaaddu
because rv40's OP differs from h264. If we use vaaddu,
rv40 would need to repeatedly switch between vxrm=0 and vxrm=2,
and switching vxrm is very slow.
C908:
avg_chroma_mc4_c: 2330.0
avg_chroma_mc4_rvv_i32: 602.7
avg_chroma_mc8_c: 1211.0
avg_chroma_mc8_rvv_i32: 602.7
put_chroma_mc4_c: 1825.0
put_chroma_mc4_rvv_i32: 414.7
put_chroma_mc8_c: 932.0
put_chroma_mc8_rvv_i32: 414.7
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
10 months ago
sunyuechi
7d0673db7e
lavc/blockdsp: R-V V fill_block
...
C908:
blockdsp.fill_block_tab[0]_c: 549.7
blockdsp.fill_block_tab[0]_rvv_i64: 48.2
blockdsp.fill_block_tab[1]_c: 77.0
blockdsp.fill_block_tab[1]_rvv_i64: 19.7
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
10 months ago
Rémi Denis-Courmont
6cd97cd797
lavc/ac3dsp: R-V V sum_square_butterfly_float
...
As we do not need to widen accumulators to 64 bits, we effectively get
double capacity for unrolling compared to the integer function. This
explains the slightly better performance gains.
ac3_sum_square_bufferfly_float_c: 65.2
ac3_sum_square_bufferfly_float_rvv_f32: 12.2
10 months ago
Rémi Denis-Courmont
6459966beb
lavc/ac3dsp: R-V V sum_square_butterfly_int32
...
ac3_sum_square_bufferfly_int32_c: 61.0
ac3_sum_square_bufferfly_int32_rvv_i64: 14.7
10 months ago
Andreas Rheinhardt
08781ebe1a
avcodec/riscv/vp9dsp: Fix inclusion guard
...
Fixes fate-source.
Reviewed-by: Jan Ekström <jeebjp@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
10 months ago
sunyuechi
c3a96f97f8
lavc/vp9dsp: R-V V ipred dc
...
C908:
vp9_dc_8x8_8bpp_c: 46.0
vp9_dc_8x8_8bpp_rvv_i64: 41.0
vp9_dc_16x16_8bpp_c: 109.2
vp9_dc_16x16_8bpp_rvv_i32: 72.7
vp9_dc_32x32_8bpp_c: 365.2
vp9_dc_32x32_8bpp_rvv_i32: 165.5
vp9_dc_127_8x8_8bpp_c: 23.0
vp9_dc_127_8x8_8bpp_rvv_i64: 22.0
vp9_dc_127_16x16_8bpp_c: 70.2
vp9_dc_127_16x16_8bpp_rvv_i32: 50.2
vp9_dc_127_32x32_8bpp_c: 295.2
vp9_dc_127_32x32_8bpp_rvv_i32: 136.7
vp9_dc_128_8x8_8bpp_c: 23.0
vp9_dc_128_8x8_8bpp_rvv_i64: 22.0
vp9_dc_128_16x16_8bpp_c: 70.2
vp9_dc_128_16x16_8bpp_rvv_i32: 50.2
vp9_dc_128_32x32_8bpp_c: 295.2
vp9_dc_128_32x32_8bpp_rvv_i32: 136.7
vp9_dc_129_8x8_8bpp_c: 23.0
vp9_dc_129_8x8_8bpp_rvv_i64: 22.0
vp9_dc_129_16x16_8bpp_c: 70.2
vp9_dc_129_16x16_8bpp_rvv_i32: 50.2
vp9_dc_129_32x32_8bpp_c: 295.2
vp9_dc_129_32x32_8bpp_rvv_i32: 136.7
vp9_dc_left_8x8_8bpp_c: 38.0
vp9_dc_left_8x8_8bpp_rvv_i64: 36.0
vp9_dc_left_16x16_8bpp_c: 93.2
vp9_dc_left_16x16_8bpp_rvv_i32: 67.7
vp9_dc_left_32x32_8bpp_c: 333.2
vp9_dc_left_32x32_8bpp_rvv_i32: 158.5
vp9_dc_top_8x8_8bpp_c: 38.7
vp9_dc_top_8x8_8bpp_rvv_i64: 36.0
vp9_dc_top_16x16_8bpp_c: 93.2
vp9_dc_top_16x16_8bpp_rvv_i32: 67.7
vp9_dc_top_32x32_8bpp_c: 333.2
vp9_dc_top_32x32_8bpp_rvv_i32: 156.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
10 months ago
Andreas Rheinhardt
88b3b09afa
avcodec/aacenc: Move initializing DSP out of aacenc.c
...
Otherwise aacenc.o gets pulled in by the aacencdsp checkasm
test and it in turn pulls the rest of lavc in.
Besides being bad size-wise this also has the downside that
it pulls in avpriv_(cga|vga16)_font from libavutil which are
marked as being imported from another library when building
libavcodec as a DLL and this breaks checkasm because it links
both lavc and lavu statically.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
12 months ago
sunyuechi
a7ad76fbbf
lavc/me_cmp: R-V V nsse
...
C908:
nsse_0_c: 1990.0
nsse_0_rvv_i32: 572.0
nsse_1_c: 910.0
nsse_1_rvv_i32: 456.0
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
9b90d0d36a
lavc/me_cmp: R-V V vsse vsad intra
...
C908:
vsad_4_c: 681.0
vsad_4_rvv_i32: 182.2
vsad_5_c: 278.0
vsad_5_rvv_i32: 145.2
vsse_4_c: 595.0
vsse_4_rvv_i32: 125.2
vsse_5_c: 281.0
vsse_5_rvv_i32: 101.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
925b55a5e8
lavc/me_cmp: R-V V vsse vsad
...
C908:
vsad_0_c: 936.0
vsad_0_rvv_i32: 236.2
vsad_1_c: 424.0
vsad_1_rvv_i32: 190.2
vsse_0_c: 877.0
vsse_0_rvv_i32: 204.2
vsse_1_c: 439.0
vsse_1_rvv_i32: 140.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
9cb8f262f2
lavc/me_cmp: R-V V sse
...
C908:
sse_0_c: 614.7
sse_0_rvv_i32: 138.2
sse_1_c: 302.7
sse_1_rvv_i32: 107.2
sse_2_c: 175.7
sse_2_rvv_i32: 104.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
37463d7979
lavc/me_cmp: R-V V pix_abs_y2
...
C908:
pix_abs_0_2_c: 904.0
pix_abs_0_2_rvv_i32: 172.2
pix_abs_1_2_c: 460.0
pix_abs_1_2_rvv_i32: 168.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
f1ec475f66
lavc/me_cmp: R-V V pix_abs_x2
...
C908:
pix_abs_0_1_c: 767.0
pix_abs_0_1_rvv_i32: 196.2
pix_abs_1_1_c: 388.0
pix_abs_1_1_rvv_i32: 185.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
b41e115dde
lavc/me_cmp: R-V V pix_abs
...
C908:
pix_abs_0_0_c: 534.0
pix_abs_0_0_rvv_i32: 136.2
pix_abs_1_0_c: 287.7
pix_abs_1_0_rvv_i32: 125.2
sad_0_c: 534.0
sad_0_rvv_i32: 136.2
sad_1_c: 287.7
sad_1_rvv_i32: 125.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
d897bbb48d
lavc/vp8dsp: R-V V vp8_idct_dc_add4uv
...
c908:
vp8_idct_dc_add4uv_c: 387.7
vp8_idct_dc_add4uv_rvv_i32: 134.5
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
e74e18cae4
lavc/vp8dsp: R-V V vp8_idct_dc_add4y
...
c908:
vp8_idct_dc_add4y_c: 368.5
vp8_idct_dc_add4y_rvv_i32: 134.5
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
c12053cefc
lavc/vp8dsp: R-V V vp8_idct_dc_add
...
c908:
vp8_idct_dc_add_c: 102.2
vp8_idct_dc_add_rvv_i32: 42.0
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
89189dd9e7
lavc/rv34dsp: R-V V rv34_idct_dc_add
...
C908:
rv34_idct_dc_add_c: 134.7
rv34_idct_dc_add_rvv_i32: 45.5
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
ee08974f90
lavc/rv34dsp: R-V V rv34_inv_transform_dc
...
C908:
rv34_inv_transform_dc_c: 35.5
rv34_inv_transform_dc_rvv_i32: 27.0
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
fdebde817c
lavc/blockdsp: R-V V clear_blocks
...
C908:
blockdsp.clear_blocks_c: 128.2
blockdsp.clear_blocks_rvv_i64: 102.5
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
0748d2bbc7
lavc/blockdsp: R-V V clear_block
...
C908:
blockdsp.clear_block_c: 47.2
blockdsp.clear_block_rvv_i64: 28.5
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
8e23ebe6f9
lavc/svq1enc: R-V V ssd_int8_vs_int16
...
C908
ssd_int8_vs_int16_c: 207.7
ssd_int8_vs_int16_rvv_i32: 14.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
Rémi Denis-Courmont
278b4b60d6
lavc/takdsp: R-V V decorrelate_sf
...
decorrelate_sf_c: 259.2
decorrelate_sf_rvv_i32: 45.5
1 year ago
sunyuechi
3d39b8d4e7
lavc/takdsp: R-V V decorrelate_sm
...
C908:
decorrelate_sm_c: 130.0
decorrelate_sm_rvv_i32: 43.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
(with minor changes)
1 year ago
James Almer
46775e64f8
avcodec/takdsp: fix const correctness
...
Signed-off-by: James Almer <jamrial@gmail.com>
1 year ago
sunyuechi
c933ff2779
lavc/takdsp: R-V V decorrelate_sr
...
C908:
decorrelate_sr_c: 95.5
decorrelate_sr_rvv_i32: 28.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
sunyuechi
864174dd00
lavc/takdsp: R-V V decorrelate_ls
...
C908:
decorrelate_ls_c: 69.7
decorrelate_ls_rvv_i32: 27.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago
Rémi Denis-Courmont
cdd38a2ffe
lavc/aacpsdsp: fix R-V V stereo interpolate
...
The penultimate loop iteration could pick any vl such that:
vlenb/4 < vl <= vlenb/2
Thus if the total length is not a multiple of vlenb/2, the vfadd.vf
on the penultimate iteration would yield corrupt values for the last
iteration.
To avoid this, force vl = vlen/2 until the last iteration. Unfortunately
this latent bug is not reproducible with either hardware or QEMU as of now.
1 year ago
Rémi Denis-Courmont
db32f75c63
lavc/opusdsp: simplify R-V V postfilter
...
This skips the round-trip to scalar register for the sliding 'x'
coefficients, improving performance by about 5%. The trick here is that
the vector slide-up instruction preserves elements in destination vector
until the slide offset.
The switch from vfslide1up.vf to vslideup.vi also allows the elimination
of data dependencies on consecutive slides. Since the specifications
recommend sticking to power of two offsets, we could slide as follows:
vslideup.vi v8, v0, 2
vslideup.vi v4, v0, 1
vslideup.vi v12, v8, 1
vslideup.vi v16, v8, 2
However in the device under test, this seems to make performance slightly
worse, so this is left for (in)validation with future better hardware.
1 year ago
Rémi Denis-Courmont
419145c11b
lavc/vc1dsp: fix R-V V vector lengths
...
The 8x4 and 4x4 use a needlessly large multiplier (unless/until we care
about embedded 64-bit-vector hardware). This is merely suboptimal.
The 8x4 case also uses an incorrect vector length, which leads to incorrect
behaviour on future/hypothetical hardware with 256-bit or larger vectors.
Pointed-out-by: Martin Storsjö <martin@martin.st>
1 year ago
Martin Storsjö
b51d9eb58e
riscv: vc1dsp: Don't check vlenb before checking the CPU flags
...
We can't call ff_get_rv_vlenb() if we don't have RVV available
at all.
Acked-by: Rémi Denis-Courmont <remi@remlab.net>
Signed-off-by: Martin Storsjö <martin@martin.st>
1 year ago
Rémi Denis-Courmont
918b3ed2d5
lavc/lpc: R-V V compute_autocorr
...
The loop iterates over the length of the vector, not the order. This is
to avoid reloading the same data for each lag value. However this means
the loop only works if the maximum order is no larger than VLENB.
The loop is roughly equivalent to:
for (size_t j = 0; j < lag; j++)
autoc[j] = 1.;
while (len > lag) {
for (ptrdiff_t j = 0; j < lag; j++)
autoc[j] += data[j] * *data;
data++;
len--;
}
while (len > 0) {
for (ptrdiff_t j = 0; j < len; j++)
autoc[j] += data[j] * *data;
data++;
len--;
}
Since register pressure is only at 50%, it should be possible to implement
the same loop for order up to 2xVLENB. But this is left for future work.
Performance numbers are all over the place from ~1.25x to ~4x speedups,
but at least they are always noticeably better than nothing.
1 year ago
sunyuechi
98596f90f4
lavc/aacencdsp: R-V V abs_pow34
...
C908:
abs_pow34_c: 535.5
abs_pow34_rvv_f32: 337.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
1 year ago