Anton Khirnov
3f9ca51015
lavc/opus*: move to opus/ subdir
5 months ago
Ramiro Polla
6aafe61285
avcodec/mpegvideoencdsp: convert stride parameters from int to ptrdiff_t
5 months ago
Rémi Denis-Courmont
7d1dda4892
lavc/h264dsp: R-V V loop_filter_chroma
...
T-Head C908:
h264_v_loop_filter_chroma_8bpp_c: 137.4
h264_v_loop_filter_chroma_8bpp_rvv_i32: 54.2
5 months ago
Rémi Denis-Courmont
3a53656837
lavc/h264dsp: do not write back unmodified rows in R-V V loop filter
5 months ago
Rémi Denis-Courmont
d8fb44c0aa
lavc/mpegvideoencdsp: R-V V add_8x8basis
...
T-Head C908:
add_8x8basis_c: 440.6
add_8x8basis_rvv_i32: 70.3
SpacemiT X60:
add_8x8basis_c: 436.3
add_8x8basis_rvv_i32: 40.5
5 months ago
Rémi Denis-Courmont
1907dd7f23
lavc/mpegvideoencdsp: R-V V try_8x8basis
...
T-Head C908:
try_8x8basis_c: 922.5
try_8x8basis_rvv_i32: 135.3
SpacemiT X60:
try_8x8basis_c: 926.1
try_8x8basis_rvv_i32: 103.1
5 months ago
Rémi Denis-Courmont
0fd37c00d7
lavc/mpegvideoencdsp: R-V V pix_norm1
...
T-Head C908:
pix_norm1_c: 480.2
pix_norm1_rvv_i64: 146.9
SpacemiT X60:
pix_norm1_c: 478.2
pix_norm1_rvv_i64: 92.7
5 months ago
Rémi Denis-Courmont
63d016aea5
lavc/mpegvideoencdsp: R-V V pix_sum
...
T-Head C908:
pix_sum_c: 332.2
pix_sum_rvv_i64: 91.2
SpacemiT X60:
pix_sum_c: 321.2
pix_sum_rvv_i64: 60.9
5 months ago
sunyuechi
4e7b5ac48f
lavc/vp9dsp: R-V V mc bilin hv
...
C908 X60
vp9_avg_bilin_4hv_8bpp_c : 10.7 9.5
vp9_avg_bilin_4hv_8bpp_rvv_i32 : 4.0 3.5
vp9_avg_bilin_8hv_8bpp_c : 38.5 34.2
vp9_avg_bilin_8hv_8bpp_rvv_i32 : 7.2 6.5
vp9_avg_bilin_16hv_8bpp_c : 147.2 130.5
vp9_avg_bilin_16hv_8bpp_rvv_i32 : 14.5 12.7
vp9_avg_bilin_32hv_8bpp_c : 574.2 509.7
vp9_avg_bilin_32hv_8bpp_rvv_i32 : 42.5 38.0
vp9_avg_bilin_64hv_8bpp_c : 2321.2 2017.7
vp9_avg_bilin_64hv_8bpp_rvv_i32 : 163.5 131.0
vp9_put_bilin_4hv_8bpp_c : 10.0 8.7
vp9_put_bilin_4hv_8bpp_rvv_i32 : 3.5 3.0
vp9_put_bilin_8hv_8bpp_c : 35.2 31.2
vp9_put_bilin_8hv_8bpp_rvv_i32 : 6.5 5.7
vp9_put_bilin_16hv_8bpp_c : 134.0 119.0
vp9_put_bilin_16hv_8bpp_rvv_i32 : 12.7 11.5
vp9_put_bilin_32hv_8bpp_c : 538.5 464.2
vp9_put_bilin_32hv_8bpp_rvv_i32 : 39.7 35.2
vp9_put_bilin_64hv_8bpp_c : 2111.7 1833.2
vp9_put_bilin_64hv_8bpp_rvv_i32 : 138.5 122.5
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
5 months ago
sunyuechi
9edd2e723b
lavc/vp9dsp: R-V V mc bilin h v
...
C908 X60
vp9_avg_bilin_4h_8bpp_c : 5.5 4.7
vp9_avg_bilin_4h_8bpp_rvv_i32 : 1.7 1.5
vp9_avg_bilin_4v_8bpp_c : 5.5 4.7
vp9_avg_bilin_4v_8bpp_rvv_i32 : 1.5 1.2
vp9_avg_bilin_8h_8bpp_c : 20.0 17.7
vp9_avg_bilin_8h_8bpp_rvv_i32 : 3.0 2.7
vp9_avg_bilin_8v_8bpp_c : 20.7 18.7
vp9_avg_bilin_8v_8bpp_rvv_i32 : 3.0 2.7
vp9_avg_bilin_16h_8bpp_c : 78.2 69.7
vp9_avg_bilin_16h_8bpp_rvv_i32 : 7.0 6.2
vp9_avg_bilin_16v_8bpp_c : 98.5 73.2
vp9_avg_bilin_16v_8bpp_rvv_i32 : 7.0 6.0
vp9_avg_bilin_32h_8bpp_c : 325.5 275.5
vp9_avg_bilin_32h_8bpp_rvv_i32 : 23.0 20.5
vp9_avg_bilin_32v_8bpp_c : 342.2 290.0
vp9_avg_bilin_32v_8bpp_rvv_i32 : 21.7 19.5
vp9_avg_bilin_64h_8bpp_c : 1263.7 1095.7
vp9_avg_bilin_64h_8bpp_rvv_i32 : 91.2 81.2
vp9_avg_bilin_64v_8bpp_c : 1331.7 1155.2
vp9_avg_bilin_64v_8bpp_rvv_i32 : 91.2 81.0
vp9_put_bilin_4h_8bpp_c : 4.5 4.0
vp9_put_bilin_4h_8bpp_rvv_i32 : 1.0 1.0
vp9_put_bilin_4v_8bpp_c : 4.7 4.2
vp9_put_bilin_4v_8bpp_rvv_i32 : 1.0 1.0
vp9_put_bilin_8h_8bpp_c : 16.7 15.0
vp9_put_bilin_8h_8bpp_rvv_i32 : 2.2 2.0
vp9_put_bilin_8v_8bpp_c : 17.5 15.7
vp9_put_bilin_8v_8bpp_rvv_i32 : 2.2 2.0
vp9_put_bilin_16h_8bpp_c : 65.2 58.0
vp9_put_bilin_16h_8bpp_rvv_i32 : 6.0 5.5
vp9_put_bilin_16v_8bpp_c : 69.2 61.7
vp9_put_bilin_16v_8bpp_rvv_i32 : 5.7 5.2
vp9_put_bilin_32h_8bpp_c : 273.2 229.0
vp9_put_bilin_32h_8bpp_rvv_i32 : 19.7 17.7
vp9_put_bilin_32v_8bpp_c : 290.5 243.7
vp9_put_bilin_32v_8bpp_rvv_i32 : 18.7 16.7
vp9_put_bilin_64h_8bpp_c : 1040.5 910.5
vp9_put_bilin_64h_8bpp_rvv_i32 : 82.5 73.0
vp9_put_bilin_64v_8bpp_c : 1108.5 971.0
vp9_put_bilin_64v_8bpp_rvv_i32 : 82.2 73.2
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
5 months ago
Rémi Denis-Courmont
616fdeaea3
lavc/riscv: depend on RVB and simplify accordingly
...
There is no known (real) hardware with V and without the complete B
extension. B was indeed required in the RISC-V application profile from
2022, earlier than V. There should not be any relevant hardware in the
future either.
In practice, different R-V Vector optimisations in FFmpeg already depend on
every constituent of the B extension anyhow, so it would not work well.
6 months ago
Rémi Denis-Courmont
4edfc11a28
lavc/h264dsp: R-V V idct4_add8 (all depths)
...
These are really just wrappers for idct4_add16intra functions, which are in
turn mostly wrappers for idct4_add and idct4_dc_add functions.
For benchmarks refer to the later two sets.
6 months ago
Rémi Denis-Courmont
de7f999481
lavc/videodsp: work-around LLVM-as
...
For some reason, it can't handle the normal syntax for an address operand
without an offset, so add a dummy zero offset.
6 months ago
Rémi Denis-Courmont
677f28b310
lavc/h264dsp: stick R-V V weight to 16-bit precision
...
T-Head C908 (ns):
h264_weight2_8_c: 1607.8
h264_weight2_8_rvv_i32: 515.0 (before)
h264_weight2_8_rvv_i32: 348.5 (after)
h264_weight4_8_c: 2255.8
h264_weight4_8_rvv_i32: 1015.0 (before)
h264_weight4_8_rvv_i32: 691.0 (after)
h264_weight8_8_c: 3857.5
h264_weight8_8_rvv_i32: 2218.8 (before)
h264_weight8_8_rvv_i32: 1561.3 (after)
h264_weight16_8_c: 7431.5
h264_weight16_8_rvv_i32: 2737.3 (before)
h264_weight16_8_rvv_i32: 1848.3 (after)
SpacemiT X60 (ns):
h264_weight2_8_c: 1624.1
h264_weight2_8_rvv_i32: 352.6 (before)
h264_weight2_8_rvv_i32: 259.3 (after)
h264_weight4_8_c: 2259.3
h264_weight4_8_rvv_i32: 685.8 (before)
h264_weight4_8_rvv_i32: 530.3 (after)
h264_weight8_8_c: 4103.3
h264_weight8_8_rvv_i32: 1581.8 (before)
h264_weight8_8_rvv_i32: 1238.6 (after)
h264_weight16_8_c: 7624.3
h264_weight16_8_rvv_i32: 2738.1 (before)
h264_weight16_8_rvv_i32: 1853.3 (after)
6 months ago
Rémi Denis-Courmont
afd45c7ff7
lavc/h264dsp: stick R-V V biweight to 16-bit
...
T-Head C908 (ns):
h264_biweight2_8_c: 2414.5
h264_biweight2_8_rvv_i32: 701.8 (before)
h264_biweight2_8_rvv_i32: 468.5 (after)
h264_biweight4_8_c: 4655.3
h264_biweight4_8_rvv_i32: 1377.5 (before)
h264_biweight4_8_rvv_i32: 931.8 (after)
h264_biweight8_8_c: 9701.5
h264_biweight8_8_rvv_i32: 2896.0 (before)
h264_biweight8_8_rvv_i32: 2070.5 (after)
h264_biweight16_8_c: 18025.0
h264_biweight16_8_rvv_i32: 3460.8 (before)
h264_biweight16_8_rvv_i32: 1978.0 (after)
SpacemiT X60 (ns):
h264_biweight2_8_c: 2415.5
h264_biweight2_8_rvv_i32: 478.2 (before)
h264_biweight2_8_rvv_i32: 362.8 (after)
h264_biweight4_8_c: 4655.3
h264_biweight4_8_rvv_i32: 946.7 (before)
h264_biweight4_8_rvv_i32: 727.3 (after)
h264_biweight8_8_c: 9061.8
h264_biweight8_8_rvv_i32: 2071.7 (before)
h264_biweight8_8_rvv_i32: 1685.8 (after)
h264_biweight16_8_c: 18020.5
h264_biweight16_8_rvv_i32: 3457.2 (before)
h264_biweight16_8_rvv_i32: 1935.8 (after)
6 months ago
Rémi Denis-Courmont
2f083fd581
lavc/audiodsp: drop R-V F vector_clipf
...
This is now firmly slower than C.
SiFive-U74 (cycles):
audiodsp.vector_clipf_c: 31.2
audiodsp.vector_clipf_rvf: 39.5
6 months ago
Rémi Denis-Courmont
54ae270213
lavc/rv34dsp: use saturating add/sub for R-V V DC add
...
T-Head C908 (cycles):
rv34_idct_dc_add_c: 113.2
rv34_idct_dc_add_rvv_i32: 48.5 (before)
rv34_idct_dc_add_rvv_i32: 39.5 (after)
6 months ago
Rémi Denis-Courmont
952b426f3b
lavc/bswapdsp: add RV Zvbb bswap16 and bswap32
6 months ago
Rémi Denis-Courmont
262168b04e
lavc/videodsp: RISC-V zicbop prefetch
...
There are currently no ways to run-time detect the CPU capability, so we
take it for granted (in the worst case, it will execute NOPs).
6 months ago
Rémi Denis-Courmont
324eba69f7
lavc/vc1dsp: use saturating arithmetic for RVV inv_trans_dc
...
T-Head C908 (cycles):
vc1dsp.vc1_inv_trans_4x4_dc_c: 113.7
vc1dsp.vc1_inv_trans_4x4_dc_rvv_i32: 46.5 (before)
vc1dsp.vc1_inv_trans_4x4_dc_rvv_i32: 45.5 (after)
vc1dsp.vc1_inv_trans_4x8_dc_c: 230.7
vc1dsp.vc1_inv_trans_4x8_dc_rvv_i32: 65.7 (before)
vc1dsp.vc1_inv_trans_4x8_dc_rvv_i32: 52.5 (after)
vc1dsp.vc1_inv_trans_8x4_dc_c: 246.7
vc1dsp.vc1_inv_trans_8x4_dc_rvv_i64: 56.7 (before)
vc1dsp.vc1_inv_trans_8x4_dc_rvv_i64: 45.5 (after)
vc1dsp.vc1_inv_trans_8x8_dc_c: 419.7
vc1dsp.vc1_inv_trans_8x8_dc_rvv_i64: 81.2 (before)
vc1dsp.vc1_inv_trans_8x8_dc_rvv_i64: 53.5 (after)
6 months ago
Rémi Denis-Courmont
784a72a116
lavc/vc1dsp: unify R-V V DC bypass functions
6 months ago
Rémi Denis-Courmont
bd0c3edb13
lavu/riscv: count bytes rather than words for bswap32
...
This removes the dependency on Zba at essentially zero cost.
6 months ago
Rémi Denis-Courmont
5171baa228
lavc/ac3dsp: fix R-V CPU requirements
...
It probably will not matter on any real hardware, but the Zbb optimisations
do not require Zba. And then, we need HAVE_RVV to build the RVV stuff.
6 months ago
Rémi Denis-Courmont
7b24f96c87
lavc/vp9dsp: remove R-V I intra functions
...
At this point, they are identical to the C code, except for instruction
ordering. In fact, they are typically slower or no faster than the C code.
6 months ago
Rémi Denis-Courmont
b0b3bea10b
lavc/h264dsp: use saturing add/sub for R-V V 8-bit DC add
...
T-Head C908 (cycles):
h264_idct4_dc_add_8bpp_c: 109.2
h264_idct4_dc_add_8bpp_rvv_i32: 34.5 (before)
h264_idct4_dc_add_8bpp_rvv_i32: 25.5 (after)
h264_idct8_dc_add_8bpp_c: 418.7
h264_idct8_dc_add_8bpp_rvv_i64: 69.5 (before)
h264_idct8_dc_add_8bpp_rvv_i64: 33.5 (after)
6 months ago
Rémi Denis-Courmont
9b4655c3a1
lavc/vp8dsp: use saturating add/sub for R-V V DC add
...
T-Head C908 (cycles):
vp7_idct_dc_add_c: 108.5
vp7_idct_dc_add_rvv_i32: 56.2 (before)
vp7_idct_dc_add_rvv_i32: 47.2 (after)
vp8_idct_dc_add_c: 96.2
vp8_idct_dc_add_rvv_i32: 43.0 (before)
vp8_idct_dc_add_rvv_i32: 34.0 (after)
6 months ago
Rémi Denis-Courmont
bbfc0ac9ca
lavc/riscv: don't set vxrm if unnecessary
...
While narrowing clip is nominally a rounding operation, the rounding mode
has no arithmetic consequence if the right shift is by zero bits.
6 months ago
Rémi Denis-Courmont
f2c30fe15a
lavc/riscv: add forward-edge CFI landing pads
6 months ago
Rémi Denis-Courmont
b62586e310
lavc/h264dsp: use RISC-V B extension
...
This saves one register and one instruction per transform.
add16 and add16intra thus become stack-less.
6 months ago
Rémi Denis-Courmont
187d4d066a
lavc/riscv: require B or zba explicitly
6 months ago
Rémi Denis-Courmont
896c22ef00
lavc/vp8dsp: fix RV32 stack alignment
...
SP must be a multiple of 16 bytes at all times on POSIX - even in leaf
functions - so that signal handlers have a properly aligned stack.
6 months ago
Rémi Denis-Courmont
9135dffd17
lavc/h264dsp: reduce spills in R-V V idct_add16
6 months ago
Rémi Denis-Courmont
245f76ad74
lavc/h264dsp: reuse the R-V V IDCT DC add functions
...
This reuses the DC bypass functions from the multiple IDCT functions, to
leverage vector code.
As an added bonus, the caller functions can now rely on the callee functions
to preserve their parameters, thus cutting down on stack spills.
6 months ago
Rémi Denis-Courmont
0a5b5bae89
lavc/h264dsp: correct VL and LMUL in idct_dc_add
...
T-Head C908 (cycles):
h264_idct4_dc_add_8bpp_c: 94.7
h264_idct4_dc_add_8bpp_rvv_i32: 55.0 (before)
h264_idct4_dc_add_8bpp_rvv_i32: 34.5 (after)
h264_idct4_dc_add_9bpp_c: 94.7
h264_idct4_dc_add_9bpp_rvv_i32: 43.5 (before)
h264_idct4_dc_add_9bpp_rvv_i32: 38.2 (after)
h264_idct4_dc_add_10bpp_c: 94.7
h264_idct4_dc_add_10bpp_rvv_i32: 43.5 (before)
h264_idct4_dc_add_10bpp_rvv_i32: 38.2 (after)
h264_idct4_dc_add_12bpp_c: 94.7
h264_idct4_dc_add_12bpp_rvv_i32: 43.7 (before)
h264_idct4_dc_add_12bpp_rvv_i32: 38.5 (after)
h264_idct4_dc_add_14bpp_c: 94.7
h264_idct4_dc_add_14bpp_rvv_i32: 43.7 (before)
h264_idct4_dc_add_14bpp_rvv_i32: 38.5 (after)
6 months ago
J. Dekker
c9dc2ad09b
lavc/h264dsp: move R-V V idct_dc_add
...
No functional changes. This just moves the assembler so that it can be
referenced by other functions in h264idct_rvv.S with local jumps.
Edited-by: Rémi Denis-Courmont <remi@remlab.net>
6 months ago
Rémi Denis-Courmont
d15169c51f
lavc/h264dsp: factor some mostly identical R-V V code
6 months ago
Rémi Denis-Courmont
483fd732ab
lavc/h264dsp: R-V V high-depth idct_add{,intra}16, idct8_add4
...
As with 8-bit, this tends to be faster, but results are all over the
place due to the variable distribution of non-zero coefficients.
6 months ago
J. Dekker
fa5a605542
avcodec/riscv: add h264 dc idct rvv
...
checkasm: bench runs 131072 (1 << 17)
h264_idct4_add_dc_8bpp_c: 1.5
h264_idct4_add_dc_8bpp_rvv_i64: 0.7
h264_idct4_add_dc_9bpp_c: 1.5
h264_idct4_add_dc_9bpp_rvv_i64: 0.7
h264_idct4_add_dc_10bpp_c: 1.5
h264_idct4_add_dc_10bpp_rvv_i64: 0.7
h264_idct4_add_dc_12bpp_c: 1.2
h264_idct4_add_dc_12bpp_rvv_i64: 0.7
h264_idct4_add_dc_14bpp_c: 1.2
h264_idct4_add_dc_14bpp_rvv_i64: 0.7
h264_idct8_add_dc_8bpp_c: 5.2
h264_idct8_add_dc_8bpp_rvv_i64: 1.5
h264_idct8_add_dc_9bpp_c: 5.5
h264_idct8_add_dc_9bpp_rvv_i64: 1.2
h264_idct8_add_dc_10bpp_c: 5.5
h264_idct8_add_dc_10bpp_rvv_i64: 1.2
h264_idct8_add_dc_12bpp_c: 4.2
h264_idct8_add_dc_12bpp_rvv_i64: 1.2
h264_idct8_add_dc_14bpp_c: 4.2
h264_idct8_add_dc_14bpp_rvv_i64: 1.2
Signed-off-by: J. Dekker <jdek@itanimul.li>
6 months ago
Rémi Denis-Courmont
3002310b70
lavc/h264dsp: R-V V high-depth add_pixels8
...
T-Head C908 (cycles);
h264_add_pixels8_9bpp_c: 270.5
h264_add_pixels8_9bpp_rvv_i32: 164.2
h264_add_pixels8_10bpp_c: 270.5
h264_add_pixels8_10bpp_rvv_i32: 164.2
h264_add_pixels8_12bpp_c: 270.5
h264_add_pixels8_12bpp_rvv_i32: 164.2
h264_add_pixels8_14bpp_c: 270.5
h264_add_pixels8_14bpp_rvv_i32: 164.2
6 months ago
Rémi Denis-Courmont
7744c08240
lavc/h264dsp: R-V V add_pixels4 and 8-bit add_pixels8
...
T-Head C908 (cycles):
h264_add_pixels4_8bpp_c: 93.5
h264_add_pixels4_8bpp_rvv_i32: 39.5
h264_add_pixels4_9bpp_c: 87.5
h264_add_pixels4_9bpp_rvv_i64: 50.5
h264_add_pixels4_10bpp_c: 87.5
h264_add_pixels4_10bpp_rvv_i64: 50.5
h264_add_pixels4_12bpp_c: 87.5
h264_add_pixels4_12bpp_rvv_i64: 50.5
h264_add_pixels4_14bpp_c: 87.5
h264_add_pixels4_14bpp_rvv_i64: 50.5
h264_add_pixels8_8bpp_c: 265.2
h264_add_pixels8_8bpp_rvv_i64: 84.5
6 months ago
Rémi Denis-Courmont
c654e37254
lavc/h264dsp: R-V V high-depth h264_idct8_add
...
Unlike the 8-bit version, we need two iterations to process this within
128-bit vectors. This adds some extra complexity for pointer arithmetic
and counting down which is unnecessary in the 8-bit variant.
Accordingly the gain relative to C are just slight better than half as
good with 128-bit vectors as with 256-bit ones.
T-Head C908 (2 iterations):
h264_idct8_add_9bpp_c: 17.5
h264_idct8_add_9bpp_rvv_i32: 10.0
h264_idct8_add_10bpp_c: 17.5
h264_idct8_add_10bpp_rvv_i32: 9.7
h264_idct8_add_12bpp_c: 17.7
h264_idct8_add_12bpp_rvv_i32: 9.7
h264_idct8_add_14bpp_c: 17.7
h264_idct8_add_14bpp_rvv_i32: 9.7
SpacemiT X60 (single iteration):
h264_idct8_add_9bpp_c: 15.2
h264_idct8_add_9bpp_rvv_i32: 5.0
h264_idct8_add_10bpp_c: 15.2
h264_idct8_add_10bpp_rvv_i32: 5.0
h264_idct8_add_12bpp_c: 14.7
h264_idct8_add_12bpp_rvv_i32: 5.0
h264_idct8_add_14bpp_c: 14.7
h264_idct8_add_14bpp_rvv_i32: 4.7
6 months ago
Rémi Denis-Courmont
4e0e872881
lavc/h264dsp: R-V V high-depth h264_idct_add
...
T-Head C908 (cycles):
h264_idct4_add_9bpp_c: 248.2
h264_idct4_add_9bpp_rvv_i32: 128.7
h264_idct4_add_10bpp_c: 256.7
h264_idct4_add_10bpp_rvv_i32: 128.7
h264_idct4_add_12bpp_c: 252.5
h264_idct4_add_12bpp_rvv_i32: 129.7
h264_idct4_add_14bpp_c: 258.0
h264_idct4_add_14bpp_rvv_i32: 129.7
7 months ago
Rémi Denis-Courmont
d28a7e8eb7
lavc/h264dsp: avoid \+ expansion
...
This seems to be unsupported by LLVM-as.
7 months ago
Rémi Denis-Courmont
f1ed351d3b
lavc/h264dsp: R-V V 8-bit h264_biweight_pixels
...
T-Head C908:
h264_biweight2_8_c: 58.0
h264_biweight2_8_rvv_i32: 11.2
h264_biweight4_8_c: 106.0
h264_biweight4_8_rvv_i32: 22.7
h264_biweight8_8_c: 205.7
h264_biweight8_8_rvv_i32: 50.0
h264_biweight16_8_c: 403.5
h264_biweight16_8_rvv_i32: 83.2
SpacemiT X60:
h264_weight2_8_c: 48.2
h264_weight2_8_rvv_i32: 8.2
h264_weight4_8_c: 90.5
h264_weight4_8_rvv_i32: 16.5
h264_weight8_8_c: 175.2
h264_weight8_8_rvv_i32: 38.0
h264_weight16_8_c: 342.2
h264_weight16_8_rvv_i32: 66.0
7 months ago
Rémi Denis-Courmont
3606e592ea
lavc/h264dsp: R-V V 8-bit h264_weight_pixels
...
There are two implementations here:
- a generic scalable one processing two columns at a time,
- a specialised processing one (fixed-size) row at a time.
Unsurprisingly, the generic one works out better with smaller widths.
With larger widths, the gains from filling vectors are outweighed by
the extra cost of strided loads and stores. In other words, memory
accesses become the bottleneck.
T-Head C908:
h264_weight2_8_c: 54.5
h264_weight2_8_rvv_i32: 13.7
h264_weight4_8_c: 101.7
h264_weight4_8_rvv_i32: 27.5
h264_weight8_8_c: 197.0
h264_weight8_8_rvv_i32: 75.5
h264_weight16_8_c: 385.0
h264_weight16_8_rvv_i32: 74.2
SpacemiT X60:
h264_weight2_8_c: 48.5
h264_weight2_8_rvv_i32: 8.2
h264_weight4_8_c: 90.7
h264_weight4_8_rvv_i32: 16.5
h264_weight8_8_c: 175.0
h264_weight8_8_rvv_i32: 37.7
h264_weight16_8_c: 342.2
h264_weight16_8_rvv_i32: 66.0
7 months ago
Rémi Denis-Courmont
f9d1230224
lavc/h264dsp: R-V V 8-bit h264_idct8_add
...
T-Head C908 (cycles):
h264_idct8_add_8bpp_c: 1072.0
h264_idct8_add_8bpp_rvv_i32: 318.5
7 months ago
Rémi Denis-Courmont
f447189b0c
lavc/h264dsp: R-V V 8-bit h264_idct_add
...
T-Head C908 (cycles):
h264_idct4_add_8bpp_c: 271.5
h264_idct4_add_8bpp_rvv_i32: 91.5
7 months ago
Rémi Denis-Courmont
e0eff64ed1
lavc/h264dsp: R-V V 8-bit h264_idct8_add4
7 months ago
Rémi Denis-Courmont
d1f0c1fbf8
lavc/h264dsp: R-V V 8-bit h264_idct_add16intra
7 months ago
Rémi Denis-Courmont
30475c95ba
lavc/h264dsp: R-V V 8-bit h264_idct_add16
...
While this *tends* to be faster than plain C, the performance numbers
are all over the place, presuambly due to the conditional character of
the main loop.
Some additional micro-optimisations should be feasible after the
underlying h264_idct_add and h264_idct_dc_add functions are also
implemented. Then it will no longer be necesseray to stricly abide by
the C ABI.
7 months ago