T-Head C908 (cycles):
vc1dsp.vc1_inv_trans_4x4_c: 310.7
vc1dsp.vc1_inv_trans_4x4_rvv_i32: 120.0
We could use 1 `vlseg4e64.v` instead of 4 `vle16.v`, but that seems to
be about 7% slower.
The 8x8 pixel arrays are not necessarily aligned to 64 bits, so the
current code leads to Bus error on real hardware. This reproducible
with FATE's vc1_ilaced_twomv test case.
The new "pessimist" code can trivially be shared for 16x16 pixel
arrays so we also do that. FWIW, this also nominally reduces the
hardware requirement from Zve64x to Zve32x.
T-Head C908:
vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_c: 14.7
vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_rvv_i32: 3.5
vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_c: 3.7
vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_rvv_i32: 1.5
SpacemiT X60:
vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_c: 13.0
vc1dsp.avg_vc1_mspel_pixels_tab[0][0]_rvv_i32: 3.0
vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_c: 3.2
vc1dsp.avg_vc1_mspel_pixels_tab[1][0]_rvv_i32: 1.2
Notes:
- The loop is biased toward no unescaped bytes as that should be most common.
- The input byte array is slid rather than the (8 times smaller) bit-mask,
as RISC-V V does not provide a bit-mask (or bit-wise) slide instruction.
- There are two comparisons with 0 per iteration, for the same reason.
- In case of match, bytes are copied until the first match, and the loop is
restarted after the escape byte. Vector compression (vcompress.vm) could
discard all escape bytes but that is slower if escape bytes are rare.
Further optimisations should be possible, e.g.:
- processing 2 bytes fewer per iteration to get rid of a 2 slides,
- taking a short cut if the input vector contains less than 2 zeroes.
But this is a good starting point:
T-Head C908:
vc1dsp.vc1_unescape_buffer_c: 12749.5
vc1dsp.vc1_unescape_buffer_rvv_i32: 6009.0
SpacemiT X60:
vc1dsp.vc1_unescape_buffer_c: 11038.0
vc1dsp.vc1_unescape_buffer_rvv_i32: 2061.0
The main loop processes 8 bytes in 5 instructions.
For comparison, the optimal plain strnlen() requires 4 instructions per
byte (6.4x worse): LBU; ADDI; BEQZ; BNE. The current libavcodec C code
involves 5 instructions per byte (8x worse). Actual benchmarks may be
slightly less favourable due to latency from ORC.B to BNE.
We can't call ff_get_rv_vlenb() if we don't have RVV available
at all.
Acked-by: Rémi Denis-Courmont <remi@remlab.net>
Signed-off-by: Martin Storsjö <martin@martin.st>
Segmented loads are slow, so here we use unit-strided load and narrowing shifts.
c910:
fcmul_add_c: 2179
fcmul_add_rvv_f64: 1652
c908:
fcmul_add_c: 4891.2
fcmul_add_rvv_f64: 2399.5
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
The code was blindly assuming that Zbb or V implied Zba. While the
earlier is practically always true, the later broke some QEMU setups,
as V was introduced earlier than Zba.
RV64G supports MIN & MAX instructions natively only on floating point
registers, not general purpose ones. The later would require the Zbb
extension. Due to that, it is actually faster to perform the clipping
"properly" in FPU.
Benchmarks on SiFive U74-MC (courtesy of Shanghai StarFive Tech):
audiodsp.vector_clipf_c: 29551.5
audiodsp.vector_clipf_rvf: 17871.0
Also tried unrolling with 2 or 8 elements but it gets worse either way.
Initialise VC1DSPContext for parser as well as for decoder.
Note, the VC-1 code doesn't actually use the function pointer yet.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
For:
ff_vc1_inv_trans_{8,4}x{8,4}_{dc_,}neon
ff_put_pixels8x8_neon
ff_put_vc1_mspel_mc{0,1,2,3}{0,1,2,3}_neon (except for 00)
Based on ARM assembly code in libavcodec/arm by Rob Clark and Mans
Rullgard.
Signed-off-by: Martin Storsjö <martin@martin.st>
This allows masking CPU features with the -cpuflags avconv option
which is useful for testing different optimisations without rebuilding.
Signed-off-by: Mans Rullgard <mans@mansr.com>
From 52.503s (~40fps) to 27.973sec (~80fps) decoding of 480p sintel
trailer, i.e. a ~2x speedup overall, on a Nexus S.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This adds NEON optimised versions of all functions in VP8DSPContext.
Based on initial work by Rob Clark.
Signed-off-by: Mans Rullgard <mans@mansr.com>
(cherry picked from commit a1c1d3c003)