This works out a bit more favourably than VP8's due to:
- additional multiplications that can be vectored,
- hardware-supported fixed-point rounding mode.
vp7_luma_dc_wht_c: 3.2
vp7_luma_dc_wht_rvv_i64: 2.0
This saves one instruction and frees up A5, which will be repurposed in
later changes. Unfortunately, we need to add quite a lot of alternative
code for this.
128-bit is the maximum, not the minimum here. Larger vector sizes can
result in reads past the end of the noise value table.
This partially reverts commit cdcb4b98b7.
This loop correctly assumes that VLMAX=16 (4x128-bit vectors
with 32-bit elements) and 32 >= pred_order > 16. We need to alternate
between VL=16 and VL=t2=pred_order-16 elements to add up to pred_order.
The current code requests AVL=a2=pred_order elements. In QEMU and on
thte K230 hardware, this sets VL=16 as we need. But the specification
merely guarantees that we get: ceil(AVL / 2) <= VL <= VLMAX. For
instance, if pred_order equals 27, we could end up with VL=14 or VL=15
instead of VL=16. So instead, request literally VLMAX=16.
Since the horizontal and vertical filters are identical except for a
transposition, this uses a common subprocedure with an ad-hoc ABI.
To preserve return-address stack prediction, a link register has to be
used (c.f. the "Control Transfer Instructions" from the
RISC-V ISA Manual). The alternate/temporary link register T0 is used
here, so that the normal RA is preserved (something Arm cannot do!).
To load the strength value based on `qscale`, the shortest possible
and PIC-compatible sequence is used: AUIPC; ADD; LBU. The classic
LLA; ADD; LBU sequence would add one more instruction since LLA is a
convenience alias for AUIPC; ADDI. To ensure that this trick works,
relocation relaxation is disabled.
To implement the two signed divisions by a power of two toward zero:
(x / (1 << SHIFT))
the code relies on the small range of integers involved, computing:
(x + (x >> (16 - SHIFT))) >> SHIFT
rather than the more general:
(x + ((x >> (16 - 1)) & ((1 << SHIFT) - 1))) >> SHIFT
Thus one ANDI instruction is avoided.
T-Head C908:
h263dsp.h_loop_filter_c: 228.2
h263dsp.h_loop_filter_rvv_i32: 144.0
h263dsp.v_loop_filter_c: 242.7
h263dsp.v_loop_filter_rvv_i32: 114.0
(C is probably worse in real use due to less predictible branches.)
While this function can easily be written with vectors, it just fails to
get any performance improvement.
For reference, this is a simpler loop-free implementation that does get
better performance than the current one depending on hardware, but still
more or less the same metrics as the C code:
func ff_sbr_neg_odd_64_rvv, zve64x
li a1, 32
addi a0, a0, 7
li t0, 8
vsetvli zero, a1, e8, m2, ta, ma
li t1, 0x80
vlse8.v v8, (a0), t0
vxor.vx v8, v8, t1
vsse8.v v8, (a0), t0
ret
endfunc
This reverts commit d06fd18f8f.
Notes:
- The loop is biased toward no unescaped bytes as that should be most common.
- The input byte array is slid rather than the (8 times smaller) bit-mask,
as RISC-V V does not provide a bit-mask (or bit-wise) slide instruction.
- There are two comparisons with 0 per iteration, for the same reason.
- In case of match, bytes are copied until the first match, and the loop is
restarted after the escape byte. Vector compression (vcompress.vm) could
discard all escape bytes but that is slower if escape bytes are rare.
Further optimisations should be possible, e.g.:
- processing 2 bytes fewer per iteration to get rid of a 2 slides,
- taking a short cut if the input vector contains less than 2 zeroes.
But this is a good starting point:
T-Head C908:
vc1dsp.vc1_unescape_buffer_c: 12749.5
vc1dsp.vc1_unescape_buffer_rvv_i32: 6009.0
SpacemiT X60:
vc1dsp.vc1_unescape_buffer_c: 11038.0
vc1dsp.vc1_unescape_buffer_rvv_i32: 2061.0
This is pretty much the same as for lpc16, though it only improves half
as large prediction orders. With 128-bit vectors, this gives:
C V old V new
1 69.2 181.5 95.5
2 107.7 180.7 95.2
3 145.5 180.0 103.5
4 183.0 179.2 102.7
5 220.7 178.5 128.0
6 257.7 194.0 127.5
7 294.5 193.7 126.7
8 331.0 193.0 126.5
Larger prediction orders see no significant changes at that size.
This calculates the optimal vector type value at run-time based on the
hardware vector length and the FLAC LPC prediction order. In this
particular case, the additional computation is easily amortised over
the loop iterations:
T-Head C908:
C V before V after
1 48.0 214.7 95.2
2 64.7 214.2 94.7
3 79.7 213.5 94.5
4 96.2 196.5 94.2 #
5 111.0 195.7 118.5
6 127.0 211.2 102.0
7 143.7 194.2 101.5
8 175.7 193.2 101.2 #
9 176.2 224.2 126.0
10 191.5 192.0 125.5
11 224.5 191.2 124.7
12 223.0 190.2 124.2
13 239.2 189.5 123.7
14 253.7 188.7 139.5
15 286.2 188.0 122.7
16 284.0 187.0 122.5 #
17 300.2 186.5 186.5
18 314.0 185.5 185.7
19 329.7 184.7 185.0
20 343.0 184.2 184.2
21 358.7 199.2 183.7
22 371.7 182.7 182.7
23 387.5 181.7 182.0
24 400.7 181.0 181.2
25 431.5 180.2 196.5
26 443.7 195.5 196.0
27 459.0 178.7 196.2
28 470.7 177.7 194.2
29 470.0 177.0 193.5
30 481.2 176.2 176.5
31 496.2 175.5 175.7
32 507.2 174.7 191.0 #
# Power of two boundary.
With 128-bit vectors, improvements are expected for the first two
test cases only. For the other two, there is overhead but below noise.
Improvements should be better observable with prediction order of 8
and less, or on hardware with larger vector sizes.
The main loop processes 8 bytes in 5 instructions.
For comparison, the optimal plain strnlen() requires 4 instructions per
byte (6.4x worse): LBU; ADDI; BEQZ; BNE. The current libavcodec C code
involves 5 instructions per byte (8x worse). Actual benchmarks may be
slightly less favourable due to latency from ORC.B to BNE.