Take vector reduction out of the loop and unroll.
Before:
audiodsp.scalarproduct_int16_c: 12321.0
audiodsp.scalarproduct_int16_rvv_i32: 4175.7
After:
audiodsp.scalarproduct_int16_c: 12320.5
audiodsp.scalarproduct_int16_rvv_i32: 1230.2
This cannot beat the Zbb implementation, and it is unlikely that a real
meaningful CPU design would support V and not Zbb. The best loop rewrite
that I could come up with (4 shifts, 2 ands, 3 ors) is still ~40% slower
than Zbb.
A proper faster vector implementation should be feasible with the
cryptographic vector extensions, but that is a story for another time.
The code was blindly assuming that Zbb or V implied Zba. While the
earlier is practically always true, the later broke some QEMU setups,
as V was introduced earlier than Zba.
This increases the group multiplier as per T-Head C910 benchmarks:
inverse_coupling_c: 4597.0
inverse_coupling_rvv_i32: 1312.7 (m1)
inverse_coupling_rvv_i32: 1116.7 (m2)
inverse_coupling_rvv_i32: 732.2 (m4)
inverse_coupling_rvv_i32: 898.0 (m8)
VSETVLI xd, x0, ...' has rather nonobvious semantics:
- If xd is x0, then it preserves the current vector length.
- If xd is not x0, it sets the vector length to the supported maximum.
Also somewhat confusingly, while VMV.X.S always does its thing
regardless of the selected vector length, VMV.S.X does _nothing_ if the
selected vector length is zero.
So the current code breaks fails to initialise the accumulator if we
are unlucky to have a selected vector length of zero on entry. Fix it
by forcing the vector length to one.
Although the DSP function only uses single precision from RISC-V F, the
caller may leave double precision values in the spilled registers if the
calling convention supports double precision hardware floats. Then, we
need to save and restore FS registers as double precision.
Conversely, we do not need to save anything at all if an integer calling
convention is in use. However we can assume that single precision floats
are supported, since the Zve32f extension implies the F extension.
So for the sake of simplicity, we always save at least single precision
values.
In theory, we should even save quadruple precision values if the LP64Q
ABI is in use. I have yet to see a compiler that supports it though.
This adds a variant of the postfilter for use with 512-bit vectors.
Half a vector is enough to perform the scalar product. Normally a whole
vector would be used anyhow. Indeed fractional multiplers are no faster
than the unit multipler.
But in this particular function, a full vector makes up 16 samples,
which would be loaded at each iteration of the outer loop. The minimum
guaranteed CELT postfilter period is only 15. Accounting for the edges,
we can only safely preload up to 13 samples.
The fractional multipler is thus used to cap the selected vector length
to a safe value of 8 elements or 256 bits.
Likewise, we have the 1024-bit variant with the quarter multipler. In
theory, a 2048-bit one would be possible with the eigth multipler, but
that length is not even defined in the specifications as of yet, nor is
it supported by any emulator - forget actual hardware.
This adds a variant of the postfilter for use with 256-bit vectors.
As a single vector is then large enough to perform the scalar product,
the group multipler is reduced to just one at run-time.
The different vector type is passed via register. Unfortunately,
there is no VSETIVL instruction, so the constant vector size (5) also
needs to be passed via a register.
This is implemented for a vector size of 128-bit. Since the scalar
product in the inner loop covers 5 samples or 160 bits, we need a group
multipler of 2.
To avoid reconfiguring the vector type, the outer loop, which loads
multiple input samples sticks to the same multipler. Consequently, the
outer loop loads 8 samples per iteration. This is safe since the minimum
period of the CELT codec is 15 samples.
The same code would also work, albeit needlessly inefficiently with a
vector length of 256 bits. A proper implementation will follow instead.
Simply taking the Zbb REV8 instruction into use in a simple loop gives
some significant savings:
bswap_buf_c: 1081.0
bswap_buf_rvb_b: 771.0
But we can also use the 64-bit REV8 as a pseudo-SIMD instruction with
just one additional shift, and one fewer load, effectively doubling the
bandwidth. Consequently, this patch is useful even if the compile-time
target has Zbb enabled for C code:
bswap_buf_c: 1081.0
bswap_buf_rvb_b: 341.0 (this patch)
On the other hand, this approach fails miserably for bswap16_buf as the
ratio of shifts and stores becomes unfavorable compared to naïve C:
bswap16_buf_c: 1542.0
bswap16_buf_rvb_b: 1803.7
Unrolling to process 128 bits (4 samples) at a time actually worsens
performance ever so slightly:
bswap_buf_c: 1081.0
bswap_buf_rvb_b: 408.5
To avoid data dependencies, this does the following unroll, which
requires one extra but probably free addition:
coeff = (b * left_weight) >> decorr_shift;
b += a;
a -= coeff;
b -= coeff;
swap(a, b);
This starts with one-time initialisation of the 26 constant factors
like 08edacc248. That is done with
the scalar instruction set. While the formula can readily be vectored,
the gains would (probably) be more than lost in transfering the results
back to FP registers (or suitably reshuffling them into vector
registers).
Note that the main loop could likely be scheduled sligthly better by
expanding the filter macro and interleaving loads with arithmetic.
It is not clear yet if that would be relevant for vector processing (as
opposed to traditional SIMD).
We could also use fewer vectors, but there is not much point in sparing
them (they are *all* callee-clobbered).
This uses the following vectorisation:
for (i = 0; i < blocksize; i++) {
ang[i] = mag[i] - copysignf(fmaxf(ang[i], 0.f), mag[i]);
mag[i] = mag[i] - copysignf(fminf(ang[i], 0.f), mag[i]);
}