This starts with one-time initialisation of the 26 constant factors
like 08edacc248. That is done with
the scalar instruction set. While the formula can readily be vectored,
the gains would (probably) be more than lost in transfering the results
back to FP registers (or suitably reshuffling them into vector
registers).
Note that the main loop could likely be scheduled sligthly better by
expanding the filter macro and interleaving loads with arithmetic.
It is not clear yet if that would be relevant for vector processing (as
opposed to traditional SIMD).
We could also use fewer vectors, but there is not much point in sparing
them (they are *all* callee-clobbered).
This uses the following vectorisation:
for (i = 0; i < blocksize; i++) {
ang[i] = mag[i] - copysignf(fmaxf(ang[i], 0.f), mag[i]);
mag[i] = mag[i] - copysignf(fminf(ang[i], 0.f), mag[i]);
}
RV64G supports MIN & MAX instructions natively only on floating point
registers, not general purpose ones. The later would require the Zbb
extension. Due to that, it is actually faster to perform the clipping
"properly" in FPU.
Benchmarks on SiFive U74-MC (courtesy of Shanghai StarFive Tech):
audiodsp.vector_clipf_c: 29551.5
audiodsp.vector_clipf_rvf: 17871.0
Also tried unrolling with 2 or 8 elements but it gets worse either way.