This uses a more traditional approach allowing up processing of up to
period minus two elements per iteration. This also allows the algorithm
to work for all and any vector length.
As the T-Head C908 device under test can load 16 elements loop, there is
unsurprisingly a little performance drop when the period is minimal and
the parallelism is capped at 13 elements:
Before:
postfilter_15_c: 21222.2
postfilter_15_rvv_f32: 22007.7
postfilter_512_c: 20189.7
postfilter_512_rvv_f32: 22004.2
postfilter_1022_c: 20189.7
postfilter_1022_rvv_f32: 22004.2
After:
postfilter_15_c: 20189.5
postfilter_15_rvv_f32: 7057.2
postfilter_512_c: 20189.5
postfilter_512_rvv_f32: 5667.2
postfilter_1022_c: 20192.7
postfilter_1022_rvv_f32: 5667.2