There is no known (real) hardware with V and without the complete B
extension. B was indeed required in the RISC-V application profile from
2022, earlier than V. There should not be any relevant hardware in the
future either.
In practice, different R-V Vector optimisations in FFmpeg already depend on
every constituent of the B extension anyhow, so it would not work well.
hf_apply_noise_0_c: 35.7
hf_apply_noise_0_rvv_f32: 9.5
hf_apply_noise_1_c: 38.5
hf_apply_noise_1_rvv_f32: 10.0
hf_apply_noise_2_c: 35.5
hf_apply_noise_2_rvv_f32: 9.7
hf_apply_noise_3_c: 38.5
hf_apply_noise_3_rvv_f32: 10.0
Maybe extending the noise table manually is not such great idea, but I
not quite sure how to deal with that otherwise? Allocating the table
dynamically is possible but would require an ELF destructor to clean up.
128-bit is the maximum, not the minimum here. Larger vector sizes can
result in reads past the end of the noise value table.
This partially reverts commit cdcb4b98b7.
While this function can easily be written with vectors, it just fails to
get any performance improvement.
For reference, this is a simpler loop-free implementation that does get
better performance than the current one depending on hardware, but still
more or less the same metrics as the C code:
func ff_sbr_neg_odd_64_rvv, zve64x
li a1, 32
addi a0, a0, 7
li t0, 8
vsetvli zero, a1, e8, m2, ta, ma
li t1, 0x80
vlse8.v v8, (a0), t0
vxor.vx v8, v8, t1
vsse8.v v8, (a0), t0
ret
endfunc
This reverts commit d06fd18f8f.
This is restricted to 128-bit vectors as larger vector sizes could read
past the end of the noise array. Support for future hardware with larger
vector sizes is left for some other time.
hf_apply_noise_0_c: 2319.7
hf_apply_noise_0_rvv_f32: 1229.0
hf_apply_noise_1_c: 2539.0
hf_apply_noise_1_rvv_f32: 1244.7
hf_apply_noise_2_c: 2319.7
hf_apply_noise_2_rvv_f32: 1232.7
hf_apply_noise_3_c: 2541.2
hf_apply_noise_3_rvv_f32: 1244.2
With 5 accumulator vectors and 6 inputs, this can only use LMUL=2.
Also the number of vector loop iterations is small, just 5 on 128-bit
vector hardware.
The vector loop is somewhat unusual in that it processes data in
descending memory order, in order to save on vector slides:
in descending order, we can extract elements to carry over to the next
iteration from the bottom of the vectors directly. With ascending order
(see in the Opus postfilter function), there are no ways to get the top
elements directly. On the downside, this requires the use of separate
shift and sub (the would-be SH3SUB instruction does not exist), with
a small pipeline stall on the vector load address.
The edge cases in scalar are done in scalar as this saves on loads
and remains significantly faster than C.
autocorrelate_c: 669.2
autocorrelate_rvv_f32: 421.0
With 128-bit vectors, this is mostly pointless but also harmless.
Performance gains should be more noticeable with larger vector sizes.
neg_odd_64_c: 76.2
neg_odd_64_rvv_i64: 74.7
The code was blindly assuming that Zbb or V implied Zba. While the
earlier is practically always true, the later broke some QEMU setups,
as V was introduced earlier than Zba.
RV64G supports MIN & MAX instructions natively only on floating point
registers, not general purpose ones. The later would require the Zbb
extension. Due to that, it is actually faster to perform the clipping
"properly" in FPU.
Benchmarks on SiFive U74-MC (courtesy of Shanghai StarFive Tech):
audiodsp.vector_clipf_c: 29551.5
audiodsp.vector_clipf_rvf: 17871.0
Also tried unrolling with 2 or 8 elements but it gets worse either way.
Initialise VC1DSPContext for parser as well as for decoder.
Note, the VC-1 code doesn't actually use the function pointer yet.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
For:
ff_vc1_inv_trans_{8,4}x{8,4}_{dc_,}neon
ff_put_pixels8x8_neon
ff_put_vc1_mspel_mc{0,1,2,3}{0,1,2,3}_neon (except for 00)
Based on ARM assembly code in libavcodec/arm by Rob Clark and Mans
Rullgard.
Signed-off-by: Martin Storsjö <martin@martin.st>
This allows masking CPU features with the -cpuflags avconv option
which is useful for testing different optimisations without rebuilding.
Signed-off-by: Mans Rullgard <mans@mansr.com>
From 52.503s (~40fps) to 27.973sec (~80fps) decoding of 480p sintel
trailer, i.e. a ~2x speedup overall, on a Nexus S.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This adds NEON optimised versions of all functions in VP8DSPContext.
Based on initial work by Rob Clark.
Signed-off-by: Mans Rullgard <mans@mansr.com>
(cherry picked from commit a1c1d3c003)