This cannot beat the Zbb implementation, and it is unlikely that a real
meaningful CPU design would support V and not Zbb. The best loop rewrite
that I could come up with (4 shifts, 2 ands, 3 ors) is still ~40% slower
than Zbb.
A proper faster vector implementation should be feasible with the
cryptographic vector extensions, but that is a story for another time.
The code was blindly assuming that Zbb or V implied Zba. While the
earlier is practically always true, the later broke some QEMU setups,
as V was introduced earlier than Zba.
Simply taking the Zbb REV8 instruction into use in a simple loop gives
some significant savings:
bswap_buf_c: 1081.0
bswap_buf_rvb_b: 771.0
But we can also use the 64-bit REV8 as a pseudo-SIMD instruction with
just one additional shift, and one fewer load, effectively doubling the
bandwidth. Consequently, this patch is useful even if the compile-time
target has Zbb enabled for C code:
bswap_buf_c: 1081.0
bswap_buf_rvb_b: 341.0 (this patch)
On the other hand, this approach fails miserably for bswap16_buf as the
ratio of shifts and stores becomes unfavorable compared to naïve C:
bswap16_buf_c: 1542.0
bswap16_buf_rvb_b: 1803.7
Unrolling to process 128 bits (4 samples) at a time actually worsens
performance ever so slightly:
bswap_buf_c: 1081.0
bswap_buf_rvb_b: 408.5
RV64G supports MIN & MAX instructions natively only on floating point
registers, not general purpose ones. The later would require the Zbb
extension. Due to that, it is actually faster to perform the clipping
"properly" in FPU.
Benchmarks on SiFive U74-MC (courtesy of Shanghai StarFive Tech):
audiodsp.vector_clipf_c: 29551.5
audiodsp.vector_clipf_rvf: 17871.0
Also tried unrolling with 2 or 8 elements but it gets worse either way.
Initialise VC1DSPContext for parser as well as for decoder.
Note, the VC-1 code doesn't actually use the function pointer yet.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
For:
ff_vc1_inv_trans_{8,4}x{8,4}_{dc_,}neon
ff_put_pixels8x8_neon
ff_put_vc1_mspel_mc{0,1,2,3}{0,1,2,3}_neon (except for 00)
Based on ARM assembly code in libavcodec/arm by Rob Clark and Mans
Rullgard.
Signed-off-by: Martin Storsjö <martin@martin.st>
This allows masking CPU features with the -cpuflags avconv option
which is useful for testing different optimisations without rebuilding.
Signed-off-by: Mans Rullgard <mans@mansr.com>
From 52.503s (~40fps) to 27.973sec (~80fps) decoding of 480p sintel
trailer, i.e. a ~2x speedup overall, on a Nexus S.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This adds NEON optimised versions of all functions in VP8DSPContext.
Based on initial work by Rob Clark.
Signed-off-by: Mans Rullgard <mans@mansr.com>
(cherry picked from commit a1c1d3c003)