This reuses the DC bypass functions from the multiple IDCT functions, to
leverage vector code.
As an added bonus, the caller functions can now rely on the callee functions
to preserve their parameters, thus cutting down on stack spills.
No functional changes. This just moves the assembler so that it can be
referenced by other functions in h264idct_rvv.S with local jumps.
Edited-by: Rémi Denis-Courmont <remi@remlab.net>
Unlike the 8-bit version, we need two iterations to process this within
128-bit vectors. This adds some extra complexity for pointer arithmetic
and counting down which is unnecessary in the 8-bit variant.
Accordingly the gain relative to C are just slight better than half as
good with 128-bit vectors as with 256-bit ones.
T-Head C908 (2 iterations):
h264_idct8_add_9bpp_c: 17.5
h264_idct8_add_9bpp_rvv_i32: 10.0
h264_idct8_add_10bpp_c: 17.5
h264_idct8_add_10bpp_rvv_i32: 9.7
h264_idct8_add_12bpp_c: 17.7
h264_idct8_add_12bpp_rvv_i32: 9.7
h264_idct8_add_14bpp_c: 17.7
h264_idct8_add_14bpp_rvv_i32: 9.7
SpacemiT X60 (single iteration):
h264_idct8_add_9bpp_c: 15.2
h264_idct8_add_9bpp_rvv_i32: 5.0
h264_idct8_add_10bpp_c: 15.2
h264_idct8_add_10bpp_rvv_i32: 5.0
h264_idct8_add_12bpp_c: 14.7
h264_idct8_add_12bpp_rvv_i32: 5.0
h264_idct8_add_14bpp_c: 14.7
h264_idct8_add_14bpp_rvv_i32: 4.7
There are two implementations here:
- a generic scalable one processing two columns at a time,
- a specialised processing one (fixed-size) row at a time.
Unsurprisingly, the generic one works out better with smaller widths.
With larger widths, the gains from filling vectors are outweighed by
the extra cost of strided loads and stores. In other words, memory
accesses become the bottleneck.
T-Head C908:
h264_weight2_8_c: 54.5
h264_weight2_8_rvv_i32: 13.7
h264_weight4_8_c: 101.7
h264_weight4_8_rvv_i32: 27.5
h264_weight8_8_c: 197.0
h264_weight8_8_rvv_i32: 75.5
h264_weight16_8_c: 385.0
h264_weight16_8_rvv_i32: 74.2
SpacemiT X60:
h264_weight2_8_c: 48.5
h264_weight2_8_rvv_i32: 8.2
h264_weight4_8_c: 90.7
h264_weight4_8_rvv_i32: 16.5
h264_weight8_8_c: 175.0
h264_weight8_8_rvv_i32: 37.7
h264_weight16_8_c: 342.2
h264_weight16_8_rvv_i32: 66.0
While this *tends* to be faster than plain C, the performance numbers
are all over the place, presuambly due to the conditional character of
the main loop.
Some additional micro-optimisations should be feasible after the
underlying h264_idct_add and h264_idct_dc_add functions are also
implemented. Then it will no longer be necesseray to stricly abide by
the C ABI.
Performance is (unfortunately) the same as with non-MBAFF, since the
hardware under test does not short-circuit vector tail calculations.
(IMO, a generic solution or work-around should be agreed on, rather
than bespoke approaches all over the place.)
T-Head C908 (cycles):
h264_h_loop_filter_luma_8bpp_c: 297.5
h264_h_loop_filter_luma_8bpp_rvv_i32: 369.2
h264_v_loop_filter_luma_8bpp_c: 862.7
h264_v_loop_filter_luma_8bpp_rvv_i32: 199.7
Performance in the horizontal scenario seems worse than scalar. x86
SSE2 and AVX optimisations are similarly affected. This is presumably
caused by unlucky inputs from checkasm, such that the C code
short-circuits almost all filter calculations.
Judging by the coefficients, the last round of add/sub can overflow
to 17 bits with a very small probability just as with the 8-point
transform. This is not observed under FATE, but better safe than sorry.
The last set of additions/subtractions can break the 16-bit limit, and
require 17 bits of precision. This uses widening adds accordingly to fix
the MSS2 FATE tests.
The problem potentially also affects inv_trans_4 with a very low
probability, but this is not reproducible under FATE.
Due to hysterical raisins, most RISC-V Linux distributions target a
RV64GC baseline excluding the Bit-manipulation ISA extensions, most
notably:
- Zba: address generation extension and
- Zbb: basic bit manipulation extension.
Most CPUs that would make sense to run FFmpeg on support Zba and Zbb
(including the current FATE runner), so it makes sense to optimise for
them. In fact a large chunk of existing assembler optimisations relies
on Zba and/or Zbb.
Since we cannot patch shared library code, the next best thing is to
carry a flag initialised at load-time and check it on need basis.
This results in 3 instructions overhead on isolated use, e.g.:
1: AUIPC rd, %pcrel_hi(ff_rv_zbb_supported)
LBU rd, %pcrel_lo(1b)(rd)
BEQZ rd, non_Zbb_fallback_code
// Zbb code here
The C compiler will typically load the flag ahead of time to reducing
latency, and can also keep it around if Zbb is used multiple times in a
single optimisation scope. For this to work, the flag symbol must be
hidden; otherwise the optimisation degrades with a GOT look-up to
support interposition:
1: AUIPC rd, GOT_OFFSET_HI
LD rd, GOT_OFFSET_LO(rd)
LBU rd, (rd)
BEQZ rd, non_Zbb_fallback_code
// Zbb code here
This patch adds code to provision the flag in libraries using bit
manipulation functions from libavutil: byte-swap, bit-weight and
counting leading or trailing zeroes.
Although checkasm does not verify this, the decoder requires that the
transform updates the input block exactly like the C code does.
This fixes vc1-ism, vc1_ilaced_twomv, vc1_sa00040, vc1_sa10091,
vc1_sa10143, vc1_sa20021, vc1test_smm0005 and wmv3-drm-dec tests.
Although checkasm does not verify this, the decoder requires that the
transform updates the input block exactly like the C code does.
This fixes vc1-ism, vc1_ilaced_twomv, vc1_sa00040, vc1_sa10091,
vc1_sa10143, vc1_sa20021, vc1test_smm0005 and wmv3-drm-dec tests.
T-Head C908 (cycles):
vc1dsp.vc1_inv_trans_4x4_c: 310.7
vc1dsp.vc1_inv_trans_4x4_rvv_i32: 120.0
We could use 1 `vlseg4e64.v` instead of 4 `vle16.v`, but that seems to
be about 7% slower.