T-Head C908 (cycles):
h264_h_loop_filter_luma_8bpp_c: 297.5
h264_h_loop_filter_luma_8bpp_rvv_i32: 369.2
h264_v_loop_filter_luma_8bpp_c: 862.7
h264_v_loop_filter_luma_8bpp_rvv_i32: 199.7
Performance in the horizontal scenario seems worse than scalar. x86
SSE2 and AVX optimisations are similarly affected. This is presumably
caused by unlucky inputs from checkasm, such that the C code
short-circuits almost all filter calculations.