FFmpeg

Commit Graph

Author	SHA1	Message	Date
Rémi Denis-Courmont	d076517056	lavc/llauddsp: R-V V scalarproduct_and_madd_int32 scalarproduct_and_madd_int32_c: 10899.7 scalarproduct_and_madd_int32_rvv_i32: 1749.0	1 year ago
Rémi Denis-Courmont	45d0eb3f70	lavc/llauddsp: R-V V scalarproduct_and_madd_int16 scalarproduct_and_madd_int16_c: 10355.7 scalarproduct_and_madd_int16_rvv_i32: 1480.0	1 year ago
Rémi Denis-Courmont	90a779bed6	lavc/huffyuvdsp: basic R-V V add_hfyu_left_pred_bgr32 Better performance can probably be achieved with a more intricate unrolled loop, but this is a start: add_hfyu_left_pred_bgr32_c: 15084.0 add_hfyu_left_pred_bgr32_rvv_i32: 10280.2 This would actually be cleaner with the RISC-V P extension, but that is not ratified yet (I think?) and usually not supported if V is supported.	1 year ago
Rémi Denis-Courmont	c536e92207	lavc/sbrdsp: R-V V hf_apply_noise functions This is restricted to 128-bit vectors as larger vector sizes could read past the end of the noise array. Support for future hardware with larger vector sizes is left for some other time. hf_apply_noise_0_c: 2319.7 hf_apply_noise_0_rvv_f32: 1229.0 hf_apply_noise_1_c: 2539.0 hf_apply_noise_1_rvv_f32: 1244.7 hf_apply_noise_2_c: 2319.7 hf_apply_noise_2_rvv_f32: 1232.7 hf_apply_noise_3_c: 2541.2 hf_apply_noise_3_rvv_f32: 1244.2	1 year ago
Rémi Denis-Courmont	5b33104fca	lavc/sbrdsp: R-V V hf_gen hf_gen_c: 2922.7 hf_gen_rvv_f32: 731.5	1 year ago
Rémi Denis-Courmont	cd7b352c53	lavc/sbrdsp: R-V V autocorrelate With 5 accumulator vectors and 6 inputs, this can only use LMUL=2. Also the number of vector loop iterations is small, just 5 on 128-bit vector hardware. The vector loop is somewhat unusual in that it processes data in descending memory order, in order to save on vector slides: in descending order, we can extract elements to carry over to the next iteration from the bottom of the vectors directly. With ascending order (see in the Opus postfilter function), there are no ways to get the top elements directly. On the downside, this requires the use of separate shift and sub (the would-be SH3SUB instruction does not exist), with a small pipeline stall on the vector load address. The edge cases in scalar are done in scalar as this saves on loads and remains significantly faster than C. autocorrelate_c: 669.2 autocorrelate_rvv_f32: 421.0	1 year ago
Rémi Denis-Courmont	f576a0835b	lavc/aacpsdsp: rework R-V V hybrid_synthesis_deint Given the size of the data set, strided memory accesses cannot be avoided. We can still do better than the current code. ps_hybrid_synthesis_deint_c: 12065.5 ps_hybrid_synthesis_deint_rvv_i32: 13650.2 (before) ps_hybrid_synthesis_deint_rvv_i64: 8181.0 (after)	1 year ago
Rémi Denis-Courmont	eb508702a8	lavc/aacpsdsp: rework R-V V add_squares Segmented loads may be slower than not. So this advantageously uses a unit-strided load and narrowing shifts instead. Before: ps_add_squares_c: 60757.7 ps_add_squares_rvv_f32: 22242.5 After: ps_add_squares_c: 60516.0 ps_add_squares_rvv_i64: 17067.7	1 year ago
Rémi Denis-Courmont	adc87a5f7c	lavc/opusdsp: rewrite R-V V postfilter This uses a more traditional approach allowing up processing of up to period minus two elements per iteration. This also allows the algorithm to work for all and any vector length. As the T-Head C908 device under test can load 16 elements loop, there is unsurprisingly a little performance drop when the period is minimal and the parallelism is capped at 13 elements: Before: postfilter_15_c: 21222.2 postfilter_15_rvv_f32: 22007.7 postfilter_512_c: 20189.7 postfilter_512_rvv_f32: 22004.2 postfilter_1022_c: 20189.7 postfilter_1022_rvv_f32: 22004.2 After: postfilter_15_c: 20189.5 postfilter_15_rvv_f32: 7057.2 postfilter_512_c: 20189.5 postfilter_512_rvv_f32: 5667.2 postfilter_1022_c: 20192.7 postfilter_1022_rvv_f32: 5667.2	1 year ago
Rémi Denis-Courmont	02594c8c01	lavc/pixblockdsp: rework R-V V get_pixels_unaligned As in the aligned case, we can use VLSE64.V, though the way of doing so gets more convoluted, so the performance gains are more modest: get_pixels_unaligned_c: 126.7 get_pixels_unaligned_rvv_i32: 145.5 (before) get_pixels_unaligned_rvv_i64: 62.2 (after) For the reference, those are the aligned benchmarks (unchanged) on the same T-Head C908 hardware: get_pixels_c: 126.7 get_pixels_rvi: 85.7 get_pixels_rvv_i64: 33.2	1 year ago
Rémi Denis-Courmont	f68ad5d2de	lavc/sbrdsp: R-V V sbr_hf_g_filt hf_g_filt_c: 1552.5 hf_g_filt_rvv_f32: 679.5	1 year ago
Rémi Denis-Courmont	d06fd18f8f	lavc/sbrdsp: R-V V neg_odd_64 With 128-bit vectors, this is mostly pointless but also harmless. Performance gains should be more noticeable with larger vector sizes. neg_odd_64_c: 76.2 neg_odd_64_rvv_i64: 74.7	1 year ago
Rémi Denis-Courmont	b0aba7dd0c	lavc/sbrdsp: R-V V sum_square sum_square_c: 803.5 sum_square_rvv_f32: 283.2	1 year ago
Rémi Denis-Courmont	86bee42473	lavc/sbrdsp: R-V V sum64x5 sum64x5_c: 385.0 sum64x5_rvv_f32: 116.0	1 year ago
Rémi Denis-Courmont	92bcc6703a	lavc/pixblockdsp: remove R-V V get_pixels_16 In the aligned case, the existing RVI assembler is actually much faster. In the unaligned case, there is nothing much to gain over C.	1 year ago
Rémi Denis-Courmont	28840cf499	lavc/jpeg2000dsp: R-V V rct_int jpeg2000_rct_int_c: 2592.2 jpeg2000_rct_int_rvv_i32: 1154.2	1 year ago
Rémi Denis-Courmont	73dea2bb91	lavc/jpeg2000dsp: R-V V ict_float jpeg2000_ict_float_c: 3112.2 jpeg2000_ict_float_rvv_f32: 1225.0	1 year ago
Rémi Denis-Courmont	424c8ceb08	lavc/huffyuvdsp: R-V V add_int16 add_int16_128_c: 2390.5 add_int16_128_rvv_i32: 832.0 add_int16_rnd_width_c: 2390.2 add_int16_rnd_width_rvv_i32: 832.5	1 year ago
Rémi Denis-Courmont	7e1cdc69fb	lavc/utvideodsp: R-V V restore_rgb_planes10 restore_rgb_planes10_c: 185852.2 restore_rgb_planes10_rvv_i32: 90130.5	1 year ago
Rémi Denis-Courmont	4aea0da230	lavc/utvideodsp: R-V V restore_rgb_planes restore_rgb_planes_c: 133065.7 restore_rgb_planes_rvv_i32: 33317.2	1 year ago
Rémi Denis-Courmont	ae72412aa8	lavc/idctdsp: improve R-V V put_pixels_clamped	1 year ago
Rémi Denis-Courmont	d48810f3a5	lavc/idctdsp: improve R-V V add_pixels_clamped	1 year ago
Rémi Denis-Courmont	600c6f1b55	lavc/idctdsp: improve R-V V put_signed_pixels_clamped This follows the same idea as with pixblockdsp, but applied at the other end, whilst writing data at the end of the function.	1 year ago
Rémi Denis-Courmont	3ea2310e89	lavc/idctdsp: require Zve64x for R-V V functions This will be required for the following changesets.	1 year ago
Rémi Denis-Courmont	300ee8b02d	lavc/pixblockdsp: aligned R-V V 8-bit functions If the scan lines are aligned, we can load each row as a 64-bit value, thus avoiding segmentation. And then we can factor the conversion or subtraction. In principle, the same optimisation should be possible for high depth, but would require 128-bit elements, for which no FFmpeg CPU flag exists.	1 year ago
Rémi Denis-Courmont	722765687b	lavc/pixblockdsp: rename unaligned R-V V functions	1 year ago
Rémi Denis-Courmont	3c6516330f	lavc/exrdsp: R-V V reoder_pixels	1 year ago
Rémi Denis-Courmont	89c10d8d20	lavc/ac3: add R-V Zbb extract_exponents	1 year ago
Rémi Denis-Courmont	cec48e3b32	riscv: factor out the bswap32 assembler	2 years ago
Rémi Denis-Courmont	b36f3d5330	lavc/fmtconvert: unroll R-V V int32_to_float_fmul_scalar	2 years ago
Rémi Denis-Courmont	f3dfd4ccf2	lavc/aacpsdsp: unroll RISC-V V hybrid_synthesis_deint	2 years ago
Rémi Denis-Courmont	0f1336b285	lavc/aacpsdsp: unroll RISC-V V hybrid_analysis_ileave	2 years ago
Rémi Denis-Courmont	69d7486e59	lavc/aacpsdsp: unroll RISC-V V mul_pair_single	2 years ago
Rémi Denis-Courmont	c270928cc0	lavc/aacpsdsp: unroll R-V V stereo interpolate	2 years ago
Rémi Denis-Courmont	27d74fc1ef	lavc/aacpsdsp: simplify R-V V stereo interpolate Remove some useless vector splat.	2 years ago
Rémi Denis-Courmont	3575ee2ea3	lavc/audiodsp: unroll RISC-V clip functions audiodsp.vector_clip_int32_c: 17500.7 audiodsp.vector_clip_int32_rvv_i32: 8404.7 (m1) audiodsp.vector_clip_int32_rvv_i32: 2689.9 (m8) audiodsp.vector_clipf_c: 33679.7 audiodsp.vector_clipf_rvf: 7019.7 audiodsp.vector_clipf_rvv_f32: 8328.0 (m1) audiodsp.vector_clipf_rvv_f32: 2209.4 (m8)	2 years ago
Rémi Denis-Courmont	9bc5676e40	lavc/g722dsp: add RISC-V V DSP function	2 years ago
Arnie Chang	8d1316e515	lavc/h264chroma: RISC-V V add motion compensation for 4xH and 2xH chroma blocks Optimize the put and avg filtering for 4xH and 2xH blocks Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2 years ago
Rémi Denis-Courmont	44cac1def0	lavc/audiodsp: rework RISC-V V scalar product Take vector reduction out of the loop and unroll. Before: audiodsp.scalarproduct_int16_c: 12321.0 audiodsp.scalarproduct_int16_rvv_i32: 4175.7 After: audiodsp.scalarproduct_int16_c: 12320.5 audiodsp.scalarproduct_int16_rvv_i32: 1230.2	2 years ago
Rémi Denis-Courmont	61e5ca4ded	lavc/bswapdsp: purge RISC-V V bswap32 This cannot beat the Zbb implementation, and it is unlikely that a real meaningful CPU design would support V and not Zbb. The best loop rewrite that I could come up with (4 shifts, 2 ands, 3 ors) is still ~40% slower than Zbb. A proper faster vector implementation should be feasible with the cryptographic vector extensions, but that is a story for another time.	2 years ago
Rémi Denis-Courmont	5de1db5370	lavc/bswapdsp: rewrite RISC-V V bswap16 This favours bit-wise logic over slow strided stores.	2 years ago
Rémi Denis-Courmont	b6585eb04c	lavu: add/use flag for RISC-V Zba extension The code was blindly assuming that Zbb or V implied Zba. While the earlier is practically always true, the later broke some QEMU setups, as V was introduced earlier than Zba.	2 years ago
Rémi Denis-Courmont	2eb55157aa	lavc/aacpsdsp: unroll RISC-V V add_squares This slightly improves performance with the Device Under Test.	2 years ago
Rémi Denis-Courmont	c541ecf0dc	lavc/alacdsp: unroll RISC-V V loops This increases the group multiplier as per T-Head C910 benchmarks: alac_append_extra_bits_mono_c: 803.0 alac_append_extra_bits_stereo_c: 1604.2 alac_decorrelate_stereo_c: 1077.5 LMUL=1 alac_append_extra_bits_mono_rvv_i32: 418.2 alac_append_extra_bits_stereo_rvv_i32: 693.2 alac_decorrelate_stereo_rvv_i32: 673.5 LMUL=2 alac_append_extra_bits_mono_rvv_i32: 382.2 alac_append_extra_bits_stereo_rvv_i32: 648.2 alac_decorrelate_stereo_rvv_i32: 542.7 LMUL=4 alac_append_extra_bits_mono_rvv_i32: 241.5 alac_append_extra_bits_stereo_rvv_i32: 512.7 alac_decorrelate_stereo_rvv_i32: 364.2 LMUL=8 alac_append_extra_bits_mono_rvv_i32: 239.7 alac_append_extra_bits_stereo_rvv_i32: 497.2 alac_decorrelate_stereo_rvv_i32: 426.7	2 years ago
Rémi Denis-Courmont	a28aa0475d	lavc/vorbisdsp: unroll RISC-V V inverse_coupling This increases the group multiplier as per T-Head C910 benchmarks: inverse_coupling_c: 4597.0 inverse_coupling_rvv_i32: 1312.7 (m1) inverse_coupling_rvv_i32: 1116.7 (m2) inverse_coupling_rvv_i32: 732.2 (m4) inverse_coupling_rvv_i32: 898.0 (m8)	2 years ago
Arnie Chang	c5508f60c2	lavc/h264chroma: RISC-V V add motion compensation for 8x8 chroma blocks Optimize the put and avg filtering for 8x8 chroma blocks Signed-off-by: Arnie Chang <arnie.chang@sifive.com>	2 years ago
Rémi Denis-Courmont	4d66e8c12e	lavc/audiodsp: fix RISC-V V scalar product (again) The loop uses a 32-bit accumulator. The current code would only zero the lower 16 bits thereof.	2 years ago
Rémi Denis-Courmont	96a83ceea4	riscv: fix scalar product initialisation VSETVLI xd, x0, ...' has rather nonobvious semantics: - If xd is x0, then it preserves the current vector length. - If xd is not x0, it sets the vector length to the supported maximum. Also somewhat confusingly, while VMV.X.S always does its thing regardless of the selected vector length, VMV.S.X does _nothing_ if the selected vector length is zero. So the current code breaks fails to initialise the accumulator if we are unlucky to have a selected vector length of zero on entry. Fix it by forcing the vector length to one.	2 years ago
Rémi Denis-Courmont	105921251a	lavc/aacpsdsp: fix clobber on RISC-V LP64D/ILP32D Although the DSP function only uses single precision from RISC-V F, the caller may leave double precision values in the spilled registers if the calling convention supports double precision hardware floats. Then, we need to save and restore FS registers as double precision. Conversely, we do not need to save anything at all if an integer calling convention is in use. However we can assume that single precision floats are supported, since the Zve32f extension implies the F extension. So for the sake of simplicity, we always save at least single precision values. In theory, we should even save quadruple precision values if the LP64Q ABI is in use. I have yet to see a compiler that supports it though.	2 years ago
Rémi Denis-Courmont	bfc69297c5	lavc/opusdsp: RISC-V V (512-bit) postfilter This adds a variant of the postfilter for use with 512-bit vectors. Half a vector is enough to perform the scalar product. Normally a whole vector would be used anyhow. Indeed fractional multiplers are no faster than the unit multipler. But in this particular function, a full vector makes up 16 samples, which would be loaded at each iteration of the outer loop. The minimum guaranteed CELT postfilter period is only 15. Accounting for the edges, we can only safely preload up to 13 samples. The fractional multipler is thus used to cap the selected vector length to a safe value of 8 elements or 256 bits. Likewise, we have the 1024-bit variant with the quarter multipler. In theory, a 2048-bit one would be possible with the eigth multipler, but that length is not even defined in the specifications as of yet, nor is it supported by any emulator - forget actual hardware.	2 years ago

1 2 3 4

182 Commits (8a96495fef05eac704131dff25d0ef1b410d13a7)