FFmpeg

Commit Graph

Author	SHA1	Message	Date
Rémi Denis-Courmont	cdd38a2ffe	lavc/aacpsdsp: fix R-V V stereo interpolate The penultimate loop iteration could pick any vl such that: vlenb/4 < vl <= vlenb/2 Thus if the total length is not a multiple of vlenb/2, the vfadd.vf on the penultimate iteration would yield corrupt values for the last iteration. To avoid this, force vl = vlen/2 until the last iteration. Unfortunately this latent bug is not reproducible with either hardware or QEMU as of now.	11 months ago
Rémi Denis-Courmont	0183c2c830	lavc/aacpsdsp: use LMUL=2 and amortise strides The input is laid out in 16 segments, of which 13 actually need to be loaded. There are no really efficient ways to deal with this: 1) If we load 8 segments wit unit stride, then narrow to 16 segments with right shifts, we can only get one half-size vector per segment, or just 2 elements per vector (EMUL=1/2) - at least with 128-bit vectors. This ends up unsurprisingly about as fas as the C code. 2) The current approach is to load with strides. We keep that approach, but improve it using three 4-segmented loads instead of 12 single-segment loads. This divides the number of distinct loaded addresses by 4. 3) A potential third approach would be to avoid segmentation altogether and splat the scalar coefficient into vectors. Then we can use a unit-stride and maximum EMUL. But the downside then is that we have to multiply the 3 (of 16) unused segments with zero as part of the multiply-accumulate operations. In addition, we also reuse vectors mid-loop so as to increase the EMUL from 1 to 2, which also improves performance a little bit. Oeverall the gains are quite small with the device under test, as it does not deal with segmented loads very well. But at least the code is tidier, and should enjoy bigger speed-ups on better hardware implementation. Before: ps_hybrid_analysis_c: 1819.2 ps_hybrid_analysis_rvv_f32: 1037.0 (before) ps_hybrid_analysis_rvv_f32: 990.0 (after)	1 year ago
Rémi Denis-Courmont	f576a0835b	lavc/aacpsdsp: rework R-V V hybrid_synthesis_deint Given the size of the data set, strided memory accesses cannot be avoided. We can still do better than the current code. ps_hybrid_synthesis_deint_c: 12065.5 ps_hybrid_synthesis_deint_rvv_i32: 13650.2 (before) ps_hybrid_synthesis_deint_rvv_i64: 8181.0 (after)	1 year ago
Rémi Denis-Courmont	eb508702a8	lavc/aacpsdsp: rework R-V V add_squares Segmented loads may be slower than not. So this advantageously uses a unit-strided load and narrowing shifts instead. Before: ps_add_squares_c: 60757.7 ps_add_squares_rvv_f32: 22242.5 After: ps_add_squares_c: 60516.0 ps_add_squares_rvv_i64: 17067.7	1 year ago
Rémi Denis-Courmont	f3dfd4ccf2	lavc/aacpsdsp: unroll RISC-V V hybrid_synthesis_deint	1 year ago
Rémi Denis-Courmont	0f1336b285	lavc/aacpsdsp: unroll RISC-V V hybrid_analysis_ileave	1 year ago
Rémi Denis-Courmont	69d7486e59	lavc/aacpsdsp: unroll RISC-V V mul_pair_single	1 year ago
Rémi Denis-Courmont	c270928cc0	lavc/aacpsdsp: unroll R-V V stereo interpolate	1 year ago
Rémi Denis-Courmont	27d74fc1ef	lavc/aacpsdsp: simplify R-V V stereo interpolate Remove some useless vector splat.	1 year ago
Rémi Denis-Courmont	2eb55157aa	lavc/aacpsdsp: unroll RISC-V V add_squares This slightly improves performance with the Device Under Test.	1 year ago
Rémi Denis-Courmont	105921251a	lavc/aacpsdsp: fix clobber on RISC-V LP64D/ILP32D Although the DSP function only uses single precision from RISC-V F, the caller may leave double precision values in the spilled registers if the calling convention supports double precision hardware floats. Then, we need to save and restore FS registers as double precision. Conversely, we do not need to save anything at all if an integer calling convention is in use. However we can assume that single precision floats are supported, since the Zve32f extension implies the F extension. So for the sake of simplicity, we always save at least single precision values. In theory, we should even save quadruple precision values if the LP64Q ABI is in use. I have yet to see a compiler that supports it though.	2 years ago
Rémi Denis-Courmont	c03f9654c9	lavc/aacpsdsp: RISC-V V stereo_interpolate[0]	2 years ago
Rémi Denis-Courmont	a15edb0bc0	lavc/aacpsdsp: RISC-V V hybrid_synthesis_deint	2 years ago
Rémi Denis-Courmont	09f907999f	lavc/aacpsdsp: RISC-V V hybrid_analysis_ileave	2 years ago
Rémi Denis-Courmont	15c3a0bd6e	lavc/aacpsdsp: RISC-V V hybrid_analysis This starts with one-time initialisation of the 26 constant factors like `08edacc248`. That is done with the scalar instruction set. While the formula can readily be vectored, the gains would (probably) be more than lost in transfering the results back to FP registers (or suitably reshuffling them into vector registers). Note that the main loop could likely be scheduled sligthly better by expanding the filter macro and interleaving loads with arithmetic. It is not clear yet if that would be relevant for vector processing (as opposed to traditional SIMD). We could also use fewer vectors, but there is not much point in sparing them (they are all callee-clobbered).	2 years ago
Rémi Denis-Courmont	e180326a0b	lavc/aacpsdsp: RISC-V V mul_pair_single	2 years ago
Rémi Denis-Courmont	b0cacf4c3f	lavc/aacpsdsp: RISC-V V add_squares	2 years ago
Rémi Denis-Courmont	c1bb19e263	lavu/fixeddsp: RISC-V V butterflies_fixed	2 years ago
Rémi Denis-Courmont	04d092e7d5	lavc/audiodsp: RISC-V F vector_clipf RV64G supports MIN & MAX instructions natively only on floating point registers, not general purpose ones. The later would require the Zbb extension. Due to that, it is actually faster to perform the clipping "properly" in FPU. Benchmarks on SiFive U74-MC (courtesy of Shanghai StarFive Tech): audiodsp.vector_clipf_c: 29551.5 audiodsp.vector_clipf_rvf: 17871.0 Also tried unrolling with 2 or 8 elements but it gets worse either way.	2 years ago
Diego Biurrun	9a9e2f1c8a	dsputil: Split audio operations off into a separate context	11 years ago
Ben Avison	9d8ecdd8ca	vc-1: Add platform-specific start code search routine to VC1DSPContext. Initialise VC1DSPContext for parser as well as for decoder. Note, the VC-1 code doesn't actually use the function pointer yet. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	11 years ago
Mason Carter	832e190632	vc1: arm: Add NEON assembly For: ff_vc1_inv_trans_{8,4}x{8,4}_{dc_,}neon ff_put_pixels8x8_neon ff_put_vc1_mspel_mc{0,1,2,3}{0,1,2,3}_neon (except for 00) Based on ARM assembly code in libavcodec/arm by Rob Clark and Mans Rullgard. Signed-off-by: Martin Storsjö <martin@martin.st>	11 years ago
Diego Biurrun	73b704ac60	arm: Add some missing header #includes	12 years ago
Mans Rullgard	b692d246ea	vp8: arm: separate ARMv6 functions from NEON This is a preparation for complete ARMv6 optimisations. Signed-off-by: Mans Rullgard <mans@mansr.com>	13 years ago
Mans Rullgard	d526c5338d	ARM: allow runtime masking of CPU features This allows masking CPU features with the -cpuflags avconv option which is useful for testing different optimisations without rebuilding. Signed-off-by: Mans Rullgard <mans@mansr.com>	13 years ago
Michael Niedermayer	c266eb1928	arm: Fix 10l typo Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	13 years ago
Ronald S. Bultje	bd66f073fe	vp8: change int stride to ptrdiff_t stride. On 64bit platforms with 32bit int, this means we won't have to sign- extend the integer anymore.	13 years ago
Diego Biurrun	32f3c541bc	doxygen: Do not include license boilerplates in Doxygen comment blocks.	13 years ago
Ronald S. Bultje	a5dfeb612e	VP8: armv6 optimizations. From 52.503s (~40fps) to 27.973sec (~80fps) decoding of 480p sintel trailer, i.e. a ~2x speedup overall, on a Nexus S. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	13 years ago
Mans Rullgard	2912e87a6c	Replace FFmpeg with Libav in licence headers Signed-off-by: Mans Rullgard <mans@mansr.com>	14 years ago
Mans Rullgard	ef15d71c1f	VP8: ARM NEON optimisations for dsp functions This adds NEON optimised versions of all functions in VP8DSPContext. Based on initial work by Rob Clark. Signed-off-by: Mans Rullgard <mans@mansr.com> (cherry picked from commit `a1c1d3c003`)	14 years ago
Mans Rullgard	a1c1d3c003	VP8: ARM NEON optimisations for dsp functions This adds NEON optimised versions of all functions in VP8DSPContext. Based on initial work by Rob Clark. Signed-off-by: Mans Rullgard <mans@mansr.com>	14 years ago

17 Commits (3371250c328857cb1b55f7a7f6e4bd0f566adcc4)