FFmpeg

Commit Graph

Author	SHA1	Message	Date
Martin Storsjö	a76b409dd0	aarch64: Reindent all assembly to 8/24 column indentation libavcodec/aarch64/vc1dsp_neon.S is skipped here, as it intentionally uses a layered indentation style to visually show how different unrolled/interleaved phases fit together. Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Martin Storsjö	93cda5a9c2	aarch64: Lowercase UXTW/SXTW and similar flags Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Martin Storsjö	184103b310	aarch64: Consistently use lowercase for vector element specifiers Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Hubert Mazur	2537fdc510	sw_scale: Add specializations for hscale 16 to 19 Provide arm64 neon optimized implementations for hscale16To19 with filter sizes 4, 8 and X4. The tests and benchmarks run on AWS Graviton 2 instances. The results from a checkasm tool are shown below. hscale_16_to_19__fs_4_dstW_512_c: 6216.0 hscale_16_to_19__fs_4_dstW_512_neon: 2257.0 hscale_16_to_19__fs_8_dstW_512_c: 10417.7 hscale_16_to_19__fs_8_dstW_512_neon: 3112.5 hscale_16_to_19__fs_12_dstW_512_c: 14890.5 hscale_16_to_19__fs_12_dstW_512_neon: 3899.0 hscale_16_to_19__fs_16_dstW_512_c: 19006.5 hscale_16_to_19__fs_16_dstW_512_neon: 5341.2 hscale_16_to_19__fs_32_dstW_512_c: 36629.5 hscale_16_to_19__fs_32_dstW_512_neon: 9502.7 hscale_16_to_19__fs_40_dstW_512_c: 45477.5 hscale_16_to_19__fs_40_dstW_512_neon: 11552.0 (Note, the checkasm tests for these functions haven't been merged since they fail on x86.) Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Hubert Mazur	9ccf8c5bfc	sw_scale: Add specializations for hscale 16 to 15 Add arm64 neon implementations for hscale 16 to 15 with filter sizes 4, 8 and X4. The tests and benchmarks run on AWS Graviton 2 instances. The results from a checkasm tool are shown below. hscale_16_to_15__fs_4_dstW_512_c: 6703.5 hscale_16_to_15__fs_4_dstW_512_neon: 2298.0 hscale_16_to_15__fs_8_dstW_512_c: 10983.0 hscale_16_to_15__fs_8_dstW_512_neon: 3216.5 hscale_16_to_15__fs_12_dstW_512_c: 15526.0 hscale_16_to_15__fs_12_dstW_512_neon: 3993.0 hscale_16_to_15__fs_16_dstW_512_c: 20183.5 hscale_16_to_15__fs_16_dstW_512_neon: 5369.7 hscale_16_to_15__fs_32_dstW_512_c: 39315.2 hscale_16_to_15__fs_32_dstW_512_neon: 9511.2 hscale_16_to_15__fs_40_dstW_512_c: 48995.7 hscale_16_to_15__fs_40_dstW_512_neon: 11570.0 (Note, the checkasm tests for these functions haven't been merged since they fail on x86.) Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Hubert Mazur	1e9cfa5bb0	sw_scale: Add specializations for hscale 8 to 19 Add arm64 neon implementations for hscale 8 to 19 with filter sizes 4, 4X and 8. Both implementations are based on very similar ones dedicated to hscale 8 to 15. The major changes refer to saving the data - instead of writing the result as int16_t it is done with int32_t. These functions are heavily inspired on patches provided by J. Swinney and M. Storsjö for hscale8to15 which were slightly adapted for hscale8to19. The tests and benchmarks run on AWS Graviton 2 instances. The results from a checkasm tool shown below. hscale_8_to_19__fs_4_dstW_512_c: 5663.2 hscale_8_to_19__fs_4_dstW_512_neon: 1259.7 hscale_8_to_19__fs_8_dstW_512_c: 9306.0 hscale_8_to_19__fs_8_dstW_512_neon: 2020.2 hscale_8_to_19__fs_12_dstW_512_c: 12932.7 hscale_8_to_19__fs_12_dstW_512_neon: 2462.5 hscale_8_to_19__fs_16_dstW_512_c: 16844.2 hscale_8_to_19__fs_16_dstW_512_neon: 4671.2 hscale_8_to_19__fs_32_dstW_512_c: 32803.7 hscale_8_to_19__fs_32_dstW_512_neon: 5474.2 hscale_8_to_19__fs_40_dstW_512_c: 40948.0 hscale_8_to_19__fs_40_dstW_512_neon: 6669.7 Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Swinney, Jonathan	75ffca7eef	libswscale/aarch64: add another hscale specialization This specialization handles the case where filtersize is 4 mod 8, e.g. 12, 20, etc. Aarch64 was previously using the c function for this case. This implementation speeds up that case significantly. hscale_8_to_15__fs_12_dstW_512_c: 6234.1 hscale_8_to_15__fs_12_dstW_512_neon: 1505.6 Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Swinney, Jonathan	0ea61725b1	swscale/aarch64: add hscale specializations This patch adds code to support specializations of the hscale function and adds a specialization for filterSize == 4. ff_hscale8to15_4_neon is a complete rewrite. Since the main bottleneck here is loading the data from src, this data is loaded a whole block ahead and stored back to the stack to be loaded again with ld4. This arranges the data for most efficient use of the vector instructions and removes the need for completion adds at the end. The number of iterations of the C per iteration of the assembly is increased from 4 to 8, but because of the prefetching, there must be a special section without prefetching when dstW < 16. This improves speed on Graviton 2 (Neoverse N1) dramatically in the case where previously fs=8 would have been required. before: hscale_8_to_15__fs_8_dstW_512_neon: 1962.8 after : hscale_8_to_15__fs_4_dstW_512_neon: 1220.9 Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>	3 years ago
Martin Storsjö	70db14376c	swscale: aarch64: Optimize the final summation in the hscale routine Before: Cortex A53 A72 A73 Graviton 2 Graviton 3 hscale_8_to_15_width8_neon: 8273.0 4602.5 4289.5 2429.7 1629.1 hscale_8_to_15_width16_neon: 12405.7 6803.0 6359.0 3549.0 2378.4 hscale_8_to_15_width32_neon: 21258.7 11491.7 11469.2 5797.2 3919.6 hscale_8_to_15_width40_neon: 25652.0 14173.7 12488.2 6893.5 4810.4 After: hscale_8_to_15_width8_neon: 7633.0 3981.5 3350.2 1980.7 1261.1 hscale_8_to_15_width16_neon: 11666.7 5951.0 5512.0 3080.7 2131.4 hscale_8_to_15_width32_neon: 20900.7 10733.2 9481.7 5275.2 3862.1 hscale_8_to_15_width40_neon: 24826.0 13536.2 11502.0 6397.2 4731.9 Thus, this gives overall a 8-29% speedup for the smaller filter sizes, around 1-8% for the larger filter sizes. Inspired by a patch by Jonathan Swinney <jswinney@amazon.com>. Signed-off-by: Martin Storsjö <martin@martin.st>	3 years ago
Martin Storsjö	9025d5c5ce	swscale: aarch64: Don't clobber callee-saved registers v8-v15 Signed-off-by: Martin Storsjö <martin@martin.st>	5 years ago
Martin Storsjö	872790b1f9	swscale: aarch64: Avoid using the x18 register The x18 is a reserved platform register on Darwin and Windows. x8/w8 seems to be unused in this function though (and same about x10 and x14), so there's really no reason to use x18 here - just change the uses of x18/w18 into x8/w8 instead without any further rewrites. Signed-off-by: Martin Storsjö <martin@martin.st>	5 years ago
Sebastian Pop	bd83191271	swscale/aarch64: use multiply accumulate and increase vector factor to 4 This patch implements ff_hscale_8_to_15_neon with NEON fused multiply accumulate and bumps the vectorization factor from 2 to 4. The speedup is of 25% on Graviton1 A1 instances based on A-72 cpus: $ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null - before: t:0.040303 avg:0.040287 max:0.040371 min:0.039214 after: t:0.032168 avg:0.032215 max:0.033081 min:0.032146 The speedup is of 39% on Graviton2 m6g instances based on Neoverse-N1 cpus: $ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null - before: t:0.019446 avg:0.019423 max:0.019493 min:0.019181 after: t:0.014015 avg:0.014096 max:0.015018 min:0.013971 Tested with `make check` on aarch64-linux. Signed-off-by: Sebastian Pop <spop@amazon.com> Reviewed-by: Jean-Baptiste Kempf <jb@videolan.org> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	5 years ago
Clément Bœsch	040598218f	sws/aarch64: restore ff_hscale_8_to_15_neon() Fix final scaling and required filter alignment. Pass FATE.	9 years ago
Clément Bœsch	263eb76bdf	sws/aarch64: add ff_hscale_8_to_15_neon ./ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null - before: t:0.489726 avg:0.489883 max:0.491852 min:0.489482 after: t:0.256515 avg:0.256458 max:0.256999 min:0.253755	9 years ago

14 Commits (a9e55f409f2f7e6b32d5b6cc1dd67ed454cafbca)