FFmpeg

Commit Graph

Author	SHA1	Message	Date
Martin Storsjö	b252178321	libavcodec: arm: Add a NEON implementation of pixblockdsp Cortex A7 A8 A9 A53 A72 get_pixels_c: 144.7 146.0 143.0 137.7 69.0 get_pixels_armv6: 112.0 106.7 90.2 95.0 72.5 get_pixels_neon: 69.0 29.7 68.7 40.2 19.0 get_pixels_unaligned_c: 144.7 146.2 143.0 137.7 69.0 get_pixels_unaligned_neon: 77.0 36.5 72.5 48.5 19.0 diff_pixels_c: 376.7 319.7 265.5 307.7 148.0 diff_pixels_armv6: 179.0 159.5 205.5 139.0 142.0 diff_pixels_neon: 69.0 40.2 77.5 53.2 26.0 diff_pixels_unaligned_c: 376.7 319.7 265.5 307.7 148.0 diff_pixels_unaligned_neon: 85.0 54.5 93.5 66.7 26.0 Signed-off-by: Martin Storsjö <martin@martin.st>	5 years ago
qoroliang	cacdac819f	lavc/hevcdec: fix the HEVC decoder crash when memory over-read Fix an occasional crash for hevc decoder in ARM 32 platform, the root cause is the memory over read(read cross the memory boundary) in SAO NENO functions ff_hevc_sao_band_filter_neon_8 and ff_hevc_sao_edge_filter_neon_8. After this fix, the crash disapper in the massive Android phone test. Signed-off-by: qoroliang <qoroliang@tencent.com>	5 years ago
Aman Gupta	0e49560806	avcodec/arm/mlpdsp: add missing dependency for truehd Signed-off-by: Aman Gupta <aman@tmm1.net>	5 years ago
Martin Storsjö	0676de935b	arm: Implement a NEON version of 422 h264_h_loop_filter_chroma Previously, the 420 version was used even for 422. This fixes occasional checkasm failures. Signed-off-by: Martin Storsjö <martin@martin.st>	6 years ago
James Almer	7b9ca44cbc	arm/h264dsp: change loop filter stride argument to ptrdiff_t This was missed in `d5d699ab6e` Signed-off-by: James Almer <jamrial@gmail.com>	6 years ago
Martin Storsjö	cef914e083	arm: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2 This makes it similar to put_epel16_v6, and gives a 10-25% speedup of this function. Before: Cortex A7 A8 A9 A53 A72 vp8_put_epel16_h6v6_neon: 3058.0 2218.5 2459.8 2183.0 1572.2 After: vp8_put_epel16_h6v6_neon: 2670.8 1934.2 2244.4 1729.4 1503.9 Signed-off-by: Martin Storsjö <martin@martin.st>	6 years ago
Meng Wang	3b2fd96048	avcodec/arm/hevcdsp_sao : add NEON optimization for sao Signed-off-by: Meng Wang <wangmeng.kids@bytedance.com> Reviewed-by: Shengbin Meng <shengbinmeng@gmail.com> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	7 years ago
Martin Storsjö	5f83935de4	arm: hevcdsp: Add commas between macro arguments When targeting darwin, clang requires commas between arguments, while the no-comma form is allowed for other targets. Since Xcode 9.3, the bundled clang supports altmacro and doesn't require using gas-preprocessor any longer. Signed-off-by: Martin Storsjö <martin@martin.st>	7 years ago
Martin Storsjö	6660bc034d	arm: hevcdsp: Avoid using macro expansion counters Clang supports the macro expansion counter (used for making unique labels within macro expansions), but not when targeting darwin. Convert uses of the counter into normal local labels, as used elsewhere. Since Xcode 9.3, the bundled clang supports altmacro and doesn't require using gas-preprocessor any longer. Signed-off-by: Martin Storsjö <martin@martin.st>	7 years ago
Martin Storsjö	ab05d3934d	arm: vc1dsp: Add commas between macro arguments When targeting darwin, clang requires commas between arguments, while the no-comma form is allowed for other targets. Since Xcode 9.3, the bundled clang supports altmacro and doesn't require using gas-preprocessor any longer. Signed-off-by: Martin Storsjö <martin@martin.st>	7 years ago
Aurelien Jacobs	f677718bc8	sbcenc: add armv6 and neon asm optimizations This was originally based on libsbc, and was fully integrated into ffmpeg.	7 years ago
Michael Niedermayer	7dbbb75ee3	avcodec/arm/sbrdsp_neon: Use a free register instead of putting 2 things in one Fixes high pitched shriek Fixes: 25420848_1478428308873746_4255813235963330560_n.mp4 Reported-by: Dale Curtis <dalecurtis@google.com> Reviewed-by: Dale Curtis <dalecurtis@chromium.org> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	7 years ago
James Almer	36de24d5b7	arm/hevc_idct: fix compilation on Android Compilation error "out of range" fixed for armeabi-v7a. Compilation failed trying to build libvlc.aar for ARM7 android on ubuntu 16.04 host. Error messages is "Offset out of range". The reason of the error is assembler LDR directives in function "ff_hevc_transform_luma_4x4_neon_8" need local storage in range <1k, but no such storage provided. Based on a patch by Ihor Bobalo <bob@eleks.com> Suggested-by: wbs Signed-off-by: James Almer <jamrial@gmail.com>	7 years ago
Alexandra Hájková	7993ec19af	hevc: Add hevc_get_pixel_4/8/12/16/24/32/48/64 Checkasm timings: block size bitdepth C NEON 4 8 bit: 146.7 48.7 10 bit: 146.7 52.7 8 8 bit: 430.3 84.4 10 bit: 430.4 119.5 12 8 bit: 812.8 141.0 10 bit: 812.8 195.0 16 8 bit: 1499.1 268.0 10 bit: 1498.9 368.4 24 8 bit: 4394.2 574.8 10 bit: 3696.3 804.8 32 8 bit: 5108.6 568.9 10 bit: 4249.6 918.8 48 8 bit: 16819.6 2304.9 10 bit: 13882.0 3178.5 64 8 bit: 13490.8 1799.5 10 bit: 11018.5 2519.4 Signed-off-by: Martin Storsjö <martin@martin.st>	7 years ago
Martin Storsjö	b487add7ec	arm: Remove a redundant check in fmtconvert_init_arm.c This was missed in `e2710e790c`, where have_vfp && !have_vfpv3 were converted into have_vfp_vm. Signed-off-by: Martin Storsjö <martin@martin.st>	7 years ago
Martin Storsjö	9dde6ab06c	arm: Fix SIGBUS on ARM when compiled with binutils 2.29 In binutils 2.29, the behavior of the ADR instruction changed so that 1 is added to the address of a Thumb function (previously nothing was added). This allows the loaded address to be passed to a BLX instruction and the correct mode change will occur. See: https://sourceware.org/bugzilla/show_bug.cgi?id=21458 By using adr with a label that isn't annotated as a thumb function, we avoid the new behaviour in binutils 2.29 and get the same behaviour as in prior releases, and as in other assemblers (ms armasm.exe, clang's built in assembler) - an idea that Janne Grunau came up with. Signed-off-by: Martin Storsjö <martin@martin.st>	7 years ago
Muhammad Faiz	0780ad9c68	avcodec/rdft: remove sintable It is redundant with costable. The first half of sintable is identical with the second half of costable. The second half of sintable is negative value of the first half of sintable. The computation is changed to handle sign of sin values, in C code and ARM assembly code. Signed-off-by: Muhammad Faiz <mfcc64@gmail.com>	7 years ago
Clément Bœsch	b12a36170b	lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis	7 years ago
Clément Bœsch	e4a27e2f2d	lavc/arm: fix lack of precision in ff_ps_stereo_interpolate_neon The code originally pre-multiply by 2 the steps, causing the running sum of the h factors to drift away due to the lack of precision. It quickly causes an inaccuracy > 0.01. I tried diverse approaches such as multiply by 2.0 (instead of adding the value itself) without success. I'm unable to bench the impact of this change, feel free to compare. This commit fixes the incoming aacpsdsp tests. Following is an alternative simplified function (matching the incoming AArch64 code) that may be used: function ff_ps_stereo_interpolate_neon, export=1 vld1.32 {q0}, [r2] vld1.32 {q1}, [r3] ldr r12, [sp] vmov.f32 q8, q0 vmov.f32 q9, q1 vzip.32 q8, q0 vzip.32 q9, q1 1: vld1.32 {d4}, [r0,:64] vld1.32 {d6}, [r1,:64] vadd.f32 q8, q8, q9 vadd.f32 q0, q0, q1 vmov.f32 d5, d4 vmov.f32 d7, d6 vmul.f32 q2, q2, q8 vmla.f32 q2, q3, q0 vst1.32 {d4}, [r0,:64]! vst1.32 {d5}, [r1,:64]! subs r12, r12, #1 bgt 1b bx lr endfunc	7 years ago
Martin Storsjö	d7320ca3ed	arm: Avoid using .dn register aliases clang now (in the upcoming 5.0 version) is capable of building our arm assembly without relying on gas-preprocessor, although clang/LLVM doesn't support .dn register aliases. The VC1 MC assembly was only built and used if the chosen assembler supported the .dn directives though. This was supported as long as gas-preprocessor was used. This means that VC1 decoding got a speed regression on clang 5.0, unless the user manually chose using gas-preprocessor again. By avoiding using the .dn register aliases, we can build the VC1 MC assembly with the latest clang version. Support for the .dn/.qn directives in clang/LLVM isn't actively planned, see https://bugs.llvm.org/show_bug.cgi?id=18199. This partially reverts `896a5bff64`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Alexandra Hájková	ce080f47b8	hevc: Add NEON 32x32 IDCT Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Alexandra Hájková	118dd4a321	hevc: 16x16 NEON idct: Use the right element size for loads/stores This doesn't change the actual behaviour of the code but improves readability. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Alexandra Hájková	edbf0fffb1	hevc: Add NEON add_residual for bitdepth 10 Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	e1c2453a4f	arm: hevc_idct: Tune the add_res_8x8 and add_res_32x32 functions Before: Cortex A7 A8 A9 A53 hevc_add_res_8x8_8_neon: 116.0 58.7 80.2 90.7 hevc_add_res_32x32_8_neon: 1230.0 737.5 1187.5 974.4 After: hevc_add_res_8x8_8_neon: 97.7 57.0 73.7 80.0 hevc_add_res_32x32_8_neon: 1216.0 698.7 1127.5 827.1 Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Seppo Tomperi	0d4d435137	hevc: Add NEON add_residual for bitdepth 8 Optimized by Alexandra Hájková. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Alexandra Hájková	3d69dd65c6	hevc: Add support for bitdepth 10 for IDCT DC Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Seppo Tomperi	358adef030	hevc: Add NEON IDCT DC functions for bitdepth 8 Signed-off-by: Alexandra Hájková <alexandra@khirnov.net> Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Alexandra Hájková	89d9869d24	hevc: Add NEON 16x16 IDCT The speedup vs C code is around 6-13x. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Ronald S. Bultje	40cbd686dc	idct_arm: remove use of ff_put/add_pixels_clamped function pointer. Instead, hardcode the use of the _arm implementation of add_pixels, and use the C version for put_pixels (as no arm-optimized version exists). Since there's separate implementations of idct{,_put,_add} for neon, this has no practical impact on performance.	8 years ago
Ronald S. Bultje	0c46641784	vp9: split out generic decoding skeleton interface API from VP9 types. This allows vp9dsp.h to only include the VP9 types header, and not the decoder skeleton interface which is for hardware decoders (dxva2/vaapi).	8 years ago
Ronald S. Bultje	f8c019944d	vp9: re-split the decoder/format/dsp interface header files. The advantage here is that the internal software decoder interface is not exposed to the DSP functions or the hardware accelerations.	8 years ago
Martin Storsjö	fbc6f190a6	arm: Always build the hevcdsp_init_arm.c file The main hevcdsp.c file calls this init function if HAVE_ARM is set, regardless of whether neon support is available or not. This fixes builds where neon isn't supported by the build tools at all. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Alexandra Hájková	0b9a237b23	hevc: Add NEON 4x4 and 8x8 IDCT Optimized by Martin Storsjö <martin@martin.st>. The speedup vs C code is around 3.2-4.4x. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Clément Bœsch	1c9f4b5078	lavc/vp9: split into vp9{block,data,mvs} This is following Libav layout to ease merges.	8 years ago
Clément Bœsch	b78243c504	lavc/arm: fix indent in blockdsp_init_neon	8 years ago
Martin Storsjö	eabc5abf94	arm: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 14516 bytes to 22484 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0 270.7 418.5 295.4 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 3840.2 3244.8 3700.1 2337.9 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4212.5 3575.4 3996.9 2571.6 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 5174.4 4270.5 4615.5 3031.9 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 5676.0 4908.5 5226.5 3491.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6403.9 5589.0 5839.8 3948.5 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1710.7 944.7 1582.1 1045.4 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 21040.7 16706.1 18687.7 13193.1 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22197.7 18282.7 19577.5 13918.6 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 24511.5 20911.5 21472.5 15367.5 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 26939.5 24264.3 23239.1 16830.3 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 29419.5 26845.1 25020.6 18259.9 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 31146.4 29633.5 26803.3 19721.7 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 33376.3 32507.8 28642.4 21174.2 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 35629.4 35439.6 30416.5 22625.7 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37269.9 37914.9 32271.9 24078.9 After: vp9_inv_dct_dct_16x16_sub1_add_10_neon: 454.0 276.0 418.5 295.1 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 2336.2 1886.0 2251.0 1458.6 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 2531.0 2054.7 2402.8 1591.1 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 3848.6 3491.1 3845.7 2554.8 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 5703.8 4831.6 5230.8 3493.4 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6399.5 5567.0 5832.4 3951.5 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1722.1 938.5 1577.3 1044.5 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 15003.5 11576.8 13105.8 9602.2 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 15768.5 12677.2 13726.0 10138.1 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 17278.8 14825.4 14907.5 11185.7 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 22335.7 21544.5 20379.5 15019.8 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 24165.6 23881.7 21938.6 16308.2 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 31082.2 30860.9 26835.3 19711.3 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 33102.6 31922.8 28638.3 21161.0 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 35104.9 34867.5 30411.7 22621.2 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37438.1 39103.4 32217.8 24067.6 Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	0ea603203d	arm: vp9itxfm16: Make the larger core transforms standalone functions This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from 17500 to 14516 bytes. This gives a small slowdown of a couple tens of cycles, up to around 150 cycles for the full case of the largest transform, but makes it more feasible to add more optimized versions of these transforms. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4237.4 3561.5 3971.8 2525.3 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6371.9 5452.0 5779.3 3910.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22068.8 17867.5 19555.2 13871.6 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37268.9 38684.2 32314.2 23969.0 After: vp9_inv_dct_dct_16x16_sub4_add_10_neon: 4375.1 3571.9 4283.8 2567.2 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 6415.6 5578.9 5844.6 3948.3 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22653.7 18079.7 19603.7 13905.3 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37593.2 38862.2 32235.8 24070.9 Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	32e273c111	arm: vp9itxfm16: Avoid reloading the idct32 coefficients Keep the idct32 coefficients in narrow form in q6-q7, and idct16 coefficients in lengthened 32 bit form in q0-q3. Avoid clobbering q0-q3 in the pass1 function, and squeeze the idct16 coefficients into q0-q1 in the pass2 function to avoid reloading them. The idct16 coefficients are clobbered and reloaded within idct32_odd though, since that turns out to be faster than narrowing them and swapping them into q6-q7. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22653.8 18268.4 19598.0 14079.0 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37699.0 38665.2 32542.3 24472.2 After: vp9_inv_dct_dct_32x32_sub4_add_10_neon: 22270.8 18159.3 19531.0 13865.0 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 37523.3 37731.6 32181.7 24071.2 Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	c1619318e5	arm: vp9itxfm16: Fix vertical alignment Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	b46d37e93a	arm: vp9itxfm16: Use the right lane size This makes the code slightly clearer, but doesn't make any functional difference. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	21c89f3a26	arm/aarch64: vp9: Fix vertical alignment Align the second/third operands as they usually are. Due to the wildly varying sizes of the written out operands in aarch64 assembly, the column alignment is usually not as clear as in arm assembly. This is cherrypicked from libav commit `7995ebfad1`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	70317b25aa	arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used In the half/quarter cases where we don't use the min_eob array, defer loading the pointer until we know it will be needed. This is cherrypicked from libav commit `3a0d5e206d`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	b7a565fe71	arm: vp9itxfm: Template the quarter/half idct32 function This reduces the number of lines and reduces the duplication. Also simplify the eob check for the half case. If we are in the half case, we know we at least will need to do the first three slices, we only need to check eob for the fourth one, so we can hardcode the value to check against instead of loading from the min_eob array. Since at most one slice can be skipped in the first pass, we can unroll the loop for filling zeros completely, as it was done for the quarter case before. This allows skipping loading the min_eob pointer when using the quarter/half cases. This is cherrypicked from libav commit `98ee855ae0`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	7995ebfad1	arm/aarch64: vp9: Fix vertical alignment Align the second/third operands as they usually are. Due to the wildly varying sizes of the written out operands in aarch64 assembly, the column alignment is usually not as clear as in arm assembly. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	3a0d5e206d	arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used In the half/quarter cases where we don't use the min_eob array, defer loading the pointer until we know it will be needed. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	98ee855ae0	arm: vp9itxfm: Template the quarter/half idct32 function This reduces the number of lines and reduces the duplication. Also simplify the eob check for the half case. If we are in the half case, we know we at least will need to do the first three slices, we only need to check eob for the fourth one, so we can hardcode the value to check against instead of loading from the min_eob array. Since at most one slice can be skipped in the first pass, we can unroll the loop for filling zeros completely, as it was done for the quarter case before. This allows skipping loading the min_eob pointer when using the quarter/half cases. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	b2e20d8984	arm: vp9itxfm: Reorder iadst16 coeffs This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. This is cherrypicked from libav commit `08074c092d`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	4f693b56bd	arm: vp9itxfm: Reorder the idct coefficients for better pairing All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp version. This is cherrypicked from libav commit `de06bdfe6c`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	600f4c9b03	arm: vp9itxfm: Avoid reloading the idct32 coefficients The idct32x32 function actually pushed q4-q7 onto the stack even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. Since the idct16 core transform avoids clobbering q4-q7 (but clobbers q2-q3 instead, to avoid needing to back up and restore q4-q7 at all in the idct16 function), and the lanewise vmul needs a register in the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5 while doing idct16. While keeping these coefficients in registers, we still can skip pushing q7. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_32x32_sub32_add_neon: 18553.8 17182.7 14303.3 12089.7 After: vp9_inv_dct_dct_32x32_sub32_add_neon: 18470.3 16717.7 14173.6 11860.8 This is cherrypicked from libav commit `402546a172`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	a88db8b9a0	arm: vp9lpf: Implement the mix2_44 function with one single filter pass For this case, with 8 inputs but only changing 4 of them, we can fit all 16 input pixels into a q register, and still have enough temporary registers for doing the loop filter. The wd=8 filters would require too many temporary registers for processing all 16 pixels at once though. Before: Cortex A7 A8 A9 A53 vp9_loop_filter_mix2_v_44_16_neon: 289.7 256.2 237.5 181.2 After: vp9_loop_filter_mix2_v_44_16_neon: 221.2 150.5 177.7 138.0 This is cherrypicked from libav commit `575e31e931`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago

1 2 3 4 5 ...

923 Commits (d12cfa17a501ffd9f191093d1e06de226279dbd0)