FFmpeg

Commit Graph

Author	SHA1	Message	Date
Matthieu Bouron	0a24d7ca83	lavc/aarch64: add sbrdsp neon implementation autocorrelate_c: 644.0 autocorrelate_neon: 420.0 hf_apply_noise_0_c: 1688.5 hf_apply_noise_0_neon: 1498.6 hf_apply_noise_1_c: 1691.2 hf_apply_noise_1_neon: 1500.6 hf_apply_noise_2_c: 1688.1 hf_apply_noise_2_neon: 1500.3 hf_apply_noise_3_c: 1696.6 hf_apply_noise_3_neon: 1502.2 hf_g_filt_c: 2117.8 hf_g_filt_neon: 1218.7 hf_gen_c: 4573.4 hf_gen_neon: 2461.0 neg_odd_64_c: 72.0 neg_odd_64_neon: 64.7 qmf_deint_bfly_c: 1107.6 qmf_deint_bfly_neon: 291.6 qmf_deint_neg_c: 210.4 qmf_deint_neg_neon: 107.4 qmf_post_shuffle_c: 163.0 qmf_post_shuffle_neon: 107.7 qmf_pre_shuffle_c: 120.5 qmf_pre_shuffle_neon: 110.7 sum64x5_c: 1361.6 sum64x5_neon: 435.4 sum_square_c: 1686.4 sum_square_neon: 787.2	7 years ago
Clément Bœsch	b12a36170b	lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis	7 years ago
Clément Bœsch	ff0ecef624	lavc/aarch64: add a few SIMD functions for AAC PS ☭ tests/checkasm/checkasm --bench --test=aacpsdsp checkasm: using random seed 3318985180 MMX implied by specified flags MMX implied by specified flags NEON: - aacpsdsp.add_squares [OK] - aacpsdsp.mul_pair_single [OK] - aacpsdsp.hybrid_analysis [OK] - aacpsdsp.stereo_interpolate [OK] checkasm: all 5 tests passed nop: 10.0 ps_add_squares_c: 63221.2 ps_add_squares_neon: 22311.7 ps_hybrid_analysis_c: 2466.6 ps_hybrid_analysis_neon: 1521.9 ps_mul_pair_single_c: 68592.0 ps_mul_pair_single_neon: 17426.6 ps_stereo_interpolate_c: 72344.3 ps_stereo_interpolate_neon: 72308.8 ps_stereo_interpolate_ipdopd_c: 117415.2 ps_stereo_interpolate_ipdopd_neon: 113386.3	7 years ago
Memphiz	9e85c5d6a7	aarch64: vp9 16bpp: Fix assembling with Xcode 6.2 and older Properly use the b.eq form instead of the nonstandard form (which both gas and newer clang accept though), and expand the register lists that used a range (which the Xcode 6.2 clang, based on clang 3.5 svn, didn't support). Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Memphiz	998609ddb8	aarch64: vp9: Fix assembling with Xcode 6.2 and older Properly use the b.eq/b.ge forms instead of the nonstandard forms (which both gas and newer clang accept though), and expand the register list that used a range (which the Xcode 6.2 clang, based on clang 3.5 svn, didn't support). This is cherrypicked from libav commit `a970f9de86`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Memphiz	a970f9de86	aarch64: vp9: Fix assembling with Xcode 6.2 and older Properly use the b.eq/b.ge forms instead of the nonstandard forms (which both gas and newer clang accept though), and expand the register list that used a range (which the Xcode 6.2 clang, based on clang 3.5 svn, didn't support). Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Matthieu Bouron	204008354f	lavc/aarch64/simple_idct: fix build with Xcode 7.2	8 years ago
Matthieu Bouron	8aa60606fb	lavc/aarch64/simple_idct: fix idct_col4_top coefficient Fixes regression introduced by `5d0b8b1ae3`.	8 years ago
Matthieu Bouron	5d0b8b1ae3	lavc/aarch64/simple_idct: fix iOS build without gas-preprocessor Separates macro arguments with commas and passes .4H/.8H as macro arguments instead of 4H/8H (the later form being interpreted as an hexadecimal value). Fixes ticket #6324. Suggested-by: Martin Storsjö <martin@martin.st>	8 years ago
James Almer	c31cbeef58	aarch64/vp9dsp: add missing header includes	8 years ago
Ronald S. Bultje	f8c019944d	vp9: re-split the decoder/format/dsp interface header files. The advantage here is that the internal software decoder interface is not exposed to the DSP functions or the hardware accelerations.	8 years ago
Clément Bœsch	1c9f4b5078	lavc/vp9: split into vp9{block,data,mvs} This is following Libav layout to ease merges.	8 years ago
Martin Storsjö	61b8a9ea29	aarch64: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 21512 bytes to 31400 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: vp9_inv_dct_dct_16x16_sub1_add_10_neon: 284.6 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 1902.7 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1903.0 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 2201.1 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 2510.0 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2821.3 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1011.6 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 9716.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9704.9 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 10641.7 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 11555.7 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 12499.8 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 13403.7 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 14335.8 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 15253.6 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16179.5 After: vp9_inv_dct_dct_16x16_sub1_add_10_neon: 282.8 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 1142.4 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1139.0 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 1772.9 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 2515.2 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2823.5 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1012.7 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 6944.4 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 6944.2 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 7609.8 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 9953.4 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 10770.1 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 13418.8 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 14330.7 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 15257.1 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16190.6 Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	d564c9018f	aarch64: vp9itxfm16: Move the load_add_store macro out from the itxfm16 pass2 function This allows reusing the macro for a separate implementation of the pass2 function. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	0f2705e66b	aarch64: vp9itxfm16: Make the larger core transforms standalone functions This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/aarch64/vp9itxfm_16bpp_neon.o from 26288 to 21512 bytes. This gives a small slowdown of a couple of tens of cycles, but makes it more feasible to add more optimized versions of these transforms. Before: vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1887.4 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2801.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9691.4 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16154.9 After: vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1899.5 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2827.2 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9714.7 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16175.9 Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	b76533f105	aarch64: vp9itxfm16: Restructure the idct32 store macros This avoids concatenation, which can't be used if the whole macro is wrapped within another macro. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	d613251622	aarch64: vp9itxfm16: Avoid .irp when it doesn't save any lines This makes the code a bit more readable. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	25ced1eb1c	aarch64: vp9itxfm16: Fix a typo in a comment Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	21c89f3a26	arm/aarch64: vp9: Fix vertical alignment Align the second/third operands as they usually are. Due to the wildly varying sizes of the written out operands in aarch64 assembly, the column alignment is usually not as clear as in arm assembly. This is cherrypicked from libav commit `7995ebfad1`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	70317b25aa	arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used In the half/quarter cases where we don't use the min_eob array, defer loading the pointer until we know it will be needed. This is cherrypicked from libav commit `3a0d5e206d`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	7995ebfad1	arm/aarch64: vp9: Fix vertical alignment Align the second/third operands as they usually are. Due to the wildly varying sizes of the written out operands in aarch64 assembly, the column alignment is usually not as clear as in arm assembly. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Matthieu Bouron	4c8e528d19	lavc/aarch64: add ff_simple_idct{,_add,_put}_neon functions	8 years ago
Martin Storsjö	3a0d5e206d	arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used In the half/quarter cases where we don't use the min_eob array, defer loading the pointer until we know it will be needed. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	26ee83acc4	aarch64: vp9itxfm: Reorder iadst16 coeffs This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. This is cherrypicked from libav commit `b8f66c0838`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	f952273019	aarch64: vp9itxfm: Reorder the idct coefficients for better pairing All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp version. This is cherrypicked from libav commit `09eb88a12e`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	2905657b90	aarch64: vp9itxfm: Avoid reloading the idct32 coefficients The idct32x32 function actually pushed d8-d15 onto the stack even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. After this, we still can skip pushing d12-d15. Before: vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3 After: vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3 This is cherrypicked from libav commit `65aa002d54`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	f32690a298	aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1 This is one cycle faster in total, and three instructions fewer. Before: vp9_loop_filter_mix2_v_44_16_neon: 123.2 After: vp9_loop_filter_mix2_v_44_16_neon: 122.2 This is cherrypicked from libav commit `3bf9c48320`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	3fbbad2984	arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit The theoretical maximum value of E is 193, so we can just saturate the addition to 255. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 After: vp9_loop_filter_v_4_8_neon: 136.0 125.7 112.6 84.0 83.0 vp9_loop_filter_v_8_8_neon: 234.0 195.5 171.5 136.0 133.7 vp9_loop_filter_v_16_8_neon: 490.0 417.5 377.7 289.0 271.0 vp9_loop_filter_v_16_16_neon: 951.2 814.7 732.3 571.0 446.7 This is cherrypicked from libav commit `c582cb8537`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	c8d6eec85d	aarch64: vp9lpf: Fix broken indentation/vertical alignment This is cherrypicked from libav commit `07b5136c48`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	9f3a886364	aarch64: vp9lpf: Interleave the start of flat8in into the calculation above This adds lots of extra .ifs, but speeds it up by a couple cycles, by avoiding stalls. This is cherrypicked from libav commit `b0806088d3`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	f0ecbb13cf	arm/aarch64: vp9lpf: Calculate !hev directly Previously we first calculated hev, and then negated it. Since we were able to schedule the negation in the middle of another calculation, we don't see any gain in all cases. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 147.0 129.0 115.8 89.0 88.7 vp9_loop_filter_v_8_8_neon: 242.0 198.5 174.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 500.0 419.5 382.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 971.2 825.5 731.5 579.0 453.0 After: vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 This is cherrypicked from libav commit `e1f9de86f4`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	148cc0bb89	aarch64: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling This work is sponsored by, and copyright, Google. Before: Cortex A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3 vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 180.2 vp9_inv_dct_dct_32x32_sub1_add_neon: 475.3 This is cherrypicked from libav commit `3fcf788fbb`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	045e33ae3f	aarch64: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter No measured speedup on a Cortex A53, but other cores might benefit. This is cherrypicked from libav commit `388e0d2515`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	ac6cb8ae5b	aarch64: vp9mc: Simplify the extmla macro parameters Fold the field lengths into the macro. This makes the macro invocations much more readable, when the lines are shorter. This also makes it easier to use only half the registers within the macro. This is cherrypicked from libav commit `5e0c2158fb`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	16ef000799	aarch64: vp9itxfm: Fix incorrect vertical alignment This is cherrypicked from libav commit `0c0b87f12d`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	d0fbf7f34e	aarch64: vp9itxfm: Update a comment to refer to a register with a different name This is cherrypicked from libav commit `8476eb0d3a`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	6752318c73	aarch64: vp9itxfm: Use the right lane sizes in 8x8 for improved readability This is cherrypicked from libav commit `3dd7827258`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	19a0f9529c	aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possible The ld1r is a leftover from the arm version, where this trick is beneficial on some cores. Use a single-lane load where we don't need the semantics of ld1r. This is cherrypicked from libav commit `ed8d293306`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	3006e5253a	aarch64: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 function This is cherrypicked from libav commit `4da4b2b87f`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	9532a7d4d0	aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32 This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 14740 bytes to 24292 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7 vp9_inv_dct_dct_16x16_sub2_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub8_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub12_add_neon: 1387.4 vp9_inv_dct_dct_16x16_sub16_add_neon: 1387.6 vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1 vp9_inv_dct_dct_32x32_sub2_add_neon: 5198.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 5198.6 vp9_inv_dct_dct_32x32_sub8_add_neon: 5196.3 vp9_inv_dct_dct_32x32_sub12_add_neon: 6183.4 vp9_inv_dct_dct_32x32_sub16_add_neon: 6174.3 vp9_inv_dct_dct_32x32_sub20_add_neon: 7151.4 vp9_inv_dct_dct_32x32_sub24_add_neon: 7145.3 vp9_inv_dct_dct_32x32_sub28_add_neon: 8119.3 vp9_inv_dct_dct_32x32_sub32_add_neon: 8118.7 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7 vp9_inv_dct_dct_16x16_sub2_add_neon: 640.8 vp9_inv_dct_dct_16x16_sub4_add_neon: 639.0 vp9_inv_dct_dct_16x16_sub8_add_neon: 842.0 vp9_inv_dct_dct_16x16_sub12_add_neon: 1388.3 vp9_inv_dct_dct_16x16_sub16_add_neon: 1389.3 vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1 vp9_inv_dct_dct_32x32_sub2_add_neon: 3685.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 3685.1 vp9_inv_dct_dct_32x32_sub8_add_neon: 3684.4 vp9_inv_dct_dct_32x32_sub12_add_neon: 5312.2 vp9_inv_dct_dct_32x32_sub16_add_neon: 5315.4 vp9_inv_dct_dct_32x32_sub20_add_neon: 7154.9 vp9_inv_dct_dct_32x32_sub24_add_neon: 7154.5 vp9_inv_dct_dct_32x32_sub28_add_neon: 8126.6 vp9_inv_dct_dct_32x32_sub32_add_neon: 8127.2 This is cherrypicked from libav commit `a63da4511d`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	a681c793a3	aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 function This allows reusing the macro for a separate implementation of the pass2 function. This is cherrypicked from libav commit `79d332ebbd`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	dc47bf3872	aarch64: vp9itxfm: Make the larger core transforms standalone functions This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from 19496 to 14740 bytes. This gives a small slowdown of a couple of tens of cycles, but makes it more feasible to add more optimized versions of these transforms. Before: vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.2 vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0 vp9_inv_dct_dct_32x32_sub32_add_neon: 8095.7 After: vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub16_add_neon: 1390.1 vp9_inv_dct_dct_32x32_sub4_add_neon: 5199.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 8125.8 This is cherrypicked from libav commit `115476018d`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	52c7366c83	aarch64: vp9itxfm: Restructure the idct32 store macros This avoids concatenation, which can't be used if the whole macro is wrapped within another macro. This is also arguably more readable. This is cherrypicked from libav commit `58d87e0f49`. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	b8f66c0838	aarch64: vp9itxfm: Reorder iadst16 coeffs This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	09eb88a12e	aarch64: vp9itxfm: Reorder the idct coefficients for better pairing All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp version. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	65aa002d54	aarch64: vp9itxfm: Avoid reloading the idct32 coefficients The idct32x32 function actually pushed d8-d15 onto the stack even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. After this, we still can skip pushing d12-d15. Before: vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3 After: vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3 Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	3bf9c48320	aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1 This is one cycle faster in total, and three instructions fewer. Before: vp9_loop_filter_mix2_v_44_16_neon: 123.2 After: vp9_loop_filter_mix2_v_44_16_neon: 122.2 Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	c582cb8537	arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit The theoretical maximum value of E is 193, so we can just saturate the addition to 255. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 After: vp9_loop_filter_v_4_8_neon: 136.0 125.7 112.6 84.0 83.0 vp9_loop_filter_v_8_8_neon: 234.0 195.5 171.5 136.0 133.7 vp9_loop_filter_v_16_8_neon: 490.0 417.5 377.7 289.0 271.0 vp9_loop_filter_v_16_16_neon: 951.2 814.7 732.3 571.0 446.7 Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	07b5136c48	aarch64: vp9lpf: Fix broken indentation/vertical alignment Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	b0806088d3	aarch64: vp9lpf: Interleave the start of flat8in into the calculation above This adds lots of extra .ifs, but speeds it up by a couple cycles, by avoiding stalls. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago

1 2 3 4

161 Commits (69218b41980883a7e75656f3058171939f5729ef)