FFmpeg

Commit Graph

Author	SHA1	Message	Date
Logan Lyu	8fa83ad70f	lavc/aarch64: new optimization for 8-bit hevc_qpel_uni_hv checkasm bench: put_hevc_qpel_uni_hv4_8_c: 489.2 put_hevc_qpel_uni_hv4_8_i8mm: 105.7 put_hevc_qpel_uni_hv6_8_c: 852.7 put_hevc_qpel_uni_hv6_8_i8mm: 268.7 put_hevc_qpel_uni_hv8_8_c: 1345.7 put_hevc_qpel_uni_hv8_8_i8mm: 300.4 put_hevc_qpel_uni_hv12_8_c: 2757.4 put_hevc_qpel_uni_hv12_8_i8mm: 581.4 put_hevc_qpel_uni_hv16_8_c: 4458.9 put_hevc_qpel_uni_hv16_8_i8mm: 860.2 put_hevc_qpel_uni_hv24_8_c: 9582.2 put_hevc_qpel_uni_hv24_8_i8mm: 2086.7 put_hevc_qpel_uni_hv32_8_c: 16401.9 put_hevc_qpel_uni_hv32_8_i8mm: 3217.4 put_hevc_qpel_uni_hv48_8_c: 36402.4 put_hevc_qpel_uni_hv48_8_i8mm: 7082.7 put_hevc_qpel_uni_hv64_8_c: 62713.2 put_hevc_qpel_uni_hv64_8_i8mm: 12408.9 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Logan Lyu	23ca61b7de	lavc/aarch64: new optimization for 8-bit hevc_qpel_uni_v checkasm bench: put_hevc_qpel_uni_v4_8_c: 146.2 put_hevc_qpel_uni_v4_8_neon: 43.2 put_hevc_qpel_uni_v6_8_c: 303.9 put_hevc_qpel_uni_v6_8_neon: 69.7 put_hevc_qpel_uni_v8_8_c: 495.2 put_hevc_qpel_uni_v8_8_neon: 74.7 put_hevc_qpel_uni_v12_8_c: 1100.9 put_hevc_qpel_uni_v12_8_neon: 222.4 put_hevc_qpel_uni_v16_8_c: 1955.2 put_hevc_qpel_uni_v16_8_neon: 269.2 put_hevc_qpel_uni_v24_8_c: 4571.9 put_hevc_qpel_uni_v24_8_neon: 832.4 put_hevc_qpel_uni_v32_8_c: 8226.4 put_hevc_qpel_uni_v32_8_neon: 1035.7 put_hevc_qpel_uni_v48_8_c: 18324.2 put_hevc_qpel_uni_v48_8_neon: 2321.2 put_hevc_qpel_uni_v64_8_c: 37659.4 put_hevc_qpel_uni_v64_8_neon: 4122.2 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Logan Lyu	b7a3150bc5	lavc/aarch64: new optimization for 8-bit hevc_epel_uni_hv checkasm bench: put_hevc_epel_uni_hv4_8_c: 204.7 put_hevc_epel_uni_hv4_8_i8mm: 70.2 put_hevc_epel_uni_hv6_8_c: 378.2 put_hevc_epel_uni_hv6_8_i8mm: 131.9 put_hevc_epel_uni_hv8_8_c: 637.7 put_hevc_epel_uni_hv8_8_i8mm: 137.9 put_hevc_epel_uni_hv12_8_c: 1301.9 put_hevc_epel_uni_hv12_8_i8mm: 314.2 put_hevc_epel_uni_hv16_8_c: 2203.4 put_hevc_epel_uni_hv16_8_i8mm: 454.7 put_hevc_epel_uni_hv24_8_c: 4848.2 put_hevc_epel_uni_hv24_8_i8mm: 1065.2 put_hevc_epel_uni_hv32_8_c: 8517.4 put_hevc_epel_uni_hv32_8_i8mm: 1898.4 put_hevc_epel_uni_hv48_8_c: 19591.7 put_hevc_epel_uni_hv48_8_i8mm: 4107.2 put_hevc_epel_uni_hv64_8_c: 33880.2 put_hevc_epel_uni_hv64_8_i8mm: 6568.7 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Logan Lyu	c0374f77f4	lavc/aarch64: move macros calc_epelh, calc_epelh2, load_epel_filterh Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Logan Lyu	7ce5a2f640	lavc/aarch64: new optimization for 8-bit hevc_epel_uni_v checkasm bench: put_hevc_epel_uni_hv64_8_i8mm: 6568.7 put_hevc_epel_uni_v4_8_c: 88.7 put_hevc_epel_uni_v4_8_neon: 32.7 put_hevc_epel_uni_v6_8_c: 185.4 put_hevc_epel_uni_v6_8_neon: 44.9 put_hevc_epel_uni_v8_8_c: 333.9 put_hevc_epel_uni_v8_8_neon: 44.4 put_hevc_epel_uni_v12_8_c: 728.7 put_hevc_epel_uni_v12_8_neon: 119.7 put_hevc_epel_uni_v16_8_c: 1224.2 put_hevc_epel_uni_v16_8_neon: 139.7 put_hevc_epel_uni_v24_8_c: 2531.2 put_hevc_epel_uni_v24_8_neon: 329.9 put_hevc_epel_uni_v32_8_c: 4739.9 put_hevc_epel_uni_v32_8_neon: 562.7 put_hevc_epel_uni_v48_8_c: 10618.7 put_hevc_epel_uni_v48_8_neon: 1256.2 put_hevc_epel_uni_v64_8_c: 19169.9 put_hevc_epel_uni_v64_8_neon: 2179.2 Co-Authored-By: J. Dekker <jdek@itanimul.li> Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Casey Smalley	b98ee1a355	aarch64/hevc: Replace br return with ret This patch changes the return instruction in the tr_32x4 macro from BR to RET. Function returns should always use the RET instruction instead of BR, to avoid interfering with branch prediction. On devices that support BTI, this is observeable as a landing pad is required when branching with BR. The change fixes fate-hevc-hdr-vivid-metadata when on hardware with BTI support. Signed-off-by: Casey Smalley <casey.smalley@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Reimar Döffinger	dcff15692d	hevcdsp_idct_neon.S: Avoid unnecessary mov. ret can be given an argument instead. This is also consistent with how other assembler code in FFmpeg does it. Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>	1 year ago
Rémi Denis-Courmont	82cb4b1c05	lavc/aarch64: remove bogus HAVE_VFP guard The IMDCT offset is only relevant for NEON optimisations. There are no VFP optimisations here that would justify the HAVE_VFP flag. In practice, this makes no difference because HAVE_NEON is practically always true for standard Armv8 platforms.	1 year ago
Logan Lyu	9557bf26b3	lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_hv put_hevc_epel_uni_w_hv4_8_c: 254.6 put_hevc_epel_uni_w_hv4_8_i8mm: 102.9 put_hevc_epel_uni_w_hv6_8_c: 411.6 put_hevc_epel_uni_w_hv6_8_i8mm: 221.6 put_hevc_epel_uni_w_hv8_8_c: 669.4 put_hevc_epel_uni_w_hv8_8_i8mm: 214.9 put_hevc_epel_uni_w_hv12_8_c: 1412.6 put_hevc_epel_uni_w_hv12_8_i8mm: 481.4 put_hevc_epel_uni_w_hv16_8_c: 2425.4 put_hevc_epel_uni_w_hv16_8_i8mm: 647.4 put_hevc_epel_uni_w_hv24_8_c: 5384.1 put_hevc_epel_uni_w_hv24_8_i8mm: 1450.6 put_hevc_epel_uni_w_hv32_8_c: 9470.9 put_hevc_epel_uni_w_hv32_8_i8mm: 2497.1 put_hevc_epel_uni_w_hv48_8_c: 20930.1 put_hevc_epel_uni_w_hv48_8_i8mm: 5635.9 put_hevc_epel_uni_w_hv64_8_c: 36682.9 put_hevc_epel_uni_w_hv64_8_i8mm: 9712.6 Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Logan Lyu	d48c89701c	lavc/aarch64: new optimization for 8-bit hevc_epel_h put_hevc_epel_h4_8_c: 67.1 put_hevc_epel_h4_8_i8mm: 21.1 put_hevc_epel_h6_8_c: 147.1 put_hevc_epel_h6_8_i8mm: 45.1 put_hevc_epel_h8_8_c: 237.4 put_hevc_epel_h8_8_i8mm: 72.1 put_hevc_epel_h12_8_c: 527.4 put_hevc_epel_h12_8_i8mm: 115.4 put_hevc_epel_h16_8_c: 943.6 put_hevc_epel_h16_8_i8mm: 153.9 put_hevc_epel_h24_8_c: 2105.4 put_hevc_epel_h24_8_i8mm: 384.4 put_hevc_epel_h32_8_c: 3631.4 put_hevc_epel_h32_8_i8mm: 519.9 put_hevc_epel_h48_8_c: 8082.1 put_hevc_epel_h48_8_i8mm: 1110.4 put_hevc_epel_h64_8_c: 14400.6 put_hevc_epel_h64_8_i8mm: 2057.1 put_hevc_qpel_h4_8_c: 124.9 put_hevc_qpel_h4_8_neon: 43.1 put_hevc_qpel_h4_8_i8mm: 33.1 put_hevc_qpel_h6_8_c: 269.4 put_hevc_qpel_h6_8_neon: 90.6 put_hevc_qpel_h6_8_i8mm: 61.4 put_hevc_qpel_h8_8_c: 477.6 put_hevc_qpel_h8_8_neon: 82.1 put_hevc_qpel_h8_8_i8mm: 99.9 put_hevc_qpel_h12_8_c: 1062.4 put_hevc_qpel_h12_8_neon: 226.9 put_hevc_qpel_h12_8_i8mm: 170.9 put_hevc_qpel_h16_8_c: 1880.6 put_hevc_qpel_h16_8_neon: 302.9 put_hevc_qpel_h16_8_i8mm: 251.4 put_hevc_qpel_h24_8_c: 4221.9 put_hevc_qpel_h24_8_neon: 893.9 put_hevc_qpel_h24_8_i8mm: 626.1 put_hevc_qpel_h32_8_c: 7437.6 put_hevc_qpel_h32_8_neon: 1189.9 put_hevc_qpel_h32_8_i8mm: 959.1 put_hevc_qpel_h48_8_c: 16838.4 put_hevc_qpel_h48_8_neon: 2727.9 put_hevc_qpel_h48_8_i8mm: 2163.9 put_hevc_qpel_h64_8_c: 29982.1 put_hevc_qpel_h64_8_neon: 4777.6 Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Logan Lyu	668eb4c00e	lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_v put_hevc_epel_uni_w_v4_8_c: 116.1 put_hevc_epel_uni_w_v4_8_neon: 48.6 put_hevc_epel_uni_w_v6_8_c: 248.9 put_hevc_epel_uni_w_v6_8_neon: 80.6 put_hevc_epel_uni_w_v8_8_c: 383.9 put_hevc_epel_uni_w_v8_8_neon: 91.9 put_hevc_epel_uni_w_v12_8_c: 806.1 put_hevc_epel_uni_w_v12_8_neon: 202.9 put_hevc_epel_uni_w_v16_8_c: 1411.1 put_hevc_epel_uni_w_v16_8_neon: 289.9 put_hevc_epel_uni_w_v24_8_c: 3168.9 put_hevc_epel_uni_w_v24_8_neon: 619.4 put_hevc_epel_uni_w_v32_8_c: 5632.9 put_hevc_epel_uni_w_v32_8_neon: 1161.1 put_hevc_epel_uni_w_v48_8_c: 12406.1 put_hevc_epel_uni_w_v48_8_neon: 2476.4 put_hevc_epel_uni_w_v64_8_c: 22001.4 put_hevc_epel_uni_w_v64_8_neon: 4343.9 Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Logan Lyu	0c604b1913	lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_h put_hevc_epel_uni_w_h4_8_c: 126.1 put_hevc_epel_uni_w_h4_8_i8mm: 41.6 put_hevc_epel_uni_w_h6_8_c: 222.9 put_hevc_epel_uni_w_h6_8_i8mm: 91.4 put_hevc_epel_uni_w_h8_8_c: 374.4 put_hevc_epel_uni_w_h8_8_i8mm: 102.1 put_hevc_epel_uni_w_h12_8_c: 806.1 put_hevc_epel_uni_w_h12_8_i8mm: 225.6 put_hevc_epel_uni_w_h16_8_c: 1414.4 put_hevc_epel_uni_w_h16_8_i8mm: 333.4 put_hevc_epel_uni_w_h24_8_c: 3128.6 put_hevc_epel_uni_w_h24_8_i8mm: 713.1 put_hevc_epel_uni_w_h32_8_c: 5519.1 put_hevc_epel_uni_w_h32_8_i8mm: 1118.1 put_hevc_epel_uni_w_h48_8_c: 12364.4 put_hevc_epel_uni_w_h48_8_i8mm: 2541.1 put_hevc_epel_uni_w_h64_8_c: 21925.9 put_hevc_epel_uni_w_h64_8_i8mm: 4383.6 Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Logan Lyu	e652e7dcda	lavc/aarch64: new optimization for 8-bit hevc_pel_uni_pixels put_hevc_pel_uni_pixels4_8_c: 35.9 put_hevc_pel_uni_pixels4_8_neon: 7.6 put_hevc_pel_uni_pixels6_8_c: 46.1 put_hevc_pel_uni_pixels6_8_neon: 20.6 put_hevc_pel_uni_pixels8_8_c: 53.4 put_hevc_pel_uni_pixels8_8_neon: 11.6 put_hevc_pel_uni_pixels12_8_c: 89.1 put_hevc_pel_uni_pixels12_8_neon: 25.9 put_hevc_pel_uni_pixels16_8_c: 106.4 put_hevc_pel_uni_pixels16_8_neon: 20.4 put_hevc_pel_uni_pixels24_8_c: 137.6 put_hevc_pel_uni_pixels24_8_neon: 47.1 put_hevc_pel_uni_pixels32_8_c: 173.6 put_hevc_pel_uni_pixels32_8_neon: 54.1 put_hevc_pel_uni_pixels48_8_c: 268.1 put_hevc_pel_uni_pixels48_8_neon: 117.1 put_hevc_pel_uni_pixels64_8_c: 346.1 put_hevc_pel_uni_pixels64_8_neon: 205.9 Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Logan Lyu	e79686be96	lavc/aarch64: new optimization for 8-bit hevc_qpel_h hevc_qpel_uni_w_hv Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Logan Lyu	15972cce8c	lavc/aarch64: new optimization for 8-bit hevc_qpel_uni_w_h Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
Logan Lyu	0b7356c1b4	lavc/aarch64: new optimization for 8-bit hevc_pel_uni_w_pixels and qpel_uni_w_v Signed-off-by: Martin Storsjö <martin@martin.st>	1 year ago
xufuji456	bd2f00f665	codec/aarch64/hevc: add transform_luma_neon got 56% speed up (run_count=1000, CPU=Cortex A53) transform_4x4_luma_neon: 45 transform_4x4_luma_c: 103 Signed-off-by: xufuji456 <839789740@qq.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
xufuji456	00a062b8d5	codec/aarch64/hevc:add idct_32x32_neon got 73% speed up (run_count=1000, CPU=Cortex A53) idct_32x32_neon: 4826 idct_32x32_c: 18236 idct_32x32_neon: 4824 idct_32x32_c: 18149 idct_32x32_neon: 4937 idct_32x32_c: 18333 Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
J. Dekker	b564ad8eac	lavc/aarch64: add hevc deblock chroma 8-12bit Benched on Ampere Altra: hevc_h_loop_filter_chroma8_c: 367.7 hevc_h_loop_filter_chroma8_neon: 31.0 hevc_h_loop_filter_chroma10_c: 396.7 hevc_h_loop_filter_chroma10_neon: 27.5 hevc_h_loop_filter_chroma12_c: 377.0 hevc_h_loop_filter_chroma12_neon: 31.7 hevc_v_loop_filter_chroma8_c: 369.0 hevc_v_loop_filter_chroma8_neon: 55.0 hevc_v_loop_filter_chroma10_c: 389.0 hevc_v_loop_filter_chroma10_neon: 54.0 hevc_v_loop_filter_chroma12_c: 389.5 hevc_v_loop_filter_chroma12_neon: 53.0 Signed-off-by: J. Dekker <jdek@itanimul.li>	2 years ago
J. Dekker	37cde570bc	lavc/aarch64: add clip N macro Signed-off-by: J. Dekker <jdek@itanimul.li>	2 years ago
xufuji456	4b4de07721	libavcodec/hevc: add hevc idct4x4 neon of aarch64 Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Martin Storsjö	ec7fa13eb0	aarch64: hevcdsp_idct: Reuse preexisting macros for transposes Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Lynne	e0661fc805	dca_core: convert to lavu/tx Thanks to Martin Storsjö <martin@martin.st> for fixing and testing the arm32 and aarch64 changes.	2 years ago
J. Dekker	9bed814e1d	lavc/aarch64: add hevc horizontal qpel/uni/bi checkasm --benchmark on Ampere Altra (Neoverse N1): put_hevc_qpel_bi_h4_8_c: 170.7 put_hevc_qpel_bi_h4_8_neon: 64.5 put_hevc_qpel_bi_h6_8_c: 373.7 put_hevc_qpel_bi_h6_8_neon: 130.2 put_hevc_qpel_bi_h8_8_c: 662.0 put_hevc_qpel_bi_h8_8_neon: 138.5 put_hevc_qpel_bi_h12_8_c: 1529.5 put_hevc_qpel_bi_h12_8_neon: 422.0 put_hevc_qpel_bi_h16_8_c: 2735.5 put_hevc_qpel_bi_h16_8_neon: 560.5 put_hevc_qpel_bi_h24_8_c: 6015.7 put_hevc_qpel_bi_h24_8_neon: 1636.0 put_hevc_qpel_bi_h32_8_c: 10779.0 put_hevc_qpel_bi_h32_8_neon: 2204.5 put_hevc_qpel_bi_h48_8_c: 24375.0 put_hevc_qpel_bi_h48_8_neon: 4984.0 put_hevc_qpel_bi_h64_8_c: 42768.0 put_hevc_qpel_bi_h64_8_neon: 8795.7 put_hevc_qpel_h4_8_c: 149.0 put_hevc_qpel_h4_8_neon: 55.7 put_hevc_qpel_h6_8_c: 321.2 put_hevc_qpel_h6_8_neon: 106.0 put_hevc_qpel_h8_8_c: 578.7 put_hevc_qpel_h8_8_neon: 133.2 put_hevc_qpel_h12_8_c: 1279.0 put_hevc_qpel_h12_8_neon: 391.7 put_hevc_qpel_h16_8_c: 2286.2 put_hevc_qpel_h16_8_neon: 519.7 put_hevc_qpel_h24_8_c: 5100.7 put_hevc_qpel_h24_8_neon: 1546.2 put_hevc_qpel_h32_8_c: 9022.0 put_hevc_qpel_h32_8_neon: 2060.2 put_hevc_qpel_h48_8_c: 20293.5 put_hevc_qpel_h48_8_neon: 4656.7 put_hevc_qpel_h64_8_c: 36037.0 put_hevc_qpel_h64_8_neon: 8262.7 put_hevc_qpel_uni_h4_8_c: 162.2 put_hevc_qpel_uni_h4_8_neon: 61.7 put_hevc_qpel_uni_h6_8_c: 355.2 put_hevc_qpel_uni_h6_8_neon: 114.2 put_hevc_qpel_uni_h8_8_c: 651.0 put_hevc_qpel_uni_h8_8_neon: 135.7 put_hevc_qpel_uni_h12_8_c: 1412.5 put_hevc_qpel_uni_h12_8_neon: 402.7 put_hevc_qpel_uni_h16_8_c: 2551.0 put_hevc_qpel_uni_h16_8_neon: 533.5 put_hevc_qpel_uni_h24_8_c: 5782.2 put_hevc_qpel_uni_h24_8_neon: 1578.7 put_hevc_qpel_uni_h32_8_c: 10586.5 put_hevc_qpel_uni_h32_8_neon: 2102.2 put_hevc_qpel_uni_h48_8_c: 23812.0 put_hevc_qpel_uni_h48_8_neon: 4739.5 put_hevc_qpel_uni_h64_8_c: 42958.7 put_hevc_qpel_uni_h64_8_neon: 8366.5 Signed-off-by: J. Dekker <jdek@itanimul.li>	2 years ago
Reimar Döffinger	38cd829dce	aarch64: Implement stack spilling in a consistent way. Currently it is done in several different ways, which might cause needless dependencies or in case of tx_float_neon.S is incorrect. Reviewed-by: Martin Storsjö <martin@martin.st> Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>	2 years ago
Grzegorz Bernacki	8f4b000c37	lavc/aarch64: Add neon implementation for vsse_intra8 Provide optimized implementation for vsse_intra8 for arm64. Performance tests are shown below. - vsse_5_c: 87.7 - vsse_5_neon: 26.2 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Co-authored-by: Martin Storsjö <martin@martin.st> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Grzegorz Bernacki	bad67cb9fd	lavc/aarch64: Provide optimized implementation of vsse8 for arm64. Provide optimized implementation of vsse8 for arm64. Performance comparison tests are shown below. - vsse_1_c: 141.5 - vsse_1_neon: 32.5 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Grzegorz Bernacki	faea56c9c7	lavc/aarch64: Provide neon implementation of nsse8 Add vectorized implementation of nsse8 function. Performance comparison tests are shown below. - nsse_1_c: 256.0 - nsse_1_neon: 82.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Grzegorz Bernacki	f401a2af21	lavc/aarch64: Add neon implementation for pix_abs8 functions. Provide optimized implementation of pix_abs8 function for arm64. Performance comparison tests are shown below: pix_abs_1_1_c: 162.5 pix_abs_1_1_neon: 27.0 pix_abs_1_2_c: 174.0 pix_abs_1_2_neon: 23.5 pix_abs_1_3_c: 203.2 pix_abs_1_3_neon: 34.7 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Co-authored-by: Martin Storsjö <martin@martin.st> Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Martin Storsjö	8089fe072e	aarch64: me_cmp: Avoid using the non-unrolled codepath for the minimum unroll size Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Martin Storsjö	6f2ad7f951	aarch64: me_cmp: Avoid redundant loads in ff_pix_abs16_y2_neon This avoids one redundant load per row; pix3 from the previous iteration can be used as pix2 in the next one. Before: Cortex A53 A72 A73 pix_abs_0_2_neon: 138.0 59.7 48.0 After: pix_abs_0_2_neon: 109.7 50.2 39.5 Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Andreas Rheinhardt	9beba05311	avcodec/fmtconvert: Remove unused AVCodecContext parameter Unused since `d74a8cb7e4`. Reviewed-by: Rémi Denis-Courmont <remi@remlab.net> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2 years ago
Hubert Mazur	b2732115dd	lavc/aarch64: Add neon implementation for pix_median_abs8 Provide optimized implementation for pix_median_abs8 function. Performance comparison tests are shown below. - median_sad_1_c: 277.0 - median_sad_1_neon: 82.0 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Hubert Mazur	e9a6170213	lavc/aarch64: Add neon implementation for vsad8_intra Provide optimized implementation for vsad8_intra function. Performance comparison tests are shown below. - vsad_5_c: 94.7 - vsad_5_neon: 20.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Hubert Mazur	0ee535b1db	lavc/aarch64: Add neon implementation for pix_median_abs16 Provide optimized implementation for pix_median_abs16 function. Performance comparison tests are shown below. - median_sad_0_c: 720.5 - median_sad_0_neon: 127.2 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Rémi Denis-Courmont	b52034270a	lavc/vorbisdsp: use ptrdiff_t rather than intptr_t ... for a difference between pointers.	2 years ago
Andreas Rheinhardt	a54e53a1c4	avcodec/vp8dsp: Constify src in vp8_mc_func Reviewed-by: Peter Ross <pross@xvid.org> Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2 years ago
Hubert Mazur	06b98e396a	lavc/aarch64: Provide neon implementation of nsse16 Add vectorized implementation of nsse16 function. Performance comparison tests are shown below. - nsse_0_c: 682.2 - nsse_0_neon: 116.5 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Co-authored-by: Martin Storsjö <martin@martin.st> Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Hubert Mazur	908abe8032	lavc/aarch64: Add neon implementation for vsse_intra16 Provide optimized implementation for vsse_intra16 for arm64. Performance tests are shown below. - vsse_4_c: 155.2 - vsse_4_neon: 36.2 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Hubert Mazur	ce03ea3e79	lavc/aarch64: Add neon implementation for vsad_intra16 Provide optimized implementation for vsad_intra16 function for arm64. Performance comparison tests are shown below. - vsad_4_c: 177.5 - vsad_4_neon: 23.5 Benchmarks and tests are run with checkasm tool on AWS Gravtion 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Hubert Mazur	c495a4b32d	lavc/aarch64: Add neon implementation of vsse16 Provide optimized implementation of vsse16 for arm64. Performance comparison tests are shown below. - vsse_0_c: 257.7 - vsse_0_neon: 59.2 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Hubert Mazur	200f5e578f	lavc/aarch64: Add neon implementation for vsad16 Provide optimized implementation of vsad16 function for arm64. Performance comparison tests are shown below. - vsad_0_c: 285.2 - vsad_0_neon: 39.5 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Co-authored-by: Martin Storsjö <martin@martin.st> Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Lynne	f99d15cca0	arm/fft: disable NEON optimizations for 131072pt transforms This has been broken since the start, and it was only discovered when I started testing my replacement for the FFT. Disable it, since there's no point in fixing slower code that's about to be removed anyway. The vfp version is not affected.	2 years ago
J. Dekker	ce2f47318b	lavc/aarch64: hevc_add_res add 12bit variants hevc_add_res_4x4_12_c: 46.0 hevc_add_res_4x4_12_neon: 18.7 hevc_add_res_8x8_12_c: 194.7 hevc_add_res_8x8_12_neon: 25.2 hevc_add_res_16x16_12_c: 716.0 hevc_add_res_16x16_12_neon: 69.7 hevc_add_res_32x32_12_c: 3820.7 hevc_add_res_32x32_12_neon: 261.0 Signed-off-by: J. Dekker <jdek@itanimul.li>	2 years ago
Martin Storsjö	48be6616d0	aarch64: me_cmp: Remove a leftover unnecessary instruction This was missed in `a2e45ad407`. Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Hubert Mazur	70efa4d011	lavc/aarch64: Add neon implementation for pix_abs8 Provide optimized implementation of pix_abs8 function for arm64. Performance comparison tests are shown below. - pix_abs_1_0_c: 101.2 - pix_abs_1_0_neon: 22.5 - sad_1_c: 101.2 - sad_1_neon: 22.5 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Hubert Mazur	74312e80d7	lavc/aarch64: Add neon implementation for sse8 Provide optimized implementation of sse8 function for arm64. Performance comparison tests are shown below. - sse_1_c: 130.7 - sse_1_neon: 29.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Hubert Mazur	a2e45ad407	lavc/aarch64: Add neon implementation for pix_abs16_y2 Provide optimized implementation of pix_abs16_y2 function for arm64. Performance comparison tests are shown below. pix_abs_0_2_c: 317.2 pix_abs_0_2_neon: 37.5 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Hubert Mazur	d7abb7d143	lavc/aarch64: Add neon implementation for sse4 Provide neon implementation for sse4 function. Performance comparison tests are shown below. - sse_2_c: 80.7 - sse_2_neon: 31.0 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Hubert Mazur	ad251fd262	lavc/aarch64: Add neon implementation for sse16 Provide neon implementation for sse16 function. Performance comparison tests are shown below. - sse_0_c: 268.2 - sse_0_neon: 43.5 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago

1 2 3 4 5 ...

349 Commits (378f1b6a393e7bf0ceb50a9454e3664a599d84d1)