FFmpeg

Commit Graph

Author	SHA1	Message	Date
James Almer	c16e99e3b3	x86: check for AV_CPU_FLAG_AVXSLOW where useful Signed-off-by: James Almer <jamrial@gmail.com> Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	10 years ago
Michael Niedermayer	cc77bb09e4	avcodec/x86/vp9dsp_init: Fix mix of declaration and statement Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com> Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	10 years ago
Ronald S. Bultje	b224b165cb	vp9: add keyframe profile 2/3 support.	10 years ago
Ronald S. Bultje	afd8c464b7	vp9/x86: make filter_16_h work on 32-bit.	10 years ago
Ronald S. Bultje	b26bc3520f	vp9/x86: make filter_48/84/88_h work on 32-bit.	10 years ago
Ronald S. Bultje	8a1cff1c35	vp9/x86: make filter_44_h work on 32-bit.	10 years ago
Ronald S. Bultje	047088b8c6	vp9/x86: make filter_16_v work on 32-bit.	10 years ago
Ronald S. Bultje	0cc9c23ea1	vp9/x86: make filter_48/84_v work on 32-bit.	10 years ago
Ronald S. Bultje	6433a9133f	vp9/x86: make filter_88_v work on 32-bit.	10 years ago
Ronald S. Bultje	75f8e52089	vp9/x86: make filter_44_v work on 32-bit.	10 years ago
James Almer	32c836cb11	x86/vp9: remove duplicate function prototypes Fixes "redundant redeclaration" warnings. Signed-off-by: James Almer <jamrial@gmail.com>	10 years ago
Ronald S. Bultje	bdc1e3e3b2	vp9/x86: intra prediction sse2/32bit support. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	10 years ago
Ronald S. Bultje	cae893f692	vp9/x86: sse2 MC assembly. Also a slight change to the ssse3 code, which prevents a theoretical overflow in the sharp filter. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	10 years ago
Ronald S. Bultje	fd77fbb390	vp9/x86: 32bit and sse2 support for vp9 inverse transform assembly Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	10 years ago
James Almer	6b2caa321f	x86/vp9: add AVX and AVX2 MC Roughly 25% faster MC than ssse3 for blocksizes 32 and 64. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>	10 years ago
James Almer	fc8db12a73	x86/vp9: inital AVX2 intra_pred tos3k-vp9-b10000.webm on a Core i5-4200U @1.6GHz 1219 decicycles in ff_vp9_ipred_dc_32x32_ssse3, 131070 runs, 2 skips 439 decicycles in ff_vp9_ipred_dc_32x32_avx2, 131070 runs, 2 skips 3570 decicycles in ff_vp9_ipred_dc_top_32x32_ssse3, 4096 runs, 0 skips 2494 decicycles in ff_vp9_ipred_dc_top_32x32_avx2, 4096 runs, 0 skips 1419 decicycles in ff_vp9_ipred_dc_left_32x32_ssse3, 16384 runs, 0 skips 717 decicycles in ff_vp9_ipred_dc_left_32x32_avx2, 16384 runs, 0 skips 2737 decicycles in ff_vp9_ipred_tm_32x32_avx, 1024 runs, 0 skips 2088 decicycles in ff_vp9_ipred_tm_32x32_avx2, 1024 runs, 0 skips 3090 decicycles in ff_vp9_ipred_v_32x32_avx, 512 runs, 0 skips 2226 decicycles in ff_vp9_ipred_v_32x32_avx2, 512 runs, 0 skips 1565 decicycles in ff_vp9_ipred_h_32x32_avx, 1024 runs, 0 skips 922 decicycles in ff_vp9_ipred_h_32x32_avx2, 1024 runs, 0 skips Signed-off-by: James Almer <jamrial@gmail.com> Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	11 years ago
Clément Bœsch	c4148a6668	x86/vp9mc: add vp9 namespace.	11 years ago
Ronald S. Bultje	fdb093c4e4	vp9/x86: intra prediction SIMD. Partially based on h264_intrapred. (I hope to eventually merge these two intrapred implementations back together.)	11 years ago
Clément Bœsch	91d85bb167	x86/vp9lpf: add ff_vp9_loop_filter_[vh]_44_16_{sse2,ssse3,avx}.	11 years ago
Clément Bœsch	c5dd73b890	x86/vp9lpf: add ff_vp9_loop_filter_h_{48,84}_16_{sse2,ssse3,avx}(). 5.40s → 5.30s overall decode time with -threads 1 on ped1080p.webm (i7 920, ssse3)	11 years ago
James Almer	644c32ea4b	x86/vp9lpf: add ff_vp9_loop_filter_[vh]_88_16_sse2() Similar gains as the ssse3 version once again Signed-off-by: James Almer <jamrial@gmail.com>	11 years ago
Clément Bœsch	222c46c531	x86/vp9lpf: add ff_vp9_loop_filter_[vh]_88_16_{ssse3,avx}. 9680 decicycles in loop_filter_v_88_16_c, 4193765 runs, 539 skips 9233 decicycles in loop_filter_h_88_16_c, 4193751 runs, 553 skips 1929 decicycles in ff_vp9_loop_filter_v_88_16_ssse3, 4194118 runs, 186 skips 2738 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193861 runs, 443 skips 5.978 → 5.417 overall decode time on ped1080p.webm (-threads 1) Adding SSE2 support should be relatively trivial (just a matter of changing the pshufb [mask_mix] with something else), patch welcome.	11 years ago
Ronald S. Bultje	97474d527f	vp9/x86: iwht4x4 (lossless) mmx.	11 years ago
Ronald S. Bultje	d43efa68bd	vp9/x86: 4x4 iadst SIMD (ssse3) variants. Cycle measurements for intra itxfm_4x4_add on ped1080p.webm: idct_idct: 66 -> 67 cycles (noise measurement) idct_iadst: 199 -> 79 cycles iadst_idct: 165 -> 70 cycles iadst_iadst: 183 -> 82 cycles	11 years ago
Ronald S. Bultje	baf47020cd	vp9/x86: 8x8 iadst SIMD (ssse3/avx) variants. Cycle measurements for intra itxfm_8x8_add on ped1080p.webm: idct_idct: 133 -> 135 cycles (noise measurement) idct_iadst: 900 -> 241 cycles iadst_idct: 864 -> 215 cycles iadst_iadst: 973 -> 310 cycles	11 years ago
James Almer	26800e3864	vp9/x86: rename ff_avg[48]_sse to ff_avg[48]_mmxext pavgb is an sse integer instruction, so the mmxext flag is enough Signed-off-by: James Almer <jamrial@gmail.com> Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com> Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	11 years ago
James Almer	d2a7314f1e	vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_sse2(). Similar gains in performance as the SSSE3 version Signed-off-by: James Almer <jamrial@gmail.com>	11 years ago
Ronald S. Bultje	8173d1ffc0	vp9/x86: 16x16 iadst_idct, idct_iadst and iadst_iadst (ssse3+avx). Sample timings on ped1080p.webm (of the ssse3 functions): iadst_idct: 4672 -> 1175 cycles idct_iadst: 4736 -> 1263 cycles iadst_iadst: 4924 -> 1438 cycles Total decoding time changed from 6.565s to 6.413s.	11 years ago
Clément Bœsch	8b4190da93	vp9/x86: add AVX for itxfm and lpf. 4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips 3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips 3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips 2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips 23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips 19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips 4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips 3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips 967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips 887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips	11 years ago
Clément Bœsch	af68bd1c06	vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_ssse3(). 16662 decicycles in loop_filter_h_16_16_c, 8387355 runs, 1253 skips 17510 decicycles in loop_filter_v_16_16_c, 8387516 runs, 1092 skips 4941 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 8387887 runs, 721 skips 3899 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 8387980 runs, 628 skips Overall decode time goes from: ./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null - 8.10s user 0.02s system 99% cpu 8.126 total to: ./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null - 6.15s user 0.04s system 99% cpu 6.199 total (46 to 61 fps)	11 years ago
Diego Biurrun	b0be1ae792	x86: avcodec: Add a bunch of missing #includes for av_cold	11 years ago
Ronald S. Bultje	e84d14df10	vp9/x86: idct_32x32_add_ssse3. Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s (13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter) to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra) or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra) or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all tests done on ped1080p.webm).	11 years ago
Ronald S. Bultje	18175baa54	vp9/x86: 16px MC functions (64bit only). Cycle counts for large MCs (old -> new on ped1080p.webm, mx!=0&&my!=0): 16x8: 876 -> 870 (0.7%) 16x16: 1444 -> 1435 (0.7%) 16x32: 2784 -> 2748 (1.3%) 32x16: 2455 -> 2349 (4.5%) 32x32: 4641 -> 4084 (13.6%) 32x64: 9200 -> 7834 (17.4%) 64x32: 8980 -> 7197 (24.8%) 64x64: 17330 -> 13796 (25.6%) Total decoding time goes from 9.326sec to 9.182sec.	11 years ago
Ronald S. Bultje	8d4c616fc0	vp9/x86: idct_add_16x16_ssse3. Currently only dc-only and full 16x16. Other subforms will follow in the near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3 seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes from ~4050 to ~745 cycles.	11 years ago
Clément Bœsch	d28c79b003	avcodec/x86/vp9dsp: use EXTERNAL_* macros. Original fix by one of these developers: Anton Khirnov <anton@khirnov.net> Diego Biurrun <diego@biurrun.de> Luca Barbato <lu_zero@gentoo.org> Martin Storsjö <martin@martin.st> See `97962b2` / `72ca830` Personnal guess is Diego Biurrun.	11 years ago
Ronald S. Bultje	72ca830f51	lavc: VP9 decoder Originally written by Ronald S. Bultje <rsbultje@gmail.com> and Clément Bœsch <u@pkh.me> Further contributions by: Anton Khirnov <anton@khirnov.net> Diego Biurrun <diego@biurrun.de> Luca Barbato <lu_zero@gentoo.org> Martin Storsjö <martin@martin.st> Signed-off-by: Luca Barbato <lu_zero@gentoo.org> Signed-off-by: Anton Khirnov <anton@khirnov.net>	11 years ago
Clément Bœsch	87434cf373	avcodec/vp9: add ff_vp9_idct_idct_{4x4,8x8}_ssse3(). 1789 decicycles in idct_idct_4x4_add_c, 262136 runs, 8 skips 1839 decicycles in idct_idct_4x4_add_c, 524270 runs, 18 skips 1864 decicycles in idct_idct_4x4_add_c, 1048548 runs, 28 skips 529 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 262138 runs, 6 skips 516 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 524282 runs, 6 skips 474 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 1048565 runs, 11 skips (~3.9x faster) 7726 decicycles in idct_idct_8x8_add_c, 1048433 runs, 143 skips 7732 decicycles in idct_idct_8x8_add_c, 2096882 runs, 270 skips 7731 decicycles in idct_idct_8x8_add_c, 4193772 runs, 532 skips 1145 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 1048549 runs, 27 skips 1137 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 2097097 runs, 55 skips 1086 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 4194188 runs, 116 skips (~7.1x faster) Overall decode time before commit: 16.48s user 0.03s system 99% cpu 16.526 total 16.54s user 0.01s system 99% cpu 16.566 total 16.46s user 0.03s system 99% cpu 16.511 total Overall decode time after commit: 16.34s user 0.02s system 99% cpu 16.378 total 16.28s user 0.02s system 99% cpu 16.315 total 16.32s user 0.03s system 99% cpu 16.366 total Tested on i7 920 with 40s 1080p footage.	11 years ago
Ronald S. Bultje	f1548c008f	Full-pixel MC functions. Decoding time of ped1080p.webm goes from 11.3sec to 11.1sec.	11 years ago
Ronald S. Bultje	c07ac8d467	VP9 MC (ssse3) optimizations. Decoding time of ped1080p.webm goes from 20.7sec to 11.3sec.	11 years ago

40 Commits (58ed7b632842f3fedbe737c3945cabc56bab2f47)