FFmpeg

Commit Graph

Author	SHA1	Message	Date
Chip Kerchner	3a557c5d88	lsws/ppc/yuv2rgb_altivec: Replace vec_lvsl/vec_perm with vec_xl gcc 6.x and 7.x generate wrong code for little endian machines for the vec_lvsl/vec_perm instruction combos in some cases. The bug was fixed in version 8.x If these instructions are replaced with vec_xl, the problem goes away for all versions of the compilers. Fixes ticket #7124.	5 years ago
Philip Langdale	cd48318035	swscale: Add support for NV24 and NV42 The implementation is pretty straight-forward. Most of the existing NV12 codepaths work regardless of subsampling and are re-used as is. Where necessary I wrote the slightly different NV24 versions. Finally, the one thing that confused me for a long time was the asm specific x86 path that did an explicit exclusion check for NV12. I replaced that with a semi-planar check and also updated the equivalent PPC code, which Lauri kindly checked.	6 years ago
Lauri Kasanen	e25bddf5fc	swscale/ppc: Shorten power8 tests via a var	6 years ago
Lauri Kasanen	a2a16206aa	swscale/ppc: VSX-optimize hScale16To* ./ffmpeg -loop 1 -s 1200x1440 -i tux16.png \ -s 2400x720 -f rawvideo -y -vframes 5 -pix_fmt yuv420p16le -nostats test.raw ./ffmpeg -loop 1 -s 1200x1440 -i tux16.png \ -s 2400x720 -f rawvideo -y -vframes 5 -pix_fmt yuv420p -nostats test.raw 32-bit mul, power8 only 2x speedup for hScale8To19_vsx (x86 SSE2 is 2.37): 30896 UNITS in hscale, 8192 runs, 0 skips 63956 UNITS in hscale, 8192 runs, 0 skips 2.06 for hScale16To15_vsx: 30531 UNITS in hscale, 8192 runs, 0 skips 63161 UNITS in hscale, 8192 runs, 0 skips	6 years ago
Lauri Kasanen	3437111f17	swscale/ppc: Indent	6 years ago
Lauri Kasanen	9456adc223	swscale/ppc: VSX-optimize hScale8To19 ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 \ -s 2400x720 -f rawvideo -y -vframes 5 -pix_fmt yuv420p16le -nostats test.raw 2.26 speedup (x86 SSE2 is 2.32): 23772 UNITS in hscale, 4096 runs, 0 skips 53862 UNITS in hscale, 4096 runs, 0 skips	6 years ago
Lauri Kasanen	d0e4d0429e	swscale/ppc: VSX-optimize hscale_fast ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 -sws_flags fast_bilinear \ -s 2400x720 -f rawvideo -vframes 5 -pix_fmt abgr -nostats test.raw 4.27 speedup for hyscale_fast: 24796 UNITS in hyscale_fast, 4096 runs, 0 skips 5797 UNITS in hyscale_fast, 4096 runs, 0 skips 4.48 speedup for hcscale_fast: 19911 UNITS in hcscale_fast, 4095 runs, 1 skips 4437 UNITS in hcscale_fast, 4096 runs, 0 skips	6 years ago
Lauri Kasanen	ce92ee4b4f	swscale/ppc: VSX-optimize non-full-chroma yuv2rgb_2 ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 -sws_flags fast_bilinear \ -s 1200x720 -f null -vframes 100 -pix_fmt $i -nostats \ -cpuflags 0 -v error - 32-bit mul, power8 only. ~2x speedup: rgb24 24431 UNITS in yuv2packed2, 16384 runs, 0 skips 13783 UNITS in yuv2packed2, 16383 runs, 1 skips bgr24 24396 UNITS in yuv2packed2, 16384 runs, 0 skips 14059 UNITS in yuv2packed2, 16384 runs, 0 skips rgba 26815 UNITS in yuv2packed2, 16383 runs, 1 skips 12797 UNITS in yuv2packed2, 16383 runs, 1 skips bgra 27060 UNITS in yuv2packed2, 16384 runs, 0 skips 13138 UNITS in yuv2packed2, 16384 runs, 0 skips argb 26998 UNITS in yuv2packed2, 16384 runs, 0 skips 12728 UNITS in yuv2packed2, 16381 runs, 3 skips bgra 26651 UNITS in yuv2packed2, 16384 runs, 0 skips 13124 UNITS in yuv2packed2, 16384 runs, 0 skips This is a low speedup, but the x86 mmx version also gets only ~2x. The mmx version is also heavily inaccurate, while the vsx version has high accuracy.	6 years ago
Lauri Kasanen	8607e29fa3	swscale/ppc: VSX-optimize yuv2rgb_full_X ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 \ -s 1200x720 -f null -vframes 100 -pix_fmt $i -nostats \ -cpuflags 0 -v error - 32-bit mul, power8 only. ~6.4x speedup: rgb24 214278 UNITS in yuv2packedX, 16384 runs, 0 skips 33249 UNITS in yuv2packedX, 16384 runs, 0 skips bgr24 214616 UNITS in yuv2packedX, 16384 runs, 0 skips 33233 UNITS in yuv2packedX, 16384 runs, 0 skips rgba 214517 UNITS in yuv2packedX, 16384 runs, 0 skips 33271 UNITS in yuv2packedX, 16384 runs, 0 skips bgra 214973 UNITS in yuv2packedX, 16384 runs, 0 skips 33397 UNITS in yuv2packedX, 16384 runs, 0 skips argb 214613 UNITS in yuv2packedX, 16384 runs, 0 skips 33310 UNITS in yuv2packedX, 16384 runs, 0 skips bgra 214637 UNITS in yuv2packedX, 16384 runs, 0 skips 33330 UNITS in yuv2packedX, 16384 runs, 0 skips	6 years ago
Lauri Kasanen	3256e949be	swscale/ppc: VSX-optimize yuv2rgb_full_2 ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 -sws_flags area \ -s 1200x720 -f null -vframes 100 -pix_fmt $i -nostats \ -cpuflags 0 -v error - 32-bit mul, power8 only. ~4x speedup: rgb24 52763 UNITS in yuv2packed2, 16384 runs, 0 skips 13453 UNITS in yuv2packed2, 16384 runs, 0 skips bgr24 53144 UNITS in yuv2packed2, 16384 runs, 0 skips 13616 UNITS in yuv2packed2, 16384 runs, 0 skips rgba 52796 UNITS in yuv2packed2, 16384 runs, 0 skips 12904 UNITS in yuv2packed2, 16384 runs, 0 skips bgra 52732 UNITS in yuv2packed2, 16384 runs, 0 skips 13262 UNITS in yuv2packed2, 16384 runs, 0 skips argb 52661 UNITS in yuv2packed2, 16384 runs, 0 skips 12879 UNITS in yuv2packed2, 16384 runs, 0 skips bgra 52662 UNITS in yuv2packed2, 16384 runs, 0 skips 12932 UNITS in yuv2packed2, 16384 runs, 0 skips	6 years ago
Lauri Kasanen	50e672bc54	swscale/ppc: VSX-optimize non-full-chroma yuv2rgb_1 ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 -sws_flags fast_bilinear \ -s 1200x1440 -f null -vframes 100 -pix_fmt $i -nostats \ -cpuflags 0 -v error - 32-bit mul, power8 only. 1.8-2.3x speedup: rgb24 18192 UNITS in yuv2packed1, 32767 runs, 1 skips 9983 UNITS in yuv2packed1, 32760 runs, 8 skips bgr24 18665 UNITS in yuv2packed1, 32766 runs, 2 skips 9925 UNITS in yuv2packed1, 32763 runs, 5 skips rgba 20239 UNITS in yuv2packed1, 32767 runs, 1 skips 8794 UNITS in yuv2packed1, 32759 runs, 9 skips bgra 20354 UNITS in yuv2packed1, 32768 runs, 0 skips 8770 UNITS in yuv2packed1, 32761 runs, 7 skips argb 20185 UNITS in yuv2packed1, 32768 runs, 0 skips 8761 UNITS in yuv2packed1, 32761 runs, 7 skips bgra 20360 UNITS in yuv2packed1, 32766 runs, 2 skips 8759 UNITS in yuv2packed1, 32764 runs, 4 skips This is a low speedup, but the x86 mmx version also gets only ~2x. The mmx version is also heavily inaccurate, while the vsx version has high accuracy.	6 years ago
Lauri Kasanen	7adce3e64c	swscale/ppc: VSX-optimize yuv2422_X ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 \ -s 1200x720 -f null -vframes 100 -pix_fmt $i -nostats \ -cpuflags 0 -v error - 7.2x speedup: yuyv422 126354 UNITS in yuv2packedX, 16384 runs, 0 skips 16383 UNITS in yuv2packedX, 16382 runs, 2 skips yvyu422 117669 UNITS in yuv2packedX, 16384 runs, 0 skips 16271 UNITS in yuv2packedX, 16379 runs, 5 skips uyvy422 117310 UNITS in yuv2packedX, 16384 runs, 0 skips 16226 UNITS in yuv2packedX, 16382 runs, 2 skips	6 years ago
Lauri Kasanen	9a2db4dc61	swscale/ppc: VSX-optimize yuv2422_2 ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 -sws_flags area \ -s 1200x720 -f null -vframes 100 -pix_fmt $i -nostats \ -cpuflags 0 -v error - 5.1x speedup: yuyv422 19339 UNITS in yuv2packed2, 16384 runs, 0 skips 3718 UNITS in yuv2packed2, 16383 runs, 1 skips yvyu422 19438 UNITS in yuv2packed2, 16384 runs, 0 skips 3800 UNITS in yuv2packed2, 16380 runs, 4 skips uyvy422 19128 UNITS in yuv2packed2, 16384 runs, 0 skips 3721 UNITS in yuv2packed2, 16380 runs, 4 skips	6 years ago
Lauri Kasanen	a6a31ca3d9	swscale/ppc: VSX-optimize yuv2422_1 ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 \ -s 1200x1440 -f null -vframes 100 -pix_fmt $i -nostats \ -cpuflags 0 -v error - 15.3x speedup: yuyv422 14513 UNITS in yuv2packed1, 32768 runs, 0 skips 949 UNITS in yuv2packed1, 32767 runs, 1 skips yvyu422 14516 UNITS in yuv2packed1, 32767 runs, 1 skips 943 UNITS in yuv2packed1, 32767 runs, 1 skips uyvy422 14530 UNITS in yuv2packed1, 32767 runs, 1 skips 941 UNITS in yuv2packed1, 32766 runs, 2 skips	6 years ago
Lauri Kasanen	681957b88d	swscale/ppc: VSX-optimize yuv2rgb_full ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 \ -s 1200x1440 -f null -vframes 100 -pix_fmt $i -nostats \ -cpuflags 0 -v error - This uses 32-bit mul, so POWER8 only. The following output formats get about 4.5x speedup: rgb24 39980 UNITS in yuv2packed1, 32768 runs, 0 skips 8774 UNITS in yuv2packed1, 32768 runs, 0 skips bgr24 40069 UNITS in yuv2packed1, 32768 runs, 0 skips 8772 UNITS in yuv2packed1, 32766 runs, 2 skips rgba 39759 UNITS in yuv2packed1, 32768 runs, 0 skips 8681 UNITS in yuv2packed1, 32767 runs, 1 skips bgra 39729 UNITS in yuv2packed1, 32768 runs, 0 skips 8696 UNITS in yuv2packed1, 32766 runs, 2 skips argb 39766 UNITS in yuv2packed1, 32768 runs, 0 skips 8672 UNITS in yuv2packed1, 32766 runs, 2 skips bgra 39784 UNITS in yuv2packed1, 32768 runs, 0 skips 8659 UNITS in yuv2packed1, 32767 runs, 1 skips	6 years ago
Lauri Kasanen	6b5ea90eac	swscale/ppc: Add av_unused to template vars only used in one includer	6 years ago
Lauri Kasanen	ac3062f1a4	swscale/ppc: Clean up some mixed decl warnings	6 years ago
Lauri Kasanen	8522d219ce	libswscale/ppc: VSX-optimize 9-16 bit yuv2planeX ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p16be \ -s 1920x1728 -f null -vframes 100 -v error -nostats - 9-14 bit funcs get about 6x speedup, 16-bit gets about 15x. Fate passes, each format tested with an image to video conversion. Only POWER8 includes 32-bit vector multiplies, so POWER7 is locked out of the 16-bit function. This includes the vec_mulo/mule functions too, not just vmuluwm. With TIMER_REPORT skips disabled: yuv420p9le 12412 UNITS in planarX, 131072 runs, 0 skips 73136 UNITS in planarX, 131072 runs, 0 skips yuv420p9be 12481 UNITS in planarX, 131072 runs, 0 skips 73410 UNITS in planarX, 131072 runs, 0 skips yuv420p10le 12322 UNITS in planarX, 131072 runs, 0 skips 72546 UNITS in planarX, 131072 runs, 0 skips yuv420p10be 12291 UNITS in planarX, 131072 runs, 0 skips 72935 UNITS in planarX, 131072 runs, 0 skips yuv420p12le 12316 UNITS in planarX, 131072 runs, 0 skips 72708 UNITS in planarX, 131072 runs, 0 skips yuv420p12be 12319 UNITS in planarX, 131072 runs, 0 skips 72577 UNITS in planarX, 131072 runs, 0 skips yuv420p14le 12259 UNITS in planarX, 131072 runs, 0 skips 72516 UNITS in planarX, 131072 runs, 0 skips yuv420p14be 12440 UNITS in planarX, 131072 runs, 0 skips 72962 UNITS in planarX, 131072 runs, 0 skips yuv420p16le 10548 UNITS in planarX, 131072 runs, 0 skips 73429 UNITS in planarX, 131072 runs, 0 skips yuv420p16be 10634 UNITS in planarX, 131072 runs, 0 skips 150959 UNITS in planarX, 131072 runs, 0 skips Signed-off-by: Lauri Kasanen <cand@gmx.com>	6 years ago
Lauri Kasanen	8dd9df9ecd	swscale/output: Altivec-optimize float yuv2plane1 This function wouldn't benefit from VSX instructions, so I put it under altivec. ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt grayf32le \ -f null -vframes 100 -v error -nostats - 3743 UNITS in planar1, 65495 runs, 41 skips -cpuflags 0 23511 UNITS in planar1, 65530 runs, 6 skips grayf32be 4647 UNITS in planar1, 65449 runs, 87 skips -cpuflags 0 28608 UNITS in planar1, 65530 runs, 6 skips The native speedup is 6.28133, and the bswapping one 6.15623. Fate passes, each format tested with an image to video conversion. Signed-off-by: Lauri Kasanen <cand@gmx.com> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	6 years ago
Lauri Kasanen	b4c8c03b00	swscale/output: VSX-optimize 16-bit yuv2plane1 ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p16le \ -f null -vframes 100 -v error -nostats - 2120 UNITS in planar1, 65393 runs, 143 skips -cpuflags 0 19157 UNITS in planar1, 65512 runs, 24 skips 9.03632 speedup, 16be similarly. Fate passes, each format tested with an image to video conversion. Signed-off-by: Lauri Kasanen <cand@gmx.com> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	6 years ago
Lauri Kasanen	1046cba24b	swscale/output: VSX-optimize nbps yuv2plane1 ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p9le \ -f null -vframes 100 -v error -nostats - Speedups: yuv2plane1_9BE_vsx 11.2042 yuv2plane1_9LE_vsx 11.156 yuv2plane1_10BE_vsx 9.89428 yuv2plane1_10LE_vsx 10.3637 yuv2plane1_12BE_vsx 9.71923 yuv2plane1_12LE_vsx 11.0404 yuv2plane1_14BE_vsx 10.1763 yuv2plane1_14LE_vsx 11.2728 Fate passes, each format tested with an image to video conversion. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	6 years ago
Lauri Kasanen	78c7ff7d25	swscale/ppc: Move VSX-using code to its own file Passes fate on LE (with "lavc/jrevdct: Avoid an aliasing violation" applied). Signed-off-by: Lauri Kasanen <cand@gmx.com> Tested-by: Michael Kostylev on BE Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	6 years ago
Lauri Kasanen	46c5693ea3	swscale/output: Altivec-optimize yuv2plane1_8 ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \ -f null -vframes 100 -v error -nostats - 1158 UNITS in planar1, 65528 runs, 8 skips -cpuflags 0 19082 UNITS in planar1, 65533 runs, 3 skips 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version takes as many cycles as the x86 SSE2 version, yikes it's fast. Note that this function uses VSX instructions, but is not marked so. This is because several existing functions also make that mistake. I'll submit a patch moving them once this is reviewed. Signed-off-by: Lauri Kasanen <cand@gmx.com> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	6 years ago
Sergey Lavrushkin	582bc5a348	libswscale: Adds conversions from/to float gray format. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	6 years ago
Michael Niedermayer	d736b52a04	swscale: Drop is9_OR_10BPS() use, its name is not correct Found-by: Luca Barbato Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	8 years ago
Michael Niedermayer	328ea6a9a5	swscale: Add input support for 12-bit formats Implemented for AV_PIX_FMT_GBRP12. Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>	8 years ago
Luca Barbato	2b5b1e1e9b	swscale: Rename is9_OR_10 to match what it does It is used to select functions that work with 9-15bits.	8 years ago
Ronald S. Bultje	70d418c7e6	Revert "PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD." This reverts commit `1df908f33f`. The expected performance improvements are essentially non-existent.	9 years ago
Dan Parrot	1df908f33f	PPC64: Add versions of functions in libswscale/input.c optimized for POWER8 VSX SIMD. This patch addresses Trac ticket #5570. The optimized functions are in file libswscale/ppc/input_vsx.c. Each optimized function name is a concatenation of the corresponding name in libswscale/input.c with suffix _vsx. Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	9 years ago
Diego Biurrun	0f40c90984	Drop pointless assert.h #includes	9 years ago
Pedro Arthur	6de58b4903	swscale: cleanup unused code Removed previous swscale code under '#ifndef NEW_FILTER' and removed unused fields of SwsContext	9 years ago
Diego Biurrun	29c2d06d67	cosmetics: Drop empty comment lines	9 years ago
Rong Yan	2af180bf1b	swscale/ppc/yuv2rgb_altivec: POWER LE support in the macros vec_unh() and vec_unl() Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	10 years ago
Luca Barbato	da60b99a88	ppc: Restrict some Altivec implementations to Big Endian In Little Endian the vec_ld/vec_st operations work as expected only for byte-vectors.	10 years ago
Rong Yan	603c839398	swscale/ppc/swscale_altivec.c: POWER LE support in yuv2planeX_8() delete macro GET_VF() it was wrong GCC tool had a bug of PPC intrinsic interpret, which has been fixed in GCC 4.9.1. This bug lead to errors in two of our previous patches. We found this when we update our GCC tools to 4.9.1 and by reading the related info on GCC website. We fix our previous error in two separate commits Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	10 years ago
Christophe Gisquet	5d38c628b0	ppc: libswscale: use LOCAL_ALIGNED instead of DECLARE_ALIGNED The later may yield incorrect code for on-stack variables. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	10 years ago
Rong Yan	e74e14608f	libswscale/ppc/swscale_altivec.c : fix hScale_altivec_real() yuv2planeX_16_altivec() yuv2planeX_8() for little endian add marcos GET_LS() GET_VF() LOAD_FILTER() LOAD_L1() GET_VF4() FIRST_LOAD() UPDATE_PTR() LOAD_SRCV() LOAD_SRCV8() GET_VFD() for POWER LE Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	10 years ago
Sean McGovern	01a82f1dc5	ppc: don't return a value from a function declared void Signed-off-by: Martin Storsjö <martin@martin.st>	11 years ago
Sean McGovern	f1f728cbe4	ppc: don't return a value from a function declared void Signed-off-by: Martin Storsjö <martin@martin.st>	11 years ago
Diego Biurrun	a6b6501185	ppc: cosmetics: Consistently format CPU flag detection invocations	11 years ago
Diego Biurrun	1909f6b1b6	swscale: cosmetics: Drop silly camelCase from swScale function pointer name	11 years ago
Diego Biurrun	4e0799a4d0	swscale: Add some missing av_cold to arch-specific init functions	11 years ago
Diego Biurrun	3aa682f253	swscale: consistent names for arch-specific acceleration functions	11 years ago
Diego Biurrun	c2503d9c8a	swscale: ppc: Hide arch-specific initialization details Also give consistent names to init functions.	11 years ago
Diego Biurrun	c011ceef78	swscale: ppc: Remove commented-out define cruft	12 years ago
Diego Biurrun	7f75f2f2bd	ppc: Drop unnecessary ff_ name prefixes from static functions	12 years ago
Diego Biurrun	511cf612ac	miscellaneous typo fixes	12 years ago
Anton Khirnov	716d413c13	Replace PIX_FMT_* -> AV_PIX_FMT_*, PixelFormat -> AVPixelFormat	12 years ago
Mans Rullgard	07eb7e20af	ppc: swscale: rework yuv2planeX_altivec() This gets rid of the variable-length scratch buffer by filtering 16 pixels at a time and writing directly to the destination. The extra loads this requires to load the source values are compensated by not doing a round-trip to memory before shifting. Signed-off-by: Mans Rullgard <mans@mansr.com>	12 years ago
Diego Biurrun	5a6e3c039c	swscale: Mark all init functions as av_cold	13 years ago

1 2 3

147 Commits (c62a1db0ac16a4e7c87f31ec077d5ff69a4f6760)