FFmpeg

Commit Graph

Author	SHA1	Message	Date
Ramiro Polla	87052c0933	swscale/x86: add sse4 and avx2 {lum,chr}ConvertRange16 chrRangeFromJpeg16_1920_c: 3153.9 chrRangeFromJpeg16_1920_sse4: 1770.0 (1.78x) chrRangeFromJpeg16_1920_avx2: 891.5 (3.54x) chrRangeToJpeg16_1920_c: 3165.0 chrRangeToJpeg16_1920_sse4: 1953.2 (1.62x) chrRangeToJpeg16_1920_avx2: 973.0 (3.25x) lumRangeFromJpeg16_1920_c: 1298.5 lumRangeFromJpeg16_1920_sse4: 886.5 (1.46x) lumRangeFromJpeg16_1920_avx2: 447.7 (2.90x) lumRangeToJpeg16_1920_c: 1905.0 lumRangeToJpeg16_1920_sse4: 993.0 (1.92x) lumRangeToJpeg16_1920_avx2: 498.9 (3.82x)	3 weeks ago
Ramiro Polla	be108ebcf4	swscale/x86/range_convert: update sse2 and avx2 range_convert functions to new API chrRangeFromJpeg8_1920_c: 2127.4 (1.00x) chrRangeFromJpeg8_1920_sse2: 816.0 (2.61x) 813.5 (2.62x) chrRangeFromJpeg8_1920_avx2: 408.9 (5.20x) 405.4 (5.25x) chrRangeToJpeg8_1920_c: 3166.9 (1.00x) chrRangeToJpeg8_1920_sse2: 815.0 (3.89x) 815.0 (3.89x) chrRangeToJpeg8_1920_avx2: 404.5 (7.83x) 405.5 (7.81x) lumRangeFromJpeg8_1920_c: 1263.0 (1.00x) lumRangeFromJpeg8_1920_sse2: 411.0 (3.07x) 413.2 (3.06x) lumRangeFromJpeg8_1920_avx2: 200.5 (6.30x) 201.9 (6.26x) lumRangeToJpeg8_1920_c: 1886.8 (1.00x) lumRangeToJpeg8_1920_sse2: 412.0 (4.58x) 408.9 (4.61x) lumRangeToJpeg8_1920_avx2: 208.5 (9.05x) 205.7 (9.17x)	3 weeks ago
Ramiro Polla	384fe39623	swscale/range_convert: fix mpeg ranges in yuv range conversion for non-8-bit pixel formats There is an issue with the constants used in YUV to YUV range conversion, where the upper bound is not respected when converting to mpeg range. With this commit, the constants are calculated at runtime, depending on the bit depth. This approach also allows us to more easily understand how the constants are derived. For bit depths <= 14, the number of fixed point bits has been set to 14 for all conversions, to simplify the code. For bit depths > 14, the number of fixed points bits has been raised and set to 18, to allow for the conversion to be accurate enough for the mpeg range to be respected. The convert functions now take the conversion constants (coeff and offset) as function arguments. For bit depths <= 14, coeff is unsigned 16-bit and offset is 32-bit. For bit depths > 14, coeff is unsigned 32-bit and offset is 64-bit. x86_64: chrRangeFromJpeg8_1920_c: 2127.4 2125.0 (1.00x) chrRangeFromJpeg16_1920_c: 2325.2 2127.2 (1.09x) chrRangeToJpeg8_1920_c: 3166.9 3168.7 (1.00x) chrRangeToJpeg16_1920_c: 2152.4 3164.8 (0.68x) lumRangeFromJpeg8_1920_c: 1263.0 1302.5 (0.97x) lumRangeFromJpeg16_1920_c: 1080.5 1299.2 (0.83x) lumRangeToJpeg8_1920_c: 1886.8 2112.2 (0.89x) lumRangeToJpeg16_1920_c: 1077.0 1906.5 (0.56x) aarch64 A55: chrRangeFromJpeg8_1920_c: 28835.2 28835.6 (1.00x) chrRangeFromJpeg16_1920_c: 28839.8 32680.8 (0.88x) chrRangeToJpeg8_1920_c: 23074.7 23075.4 (1.00x) chrRangeToJpeg16_1920_c: 17318.9 24996.0 (0.69x) lumRangeFromJpeg8_1920_c: 15389.7 15384.5 (1.00x) lumRangeFromJpeg16_1920_c: 15388.2 17306.7 (0.89x) lumRangeToJpeg8_1920_c: 19227.8 19226.6 (1.00x) lumRangeToJpeg16_1920_c: 15387.0 21146.3 (0.73x) aarch64 A76: chrRangeFromJpeg8_1920_c: 6324.4 6268.1 (1.01x) chrRangeFromJpeg16_1920_c: 6339.9 11521.5 (0.55x) chrRangeToJpeg8_1920_c: 9656.0 9612.8 (1.00x) chrRangeToJpeg16_1920_c: 6340.4 11651.8 (0.54x) lumRangeFromJpeg8_1920_c: 4422.0 4420.8 (1.00x) lumRangeFromJpeg16_1920_c: 4420.9 5762.0 (0.77x) lumRangeToJpeg8_1920_c: 5949.1 5977.5 (1.00x) lumRangeToJpeg16_1920_c: 4446.8 5946.2 (0.75x) NOTE: all simd optimizations for range_convert have been disabled. they will be re-enabled when they are fixed for each architecture. NOTE2: the same issue still exists in rgb2yuv conversions, which is not addressed in this commit.	3 weeks ago
Ramiro Polla	2d1358a84d	swscale/range_convert: saturate output instead of limiting input For bit depths <= 14, the result is saturated to 15 bits. For bit depths > 14, the result is saturated to 19 bits. x86_64: chrRangeFromJpeg8_1920_c: 2126.5 2127.4 (1.00x) chrRangeFromJpeg16_1920_c: 2331.4 2325.2 (1.00x) chrRangeToJpeg8_1920_c: 3163.0 3166.9 (1.00x) chrRangeToJpeg16_1920_c: 3163.7 2152.4 (1.47x) lumRangeFromJpeg8_1920_c: 1262.2 1263.0 (1.00x) lumRangeFromJpeg16_1920_c: 1079.5 1080.5 (1.00x) lumRangeToJpeg8_1920_c: 1860.5 1886.8 (0.99x) lumRangeToJpeg16_1920_c: 1910.2 1077.0 (1.77x) aarch64 A55: chrRangeFromJpeg8_1920_c: 28836.2 28835.2 (1.00x) chrRangeFromJpeg16_1920_c: 28840.1 28839.8 (1.00x) chrRangeToJpeg8_1920_c: 44196.2 23074.7 (1.92x) chrRangeToJpeg16_1920_c: 36527.3 17318.9 (2.11x) lumRangeFromJpeg8_1920_c: 15388.5 15389.7 (1.00x) lumRangeFromJpeg16_1920_c: 15389.3 15388.2 (1.00x) lumRangeToJpeg8_1920_c: 23069.7 19227.8 (1.20x) lumRangeToJpeg16_1920_c: 19227.8 15387.0 (1.25x) aarch64 A76: chrRangeFromJpeg8_1920_c: 6334.7 6324.4 (1.00x) chrRangeFromJpeg16_1920_c: 6336.0 6339.9 (1.00x) chrRangeToJpeg8_1920_c: 11474.5 9656.0 (1.19x) chrRangeToJpeg16_1920_c: 9640.5 6340.4 (1.52x) lumRangeFromJpeg8_1920_c: 4453.2 4422.0 (1.01x) lumRangeFromJpeg16_1920_c: 4414.2 4420.9 (1.00x) lumRangeToJpeg8_1920_c: 6645.0 5949.1 (1.12x) lumRangeToJpeg16_1920_c: 6005.2 4446.8 (1.35x) NOTE: all simd optimizations for range_convert have been disabled except for x86, which already had the same behaviour. they will be re-enabled when they are fixed for each architecture.	3 weeks ago
Niklas Haas	2a091d4f2e	swscale: introduce new, dynamic scaling API As part of a larger, ongoing effort to modernize and partially rewrite libswscale, it was decided and generally agreed upon to introduce a new public API for libswscale. This API is designed to be less stateful, more explicitly defined, and considerably easier to use than the existing one. Most of the API work has been already accomplished in the previous commits, this commit merely introduces the ability to use sws_scale_frame() dynamically, without prior sws_init_context() calls. Instead, the new API takes frame properties from the frames themselves, and the implementation is based on the new SwsGraph API, which we simply reinitialize as needed. This high-level wrapper also recreates the logic that used to live inside vf_scale for scaling interlaced frames, enabling it to be reused more easily by end users. Finally, this function is designed to simply copy refs directly when nothing needs to be done, substantially improving throughput of the noop fast path. Sponsored-by: Sovereign Tech Fund Signed-off-by: Niklas Haas <git@haasn.dev>	1 month ago
Niklas Haas	2d077f9acd	swscale/internal: group user-facing options together This is a preliminary step to separating these into a new struct. This commit contains no functional changes, it is a pure search-and-replace. Sponsored-by: Sovereign Tech Fund Signed-off-by: Niklas Haas <git@haasn.dev>	1 month ago
Niklas Haas	10d1be2621	swscale/internal: use static_assert for enforcing offsets Instead of sprinkling av_assert0 into random init functions. Sponsored-by: Sovereign Tech Fund Signed-off-by: Niklas Haas <git@haasn.dev>	1 month ago
James Almer	2eb9c35010	x86/swscale: disable AVX2 yuv2nv12cX functions if accurate_rnd is requested Signed-off-by: James Almer <jamrial@gmail.com>	2 months ago
James Almer	78ba06928a	swscale/x86/rgb2rgb: add optimized versions of the remaining shuffle_bytes functions Signed-off-by: James Almer <jamrial@gmail.com>	2 months ago
Ramiro Polla	8b30daedf7	swscale/range_convert: indent after previous commit	2 months ago
Ramiro Polla	f7ee0195df	swscale/range_convert: drop redundant conditionals from arch-specific init functions These conditions are already checked for in the main init function.	2 months ago
Ramiro Polla	7728b3357d	swscale/range_convert: call arch-specific init functions from main init function This commit also fixes the issue that the call to ff_sws_init_range_convert() from sws_init_swscale() was not setting up the arch-specific optimizations.	2 months ago
Niklas Haas	67adb30322	swscale: rename SwsContext to SwsInternal And preserve the public SwsContext as separate name. The motivation here is that I want to turn SwsContext into a public struct, while keeping the internal implementation hidden. Additionally, I also want to be able to use multiple internal implementations, e.g. for GPU devices. This commit does not include any functional changes. For the most part, it is a simple rename. The only complications arise from the public facing API functions, which preserve their current type (and hence require an additional unwrapping step internally), and the checkasm test framework, which directly accesses SwsInternal. For consistency, the affected functions that need to maintain a distionction have generally been changed to refer to the SwsContext as sws, and the SwsInternal as c. In an upcoming commit, I will provide a backing definition for the public SwsContext, and update `sws_internal()` to dereference the internal struct instead of merely casting it. Sponsored-by: Sovereign Tech Fund Signed-off-by: Niklas Haas <git@haasn.dev>	2 months ago
Martin Storsjö	b9145fcab2	swscale: Fix aarch64 and i386 compilation failures This unbreaks builds after `c1a0e65763`, which broke with errors like src/libswscale/aarch64/rgb2rgb.c:66:25: error: incompatible function pointer types assigning to 'void ()(const uint8_t , uint8_t , uint8_t , uint8_t , int, int, int, int, int, const int32_t )' (aka 'void ()(const unsigned char , unsigned char , unsigned char , unsigned char , int, int, int, int, int, const int )') from 'void (const uint8_t , uint8_t , uint8_t , uint8_t , int, int, int, int, int, int32_t )' (aka 'void (const unsigned char , unsigned char , unsigned char , unsigned char , int, int, int, int, int, int )') [-Wincompatible-function-pointer-types] 66 \| ff_rgb24toyv12 = rgb24toyv12; \| ^ ~~~~~~~~~~~ and src/libswscale/aarch64/swscale_unscaled.c:213:29: error: incompatible function pointer types assigning to 'SwsFunc' (aka 'int ()(struct SwsContext , const unsigned char const , const int , int, int, unsigned char const , const int )') from 'int (SwsContext , const uint8_t const , const int , int, int, const uint8_t *, const int )' (aka 'int (struct SwsContext , const unsigned char const , const int , int, int, const unsigned char *, const int )') [-Wincompatible-function-pointer-types] 213 \| c->convert_unscaled = nv24_to_yuv420p_neon_wrapper; \| ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Martin Storsjö <martin@martin.st>	3 months ago
Niklas Haas	c1a0e65763	swscale/internal: constify SwsFunc I want to move away from having random leaf processing functions mutate plane pointers, and while we're at it, we might as well make the strides and tables const as well. Sponsored-by: Sovereign Tech Fund Signed-off-by: Niklas Haas <git@haasn.dev>	3 months ago
Ramiro Polla	caaec2ea95	swscale/x86/rgb2rgb: disable rgb24toyv12_mmxext for x86_64 The mmxext implementation is slower than the C version in x86_64. m32 m64 rgb24toyv12_16_200_c: 24942.7 14812.6 rgb24toyv12_16_200_mmxext: 17857.2 ( 1.40x) 17400.4 ( 0.85x) rgb24toyv12_128_60_c: 56892.9 35616.9 rgb24toyv12_128_60_mmxext: 40730.9 ( 1.40x) 39610.4 ( 0.90x) rgb24toyv12_512_16_c: 58402.7 37209.4 rgb24toyv12_512_16_mmxext: 44842.4 ( 1.30x) 41136.2 ( 0.90x) rgb24toyv12_1920_4_c: 54827.4 34737.4 rgb24toyv12_1920_4_mmxext: 51169.9 ( 1.07x) 34818.9 ( 1.00x)	4 months ago
Ramiro Polla	4c824ad391	swscale/x86/rgb2rgb: fix deinterleaveBytes writing past the end of the buffers	4 months ago
Ramiro Polla	f17a6bd200	swscale/x86/rgb2rgb: fix deinterleaveBytes for unaligned dst pointers	4 months ago
Ramiro Polla	8744764a4c	swscale/x86/yuv2rgb: add ssse3 yuv42{0,2}p -> gbrp unscaled colorspace converters Note: this implementation is limited to x86_64 due to general purpose register pressure. checkasm --bench on an Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz: yuv420p_gbrp_8_c: 118.5 yuv420p_gbrp_8_ssse3: 93.3 yuv420p_gbrp_128_c: 1068.3 yuv420p_gbrp_128_ssse3: 319.3 yuv420p_gbrp_1080_c: 8841.8 yuv420p_gbrp_1080_ssse3: 2211.8 yuv420p_gbrp_1920_c: 15903.8 yuv420p_gbrp_1920_ssse3: 3814.3 yuv422p_gbrp_8_c: 144.8 yuv422p_gbrp_8_ssse3: 93.8 yuv422p_gbrp_128_c: 1395.8 yuv422p_gbrp_128_ssse3: 313.0 yuv422p_gbrp_1080_c: 11551.5 yuv422p_gbrp_1080_ssse3: 2240.8 yuv422p_gbrp_1920_c: 20585.3 yuv422p_gbrp_1920_ssse3: 5249.5 yuva420p_gbrp_8_c: 117.5 yuva420p_gbrp_8_ssse3: 92.0 yuva420p_gbrp_128_c: 1593.0 yuva420p_gbrp_128_ssse3: 319.3 yuva420p_gbrp_1080_c: 8694.5 yuva420p_gbrp_1080_ssse3: 2186.0 yuva420p_gbrp_1920_c: 15946.5 yuva420p_gbrp_1920_ssse3: 3805.3	4 months ago
Ramiro Polla	ac6263945a	swscale/x86/yuv2rgb: Detemplatize Every function in yuv2rgb_template.c is only compiled exactly once, so detemplatize it.	6 months ago
Ramiro Polla	4f7f9b1026	swscale: remove unconditional #define DITHER1XBPP This seems to have had an use in the past, but it is now defined unconditionally.	6 months ago
Ramiro Polla	61e851381f	swscale/yuv2rgb/x86: remove mmx/mmxext yuv2rgb functions These functions are either slower or barely faster than the C LUT yuv2rgb code.	6 months ago
James Almer	fcf72966a5	swscale/x86/range_convert: add missing AVX2 preprocessor wrapper Fixes compilation with old yasm. Signed-off-by: James Almer <jamrial@gmail.com>	6 months ago
James Almer	8a4c9d6bd3	swscale/x86/range_convert: reduce amount of xmm regs clobbered in luma functions Signed-off-by: James Almer <jamrial@gmail.com>	6 months ago
Ramiro Polla	f6859cade3	swscale/x86: add sse2 and avx2 {lum,chr}ConvertRange chrRangeFromJpeg_8_c: 22.3 chrRangeFromJpeg_8_sse2: 13.3 chrRangeFromJpeg_8_avx2: 13.3 chrRangeFromJpeg_24_c: 72.8 chrRangeFromJpeg_24_sse2: 22.3 chrRangeFromJpeg_24_avx2: 17.5 chrRangeFromJpeg_128_c: 345.5 chrRangeFromJpeg_128_sse2: 106.0 chrRangeFromJpeg_128_avx2: 57.8 chrRangeFromJpeg_144_c: 380.5 chrRangeFromJpeg_144_sse2: 118.5 chrRangeFromJpeg_144_avx2: 62.3 chrRangeFromJpeg_256_c: 646.3 chrRangeFromJpeg_256_sse2: 218.8 chrRangeFromJpeg_256_avx2: 109.0 chrRangeFromJpeg_512_c: 1461.5 chrRangeFromJpeg_512_sse2: 426.5 chrRangeFromJpeg_512_avx2: 211.5 chrRangeToJpeg_8_c: 37.8 chrRangeToJpeg_8_sse2: 10.5 chrRangeToJpeg_8_avx2: 14.0 chrRangeToJpeg_24_c: 114.3 chrRangeToJpeg_24_sse2: 23.5 chrRangeToJpeg_24_avx2: 16.3 chrRangeToJpeg_128_c: 633.5 chrRangeToJpeg_128_sse2: 107.5 chrRangeToJpeg_128_avx2: 55.0 chrRangeToJpeg_144_c: 758.3 chrRangeToJpeg_144_sse2: 132.0 chrRangeToJpeg_144_avx2: 64.5 chrRangeToJpeg_256_c: 1345.0 chrRangeToJpeg_256_sse2: 218.0 chrRangeToJpeg_256_avx2: 105.3 chrRangeToJpeg_512_c: 2524.0 chrRangeToJpeg_512_sse2: 417.0 chrRangeToJpeg_512_avx2: 218.8 lumRangeFromJpeg_8_c: 11.8 lumRangeFromJpeg_8_sse2: 11.0 lumRangeFromJpeg_8_avx2: 10.3 lumRangeFromJpeg_24_c: 38.5 lumRangeFromJpeg_24_sse2: 15.5 lumRangeFromJpeg_24_avx2: 12.5 lumRangeFromJpeg_128_c: 232.3 lumRangeFromJpeg_128_sse2: 60.0 lumRangeFromJpeg_128_avx2: 26.8 lumRangeFromJpeg_144_c: 259.5 lumRangeFromJpeg_144_sse2: 65.3 lumRangeFromJpeg_144_avx2: 29.0 lumRangeFromJpeg_256_c: 464.5 lumRangeFromJpeg_256_sse2: 107.5 lumRangeFromJpeg_256_avx2: 54.0 lumRangeFromJpeg_512_c: 897.5 lumRangeFromJpeg_512_sse2: 224.5 lumRangeFromJpeg_512_avx2: 109.8 lumRangeToJpeg_8_c: 17.8 lumRangeToJpeg_8_sse2: 11.0 lumRangeToJpeg_8_avx2: 11.8 lumRangeToJpeg_24_c: 56.3 lumRangeToJpeg_24_sse2: 11.0 lumRangeToJpeg_24_avx2: 12.5 lumRangeToJpeg_128_c: 333.8 lumRangeToJpeg_128_sse2: 53.3 lumRangeToJpeg_128_avx2: 26.5 lumRangeToJpeg_144_c: 375.5 lumRangeToJpeg_144_sse2: 60.8 lumRangeToJpeg_144_avx2: 29.0 lumRangeToJpeg_256_c: 652.0 lumRangeToJpeg_256_sse2: 109.5 lumRangeToJpeg_256_avx2: 53.5 lumRangeToJpeg_512_c: 1284.3 lumRangeToJpeg_512_sse2: 218.0 lumRangeToJpeg_512_avx2: 108.3	6 months ago
James Almer	17c3cc5bb6	swscale/x86/rgb_2_rgb: add missing wrap to ff_uyvytoyuv422_avx2 Fixes old yasm. Signed-off-by: James Almer <jamrial@gmail.com>	7 months ago
James Almer	03546f49a3	swscale/x86/rgb2rgb: add missing wrap for ff_uyvytoyuv422_avx2 Signed-off-by: James Almer <jamrial@gmail.com>	7 months ago
James Almer	e8cef5e152	swscale/x86/rgb2rgb: remove mmxext version of shuffle_bytes_2103 Signed-off-by: James Almer <jamrial@gmail.com>	7 months ago
James Almer	c578bb9864	swscale/x86/input: add AVX2 optimized uyvytoyuv422 uyvytoyuv422_c: 23991.8 uyvytoyuv422_sse2: 2817.8 uyvytoyuv422_avx: 2819.3 uyvytoyuv422_avx2: 1972.3 Signed-off-by: James Almer <jamrial@gmail.com>	7 months ago
James Almer	e9cfd53257	swscale/x86/input: add AVX2 optimized RGB32 to YUV functions abgr_to_uv_8_c: 43.3 abgr_to_uv_8_sse2: 14.3 abgr_to_uv_8_avx: 15.3 abgr_to_uv_8_avx2: 18.8 abgr_to_uv_128_c: 650.3 abgr_to_uv_128_sse2: 110.8 abgr_to_uv_128_avx: 112.3 abgr_to_uv_128_avx2: 64.8 abgr_to_uv_1080_c: 5456.3 abgr_to_uv_1080_sse2: 888.8 abgr_to_uv_1080_avx: 900.8 abgr_to_uv_1080_avx2: 518.3 abgr_to_uv_1920_c: 9692.3 abgr_to_uv_1920_sse2: 1593.8 abgr_to_uv_1920_avx: 1613.3 abgr_to_uv_1920_avx2: 864.8 abgr_to_y_8_c: 23.3 abgr_to_y_8_sse2: 12.8 abgr_to_y_8_avx: 13.3 abgr_to_y_8_avx2: 17.3 abgr_to_y_128_c: 308.3 abgr_to_y_128_sse2: 67.3 abgr_to_y_128_avx: 66.8 abgr_to_y_128_avx2: 44.8 abgr_to_y_1080_c: 2371.3 abgr_to_y_1080_sse2: 512.8 abgr_to_y_1080_avx: 505.8 abgr_to_y_1080_avx2: 314.3 abgr_to_y_1920_c: 4177.3 abgr_to_y_1920_sse2: 915.8 abgr_to_y_1920_avx: 926.8 abgr_to_y_1920_avx2: 519.3 bgra_to_uv_8_c: 37.3 bgra_to_uv_8_sse2: 13.3 bgra_to_uv_8_avx: 14.8 bgra_to_uv_8_avx2: 19.8 bgra_to_uv_128_c: 563.8 bgra_to_uv_128_sse2: 111.3 bgra_to_uv_128_avx: 112.3 bgra_to_uv_128_avx2: 64.8 bgra_to_uv_1080_c: 4691.8 bgra_to_uv_1080_sse2: 893.8 bgra_to_uv_1080_avx: 899.8 bgra_to_uv_1080_avx2: 517.8 bgra_to_uv_1920_c: 8332.8 bgra_to_uv_1920_sse2: 1590.8 bgra_to_uv_1920_avx: 1605.8 bgra_to_uv_1920_avx2: 867.3 bgra_to_y_8_c: 22.3 bgra_to_y_8_sse2: 12.8 bgra_to_y_8_avx: 12.8 bgra_to_y_8_avx2: 17.3 bgra_to_y_128_c: 291.3 bgra_to_y_128_sse2: 67.8 bgra_to_y_128_avx: 69.3 bgra_to_y_128_avx2: 45.3 bgra_to_y_1080_c: 2357.3 bgra_to_y_1080_sse2: 508.3 bgra_to_y_1080_avx: 518.3 bgra_to_y_1080_avx2: 399.8 bgra_to_y_1920_c: 4202.8 bgra_to_y_1920_sse2: 906.8 bgra_to_y_1920_avx: 907.3 bgra_to_y_1920_avx2: 526.3 Signed-off-by: James Almer <jamrial@gmail.com>	7 months ago
James Almer	d5fe99dc5f	swscale/x86/input: add AVX2 optimized RGB24 to YUV functions rgb24_to_uv_8_c: 39.3 rgb24_to_uv_8_sse2: 14.3 rgb24_to_uv_8_ssse3: 13.3 rgb24_to_uv_8_avx: 12.8 rgb24_to_uv_8_avx2: 14.3 rgb24_to_uv_128_c: 582.8 rgb24_to_uv_128_sse2: 127.3 rgb24_to_uv_128_ssse3: 107.3 rgb24_to_uv_128_avx: 111.3 rgb24_to_uv_128_avx2: 62.3 rgb24_to_uv_1080_c: 4981.3 rgb24_to_uv_1080_sse2: 1048.3 rgb24_to_uv_1080_ssse3: 876.8 rgb24_to_uv_1080_avx: 887.8 rgb24_to_uv_1080_avx2: 492.3 rgb24_to_uv_1280_c: 5906.8 rgb24_to_uv_1280_sse2: 1263.3 rgb24_to_uv_1280_ssse3: 1048.3 rgb24_to_uv_1280_avx: 1045.8 rgb24_to_uv_1280_avx2: 579.8 rgb24_to_uv_1920_c: 8665.3 rgb24_to_uv_1920_sse2: 1888.8 rgb24_to_uv_1920_ssse3: 1571.8 rgb24_to_uv_1920_avx: 1558.8 rgb24_to_uv_1920_avx2: 869.3 rgb24_to_y_8_c: 20.3 rgb24_to_y_8_sse2: 11.8 rgb24_to_y_8_ssse3: 10.3 rgb24_to_y_8_avx: 10.3 rgb24_to_y_8_avx2: 10.8 rgb24_to_y_128_c: 284.8 rgb24_to_y_128_sse2: 83.3 rgb24_to_y_128_ssse3: 66.8 rgb24_to_y_128_avx: 64.8 rgb24_to_y_128_avx2: 39.3 rgb24_to_y_1080_c: 2451.3 rgb24_to_y_1080_sse2: 696.3 rgb24_to_y_1080_ssse3: 516.8 rgb24_to_y_1080_avx: 518.8 rgb24_to_y_1080_avx2: 301.8 rgb24_to_y_1280_c: 2892.8 rgb24_to_y_1280_sse2: 816.8 rgb24_to_y_1280_ssse3: 623.3 rgb24_to_y_1280_avx: 616.3 rgb24_to_y_1280_avx2: 350.8 rgb24_to_y_1920_c: 4338.8 rgb24_to_y_1920_sse2: 1210.8 rgb24_to_y_1920_ssse3: 928.3 rgb24_to_y_1920_avx: 920.3 rgb24_to_y_1920_avx2: 534.8 Signed-off-by: James Almer <jamrial@gmail.com>	7 months ago
Andreas Rheinhardt	8b62fb231a	swscale/x86/rgb2rgb: Detemplatize Every function in rgb2rgb_template.c is only compiled exactly once; there is no overlap at all between the MMXEXT and the SSE2 functions, so detemplatize it. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	7 months ago
Andreas Rheinhardt	5421dee0e7	swscale/x86/rgb2rgb_template: Remove unused uyvytoyv12 Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	7 months ago
Andreas Rheinhardt	c1c35380a7	swscale/x86/rgb2rgb: Don't unnecessarily check for inline ASM The SSE2 and AVX versions of deinterleaveBytes are external ASM. Move them out of the inline ASM template. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	7 months ago
Andreas Rheinhardt	f7305eb3b3	swscale/x86/rgb2rgb_template: Remove unnecessary SFENCE The ff_nv12ToUV_* functions don't use non-temporal stores at all. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	7 months ago
Ramiro Polla	5939f7228a	libswscale/x86/yuv_2_rgb: fix some comments	7 months ago
Michael Niedermayer	3f9daf1c18	swscale/x86/swscale: use a clearer name for INPUT_PLANER_RGB_A_FUNC_CASE related: CID1497114 Missing break in switch Sponsored-by: Sovereign Tech Fund Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	7 months ago
Henrik Gramner	c3d3f0e697	avutil/x86util: Fix broken pre-SSE4.1 PMINSD emulation Fixes yadif-16 which allows FATE to pass. Broken since `2904db9045` (2017).	9 months ago
Alfred Wingate	e5ce473040	swscale/x86/rgb_2_rgb: Add opaque pointer to missed definitions of ff_nv12ToUV Opaque parameters were previously added to the original definition of ff_nv12ToUV, leading to gcc noticing a type mismatch with -Wlto-type-mismatch. `f2de911818` https://bugs.gentoo.org/907484 Signed-off-by: Alfred Wingate <parona@protonmail.com> Signed-off-by: Anton Khirnov <anton@khirnov.net>	1 year ago
Lynne	bbe95f7353	x86: replace explicit REP_RETs with RETs From x86inc: > On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either > a branch or a branch target. So switch to a 2-byte form of ret in that case. > We can automatically detect "follows a branch", but not a branch target. > (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.) x86inc can automatically determine whether to use REP_RET rather than REP in most of these cases, so impact is minimal. Additionally, a few REP_RETs were used unnecessary, despite the return being nowhere near a branch. The only CPUs affected were AMD K10s, made between 2007 and 2011, 16 years ago and 12 years ago, respectively. In the future, everyone involved with x86inc should consider dropping REP_RETs altogether.	2 years ago
Michael Niedermayer	b74f89caae	swscale/output: Bias 16bps output calculations to improve non overflowing range for GBRP16/GBRPF32 Fixes: integer overflow Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2 years ago
Andreas Rheinhardt	de33506e4b	swscale/x86/rgb_2_rgb: Empty MMX state in ff_shuffle_bytes_2103_mmxext Fixes FATE-failures with the the filter-2xbr filter-3xbr filter-4xbr filter-ep2x filter-ep3x filter-hq2x filter-hq3x filter-hq4x filter-paletteuse-bayer filter-paletteuse-bayer0 filter-paletteuse-nodither and filter-paletteuse-sierra2_4a tests when using 32bit x86 with CPUFLAGS ranging from "mmx+mmxext" to "mmx+mmxext+sse+sse2+sse3" (the relevant function is only overwritten when using SSSE3). Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2 years ago
Timo Rothenpieler	f2de911818	swscale: add opaque parameter to input functions	2 years ago
Andreas Rheinhardt	8bec225c3c	swscale/x86/yuv2yuvX: Remove unused ff_yuv2yuvX_mmx() Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2 years ago
Alan Kelly	a38293e444	libswscale: Enable hscale_avx2 for all input sizes. ff_shuffle_filter_coefficients shuffles the tail as required. Signed-off-by: Anton Khirnov <anton@khirnov.net>	2 years ago
Alan Kelly	a6724285fd	sws: allow avx2 hscale to process inputs of any size. The main loop processes blocks of 16 pixels. The tail processes blocks of size 4. Signed-off-by: Anton Khirnov <anton@khirnov.net>	2 years ago
Alan Kelly	51a34e8525	sws: Replace call to yuv2yuvX_mmx by yuv2yuvX_mmxext Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2 years ago
Swinney, Jonathan	4dcd191a50	checkasm: updated tests for sw_scale Change the reference to exactly match the C reference in swscale, instead of exactly matching the x86 SIMD implementations (which differs slightly). Test with and without SWS_ACCURATE_RND - if this flag isn't set, the output must match the C reference exactly, otherwise it is allowed to be off by 2. Mark a couple x86 functions as unavailable when SWS_ACCURATE_RND is set - apparently this discrepancy hasn't been noticed in other exact tests before. Add a test for yuv2plane1. Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2 years ago
Andreas Rheinhardt	81d3472031	swscale/x86/swscale: Simplify macro This is possible now that it is no longer used by MMX. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	3 years ago
Andreas Rheinhardt	a05f22eaf3	swscale/x86/swscale: Remove obsolete and harmful MMX(EXT) functions x64 always has MMX, MMXEXT, SSE and SSE2 and this means that some functions for MMX, MMXEXT, SSE and 3dnow are always overridden by other functions (unless one e.g. explicitly disables SSE2). So given that the only systems that benefit from these functions are truely ancient 32bit x86s they are removed. Moreover, some of the removed code was buggy/not bitexact and lead to failures involving the f32le and f32be versions of gray, gbrp and gbrap on x86-32 when SSE2 was not disabled. See e.g. https://fate.ffmpeg.org/report.cgi?time=20220609221253&slot=x86_32-debian-kfreebsd-gcc-4.4-cpuflags-mmx Notice that yuv2yuvX_mmx is not removed, because it is used by SSE3 and AVX2 as fallback in case of unaligned data and also for tail processing. I don't know why yuv2yuvX_mmxext isn't being used for this; an earlier version [1] of `554c2bc708` used it, but the version that was eventually applied does not. [1]: https://ffmpeg.org/pipermail/ffmpeg-devel/2020-November/272124.html Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	3 years ago

1 2 3 4 5 ...

442 Commits (6bf9252807c8877ca385279e328516f573a0a38b)