FFmpeg

Commit Graph

Author	SHA1	Message	Date
Christophe GISQUET	2784d18791	SBR DSP x86: implement SSE sbr_hf_g_filt Unrolling the main loop to process, instead of 4 elements: - 8: minor gain of 2 cycles (not worth the extra object size) - 2: loss of 8 cycles. Assigning STEP to a register is a loss. Output address (Y) is almost always unaligned. Timings: - C (32/64 bits): 117/109 cycles - SSE: 57 cycles Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	13 years ago
Christophe GISQUET	34454c761f	SBR DSP x86: implement SSE sbr_sum_square_sse The 32bits targets have been compiled with -mfpmath=sse for proper reference. sbr_sum_square C /32bits: 82c (unrolled)/102c C /64bits: 69c (unrolled)/82c SSE/32bits: 42c SSE/64bits: 31c Use of SSE4.1 dpps to perform the final sum is slower. Not unrolling to perform 8 operations in a loop yields 10 more cycles. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	13 years ago
Ronald S. Bultje	3ab9a2a557	rv34: change most "int stride" into "ptrdiff_t stride". This prevents having to sign-extend on 64-bit systems with 32-bit ints, such as x86-64. Also fixes crashes on systems where we don't do it and arguments are not in registers, such as Win64 for all weight functions.	13 years ago
Ronald S. Bultje	8fb26950ed	h264: don't use redzone in loopfilter on win64. Red zone usage is not allowed in the Win64 ABI.	13 years ago
Michael Niedermayer	f9caec0cf9	h264: change deblock_h_chroma_8_mmxext() to prevent valgrind confusion. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	13 years ago
Christophe GISQUET	f3e084909b	mpegaudio: replace memcpy by SIMD code By replacing memcpy with an unrolled loop using the alignment knowledge it has, some speedup can be obtained. Before (gcc 4.6.1): ~400 cycles After: ~370 cycles Overall, around 2% speed increase when decoding a 2400s mp3 to f32le. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	13 years ago
Martin Storsjö	efd29844eb	mpegvideo: Add ff_ prefix to nonstatic functions Signed-off-by: Martin Storsjö <martin@martin.st>	13 years ago
Martin Storsjö	873c89e2a6	dsputil: Add ff_ prefix to inv_zigzag_direct16 Signed-off-by: Martin Storsjö <martin@martin.st>	13 years ago
Martin Storsjö	9cf0841ef3	dsputil: Add ff_ prefix to the dsputil_init functions Signed-off-by: Martin Storsjö <martin@martin.st>	13 years ago
Reimar Döffinger	f51a072160	Fix compilation without HAVE_AVX. %ifdef HAVE_AVX must now be %if HAVE_AVX. Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>	13 years ago
Reimar Döffinger	b223035511	Detect and check for CMOV. Some MMX-only CPUs do not have support for CMOV. All SSE/MMX2 CPUs should be fine, thus no check was added to those functions. See also https://sourceforge.net/tracker/?func=detail&aid=3358347&group_id=205275&atid=992986 Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>	13 years ago
Reimar Döffinger	394d41ee30	Partially revert "Fix png decoding on x86." This partially reverts commit `58dabf7bf2`. It is no longer necessary to use unaligned mov. The swapped mov argument fix remains though.	13 years ago
Justin Ruggles	d483bb58c3	ac3dsp: do not use pshufb in ac3_extract_exponents_ssse3() We need to do unsigned saturation in order to cover the corner case when the absolute coefficient value is 16777215 (the maximum value). Fixes Bug #216	13 years ago
Diego Biurrun	0bba26466f	cosmetics: Delete empty lines at end of file.	13 years ago
Ronald S. Bultje	ce1e250ee9	h264: manually save/restore XMM registers for functions using INIT_MMX. On Win64, these registers are callee-save, so not saving/restoring them correctly is a violation of ABI and can lead to crashes or corrupt data.	13 years ago
Ronald S. Bultje	4ff6dea390	pngdsp: swap argument inversion.	13 years ago
Michael Kostylev	3206cccc0e	h264: mark h264_idct_add8_10 with number of XMM registers. This fixes XMM register clobber problems on Win64. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	13 years ago
Reimar Döffinger	58dabf7bf2	Fix png decoding on x86. Line sizes are only 8-byte aligned, so use unaliged loads for add_bytes_l2 pointers. Increasing the alignment requirement to 16 seemed a bit extreme (png may be used for rather small sizes). Also fix a mov that had its arguments swapped, leading add_bytes_l2 being applied on up to 8 bytes too few. Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>	13 years ago
Reimar Döffinger	da1ba4e88b	Fix NASM compilation. movd needs explicit register size prefix for NASM. Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>	13 years ago
KO Myung-Hun	c853124fb0	Use SECTION_TEXT instead of section .text for the compatibility aout does not support 'align='. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	13 years ago
Ronald S. Bultje	7e4d9d5d45	win64: add a XMM clobber test configure option. This will be useful to test more aggressively for failures to mark XMM registers as clobbered in Win64 builds, and prevent regressions thereof. Based on a patch by Ramiro Polla <ramiro.polla@gmail.com>	13 years ago
Justin Ruggles	236a550c3f	Fix a typo in the x86 asm version of ff_vector_clip_int32() Specifies the correct number of xmm registers used so that they can be saved and restored on Win64 if necessary.	13 years ago
Christophe Gisquet	e5c9de2ab7	rv40: x86 SIMD for biweight Provide MMX, SSE2 and SSSE3 versions, with a fast-path when the weights are multiples of 512 (which is often the case when the values round up nicely). *_TIMER report for the 16x16 and 8x8 cases: C: 9015 decicycles in 16, 524257 runs, 31 skips 2656 decicycles in 8, 524271 runs, 17 skips MMX: 4156 decicycles in 16, 262090 runs, 54 skips 1206 decicycles in 8, 262131 runs, 13 skips MMX on fast-path: 2760 decicycles in 16, 524222 runs, 66 skips 995 decicycles in 8, 524252 runs, 36 skips SSE2: 2163 decicycles in 16, 262131 runs, 13 skips 832 decicycles in 8, 262137 runs, 7 skips SSE2 with fast path: 1783 decicycles in 16, 524276 runs, 12 skips 711 decicycles in 8, 524283 runs, 5 skips SSSE3: 2117 decicycles in 16, 262136 runs, 8 skips 814 decicycles in 8, 262143 runs, 1 skips SSSE3 with fast path: 1315 decicycles in 16, 524285 runs, 3 skips 578 decicycles in 8, 524286 runs, 2 skips This means around a 4% speedup for some sequences. Signed-off-by: Diego Biurrun <diego@biurrun.de>	13 years ago
Diego Biurrun	91bafb52ae	x86: Give RV40 init file a more suitable name.	13 years ago
Diego Biurrun	c30b198381	x86: Place mm_flags variable declaration below the appropriate #ifdef. This fixes some unused variable warnings with YASM disabled.	13 years ago
Christophe Gisquet	6b03900382	x86 dsputil: provide SSE2/SSSE3 versions of bswap_buf While pshufb allows emulating bswap on XMM registers for SSSE3, more shuffling is needed for SSE2. Alignment is critical, so specific codepaths are provided for this case. For the huffyuv sequence "angels_480-huffyuvcompress.avi": C (using bswap instruction): ~ 55k cycles SSE2: ~ 40k cycles SSSE3 using unaligned loads: ~ 35k cycles SSSE3 using aligned loads: ~ 30k cycles Signed-off-by: Diego Biurrun <diego@biurrun.de>	13 years ago
Ronald S. Bultje	af79a0c48a	png: add support for bpp>4 to paeth x86 SIMD code. This fixes playback of e.g. RGB48 (bpp=6) content on x86 CPUs. Fixes bug 214.	13 years ago
Ronald S. Bultje	f91c4b7824	png: add SSE2 version for add_bytes_l2.	13 years ago
Ronald S. Bultje	59f474b49d	png: convert DSP functions to yasm.	13 years ago
Ronald S. Bultje	20a7d3178f	png: add missing #if HAVE_SSSE3 around function pointer assignment.	13 years ago
Ronald S. Bultje	331e7c4cb3	imdct36: mark SSE functions as using all 16 XMM registers. On x86-64, it indeed uses all 16 registers (and on x86-32, this gets clipped to 8). Not marking it properly causes callers of this function to fail randomly because of XMM register clobbering.	13 years ago
Ronald S. Bultje	e92003514d	png: move DSP functions to their own DSP context.	13 years ago
Michael Niedermayer	81ab42a334	dirac_yasm: fix linking failure due to %ifndef Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	13 years ago
Ronald S. Bultje	3b15a6d742	config.asm: change %ifdef directives to %if directives. This allows combining multiple conditionals in a single statement.	13 years ago
Ronald S. Bultje	c3af52fa8b	dsputil: use vertical component for drawing bottom edge. Current code only writes 8 pixels of vertical edge for YUV422, which causes MC artifacts when subsequent frames use data from that edge.	13 years ago
Reimar Döffinger	7e62315c91	Use correct register size. Fixes compilation with NASM. Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>	13 years ago
Christophe GISQUET	9ba9c34024	rv34: 1-pass inter MB reconstruction Implement 1-pass inverse transform and reconstruction for inter blocks.	13 years ago
Christophe GISQUET	d78062386e	rv34: Intra 16x16 handling Extract processing of intra 16x16 blocks from intra macroblock processing. Also implement a function performing inverse transform and block reconstruction for DC-only blocks in 1 pass instead of 2.	13 years ago
Reimar Döffinger	7a1723086a	Fix compilation without HAVE_AVX, HAVE_YASM etc. At the very least this should fix warnings about unused static functions if one or more of these is not defined. However even compilation might be broken if the compiler does not optimize the function away completely. This actually happens in case of the AVX function, since the function pointer is used in an assignment that is not under an #if and thus probably only optimized away after the function was already marked as used. Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>	13 years ago
Reimar Döffinger	83b12c16af	Use correct register size, fixes compilation with NASM. Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>	13 years ago
Carl Eugen Hoyos	ef3a19d595	Fix compilation with yasm-0.6.2	13 years ago
Christophe GISQUET	3faa303a47	rv34: DC-only inverse transform When decoding coefficients, detect whether the block is DC-only, and take advantage of this knowledge to perform DC-only inverse transform. This is achieved by: - first, changing the 108x4 element modulo_three_table into a 108 element table (kind of base4), and accessing each value using mask and shifts. - then, checking low bits for 0 (as they represent the presence of higher frequency coefficients) Also provide x86 SIMD code for the DC-only inverse transform. Signed-off-by: Kostya Shishkov <kostya.shishkov@gmail.com>	13 years ago
Michael Niedermayer	5387f9917f	cabac: Try to disable problematic ASM for gcc-llvm 4.2.1 This should fix compilation with gcc-llvm (see darwin fate box) Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	13 years ago
Henrik Gramner	e7d02b04dc	fft: init functions with INIT_XMM/YMM. This is required to handle clobbering of XMM registers on Win64 correctly. Fixes FFT and all tests depending on FFT on Win64. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Janne Grunau <janne-libav@jannau.net>	13 years ago
Michael Niedermayer	f247f4cf47	cabac: 3rd try at working around a compiler bug in clang. Switch to a broader detection of versions. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	13 years ago
Michael Niedermayer	444632eae6	cabac: Disable get_cabac_inline_x86() for clang 2.9 on x86_32 This should finally fix the compilation issue on darwin Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	13 years ago
Michael Niedermayer	2138a89e71	Revert "Revert commit 599b4c6efddaed33b1667c386b34b07729ba732b" This reverts commit `c4f237a981`. This didnt fix compilation on darwin with current clang.	13 years ago
Vitor Sessak	39df0c434c	mpegaudiodec: optimized iMDCT transform Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	13 years ago
Michael Niedermayer	c4f237a981	Revert commit `599b4c6efd` Author: Mans Rullgard <mans@mansr.com> Date: Sun Dec 11 21:41:59 2011 +0000 x86: cabac: replace explicit memory references with "m" operands This replaces the explicit offset(reg) memory references with "m" operands for the same locations. As a result, one fewer register operand is needed for these inline asm statements. This change appears to have broken compilation on darwin, and subsequent fixes by martin (which did not fix compilation) removed the register advantage, thus this change seems not a good idea to keep. See: http://fate.ffmpeg.org/log.cgi?time=20120103122446&log=compile&slot=i386-darwin-llvm-gcc-4.2.1 Signed-off-by: Michael Niedermayer <michaelni@gmx.at>	13 years ago
Martin Storsjö	676a9ee1d2	x86: Fix constraints for decode_significance*_x86 Originally, prior to `8742a4ff8`, the caller code was compiled within this condition: ARCH_X86 && HAVE_7REGS && HAVE_EBX_AVAILABLE && !defined(BROKEN_RELOCATIONS) Since HAVE_7REGS is defined as (ARCH_X86_64 \|\| (HAVE_EBX_AVAILABLE && HAVE_EBP_AVAILABLE)) the subcondition HAVE_7REGS && HAVE_EBX_AVAILABLE is equal to HAVE_7REGS (for 32 bit at least). The correct simplification of the original condition thus is HAVE_7REGS, not HAVE_EBX_AVAILABLE. This fixes compilation in some cases where HAVE_EBP_AVAILABLE = 0 and HAVE_EBX_AVAILABLE = 1. Signed-off-by: Martin Storsjö <martin@martin.st>	13 years ago

... 3 4 5 6 7 ...

905 Commits (f398617b197d9a44b27d3313df8cd6c72b6a168a)