Ronald S. Bultje
f0a2b6249b
vp9: add 16x16 idct avx2 (8-bit).
...
checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows
that it's about 1.65x as fast as the AVX version for the full IDCT, and
similar speedups for the sub-IDCTs:
nop: 24.6
vp9_inv_dct_dct_16x16_add_8_1_c: 6444.8
vp9_inv_dct_dct_16x16_add_8_1_sse2: 638.6
vp9_inv_dct_dct_16x16_add_8_1_ssse3: 484.4
vp9_inv_dct_dct_16x16_add_8_1_avx: 661.2
vp9_inv_dct_dct_16x16_add_8_1_avx2: 311.5
vp9_inv_dct_dct_16x16_add_8_2_c: 6665.7
vp9_inv_dct_dct_16x16_add_8_2_sse2: 646.9
vp9_inv_dct_dct_16x16_add_8_2_ssse3: 455.2
vp9_inv_dct_dct_16x16_add_8_2_avx: 521.9
vp9_inv_dct_dct_16x16_add_8_2_avx2: 304.3
vp9_inv_dct_dct_16x16_add_8_4_c: 7022.7
vp9_inv_dct_dct_16x16_add_8_4_sse2: 647.4
vp9_inv_dct_dct_16x16_add_8_4_ssse3: 467.1
vp9_inv_dct_dct_16x16_add_8_4_avx: 446.1
vp9_inv_dct_dct_16x16_add_8_4_avx2: 297.0
vp9_inv_dct_dct_16x16_add_8_8_c: 6800.4
vp9_inv_dct_dct_16x16_add_8_8_sse2: 598.6
vp9_inv_dct_dct_16x16_add_8_8_ssse3: 465.7
vp9_inv_dct_dct_16x16_add_8_8_avx: 440.9
vp9_inv_dct_dct_16x16_add_8_8_avx2: 290.2
vp9_inv_dct_dct_16x16_add_8_16_c: 6626.6
vp9_inv_dct_dct_16x16_add_8_16_sse2: 599.5
vp9_inv_dct_dct_16x16_add_8_16_ssse3: 475.0
vp9_inv_dct_dct_16x16_add_8_16_avx: 469.9
vp9_inv_dct_dct_16x16_add_8_16_avx2: 286.4
9 years ago
James Almer
172af20852
x86/showcqt: use three operand format for some instructions
...
Fixes failures with yasm 1.1.0 and older
Signed-off-by: James Almer <jamrial@gmail.com>
9 years ago
James Almer
99b899483e
avutil/x86util: move haddps sse emulation from showcqt
...
Signed-off-by: James Almer <jamrial@gmail.com>
9 years ago
James Almer
d5f8a642f6
x86: port PSIGNW to cpuflags
...
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
9 years ago
James Almer
5750d6c5e9
x86: move XOP emulation code back to x86inc
...
Only two functions that use xop multiply-accumulate instructions where the
first operand is the same as the fourth actually took advantage of the macros.
This further reduces differences with x264's x86inc.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
10 years ago
James Almer
37b35feb64
x86/swr: add SSE2/AVX pack_8ch functions
...
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
10 years ago
Kieran Kunhya
9a738c27dc
v210enc: Add SIMD optimised 8-bit and 10-bit encoders
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
10 years ago
Kieran Kunhya
36091742d1
v210enc: Add SIMD optimised 8-bit and 10-bit encoders
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
James Almer
d0f56ca071
x86/hevc_deblock: improve 8bit transpose store macros
...
Up to four instructions less depending on function and instruction set.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
11 years ago
James Almer
1ace9573dc
x86/hevc_idct: replace old and unused idct functions
...
Only 8-bit and 10-bit idct_dc() functions are included (adding others should be trivial).
Benchmarks on an Intel Core i5-4200U:
idct8x8_dc
SSE2 MMXEXT C
cycles 22 26 57
idct16x16_dc
AVX2 SSE2 C
cycles 27 32 249
idct32x32_dc
AVX2 SSE2 C
cycles 62 126 1375
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: Mickaël Raulet <mraulet@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
11 years ago
Christophe Gisquet
9107612818
x86util: add and use RSHIFT/LSHIFT macros
...
Those macros take a byte number as shift argument, as this argument
differs between MMX and SSE2 instructions.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
11 years ago
Christophe Gisquet
2267003981
x86: hpeldsp: better factorization
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
11 years ago
James Almer
561bfc85eb
x86/dsputilenc: implement SSE2 versions of pix_{sum16, norm1}
...
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
11 years ago
James Almer
76ed71a72b
x86: move horizontal add macros to x86util
...
Also port relevant AVX2/XOP optimizations from x264 with permission
to relicense to LGPL from the corresponding authors
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
11 years ago
James Almer
3f3d748cab
x86: Move XOP emulation to x86util
...
We need the emulation to support the cases where the first
argument is the same as the fourth. To achieve this a fifth
argument working as a temporary may be needed.
Emulation that doesn't obey the original instruction semantics
can't be in x86inc.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
11 years ago
Jason Garrett-Glaser
c6908d6b4b
x86inc: FMA3/4 Support
...
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
11 years ago
Derek Buitenhuis
206895708e
x86inc: Remove our FMA4 support
...
This is so we can sync to x264's version of FMA4 support.
This partialy reverts commit 79687079a9
.
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
11 years ago
Diego Biurrun
d633d12b2c
x86inc: Add cvisible macro for C functions with public prefix
...
This allows defining externally visible library symbols.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
12 years ago
Diego Biurrun
ef5d41a553
x86inc: Rename "program_name" to "private_prefix"
...
The new name is more descriptive and will allow defining a separate
public prefix for externally visible library symbols.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
12 years ago
Diego Biurrun
dae1d507af
x86: Add PAVGB macro to abstract pavgb/pavgusb instruction via cpuflags
12 years ago
Diego Biurrun
320e1d0df3
x86: ABSB2: port to cpuflags
12 years ago
Diego Biurrun
094a7405e5
x86: ABSB: port to cpuflags
12 years ago
Diego Biurrun
51969a652c
x86: ABS2: port to cpuflags
12 years ago
Diego Biurrun
5b4dfbffc2
x86: ABS1: port to cpuflags
12 years ago
Justin Ruggles
ac7eb4cb20
float_dsp: add vector_dmul_scalar() to multiply a vector of doubles
...
Include x86-optimized versions for SSE2 and AVX.
12 years ago
Diego Biurrun
87af05c575
x86: SPLATD: port to cpuflags
12 years ago
Diego Biurrun
26301caaa1
x86: mmx2 ---> mmxext in asm constructs
12 years ago
Diego Biurrun
f0d124f005
x86inc: Set program_name outside of x86inc.asm
...
This reduces the local difference to the x264 upstream version.
12 years ago
Diego Biurrun
4b60fac419
x86: PALIGNR: port to cpuflags
12 years ago
Diego Biurrun
dbb37e7711
x86: PABSW: port to cpuflags
12 years ago
Diego Biurrun
0a7a94f2e5
x86: Refactor PSWAPD fallback implementations and port to cpuflags
12 years ago
Diego Biurrun
26f01bd106
x86: PMINUB: port to cpuflags
12 years ago
Diego Biurrun
61bc2bc7d4
x86util: Add cpuflags_mmxext alias for cpuflags_mmx2
...
"mmxext" is a more sensible name and more common in outside projects.
12 years ago
Dave Yeo
264f12342c
x86: Fix assembly with NASM
...
Unlike YASM, NASM only looks for include files in the current
directory, not in the directory that included files reside in.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
12 years ago
Dave Yeo
9c167914a1
x86: Fix assembly with NASM
...
Unlike YASM, NASM only looks for include files in the current
directory, not in the directory that included files reside in.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
12 years ago
Diego Biurrun
588fafe7f3
x86: MMX2 ---> MMXEXT in macro names
12 years ago
Diego Biurrun
6860b4081d
x86: include x86inc.asm in x86util.asm
...
This is necessary to allow refactoring some x86util macros with cpuflags.
12 years ago
Justin Ruggles
6092dafb5a
lavr: x86: optimized 6-channel s16 to fltp conversion
13 years ago
Jason Garrett-Glaser
85a3c19ed1
dsputil: x86: add SHUFFLE_MASK_W macro
...
Simplifies pshufb masks that operate on words.
13 years ago
Loren Merritt
4d4752366f
x86inc: add SPLATB_LOAD, SPLATB_REG, PSHUFLW macros
...
Signed-off-by: Diego Biurrun <diego@biurrun.de>
13 years ago
Vitor Sessak
4a301706fd
x86: Avoid movs on BUTTERFLYPS when in AVX mode
...
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
13 years ago
Justin Ruggles
5cc6d5244d
lavr: replace the SSE version of ff_conv_fltp_to_flt_6ch() with SSE4 and AVX
...
The current SSE version is slower than the MMX version on Athlon64 and Sandy
Bridge, but the SSE4 and AVX versions are faster on Sandy Bridge.
13 years ago
Justin Ruggles
c8af852b97
Add libavresample
...
This is a new library for audio sample format, channel layout, and sample rate
conversion.
13 years ago
Ronald S. Bultje
3b15a6d742
config.asm: change %ifdef directives to %if directives.
...
This allows combining multiple conditionals in a single statement.
13 years ago
Justin Ruggles
4e8e262476
fmtconvert: port int32_to_float_fmul_scalar() x86 inline asm to yasm
13 years ago
Ronald S. Bultje
38e06c2969
Move clipd macros to x86util.asm.
...
This allows sharing them between multiple .asm files.
14 years ago
Ronald S. Bultje
b2c087871d
Move x86util.asm from libavcodec/ to libavutil/.
...
This allows using it in swscale also.
14 years ago
Jason Garrett-Glaser
a3bf7b864a
H.264: tweak some other x86 asm for Atom
14 years ago
Daniel Kang
c0483d0c7a
H.264: Add x86 assembly for 10-bit H.264 predict functions
...
Mainly ported from 8-bit H.264 predict.
Some code ported from x264. LGPL ok by author.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
14 years ago
Loren Merritt
422b2362fc
dct32_sse: eliminate some spills
...
125->104 cycles on penryn (x86_64 only)
14 years ago