James Almer
aa945dc112
x86/hevcdsp: add missing vzeroupper in ff_hevc_sao_band_filter_48_*_avx2
...
Signed-off-by: James Almer <jamrial@gmail.com>
10 years ago
James Almer
71e2cb4706
x86/hevcdsp: add missing guards to ff_hevc_sao_band_filter_avx2
...
Signed-off-by: James Almer <jamrial@gmail.com>
10 years ago
Christophe Gisquet
bff7feb328
x86: hevc/sao: aligned source buffers
...
Usefull for at least band filter, for which:
- Band filter call only:
32 64
Before: 16556 54015
After: 16497 52355
- Whole case:
32 64
Before: 37031 103008
After: 32045 93952
10 years ago
James Almer
fa3eccb4f9
x86/hevc: add ff_hevc_sao_band_filter_{8,10,12}_{sse2,avx,avx2}
...
Original x86 intrinsics code and initial 8bit yasm port by Pierre-Edouard Lepere.
10/12bit yasm ports, refactoring and optimizations by James Almer
Benchmarks of BQTerrace_1920x1080_60_qp22.bin with an Intel Core i5-4200U
width 32
40338 decicycles in sao_band_filter_0_8, 2048 runs, 0 skips
8056 decicycles in ff_hevc_sao_band_filter_8_32_sse2, 2048 runs, 0 skips
7458 decicycles in ff_hevc_sao_band_filter_8_32_avx, 2048 runs, 0 skips
4504 decicycles in ff_hevc_sao_band_filter_8_32_avx2, 2048 runs, 0 skips
width 64
136046 decicycles in sao_band_filter_0_8, 16384 runs, 0 skips
28576 decicycles in ff_hevc_sao_band_filter_8_32_sse2, 16384 runs, 0 skips
26707 decicycles in ff_hevc_sao_band_filter_8_32_avx, 16384 runs, 0 skips
14387 decicycles in ff_hevc_sao_band_filter_8_32_avx2, 16384 runs, 0 skips
Reviewed-by: Christophe Gisquet <christophe.gisquet@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
10 years ago
Christophe Gisquet
7aeafacfd0
x86/sbrdsp: Use different mem moves
...
Before
2843 decicycles in ff_sbr_autocorrelate_sse3, 262086 runs, 58 skips
After
2693 decicycles in ff_sbr_autocorrelate_sse3, 262117 runs, 27 skips
Signed-off-by: James Almer <jamrial@gmail.com>
10 years ago
James Almer
449b21bfab
x86/sbrdsp: add ff_sbr_autocorrelate_{sse,sse3}
...
2 to 2.5 times faster.
Signed-off-by: James Almer <jamrial@gmail.com>
10 years ago
James Almer
08810a8895
x86/flacdsp: remove unneeded ifdeffery
...
x86inc can translate r*m into a register or stack on its own
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: James Almer <jamrial@gmail.com>
10 years ago
James Almer
37b35feb64
x86/swr: add SSE2/AVX pack_8ch functions
...
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
10 years ago
Ronald S. Bultje
3aefca68ca
vp9/x86: add myself to copyright holders for loopfilter assembly.
10 years ago
Ronald S. Bultje
afd8c464b7
vp9/x86: make filter_16_h work on 32-bit.
10 years ago
Ronald S. Bultje
b26bc3520f
vp9/x86: make filter_48/84/88_h work on 32-bit.
10 years ago
Ronald S. Bultje
8a1cff1c35
vp9/x86: make filter_44_h work on 32-bit.
10 years ago
Ronald S. Bultje
047088b8c6
vp9/x86: make filter_16_v work on 32-bit.
10 years ago
Ronald S. Bultje
0cc9c23ea1
vp9/x86: make filter_48/84_v work on 32-bit.
10 years ago
Ronald S. Bultje
6433a9133f
vp9/x86: make filter_88_v work on 32-bit.
10 years ago
Ronald S. Bultje
75f8e52089
vp9/x86: make filter_44_v work on 32-bit.
10 years ago
Ronald S. Bultje
7f80c3344c
vp8/x86: save one register in SIGN_ADD/SUB.
10 years ago
Ronald S. Bultje
8ea2194ebb
vp9/x86: store unpacked intermediates for filter6/14 on stack.
...
filter16 goes from 508 to 482 (h) or 346 to 314 (v) cycles; filter88
goes from 240 to 238 (h) or 174 to 165 (v) cycles, measured on TOS.
10 years ago
Ronald S. Bultje
e42409479f
vp8/x86: move variable assigned inside macro branch.
...
The value is not used outside the branch.
10 years ago
Ronald S. Bultje
418c202c63
vp9/x86: simplify ABSSUM_CMP by inverting the comparison meaning.
10 years ago
Ronald S. Bultje
d1c55654e1
vp8/x86: remove unused register from ABSSUB_CMP macro.
10 years ago
Ronald S. Bultje
e59bd08986
vp9/x86: slightly simplify 44/48/84/88 h stores.
10 years ago
Ronald S. Bultje
8132629bd5
vp9/x86: make cglobal statement more conservative in register allocation.
10 years ago
Ronald S. Bultje
c013ca58c5
vp9/x86: save one register in loopfilter surface coverage.
10 years ago
James Almer
32c836cb11
x86/vp9: remove duplicate function prototypes
...
Fixes "redundant redeclaration" warnings.
Signed-off-by: James Almer <jamrial@gmail.com>
10 years ago
James Almer
7696e429c7
x86/vp3dsp: port put_vp_no_rnd_pixels8_l2_mmx to yasm
...
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
James Almer
a4d62f7775
x86/constants: fix alignment of pw_255
...
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
Ronald S. Bultje
bdc1e3e3b2
vp9/x86: intra prediction sse2/32bit support.
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
Ronald S. Bultje
b6e1711223
vp9/x86: invert hu_ipred left array ordering.
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
Ronald S. Bultje
0a7964dca5
vp9/x86: save one register on 32bit idct32x32.
...
Fixes build on win32.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
Ronald S. Bultje
cae893f692
vp9/x86: sse2 MC assembly.
...
Also a slight change to the ssse3 code, which prevents a theoretical
overflow in the sharp filter.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
Ronald S. Bultje
fd77fbb390
vp9/x86: 32bit and sse2 support for vp9 inverse transform assembly
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
Michael Niedermayer
a03f72e744
avcodec/x86/hevc_mc: fix sse register counts
...
These fix failures of --enable-xmm-clobber-test
It would be better to change the code to use fewer registers, but until
someone does the used register count must not be too small
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
Michael Niedermayer
d43d5c5707
avcodec/x86/hevc_mc: remove dead branch from EPEL_FILTER
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
Michael Niedermayer
ed9be7dd47
avcodec/x86/pngdsp: fix off by 1 error
...
This fixes artifacts in the last pixel of rows with some widths and pixel formats
Found-by: Dominique Leroux <Dominique.Leroux@autodesk.com>
Tested-by: Dominique Leroux <Dominique.Leroux@autodesk.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
Kieran Kunhya
9a738c27dc
v210enc: Add SIMD optimised 8-bit and 10-bit encoders
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
10 years ago
Reimar Döffinger
49d9cbe55d
h264_i386: Fix operand size
...
Fixes fate failure on macosx clang x86-64
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
Christophe Gisquet
9fa056ba75
pngdsp x86: use unaligned access
...
For test images manually generated to contain only up prediction,
timing results:
8380x3032 255x185
before: 138635 1992
after: 139232 1996
Actually jumping to the proper version depending on the alignment:
8380x3032: 138767
A 0.5% speed improvement for gigantic images is not worth the code
duplication.
Fixes ticket #4148
Signed-off-by: Christophe Gisquet <christophe.gisquet@gmail.com>
Tested-by: Benoit Fouet <benoit.fouet@free.fr>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
Kieran Kunhya
36091742d1
v210enc: Add SIMD optimised 8-bit and 10-bit encoders
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
Vittorio Giovara
9c12c6ff95
motion_est: convert stride to ptrdiff_t
...
CC: libav-stable@libav.org
Bug-Id: CID 700556 / CID 700557 / CID 700558
10 years ago
Carl Eugen Hoyos
600e38f563
Fix standalone compilation of the apng decoder on x86.
10 years ago
Michael Niedermayer
65ce8f8895
avcodec/x86/Makefile: fix order
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
Michael Niedermayer
d3512a0e89
avcodec/x86/lossless_audiodsp: fix fallback code for 32bit
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
Michael Niedermayer
4327088da3
avcodec/x86/lossless_audiodsp: support len %16 == 8 in scalarproduct_and_madd_int16()
...
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago
Reimar Döffinger
478c61ccb2
h264_i386: Optimize decode_significance_8x8_x86 for 64 bit.
...
11674 -> 10877 decicycles on my Phenom II.
Overall speedup was unfortunately within measurement error.
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
10 years ago
James Almer
3cec54b7d7
x86/flacdsp: add SSE2 and AVX decorrelate functions
...
Two to four times faster depending on instruction set, block size and channel count.
10 years ago
James Almer
84ccc317ce
x86/flacdsp: separate decoder and encoder dsp initialization
...
Signed-off-by: James Almer <jamrial@gmail.com>
10 years ago
James Almer
7292b0477a
x86/hpeldsp: fix loop in {avg,avg_no_rnd}_pixels16_x2_mmx
...
Handle it inside the __asm__() block.
Fixes fate-vc1_ilaced_twomv when using the gcc-usan toolchain.
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: James Almer <jamrial@gmail.com>
10 years ago
Henrik Gramner
2d91abade2
x86: h264_intrapred: Don't treat 32-bit integers as 64-bit
...
The upper halves are not guaranteed to be zero in x86-64.
Signed-off-by: Anton Khirnov <anton@khirnov.net>
10 years ago
Mickaël Raulet
4ba6371a83
x86/hevc: get rid off packusdw for ssse3 compatibility
...
cherry picked from commit df8ebe304df453f26c28ff8f11d607f49b90a4c2
Fixes out of array access
Fixes: asan_stack-oob_1046454_9_asan_stack-oob_15a9e7c_170_WP_MAIN10_B_Toshiba_3.bit
Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
10 years ago