Michael Niedermayer
bc488ec28a
avcodec/me_cmp: Fix crashes on ARM due to misalignment
...
Adds a diff_pixels_unaligned()
Fixes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=872503
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
8 years ago
Ivan Kalvachev
43dab86bcd
opus_pvq_search: Restore the proper use of conditional define and simplify the function name suffix handling.
...
Using named define properly documents the code paths.
It also avoids passing additional numbered arguments through
multiple levels of macro templates.
The suffix handling is done by concatenation, like in
other asm functions and avoid having two separate
"cglobal" defines.
Signed-off-by: Ivan Kalvachev <ikalvachev@gmail.com>
8 years ago
Rostislav Pehlivanov
3c99523a28
opus_pvq_search: split functions into exactness and only use the exact if its faster
...
This splits the asm function into exact and non-exact version. The exact
version is as fast or faster on newer CPUs (which EXTERNAL_AVX_FAST describes
well) whilst the non-exact version is faster than the exact on older CPUs.
Also fixes yasm compilation which doesn't accept !cpuflags(avx) syntax.
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
8 years ago
Rostislav Pehlivanov
f386dd70ac
opus_pvq_search: only use rsqrtps approximation on CPUs with avx
...
Makes the search produce idential results with the C version.
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
8 years ago
Rostislav Pehlivanov
8e53cd1fab
ops_pvq_search: remove dead macro
...
There's no point in toggling it, even for debugging. Its just worse.
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
8 years ago
Ivan Kalvachev
7205513f8f
SIMD opus pvq_search implementation
...
Explanation on the workings and methods used by the
Pyramid Vector Quantization Search function
could be found in the following Work-In-Progress mail threads:
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-June/212146.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-June/212816.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-July/213030.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-July/213436.html
Signed-off-by: Ivan Kalvachev <ikalvachev@gmail.com>
8 years ago
Rostislav Pehlivanov
70eb77b34e
mdct15: add inverse transform postrotation SIMD
...
2.5ms frames:
Before (c): 2638 decicycles in postrotate, 2097040 runs, 112 skips
After (sse3): 1467 decicycles in postrotate, 2097083 runs, 69 skips
After (avx2): 1244 decicycles in postrotate, 2097085 runs, 67 skips
5ms frames:
Before (c): 4987 decicycles in postrotate, 1048371 runs, 205 skips
After (sse3): 2644 decicycles in postrotate, 1048509 runs, 67 skips
After (avx2): 2031 decicycles in postrotate, 1048523 runs, 53 skips
10ms frames:
Before (c): 9153 decicycles in postrotate, 523575 runs, 713 skips
After (sse3): 5110 decicycles in postrotate, 523726 runs, 562 skips
After (avx2): 3738 decicycles in postrotate, 524223 runs, 65 skips
20ms frames:
Before (c): 17857 decicycles in postrotate, 261866 runs, 278 skips
After (sse3): 10041 decicycles in postrotate, 261746 runs, 398 skips
After (avx2): 7050 decicycles in postrotate, 262116 runs, 28 skips
Improves total decoding performance for real world content by 9% with avx2.
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
8 years ago
Wan-Teh Chang
ea1ca17be2
avcodec/x86/cavsdsp: Delete #include "libavcodec/x86/idctdsp.h".
...
This file already has #include "idctdsp.h", which is resolved to the
idctdsp.h header in the directory where this file resides by compilers.
Two other files in this directory, libavcodec/x86/idctdsp_init.c and
libavcodec/x86/xvididct_init.c, also rely on #include "idctdsp.h"
working this way.
Signed-off-by: Wan-Teh Chang <wtc@google.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
8 years ago
James Almer
9d5e81d3b1
Revert "x86/sbrdsp: remove unnecessary sign extend instruction in apply_noise_main"
...
This reverts commit 24bb7db403
.
noise has to after all be sign extended, not zero extended, on tests
other than checkasm.
Fixes most aac tests broken by the now reverted commit.
8 years ago
James Almer
24bb7db403
x86/sbrdsp: remove unnecessary sign extend instruction in apply_noise_main
...
noise needs to be zero extended and it can be done implicitly as a side effect
in a subsequent instruction.
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Almer
bcbe9e4447
x86/sbrdsp: zero extend m_max in apply_noise_main
...
Tested-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Almer
440285474b
x86/utvideodsp: make restore_rgb_planes functions work on x86_32
...
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Almer
ac8ad8d098
x86/sbrdsp: sign extend start and end gprs in ff_sbr_hf_gen_sse
...
Tested-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Darnley
0c2acccd4b
avcodec/x86: use new x86-64 functions for -idct simple
...
They now match according to FATE, barring any further bugs with untested
parts
8 years ago
James Darnley
d7246ea9f2
avcodec/x86: add an 8-bit simple IDCT function based on the x86-64 high depth functions
...
Includes add/put functions
Rounding contributed by Ronald S. Bultje
8 years ago
James Darnley
8b19467d07
avcodec/x86: allow future 8-bit simple idct to have "DC only hack"
...
Created by Ronald S. Bultje
8 years ago
Clément Bœsch
b12a36170b
lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis
8 years ago
Michael Niedermayer
516c213f08
avcodec/x86/vp9dsp_init_16bpp: Fix linking to missing ff_vp9_ipred_dr_32x32_16_avx2() on 32bit
...
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
8 years ago
Ilia Valiakhmetov
35a5d9715d
avcodec/vp9: add 64-bit ipred_dr_32x32_16 avx2 implementation
...
vp9_diag_downright_32x32_12bpp_c: 429.7
vp9_diag_downright_32x32_12bpp_sse2: 158.9
vp9_diag_downright_32x32_12bpp_ssse3: 144.6
vp9_diag_downright_32x32_12bpp_avx: 141.0
vp9_diag_downright_32x32_12bpp_avx2: 73.8
Almost 50% faster than avx implementation
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
8 years ago
Paul B Mahol
4ed7c2bbc3
avcodec/utvideodec: add SIMD for restore_rgb_planes
...
Signed-off-by: Paul B Mahol <onemda@gmail.com>
8 years ago
Matthieu Bouron
db5bf64b21
lavc/x86: clear r2 higher bits in ff_sbr_sum_square
...
Suggested-by: James Almer <jamrial@gmail.com>
8 years ago
James Almer
349446e36f
x86/mdct15: use three operand form for some instructions
...
Fixes compilation with old yasm
8 years ago
Rostislav Pehlivanov
e1120b1c54
mdct15: add assembly optimizations for the 15-point FFT
...
c: 1802 decicycles in fft15,16774635 runs, 2581 skips
avx: 865 decicycles in fft15,16776378 runs, 838 skips
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
8 years ago
Diego Biurrun
fd502f4f5f
build: Generalize yasm/nasm-related variable names
...
None of them are specific to the YASM assembler.
(Cherry-picked from libav commit 39e208f4d4
)
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Darnley
8221c71703
avcodec/x86: allow future 8-bit simple idct to use slightly different coefficients
8 years ago
James Darnley
d2597fb0c1
avcodec/x86: modify simple_idct10 macros to add an action paramter
8 years ago
James Darnley
8781330d80
avcodec/x86: cleanup simple_idct10
...
Use named arguments for the functions so we can remove a define. The
stride/linesize argument is now ptrdiff_t type so we no longer need to
sign extend the register.
8 years ago
James Darnley
e3db94302c
avcodec/x86/mpegenc: support transpose permuation type
8 years ago
James Darnley
fa30a0a548
avcodec/x86/mpegenc: check IDCT permutation type is a valid value
8 years ago
Michael Niedermayer
ae6f6d4e34
avcodec/x86/mpegvideo: Use intra scantable in dct_unquantize_h263_intra_mmx()
...
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
8 years ago
James Almer
8bb59e6742
x86/aacpsdsp: add ff_ps_hybrid_analysis_ileave_sse
...
About 2x faster than the c version.
8 years ago
James Almer
e229df9478
x86/aacpsdsp: add ff_ps_hybrid_synthesis_deint_{sse,sse4}
...
About 2x faster than the c version.
8 years ago
James Almer
623d217ed1
avcodec/aacps: move checks for valid length outside the stereo_interpolate dsp function
...
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Almer
b3446862bf
x86/vorbisdsp: optimize ff_vorbis_inverse_coupling_sse
...
About 7% faster.
8 years ago
Ronald S. Bultje
d35ff98e27
vp9: fix overwrite in ff_vp9_ipred_dr_16x16_16_avx2.
...
Fixes trac issue 6459.
8 years ago
Ilia Valiakhmetov
81fc617c12
avcodec/vp9: ipred_dr_16x16_16 avx2 implementation
...
Signed-off-by: Ilia Valiakhmetov <zakne0ne@gmail.com>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
8 years ago
James Almer
497a4b554c
x86/aacpsdsp: fix output of ff_ps_stereo_interpolate_ipdopd_sse3
...
The fate-aac-al_sbr_ps_04_ur test did not detect this mistake.
8 years ago
Ilia Valiakhmetov
73d9a9a6af
libavcodec/vp9: ipred_dl_32x32_16 avx2 implementation
...
vp9_diag_downleft_32x32_8bpp_c: 580.2
vp9_diag_downleft_32x32_8bpp_sse2: 75.6
vp9_diag_downleft_32x32_8bpp_ssse3: 73.7
vp9_diag_downleft_32x32_8bpp_avx: 72.7
vp9_diag_downleft_32x32_10bpp_c: 1101.2
vp9_diag_downleft_32x32_10bpp_sse2: 145.4
vp9_diag_downleft_32x32_10bpp_ssse3: 137.5
vp9_diag_downleft_32x32_10bpp_avx: 134.8
vp9_diag_downleft_32x32_10bpp_avx2: 94.0
vp9_diag_downleft_32x32_12bpp_c: 1108.5
vp9_diag_downleft_32x32_12bpp_sse2: 145.5
vp9_diag_downleft_32x32_12bpp_ssse3: 137.3
vp9_diag_downleft_32x32_12bpp_avx: 135.2
vp9_diag_downleft_32x32_12bpp_avx2: 94.0
~30% faster than avx implementation
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
8 years ago
James Almer
933dd62288
x86/aacpsdsp: optimize ff_ps_mul_pair_single_sse
...
~2% faster.
8 years ago
James Almer
be3809a521
x86/aacpsdsp: optimize ff_ps_stereo_interpolate_sse3
...
Move the unpacking outside of the loop. 5% to 10% faster.
Suggested-by: ubitux
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Almer
b5a0971ff0
x86/aacps: add ff_ps_stereo_interpolate_ipdopd_sse3()
...
About 2x faster than the c version.
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Darnley
0dea0114fb
avcodec/x86/idctdsp_init: reindent
8 years ago
James Darnley
8e89f6fd37
avcodec/x86: move simple_idct to external assembly
8 years ago
Clément Bœsch
584366a436
lavc/mpegvideoenc: reformat inv_zigzag_direct16 so the zigzag pattern is visible
8 years ago
James Darnley
7aa90b4e94
avcodec/h264: add sse2 versions of previous idct functions
...
Kaby Lake Pentium:
- ff_h264_idct_add_8_sse2: ~1.18x faster than mmxext
- ff_h264_idct_dc_add_8_sse2: ~1.07x faster than mmxext
8 years ago
James Darnley
27460dfebc
avcodec/h264: add avx 8-bit h264_idct_dc_add
...
Haswell:
- 1.02x faster (405±0.7 vs. 397±0.8 decicycles) compared with mmxext
Skylake-U:
- 1.06x faster (498±1.8 vs. 470±1.3 decicycles) compared with mmxext
8 years ago
James Darnley
f61d454ca1
avcodec/h264: add avx 8-bit h264_idct_add
...
Haswell:
- 1.11x faster (522±0.4 vs. 469±1.8 decicycles) compared with mmxext
Skylake-U:
- 1.21x faster (671±5.5 vs. 555±1.4 decicycles) compared with mmxext
8 years ago
James Darnley
b5325c6711
avcodec/h264: use some 3 operand forms
8 years ago
James Darnley
060ba9e5e3
avcodec/h264: change RETs into REP_RETs where appropriate
8 years ago
Michael Niedermayer
fa8fd0808f
avcodec/x86/vc1dsp_init: Fix build failure with --disable-optimizations and clang
...
compilers doing DCE at -O0 do not necessarily understand "complex" boolean expressions
Build succeeds with this change, this was the only failure
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
8 years ago