Ivan Kalvachev
7205513f8f
SIMD opus pvq_search implementation
...
Explanation on the workings and methods used by the
Pyramid Vector Quantization Search function
could be found in the following Work-In-Progress mail threads:
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-June/212146.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-June/212816.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-July/213030.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-July/213436.html
Signed-off-by: Ivan Kalvachev <ikalvachev@gmail.com>
8 years ago
Rostislav Pehlivanov
70eb77b34e
mdct15: add inverse transform postrotation SIMD
...
2.5ms frames:
Before (c): 2638 decicycles in postrotate, 2097040 runs, 112 skips
After (sse3): 1467 decicycles in postrotate, 2097083 runs, 69 skips
After (avx2): 1244 decicycles in postrotate, 2097085 runs, 67 skips
5ms frames:
Before (c): 4987 decicycles in postrotate, 1048371 runs, 205 skips
After (sse3): 2644 decicycles in postrotate, 1048509 runs, 67 skips
After (avx2): 2031 decicycles in postrotate, 1048523 runs, 53 skips
10ms frames:
Before (c): 9153 decicycles in postrotate, 523575 runs, 713 skips
After (sse3): 5110 decicycles in postrotate, 523726 runs, 562 skips
After (avx2): 3738 decicycles in postrotate, 524223 runs, 65 skips
20ms frames:
Before (c): 17857 decicycles in postrotate, 261866 runs, 278 skips
After (sse3): 10041 decicycles in postrotate, 261746 runs, 398 skips
After (avx2): 7050 decicycles in postrotate, 262116 runs, 28 skips
Improves total decoding performance for real world content by 9% with avx2.
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
8 years ago
Wan-Teh Chang
ea1ca17be2
avcodec/x86/cavsdsp: Delete #include "libavcodec/x86/idctdsp.h".
...
This file already has #include "idctdsp.h", which is resolved to the
idctdsp.h header in the directory where this file resides by compilers.
Two other files in this directory, libavcodec/x86/idctdsp_init.c and
libavcodec/x86/xvididct_init.c, also rely on #include "idctdsp.h"
working this way.
Signed-off-by: Wan-Teh Chang <wtc@google.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
8 years ago
James Almer
9d5e81d3b1
Revert "x86/sbrdsp: remove unnecessary sign extend instruction in apply_noise_main"
...
This reverts commit 24bb7db403
.
noise has to after all be sign extended, not zero extended, on tests
other than checkasm.
Fixes most aac tests broken by the now reverted commit.
8 years ago
James Almer
24bb7db403
x86/sbrdsp: remove unnecessary sign extend instruction in apply_noise_main
...
noise needs to be zero extended and it can be done implicitly as a side effect
in a subsequent instruction.
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Almer
bcbe9e4447
x86/sbrdsp: zero extend m_max in apply_noise_main
...
Tested-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Almer
440285474b
x86/utvideodsp: make restore_rgb_planes functions work on x86_32
...
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Almer
ac8ad8d098
x86/sbrdsp: sign extend start and end gprs in ff_sbr_hf_gen_sse
...
Tested-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Darnley
0c2acccd4b
avcodec/x86: use new x86-64 functions for -idct simple
...
They now match according to FATE, barring any further bugs with untested
parts
8 years ago
James Darnley
d7246ea9f2
avcodec/x86: add an 8-bit simple IDCT function based on the x86-64 high depth functions
...
Includes add/put functions
Rounding contributed by Ronald S. Bultje
8 years ago
James Darnley
8b19467d07
avcodec/x86: allow future 8-bit simple idct to have "DC only hack"
...
Created by Ronald S. Bultje
8 years ago
Clément Bœsch
b12a36170b
lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis
8 years ago
Michael Niedermayer
516c213f08
avcodec/x86/vp9dsp_init_16bpp: Fix linking to missing ff_vp9_ipred_dr_32x32_16_avx2() on 32bit
...
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
8 years ago
Ilia Valiakhmetov
35a5d9715d
avcodec/vp9: add 64-bit ipred_dr_32x32_16 avx2 implementation
...
vp9_diag_downright_32x32_12bpp_c: 429.7
vp9_diag_downright_32x32_12bpp_sse2: 158.9
vp9_diag_downright_32x32_12bpp_ssse3: 144.6
vp9_diag_downright_32x32_12bpp_avx: 141.0
vp9_diag_downright_32x32_12bpp_avx2: 73.8
Almost 50% faster than avx implementation
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
8 years ago
Paul B Mahol
4ed7c2bbc3
avcodec/utvideodec: add SIMD for restore_rgb_planes
...
Signed-off-by: Paul B Mahol <onemda@gmail.com>
8 years ago
Matthieu Bouron
db5bf64b21
lavc/x86: clear r2 higher bits in ff_sbr_sum_square
...
Suggested-by: James Almer <jamrial@gmail.com>
8 years ago
James Almer
349446e36f
x86/mdct15: use three operand form for some instructions
...
Fixes compilation with old yasm
8 years ago
Rostislav Pehlivanov
e1120b1c54
mdct15: add assembly optimizations for the 15-point FFT
...
c: 1802 decicycles in fft15,16774635 runs, 2581 skips
avx: 865 decicycles in fft15,16776378 runs, 838 skips
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
8 years ago
Diego Biurrun
fd502f4f5f
build: Generalize yasm/nasm-related variable names
...
None of them are specific to the YASM assembler.
(Cherry-picked from libav commit 39e208f4d4
)
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Darnley
8221c71703
avcodec/x86: allow future 8-bit simple idct to use slightly different coefficients
8 years ago
James Darnley
d2597fb0c1
avcodec/x86: modify simple_idct10 macros to add an action paramter
8 years ago
James Darnley
8781330d80
avcodec/x86: cleanup simple_idct10
...
Use named arguments for the functions so we can remove a define. The
stride/linesize argument is now ptrdiff_t type so we no longer need to
sign extend the register.
8 years ago
James Darnley
e3db94302c
avcodec/x86/mpegenc: support transpose permuation type
8 years ago
James Darnley
fa30a0a548
avcodec/x86/mpegenc: check IDCT permutation type is a valid value
8 years ago
Michael Niedermayer
ae6f6d4e34
avcodec/x86/mpegvideo: Use intra scantable in dct_unquantize_h263_intra_mmx()
...
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
8 years ago
James Almer
8bb59e6742
x86/aacpsdsp: add ff_ps_hybrid_analysis_ileave_sse
...
About 2x faster than the c version.
8 years ago
James Almer
e229df9478
x86/aacpsdsp: add ff_ps_hybrid_synthesis_deint_{sse,sse4}
...
About 2x faster than the c version.
8 years ago
James Almer
623d217ed1
avcodec/aacps: move checks for valid length outside the stereo_interpolate dsp function
...
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Almer
b3446862bf
x86/vorbisdsp: optimize ff_vorbis_inverse_coupling_sse
...
About 7% faster.
8 years ago
Ronald S. Bultje
d35ff98e27
vp9: fix overwrite in ff_vp9_ipred_dr_16x16_16_avx2.
...
Fixes trac issue 6459.
8 years ago
Ilia Valiakhmetov
81fc617c12
avcodec/vp9: ipred_dr_16x16_16 avx2 implementation
...
Signed-off-by: Ilia Valiakhmetov <zakne0ne@gmail.com>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
8 years ago
James Almer
497a4b554c
x86/aacpsdsp: fix output of ff_ps_stereo_interpolate_ipdopd_sse3
...
The fate-aac-al_sbr_ps_04_ur test did not detect this mistake.
8 years ago
Ilia Valiakhmetov
73d9a9a6af
libavcodec/vp9: ipred_dl_32x32_16 avx2 implementation
...
vp9_diag_downleft_32x32_8bpp_c: 580.2
vp9_diag_downleft_32x32_8bpp_sse2: 75.6
vp9_diag_downleft_32x32_8bpp_ssse3: 73.7
vp9_diag_downleft_32x32_8bpp_avx: 72.7
vp9_diag_downleft_32x32_10bpp_c: 1101.2
vp9_diag_downleft_32x32_10bpp_sse2: 145.4
vp9_diag_downleft_32x32_10bpp_ssse3: 137.5
vp9_diag_downleft_32x32_10bpp_avx: 134.8
vp9_diag_downleft_32x32_10bpp_avx2: 94.0
vp9_diag_downleft_32x32_12bpp_c: 1108.5
vp9_diag_downleft_32x32_12bpp_sse2: 145.5
vp9_diag_downleft_32x32_12bpp_ssse3: 137.3
vp9_diag_downleft_32x32_12bpp_avx: 135.2
vp9_diag_downleft_32x32_12bpp_avx2: 94.0
~30% faster than avx implementation
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
8 years ago
James Almer
933dd62288
x86/aacpsdsp: optimize ff_ps_mul_pair_single_sse
...
~2% faster.
8 years ago
James Almer
be3809a521
x86/aacpsdsp: optimize ff_ps_stereo_interpolate_sse3
...
Move the unpacking outside of the loop. 5% to 10% faster.
Suggested-by: ubitux
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Almer
b5a0971ff0
x86/aacps: add ff_ps_stereo_interpolate_ipdopd_sse3()
...
About 2x faster than the c version.
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Darnley
0dea0114fb
avcodec/x86/idctdsp_init: reindent
8 years ago
James Darnley
8e89f6fd37
avcodec/x86: move simple_idct to external assembly
8 years ago
Clément Bœsch
584366a436
lavc/mpegvideoenc: reformat inv_zigzag_direct16 so the zigzag pattern is visible
8 years ago
James Darnley
7aa90b4e94
avcodec/h264: add sse2 versions of previous idct functions
...
Kaby Lake Pentium:
- ff_h264_idct_add_8_sse2: ~1.18x faster than mmxext
- ff_h264_idct_dc_add_8_sse2: ~1.07x faster than mmxext
8 years ago
James Darnley
27460dfebc
avcodec/h264: add avx 8-bit h264_idct_dc_add
...
Haswell:
- 1.02x faster (405±0.7 vs. 397±0.8 decicycles) compared with mmxext
Skylake-U:
- 1.06x faster (498±1.8 vs. 470±1.3 decicycles) compared with mmxext
8 years ago
James Darnley
f61d454ca1
avcodec/h264: add avx 8-bit h264_idct_add
...
Haswell:
- 1.11x faster (522±0.4 vs. 469±1.8 decicycles) compared with mmxext
Skylake-U:
- 1.21x faster (671±5.5 vs. 555±1.4 decicycles) compared with mmxext
8 years ago
James Darnley
b5325c6711
avcodec/h264: use some 3 operand forms
8 years ago
James Darnley
060ba9e5e3
avcodec/h264: change RETs into REP_RETs where appropriate
8 years ago
Michael Niedermayer
fa8fd0808f
avcodec/x86/vc1dsp_init: Fix build failure with --disable-optimizations and clang
...
compilers doing DCE at -O0 do not necessarily understand "complex" boolean expressions
Build succeeds with this change, this was the only failure
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
8 years ago
Ronald S. Bultje
83ae7e6350
x86/idctdsp_init: reindent.
8 years ago
Ronald S. Bultje
e0c205677f
x86/simple_idct: add explicit sse2 simple_idct_put/add versions.
...
These use the mmx IDCT, but sse2 put/add_pixels_clamped implementations.
This way we don't need to use the ff_put/add_pixels_clamped function
pointers.
8 years ago
Ronald S. Bultje
2f0591cfa3
cavs: add a sse2 idct implementation.
...
This makes using the function pointer ff_add_pixels_clamped() unnecessary,
since we always know what the best implementation is at compile-time.
8 years ago
Ronald S. Bultje
c9d98c5649
cavs: convert idct from inline asm to yasm.
8 years ago
Ronald S. Bultje
b51d7d89f8
x86/xvididct: remove use of ff_put/add_pixels_clamped function pointer.
...
Since there's separate SSE2 implementations of xvid_idct_put/add, this
patch has no practical impact on performance.
8 years ago