James Almer
497a4b554c
x86/aacpsdsp: fix output of ff_ps_stereo_interpolate_ipdopd_sse3
...
The fate-aac-al_sbr_ps_04_ur test did not detect this mistake.
8 years ago
Ilia Valiakhmetov
73d9a9a6af
libavcodec/vp9: ipred_dl_32x32_16 avx2 implementation
...
vp9_diag_downleft_32x32_8bpp_c: 580.2
vp9_diag_downleft_32x32_8bpp_sse2: 75.6
vp9_diag_downleft_32x32_8bpp_ssse3: 73.7
vp9_diag_downleft_32x32_8bpp_avx: 72.7
vp9_diag_downleft_32x32_10bpp_c: 1101.2
vp9_diag_downleft_32x32_10bpp_sse2: 145.4
vp9_diag_downleft_32x32_10bpp_ssse3: 137.5
vp9_diag_downleft_32x32_10bpp_avx: 134.8
vp9_diag_downleft_32x32_10bpp_avx2: 94.0
vp9_diag_downleft_32x32_12bpp_c: 1108.5
vp9_diag_downleft_32x32_12bpp_sse2: 145.5
vp9_diag_downleft_32x32_12bpp_ssse3: 137.3
vp9_diag_downleft_32x32_12bpp_avx: 135.2
vp9_diag_downleft_32x32_12bpp_avx2: 94.0
~30% faster than avx implementation
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
8 years ago
James Almer
933dd62288
x86/aacpsdsp: optimize ff_ps_mul_pair_single_sse
...
~2% faster.
8 years ago
James Almer
be3809a521
x86/aacpsdsp: optimize ff_ps_stereo_interpolate_sse3
...
Move the unpacking outside of the loop. 5% to 10% faster.
Suggested-by: ubitux
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Almer
b5a0971ff0
x86/aacps: add ff_ps_stereo_interpolate_ipdopd_sse3()
...
About 2x faster than the c version.
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
James Darnley
0dea0114fb
avcodec/x86/idctdsp_init: reindent
8 years ago
James Darnley
8e89f6fd37
avcodec/x86: move simple_idct to external assembly
8 years ago
Clément Bœsch
584366a436
lavc/mpegvideoenc: reformat inv_zigzag_direct16 so the zigzag pattern is visible
8 years ago
James Darnley
7aa90b4e94
avcodec/h264: add sse2 versions of previous idct functions
...
Kaby Lake Pentium:
- ff_h264_idct_add_8_sse2: ~1.18x faster than mmxext
- ff_h264_idct_dc_add_8_sse2: ~1.07x faster than mmxext
8 years ago
James Darnley
27460dfebc
avcodec/h264: add avx 8-bit h264_idct_dc_add
...
Haswell:
- 1.02x faster (405±0.7 vs. 397±0.8 decicycles) compared with mmxext
Skylake-U:
- 1.06x faster (498±1.8 vs. 470±1.3 decicycles) compared with mmxext
8 years ago
James Darnley
f61d454ca1
avcodec/h264: add avx 8-bit h264_idct_add
...
Haswell:
- 1.11x faster (522±0.4 vs. 469±1.8 decicycles) compared with mmxext
Skylake-U:
- 1.21x faster (671±5.5 vs. 555±1.4 decicycles) compared with mmxext
8 years ago
James Darnley
b5325c6711
avcodec/h264: use some 3 operand forms
8 years ago
James Darnley
060ba9e5e3
avcodec/h264: change RETs into REP_RETs where appropriate
8 years ago
Michael Niedermayer
fa8fd0808f
avcodec/x86/vc1dsp_init: Fix build failure with --disable-optimizations and clang
...
compilers doing DCE at -O0 do not necessarily understand "complex" boolean expressions
Build succeeds with this change, this was the only failure
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
8 years ago
Ronald S. Bultje
83ae7e6350
x86/idctdsp_init: reindent.
8 years ago
Ronald S. Bultje
e0c205677f
x86/simple_idct: add explicit sse2 simple_idct_put/add versions.
...
These use the mmx IDCT, but sse2 put/add_pixels_clamped implementations.
This way we don't need to use the ff_put/add_pixels_clamped function
pointers.
8 years ago
Ronald S. Bultje
2f0591cfa3
cavs: add a sse2 idct implementation.
...
This makes using the function pointer ff_add_pixels_clamped() unnecessary,
since we always know what the best implementation is at compile-time.
8 years ago
Ronald S. Bultje
c9d98c5649
cavs: convert idct from inline asm to yasm.
8 years ago
Ronald S. Bultje
b51d7d89f8
x86/xvididct: remove use of ff_put/add_pixels_clamped function pointer.
...
Since there's separate SSE2 implementations of xvid_idct_put/add, this
patch has no practical impact on performance.
8 years ago
James Almer
6171f178e7
x86/hevc_add_res: merge last remaining changes from 3d65359832
...
See https://lists.libav.org/pipermail/libav-devel/2016-October/079829.html
8 years ago
Ronald S. Bultje
f8c019944d
vp9: re-split the decoder/format/dsp interface header files.
...
The advantage here is that the internal software decoder interface is
not exposed to the DSP functions or the hardware accelerations.
8 years ago
Clément Bœsch
1c9f4b5078
lavc/vp9: split into vp9{block,data,mvs}
...
This is following Libav layout to ease merges.
8 years ago
Michael Niedermayer
73fb40dc87
avcodec/x86/idctdsp: Remove duplicate include
...
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
8 years ago
James Almer
ac42f08099
x86/hevc_add_res: merge missing changes from 3d65359832
...
Unrolling the loops triplicates the size of the assembled output
while not generating any gain in performance.
8 years ago
Clément Bœsch
40ac226014
lavc/x86/hevc: rename hevc_res_add to hevc_add_res
...
This will simplify incoming merge.
8 years ago
Diego Biurrun
dcc39ee10e
lavc: Remove deprecated XvMC support hacks
...
Deprecated in 11/2013.
8 years ago
James Almer
30cadfe071
avcodec/lossless_videodsp: use ptrdiff_t for length parameters
...
Signed-off-by: James Almer <jamrial@gmail.com>
8 years ago
Clément Bœsch
af607b7e07
lavc/huffyuvdsp: only transmit the pix_fmt instead of the whole avctx
...
Only the pixel format is required in that init function. This will also
simplify the incoming merge.
8 years ago
James Almer
aee046a895
x86/audiodsp: remove an unnecessary movss
8 years ago
Ilia
2f3d10a01a
avcodec/vp9: avx2 implementation of ipred_dl_16x16_16
...
vp9_diag_downleft_16x16_10bpp_c: 263.0
vp9_diag_downleft_16x16_10bpp_sse2: 44.7
vp9_diag_downleft_16x16_10bpp_ssse3: 32.5
vp9_diag_downleft_16x16_10bpp_avx: 31.9
vp9_diag_downleft_16x16_10bpp_avx2: 25.7
vp9_diag_downleft_16x16_12bpp_c: 264.7
vp9_diag_downleft_16x16_12bpp_sse2: 44.4
vp9_diag_downleft_16x16_12bpp_ssse3: 32.0
vp9_diag_downleft_16x16_12bpp_avx: 32.4
vp9_diag_downleft_16x16_12bpp_avx2: 25.5
Benchmarked with 10000 runs
Signed-off-by: Ilia <zakne0ne@gmail.com>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
8 years ago
Mirage Abeysekara
5eb4f95bef
h264pred: added AVX2 implementation for tm_vp8 16x16.
...
checkasm --bench results with 5000 runs
pred16x16_tm_vp8_c: 302.8
pred16x16_tm_vp8_mmx: 101.4
pred16x16_tm_vp8_mmxext: 95.5
pred16x16_tm_vp8_sse2: 95.1
pred16x16_tm_vp8_avx2: 38.2
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
8 years ago
Diego Biurrun
681a86aba6
x86: fft: Port to cpuflags
8 years ago
Diego Biurrun
e9bb77fb10
x86: h264: Simplify DEQUANT macro with cpuflags
8 years ago
Diego Biurrun
307eb1a8ee
x86: vp8dsp: port FILTER_BILINEAR macro to cpuflags
8 years ago
Diego Biurrun
994c4bc107
x86util: Port all macros to cpuflags
...
Also do some small cosmetic changes: Drop pointless _MMX suffix from ABSD2
macro name, drop pointless check for MMX support, we always assume MMX is
available in our SIMD code, fix spelling.
8 years ago
Michael Niedermayer
835d9f299c
avcodec/x86/cavsdsp: Put MMX code under mmx check
...
Without this the FPU state becomes trashed and causes mysterious
fate failures with cpuflags=0
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
8 years ago
Diego Biurrun
6eef263aca
x86: Merge align directives into SECTION_RODATA declarations where possible
8 years ago
Diego Biurrun
39e208f4d4
build: Generalize yasm/nasm-related variable names
...
None of them are specific to the YASM assembler.
8 years ago
Diego Biurrun
fde7ee8710
x86: hevc: Add missing colons after assembly labels
...
This fixes several warnings of the sort
warning: label alone on a line without a colon might be in error
8 years ago
James Darnley
33de0fee2c
avcodec/h264: enable sse2 chroma deblock/loop filter functions
...
Between 1.00 and 1.16 times faster on Intel Yorkfield Core 2 Quad.
Between 1.11 and 1.39 times faster on Intel Kaby Lake Pentium.
8 years ago
James Darnley
cd893b9307
avcodec/h264: add avx 8-bit 4:2:2 chroma h intra deblock/loop filter
...
~1.37x faster (147 vs. 108 cycles) compared to mmxext function
8 years ago
James Darnley
0e16b3e2be
avcodec/h264: add avx 8-bit 4:2:0 chroma h intra deblock/loop filter
...
~1.10x faster (69 vs. 63 cycles) compared to mmxext function
8 years ago
James Darnley
987ffe4b8d
avcodec/h264: add avx 8-bit chroma v intra deblock/loop filter
...
~1.14x faster (90 vs 78 cycles) compared with mmxext
8 years ago
James Darnley
88307b3eec
avcodec/h264: add avx 8-bit 4:2:2 chroma h deblock/loop filter
...
~1.21x faster (68 vs. 56 cycles) compared with mmxext function
8 years ago
James Darnley
ac096fc82d
avcodec/h264: add avx 8-bit 4:2:0 chroma h deblock/loop filter
...
~1.14x faster (93 vs. 81 cycles) compared with mmxext function
8 years ago
James Darnley
5c56758843
avcodec/h264: add avx 8-bit chroma v deblock/loop filter
...
~1.24x faster (101 vs. 81 cycles) compared with mmxext function
8 years ago
James Darnley
5336887867
avcodec/h264: sse2, avx h luma mbaff deblock/loop filter
...
x86-64 only
Yorkfield:
- sse2: ~2.17x (434 vs. 200 cycles)
Nehalem:
- sse2: ~2.94x (409 vs. 139 cycles)
Skylake:
- sse2: ~3.10x (370 vs. 119 cycles)
- avx: ~3.29x (370 vs. 112 cycles)
8 years ago
James Darnley
e18bc2114f
avcodec/h264: add named parameters to x86 function
8 years ago
James Darnley
9d815b7424
avcodec/x86: deduplicate PASS8ROWS macro
8 years ago
Diego Biurrun
7abdd026df
asm: Consistently uppercase SECTION markers
8 years ago