While the inline cabac assembly has worked correctly in i386 builds
historically, modern compiler updates has started showing issues
with it, when the function gets inlined into larger contexts that
fail to provide the amount of free registers as this function
requires.
This was an issue with Clang on Windows on i386, which was fixed
in c6d284b945324a7bc70ea8b9056040c8148aa835. However, recently
the same issues also have started showing up with GCC (both for
Windows and Linux). Whether the issue appears seems dependent on
a lot of optimizer tuning (e.g. the issue appears or goes away
depenent on the combinaton of -march= and -mtune= options),
potentially due to the compiler making different decisions on
how much to inline.
Fixes: https://trac.ffmpeg.org/ticket/8903
Signed-off-by: Martin Storsjö <martin@martin.st>
From x86inc:
> On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either
> a branch or a branch target. So switch to a 2-byte form of ret in that case.
> We can automatically detect "follows a branch", but not a branch target.
> (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.)
x86inc can automatically determine whether to use REP_RET rather than
REP in most of these cases, so impact is minimal. Additionally, a few
REP_RETs were used unnecessary, despite the return being nowhere near a
branch.
The only CPUs affected were AMD K10s, made between 2007 and 2011, 16
years ago and 12 years ago, respectively.
In the future, everyone involved with x86inc should consider dropping
REP_RETs altogether.
simple_idct.asm is 32 bit-only since
bfb28b5ce8,
whereas simple_idct10.asm is x64-only. So don't build
the ultimately unneeded and empty files, as some linkers
complain about this: "ranlib: file:
libavcodec/libavcodec.a(simple_idct.o) has no symbols"
(this is from an Xcode toolchain as reported by Ronald S. Bultje).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
avx512 on Skylake-X (Xeon D-2123IT):
1.19x faster (970±91.2 vs. 817±104.4 decicycles) compared with avx2
avx512icl on Ice Lake (Xeon Silver 4316):
2.52x faster (1350±5.3 vs. 535±9.5 decicycles) compared with avx2
Negligible speed difference for avx2 on Zen 2 (Ryzen 5700X) and
Broadwell (Xeon E5-2620 v4):
1690±4.3 decicycles vs. 1693±78.4
1439±31.1 decicycles vs 1429±16.7
Moderate speedup with avx512 on Skylake-X (Xeon D-2123IT):
1.22x faster (793±0.8 vs. 649±5.5 decicycles) compared with avx2
Better speedup with avx512icl on Ice Lake (Xeon Silver 4316):
1.77x faster (784±1.8 vs. 442±11.6 decicycles) compared with avx2
Co-authors:
Henrik Gramner <henrik@gramner.com>
Kieran Kunhya <kierank@obe.tv>
It is only used by gmc/gmc1 which is only used by the MPEG-4
decoder, so move it to Mpeg4DecContext and rename it
to Mpeg4VideoDSP. Also compile it iff the MPEG-4 decoder is compiled.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Currently, SVQ1EncContext is defined in a header that is also
included by the arch-specific code that initializes the one
and only dsp function that this encoder uses directly.
But the arch-specific functions to set this dsp function
do not need anything from SVQ1EncContext. This commit therefore
adds a small SVQ1EncDSPContext whose only member is said
function pointer and renames svq1enc.h to svq1encdsp.h
to avoid exposing unnecessary internals to these init
functions (and the whole mpegvideo with it).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Pointers to void can be converted to any pointer to incomplete or
object type and back; but they are nevertheless not completely generic
pointers: There is no provision in the C standard that guarantees their
convertibility with function pointers. C90 lacks a generic function
pointer, C99 made every function pointer a generic function pointer and
still disallows the convertibility with void *. Both GCC as well as
Clang warn about this when using -pedantic.
Therefore use unions to avoid these conversions.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Old one was written with the assumption only even inputs would be given.
This very messy replacement supports even and odd inputs, and supports
AVX2 for extra speed. The buffers given are usually quite big (4k samples),
so the speedup is worth it.
The new SSE version is still faster than the old inline asm version by 33%.
Also checkasm is provided to make sure this monstrosity works.
This fixes some FATE tests.
Reviewed-by: Peter Ross <pross@xvid.org>
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Inside a function, the second ';' in ";;" is just a null statement,
but it is actually illegal outside of functions. Compilers
nevertheless accept it without warning, except when in -pedantic
mode when e.g. Clang emits a -Wextra-semi warning. Therefore
remove the unnecessary ';'.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The height is hardcoded in some of the me_cmp functions, but not
in all of them. But in the case of all other functions, it's hardcoded
in the same place in SIMD functions as in the C reference functions,
while this one function differs from the behaviour of the C code.
(Before 542765ce3e, there were a
couple other sad8_*_mmx functions with similar hardcoded height.)
Signed-off-by: Martin Storsjö <martin@martin.st>