From x86inc:
> On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either
> a branch or a branch target. So switch to a 2-byte form of ret in that case.
> We can automatically detect "follows a branch", but not a branch target.
> (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.)
x86inc can automatically determine whether to use REP_RET rather than
REP in most of these cases, so impact is minimal. Additionally, a few
REP_RETs were used unnecessary, despite the return being nowhere near a
branch.
The only CPUs affected were AMD K10s, made between 2007 and 2011, 16
years ago and 12 years ago, respectively.
In the future, everyone involved with x86inc should consider dropping
REP_RETs altogether.
This commit enabled assembly code with intel AVX512 VNNI and added unit test for sobel filter
sobel_c: 4537
sobel_avx512icl 2136
Signed-off-by: bwang30 <bin.wang@intel.com>
Signed-off-by: Haihao Xiang <haihao.xiang@intel.com>
There is no MMX code for (add|put|put_signed)_pixels_clamped
since commit bfb28b5ce8, so use
declare_func instead of declare_func_emms() to also test that
we are not in MMX mode after return.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
There is no MMX code for diff_bytes since commit
230ea38de1, so use declare_func
instead of declare_func_emms() to also test that we are not
in MMX mode after return.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
There is no MMX code for add_int16 since commit
4b6ffc2880, so use declare_func
instead of declare_func_emms() to also test that we are not
in MMX mode after return.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
There is no MMX code for llviddsp after commit
fed07efcde, so use declare_func
instead of declare_func_emms() to also test that we are not
in MMX mode after return.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
There is no MMX code for pixblockdsp after commit
92b5800277, so use declare_func
instead of declare_func_emms() to also test that we are not
in MMX mode after return.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
There is no MMX code for audiodsp after commit
3d716d38ab, so use declare_func
instead of declare_func_emms() to also test that we are not
in MMX mode after return.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
There is no MMX code for blockdsp after commit
ee551a21dd, so use declare_func
instead of declare_func_emms() to also test that we are not
in MMX mode after return.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
There is no MMX code for vc1_inv_trans_8x8 or
vc1_unescape_buffer, so use declare_func instead of
declare_func_emms() to also test that we are not in MMX
mode after return.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
There is no MMX code for loop filters since commit
6a551f1405, so use declare_func
instead of declare_func_emms() to also test that we are not
in MMX mode after return.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Unfortunately, it is common, and will remain so, that the Bit
manipulations are not enabled at compilation time. This is an official
policy for Debian ports in general (though they do not support RISC-V
officially as of yet) to stick to the minimal target baseline, which
does not include the B extension or even its Zbb subset.
For inline helpers (CPOP, REV8), compiler builtins (CTZ, CLZ) or
even plain C code (MIN, MAX, MINU, MAXU), run-time detection seems
impractical. But at least it can work for the byte-swap DSP functions.
RVV defines a total of 12 different extensions, including:
- 5 different instruction subsets:
- Zve32x: 8-, 16- and 32-bit integers,
- Zve32f: Zve32x plus single precision floats,
- Zve64x: Zve32x plus 64-bit integers,
- Zve64f: Zve32f plus Zve64x,
- Zve64d: Zve64f plus double precision floats.
- 6 different vector lengths:
- Zvl32b (embedded only),
- Zvl64b (embedded only),
- Zvl128b,
- Zvl256b,
- Zvl512b,
- Zvl1024b,
- and the V extension proper: equivalent to Zve64f and Zvl128b.
In total, there are 6 different possible sets of supported instructions
(including the empty set), but for convenience we allocate one bit for
each type sets: up-to-32-bit ints (RVV_I32), floats (RVV_F32),
64-bit ints (RVV_I64) and doubles (RVV_F64).
Whence the vector size is needed, it can be retrieved by reading the
unprivileged read-only vlenb CSR. This should probably be a separate
helper macro if needed at a later point.
This introduces compile-time and run-time CPU detection on RISC-V. In
practice, I doubt that FFmpeg will ever see a RISC-V CPU without all of
I, F and D extensions, and if it does, it probably won't have run-time
detection. So the flags are essentially always set.
But as things stand, checkasm wants them that way. Compare the ARMV8
flag on AArch64. We are nowhere near running short on CPU flag bits.
~4x faster than the C version.
The shuffles in the 15pt dim1 are seriously expensive. Not happy with it,
but I'm contempt.
Can be easily converted to pure AVX by removing all vpermpd/vpermps
instructions.
Old one was written with the assumption only even inputs would be given.
This very messy replacement supports even and odd inputs, and supports
AVX2 for extra speed. The buffers given are usually quite big (4k samples),
so the speedup is worth it.
The new SSE version is still faster than the old inline asm version by 33%.
Also checkasm is provided to make sure this monstrosity works.
This fixes some FATE tests.
This avoids triggering overflows in the filters, and avoids stray
test failures in the approximate functions on x86; due to rounding
differences, one implementation might overflow while another one
doesn't.
Signed-off-by: Martin Storsjö <martin@martin.st>
Previously, the checkasm test always passed h=8, so no other cases
were tested.
Out of the me_cmp functions, in practice, some functions are hardcoded
to always assume a 8x8 block (ignoring the h parameter), while others
do use the parameter. For those with hardcoded height, both the
reference C function and the assembly implementations ignore the
parameter similarly.
The documentation for the functions indicate that heights between
w/2 and 2*w, within the range of 4 to 16, should be supported. This
patch just tests random heights in that range, without knowing what
width the current function actually uses.
Signed-off-by: Martin Storsjö <martin@martin.st>
Change the reference to exactly match the C reference in swscale,
instead of exactly matching the x86 SIMD implementations (which
differs slightly). Test with and without SWS_ACCURATE_RND - if this
flag isn't set, the output must match the C reference exactly,
otherwise it is allowed to be off by 2.
Mark a couple x86 functions as unavailable when SWS_ACCURATE_RND
is set - apparently this discrepancy hasn't been noticed in other
exact tests before.
Add a test for yuv2plane1.
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
This codepath is enabled by default on arm, if the linux perf API
is available, unless disabled with --disable-linux-perf.
Signed-off-by: Martin Storsjö <martin@martin.st>
- ff_pix_abs16_neon
- ff_pix_abs16_xy2_neon
In direct micro benchmarks of these ff functions verses their C implementations,
these functions performed as follows on AWS Graviton 3.
ff_pix_abs16_neon:
pix_abs_0_0_c: 141.1
pix_abs_0_0_neon: 19.6
ff_pix_abs16_xy2_neon:
pix_abs_0_3_c: 269.1
pix_abs_0_3_neon: 39.3
Tested with:
./tests/checkasm/checkasm --test=motion --bench --disable-linux-perf
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
SSE3 instruction movdqa in ff_yuv2yuvX_sse3() expects a 16-byte aligned address for a memory address, or else a segfault is generated.
The src_pixels buffer below was not aligned to 16 bytes on the stack necessarily, so we got segfaults during fate-checkasm-sw_scale.
Therefore 16-byte align all of these local variables, aligning them too much shouldn't hurt.