SSE-optimized hScale() scales up to 4 pixels at once, so we need to
allocate up to 3 padding pixels to prevent overreads. This fixes
valgrind errors in various swscale-tests on fate.
Speed: from 3.9x to 9.6x speed improvement over C, and some small
(up to 15%) speed improvements over existing MMX code (particularly
for bigger filters).
This allows using more specific implementations for chroma/luma, e.g.
we can make assumptions on filterSize being constant, thus avoiding
that test at runtime.
It just does that part in scalar form, I doubt using a vector store
over 2 array would speed it up particularly.
The function should be written to not use a scratch buffer.
The logged information is possibly false, and it tends to be outdated
after each change since the logging code needs to be manually updated.
Simplify and prevent confusing wrong debug messages.
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Also remove the unnecessary isSupportedIn/Out macros.
Make the code more compact/readable, and simplify the access to
lsws-specific pixel format information.
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
When converting RGB format to RGB format with the same bits per sample,
unscaled path performs conversion on the whole buffer at once. For
non-multiple-of-16 BGR24 to RGB24 conversion it means that padding at the
end of line will be converted too. Since it may be of arbitrary length
(e.g. 8 bytes), operating on the whole buffer produces obviously wrong
results.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
ptrdiff_t can be 4 bytes, which leads to the next element being 4-byte
aligned and thus at a different offset than intended. Forcing 8-byte
alignment forces equal offset of dither16/32 on x86-32 and x86-64.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
We operated on 31-bits, but with e.g. lanczos scaling, values can
add up to beyond 0x80000000, thus leading to output of zeroes. Drop
one bit of precision fixes this.
Remove unused variables "flags" and "dstFormat" in yuv2packed1,
merge source rows per plane for yuv2packed[12], and make every
source argument int16_t (some where invalidly set to uint16_t).
This prevents stack pollution and is part of the Great Evil Plan
to simplify swscale.
This will likely lead to a considerable performance boost,
since it removes a branch from the inner loop. Part of the
Great Evil Plan to simplify swscale.
On architectures such as x86 (both 32 bit and 64bit), the stack element
size is fixed, which maintains alignment. Here, this change does not
break anything. However, we also support also other architectures where
this property is not maintained and therefore, applications will crash
horribly.
This change effectively forces all applications to be recompiled against
libswscale.