While this technically compiles in current ffmpeg, this is only
because ffmpeg is compiled in strict ISO C mode, which disables
the builtin 'vector' keyword for AltiVec/VSX. Instead this gets
replaced with a macro inside altivec.h, which defines vector to
be actually __vector, which accepts random types.
Normally, the vector keyword should be used only with plain
scalar non-typedef types, such as unsigned int. But we have the
vec_(s|u)(8|16|32) macros, which can be used in a portable manner,
in util_altivec.h in libavutil.
This is also consistent with other AltiVec/VSX code elsewhere in
the tree.
Fixes#7861.
Signed-off-by: Daniel Kolesa <daniel@octaforge.org>
Signed-off-by: Lauri Kasanen <cand@gmx.com>
The implementation is pretty straight-forward. Most of the existing
NV12 codepaths work regardless of subsampling and are re-used as is.
Where necessary I wrote the slightly different NV24 versions.
Finally, the one thing that confused me for a long time was the
asm specific x86 path that did an explicit exclusion check for NV12.
I replaced that with a semi-planar check and also updated the
equivalent PPC code, which Lauri kindly checked.
This function wouldn't benefit from VSX instructions, so I put it
under altivec.
./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt grayf32le \
-f null -vframes 100 -v error -nostats -
3743 UNITS in planar1, 65495 runs, 41 skips
-cpuflags 0
23511 UNITS in planar1, 65530 runs, 6 skips
grayf32be
4647 UNITS in planar1, 65449 runs, 87 skips
-cpuflags 0
28608 UNITS in planar1, 65530 runs, 6 skips
The native speedup is 6.28133, and the bswapping one 6.15623.
Fate passes, each format tested with an image to video conversion.
Signed-off-by: Lauri Kasanen <cand@gmx.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Passes fate on LE (with "lavc/jrevdct: Avoid an aliasing violation" applied).
Signed-off-by: Lauri Kasanen <cand@gmx.com>
Tested-by: Michael Kostylev on BE
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \
-f null -vframes 100 -v error -nostats -
1158 UNITS in planar1, 65528 runs, 8 skips
-cpuflags 0
19082 UNITS in planar1, 65533 runs, 3 skips
16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version
takes as many cycles as the x86 SSE2 version, yikes it's fast.
Note that this function uses VSX instructions, but is not marked so.
This is because several existing functions also make that mistake.
I'll submit a patch moving them once this is reviewed.
Signed-off-by: Lauri Kasanen <cand@gmx.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
GCC tool had a bug of PPC intrinsic interpret, which has been fixed in GCC 4.9.1. This bug lead to
errors in two of our previous patches. We found this when we update our GCC tools to 4.9.1 and by
reading the related info on GCC website. We fix our previous error in two separate commits
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
add marcos GET_LS() GET_VF() LOAD_FILTER() LOAD_L1() GET_VF4() FIRST_LOAD() UPDATE_PTR() LOAD_SRCV() LOAD_SRCV8() GET_VFD() for POWER LE
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This gets rid of the variable-length scratch buffer by filtering 16
pixels at a time and writing directly to the destination. The extra
loads this requires to load the source values are compensated by not
doing a round-trip to memory before shifting.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Use uintptr_t instead of plain int. Without this change, the
comparisons will come out wrong for pointers in certain ranges.
Fixes random failures on ppc64. Also fixes some compiler warnings.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This allows using more specific implementations for chroma/luma, e.g.
we can make assumptions on filterSize being constant, thus avoiding
that test at runtime.
It just does that part in scalar form, I doubt using a vector store
over 2 array would speed it up particularly.
The function should be written to not use a scratch buffer.
Remove unused variables "flags" and "dstFormat" in yuv2packed1,
merge source rows per plane for yuv2packed[12], and make every
source argument int16_t (some where invalidly set to uint16_t).
This prevents stack pollution and is part of the Great Evil Plan
to simplify swscale.
This will likely lead to a considerable performance boost,
since it removes a branch from the inner loop. Part of the
Great Evil Plan to simplify swscale.
commit 93681fbd50
Author: Ronald S. Bultje <rsbultje@gmail.com>
Date: Thu May 26 11:32:32 2011 -0400
swscale: fix compile on ppc.
commit e758573a88
Author: Ronald S. Bultje <rsbultje@gmail.com>
Date: Thu May 26 10:36:47 2011 -0400
swscale: fix compile on x86-32.
commit 0f4eb8b043
Author: Ronald S. Bultje <rsbultje@gmail.com>
Date: Thu May 26 09:17:52 2011 -0400
swscale: remove VOF/VOFW.
commit b4a224c5e4
Author: Ronald S. Bultje <rsbultje@gmail.com>
Date: Wed May 25 14:30:09 2011 -0400
swscale: split chroma buffers into separate U/V planes.
Preparatory step to implement support for sizes > VOFW.
It seems sws-PPC did hardcode 2048 at various places instead of using VOFW.
This also means that all past VOFW benchmarks on PPC are meaningless
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>