./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p16be \
-s 1920x1728 -f null -vframes 100 -v error -nostats -
9-14 bit funcs get about 6x speedup, 16-bit gets about 15x.
Fate passes, each format tested with an image to video conversion.
Only POWER8 includes 32-bit vector multiplies, so POWER7 is locked out
of the 16-bit function. This includes the vec_mulo/mule functions too,
not just vmuluwm.
With TIMER_REPORT skips disabled:
yuv420p9le
12412 UNITS in planarX, 131072 runs, 0 skips
73136 UNITS in planarX, 131072 runs, 0 skips
yuv420p9be
12481 UNITS in planarX, 131072 runs, 0 skips
73410 UNITS in planarX, 131072 runs, 0 skips
yuv420p10le
12322 UNITS in planarX, 131072 runs, 0 skips
72546 UNITS in planarX, 131072 runs, 0 skips
yuv420p10be
12291 UNITS in planarX, 131072 runs, 0 skips
72935 UNITS in planarX, 131072 runs, 0 skips
yuv420p12le
12316 UNITS in planarX, 131072 runs, 0 skips
72708 UNITS in planarX, 131072 runs, 0 skips
yuv420p12be
12319 UNITS in planarX, 131072 runs, 0 skips
72577 UNITS in planarX, 131072 runs, 0 skips
yuv420p14le
12259 UNITS in planarX, 131072 runs, 0 skips
72516 UNITS in planarX, 131072 runs, 0 skips
yuv420p14be
12440 UNITS in planarX, 131072 runs, 0 skips
72962 UNITS in planarX, 131072 runs, 0 skips
yuv420p16le
10548 UNITS in planarX, 131072 runs, 0 skips
73429 UNITS in planarX, 131072 runs, 0 skips
yuv420p16be
10634 UNITS in planarX, 131072 runs, 0 skips
150959 UNITS in planarX, 131072 runs, 0 skips
Signed-off-by: Lauri Kasanen <cand@gmx.com>
Passes fate on LE (with "lavc/jrevdct: Avoid an aliasing violation" applied).
Signed-off-by: Lauri Kasanen <cand@gmx.com>
Tested-by: Michael Kostylev on BE
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \
-f null -vframes 100 -v error -nostats -
1158 UNITS in planar1, 65528 runs, 8 skips
-cpuflags 0
19082 UNITS in planar1, 65533 runs, 3 skips
16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version
takes as many cycles as the x86 SSE2 version, yikes it's fast.
Note that this function uses VSX instructions, but is not marked so.
This is because several existing functions also make that mistake.
I'll submit a patch moving them once this is reviewed.
Signed-off-by: Lauri Kasanen <cand@gmx.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
GCC tool had a bug of PPC intrinsic interpret, which has been fixed in GCC 4.9.1. This bug lead to
errors in two of our previous patches. We found this when we update our GCC tools to 4.9.1 and by
reading the related info on GCC website. We fix our previous error in two separate commits
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
add marcos GET_LS() GET_VF() LOAD_FILTER() LOAD_L1() GET_VF4() FIRST_LOAD() UPDATE_PTR() LOAD_SRCV() LOAD_SRCV8() GET_VFD() for POWER LE
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This gets rid of the variable-length scratch buffer by filtering 16
pixels at a time and writing directly to the destination. The extra
loads this requires to load the source values are compensated by not
doing a round-trip to memory before shifting.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Use uintptr_t instead of plain int. Without this change, the
comparisons will come out wrong for pointers in certain ranges.
Fixes random failures on ppc64. Also fixes some compiler warnings.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This allows using more specific implementations for chroma/luma, e.g.
we can make assumptions on filterSize being constant, thus avoiding
that test at runtime.
It just does that part in scalar form, I doubt using a vector store
over 2 array would speed it up particularly.
The function should be written to not use a scratch buffer.
Remove unused variables "flags" and "dstFormat" in yuv2packed1,
merge source rows per plane for yuv2packed[12], and make every
source argument int16_t (some where invalidly set to uint16_t).
This prevents stack pollution and is part of the Great Evil Plan
to simplify swscale.
This will likely lead to a considerable performance boost,
since it removes a branch from the inner loop. Part of the
Great Evil Plan to simplify swscale.
commit 93681fbd50
Author: Ronald S. Bultje <rsbultje@gmail.com>
Date: Thu May 26 11:32:32 2011 -0400
swscale: fix compile on ppc.
commit e758573a88
Author: Ronald S. Bultje <rsbultje@gmail.com>
Date: Thu May 26 10:36:47 2011 -0400
swscale: fix compile on x86-32.
commit 0f4eb8b043
Author: Ronald S. Bultje <rsbultje@gmail.com>
Date: Thu May 26 09:17:52 2011 -0400
swscale: remove VOF/VOFW.
commit b4a224c5e4
Author: Ronald S. Bultje <rsbultje@gmail.com>
Date: Wed May 25 14:30:09 2011 -0400
swscale: split chroma buffers into separate U/V planes.
Preparatory step to implement support for sizes > VOFW.
It seems sws-PPC did hardcode 2048 at various places instead of using VOFW.
This also means that all past VOFW benchmarks on PPC are meaningless
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>