get_ue_golomb_long() is only tested for values up to 2^15 - 2 since
we can not write larger values.
Silence the test on success and return a non-zero value on error.
Use an heap scratch buffer instead of large stack buffer.
Remove unneeded includes.
This way, if the AVCodecContext is allocated for a specific codec, the
caller doesn't need to store this codec separately and then pass it
again to avcodec_open2().
It also allows to set codec private options using av_opt_set_* before
opening the codec.
It allows to check whether an AVCodecContext is open in a documented
way. Right now the undocumented way this check is done in lavf/lavc is
by checking whether AVCodecContext.codec is NULL. However it's desirable
to be able to set AVCodecContext.codec before avcodec_open2().
In some cases, what is left to read from ptr is smaller than EXTRABYTES.
Based on a patch by Thierry Foucu <tfoucu@gmail.com>.
Signed-off-by: Alex Converse <alex.converse@gmail.com>
Provide MMX, SSE2 and SSSE3 versions, with a fast-path when the weights are
multiples of 512 (which is often the case when the values round up nicely).
*_TIMER report for the 16x16 and 8x8 cases:
C:
9015 decicycles in 16, 524257 runs, 31 skips
2656 decicycles in 8, 524271 runs, 17 skips
MMX:
4156 decicycles in 16, 262090 runs, 54 skips
1206 decicycles in 8, 262131 runs, 13 skips
MMX on fast-path:
2760 decicycles in 16, 524222 runs, 66 skips
995 decicycles in 8, 524252 runs, 36 skips
SSE2:
2163 decicycles in 16, 262131 runs, 13 skips
832 decicycles in 8, 262137 runs, 7 skips
SSE2 with fast path:
1783 decicycles in 16, 524276 runs, 12 skips
711 decicycles in 8, 524283 runs, 5 skips
SSSE3:
2117 decicycles in 16, 262136 runs, 8 skips
814 decicycles in 8, 262143 runs, 1 skips
SSSE3 with fast path:
1315 decicycles in 16, 524285 runs, 3 skips
578 decicycles in 8, 524286 runs, 2 skips
This means around a 4% speedup for some sequences.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
Currently, any samples in the final frame are not decoded because they are
only represented by one frame instead of two. So we encode two final frames to
cover both the analysis delay and the MDCT delay.
While pshufb allows emulating bswap on XMM registers for SSSE3, more
shuffling is needed for SSE2. Alignment is critical, so specific codepaths
are provided for this case.
For the huffyuv sequence "angels_480-huffyuvcompress.avi":
C (using bswap instruction): ~ 55k cycles
SSE2: ~ 40k cycles
SSSE3 using unaligned loads: ~ 35k cycles
SSSE3 using aligned loads: ~ 30k cycles
Signed-off-by: Diego Biurrun <diego@biurrun.de>
On x86-64, it indeed uses all 16 registers (and on x86-32, this gets
clipped to 8). Not marking it properly causes callers of this function
to fail randomly because of XMM register clobbering.
Note: This fixes the following GCC warning :-
libavcodec/sunrast.c:94: warning: comparison of unsigned expression < 0 is always false.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
10l: Forgot to adjust deinterleave for new location of incoming samples in 7946a5a.
This produced incorrect, but surprisingly listenable results.
Thanks to Justin Ruggles for the report.
Signed-off-by: Anton Khirnov <anton@khirnov.net>