FFmpeg

Author	SHA1	Message	Date
Andreas Rheinhardt	259234b46f	avcodec/vp9: Simplify replacing VP9Frame ff_thread_progress_replace() can handle a blank ProgressFrame as src (in which case it simply unreferences dst), but not a NULL one. So add a blank frame to be used as source for this case, so that we can use the replace functions to simplify vp9_frame_replace(). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	11 months ago
Andreas Rheinhardt	7bd3b73716	avcodec/vp9: Switch to ProgressFrames This already fixes a race in the vp9-encparams test. In this test, side data is added to the current frame after having been decoded (and therefore after ff_thread_finish_setup() has been called). Yet the update_thread_context callback called ff_thread_ref_frame() and therefore av_frame_ref() with this frame as source frame and the ensuing read was unsynchronised with adding the side data, i.e. there was a data race. By switching to the ProgressFrame API the implicit av_frame_ref() is removed and the race fixed except if this frame is later reused by a show-existing-frame which uses an explicit av_frame_ref(). The vp9-encparams test does not cover this, so this commit already fixes all the races in this test. This decoder kept multiple references to the same ThreadFrames in the same context and therefore had lots of implicit av_frame_ref() even when decoding single-threaded. This incurred lots of small allocations: When decoding an ordinary 10s video in single-threaded mode the number of allocations reported by Valgrind went down from 57,814 to 20,908; for 10 threads it went down from 84,223 to 21,901. Reviewed-by: Anton Khirnov <anton@khirnov.net> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	11 months ago
Andreas Rheinhardt	8c0350f57e	avcodec/vp9: Use RefStruct-pool API for extradata It avoids allocations and corresponding error checks. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	1 year ago
Andreas Rheinhardt	f8252d6ce3	avcodec/decode: Use RefStruct API for hwaccel_picture_private Avoids allocations and therefore error checks: Syncing hwaccel_picture_private across threads can't fail any more. Also gets rid of an unnecessary pointer in structures and in the parameter list of ff_hwaccel_frame_priv_alloc(). Reviewed-by: Anton Khirnov <anton@khirnov.net> Reviewed-by: Lynne <dev@lynne.ee> Tested-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	1 year ago
Andreas Rheinhardt	6f7d3bde11	avcodec/vp8, vp9: Avoid using VP56mv and VP56Frame in VP8/9 Instead replace VP56mv by new and identical structures VP8mv and VP9mv. Also replace VP56Frame by VP8FrameType in vp8.h and use that in VP8 code. Also remove VP56_FRAME_GOLDEN2, as this has only been used by VP8, and use VP8_FRAME_ALTREF as replacement for its usage in VP8 as this is more in line with VP8 verbiage. This allows to remove all inclusions of vp56.h from everything that is not VP5/6. This also removes implicit inclusions of hpeldsp.h, h264chroma.h, vp3dsp.h and vp56dsp.h from all VP8/9 files. (This also fixes a build issue: If one compiles with -O0 and disables everything except the VP8-VAAPI encoder, the file containing ff_vpx_norm_shift is not compiled, yet this is used implicitly by vp56_rac_gets_nn() which is defined in vp56.h; it is unused by the VP8-VAAPI encoder and declared as av_unused, yet with -O0 unused noninline functions are not optimized away, leading to linking failures. With this patch, said function is not included in vaapi_encode_vp8.c any more.) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	3 years ago
Andreas Rheinhardt	02220b88fc	avcodec/thread: Don't use ThreadFrame when unnecessary The majority of frame-threaded decoders (mainly the intra-only) need exactly one part of ThreadFrame: The AVFrame. They don't need the owners nor the progress, yet they had to use it because ff_thread_(get\|release)_buffer() requires it. This commit changes this and makes these functions work with ordinary AVFrames; the decoders that need the extra fields for progress use ff_thread_(get\|release)_ext_buffer() which work exactly as ff_thread_(get\|release)_buffer() used to do. This also avoids some unnecessary allocations of progress AVBuffers, namely for H.264 and HEVC film grain frames: These frames are not used for synchronization and therefore don't need a ThreadFrame. Also move the ThreadFrame structure as well as ff_thread_ref_frame() to threadframe.h, the header for frame-threaded decoders with inter-frame dependencies. Reviewed-by: Anton Khirnov <anton@khirnov.net> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	3 years ago
Ronald S. Bultje	0c46641784	vp9: split out generic decoding skeleton interface API from VP9 types. This allows vp9dsp.h to only include the VP9 types header, and not the decoder skeleton interface which is for hardware decoders (dxva2/vaapi).	8 years ago
Ronald S. Bultje	f8c019944d	vp9: re-split the decoder/format/dsp interface header files. The advantage here is that the internal software decoder interface is not exposed to the DSP functions or the hardware accelerations.	8 years ago
Clément Bœsch	e6ffdc9582	lavc/vp9: shuffle header declaration This reduces diff with Libav.	8 years ago
Clément Bœsch	37814a21cb	lavc/vp9: consistent use of typedef instead of struct	8 years ago
Clément Bœsch	1c9f4b5078	lavc/vp9: split into vp9{block,data,mvs} This is following Libav layout to ease merges.	8 years ago
Mathieu Velten	b1f630f1a6	avcodec/vp9: move bpp to the shared context for use in hwaccel Signed-off-by: Mark Thompson <sw@jkqxz.net>	8 years ago
Martin Storsjö	383d96aa22	aarch64: vp9: Add NEON optimizations of VP9 MC functions This work is sponsored by, and copyright, Google. These are ported from the ARM version; it is essentially a 1:1 port with no extra added features, but with some hand tuning (especially for the plain copy/avg functions). The ARM version isn't very register starved to begin with, so there's not much to be gained from having more spare registers here - we only avoid having to clobber callee-saved registers. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_neon: 27.2 23.7 vp9_avg8_neon: 56.5 54.7 vp9_avg16_neon: 169.9 167.4 vp9_avg32_neon: 585.8 585.2 vp9_avg64_neon: 2460.3 2294.7 vp9_avg_8tap_smooth_4h_neon: 132.7 125.2 vp9_avg_8tap_smooth_4hv_neon: 478.8 442.0 vp9_avg_8tap_smooth_4v_neon: 126.0 93.7 vp9_avg_8tap_smooth_8h_neon: 241.7 234.2 vp9_avg_8tap_smooth_8hv_neon: 690.9 646.5 vp9_avg_8tap_smooth_8v_neon: 245.0 205.5 vp9_avg_8tap_smooth_64h_neon: 11273.2 11280.1 vp9_avg_8tap_smooth_64hv_neon: 22980.6 22184.1 vp9_avg_8tap_smooth_64v_neon: 11549.7 10781.1 vp9_put4_neon: 18.0 17.2 vp9_put8_neon: 40.2 37.7 vp9_put16_neon: 97.4 99.5 vp9_put32_neon/armv8: 346.0 307.4 vp9_put64_neon/armv8: 1319.0 1107.5 vp9_put_8tap_smooth_4h_neon: 126.7 118.2 vp9_put_8tap_smooth_4hv_neon: 465.7 434.0 vp9_put_8tap_smooth_4v_neon: 113.0 86.5 vp9_put_8tap_smooth_8h_neon: 229.7 221.6 vp9_put_8tap_smooth_8hv_neon: 658.9 621.3 vp9_put_8tap_smooth_8v_neon: 215.0 187.5 vp9_put_8tap_smooth_64h_neon: 10636.7 10627.8 vp9_put_8tap_smooth_64hv_neon: 21076.8 21026.9 vp9_put_8tap_smooth_64v_neon: 9635.0 9632.4 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is pretty much the same as for the 32 bit case; on the A53 it's around 6-13x for ther larger 8tap filters. The exact speedup varies a little, since the C versions generally don't end up exactly as slow/fast as on 32 bit. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	a4cfcddcb0	vp9: Make the subpel filters non-static Make them aligned, to allow efficient access to them from simd. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	ffbd1d2b00	arm: vp9: Add NEON optimizations of VP9 MC functions This work is sponsored by, and copyright, Google. The filter coefficients are signed values, where the product of the multiplication with one individual filter coefficient doesn't overflow a 16 bit signed value (the largest filter coefficient is 127). But when the products are accumulated, the resulting sum can overflow the 16 bit signed range. Instead of accumulating in 32 bit, we accumulate the largest product (either index 3 or 4) last with a saturated addition. (The VP8 MC asm does something similar, but slightly simpler, by accumulating each half of the filter separately. In the VP9 MC filters, each half of the filter can also overflow though, so the largest component has to be handled individually.) Examples of relative speedup compared to the C version, from checkasm: Cortex A7 A8 A9 A53 vp9_avg4_neon: 1.71 1.15 1.42 1.49 vp9_avg8_neon: 2.51 3.63 3.14 2.58 vp9_avg16_neon: 2.95 6.76 3.01 2.84 vp9_avg32_neon: 3.29 6.64 2.85 3.00 vp9_avg64_neon: 3.47 6.67 3.14 2.80 vp9_avg_8tap_smooth_4h_neon: 3.22 4.73 2.76 4.67 vp9_avg_8tap_smooth_4hv_neon: 3.67 4.76 3.28 4.71 vp9_avg_8tap_smooth_4v_neon: 5.52 7.60 4.60 6.31 vp9_avg_8tap_smooth_8h_neon: 6.22 9.04 5.12 9.32 vp9_avg_8tap_smooth_8hv_neon: 6.38 8.21 5.72 8.17 vp9_avg_8tap_smooth_8v_neon: 9.22 12.66 8.15 11.10 vp9_avg_8tap_smooth_64h_neon: 7.02 10.23 5.54 11.58 vp9_avg_8tap_smooth_64hv_neon: 6.76 9.46 5.93 9.40 vp9_avg_8tap_smooth_64v_neon: 10.76 14.13 9.46 13.37 vp9_put4_neon: 1.11 1.47 1.00 1.21 vp9_put8_neon: 1.23 2.17 1.94 1.48 vp9_put16_neon: 1.63 4.02 1.73 1.97 vp9_put32_neon: 1.56 4.92 2.00 1.96 vp9_put64_neon: 2.10 5.28 2.03 2.35 vp9_put_8tap_smooth_4h_neon: 3.11 4.35 2.63 4.35 vp9_put_8tap_smooth_4hv_neon: 3.67 4.69 3.25 4.71 vp9_put_8tap_smooth_4v_neon: 5.45 7.27 4.49 6.52 vp9_put_8tap_smooth_8h_neon: 5.97 8.18 4.81 8.56 vp9_put_8tap_smooth_8hv_neon: 6.39 7.90 5.64 8.15 vp9_put_8tap_smooth_8v_neon: 9.03 11.84 8.07 11.51 vp9_put_8tap_smooth_64h_neon: 6.78 9.48 4.88 10.89 vp9_put_8tap_smooth_64hv_neon: 6.99 8.87 5.94 9.56 vp9_put_8tap_smooth_64v_neon: 10.69 13.30 9.43 14.34 For the larger 8tap filters, the speedup vs C code is around 5-14x. This is significantly faster than libvpx's implementation of the same functions, at least when comparing the put_8tap_smooth_64 functions (compared to vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon from libvpx). Absolute runtimes from checkasm: Cortex A7 A8 A9 A53 vp9_put_8tap_smooth_64h_neon: 20150.3 14489.4 19733.6 10863.7 libvpx vpx_convolve8_horiz_neon: 52623.3 19736.4 21907.7 25027.7 vp9_put_8tap_smooth_64v_neon: 14455.0 12303.9 13746.4 9628.9 libvpx vpx_convolve8_vert_neon: 42090.0 17706.2 17659.9 16941.2 Thus, on the A9, the horizontal filter is only marginally faster than libvpx, while our version is significantly faster on the other cores, and the vertical filter is significantly faster on all cores. The difference is especially large on the A7. The libvpx implementation does the accumulation in 32 bit, which probably explains most of the differences. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Martin Storsjö	2e55e26b40	vp9: Flip the order of arguments in MC functions This makes it match the pattern already used for VP8 MC functions. This also makes the signature match ffmpeg's version of these functions, easing porting of code in both directions. Signed-off-by: Martin Storsjö <martin@martin.st>	8 years ago
Ronald S. Bultje	1730a67ab9	vp9: add frame threading Signed-off-by: Anton Khirnov <anton@khirnov.net>	9 years ago
Ronald S. Bultje	5b995452a6	vp9: allocate 'b', 'block/uvblock' and 'eob/uveob' dynamically. This will be needed for frame threading. Signed-off-by: Anton Khirnov <anton@khirnov.net>	9 years ago
Ronald S. Bultje	bc6e0b64a9	vp9: split last/cur_frame from the reference buffers. We need more information from last/cur_frame than from reference buffers, so we can use a simplified structure for reference buffers, and then store mvs and segmentation map information in last/cur. This prepares the decoder for frame threading support. Signed-off-by: Anton Khirnov <anton@khirnov.net>	9 years ago
Ronald S. Bultje	0df4801105	vp9: make mv bounds 32bit. The frame dimensions are 16bit, so the mv bounds can easily overflow int16 for large videos. Bug-Id: Handbrake/46 CC: libav-stable@libav.org Signed-off-by: Anton Khirnov <anton@khirnov.net>	9 years ago
Vittorio Giovara	41ed7ab45f	cosmetics: Fix spelling mistakes Signed-off-by: Diego Biurrun <diego@biurrun.de>	9 years ago
Hendrik Leppkes	585083dd1f	vp9: add hwaccel hooks	9 years ago
Hendrik Leppkes	6e719dc6bb	vp9: expose reference frames in VP9SharedContext	10 years ago
Ronald S. Bultje	b95f241b6e	vp9: split header into separate struct and expose in vp9.h This allows hwaccels to access the bitstream header information.	10 years ago
Vittorio Giovara	3f38d4b816	vp9: Parse subsampling and report missing feature	10 years ago
Luca Barbato	312daa1589	vp9: Use the correct upper bound for seg_id And use a macro to make apparent why the value. Bug-Id: CID 1108595	10 years ago
Ronald S. Bultje	72ca830f51	lavc: VP9 decoder Originally written by Ronald S. Bultje <rsbultje@gmail.com> and Clément Bœsch <u@pkh.me> Further contributions by: Anton Khirnov <anton@khirnov.net> Diego Biurrun <diego@biurrun.de> Luca Barbato <lu_zero@gentoo.org> Martin Storsjö <martin@martin.st> Signed-off-by: Luca Barbato <lu_zero@gentoo.org> Signed-off-by: Anton Khirnov <anton@khirnov.net>	11 years ago
Ronald S. Bultje	848826f527	Native VP9 decoder. Authors: Ronald S. Bultje <rsbultje gmail com>, Clement Boesch <u pkh me>	12 years ago

7 Commits (54ae270213b5a98f923bfd4506e450b2e764ede2)