FFmpeg

Commit Graph

Author	SHA1	Message	Date
Andreas Rheinhardt	f8efd890bf	avutil/tx_template: Move function pointers to const memory This can be achieved by moving the AVOnce out of the structure containing the function pointers; the latter can then be made const. This also has the advantage of eliminating padding in the structure (sizeof(AVOnce) is four here) and allowing the AVOnces to be put into .bss (dependening upon the implementation). Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2 years ago
Andreas Rheinhardt	188216581b	avutil/tx_template: Avoid code duplication Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2 years ago
Andreas Rheinhardt	2af5f55b2e	avutil/tx_template: Don't waste space for inexistent factors It is possible to avoid the factors array for the power-of-two tables for which said array is unused by using a different structure for initialization for power-of-two tables than for non-power-of-two-tables. This saves 31516B from .data. Reviewed-by: Lynne <dev@lynne.ee> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2 years ago
Lynne	ace42cf581	x86/tx_float: add 15xN PFA FFT AVX SIMD ~4x faster than the C version. The shuffles in the 15pt dim1 are seriously expensive. Not happy with it, but I'm contempt. Can be easily converted to pure AVX by removing all vpermpd/vpermps instructions.	2 years ago
Lynne	7e7baf8ab8	lavu/tx: do not steal lookup tables of subcontexts in the iMDCT As it happens, some still need their contexts.	2 years ago
Lynne	f1b35fc8f0	lavu/tx: remove av_cold from table definitions How did this get here?	2 years ago
Lynne	c92edd969a	lavu/tx: rotate 3 & 15-point exptabs This just inverts their signs. Simplifies SIMD.	2 years ago
Lynne	51172223fd	lavu/tx: generalize MDCTs The same code can perform any-length MDCTs with minimal changes.	2 years ago
Lynne	645a1f4422	lavu/tx: add the inplace flag to PFA FFTs They support in-place, because they have to use a temporary buffer.	2 years ago
Lynne	ae66a9db7b	lavu/tx: optimize and simplify inverse MDCTs Convert the input from a scatter to a gather instead, which is faster and better for SIMD. Also, add a pre-shuffled exptab version to avoid gathering there at all. This doubles the exptab size, but the speedup makes it worth it. In SIMD, the exptab will likely be purged to a higher cache anyway because of the FFT in the middle, and the amount of loads stays identical. For a 960-point inverse MDCT, the speedup is 10%. This makes it possible to write sane and fast SIMD versions of inverse MDCTs.	2 years ago
Lynne	af94ab7c7c	lavu/tx: add an RDFT implementation RDFTs are full of conventions that vary between implementations. What I've gone for here is what's most common between both fftw, avcodec's rdft and what we use, the equivalent of which is DFT_R2C for forward and IDFT_C2R for inverse. The other 2 conventions (IDFT_R2C and DFT_C2R) were not used at all in our code, and their names are also not appropriate. If there's a use for either, we can easily add a flag which would just flip the sign on one exptab. For some unknown reason, possibly to allow reusing FFT's exp tables, av_rdft's C2R output is 0.5x lower than what it should be to ensure a proper back-and-forth conversion. This code outputs its real samples at the correct level, which matches FFTW's level, and allows the user to change the level and insert arbitrary multiplies for free by setting the scale option.	3 years ago
Lynne	ef4bd81615	lavu/tx: rewrite internal code as a tree-based codelet constructor This commit rewrites the internal transform code into a constructor that stitches transforms (codelets). This allows for transforms to reuse arbitrary parts of other transforms, and allows transforms to be stacked onto one another (such as a full iMDCT using a half-iMDCT which in turn uses an FFT). It also permits for each step to be individually replaced by assembly or a custom implementation (such as an ASIC).	3 years ago
Lynne	1978b143eb	checkasm: add av_tx FFT SIMD testing code This sadly required making changes to the code itself, due to the same context needing to be reused for both versions. The lookup table had to be duplicated for both versions.	4 years ago
Lynne	0072a42388	lavu/tx: add full-sized iMDCT transform flag	4 years ago
Lynne	8c55c82583	lavu/tx: add a 9-point FFT and (i)MDCT	4 years ago
Lynne	bd9ea917a3	lavu/tx: add a 7-point FFT and (i)MDCT	4 years ago
Lynne	89da62f2fc	lavu/tx: refactor power-of-two FFT This commit refactors the power-of-two FFT, making it faster and halving the size of all tables, making the code much smaller on all systems. This removes the big/small pass split, because on modern systems the "big" pass is always faster, and even on older machines there is no measurable speed difference.	4 years ago
Lynne	e20a39a375	lavu/tx: do not invert permutes on MDCTs	4 years ago
Lynne	8e94b7cff0	lavu/tx: invert permutation lookups out[lut[i]] = in[i] lookups were 4.04 times(!) slower than out[i] = in[lut[i]] lookups for an out-of-place FFT of length 4096. The permutes remain unchanged for anything but out-of-place monolithic FFT, as those benefit quite a lot from the current order (it means there's only 1 lookup necessary to add to an offset, rather than a full gather). The code was based around non-power-of-two FFTs, so this wasn't benchmarked early on.	4 years ago
Lynne	10341743d2	lavu/tx: require output argument to match input for inplace transforms This simplifies some assembly code by a lot, by either saving a branch or saving an entire duplicated function.	4 years ago
Lynne	5ca40d6d94	lavu/tx: support in-place FFT transforms This commit adds support for in-place FFT transforms. Since our internal transforms were all in-place anyway, this only changes the permutation on the input. Unfortunately, research papers were of no help here. All focused on dry hardware implementations, where permutes are free, or on software implementations where binary bloat is of no concern so storing dozen times the transforms for each permutation and version is not considered bad practice. Still, for a pure C implementation, it's only around 28% slower than the multi-megabyte FFTW3 in unaligned mode. Unlike a closed permutation like with PFA, split-radix FFT bit-reversals contain multiple NOPs, multiple simple swaps, and a few chained swaps, so regular single-loop single-state permute loops were not possible. Instead, we filter out parts of the input indices which are redundant. This allows for a single branch, and with some clever AVX512 asm, could possibly be SIMD'd without refactoring. The inplace_idx array is guaranteed to never be larger than the revtab array, and in practice only requires around log2(len) entries. The power-of-two MDCTs can be done in-place as well. And it's possible to eliminate a copy in the compound MDCTs too, however it'll be slower than doing them out of place, and we'd need to dirty the input array.	4 years ago
James Almer	f6477ac9f4	avutil/tx: use ENOSYS instead of ENOTSUP It's the standard error code used across the codebase to signal unimplemented or unsupported features. Signed-off-by: James Almer <jamrial@gmail.com>	4 years ago
Lynne	06a8596825	lavu: support arbitrary-point FFTs and all even (i)MDCT transforms This patch adds support for arbitrary-point FFTs and all even MDCT transforms. Odd MDCTs are not supported yet as they're based on the DCT-II and DCT-III and they're very niche. With this we can now write tests.	4 years ago
Lynne	2465fe1302	lavu/tx: add 2-point FFT transform By itself, this allows 6-point, 10-point and 30-point transforms. When the 9-point transform is added it allows for 18-point FFT, and also for a 36-point MDCT (used by MP3).	5 years ago
Lynne	e1c84856bb	lavu/tx: improve 3-point fixed precision There's just no reason not to when its so easy (albeit messy) and its also reducing the precision of all non-power-of-two transforms that use it.	5 years ago
Lynne	223b58c74b	lavu/tx: slightly optimize fft15 Saves 2 additions.	5 years ago
Lynne	a38c6f47c9	lavu/tx: undef the correct macro It was renamed and no warning was given for undeffing a nonexisting one.	5 years ago
Lynne	e8f054b095	lavu/tx: implement 32 bit fixed point FFT and MDCT Required minimal changes to the code so made sense to implement. FFT and MDCT tested, the output of both was properly rounded. Fun fact: the non-power-of-two fixed-point FFT and MDCT are the fastest ever non-power-of-two fixed-point FFT and MDCT written. This can replace the power of two integer MDCTs in aac and ac3 if the MIPS optimizations are ported across. Unfortunately the ac3 encoder uses a 16-bit fixed point forward transform, unlike the encoder which uses a 32bit inverse transform, so some modifications might be required there. The 3-point FFT is somewhat less accurate than it otherwise could be, having minor rounding errors with bigger transforms. However, this could be improved later, and the way its currently written is the way one would write assembly for it. Similar rounding errors can also be found throughout the power of two FFTs as well, though those are more difficult to correct. Despite this, the integer transforms are more than accurate enough.	5 years ago
Lynne	42e2319ba9	lavu/tx: add support for double precision FFT and MDCT Simply moves and templates the actual transforms to support an additional data type. Unlike the float version, which is equal or better than libfftw3f, double precision output is bit identical with libfftw3.	5 years ago
Ruiling Song	65646db8e8	avutil/tx: should check against (*ctx) ctx is a pointer to pointer here. Signed-off-by: Ruiling Song <ruiling.song@intel.com>	6 years ago
Lynne	6044534964	avutil/tx: fix forward compound non-mod-15 based MDCTs There was a hardcoded value left. Wasn't caught earlier as no code uses compound forward mod-3/5 MDCTs yet.	6 years ago
Lynne	b79b29ddb1	libavutil: add an FFT & MDCT implementation This commit adds a new API to libavutil to allow for arbitrary transformations on various types of data. This is a partly new implementation, with the power of two transforms taken from libavcodec/fft_template, the 5 and 15-point FFT taken from mdct15, while the 3-point FFT was written from scratch. The (i)mdct folding code is taken from mdct15 as well, as the mdct_template code was somewhat old, messy and not easy to separate. A notable feature of this implementation is that it allows for 3xM and 5xM based transforms, where M is a power of two, e.g. 384, 640, 768, 1280, etc. AC-4 uses 3xM transforms while Siren uses 5xM transforms, so the code will allow for decoding of such streams. A non-exaustive list of supported sizes: 4, 8, 12, 16, 20, 24, 32, 40, 48, 60, 64, 80, 96, 120, 128, 160, 192, 240, 256, 320, 384, 480, 512, 640, 768, 960, 1024, 1280, 1536, 1920, 2048, 2560... The API was designed such that it allows for not only 1D transforms but also 2D transforms of certain block sizes. This was partly on accident as the stride argument is required for Opus MDCTs, but can be used in the context of a 2D transform as well. Also, various data types would be implemented eventually as well, such as "double" and "int32_t". Some performance comparisons with libfftw3f (SIMD disabled for both): 120: 22353 decicycles in fftwf_execute, 1024 runs, 0 skips 21836 decicycles in compound_fft_15x8, 1024 runs, 0 skips 128: 22003 decicycles in fftwf_execute, 1024 runs, 0 skips 23132 decicycles in monolithic_fft_ptwo, 1024 runs, 0 skips 384: 75939 decicycles in fftwf_execute, 1024 runs, 0 skips 73973 decicycles in compound_fft_3x128, 1024 runs, 0 skips 640: 104354 decicycles in fftwf_execute, 1024 runs, 0 skips 149518 decicycles in compound_fft_5x128, 1024 runs, 0 skips 768: 109323 decicycles in fftwf_execute, 1024 runs, 0 skips 164096 decicycles in compound_fft_3x256, 1024 runs, 0 skips 960: 186210 decicycles in fftwf_execute, 1024 runs, 0 skips 215256 decicycles in compound_fft_15x64, 1024 runs, 0 skips 1024: 163464 decicycles in fftwf_execute, 1024 runs, 0 skips 199686 decicycles in monolithic_fft_ptwo, 1024 runs, 0 skips With SIMD we should be faster than fftw for 15xM transforms as our fft15 SIMD is around 2x faster than theirs, even if our ptwo SIMD is slightly slower. The goal is to remove the libavcodec/mdct15 code and deprecate the libavcodec/avfft interface once aarch64 and x86 SIMD code has been ported. New code throughout the project should use this API. The implementation passes fate when used in Opus, AAC and Vorbis, and the output is identical with ATRAC9 as well.	6 years ago

29 Commits (0d0d67ce3640f01395e6d3c1ab3fc2d68fe78e2e)