FFmpeg

109543 Commits

37 Branches

400 Tags

465 MiB

Tag: Branch: Tree: 3dc7a9f623

Commit Graph

Author	SHA1	Message	Date
Lynne	87bae6b018	lavu/tx: refactor to explicitly track and convert lookup table order Necessary for generalizing PFAs.	2 years ago
Lynne	a89025f74d	aarch64/tx_float: fix compilation Forgot to add the new function arguments.	2 years ago
Lynne	f932b89ea3	lavu/tx: implement aarch64 NEON SIMD FFT The fastest fast Fourier transform in not just the west, but the world, now for the most popular toy ISA. On a high level, it follows the design of the AVX2 version closely, with the exception that the input is slightly less permuted as we don't have to do lane switching with the input on double 4pt and 8pt. On a low level, the lack of subadd/addsub instructions REALLY penalizes any attempt at writing an FFT. That single register matters a lot, and reloading it simply takes unacceptably long. In x86 land, vendors would've noticed developers need this. In ARM land, you get a badly designed complex multiplication instruction we cannot use, that's not present on 95% of devices. Because only compilers matter, right? Future optimization options are very few, perhaps better register management to use more ld1/st1s. All timings below are in cycles: A53: Length \| C \| New (lavu) \| Old (lavc) \| FFTW ------ \|-------------\|-------------\|-------------\|----- 4 \| 842 \| 420 \| 1210 \| 1460 8 \| 1538 \| 1020 \| 1850 \| 2520 16 \| 3717 \| 1900 \| 3700 \| 3990 32 \| 9156 \| 4070 \| 8289 \| 8860 64 \| 21160 \| 9931 \| 18600 \| 19625 128 \| 49180 \| 23278 \| 41922 \| 41922 256 \| 112073 \| 53876 \| 93202 \| 101092 512 \| 252864 \| 122884 \| 205897 \| 207868 1024 \| 560512 \| 278322 \| 458071 \| 453053 2048 \| 1295402 \| 775835 \| 1038205 \| 1020265 4096 \| 3281263 \| 2021221 \| 2409718 \| 2577554 8192 \| 8577845 \| 4780526 \| 5673041 \| 6802722 Apple M1 New - Total for len 512 reps 2097152 = 1.459141 s Old - Total for len 512 reps 2097152 = 2.251344 s FFTW - Total for len 512 reps 2097152 = 1.868429 s New - Total for len 1024 reps 4194304 = 6.490080 s Old - Total for len 1024 reps 4194304 = 9.604949 s FFTW - Total for len 1024 reps 4194304 = 7.889281 s New - Total for len 16384 reps 262144 = 10.374001 s Old - Total for len 16384 reps 262144 = 15.266713 s FFTW - Total for len 16384 reps 262144 = 12.341745 s New - Total for len 65536 reps 8192 = 1.769812 s Old - Total for len 65536 reps 8192 = 4.209413 s FFTW - Total for len 65536 reps 8192 = 3.012365 s New - Total for len 131072 reps 4096 = 1.942836 s Old - Segfaults FFTW - Total for len 131072 reps 4096 = 3.713713 s Thanks to wbs for some simplifications, assembler fixes and a review and to jannau for giving it a look.	2 years ago

Author

SHA1

Message

Date

Lynne

87bae6b018

lavu/tx: refactor to explicitly track and convert lookup table order

Necessary for generalizing PFAs.

2 years ago

Lynne

a89025f74d

aarch64/tx_float: fix compilation

Forgot to add the new function arguments.

2 years ago

Lynne

f932b89ea3

lavu/tx: implement aarch64 NEON SIMD FFT

The fastest fast Fourier transform in not just the west, but the world,
now for the most popular toy ISA.

On a high level, it follows the design of the AVX2 version closely,
with the exception that the input is slightly less permuted as we don't have
to do lane switching with the input on double 4pt and 8pt.

On a low level, the lack of subadd/addsub instructions REALLY penalizes
any attempt at writing an FFT. That single register matters a lot,
and reloading it simply takes unacceptably long.
In x86 land, vendors would've noticed developers need this.
In ARM land, you get a badly designed complex multiplication instruction
we cannot use, that's not present on 95% of devices. Because only
compilers matter, right?

Future optimization options are very few, perhaps better register
management to use more ld1/st1s.

All timings below are in cycles:
A53:
Length | C           | New (lavu)  | Old (lavc)  | FFTW
------ |-------------|-------------|-------------|-----
4      |         842 | 420         | 1210        | 1460
8      |        1538 | 1020        | 1850        | 2520
16     |        3717 | 1900        | 3700        | 3990
32     |        9156 | 4070        | 8289        | 8860
64     |       21160 | 9931        | 18600       | 19625
128    |       49180 | 23278       | 41922       | 41922
256    |      112073 | 53876       | 93202       | 101092
512    |      252864 | 122884      | 205897      | 207868
1024   |      560512 | 278322      | 458071      | 453053
2048   |     1295402 | 775835      | 1038205     | 1020265
4096   |     3281263 | 2021221     | 2409718     | 2577554
8192   |     8577845 | 4780526     | 5673041     | 6802722

Apple M1
New  - Total for len 512 reps 2097152 = 1.459141 s
Old  - Total for len 512 reps 2097152 = 2.251344 s
FFTW - Total for len 512 reps 2097152 = 1.868429 s

New  - Total for len 1024 reps 4194304 = 6.490080 s
Old  - Total for len 1024 reps 4194304 = 9.604949 s
FFTW - Total for len 1024 reps 4194304 = 7.889281 s

New  - Total for len 16384 reps 262144 = 10.374001 s
Old  - Total for len 16384 reps 262144 = 15.266713 s
FFTW - Total for len 16384 reps 262144 = 12.341745 s

New  - Total for len 65536 reps 8192 = 1.769812 s
Old  - Total for len 65536 reps 8192 = 4.209413 s
FFTW - Total for len 65536 reps 8192 = 3.012365 s

New  - Total for len 131072 reps 4096 = 1.942836 s
Old  - Segfaults
FFTW - Total for len 131072 reps 4096 = 3.713713 s

Thanks to wbs for some simplifications, assembler fixes and a review
and to jannau for giving it a look.

2 years ago

3 Commits (3dc7a9f6239d909b3e66712b32380822466882af)