A Fast Fourier Transform Compiler Matteo Frigo Supercomputing Technologies Group MIT Laboratory for Computer Science
1 genfft loih Srne ta. 97 automatically. 1987] al., real-number et a [Sorensen derives algorithm it algorithm FFT complex-number a From algorithms. FFT ML. unknown of previously dialect “Discovered” a 1998], [Leroy 2.0 Caml Objective size. in arbitrary Written of transforms Fourier for code C fast Generates sadomain-speci a is fi F compiler. FFT c genfft 2 genfft bility oe ouetto n ecmr eut at results benchmark and MPI. documentation and Code, threads Posix Cilk, for available are versions Parallel FFTW size. Fourier arbitrary discrete of complex comput- transforms and for real library multidimensional a and one- is ing 1997–1999] Johnson, and [Frigo FFTW rdcdte“ne op fFT 9%o h code). the of (95% FFTW of loop” “inner the produced and adapts genfft speed http theory lcs mit ed genfft teft h adaeatmtcly giving automatically, hardware the to itself ttesm time. same the at ae FWpossible. FFTW makes sdi FFTW in used u fftw porta- 3 FFTW’s performance FFTW is faster than all other portable FFT codes, and it is compara- ble with machine-specific libraries provided by vendors.
250 B J B FFTW Singleton
B J J SUNPERF Krukar 200 J J B J B J H Ooura M Temperton B F FFTPACK Bailey J B B 150 B B B F J Green NR F Sorensen 7 QFT J H B H HF F mflops F 100 J H F F B H H H H HF F HB HF M M J H B H M M M M M M B 50 J 7 FM H B HB B HB B F 7 7 7 7 J J H J HJ J H HB H M 7 F F J F J J FJ 7 7 7 F F F F F B M 7 7 M M H M 7 7 M M M M M M FJ M 7 7 7 7 7 7 0 M 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 Transform Size Benchmark of FFT codes on a 167-MHz Sun UltraSPARC I, double precision.
4 FFTW’s performance
250 B B B B B FFTW B B B 200 E SUNPERF B B E E E B 150 E B E B E E E B E 100 E B E B E E B E B B B E E 50 E E B E 0 E
transform size
FFTW versus Sun’s Performance library on a 167-MHz Sun Ultra- SPARC I, single precision.
5 Why FFTW is fast FFTW does not implement a single fixed algorithm. Instead, the transform is computed by an executor, composed of highly opti- mized, composable blocks of C code called codelets. Roughly speak- ing, a codelet computes a Fourier transform of small size. At runtime, a planner finds a fast composition of codelets. The plan- ner measures the speed of different plans and chooses the fastest.
FFTW contains 120 codelets for a total of 56215 lines of op- timized code.
6 Real FFTs are complex The data-flow graph of the Fourier transform is not a butterfly, it is a mess! Complex butterfly Real and imaginary parts
The FFT is even messier when the size is
not a power of , or when the input con- sists of real numbers.
7 4. 3. 2. 1. guages.) Unparse Schedule Simplify optimizing. bother not Do dag a Create genfft h dag. the oC Cudas nas oFRRNo te lan- other or FORTRAN to unparse also (Could C. to h a,uigargse-biiu schedule. register-oblivious a using dag, the o nFTagrtmi h ipetpsil way. possible simplest the in algorithm FFT an for scmiainstrategy compilation ’s 8 The case for genfft Why not rely on the C compiler to optimize codelets?