A Compiler Matteo Frigo Supercomputing Technologies Group MIT Laboratory for Computer Science

1 genfft loih Srne ta. 97 automatically. 1987] al., real-number et a [Sorensen derives algorithm it algorithm FFT complex-number a From algorithms. FFT ML. unknown of previously dialect “Discovered” a 1998], [Leroy 2.0 Caml Objective size. in arbitrary Written of transforms Fourier for code fast Generates sadomain-speci a is fi F compiler. FFT c genfft 2 genfft bility oe ouetto n ecmr eut at results benchmark and MPI. documentation and Code, threads Posix Cilk, for available are versions Parallel FFTW size. Fourier arbitrary discrete of complex comput- transforms and for real library multidimensional a and one- is ing 1997–1999] Johnson, and [Frigo FFTW rdcdte“ne op fFT 9%o h code). the of (95% FFTW of loop” “inner the produced and adapts genfft speed httptheorylcsmited genfft teft h adaeatmtcly giving automatically, hardware the to itself ttesm time. same the at ae FWpossible. FFTW makes sdi FFTW in used u fftw porta- 3 FFTW’s performance FFTW is faster than all other portable FFT codes, and it is compara- ble with machine-specific libraries provided by vendors.

250 B J B FFTW Singleton

B J J SUNPERF Krukar 200 J J B J B J H Ooura M Temperton B F FFTPACK Bailey J B B 150 B B B F J Green NR F Sorensen 7 QFT J H B H HF F mflops F 100 J H F F B H H H H HF F HB HF M M J H B H M M M M M M B 50 J 7 FM H B HB B HB B F 7 7 7 7 J J H J HJ J H HB H M 7 F F J F J J FJ 7 7 7 F F F F F B M 7 7 M M H M 7 7 M M M M M M FJ M 7 7 7 7 7 7 0 M 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 Transform Size Benchmark of FFT codes on a 167-MHz Sun UltraSPARC I, double precision.

4 FFTW’s performance

250 B B B B B FFTW B B B 200 E SUNPERF B B E E E B 150 E B E B E E E B E 100 E B E B E E B E B B B E E 50 E E B E 0 E

transform size

FFTW versus Sun’s Performance library on a 167-MHz Sun Ultra- SPARC I, single precision.

5 Why FFTW is fast FFTW does not implement a single fixed algorithm. Instead, the transform is computed by an executor, composed of highly opti- mized, composable blocks of C code called codelets. Roughly speak- ing, a codelet computes a Fourier transform of small size. At runtime, a planner finds a fast composition of codelets. The plan- ner measures the speed of different plans and chooses the fastest.

FFTW contains 120 codelets for a total of 56215 lines of op- timized code.

6 Real FFTs are complex The data-flow graph of the Fourier transform is not a butterfly, it is a mess! Complex butterfly Real and imaginary parts

The FFT is even messier when the size is

not a power of , or when the input con- sists of real numbers.

7 4. 3. 2. 1. guages.) Unparse Schedule Simplify optimizing. bother not Do dag a Create genfft h dag. the oC Cudas nas oFRRNo te lan- other or to unparse also (Could C. to h a,uigargse-biiu schedule. register-oblivious a using dag, the o nFTagrtmi h ipetpsil way. possible simplest the in algorithm FFT an for scmiainstrategy compilation ’s 8 The case for genfft Why not rely on the C compiler to optimize codelets?

FFT algorithms are best expressed recursively, but C compilers are not good at unrolling recursion.

The FFT needs very long code sequences to use all registers ef- fectively. (FFTW on Alpha EV56 computes FFT(64) with 2400 lines of straight C code.)

Register allocators choke if you feed them with long code se- quences. In the case of the FFT, however, we know the (asymp- totically) optimal register allocation.

Certain optimizations only make sense for FFTs. Since C compilers are good at instruction selection and instruction

scheduling, I did not need to implement these steps in genfft .

9 Outline

Dag creation.

The simplifier.

Register-oblivious scheduling.

Related work.

Conclusion.

10 Dag representation

A dag is represented as a data type node . (Think of a syntax tree or node node of Times list node of Plus a symbolic algebra system.)

Variablevariable of Load Numbernumber of Num

v

node type



v

  

v v v Num Times Plus v



v

 

v Num Times v Plus v



v



v

(All numbers are real.)

An abstraction layer implements complex arithmetic on pairs of dag nodes. 11 Dag creation

The function fftgen performs a complete symbolic evaluation of an FFT algorithm, yielding a dag.

No single FFT algorithm is good for all input sizes n . genfft con-

tains many algorithms, and fftgen dispatches the most appropriate.

Cooley-Tukey [1965] (workspq for n ).

Prime factor [Good, 1958] (workspq for n q whenp gcd ).

Split-radix [Duhamel and Hollman, 1984] (worksk for n ).

Rader [1968] (works when n is prime).

12 Cooley-Tukey FFT algorithm DSP book:



 j  j  j

q k q mod k outer k fun



j  pj j k

p n q n

x x y

A

  

in k twiddle p fftgen k outer let

k j k j k j jk

X X X

 n

 q  p

in k j inner k j n omega

 

j k twiddle let wherepq n qk k and k .

in j j p x j fun

q fftgen j inner OCamllet code:

x q p n cooleytukey let

13 Outline

Dag creation.

The simplifier.

Register-oblivious scheduling.

Related work.

Conclusion.

14 The simplifier Well-known optimizations:

Algebraic simplification,a  e.g., a .

Constant folding.

Common-subexpression elimination. (In the current implemen- tation, CSE is encapsulated into a monad.) Specific to the FFT:

(Elimination of negative constants.)

Network transposition.

Why did I implement these well-known optimizations, which are implemented in C compilers anyway? Because the the specific optimizations can only be done after the well-known optimizations.

15 y t t y

Network transposition

x s s x

A network is a dag that computes a linear function [Crochiere and

t y t y

Oppenheim, 1975]. (Recall that the FFT is linear.) Reversing all edges yields a transposed network.

s x s x

If a network computesAY X , the transposed network computes

X A Y .

T 16 Optimize the transposed dag!

genfft employs this dag optimization algorithm:



G

OPTIMIZE( G )

SIMPLIFY( G )

 

G SIMPLIFY( G )

T T

 RETURN SIMPLIFY( G ) (Iterating the transposition has no effect.)

17 Benefits of network transposition

genfft literature size adds muls muls savings adds muls original transposed complex to complex 5 32 16 12 25% 34 10 10 84 32 24 25% 88 20 13 176 88 68 23% 214 76 15 156 68 56 18% 162 50 real to complex 5 12 8 6 25% 17 5 10 34 16 12 25% ?? 13 76 44 34 23% ?? 15 64 31 25 19% 67 25

18 Outline

Dag creation.

The simplifier.

Register-oblivious scheduling.

Related work.

Conclusion.

19 How to help register allocators

genfft ’s scheduler reorders the dag so that the register allocator of the C compiler can minimize the number of register spills.

genfft ’s recursive partitioning heuristic is asymptotically optimal

for n .

k

genfft ’s schedule is register-oblivious: This schedule does not de-

pend on the number R of registers, and yet it is optimal for all R .

Certain C compilers insist on imposing their own schedule.

With egcs /SPARC, FFTW is 50-100% faster if I use

fnoscheduleinsns . (How do I persuade the SGI compiler to respect my sched- ule?)

20 Recursive partitioning The scheduler partitions the dag by cutting it in the “middle.” This operation creates connected components that are scheduled recur- sively.

21 Register-oblivious scheduling

Assume that we want to compute the FFT of n points, and a computer

has R registers, whereR n .

Lower bound [Hong and Kung, 1981]. If n ,then -point FFT

k

requires R lg atn leastlg n register spills.

Upper bound. If n , genfft ’s output program for n -point FFT

k

incurs R lg atn mostlg n O register spills. A more general theory of cache-oblivious algorithms exists [Frigo, Leiserson, Prokop, and Ramachandran, 1999].

22 Related work

FOURGEN [Maruhn, 1976]. Written in PL/I, generates FOR-

TRAN transforms of size .

k

[Johnson and Burrus, 1983] automate the design of DFT pro- grams using dynamic programming.

[Selesnick and Burrus, 1986] generate MATLAB programs for FFTs of certain sizes.

[Perez and Takaoka, 1987] generated prime-factor FFT programs.

[Veldhuizen, 1995] uses a template metaprograms technique to generate C++.

EXTENT [Gupta et al., 1996] compiles a tensor-product lan- guage into FORTRAN.

23 Conclusion A domain-specific compiler is a valuable tool.

For the FFT on modern processors, performance requires straight- line sequences with hundreds of instructions.

Correctness was surprisingly easy.

Rapid turnaround. (About 15 minutes to regenerate FFTW.)

We can implement problem-specific code improvements.

genfft derives new algorithms.

24