A Fast Fourier Transform Compiler
Matteo Frigo MIT Laboratory for Computer Science 545 Technology Square NE43-203
Cambridge, MA 02139 athena theory lcs mit ed u
February 16, 1999
Abstract 250 B B B B B FFTW The FFTW library for computing the discrete Fourier trans- B B B 200 E SUNPERF form (DFT) has gained a wide acceptance in both academia B B E and industry, because it provides excellent performance on E B 150 E E B E B a variety of machines (even competitive with or faster than E E E B E equivalent libraries supplied by vendors). In FFTW, most of 100 E B the performance-critical code was generated automatically E B E E B E B B B by a special-purpose compiler, called genfft, that outputs E E 50 E
C code. Written in Objective Caml, genfft can produce E B DFT programs for any input length, and it can specialize E 0 E the DFT program for the common case where the input data
are real instead of complex. Unexpectedly, genfft “discov- ered” algorithms that were previously unknown, and it was transform size able to reduce the arithmetic complexity of some other ex- isting algorithms. This paper describes the internals of this Figure 1: Graph of the performance of FFTW versus Sun’s Per- special-purpose compiler in some detail, and it argues that a formance Library on a 167 MHz UltraSPARC processor in single specialized compiler is a valuable tool. precision. The graph plots the speed in “mflops” (higher is better) versus the size of the transform. This figure shows sizes that are powers of two, while Figure 2 shows other sizes that can be fac- 1 Introduction tored into powers of 2, 3, 5, and 7. This distinction is important because the DFT algorithm depends on the factorization of the size, Recently, Steven G. Johnson and I released Version 2.0 of and most implementations of the DFT are optimized for the case the FFTW library [FJ98, FJ], a comprehensive collection of of powers of two. See [FJ97] for additional experimental results. fast C routines for computing the discrete Fourier transform FFTW was compiled with Sun’s C compiler (WorkShop Compilers (DFT) in one or more dimensions, of both real and complex 4.2 30 Oct 1996 C 4.2). data, and of arbitrary input size. The DFT [DV90] is one of the most important computational problems, and many (which amounts to 95% of the total code) was generated au- real-world applications require that the transform be com- tomatically by a special-purpose compiler written in Objec- puted as quickly as possible. FFTW is one of the fastest tive Caml [Ler98]. This paper explains how this compiler DFT programs available (see Figures 1 and 2) because of two works. unique features. First, FFTW automatically adapts the com- FFTW does not implement a single DFT algorithm, but it putation to the hardware. Second, the inner loop of FFTW is structured as a library of codelets—sequences of C code that can be composed in many ways. In FFTW’s lingo, a The author was supported in part by the Defense Advanced Research composition of codelets is called a plan. You can imagine Projects Agency (DARPA) under Grant F30602-97-1-0270, and by a Digital Equipment Corporation fellowship. the plan as a sort of bytecode that dictates which codelets should be executed in what order. (In the current FFTW To appear in Proceedings of the 1999 ACM SIGPLAN implementation, however, a plan has a tree structure.) The Conference on Programming Language Design and Im- precise plan used by FFTW depends on the size of the in- plementation (PLDI), Atlanta, Georgia, May 1999. put (where “the input” is an array of n complex numbers), and on which codelets happen to run fast on the underly-
1 250 besides noticing common subexpressions, the simplifier B also attempts to create them. The simplifier is written in B B FFTW B B 200 E SUNPERF monadic style [Wad97], which allowed me to deal with B the dag as if it were a tree, making the implementation B 150 B B B E E E E B B much easier. E B B B B E
E genfft E 3. In the scheduler, produces a topological sort of
100 E B k
E B the dag (a “schedule”) that, for transforms of size , E B E E B provably minimizes the asymptotic number of register 50 E E E E E spills, no matter how many registers the target machine E has. This truly remarkable fact can be derived from 0 the analysis of the red-blue pebbling game of Hong and Kung [HK81], as we shall see in Section 6. For trans- transform size forms of other sizes the scheduling strategy is no longer provably good, but it still works well in practice. Again, Figure 2: See caption of Figure 1. the scheduler depends heavily on the topological struc- ture of DFT dags, and it would not be appropriate in a ing hardware. The user need not choose a plan by hand, general-purpose compiler. however, because FFTW chooses the fastest plan automat- ically. The machinery required for this choice is described 4. Finally, the schedule is unparsed to C. (It would be easy elsewhere [FJ97, FJ98]. to produce FORTRAN or other languages by changing Codelets form the computational kernel of FFTW. You the unparser.) The unparser is rather obvious and unin- can imagine a codelet as a fragment of C code that computes teresting, except for one subtlety discussed in Section 7. a Fourier transform of a fixed size (say, 16, or 19).1 FFTW’s codelets are produced automatically by the FFTW codelet Although the creation phase uses algorithms that have
been known for several years, the output of genfft is at genfft generator, unimaginatively called genfft. is an
unusual special-purpose compiler. While a normal compiler times completely unexpected. For example, for a complex