The Design and Implementation of FFTW3 Matteo Frigo and Steven G
Total Page:16
File Type:pdf, Size:1020Kb
1 The Design and Implementation of FFTW3 Matteo Frigo and Steven G. Johnson (Invited Paper) Abstract— FFTW is an implementation of the discrete Fourier of a multi-dimensional array. (Most implementations sup- transform (DFT) that adapts to the hardware in order to port only a single DFT of contiguous data.) maximize performance. This paper shows that such an approach • FFTW supports DFTs of real data, as well as of real can yield an implementation that is competitive with hand- optimized libraries, and describes the software structure that symmetric/antisymmetric data (also called discrete co- makes our current FFTW3 version flexible and adaptive. We sine/sine transforms). further discuss a new algorithm for real-data DFTs of prime size, The interaction of the user with FFTW occurs in two a new way of implementing DFTs by means of machine-specific stages: planning, in which FFTW adapts to the hardware, “SIMD” instructions, and how a special-purpose compiler can and execution, in which FFTW performs useful work for the derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm. user. To compute a DFT, the user first invokes the FFTW planner, specifying the problem to be solved. The problem is Index Terms— FFT, adaptive software, Fourier transform, a data structure that describes the “shape” of the input data— cosine transform, Hartley transform, I/O tensor. array sizes and memory layouts—but does not contain the data itself. In return, the planner yields a plan, an executable data I. INTRODUCTION structure that accepts the input data and computes the desired DFT. Afterwards, the user can execute the plan as many times FTW [1] is a widely used free-software library that as desired. computes the discrete Fourier transform (DFT) and its F The FFTW planner works by measuring the actual run time various special cases. Its performance is competitive even with of many different plans and by selecting the fastest one. This vendor-optimized programs, but unlike these programs, FFTW process is analogous to what a programmer would do by hand is not tuned to a fixed machine. Instead, FFTW uses a planner when tuning a program to a fixed machine, but in FFTW’s case to adapt its algorithms to the hardware in order to maximize no manual intervention is required. Because of the repeated performance. The input to the planner is a problem, a multi- performance measurements, however, the planner tends to be dimensional loop of multi-dimensional DFTs. The planner time-consuming. In performance-critical applications, many applies a set of rules to recursively decompose a problem into transforms of the same size are typically required, and there- simpler sub-problems of the same type. “Sufficiently simple” fore a large one-time cost is usually acceptable. Otherwise, problems are solved directly by optimized, straight-line code FFTW provides a mode of operation where the planner quickly that is automatically generated by a special-purpose compiler. returns a “reasonable” plan that is not necessarily the fastest. This paper describes the overall structure of FFTW as well as The planner generates plans according to rules that recur- the specific improvements in FFTW3, our latest version. sively decompose a problem into simpler sub-problems. When FFTW is fast, but its speed does not come at the expense of the problem becomes “sufficiently simple,” FFTW produces flexibility. In fact, FFTW is probably the most flexible DFT a plan that calls a fragment of optimized straight-line code library available: that solves the problem directly. These fragments are called • FFTW is written in portable C and runs well on many codelets in FFTW’s lingo. You can envision a codelet as architectures and operating systems. computing a “small” DFT, but many variations and special • FFTW computes DFTs in O(n log n) time for any cases exist. For example, a codelet might be specialized to length n. (Most other DFT implementations are either compute the DFT of real input (as opposed to complex). 2 restricted to a subset of sizes or they become Θ(n ) for FFTW’s speed depends therefore on two factors. First, the certain values of n, for example when n is prime.) decomposition rules must produce a space of plans that is rich • FFTW imposes no restrictions on the rank (dimension- enough to contain “good” plans for most machines. Second, ality) of multi-dimensional transforms. (Most other im- the codelets must be fast, since they ultimately perform all the plementations are limited to one-dimensional, or at most real work. two- and three-dimensional data.) FFTW’s codelets are generated automatically by a special- • FFTW supports multiple and/or strided DFTs; for exam- purpose compiler called genfft. Most users do not interact ple, to transform a 3-component vector field or a portion with genfft at all: the standard FFTW distribution contains a set of about 150 pre-generated codelets that cover the most M. Frigo is with the IBM Austin Research Laboratory, 11501 Burnet Road, Austin, TX 78758. He was supported in part by the Defense Advanced common uses. Users with special needs can use genfft Research Projects Agency (DARPA) under contract No. NBCH30390004. to generate their own codelets. genfft is useful because S. G. Johnson is with the Massachusetts Institute of Technology, 77 of the following features. From a high-level mathematical Mass. Ave. Rm. 2-388, Cambridge, MA 02139. He was supported in part by the Materials Research Science and Engineering Center program of the description of a DFT algorithm, genfft derives an optimized National Science Foundation under award DMR-9400334. implementation automatically. From a complex-DFT algo- Published in Proc. IEEE, vol. 93, no. 2, pp. 216–231 (2005). rithm, genfft automatically derives an optimized algorithm result. The most important FFT (and the one primarily used in for the real-input DFT. We take advantage of this property to FFTW) is known as the “Cooley-Tukey” algorithm, after the implement real-data DFTs (Section VII), as well as to exploit two authors who rediscovered and popularized it in 1965 [14], machine-specific “SIMD” instructions (Section IX). Similarly, although it had been previously known as early as 1805 by genfft automatically derives codelets for the discrete cosine Gauss as well as by later re-inventors [15]. The basic idea (DCT) and sine (DST) transforms (Section VIII). We summa- behind this FFT is that a DFT of a composite size n = n1n2 rize genfft in Section VI, while a full description appears can be re-expressed in terms of smaller DFTs of sizes n1 and in [2]. n2—essentially, as a two-dimensional DFT of size n1 × n2 We have produced three major implementations of FFTW, where the output is transposed. The choices of factorizations each building on the experience of the previous system. of n, combined with the many different ways to implement the FFTW1 [3] (1997) introduced the idea of generating codelets data re-orderings of the transpositions, have led to numerous automatically, and of letting a planner search for the best implementation strategies for the Cooley-Tukey FFT, with combination of codelets. FFTW2 (1998) incorporated a new many variants distinguished by their own names [16], [17]. version of genfft [2]. genfft did not change much in FFTW implements a space of many such variants, as described FFTW3 (2003), but the runtime structure was completely later, but here we derive the basic algorithm, identify its key rewritten to allow for a much larger space of plans. This paper features, and outline some important historical variations and describes the main ideas common to all FFTW systems, the their relation to FFTW. runtime structure of FFTW3, and the modifications to genfft The Cooley-Tukey algorithm can be derived as follows. If since FFTW2. n can be factored into n = n1n2, Eq. (1) can be rewritten by Previous work on adaptive systems includes [3]–[11]. In letting j = j1n2 + j2 and k = k1 + k2n1. We then have: particular, SPIRAL [9], [10] is another system focused on Y [k1 + k2n1] = (2) optimization of Fourier transforms and related algorithms, but it has distinct differences from FFTW. SPIRAL searches n2−1 n1−1 X X j1k1 j2k1 j2k2 X[j1n2 + j2]ω ω ω . at compile-time over a space of mathematically equivalent n1 n n2 formulas expressed in a “tensor-product” language, whereas j2=0 j1=0 FFTW searches at runtime over the formalism discussed in Thus, the algorithm computes n2 DFTs of size n1 (the inner Section IV, which explicitly includes low-level details, such as sum), multiplies the result by the so-called twiddle factors j2k1 strides and memory alignments, that are not as easily expressed ωn , and finally computes n1 DFTs of size n2 (the outer using tensor products. SPIRAL generates machine-dependent sum). This decomposition is then continued recursively. The code, whereas FFTW’s codelets are machine-independent. literature uses the term radix to describe an n1 or n2 that FFTW’s search uses dynamic programming [12, chapter 16], is bounded (often constant); the small DFT of the radix is while the SPIRAL project has experimented with a wider traditionally called a butterfly. range of search strategies including machine-learning tech- Many well-known variations are distinguished by the radix niques [13]. alone. A decimation in time (DIT) algorithm uses n2 as the The remainder of this paper is organized as follows. We radix, while a decimation in frequency (DIF) algorithm uses n1 begin with a general overview of fast Fourier transforms in as the radix. If multiple radices are used, e.g. for n composite Section II. Then, in Section III, we compare the performance but not a prime power, the algorithm is called mixed radix.