Fastest Fourier Transform in the West

Devendra Ghate and Nachi Gupta [email protected], [email protected]

Oxford University Computing Laboratory

MT 2006 Internal Seminar – p. 1/33 Cooley-Tukey Algorithm for FFT

N 1 2πi − nk Xk = X xne− N , k = 0, . . . , N − 1 n=0 2 If we solve for Xk directly then O(N ) operations required for the solution.

If N = p × q, then this transformation can be split into two separate transformations...... recursively, we get the FFT algorithm.

N If N is split into r intervals of r each then the complexity is r × N × logr N.

This can be generalised to mixed-radix algorithms.

MT 2006 Internal Seminar – p. 2/33 Overview of FFT algorithms

Cooley-Tukey FFT algorithm (popularized in 1965), known by Gauss in 1805.

Prime-factor FFT algorithm ± a.k.a. Good-Thomas (1958-1963), N = N1N2 becomes 2D N1 by N2 DFT for only relatively prime N1, N2.

Bruun's FFT algorithm (1978, generalized to arbitrary even composite sizes by H. Murakami in 1996), recursive polynomial factorization approach.

Rader's FFT algorithm (1968), for prime size by expressing DFT as a convolution.

Bluestein's FFT algorithm (1968) ± a.k.a. chirp z-transform algorithm (1969), for prime sizes by expressing DFT as a convolution.

Rader-Brenner (1976) ± Cooley-Tukey like, but purely imag twiddle factors

MT 2006 Internal Seminar – p. 3/33 What is FFTW?

FFTW is a package for computing a one or multidimensional complex discrete Fourier transform (DFT) of arbitrary size.

At the heart of FFTW lies a philosophy geared towards speed, portability, and elegance – achieved via an adaptive software architecture.

Code, documentation, and some benchmarking results at www.fftw.org.

MT 2006 Internal Seminar – p. 4/33 Features

Speed. Both one-dimensional and multi-dimensional transforms. Arbitrary size (small prime factors are best, but FFTW uses O(N log N) algorithms even for prime sizes). Fast transforms of purely real input or output data. Discrete Cosine Transform and Discrete Sine Transform. Parallel Transforms (an MPI version of distributed memory transforms). Portable to any platform with a compiler. Both C and interfaces.

GPL MT 2006 Internal Seminar – p. 5/33 Wilkinson Prize

FFTW received the 1999 J. H. Wilkinson Prize for Numerical Software, which is awarded every four years to the software that “best addresses all phases of the preparation of high quality numerical software.”

Wilkinson, a seminal figure in modern , was a key proponent of the notion of reusable common libraries for scientific computing.

MT 2006 Internal Seminar – p. 6/33 Why FFTW is fast!

The transform is computed by an executor, composed of highly optimized composable blocks of C code called codelets.

At runtime, a planner finds an efficient way to mix these codelets: it measures the speeds of different plans and chooses the best using a dynamic programming algorithm.

The executor then interprets this plan with negligible overhead.

Codelets are generated automatically and fast.

MT 2006 Internal Seminar – p. 7/33 FFTW is easy to use

FFTW’s internal complexity is not visible to the user. int n = 1024;

// allocate memory in = fftw_malloc( sizeof( fftw_complex ) * n ); out = fftw_malloc( sizeof( fftw_complex ) * n );

// Create plan fftw_plan plan = fftw_plan_dft_1d( n, in, out, FFTW_FORWARD, FFTW_ESTIMATE );

// ... fill in ...

// Execute fftw_execute( plan );

// Destroy plan fftw_destroy_plan( plan );

MT 2006 Internal Seminar – p. 8/33 The Executor

The executor implements the recursive divide and conquer Cooley-Tukey FFT algorithm.

The executor is composed of many optimized code sequences called codelets. Codelets come in two flavors: Nontwiddle codelets compute transforms of small sizes. Twiddle codelets combine small transforms to compute bigger transforms. The executor invokes the codelets as dictated by the plan.

MT 2006 Internal Seminar – p. 9/33 The Planner

A planner tries out various combinations of codelets to decide on the best algorithm.

The planner can produce many plans, measure the speed of them, and then pick the best. This is accomplished via dynamic programming techniques. The planner can produce a “reasonable” plan quickly if desired. The planner collects information about the machines, which can be stored on disk and reused at a later time.

MT 2006 Internal Seminar – p. 10/33 What a plan looks like

FFT( 128 ) = DIVIDE-AND-CONQUER( 128, 4 ) DIVIDE-AND-CONQUER( 32, 8 ) SOLVE( 4 ) The plan is a sequence of instructions, either (SOLVE)(n): compute the FFT of size n using a nontwiddle codelet. (DIVIDE-AND-CONQUER)(n, p): solve p problems of size n/p using the rest of the plan, and then combine the results using a twiddle codelet.

MT 2006 Internal Seminar – p. 11/33 What a real plan looks like

Plan for an array of size 247

(dft-ct-dit/13

(dftw-generic-dit-13-19

(dft-direct-13-x19 "n1\_13"))

(dft-vrank>=1-x13/1

(dft-generic-19)))

MT 2006 Internal Seminar – p. 12/33 FFTW Plans

FFTW_ESTIMATE - no run-time tuning, probably suboptimal FFTW_MEASURE - experiment with different algorithms and choose fastest FFTW_PATIENT - experiment with more algorithms, ... FFTW_EXHAUSTIVE - even more, ...

Creating a plan for an array of size 823467 in FFTW_EXHAUSTIVE modes takes 25 minutes.

MT 2006 Internal Seminar – p. 13/33 Wisdom

A central or local database of plans for various array sizes for a given hardware configuration.

Helps greatly in the estimate mode Reduces computation time for the exhaustive mode Wisdom from the other array sizes is also utilised Can be exported from one architecture to another

Matlab uses a wisdom database by default.

MT 2006 Internal Seminar – p. 14/33 FFTW Compiler

Various basic algorithms written in CamL (a functional language) Directed acyclic graph (DAG) created for all these algorithms Optimisation rules defined especially suited for FFT algorithms (these also include standard optimisations carried out by a compiler) Small snippets of code, “codelets”, generated for FFT algorithms of size N = 2 : 16

MT 2006 Internal Seminar – p. 15/33 Multi-dimensional Arrays

FFTW does not accept multidimensional arrays.

A single dimensional array in the "row-major" format (C standard) Use of fftw_malloc for contiguous array allocation

MKL routines accept multi-dimensinal arrays. Various compact formats are also defined.

MT 2006 Internal Seminar – p. 16/33 NAG FFT routines

Complex-Complex, Real-Complex routines FFT algorithm (Brigham 1974) Scaling factor of 1 √n Array dimension n s.t. the largest prime factor of n does not exceed 19 and total number of factors of n including repetitions, does not exceed 20.

MT 2006 Internal Seminar – p. 17/33 MKL FFT routines

Complex-Complex & real-complex transforms Multi-dimensional arrays upto order 7 Interface for both FORTRAN and C Mixed-redix transforms !

MKL also provides an API for FFTW 3.x !!!

MATLAB uses FFTW3 internally for all FFT calculations. Interface for setting plan options available.

MT 2006 Internal Seminar – p. 18/33 Henrici icc used with following flags:

-O3 : Level-3 opitization -c99 : Support for complex numbers using C9X standards -ipo : Multi file inlining -funroll-loops : Unrolling the loops -xN : Compile and optimise code specifically for Pentium-4 -align: Analyse and reorder the memory layout

MT 2006 Internal Seminar – p. 19/33 Benchmark

Time calculated by average over 1000 runs for each array size N Minimum time from 8 trials

FLOPS = 5 × N × log2 N Compiler: icc

MT 2006 Internal Seminar – p. 20/33 FLOPS void fftw_flops(plan, add, mul, fma)

8 10

7 10

6 10 FLOPs 5 10 5Nlog2(N) FLOPS 2Nlog2(N) 4 10

3 10 0 1 2 3 4 5 6 7 4 N x 10

MT 2006 Internal Seminar – p. 21/33 1D Complex Bench - Powers of 2

Benchmark for complex case 3000 FFTW3−Estimate FFTW3−Measure FFTW3−Patient FFTW3−Exhaustive 2500 MKL NAG MATLAB−Estimate MATLAB−Measure MATLAB−Patient 2000 MATLAB−Exhaustive

1500 MFLOPS/s

1000

500

0 0 1 2 3 4 5 10 10 10 10 10 10 N (Array Size) MT 2006 Internal Seminar – p. 22/33 1D Complex Bench - General

Benchmark for complex case 2500 FFTW3−Estimate FFTW3−Measure FFTW3−Patient FFTW3−Exhaustive MKL 2000 NAG MATLAB−Estimate MATLAB−Measure MATLAB−Patient MATLAB−Exhaustive 1500 MFLOPS/s

1000

500

0 0 1 2 3 4 5 6 10 10 10 10 10 10 10 N (Array Size)

MT 2006 Internal Seminar – p. 23/33 1D Complex Bench - Combined

Benchmark for complex case 2500 FFTW3−Estimate FFTW3−Measure FFTW3−Patient FFTW3−Exhaustive MKL 2000 NAG MATLAB−Estimate MATLAB−Measure MATLAB−Patient MATLAB−Exhaustive 1500 MFLOPS/s

1000

500

0 0 1 2 3 4 5 10 10 10 10 10 10 N (Array Size)

MT 2006 Internal Seminar – p. 24/33 BenchFFT Results on Henrici1

double-precision complex, 1d transforms powers of two 3000 fftw3 out-of-place fftw3 in-place ooura-sg spiral-egner-fft 2500 ffte green dfftpack rmayer-lookup 2000 bloodworth gsl-mixed-radix harm arprec 1500 sciport fxt-fht numutils kissfft speed (mflops) monnier 1000 mpfun77 mixfft cross fxt-matrixfft jmfftc 500 esrfft valkenburg

02 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

MT 2006 Internal Seminar – p. 25/33 BenchFFT Results on Henrici1

double-precision complex, 1d transforms non-powers of two 2000 fftw3 out-of-place fftw3 in-place dfftpack ffte gsl-mixed-radix kissfft 1500 monnier mixfft jmfftc numutils valkenburg

1000 speed (mflops)

500

06 9 12 15 18 24 36 80 108 210 504 1000 1960 4725 10368 27000 75600 165375

MT 2006 Internal Seminar – p. 26/33 BenchFFT Results on Henrici1

double-precision real-data, 1d transforms powers of two 2000 fftw3 out-of-place fftw3-r2r fftw3 in-place ooura-sg fftreal sciport 1500 green rmayer-lookup dfftpack bloodworth fxt-fht-real gsl-mixed-radix 1000 jmfftc speed (mflops)

500

02 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

MT 2006 Internal Seminar – p. 27/33 BenchFFT Results on Henrici1

double-precision real-data, 1d transforms non-powers of two 2000 fftw3-r2r fftw3 out-of-place fftw3 in-place dfftpack gsl-mixed-radix jmfftc 1500

1000 speed (mflops)

500

06 9 12 15 18 24 36 80 108 210 504 1000 1960 4725 10368 27000 75600 165375

MT 2006 Internal Seminar – p. 28/33 BenchFFT Results on Henrici1

double-precision complex, 2d transforms powers of two 2500 fftw3 in-place fftw3 out-of-place ooura-4f2d green ffte 2000 harm fxt-twodim jmfftc

1500

speed (mflops) 1000

500

04x4 8x4 4x8 8x8 16x16 32x32 64x64 16x512 128x64 128x128 256x128 512x64 64x1024 256x256 512x512 1024x1024

MT 2006 Internal Seminar – p. 29/33 BenchFFT Results on Henrici1

double-precision complex, 2d transforms non-powers of two 2500 fftw3 in-place fftw3 out-of-place ffte jmfftc 2000

1500

speed (mflops) 1000

500

05x5 6x6 7x7 9x9 10x10 11x11 12x12 13x13 14x14 15x15 25x24 48x48 49x49 60x60 72x56 75x75 80x80 84x84 96x96 100x100 105x105 112x112 120x120 144x144 180x180 240x240 360x360 1000x1000

MT 2006 Internal Seminar – p. 30/33 BenchFFT Results on Henrici1

double-precision complex, 3d transforms powers of two 2500 fftw3 in-place fftw3 out-of-place ooura-sg3d green harm 2000 ffte fxt-ndim jmfftc

1500

speed (mflops) 1000

500

0 4x4x4 8x8x8 4x8x16 16x16x16 32x32x32 64x64x64 256x64x32 16x1024x64 128x128x128

MT 2006 Internal Seminar – p. 31/33 BenchFFT Results on Henrici1

double-precision complex, 3d transforms non-powers of two 2500 fftw3 out-of-place fftw3 in-place ffte jmfftc 2000

1500

speed (mflops) 1000

500

0 5x5x5 6x6x6 7x7x7 9x9x9 10x10x10 11x11x11 12x12x12 13x13x13 14x14x14 15x15x15 24x25x28 48x48x48 49x49x49 60x60x60 72x60x56 75x75x75 80x80x80 84x84x84 96x96x96 100x100x100 105x105x105 112x112x112 120x120x120

MT 2006 Internal Seminar – p. 32/33 Conclusions

Based on the experiments we conducted on Henrici1

MKL DFT routines are faster than FFTW3. FFTW has competitive performance. NAG performed poorly.

We prefer FFTW for C/C++/Fortran because of its ease of use, portability, additional routines, documentation, and the many other features. This is the package that we would recommend.

In general, we find the Matlab routine to be very easy to use and recommend this if we are able to accept the performance overhead.

MT 2006 Internal Seminar – p. 33/33