The FFT Via Factorizations

A Key to Designing High Performance Implementations

Charles Van Loan Department of Computer Science Cornell University A High Level Perspective... Blocking For Performance

A11 A12 A1q n ··· } 1 A21 A22 A2q n A =  ···  } 2 ......    Ap1 Ap2 Apq  nq  ···  }   n1 n2 nq |{z} |{z} |{z} A well known strategy for high-performance Ax = b and Ax = λx solvers. Factoring for Performance

One way to execute a matrix-vector product

y = Fnx when Fn = At A A is as follows: ··· 2 1 y = x for k = 1:t y = Akx end

˜ ˜ A different factorization Fn = At˜ A1 would yield a different algorithm. ··· The Discrete (n = 8)

0 0 0 0 0 0 0 0 ω8 ω8 ω8 ω8 ω8 ω8 ω8 ω8 0 1 2 3 4 5 6 7  ω8 ω8 ω8 ω8 ω8 ω8 ω8 ω8   ω0 ω2 ω4 ω6 ω8 ω10 ω12 ω14   8 8 8 8 8 8 8 8     ω0 ω3 ω6 ω9 ω12 ω15 ω18 ω21   8 8 8 8 8 8 8 8  y = F8x =   x  ω0 ω4 ω8 ω12 ω16 ω20 ω24 ω28   8 8 8 8 8 8 8 8     ω0 ω5 ω10 ω15 ω20 ω25 ω30 ω35   8 8 8 8 8 8 8 8     ω0 ω6 ω12 ω18 ω24 ω30 ω36 ω42   8 8 8 8 8 8 8 8     ω0 ω7 ω14 ω21 ω28 ω35 ω42 ω49   8 8 8 8 8 8 8 8    ω = cos(2π/8) i sin(2π/8) 8 − · The DFT Matrix In General...

If ωn = cos(2π/n) i sin(2π/n) then − ·

pq [Fn]pq = ωn

= (cos(2π/n) i sin(2π/n))pq − · = cos(2pqπ/n) i sin(2pqπ/n) − ·

Fact: H Fn Fn = nIn

Thus, Fn/√n is unitary. Data Sparse Matrices

An n-by-n matrix A is data sparse if it can be represented with many fewer than n2 numbers.

Example 1. A has lots of zeros. (“Traditional Sparse”)

Example 2. A is Toeplitz... abcd ea bc A =  feab   gf ea      More Examples of Data Sparse Matrices

A is a B C, e.g., ⊗

b C b C A = 11 12 " b21C b22C #

m1 m1 m2 m2 2 2 If B IR × and C IR × then A = B C has m1m2 ∈ ∈ 2 2 ⊗ entries but is parameterized by just m1 + m2 numbers. Extreme Data Sparsity

n n n n A = S(i,j,k,`) (2-by-2) (2-by-2) · ⊗ ··· ⊗ i=1 j=1 k=1 `=1 X X X X d times | {z } A is 2d -by-2d but is parameterized by O(dn4) numbers. Factorization of Fn

The DFT matrix can be factored into a short product of sparse matrices, e.g.,

F = A A A P 1024 10 ··· 2 1 1024 where each A-matrix has 2 nonzeros per row and P1024 is a per- mutation. From Factorization to Algorithm

If n = 210 and

Fn = A A A Pn 10 ··· 2 1 then

y = Pnx for k = 1:10 y = A x 2n flops. k ← end computes y = Fnx and requires O(n log n) flops. Recursive Block Structure

F8(:, [02461357]) = 1 00 0 1 0 0 0 0 100 0 ω8 0 0  2  0 010 0 0 ω8 0  3   0 001 0 0 0 ω8  F4 0    1 00 0 10 0 0  0 F4  −     0 100 0 ω8 0 0   − 2   0 010 0 0 ω8 0   − 3   0 001 0 0 0 ω   − 8    Fn/2 “shows up” when you permute the columns of Fn so that the odd-indexed columns come first. Recursion...

We build an 8-point DFT from two 4-point DFTs...

1 00 0 1 0 0 0 0 100 0 ω8 0 0  2  0 010 0 0 ω8 0  3  F x(0:2:7)  0 001 0 0 0 ω8  4 F8 x =    1 00 0 10 0 0  " F x(1:2:7) #  −  4  0 100 0 ω8 0 0   − 2   0 010 0 0 ω8 0   − 3   0 001 0 0 0 ω   − 8    Radix-2 FFT: Recursive Implementation

function y =fft(x, n) if n = 1 y = x else m = n/2; ω = exp( 2πi/n) − m 1 Ω = diag(1,ω,...,ω − ) z = fft(x(0:2:n 1), m) T − z = Ω fft(x(1:2:n 1), m) B · − I I z y = m m T Overall: 5n log n flops. Im Im zB  −   end The Divide-and-Conquer Picture

(0:1:15) ¨¨HH ¨ H ¨ H ¨¨ HH ¨ H ¨ H (0:2:15) (1:2:15) Q Q  Q  Q  Q  Q  Q  Q (0:4:15) (2:4:15) (1:4:15) (3:4:15) @ @ @ @ @ @ @ @ (0:8:15) (4:8:15) (2:8:15) (6:8:15) (1:8:15) (5:8:15) (3:8:15) (7:8:15) ¡A ¡A ¡A ¡A ¡A ¡A ¡A ¡A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A [0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15] Towards a Nonrecursive Implementation

The Radix-2 Factorization...

If n = 2m and m 1 Ωm = diag(1,ωn,...,ωn − ), then Fm ΩmFm Im Ωm FnΠn = = (I2 Fm). " Fm ΩmFm # " Im Ωm # ⊗ − − where Πn = In(:, [0:2:n 1:2:n]).

Fm 0 Note: I2 Fm = . ⊗ 0 Fm   The Cooley-Tukey Factorization n = 2t

Fn = At A Pn ··· 1

Pn = the n-by-n “bit reversal ”

I Ω L/2 L/2 q Aq = Ir L = 2 , r = n/L ⊗ " IL/2 ΩL/2 # −

L/2 1 Ω = diag(1,ω ,...,ω − ) ω = exp( 2πi/L) L/2 L L L − The Bit Reversal Permutation

(0:1:15) ¨¨HH ¨ H ¨ H ¨¨ HH ¨ H ¨ H (0:2:15) (1:2:15) Q Q  Q  Q  Q  Q  Q  Q (0:4:15) (2:4:15) (1:4:15) (3:4:15) @ @ @ @ @ @ @ @ (0:8:15) (4:8:15) (2:8:15) (6:8:15) (1:8:15) (5:8:15) (3:8:15) (7:8:15) ¡A ¡A ¡A ¡A ¡A ¡A ¡A ¡A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A [0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15] Bit Reversal

x(0) x(0000) x(0000) x(0) x(1) x(0001) x(1000) x(8)  x(2)   x(0010)   x(0100)   x(4)   x(3)   x(0011)   x(1100)   x(12)           x(4)   x(0100)   x(0010)   x(2)           x(5)   x(0101)   x(1010)   x(10)           x(6)   x(0110)   x(0110)   x(6)           x(7)   x(0111)   x(1110)   x(14)    =     =    x(8)   x(1000)  →  x(0001)   x(1)           x(9)   x(1001)   x(1001)   x(9)           x(10)   x(1010)   x(0101)   x(5)           x(11)   x(1011)   x(1101)   x(13)           x(12)   x(1100)   x(0011)   x(3)           x(13)   x(1101)   x(1011)   x(11)           x(14)   x(1110)   x(0111)   x(7)           x(15)   x(1111)   x(1111)   x(15)                  Butterfly Operations

This matrix is block diagonal...

I Ω L/2 L/2 q Aq = Ir L = 2 , r = n/L ⊗ " IL/2 ΩL/2 # − r copies of things like this

1 1 × ×  1  ×  1   ×   1   ×   1   ×   1   ×   1   ×    At the Scalar Level...

a Hs ¨s a + ωb H ¨ H ¨ ω ¨¨ HH b s¨ Hs a ωb − Signal Flow Graph (n = 8)

x0 Hs ¨s s s y0 HH ¨¨ @ A ¡ @ A ¡ ω0 ¨ 8 H @ A ¡ ¨ H @ ¨ H 0 A ¡ x4 s s ω s s y1 @ 8 A A ¡ ¡ @ @ @ A A ¡ ¡ @ @ A A ¡ ¡ @ A A ¡ ¡ x s s 2 @s 0 s y 2 HH ¨¨ ω8 A A ω8 ¡ ¡ 2 H ¨ @ A ¡ A ¡ 0 @ A ¡ A ¡ ω8 A A ¡ ¡ ¨¨ HH @ ¡ A ¡ A ¨ H @ A 1 ¡ x6 s s s ¡ ω A s y3 A ¡A 8 ¡A ¡ A ¡ A ¡ A¡ ¡ A A¡ A A ¡ ¡ ¡ ¡ A ¡ A A ¡ A 2 ¡ A x1 Hs ¨s s ¡ ω A s y4 H ¨ @ A 8 ¡ H ¨ ¡ A @ A ¡ A ¡ ω0 ¡ ¡ A A ¨ 8 H @ ¡ A ¡ A ¨ H @ ¡ A ¡ A x ¨s Hs 0 s¡ ¡ 3 A As y 5 @ ω8 ω8 5 @ ¡ ¡ A A @ @ ¡ ¡ A A @ @ @ ¡ ¡ A A 2 @¡ A x3 Hs ¨s ω s ¡ A s y6 H ¨ 8 H ¨ @ ¡ A ω0 @ ¡ A ¨ 8 H @ ¡ A ¨ H ¨ H @¡ A x7 s s s s y7 The Transposed Stockham Factorization

If n = 2t, then

Fn = S S S , t ··· 2 1 where for q = 1:t the factor Sq = AqΓq 1 is defined by − A = I B ,L = 2q, r = n/L, q r ⊗ L

Γq 1 = Πr IL ,L = L/2, r = 2r, − ∗ ⊗ ∗ ∗ ∗

IL ΩL BL = ∗ ∗ , IL ΩL  ∗ − ∗  L 1 ΩL = diag(1, ωL,...,ωL∗− ). ∗ Perfect Shuffle

x0 x0 x1 x1     x2 x4  x   x  (Π I )  3  =  5  4 ⊗ 2  x   x   4   2   x   x   5   3   x   x   6   6   x   x   7   7      Cooley-Tukey Array Interpretation

Step q:

k 2k 2k+1

− 8 =2q 1  q L∗ <>  L=2 −→  :> 

r∗ =n/L∗   | {z } r=n/L  | {z } Reshaping

×  ×  ×    ×  x =   x = ××××  ×  → 2 4   ×  ××××   ×     ×     ×     ×    Transposed Stockham Array Interp

k k+r

(q−1) T 9 q−1 = = L∗=2 . xL∗×r∗ FL∗ xr∗×L∗ =>

;>

r∗ =n/L∗ (q) (q 1) x = Sqx − | k {z }

9 > > > (q) = T = > =2q xL×r FLxr×L => L > > > > ;>

r=n/L | {z } 2 2 2 Basic Radix-2 Versions × ×

Store intermediate DFTs by row or column

Intermediate DFTs adjacent or not.

How the two butterfly loops are ordered.

I Ω L/2 L/2 q x = Ir x L = 2 , r = n/L ⊗ " IL/2 ΩL/2 #! − The Gentleman-Sande Idea

T It can be shown that Fn = Fn and so if

T Fn = At A P ··· 1 n then T T T Fn = F = PnA A n 1 ··· t and we can compute y = Fnx as follows... y = x for k = t: 1:1 −T y = Ak x end y = Pny and Other Aps

From “problem space” to “DFT space” via for k = t: 1:1 −T x = Ak x end x = Pnx

Do your thing in DFT space. Then inverse transform back to Problem space via T x = Pn x for k = 1:t x = Akx end x = x/n

Can avoid the Pn ops by working in “scrambled” DFT space. Radix-4

Can combine four quarter-length DFTs to produce a single full- length DFT:

IIII a (a + c)+ (b + d) I iI I iI b (a c) i(b d) v = = ,  I − II− I  c   (a +− c)− (b −+ d)  − − −  I iI I iI  d   (a c)+i(b d)   − −    − −       The radix-4 butterfly. Better re-use of data. Fewer flops. Radix-4 FFT is 4.25n log n (instead of 5n log n). Mixed Radix

96P #cPP  # c PP  # c PP  # c PP 24 24 24 24 @ @ @ @ @ @ @ @ 8 8 8 8 8 8 8 8 8 8 8 8 Multiple DFTs

Given: n1-by-n2 matrix X.

Multicolumn DFT Problem...

X Fn X ← 1 Multirow DFT Problem...

X XFn ← 2 Blocked Multiple DFTs

X Fn X becomes ← 1

X X Xp Fn X Fn X Fn Xp 1 | 2 |···| ← 1 1 | 1 2 |···| 1     The 4-Step Framework

A matrix reshaping of the x Fnx operation when n = n n : ← 1 2 xn n xn n Fn Multiple row DFT 1× 2 ← 1× 2 2 xn n Fn(0:n1 1, 0:n2 1). xn n Pointwise multiply 1× 2 ← − − ∗ 1× 2 T xn2 n1 xn n Transpose × ← 1× 2 xn n xn n Fn Multiple row DFT . 2× 1 ← 2× 1 1

Can be arranged so communication is concentrated in the trans- pose step. Distributed Transpose: Example

Initial: X00 X01 X02 X03 X X X X X =  10 11 12 13  . X20 X21 X22 X23  X X X X   30 31 32 33  Transpose each block:   T T T T X00 X01 X02 X03  XT XT XT XT  X 10 11 12 13 . ←  XT XT XT XT   20 21 22 23     XT XT XT XT   30 31 32 33    Now regard as 2-by-2 and block transpose each block:

T T T T X00 X10 X02 X12  XT XT XT XT  X 01 11 03 13 . ←    XT XT XT XT   20 30 22 32     XT XT XT XT   21 31 23 33  Now do a 2-by-2 block transpose: 

T T T T X00 X10 X20 X30  XT XT XT XT  X 01 11 21 31 . ←    XT XT XT XT   02 12 22 32     XT XT XT XT   03 13 23 33    Factorization and Transpose

T xn m xm n × ← × corresponds to x P (m, n)x ← where P (m, n) is a perfect shuffle permutation, e.g.,

P (3, 4) = I12(:, [03691471025811])

Different multi-pass transposition algorithms correspond to differ- ent factorizations of P (m, n). Two-Dimensional FFTs

If X is an n1-by-n2 matrix then is 2D DFT is X Fn XFn ← 1 2 Option 1.

X Fn X ← 1 X XFn ← 2 Option 2. Assume n = n and Fn = At A . 1 2 1 ··· 1 for q = 1:t T X AqXA ← q end Interminlgling the column and row butterfly computations can result in better locality. 3-Dimensional DFTs

Given X(1:n1, 1:n2, 1:n3), apply DFT in each of the three dimen- sions. If

x = reshape(X(1:n1, 1:n2, 1:n3), n1n2n3, 1) then the problem is to compute

x (Fn Fn Fn )x ← 3 ⊗ 2 ⊗ 1 i.e., x (In In Fn )x ← 3 ⊗ 2 ⊗ 1 x (In Fn In )x ← 3 ⊗ 2 ⊗ 1 x (Fn In In )x ← 3 ⊗ 2 ⊗ 1 d-Dimensional DFTs

Sample for d = 5: X(α , α , α , α , α ) F µ = 1 1 2 3 4 5 n1 X(α , α , α , α , α ) T 2 3 4 5 1 Πn1,n X(α , α , α , α , α ) F µ = 2 2 3 4 5 1 n2 X(α , α , α , α , α ) T 3 4 5 1 2 Πn2,n X(α , α , α , α , α ) F µ = 3 3 4 5 1 2 n3 X(α , α , α , α , α ) T 4 5 1 2 3 Πn3,n X(α , α , α , α , α ) F µ = 4 4 5 1 2 3 n4 X(α , α , α , α , α ) T 5 1 2 3 4 Πn4,n X(α , α , α , α , α ) F µ = 5 5 1 2 3 4 n5 X(α , α , α , α , α ) T 1 2 3 4 5 Πn5,n Intemingling of component DFTs and tensor transpositions. References

FFTW: http:www.fftw.org

C. Van Loan (1992). Computational Frameworks for the , SIAM Publications, Philadelphia, PA.