The FFT Via Matrix Factorizations
A Key to Designing High Performance Implementations
Charles Van Loan Department of Computer Science Cornell University A High Level Perspective... Blocking For Performance
A11 A12 A1q n ··· } 1 A21 A22 A2q n A = ··· } 2 ...... Ap1 Ap2 Apq nq ··· } n1 n2 nq |{z} |{z} |{z} A well known strategy for high-performance Ax = b and Ax = λx solvers. Factoring for Performance
One way to execute a matrix-vector product
y = Fnx when Fn = At A A is as follows: ··· 2 1 y = x for k = 1:t y = Akx end
˜ ˜ A different factorization Fn = At˜ A1 would yield a different algorithm. ··· The Discrete Fourier Transform (n = 8)
0 0 0 0 0 0 0 0 ω8 ω8 ω8 ω8 ω8 ω8 ω8 ω8 0 1 2 3 4 5 6 7 ω8 ω8 ω8 ω8 ω8 ω8 ω8 ω8 ω0 ω2 ω4 ω6 ω8 ω10 ω12 ω14 8 8 8 8 8 8 8 8 ω0 ω3 ω6 ω9 ω12 ω15 ω18 ω21 8 8 8 8 8 8 8 8 y = F8x = x ω0 ω4 ω8 ω12 ω16 ω20 ω24 ω28 8 8 8 8 8 8 8 8 ω0 ω5 ω10 ω15 ω20 ω25 ω30 ω35 8 8 8 8 8 8 8 8 ω0 ω6 ω12 ω18 ω24 ω30 ω36 ω42 8 8 8 8 8 8 8 8 ω0 ω7 ω14 ω21 ω28 ω35 ω42 ω49 8 8 8 8 8 8 8 8 ω = cos(2π/8) i sin(2π/8) 8 − · The DFT Matrix In General...
If ωn = cos(2π/n) i sin(2π/n) then − ·
pq [Fn]pq = ωn
= (cos(2π/n) i sin(2π/n))pq − · = cos(2pqπ/n) i sin(2pqπ/n) − ·
Fact: H Fn Fn = nIn
Thus, Fn/√n is unitary. Data Sparse Matrices
An n-by-n matrix A is data sparse if it can be represented with many fewer than n2 numbers.
Example 1. A has lots of zeros. (“Traditional Sparse”)
Example 2. A is Toeplitz... abcd ea bc A = feab gf ea More Examples of Data Sparse Matrices
A is a Kronecker Product B C, e.g., ⊗
b C b C A = 11 12 " b21C b22C #
m1 m1 m2 m2 2 2 If B IR × and C IR × then A = B C has m1m2 ∈ ∈ 2 2 ⊗ entries but is parameterized by just m1 + m2 numbers. Extreme Data Sparsity
n n n n A = S(i,j,k,`) (2-by-2) (2-by-2) · ⊗ ··· ⊗ i=1 j=1 k=1 `=1 X X X X d times | {z } A is 2d -by-2d but is parameterized by O(dn4) numbers. Factorization of Fn
The DFT matrix can be factored into a short product of sparse matrices, e.g.,
F = A A A P 1024 10 ··· 2 1 1024 where each A-matrix has 2 nonzeros per row and P1024 is a per- mutation. From Factorization to Algorithm
If n = 210 and
Fn = A A A Pn 10 ··· 2 1 then
y = Pnx for k = 1:10 y = A x 2n flops. k ← end computes y = Fnx and requires O(n log n) flops. Recursive Block Structure
F8(:, [02461357]) = 1 00 0 1 0 0 0 0 100 0 ω8 0 0 2 0 010 0 0 ω8 0 3 0 001 0 0 0 ω8 F4 0 1 00 0 10 0 0 0 F4 − 0 100 0 ω8 0 0 − 2 0 010 0 0 ω8 0 − 3 0 001 0 0 0 ω − 8 Fn/2 “shows up” when you permute the columns of Fn so that the odd-indexed columns come first. Recursion...
We build an 8-point DFT from two 4-point DFTs...
1 00 0 1 0 0 0 0 100 0 ω8 0 0 2 0 010 0 0 ω8 0 3 F x(0:2:7) 0 001 0 0 0 ω8 4 F8 x = 1 00 0 10 0 0 " F x(1:2:7) # − 4 0 100 0 ω8 0 0 − 2 0 010 0 0 ω8 0 − 3 0 001 0 0 0 ω − 8 Radix-2 FFT: Recursive Implementation
function y =fft(x, n) if n = 1 y = x else m = n/2; ω = exp( 2πi/n) − m 1 Ω = diag(1,ω,...,ω − ) z = fft(x(0:2:n 1), m) T − z = Ω fft(x(1:2:n 1), m) B · − I I z y = m m T Overall: 5n log n flops. Im Im zB − end The Divide-and-Conquer Picture
(0:1:15) ¨¨HH ¨ H ¨ H ¨¨ HH ¨ H ¨ H (0:2:15) (1:2:15) Q Q Q Q Q Q Q Q (0:4:15) (2:4:15) (1:4:15) (3:4:15) @ @ @ @ @ @ @ @ (0:8:15) (4:8:15) (2:8:15) (6:8:15) (1:8:15) (5:8:15) (3:8:15) (7:8:15) ¡A ¡A ¡A ¡A ¡A ¡A ¡A ¡A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A [0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15] Towards a Nonrecursive Implementation
The Radix-2 Factorization...
If n = 2m and m 1 Ωm = diag(1,ωn,...,ωn − ), then Fm ΩmFm Im Ωm FnΠn = = (I2 Fm). " Fm ΩmFm # " Im Ωm # ⊗ − − where Πn = In(:, [0:2:n 1:2:n]).
Fm 0 Note: I2 Fm = . ⊗ 0 Fm The Cooley-Tukey Factorization n = 2t
Fn = At A Pn ··· 1
Pn = the n-by-n “bit reversal ” permutation matrix
I Ω L/2 L/2 q Aq = Ir L = 2 , r = n/L ⊗ " IL/2 ΩL/2 # −
L/2 1 Ω = diag(1,ω ,...,ω − ) ω = exp( 2πi/L) L/2 L L L − The Bit Reversal Permutation
(0:1:15) ¨¨HH ¨ H ¨ H ¨¨ HH ¨ H ¨ H (0:2:15) (1:2:15) Q Q Q Q Q Q Q Q (0:4:15) (2:4:15) (1:4:15) (3:4:15) @ @ @ @ @ @ @ @ (0:8:15) (4:8:15) (2:8:15) (6:8:15) (1:8:15) (5:8:15) (3:8:15) (7:8:15) ¡A ¡A ¡A ¡A ¡A ¡A ¡A ¡A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A [0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15] Bit Reversal
x(0) x(0000) x(0000) x(0) x(1) x(0001) x(1000) x(8) x(2) x(0010) x(0100) x(4) x(3) x(0011) x(1100) x(12) x(4) x(0100) x(0010) x(2) x(5) x(0101) x(1010) x(10) x(6) x(0110) x(0110) x(6) x(7) x(0111) x(1110) x(14) = = x(8) x(1000) → x(0001) x(1) x(9) x(1001) x(1001) x(9) x(10) x(1010) x(0101) x(5) x(11) x(1011) x(1101) x(13) x(12) x(1100) x(0011) x(3) x(13) x(1101) x(1011) x(11) x(14) x(1110) x(0111) x(7) x(15) x(1111) x(1111) x(15) Butterfly Operations
This matrix is block diagonal...
I Ω L/2 L/2 q Aq = Ir L = 2 , r = n/L ⊗ " IL/2 ΩL/2 # − r copies of things like this
1 1 × × 1 × 1 × 1 × 1 × 1 × 1 × At the Scalar Level...
a Hs ¨s a + ωb H ¨ H ¨ ω ¨¨ HH b s¨ Hs a ωb − Signal Flow Graph (n = 8)
x0 Hs ¨s s s y0 HH ¨¨ @ A ¡ @ A ¡ ω0 ¨ 8 H @ A ¡ ¨ H @ ¨ H 0 A ¡ x4 s s ω s s y1 @ 8 A A ¡ ¡ @ @ @ A A ¡ ¡ @ @ A A ¡ ¡ @ A A ¡ ¡ x s s 2 @s 0 s y 2 HH ¨¨ ω8 A A ω8 ¡ ¡ 2 H ¨ @ A ¡ A ¡ 0 @ A ¡ A ¡ ω8 A A ¡ ¡ ¨¨ HH @ ¡ A ¡ A ¨ H @ A 1 ¡ x6 s s s ¡ ω A s y3 A ¡A 8 ¡A ¡ A ¡ A ¡ A¡ ¡ A A¡ A A ¡ ¡ ¡ ¡ A ¡ A A ¡ A 2 ¡ A x1 Hs ¨s s ¡ ω A s y4 H ¨ @ A 8 ¡ H ¨ ¡ A @ A ¡ A ¡ ω0 ¡ ¡ A A ¨ 8 H @ ¡ A ¡ A ¨ H @ ¡ A ¡ A x ¨s Hs 0 s¡ ¡ 3 A As y 5 @ ω8 ω8 5 @ ¡ ¡ A A @ @ ¡ ¡ A A @ @ @ ¡ ¡ A A 2 @¡ A x3 Hs ¨s ω s ¡ A s y6 H ¨ 8 H ¨ @ ¡ A ω0 @ ¡ A ¨ 8 H @ ¡ A ¨ H ¨ H @¡ A x7 s s s s y7 The Transposed Stockham Factorization
If n = 2t, then
Fn = S S S , t ··· 2 1 where for q = 1:t the factor Sq = AqΓq 1 is defined by − A = I B ,L = 2q, r = n/L, q r ⊗ L
Γq 1 = Πr IL ,L = L/2, r = 2r, − ∗ ⊗ ∗ ∗ ∗
IL ΩL BL = ∗ ∗ , IL ΩL ∗ − ∗ L 1 ΩL = diag(1, ωL,...,ωL∗− ). ∗ Perfect Shuffle
x0 x0 x1 x1 x2 x4 x x (Π I ) 3 = 5 4 ⊗ 2 x x 4 2 x x 5 3 x x 6 6 x x 7 7 Cooley-Tukey Array Interpretation
Step q:
k 2k 2k+1
− 8 =2q 1 q L∗ <> L=2 −→ :>
r∗ =n/L∗ | {z } r=n/L | {z } Reshaping
× × × × x = x = ×××× × → 2 4 × ×××× × × × × Transposed Stockham Array Interp
k k+r
(q−1) T 9 q−1 = = L∗=2 . xL∗×r∗ FL∗ xr∗×L∗ =>
;>
r∗ =n/L∗ (q) (q 1) x = Sqx − | k {z }
9 > > > (q) = T = > =2q xL×r FLxr×L => L > > > > ;>
r=n/L | {z } 2 2 2 Basic Radix-2 Versions × ×
Store intermediate DFTs by row or column
Intermediate DFTs adjacent or not.
How the two butterfly loops are ordered.
I Ω L/2 L/2 q x = Ir x L = 2 , r = n/L ⊗ " IL/2 ΩL/2 #! − The Gentleman-Sande Idea
T It can be shown that Fn = Fn and so if
T Fn = At A P ··· 1 n then T T T Fn = F = PnA A n 1 ··· t and we can compute y = Fnx as follows... y = x for k = t: 1:1 −T y = Ak x end y = Pny Convolution and Other Aps
From “problem space” to “DFT space” via for k = t: 1:1 −T x = Ak x end x = Pnx
Do your thing in DFT space. Then inverse transform back to Problem space via T x = Pn x for k = 1:t x = Akx end x = x/n
Can avoid the Pn ops by working in “scrambled” DFT space. Radix-4
Can combine four quarter-length DFTs to produce a single full- length DFT:
IIII a (a + c)+ (b + d) I iI I iI b (a c) i(b d) v = = , I − II− I c (a +− c)− (b −+ d) − − − I iI I iI d (a c)+i(b d) − − − − The radix-4 butterfly. Better re-use of data. Fewer flops. Radix-4 FFT is 4.25n log n (instead of 5n log n). Mixed Radix
96P #cPP # c PP # c PP # c PP 24 24 24 24 @ @ @ @ @ @ @ @ 8 8 8 8 8 8 8 8 8 8 8 8 Multiple DFTs
Given: n1-by-n2 matrix X.
Multicolumn DFT Problem...
X Fn X ← 1 Multirow DFT Problem...
X XFn ← 2 Blocked Multiple DFTs
X Fn X becomes ← 1
X X Xp Fn X Fn X Fn Xp 1 | 2 |···| ← 1 1 | 1 2 |···| 1 The 4-Step Framework
A matrix reshaping of the x Fnx operation when n = n n : ← 1 2 xn n xn n Fn Multiple row DFT 1× 2 ← 1× 2 2 xn n Fn(0:n1 1, 0:n2 1). xn n Pointwise multiply 1× 2 ← − − ∗ 1× 2 T xn2 n1 xn n Transpose × ← 1× 2 xn n xn n Fn Multiple row DFT . 2× 1 ← 2× 1 1
Can be arranged so communication is concentrated in the trans- pose step. Distributed Transpose: Example
Initial: X00 X01 X02 X03 X X X X X = 10 11 12 13 . X20 X21 X22 X23 X X X X 30 31 32 33 Transpose each block: T T T T X00 X01 X02 X03 XT XT XT XT X 10 11 12 13 . ← XT XT XT XT 20 21 22 23 XT XT XT XT 30 31 32 33 Now regard as 2-by-2 and block transpose each block:
T T T T X00 X10 X02 X12 XT XT XT XT X 01 11 03 13 . ← XT XT XT XT 20 30 22 32 XT XT XT XT 21 31 23 33 Now do a 2-by-2 block transpose:
T T T T X00 X10 X20 X30 XT XT XT XT X 01 11 21 31 . ← XT XT XT XT 02 12 22 32 XT XT XT XT 03 13 23 33 Factorization and Transpose
T xn m xm n × ← × corresponds to x P (m, n)x ← where P (m, n) is a perfect shuffle permutation, e.g.,
P (3, 4) = I12(:, [03691471025811])
Different multi-pass transposition algorithms correspond to differ- ent factorizations of P (m, n). Two-Dimensional FFTs
If X is an n1-by-n2 matrix then is 2D DFT is X Fn XFn ← 1 2 Option 1.
X Fn X ← 1 X XFn ← 2 Option 2. Assume n = n and Fn = At A . 1 2 1 ··· 1 for q = 1:t T X AqXA ← q end Interminlgling the column and row butterfly computations can result in better locality. 3-Dimensional DFTs
Given X(1:n1, 1:n2, 1:n3), apply DFT in each of the three dimen- sions. If
x = reshape(X(1:n1, 1:n2, 1:n3), n1n2n3, 1) then the problem is to compute
x (Fn Fn Fn )x ← 3 ⊗ 2 ⊗ 1 i.e., x (In In Fn )x ← 3 ⊗ 2 ⊗ 1 x (In Fn In )x ← 3 ⊗ 2 ⊗ 1 x (Fn In In )x ← 3 ⊗ 2 ⊗ 1 d-Dimensional DFTs
Sample for d = 5: X(α , α , α , α , α ) F µ = 1 1 2 3 4 5 n1 X(α , α , α , α , α ) T 2 3 4 5 1 Πn1,n X(α , α , α , α , α ) F µ = 2 2 3 4 5 1 n2 X(α , α , α , α , α ) T 3 4 5 1 2 Πn2,n X(α , α , α , α , α ) F µ = 3 3 4 5 1 2 n3 X(α , α , α , α , α ) T 4 5 1 2 3 Πn3,n X(α , α , α , α , α ) F µ = 4 4 5 1 2 3 n4 X(α , α , α , α , α ) T 5 1 2 3 4 Πn4,n X(α , α , α , α , α ) F µ = 5 5 1 2 3 4 n5 X(α , α , α , α , α ) T 1 2 3 4 5 Πn5,n Intemingling of component DFTs and tensor transpositions. References
FFTW: http:www.fftw.org
C. Van Loan (1992). Computational Frameworks for the Fast Fourier Transform, SIAM Publications, Philadelphia, PA.