The FFT Via Matrix Factorizations

The FFT Via Matrix Factorizations A Key to Designing High Performance Implementations Charles Van Loan Department of Computer Science Cornell University A High Level Perspective... Blocking For Performance A11 A12 A1q n ··· } 1 A21 A22 A2q n A = ··· } 2 . ... Ap1 Ap2 Apq nq ··· } n1 n2 nq |{z} |{z} |{z} A well known strategy for high-performance Ax = b and Ax = λx solvers. Factoring for Performance One way to execute a matrix-vector product y = Fnx when Fn = At A A is as follows: ··· 2 1 y = x for k = 1:t y = Akx end ˜ ˜ A different factorization Fn = At˜ A1 would yield a different algorithm. ··· The Discrete Fourier Transform (n = 8) 0 0 0 0 0 0 0 0 ω8 ω8 ω8 ω8 ω8 ω8 ω8 ω8 0 1 2 3 4 5 6 7 ω8 ω8 ω8 ω8 ω8 ω8 ω8 ω8 ω0 ω2 ω4 ω6 ω8 ω10 ω12 ω14 8 8 8 8 8 8 8 8 ω0 ω3 ω6 ω9 ω12 ω15 ω18 ω21 8 8 8 8 8 8 8 8 y = F8x = x ω0 ω4 ω8 ω12 ω16 ω20 ω24 ω28 8 8 8 8 8 8 8 8 ω0 ω5 ω10 ω15 ω20 ω25 ω30 ω35 8 8 8 8 8 8 8 8 ω0 ω6 ω12 ω18 ω24 ω30 ω36 ω42 8 8 8 8 8 8 8 8 ω0 ω7 ω14 ω21 ω28 ω35 ω42 ω49 8 8 8 8 8 8 8 8 ω = cos(2π/8) i sin(2π/8) 8 − · The DFT Matrix In General... If ωn = cos(2π/n) i sin(2π/n) then − · pq [Fn]pq = ωn = (cos(2π/n) i sin(2π/n))pq − · = cos(2pqπ/n) i sin(2pqπ/n) − · Fact: H Fn Fn = nIn Thus, Fn/√n is unitary. Data Sparse Matrices An n-by-n matrix A is data sparse if it can be represented with many fewer than n2 numbers. Example 1. A has lots of zeros. (“Traditional Sparse”) Example 2. A is Toeplitz... abcd ea bc A = feab gf ea More Examples of Data Sparse Matrices A is a Kronecker Product B C, e.g., ⊗ b C b C A = 11 12 " b21C b22C # m1 m1 m2 m2 2 2 If B IR × and C IR × then A = B C has m1m2 ∈ ∈ 2 2 ⊗ entries but is parameterized by just m1 + m2 numbers. Extreme Data Sparsity n n n n A = S(i,j,k,`) (2-by-2) (2-by-2) · ⊗ ··· ⊗ i=1 j=1 k=1 `=1 X X X X d times | {z } A is 2d -by-2d but is parameterized by O(dn4) numbers. Factorization of Fn The DFT matrix can be factored into a short product of sparse matrices, e.g., F = A A A P 1024 10 ··· 2 1 1024 where each A-matrix has 2 nonzeros per row and P1024 is a permutation. From Factorization to Algorithm If n = 210 and Fn = A A A Pn 10 ··· 2 1 then y = Pnx for k = 1:10 y = A x 2n flops. k ← end computes y = Fnx and requires O(n log n) flops. Recursive Block Structure F8(:, [02461357]) = 1 00 0 1 0 0 0 0 100 0 ω8 0 0 2 0 010 0 0 ω8 0 3 0 001 0 0 0 ω8 F4 0 1 00 0 10 0 0 0 F4 − 0 100 0 ω8 0 0 − 2 0 010 0 0 ω8 0 − 3 0 001 0 0 0 ω − 8 Fn/2 “shows up” when you permute the columns of Fn so that the odd-indexed columns come first. Recursion... We build an 8-point DFT from two 4-point DFTs... 1 00 0 1 0 0 0 0 100 0 ω8 0 0 2 0 010 0 0 ω8 0 3 F x(0:2:7) 0 001 0 0 0 ω8 4 F8 x = 1 00 0 10 0 0 " F x(1:2:7) # − 4 0 100 0 ω8 0 0 − 2 0 010 0 0 ω8 0 − 3 0 001 0 0 0 ω − 8 Radix-2 FFT: Recursive Implementation function y =fft(x, n) if n = 1 y = x else m = n/2; ω = exp( 2πi/n) − m 1 Ω = diag(1,ω,...,ω − ) z = fft(x(0:2:n 1), m) T − z = Ω fft(x(1:2:n 1), m) B · − I I z y = m m T Overall: 5n log n flops. Im Im zB − end The Divide-and-Conquer Picture (0:1:15) ¨¨HH ¨ H ¨ H ¨¨ HH ¨ H ¨ H (0:2:15) (1:2:15) Q Q Q Q Q Q Q Q (0:4:15) (2:4:15) (1:4:15) (3:4:15) @ @ @ @ @ @ @ @ (0:8:15) (4:8:15) (2:8:15) (6:8:15) (1:8:15) (5:8:15) (3:8:15) (7:8:15) ¡A ¡A ¡A ¡A ¡A ¡A ¡A ¡A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A [0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15] Towards a Nonrecursive Implementation The Radix-2 Factorization... If n = 2m and m 1 Ωm = diag(1,ωn,...,ωn − ), then Fm ΩmFm Im Ωm FnΠn = = (I2 Fm). " Fm ΩmFm # " Im Ωm # ⊗ − − where Πn = In(:, [0:2:n 1:2:n]). Fm 0 Note: I2 Fm = . ⊗ 0 Fm The Cooley-Tukey Factorization n = 2t Fn = At A Pn ··· 1 Pn = the n-by-n “bit reversal ” permutation matrix I Ω L/2 L/2 q Aq = Ir L = 2 , r = n/L ⊗ " IL/2 ΩL/2 # − L/2 1 Ω = diag(1,ω ,...,ω − ) ω = exp( 2πi/L) L/2 L L L − The Bit Reversal Permutation (0:1:15) ¨¨HH ¨ H ¨ H ¨¨ HH ¨ H ¨ H (0:2:15) (1:2:15) Q Q Q Q Q Q Q Q (0:4:15) (2:4:15) (1:4:15) (3:4:15) @ @ @ @ @ @ @ @ (0:8:15) (4:8:15) (2:8:15) (6:8:15) (1:8:15) (5:8:15) (3:8:15) (7:8:15) ¡A ¡A ¡A ¡A ¡A ¡A ¡A ¡A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A ¡ A [0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15] Bit Reversal x(0) x(0000) x(0000) x(0) x(1) x(0001) x(1000) x(8) x(2) x(0010) x(0100) x(4) x(3) x(0011) x(1100) x(12) x(4) x(0100) x(0010) x(2) x(5) x(0101) x(1010) x(10) x(6) x(0110) x(0110) x(6) x(7) x(0111) x(1110) x(14) = = x(8) x(1000) → x(0001) x(1) x(9) x(1001) x(1001) x(9) x(10) x(1010) x(0101) x(5) x(11) x(1011) x(1101) x(13) x(12) x(1100) x(0011) x(3) x(13) x(1101) x(1011) x(11) x(14) x(1110) x(0111) x(7) x(15) x(1111) x(1111) x(15) Butterfly Operations This matrix is block diagonal... I Ω L/2 L/2 q Aq = Ir L = 2 , r = n/L ⊗ " IL/2 ΩL/2 # − r copies of things like this 1 1 × × 1 × 1 × 1 × 1 × 1 × 1 × At the Scalar Level... a Hs ¨s a + ωb H ¨ H ¨ ω ¨¨ HH b s¨ Hs a ωb − Signal Flow Graph (n = 8) x0 Hs ¨s s s y0 HH ¨¨ @ A ¡ @ A ¡ ω0 ¨ 8 H @ A ¡ ¨ H @ ¨ H 0 A ¡ x4 s s ω s s y1 @ 8 A A ¡ ¡ @ @ @ A A ¡ ¡ @ @ A A ¡ ¡ @ A A ¡ ¡ x s s 2 @s 0 s y 2 HH ¨¨ ω8 A A ω8 ¡ ¡ 2 H ¨ @ A ¡ A ¡ 0 @ A ¡ A ¡ ω8 A A ¡ ¡ ¨¨ HH @ ¡ A ¡ A ¨ H @ A 1 ¡ x6 s s s ¡ ω A s y3 A ¡A 8 ¡A ¡ A ¡ A ¡ A¡ ¡ A A¡ A A ¡ ¡ ¡ ¡ A ¡ A A ¡ A 2 ¡ A x1 Hs ¨s s ¡ ω A s y4 H ¨ @ A 8 ¡ H ¨ ¡ A @ A ¡ A ¡ ω0 ¡ ¡ A A ¨ 8 H @ ¡ A ¡ A ¨ H @ ¡ A ¡ A x s¨ Hs 0 s¡ ¡ 3 A As y 5 @ ω8 ω8 5 @ ¡ ¡ A A @ @ ¡ ¡ A A @ @ @ ¡ ¡ A A 2 @¡ A x3 Hs ¨s ω s ¡ A s y6 H ¨ 8 H ¨ @ ¡ A ω0 @ ¡ A ¨ 8 H @ ¡ A ¨ H ¨ H @¡ A x7 s s s s y7 The Transposed Stockham Factorization If n = 2t, then Fn = S S S , t ··· 2 1 where for q = 1:t the factor Sq = AqΓq 1 is defined by − A = I B , L = 2q, r = n/L, q r ⊗ L Γq 1 = Πr IL , L = L/2, r = 2r, − ∗ ⊗ ∗ ∗ ∗ IL ΩL BL = ∗ ∗ , IL ΩL ∗ − ∗ L 1 ΩL = diag(1, ωL,...,ωL∗− ). ∗ Perfect Shuffle x0 x0 x1 x1 x2 x4 x x (Π I ) 3 = 5 4 ⊗ 2 x x 4 2 x x 5 3 x x 6 6 x x 7 7 Cooley-Tukey Array Interpretation Step q: k 2k 2k+1 − 8 =2q 1 q L∗ <> L=2 −→ :> r∗ =n/L∗ | {z } r=n/L | {z } Reshaping × × × × x = x = ×××× × → 2 4 × ×××× × × × × Transposed Stockham Array Interp k k+r (q−1) T 9 q−1 = = L∗=2 . xL∗×r∗ FL∗ xr∗×L∗ => ;> r∗ =n/L∗ (q) (q 1) x = Sqx − | k {z } 9 > > > (q) = T = > =2q xL×r FLxr×L => L > > > > ;> r=n/L | {z } 2 2 2 Basic Radix-2 Versions × × Store intermediate DFTs by row or column Intermediate DFTs adjacent or not. How the two butterfly loops are ordered. I Ω L/2 L/2 q x = Ir x L = 2 , r = n/L ⊗ " IL/2 ΩL/2 #! − The Gentleman-Sande Idea T It can be shown that Fn = Fn and so if T Fn = At A P ··· 1 n then T T T Fn = F = PnA A n 1 ··· t and we can compute y = Fnx as follows..

Load more