COOLEY-TUKEY FFT LIKE ALGORITHMS FOR THE DCT
Markus Pusc¨ hel
Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, U.S.A.
ABSTRACT All algorithms in this class have a large degree of parallelism, low arithmetic cost, and a regular, flexible, recursive structure, The Cooley-Tukey FFT algorithm decomposes a discrete Fourier which makes them suitable for efficient hardware implementation transform (DFT) of size n = km into smaller DFTs of size k and or for automatic software generation and adaptation [2, 3]. Only m. In this paper we present a theorem that decomposes a polyno- few special cases in this class are known from the literature. We mial transform into smaller polynomial transforms, and show that present the mathematical framework in Section 2; the algorithm the FFT is obtained as a special case. Then we use this theorem derivation and analysis is in Section 3. to derive a new class of recursive algorithms for the discrete co- sine transforms (DCTs) of type II and type III. In contrast to other 2. POLYNOMIAL ALGEBRAS AND TRANSFORMS approaches, we manipulate polynomial algebras instead of trans- form matrix entries, which makes the derivation transparent, con- A vector space A that permits multiplication of elements such that cise, and gives insight into the algorithms’ structure. The derived the distributive law holds is called an algebra. Examples include algorithms have a regular structure and, for 2-power size, minimal C and the set C[x] of polynomials with complex coefficients. arithmetic cost (among known DCT algorithms). Polynomial Algebra. Let p(x) be a polynomial of degree deg(p) = n. Then, A = C[x]/p(x) = {q(x) | deg(q) < n}, 1. INTRODUCTION the set of residue classes modulo p, is an n-dimensional algebra with respect to the addition of polynomials, and the polynomial The celebrated Cooley-Tukey fast Fourier transform (FFT) algo- multiplication modulo p. We call A a polynomial algebra. rithm [1] decomposes a discrete Fourier transform (DFT) of size Polynomial Transform. For the remainder of this paper, we n = km into smaller DFTs of size k and m. The algorithm’s in- assume that the polynomial p(x) in A = C[x]/p(x) has pair- wise distinct zeros α = (α0, . . . , αn 1). Then, the Chinese Re- herent versatility—due to the degree of freedom in factoring n and − due to its structure—has proven very useful for its implementation mainder Theorem decomposes A into a Cartesian product of one- on a variety of diverse computing platforms. A recent example is dimensional algebras as automatic platform adaptation of FFT software through algorith- C[x]/p(x) → C[x]/(x − α0) ⊕ . . . ⊕ C[x]/(x − αn 1), mic search [2, 3, 4]. − q(x) 7→ (q(α0), . . . , q(αn 1)). (1) In this paper we derive a new class of algorithms, analogous to − the FFT, for the discrete cosine transforms (DCTs) of type II and If we choose b = (p0, . . . , pn 1), deg(pi) < n, as a basis for 0 − III. The discovery of this class and its derivation is made possible A, and bk = (x ) as the basis in each C[x]/(x − αk), then the through an approach using polynomial algebras. decomposition in (1) is given by the polynomial transform Algebraic Approach to the DCTs. There is a large number Pb,α = [p`(αk)]0 k,`r(x)), then Pb,α can be factorized as briefly explained this situation. We associate to each DCT (or DST) a polynomial next (see [9, 10] for details). We assume deg(q) = k, deg(r) = algebra, and derive known algorithms by manipulating these al- m, i.e., n = km, and denote with β = (β0, . . . , βk 1) the zeros gebras instead of transform matrix entries. This approach makes − of q and with αi0 = (αi,0, . . . , αi,m 1) the zeros of r(x)−βi, i.e., − the derivation transparent and provides the mathematical under- each αi,j is a zero α` of p. pinning for these algorithms. Then C[x]/p(x) decomposes in the following steps. A New Class of DCT Algorithms. In this paper we present a general theorem for decomposing polynomial algebras and show C[x]/p(x) → C[x]/(r(x) − βi) (3) that the Cooley-Tukey FFT is obtained as a special case. Then we 0Mirecursion is used for r0qk 1(r), . . . , rm 1qk 1(r)) − − − automatic software optimization [2, 3, 4]. is a basis of C[x]/p(x). The final factorization of Pb,α is given by the following theorem, in which we use the tensor (or Kronecker) 3. RECURSIVE DCT II AND III ALGORITHMS product and the direct sum of matrices A, B, respectively defined by In [9, 10] we have shown that all 16 types of discrete cosine and sine transforms are (possibly scaled) polynomial transforms, and A ⊗ B = [a · B], for A = [a ], A ⊕ B = [ A ] . k,` k,` B used this connection to derive and explain many of their known algorithms by manipulating polynomial algebras rather than ma- Theorem 1 Using previous notation, trices. In this section we extend this approach and use Theorem 1 to derive an entire class of recursive DCT type II and III algorithms Pb,α = P P 0 (Pc,β ⊗ Im)B, (7) d,αi based on Theorem 1, which are thus mathematically equivalent to 0Mitranspose, DCTn = (DCTn ) . (III) (7). The DCTn is a polynomial transform as has been recognized Example: Cooley-Tukey FFT. The discrete Fourier trans- already in [6]. To make this explicit we introduce the Chebyshev form (DFT) of size n is defined by the matrix polynomials defined by the recurrence
k` 2π√ 1/n DFTn = [ωn ]0 k,` n (III) Lm : jk + i 7→ im + j, 0 ≤ i < k, 0 ≤ j < m, DCTn = [T`(cos(k + 1/2)π/n)]0 k,` n n Definition 1 Let p(x) = Tn(x) − cos rπ with list of zeros α = DFTn = Lm(Ik ⊗ DFTm) Tm(DFTk ⊗ Im), (10) (cos r1π, . . . , cos rn 1π), computed from (16), normalized to sat- − n isfy 0 ≤ ri ≤ 1, and ordered as ri < rj for i < j. Further, where Tm is the diagonal twiddle matrix. By transposing (10) we let d = (T0, . . . , Tn 1) be the basis of C[x]/p(x). Then we obtain a different recursion (III) − call DCTn (rπ) = Pd,α a skew DCT of type III. In particular, n n (III) (III) DFTn = (DFTk ⊗ Im) Tm(Ik ⊗ DFTm) Lk (11) DCTn (π/2) = DCTn . We start the derivation with the matrix B = Bn,k in (7). The basis with the bidiagonal matrix b0 in (6) is given by 1 1 b0 = (T0T0(Tm), . . . , Tm 1T0(Tm), 1 1 − . . . . . Sk = .. .. . (20) T0Tk 1(Tm), . . . , Tm 1Tk 1(Tm)) 1 1 − − − = (Tjm i/2 + Tjm+i/2)0 isequence, which increases the critical path of computa- 1 Bn− = Cn,k · Dn,k, tion and thus runtime or latency (compare to Sk in (20) where all k − 1 additions can be done in parallel). To overcome this prob- with lem, we can either consider only small values of k, or invert (19), I Z m m which we do next. I Z m m Inversion. To invert (19), we define a skew DCT(II), moti- .. .. (III) 1 (II) Cn,k = . . , vated by (DCTn )− = 2/n · diag(1/2, 1, . . . , 1) · DCTn . I Z m m Im Definition 2 We define the skew DCT of type II by (II) (III) 1 0 DCTn (rπ) = n/2 · diag(2, 1, . . . , 1) · (DCTn (rπ))− . 0 1 Z = . . , (17) (II) (II) m . . . . In particular, DCTn = DCTn (π/2). 0 1 It turns out that with this definition, by inverting (19) the mul- and the diagonal matrix tiplications cancel each other, and we get the beautifully simple form Dn,k = Im ⊕(Ik 1 ⊗ diag(1, 1/2, . . . , 1/2)). − DCT(II)(rπ) = C (DCT(II)(rπ) ⊗ I ) The matrices for steps (3) and (4) are straightforward; the final n n,k k m (II) n permutation P in (7) is given by (without proof) DCTm (riπ) Mk , (21) 0Mirecursions, two for the DCT of type III and two for type II. We list these recursions in the (III) n (III) DCTn = Km DCTn ((i + 1/2)π/k) following using previous notation. The numbers r, ri are related 0Mi(III) (II) T (III) n (III) DCTn = DCTn (π/2) , DCTn (rπ) = Km DCTm (riπ) (II) T n (II) T 0MiQn,k (III) T n = (Ik ⊕(Im 1 ⊗Sk)) , DCTm (riπ) Mk , (24) − 0MiConvolution, Springer, 1997. The algorithms (23) for the DCT(III) and its transpose (25) for the DCT(II) have a large degree of parallelism, and a simple, [2] M. Frigo and S. G. Johnson, “FFTW: An adaptive software very regular structure. For a 2-power n, the arithmetic cost of the architecture for the FFT,” in Proc. ICASSP, 1998, vol. 3, pp. algorithms is independent of the chosen recursion and given by 1381–1384, http://www.fftw.org. 3n n [3] J. M. F. Moura, J. Johnson, R. Johnson, D. Padua, A(n) = log (n) − n + 1, M(n) = log (n) (27) V. Prasanna, M. Puschel,¨ and M. M. Veloso, “SPIRAL: 2 2 2 2 Automatic Library Generation and Platform-Adaptation for additions and multiplications, respectively, and is thus among the DSP Algorithms,” 1998, http://www.ece.cmu.edu/∼spiral. best ones known. The only possible drawback (for fixed point im- [4] M. Puschel,¨ B. Singer, J. Xiong, J. M. F. Moura, J. John- plementations) is the large dynamic range of the occurring inverse son, D. Padua, M. Veloso, and R. W. Johnson, “SPIRAL: A cosines from the base cases (26). We found only the special case Generator for Platform-Adapted Libraries of Signal Process- m = 2 (in which only skew DCTs of size 2 occur) in [11] in a n ing Algorithms,” 2003, to appear in Journal of High Perfor- slightly different form. Namely, using (A ⊗ I )Lk = (I ⊗A) m m mance Computing and Applications. for any k × k matrix A, we can write (23) also as [5] Y. Morikawa, H. Hamada, and N. Yamane, “A Fast Algo- Ln (II) T n (II) T k rithm for the Cosine Transform Based on Successive Or- DCTn (rπ) = Rm DCTm (riπ) der Reduction of the Chebyshev Polynomial,” Elec. and 0MiMathematics in Computation, vol. 56, no. 193, pp. 281–296, 1991. (II) n (II) DCTn (rπ) = Cn,k Lk (Im ⊗ DCTk (rπ)) [7] E. Feig and S. Winograd, “Fast Algorithms for the Discrete Ln Cosine Transform,” IEEE Trans. on Signal Processing, vol. (II) k n DCTm (riπ) Rm . (29) 40, no. 9, pp. 2174–2193, 1992. 0Mi(III) n (III) [13] G. Bi and L. W. Yu, “DCT Algorithms for Composite Se- DCTn (rπ) = Km(DCTm (rπ/2) quence Lengths,” IEEE Trans. on Signal Processing, vol. 46, (III) no. 3, pp. 554–562, 1998. ⊕ DCTm ((1 − r)π/2))(F2 ⊗ Im)An, (30)