arXiv:cs/0703150v2 [cs.NA] 29 Jan 2009 abig A02139. MA Cambridge etysrasdb e loihsaheigaflpcount flop a achieving new by surpassed cently n ueia cuay smttcly h prto cou operation II a the multiplications types from Asymptotically, real reduced of accuracy. s of without numerical count (DST), algorithms, ing lower published transform previously a than achieve sine additions that discrete III, and and (DCT) form civdb h pi-ai loih,with , split-radix the by achieved osfor flops oe-ftotasomsize transform power-of-two fteipt rotus eeaiigawl-nw techniq well-known a generalizing outputs, N or inputs the of additional C-Io h aesz,sneteagrtm eae by related [11]–[ are flips algorithms sign the and the to permutations, since transpositions, and counts size, simple symmetry, same operation the even identical DCT-II of certain have The DCT-II a DST-III [6]–[10]. of [1], and operations case DST-II, input special redundant real a the with be discard to DFT DCT-II a the consider fr of algorithm simply DCT we a FFT, derive to this of order In rescaling multiplications. simplify the to recursive of as the so ca a [2]–[5] algorithm in FFT an from split-radix “conjugate-pair” within consequently stem factors and multiplicative DSTs, internal algorithm, the and FFT (DFT) DCTs T new transform [1]. new this algorithm Fourier in split-radix discrete previous-best gains the the the reduced to which for compared published algorithm, recently count a (FFT) operation on transform based Fourier Our are accuracy. fast algorithms numerical DST sacrificing and without DCT 5.6%), about reduc asymptotic of an multiplication (with and algorithms additions published real previously total size fewer power-of-two require of DST-III), that and (DCT-II DST-II transforms also sine and and DCT-III, cosine discrete type-III and II h C ob pca aeo F flength of DFT a of case special a be to DCT the in) o h F fapwro-w size power-of-two a of ope DFT (floating-point the “flops” for tions), called herein multiplications, om rtmtccomplexity arithmetic form; olwimdaeyfo h mrvdcutfrteDCT-II. the for count DST-II improved and DST-II, the DCT-III, from algo the immediately radix for follow split algorithms conjugate-pair o improved (based the The algorithm of transform rescaling Fourier recursive fast improved operations recent redundant a pruning then and symmetries, certain eateto ahmtc,MsahstsIsiueo Te of Institute Massachusetts , of Department * Abstract nti ae,w eciercrieagrtm o h type the for algorithms recursive describe we paper, this In ic 98 h oetttlcuto eladtosand additions real of count total lowest the 1968, Since ne Terms Index 8 = yAa ta.Teerslsaedrvdb considering by derived are results These al. et Arai by ihrdcdnme faihei operations arithmetic of number reduced with epeetagrtm o h iceecsn trans- cosine discrete the for algorithms present We — N > N utpiain a esvdb eti rescaling certain a by saved be may multiplications 2 iceecsn rnfr;fs ore trans- Fourier fast transform; cosine discrete — N log 1 6,[] 1][6.Ti on a re- was count This [14]–[16]. [8], [6], 2 .I I. N y eI/I C/S algorithms DCT/DST Type-II/III + NTRODUCTION O N ( N utemr,w hwta an that show we Furthermore, . ) to 17 9 N unhn hoadSee .Johnson* G. Steven and Shao Xuancheng log 4 N 2 N log N + 2 O 2 = N ( 4 N − N chnology, m 6 rithm). ) acrific- with , than s efor ue lda lled N some o a for from tis nt 13]. was tion and +8 om a n ra- nd he s, I, I - h rtsvnsoe h rvosrcr cu for occur record previous the over savings first The utpiain ol esvdb cln h upt fa of outputs the scaling size by of saved DCT-II be could multiplications further a that a show counts we different Moreover, N choices. slightly but other coun by transform, above the obtained unitary count: a flop for this was on normalization of effect the n r umrzdi al o several for I Table in summarized are and o on o h C-Io size of DCT-II the for count flop upto h C-I(ripto h C-I) ndigso, doing In Arai DCT-III). by the result of a input generalize (or we DCT-II the of output PGcmrsin[43]. compression JPEG 2 rmara-vnFT n ietegnrlfruafrthe for formula general the (for give operations count and flop redundant FFT, new pruning real-even by a “manuall to DCT-I algorithm from point for DCT-II starting general same except a the a presented use derive we were nor paper, this count given algorithm In flop [1]. a explicit the with for an FFT formula an neither automaticall were from but that types operations [41] symmetry, DCT redundant [10], various the generator code pruned the FFT a for new using our counts reduced on flop Based the [27]–[40].) algorithm, described count flop been larger the [ also or unreported have with [6], an operations with starting algorithms redundant by DCT (Many discarding achieved and be [1 FFT could [7], split-radix and [6], [18]–[26], normalization) [13], on depending constant additive of N F fw eeywse oso htteDTI/I ol be could DCT-II/III the that show to wished merely we If 17 O COUNTS LOP utpiain a esvdb niiulyrescaling individually by saved be can multiplications 34 9 9 log N N 2 log log N 2 2 4096 2048 1024 − 512 256 128 N N 64 32 16 N ( N ELADDS REAL + − N O 2 + 17 27 rvosbs DCT-II best previous ( 8 = N N N U E ALGORITHM NEW OUR ) o ntr omlzto wt the (with normalization unitary a for 2 = − 1,[7.Smlry h lowest-known the Similarly, [17]. [1], + 4] eutcmol ple to applied commonly result a [42], MULTS 9 1 AL I TABLE m ( − > 94210 43010 19458 1) 8706 3842 1666 tal. et ) 1 706 290 114 log N FPEIU BEST PREVIOUS OF ): + 2 2 = N h hwdta eight that showed who , 54 7 e algorithm New log m ( − N 2 > 1) eas consider also We . N 1 log 90264 41266 18698 8384 3708 1614 a previously was 686 284 112 2 N DCT-II + N 2 3 . 16 = every AND (1) 7]. 2], y” re y 1 t , 2

17 computed in 9 N log2 N + O(N) operations, we could do so by using well-known techniques to re-express the DCT in terms of a real-input DFT of length N plus O(N) pre/post- processing operations [29], [30], [33], [44]–[46], and then 17 substituting our new FFT that requires 9 N log2 N + O(N) operations for real inputs [1]. However, with FFT and DCT algorithms, there is great interest in obtaining not only the best possible asymptotic constant factor, but also the best possible exact count of arithmetic operations (which, for the DCT-II of power-of-two sizes, had not changed by even one operation for Fig. 1. A DCT-II of size N = 8 (open dots, x0,...,x7) is equivalent to over 20 years). Our result (1) is intended as a new upper bound a size-4N DFT via interleaving with zeros (black dots) and extending in an even (squares) periodic (gray) . on this (still unknown) minimum exact count, and therefore we have done our best with the O(N) terms as well as the asymptotic constant. It turns out, in fact, that our algorithm − 2πi where ωN = e N is an Nth primitive . In to achieve (1) is closely related to well-known algorithms for order to relate this to the DCT-II, it is convenient to choose a expressing the DCT-II in terms of a real-input FFT of the same different normalization for the latter transform: length, but it does not seem obvious a priori that this is what N−1 one obtains by pruning our FFT for symmetric data. π 1 C =2 x cos n + k . (4) In the following sections, we first review how a DCT-II may k n N  2  be expressed as a special case of a DFT, and how the previous nX=0 optimum flop count can be achieved by pruning redundant This normalization is not unitary, but it is more directly related operations from a conjugate-pair split-radix FFT. Then, we to the DFT and therefore more convenient for the development briefly review the new FFT algorithm presented in [1], and of algorithms. Of course, any fast algorithm for Ck trivially II derive the new DCT-II algorithm. We follow by considering yields a fast algorithm for Ck , although the exact count of the effect of normalization and scaling. Finally, we consider required multiplications depends on the normalization, an issue the extension of this algorithm to algorithms for the DCT-III, we discuss in more detail in section V. DST-II, and DST-III. We close with some concluding remarks In order to derive Ck from the DFT formula, one can use about future directions. We emphasize that no proven tight 2ℓ 4N−2ℓ the identity 2 cos(πℓ/N)= ω4N + ω4N to write: lower bound on the DCT-II flop count is currently known, and N−1 we make no claim that eq. (1) is the lowest possible (although π 1 C =2 x cos n + k we have endeavored not to miss any obvious optimizations). k n N  2  nX=0 − − II. DCT-II IN TERMS OF DFT N 1 N 1 = x ω(2n+1)k + x ω(4N−2n−1)k (5) Various forms of discrete cosine transform have been de- n 4N n 4N nX=0 nX=0 fined, corresponding to different boundary conditions on the 4N−1 transform [47]. Perhaps the most common form is the type- nk = x˜nω4N , II DCT, used in [43] and many other nX=0 applications. The DCT-II is typically defined as a real, or- thogonal (unitary), linear transformation by the formula (for where x˜n is a real-even sequence of length 4N (i.e. x˜4N−n = k =0,...,N 1): x˜n), defined as follows for 0 n

Algorithm 1 Standard conjugate-pair split-radix FFT algo- Algorithm 2 Standard conjugate-pair split-radix FFT algo- rithm of size N. Special-case optimizations for k = 0 and rithm of size N as in Algorithm 1, but specialized for the k = N/8, as well as the base cases, are omitted for simplicity. case of real inputs. Special-case optimizations for k =0 and function X − splitfft (x ): k = N/8, as well as the base cases, are omitted for simplicity. k=0..N 1 ← N n U splitfft (x 2 ) function X rsplitfft (x ): k2=0...N/2−1 ← N/2 2n k=0..N/2 ← N n Zk4=0...N/4−1 splitfftN/4 (x4n4+1) Uk2=0...N/4 rsplitfftN/2 (x2n2 ) ′ ← splitfft ← rsplitfft Zk4=0...N/4−1 N/4 (x4n4−1) Zk4=0...N/8 N/4 (x4n4+1) ← ′ ← rsplitfft for k =0 to N/4 1 do Zk4=0...N/8 N/4 (x4n4−1) k− −k ′ ← Xk Uk + ωN Zk + ωN Zk for k =0 to N/8 do ← k −k ′ k −k ′ Xk+N/2 Uk ωN Zk + ωN Zk Xk Uk + ωN Zk + ωN Zk ← − k −k ′ ← ∗ k −k ′ Xk+N/4 Uk+N/4 i ωN Zk ωN Zk Xk+N/4 UN/4−k i ωN Zk ωN Zk ← − − ← − − ∗ X U + i ωk Z ω−kZ′ k −k ′  k+3N/4 k+N/4 N k N k XN/4−k UN/4−k i ωN Zk ωN Zk ← − ← − − ∗ end for  X U ∗ ωk Z + ω−kZ′  N/2−k ← k − N k N k end for 

III. CONJUGATE-PAIR FFT AND DCT-II Although the previous minimum flop count for DCT-II in the ordinary split-radix [7], [48] and in the conjugate-pair algorithms can be derived from the ordinary split-radix FFT al- split-radix [1], by eliminating redundant operations, to achieve gorithm [6], [7] (and it can also be derived in other ways [12], a flop count of 2N log2 N 4N +6. Specifically, one need [13], [18]–[26]), here we will do the same thing using a variant only compute the outputs for− 0 k N/2 (where the k =0 dubbed the “conjugate-pair” split-radix FFT. This algorithm and k = N/2 outputs are purely≤ real),≤ and the corresponding was originally proposed to reduce the number of flops [2], algorithm is shown in Algorithm 2. Again, for simplicity we do but was later shown to have the same flop count as ordinary not show special-case optimizations for k =0 and k = N/8, split-radix [3]–[5]. It turns out, however, that the conjugate- so it may appear that the XN/8, XN/4, and X3N/8 outputs are pair algorithm exposes symmetries in the multiplicative factors computed twice. To derive Algorithm 2, one exploits two facts. that can be exploited to reduce the flop count by an appropriate First, the sub-transforms operate on real data and therefore rescaling [1], which we will employ in the following sections. have conjugate-symmetric output. Second, the twiddle factors in Algorithm 1 have some redundancy because of the identity ±(N/4−k) ∓k A. Conjugate-pair split-radix FFT ωN = iωN , which allows us to share twiddle factors between k and∓N/4 k in the original loop. Starting with the DFT of equation (3), the decimation-in- − time conjugate-pair FFT splits it into three smaller DFTs: one of size N/2 of the even-indexed inputs, and two of size N/4: B. Fast DCT-II via split-radix FFT Before we derive our fast DCT-II with a reduced flop N/2−1 N/4−1 n2k k n4k count, we first derive a DCT-II algorithm with the same flop X = ω x 2 + ω ω x 4 k N/2 2n N N/4 4n +1 count as previously published algorithms, but starting with the nX2=0 nX4=0 conjugate-pair split-radix algorithm. This algorithm will then N/4−1 −k n4k be modified in section IV-B, below, to reduce the number of + ω ω x 4 , (8) N N/4 4n −1 multiplications. nX4=0 In eq. (5), we showed that a DCT-II of size N, with inputs where the indices 4n4 1 are computed modulo N. [In x and outputs C , can be expressed as the first N outputs ± n k contrast, the ordinary split-radix FFT uses x4n4+3 for the of a DFT of size N˜ =4N of real-even inputs x˜n. [Therefore, third sum (a cyclic shift of x4n4−1), with a corresponding N in eq. (8) above and in the surrounding text are replaced 3k multiplicative “twiddle” factor of ωN .] This decomposition below by N˜.] Now, consider what happens when we evaluate is repeated recursively (until base cases of size N = 1 or this DFT via the conjugate-pair split decomposition in eq. (8) N = 2, not shown, are reached), as shown by the pseudo- and Algorithm 2. In this algorithm, we compute three smaller code in Algorithm 1. Here, we denote the results of the three DFTs. First, Uk is the DFT of size N/˜ 2=2N of the x˜2n ′ sub-transforms of size N/2, N/4, and N/4 by Uk, Zk, and Zk, inputs, but these are all zero and so Uk is zero. Second, Zk is respectively. The number of flops required by this algorithm, the DFT of size N/˜ 4= N of the x˜4n+1 inputs (where 4n +1 after certain simplifications (common subexpression elimina- is evaluated modulo N˜), which by eq. (7) correspond to the tion and constant folding, and special-case simplifications of original data xn by the formula: the constants for k = 0 and k = N/8) and not counting k x2n 0 n

this algorithm [33] only because they used a suboptimal real- input DFT for the subtransform (split-radix being then almost unknown). Using a real-input split-radix DFT to compute Zk, the flop count for Zk is 2N log2 N 4N +6 from above. To get the total flop count for Algorithm− 3, we need to add N/2 1 k − general complex multiplications by 2ω4N (6 flops each) plus two real multiplications (2 flops), for a total of 2N log2 N N +2 flops. This matches the best-known flop count in the− literature (where the +2 can be removed merely by choosing a different normalization as discussed in section V). Fig. 2. The DCT-II of N = 8 points xn (open dots) is computed, in a conjugate-pair FFT of the N˜ = 32 extended data x˜ (squares) from figure 1, n IV. NEW FFT AND DCT-II via the DFT Zk of the circled points x˜4n+1, corresponding to the even- indexed xn followed by the odd-indexed xn. In this section, we first review the new FFT algorithm introduced in our previous work based on a recursive rescaling Algorithm 3 Fast DCT-II algorithm, matching previous best of the conjugate-pair FFT [1], and then apply it to derive a flop count, derived from Algorithm 2 by discarding redundant fast DCT-II algorithm as in the previous section. operations. function C − splitdctII (x ): A. New FFT k=0..N 1 ← N n for n =0 to N/2 1 do Based on the conjugate-pair split-radix FFT from section III, − y˜n x2n a new FFT algorithm with a reduced number of flops can ← y˜ − − x be derived by scaling the subtransforms [1]. We will not N 1 n ← 2n+1 end for reproduce the derivation here, but will simply summarize the Z rsplitfft (˜yn) results. In particular, the original conjugate-pair split-radix k=0...N/2 ← N C 2Z Algorithm 1 is split into four mutually recursive subroutines, 0 ← 0 for k =1 to N/2 1 do each of which has the same split-radix structure but computes k − Ck 2 ω4N Zk a DFT scaled by a different factor. These algorithms are ← ℜ k CN−k 2 ω  Zk shown in Algorithm 4 for the case of complex inputs, and ←− ℑ 4N end for  in Algorithm 5 specialized for real inputs, in both of which C √2Z the scaling factors are combined with the twiddle factors ωk N/2 ← N/2 N to reduce the total number of multiplications compared to Algorithms 1 and 2, respectively. Again, special cases for k = 0 and k = N/8, as well as the N = 1, 2 base cases in figure 2 with the circled points corresponding to the y˜n, and obvious simplifications such as common-subexpression which are clearly the even-indexed xn followed by the odd- ′ ˜ elimination, are not shown for simplicity. indexed xn in reverse. The second DFT, Zk, of size N/4= N The key aspect of these algorithms is the scale factor sN,k, is redundant: it is the DFT of x˜ − , but by the even symmetry 4n 1 where the subtransforms compute the DFT scaled by of x˜ this is equal to x˜ , and therefore Z′ = Z∗ (the 1/sℓN,k n 4(−n)+1 k k for . (The subroutines are condensed complex conjugate of Z ). ℓ = 1, 2, 4 ℓ = 0, 1, 2 k into one in Algorithms 4–5, by making a variable, but in Therefore, a fast DCT-II is obtained merely by computing ℓ practice they need to be implemented separately in order to a single real-input DFT of size N to obtain Z = Z∗ , k N−k exploit the special cases of the scale factor [1]. The and then combining this according to Algorithm 2 to obtain ℓ = 4 case is written separately because it is factorized differently.) C = X for k = 0 ...N 1. In particular, in the loop of k k This scale factor is defined for N = 2m by the following Algorithm 2, all but the X −and X terms correspond to k N/˜ 4−k recurrence, where k = k mod N : subscripts N and are not needed. Moreover, we obtain 4 4 ≥ X = ωk Z + ω−kZ∗ =2 ωk Z , 1 for N 4 k N˜ k N˜ k 4N k ℜ∗ m ≤ k −k ∗  k sN=2 ,k =  sN/4,k4 cos(2πk4/N) for k4 N/8 . XN−k = i ω ˜ Zk ω ˜ Zk = 2 ω4N Zk .  ≤ − N − N − ℑ sN/4,k4 sin(2πk4/N) otherwise    (10) The resulting algorithm, including the special-case optimiza-  0 This definition has the properties: sN,0 =1, sN,k+N/4 = sN,k, tions for k =0 (where ω4N =1 and Z0 is real) and k = N/2 N/2 and sN,N/4−k = sN,k. Also, sN,k > 0 and decays rather (where ω = (1 i)/√2 and Z 2 is real), is shown in log4 cos(π/5) 4N − N/ slowly with N: sN,k has an Ω(N ) asymptotic Algorithm 3. lower bound [1]. When these scale factors are combined with In fact, Algorithm 3 is equivalent to an algorithm derived k the twiddle factors ωN , we obtain terms of the form in a different way by various authors [30], [33] to express a s DCT-II in terms of a real-input DFT of the same length. Here, t = ωk N/4,k , (11) N,k N s we see that this algorithm is exactly equivalent to a conjugate- N,k pair split-radix FFT of length 4N. Previous authors obtained a which is always a of the form 1 i tan 2πk ± ± N suboptimal flop count 5 N log N +O(N) for the DCT-II using or cot 2πk i and can therefore be multiplied with two 2 2 ± N ± 5

Algorithm 4 New FFT algorithm [1] of length N (divisible Algorithm 5 New FFT algorithm of size N (divisible by 4), as ℓ by 4). The sub-transforms newfftSN (x) for ℓ =0 are scaled in Algorithm 4 but specialized for the case of real inputs. The by s , respectively, while ℓ =0 is the final6 unscaled DFT sub-transforms rnewfftSℓ (x) for ℓ =0 are scaled by s , ℓN,k N 6 ℓN,k (s0,k =1). Special-case optimizations for k =0 and k = N/8 respectively, while ℓ =0 is the final unscaled DFT (s0,k =1). are not shown. Special-case optimizations for k = 0 and k = N/8 are not ℓ shown. function Xk=0..N−1 newfftSN (xn): ← ℓ computes DFT / sℓN,k for ℓ =0, 1, 2 function Xk=0..N/2 rnewfftSN (xn): { 2ℓ } ← Uk2=0...N/2−1 newfftSN/2 (x2n2 ) computes DFT / sℓN,k for ℓ =0, 1, 2 ← 1 { rnewfftS2ℓ } Zk4=0...N/4−1 newfftSN/4 (x4n4+1) Uk2=0...N/4 N/2 (x2n2 ) ′ ← 1 ← 1 newfftS Z 4 rnewfftS (x4n4+1) Zk4=0...N/4−1 N/4 (x4n4−1) k =0...N/8 N/4 ← ′ ← 1 for k =0 to N/4 1 do Z rnewfftS (x4n4−1) − k4=0...N/8 ← N/4 ∗ ′ for k =0 to N/8 do Xk Uk + tN,kZk + tN,kZk (sN,k/sℓN,k) ←   · ∗ ′ ∗ ′ Xk Uk + tN,kZk + tN,kZk (sN,k/sℓN,k) Xk+N/2 Uk tN,kZk + tN,kZk (sN,k/sℓN,k) ← ∗  · ← −   · Xk+N/4 UN/4−k Xk+N/4 Uk+N/4 ← ← ∗ ′ i t Z t∗ Z′ (s /s ) i tN,kZk tN,kZk (sN,k/sℓN,k+N/4) N,k k N,k k N,k ℓN,k+N/4 −  −  · −  −  · XN/4−k UN/4−k Xk+3N/4 Uk+N/4 ← ∗ ← ∗ ′ ∗ ′ i tN,kZk t Z (sN,k/sℓN,k+N/4) + i tN,kZk t Z (sN,k/sℓN,k+N/4) N,k k − N,k k · −  ∗ −  ·   XN/2−k Uk end for ← ∗ ∗ ′ function X − newfftS4 (x ): tN,kZk + tN,kZk (sN,k/sℓN,k) k=0..N 1 ← N n −   · computes DFT / s4N,k end for { } 2 Uk2=0...N/2−1 newfftSN/2 (x2n2 ) function Xk=0..N/2 rnewfftS4N (xn): ← 1 ← Zk4=0...N/4−1 newfftSN/4 (x4n4+1) computes DFT / s4N,k ′ ← 1 { } 2 newfftS U 2 rnewfftS (x 2 ) Zk4=0...N/4−1 N/4 (x4n4−1) k =0...N/4 N/2 2n ← ← 1 for k =0 to N/4 1 do Zk4=0...N/8 rnewfftSN/4 (x4n4+1) − ∗ ′ ′ ← 1 X U + t Z + t Z (s /s ) Z rnewfftS (x4n4−1) k ← k N,k k N,k k · N,k 4N,k k4=0...N/8 ← N/4 h  ∗ i ′ for k =0 to N/8 do Xk+N/2 Uk tN,kZk + tN,kZk ∗ ′ ← h −  i Xk Uk + tN,kZk + tN,kZk (sN,k/s4N,k) (sN,k/s4N,k+N/2) ← h  i · · ∗ ′ ∗ ∗ ′ X U i t Z t Z Xk+N/4 UN/4−k i tN,kZk tN,kZk k+N/4 ← k+N/4 − N,k k − N,k k ← h −  − i h  i (sN,k/s4N,k+N/4) (sN,k/s4N,k+N/4) · ∗ · ∗ ′ ∗ ′ X U + i t Z t Z XN/4−k UN/4−k i tN,kZk tN,kZk k+3N/4 ← k+N/4 N,k k − N,k k ← h −  −  i h(s /s  ) i (sN,k/s4N,N/4−k) N,k 4N,k+3N/4 · ∗ end for · ∗ ∗ ′ XN/2−k Uk tN,kZk + tN,kZk ← h −   i (sN,k/s4N,N/2−k) end for · k fewer real multiplications than are required to multiply by ωN . Because of the symmetry sN,N/4−k = sN,k, it is possible to specialize Algorithm 4 for real data in the same way as in Algorithm 2, because the scale factor preserves the conjugate The number M(N) of real multiplications saved over ordinary ∗ symmetry Xk = XN−k of the outputs of all subtransforms split radix is given by: [1].1 The resulting flop count, for arbitrary complex data xn, is then reduced from 4N log2 N 6N +8 to 2 38 − M(N)= N log2 N N +2log2 N 34 124 9 − 27 2 16 N log2 N N 2log2 N log2 N log2 N 9 − 27 − + ( 1) log2 N ( 1) . (13) 2 16 9 − − 27 − ( 1)log2 N log N + ( 1)log2 N +8. (12) − 9 − 2 27 − In particular, assuming that complex multiplications are imple- If the data are purely real, it was shown that M(N)/2 mented as four real multiplications and two real additions, then multiplications are saved over the corresponding real-input the savings are purely in the number of real multiplications. split-radix algorithm [1]. These flop counts are to compute the original, unscaled definition of the DFT. If one is allowed to 1In Algorithm 5, we have also utilized various symmetries of the scale scale the outputs by any factor desired, then scaling by = 1/sN,k factors, such as s2N,N/4−k s2N,k+N/4, as described in our previous 1 work [1], to make explicit the shared multiplicative factors between the [corresponding to newfftSN (xn) in Algorithm 4], saves an different terms. additional M (N) M(N) 0 multiplications for complex S − ≥ 6

Algorithm 6 New DCT-II algorithm of size N =2m, based on k as for Algorithm 3 because the products ω4N sN,k can be discarding redundant operations from the new FFT algorithm precomputed. Therefore, we save a total of MS(N)/2 flops of size 4N, achieving new record flop count. compared to the previous best flop count of 2N log N N +2, 2 − function Ck=0..N−1 newdctIIN (xn): resulting in the flop count of eq. (1). This formula matches the for n =0 to N/2 ←1 do flop count that was achieved by an automatic code generator y˜ x − in Ref. [1]. n ← 2n y˜N−1−n x2n+1 Because the new DCT algorithm is mathematically merely end for ← a special set of inputs for the underlying FFT, it will have the Z rnewfftS1 (˜y ) same favorable logarithmic error bounds as discussed in the k=0..N/2 ← N n C0 2Z0 previous section. for ←k =1 to N/2 1 do k − Ck 2 ω4N sN,kZk V. NORMALIZATIONS ← ℜ k CN−k 2 ω sN,k Zk ←− ℑ 4N In the above definition of DCT-II, we use a scale factor end for  “2” in front of the summation in order to make the DCT- C √2Z N/2 ← N/2 II directly equivalent to the corresponding DFT. But in some circumstances, it is useful to multiply by other factors, and dif- ferent normalizations lead to slightly different flop counts. For inputs, where: example, one could use the unitary normalization from eq. (2), which replaces 2 by 2/N or 1/N (for k =0) and requires 2 20 the same number of flops.p If onep uses the unitary normalization MS(N)= N log2 N N 9 − 27 multiplied by √N, then one saves two multiplications in the 2 7 + ( 1)log2 N log N ( 1)log2 N +1. (14) k = 0 and k = N/2 terms (in this normalization, C0 = Z0 9 − 2 − 27 − and CN/2 = ZN/2) for both Algorithm 3 and Algorithm 6. For real inputs, one similarly saves MS(N)/2 flops for (This leads to a commonly quoted 2N log2 N N formula 1 − rnewfftSN (xn) operating on real xn in Algorithm 5. for the previous flop count.) Although the division by a cosine or sine in the scale factor It is also common to compute a DCT-II with scaled out- may, at first glance, seem as if it may exacerbate floating- puts, e.g. for the JPEG image-compression standard where point errors, this is not the case. Cooley–Tukey based FFT an arbitrary scaling can be absorbed into a subsequent quan- algorithms have O(√log N) root-mean-square error growth, tization step [43], and in this case the scaling can save 8 and O(log N) error bounds, in finite-precision floating-point multiplications [42] over the 42 flops required for an unitary 1 arithmetic (assuming that the trigonometric constants are 8-point DCT-II. Since our newfftSN (xn) attempts to be the precomputed accurately) [49]–[51], and the new FFT is no optimally scaled FFT, we should be able to derive this scaled 1 different [1]. The reason for this is straightforward: it never DCT-II by using rnewfftSN (xn) /2 for our length-4N DFT 0 adds or subtracts scaled and unscaled values. Instead, wherever instead of rnewfftSN (x), and again pruning the redundant the original FFT would compute a+b, the new FFT computes operations. Doing so, we obtain an algorithm identical to k s (a + b) for some fixed scale factor s. Algorithm 6 except that 2ω sN,k is replaced by t4N,k, which · 4N requires fewer multiplications, and also we now obtain C0 = B. Fast DCT-II from new FFT Z0 and CN/2 = ZN/2. This algorithm computes Ck/2s4N,k instead of Ck. In particular, we save exactly N multiplications To derive the new DCT-II algorithm based on the new FFT over Algorithm 6, which matches the result by Ref. [42] of the previous section, we follow exactly the same process as for N = 8 but generalizes it to all N. This savings of N in Sec. III-B. We express the DCT-II of length N in terms of multiplications for a scaled DCT-II also matches the flop count a DFT of length 4N, apply the FFT algorithm 5, and discard that was achieved by an automatic code generator in Ref. [1]. the redundant operations. As before, the Uk subtransform is identically zero, Zk is the transform of the even-indexed VI. FAST DCT-III, DST-II, AND DST-III elements of x followed by the odd-indexed elements in n Given any algorithm for the DCT-II, one can immediately reverse order, and Z′ = Z∗ is redundant. k N−k obtain algorithms for the DCT-III, DST-II, and DST-III with The resulting algorithm is shown in Algorithm 6, and differs exactly the same number of arithmetic operations. In this way, from the original fast DCT-II of Algorithm 3 in only two ways. any improvement in the DCT-II, such as the one described First, instead of calling the ordinary split-radix (or conjugate- in this paper, immediately leads to improved algorithms for pair) FFT for the subtransform, it calls newfftS (x ). N n those transforms and vice versa. In this section, we review the Second, because this subtransform is scaled by 1/sN,k, the k equivalence between those transforms. twiddle factor ω4N in Algorithm 6 is multiplied by sN,k. To derive the flop count for Algorithm 6, we merely need to add the flop count for the subtransform [which saves A. DCT-III 1 MS(N)/2 = 9 N log2 N + O(N) flops, from eq. (14), To obtain a DCT-III algorithm from a DCT-II algorithm, one compared to ordinary real-input split radix] with the number can exploit two facts. First, the DCT-III, viewed as a , of flops in the loop, where the latter is exactly the same is simply the of the DCT-II matrix. Second, any 7

size-2 DCT-II size-2 DCT-III Algorithm 7 New DCT-III algorithm of size N =2m, based x 2 C = 2( x + x ) network C 2 x = 2C + 2 C on network transposition of Algorithm 6, with the same flop 0 0 0 1 transposition 0 0 0 1 2 2 2 2 count. function D − newdctIII (x ): k=0..N 1 ← N n x1 C1 = 2(x0 x1) C1 x1 = 2C0 2 C1 Z 2x 2 2 0 0 for ←n =1 to N/2 1 do −n − Fig. 3. Example of network transposition for a size-2 DCT-II (left). When Zn 2ω4N sN,n(xn ixN−n) the linear network is transposed (all edges are reversed), the resulting network ← ∗ − ZN−n Zn computes the transposed-matrix operation: a size-2 DCT-III (right). end for ← Z √2x N/2 ← N/2 y˜ rnewfftS(T) 1 (Z ) linear algorithm can be viewed as a linear network, and the k ← N n transpose operation is computed by the network transposition for k =0 to N/2 1 do T − of this algorithm [52]. To review, a linear network represents C2k y˜k T ← C y˜ − − the algorithm by a directed acyclic graph, where the edges 2k+1 ← N 1 k represent multiplications by constants and the vertices repre- end for sent additions of the incoming edges. Network transposition simply consists of reversing the direction of every edge. We DFT of the same size [30], [33]. In order to minimize the prove below that the transposed network requires the same flop count once the scale factors are included, however, it is number of additions and multiplications for networks with the important that this real-output DFT operate on scaled inputs same number of inputs and outputs, and therefore the DCT-III rather than scaled outputs, so it is intrinsically different (trans- can be computed in the same number of flops as the DCT- posed) from Algorithm 4. The real-output (scaled, conjugate- II by the transposed algorithm. The DCT-III corresponding to symmetric input) version of rnewfftS1 (x ) can be derived the transpose of eq. (4) is N n by network transposition of the real-input case (since network N−1 π 1 transposition interchanges inputs and outputs). Equivalently, T 1 Ck =2 xn cos n k + . (15) rnewfftS N  2 whereas the N (xn) was a scaled-output decimation- nX=0 in-time algorithm specialized for real inputs, the transposed Since the DCT-III is the transpose of the DCT-II, a fast DCT- (T) 1 algorithm rnewfftSN (xn) is a scaled-input decimation-in- III algorithm is obtained by network transposition of a fast frequency algorithm specialized for real outputs. We do not list DCT-II algorithm. For example, network transposition of a (T) 1 rnewfftSN (xn) explicitly here, however; our main point size-2 DCT-III is shown in figure. 3. is simply to establish that an algorithm for the DCT-III with A proof that the transposed network requires the same exactly the same flop count (1) follows immediately from the number of flops is as follows. Clearly, the number of multipli- new DCT-II. cations, the number of edges with weight = 1, is unchanged There are also other ways to derive a DCT-III algorithm 6 ± by transposition. The number of additions is given by the sum with the same flop count without using network transposition, of indegree 1 for all the vertices, except for the N input of course. For example, one can again consider the DCT-III − vertices which have indegree zero. That is, for a set V of to be a real-input DFT of size 4N with certain symmetries vertices and a set E of edges: (different from the DCT-II’s symmetries), and prune redundant operations from Algorithm 5, resulting in a decimation-in-time # adds = N + [indegree(v) 1] − algorithm. Such a derivation is useful because it allows one vX∈V to derive a scaled-output DCT-III as well, which turns out to = N + E V , (16) | | − | | be a subroutine of a new DCT-IV algorithm that we describe using the fact that the sum of the indegree over all vertices is in another manuscript, currently in press [53]. However, this simply the number E of edges. Because the above expression derivation is more complicated than the one above, resulting is obviously invariant| | under network transposition as long as in four mutually recursive DCT-III subroutines, and does not N is unchanged (i.e. same number of inputs and outputs), the lower the flop count, so we omit it here. number of additions is invariant. More explicitly, a fast DCT-III algorithm derived from the B. DST-II and DST-III transpose of our new Algorithm 6 is outlined in Algorithm 7. The DST-II is defined, using a unitary normalization, as: Whereas for the DCT-II we computed the real-input DFT (with N−1 II 2 δk,N π 1 conjugate-symmetric output) of the rearranged inputs y˜n and S = − xn sin n + k (17) ∗ k r N N  2  then post-processed the complex outputs Zk = ZN−k, now nX=0 we do the reverse. We first preprocess the inputs to form for k = 1,...,N (not 0,...,N 1). As in section II, it is ∗ − a complex array Zk = ZN−k, then perform a real-output, more convenient to develop algorithms for an unnormalized scaled-input DFT to obtain y˜k, and finally assign the even and variation: T odd elements of the result Ck from the first and second halves N−1 π 1 of y˜k. Without the scale factors, this is equivalent to a well- S =2 x sin n + k (18) k n N  2  known algorithm to express a DCT-III in terms of a realoutput nX=0 8

T similar to our normalization of Ck and Ck above. Although by arithmetic [10]. Nevertheless, minimal-arithmetic hard- we could derive fast algorithms for Sk directly by treating it coded DCTs of small sizes are often used in audio and as a DFT of length 4N with odd symmetry, interleaved with image compression [43], [47], and the availability of any new zeros, and discarding redundant operations similar to above, it algorithm with regular structure amenable to implementation turns out there is a simpler technique. The DST-II is exactly leads to rich new opportunities for performance optimization. equivalent to a DCT-II in which the outputs are reversed and Similar arithmetic improvements also occur for other types every other input is multiplied by 1 [11]–[13]: of DCT and DST [1], as well as for the modified DCT (MDCT, − N−1 closely related to a DCT-IV), and the explicit description of n π 1 those algorithms is the subject of another manuscript currently S − =2 ( 1) x cos n + k (19) N k − n N  2  nX=0 in press [53]. We have already shown that the new exact count for the DCT-I is 2N log N 3N +2log N +5 M(2N)/4= for k =0,...,N 1. It therefore follows that a DST-II can be 2 2 17 N log N + O(N), where− M(N) is given by− eq. (13) [1]. computed with the− same number of flops as a DCT-II of the 9 2 However, no new exact count for arbitrary N = 2m has yet same size, assuming that multiplications by 1 are free—the been published for the DCT-IV and related transforms. Again, reason for this is that sign flips can be absorbed− at no cost it immediately follows that the asymptotic cost of a DCT-IV by converting additions into subtractions or vice versa in the (and hence an MDCT) is reduced to 17 N log N + O(N) subsequent algorithmic steps. Therefore, our new flop count 9 2 simply by applying known algorithms to express a DCT- (1) immediately applies to the DST-II. IV in terms of a DCT-II of size N [12] (although this Similarly, a DST-III is given by the transpose of the DST- algorithm has large numerical errors [10]) or in terms of II (which is the inverse, for the unitary normalization). In two DCT-II/III transforms of size N/2 [19]. However, again unnormalized form, this is we wish to establish a new lowest-known upper bound for N π 1 the exact count of operations of the DCT-IV, and not just ST =2 x sin n k + (20) k n N  2 the asymptotic constant factor. (It turns out that the DCT-IV nX=1 algorithm obtained by pruning redundant computations from for k = 0,...,N 1. This must have the same number of our new FFT, this time of symmetric data with size 8N [10], − flops as the DST-II by the network transposition argument is closely related to the algorithm by Wang et al. [19], just above. Alternatively, it can also be obtained from the DCT-III as as the algorithm for the DCT-II in this paper was closely by reversing the inputs and multiplying every other output by related to a known algorithm derived by other means [30], 1 [12]: [33].) − N−1 This work was supported in part by a grant from the T k π 1 S = ( 1) x − cos n k + , (21) MIT Undergraduate Research Opportunities Program. The k − N n N  2 nX=0 authors are also grateful to M. Frigo, co-author of FFTW with T S. G. Johnson [10], for many helpful discussions. which can be obtained from Ck with the same number of flops. REFERENCES

VII. CONCLUSION [1] S. G. Johnson and M. Frigo, “A modified split-radix FFT with fewer arithmetic operations,” IEEE Trans. , vol. 55, no. 1, We have shown that a improved count of real additions pp. 111–119, 2007. and multiplications (flops), eq. (1), can be obtained for the [2] I. Kamar and Y. Elcherif, “Conjugate pair fast Fourier transform,” DCT/DST types II/III by pruning redundant operations from Electron. Lett., vol. 25, no. 5, pp. 324–325, 1989. a recent improved FFT algorithm. We have also shown how [3] R. A. Gopinath, “Comment: Conjugate pair fast Fourier transform,” Electron. Lett., vol. 25, no. 16, p. 1084, 1989. to save N multiplications by rescaling the outputs of a DCT- [4] H.-S. Qian and Z.-J. Zhao, “Comment: Conjugate pair fast Fourier II (or the inputs of a DCT-III), generalizing a well-known transform,” Electron. Lett., vol. 26, no. 8, pp. 541–542, 1990. technique for N = 8 [42]. However, we do not claim that [5] A. M. Krot and H. B. Minervina, “Comment: Conjugate pair fast Fourier transform,” Electron. Lett., vol. 28, no. 12, pp. 1143–1144, 1992. our new count is optimal—it may be that further gains can be [6] M. Vetterli and H. J. Nussbaumer, “Simple FFT and DCT algorithms obtained by, for example, extending the rescaling technique with reduced number of operations,” Signal Processing, vol. 6, no. 4, of [1] to greater generality. In particular, as has been pointed pp. 267–278, 1984. [7] P. Duhamel, “Implementation of “split-radix” FFT algorithms for com- out by other authors [8], any improvement in the arithmetic plex, real, and real-symmetric data,” IEEE Trans. Acoust., Speech, Signal complexity or flop counts for the DFT immediately leads to Processing, vol. 34, no. 2, pp. 285–295, 1986. corresponding improvements in DCTs/DSTs, and vice versa. [8] P. Duhamel and M. Vetterli, “Fast Fourier transforms: a tutorial review and a state of the art,” Signal Processing, vol. 19, pp. 259–299, April Our most important result, we believe, is the fact that there 1990. suddenly appears to be new room for improvement in problems [9] R. Vuduc and J. Demmel, “Code generators for automatic tuning of that had seen no reductions in flop counts for many years. numerical kernels: experiences with FFTW,” in Proc. Semantics, Appli- cation, and Implementation of Code Generators Workshop, (Montreal), The question of the minimal operation counts for basic Sept. 2000. linear transformations such as DCTs and DSTs is of longstand- [10] M. Frigo and S. G. Johnson, “The design and implementation of ing theoretical interest. The practical impact of a few percent FFTW3,” Proc. IEEE, vol. 93, no. 2, pp. 216–231, 2005. [11] Z. Wang, “A fast algorithm for the implemented improvement in flop counts, admittedly, is less clear because by the fast cosine transform,” IEEE Trans. Acoust., Speech, Signal computation time on modern hardware is often not dominated Processing, vol. 30, no. 5, pp. 814–815, 1982. 9

[12] S. C. Chan and K. L. Ho, “Direct methods for computing discrete [40] G. Plonka and M. Tasche, “Fast and numerically stable algorithms sinusoidal transforms,” IEE Proceedings F, vol. 137, no. 6, pp. 433– for discrete cosine transforms,” Lin. Algebra and its Appl., vol. 394, 442, 1990. pp. 309–345, 2005. [13] P. Lee and F.-Y. Huang, “Restructured recursive DCT and DST algo- [41] M. Frigo, “A fast Fourier transform compiler,” in Proc. ACM SIG- rithms,” IEEE Trans. Signal Processing, vol. 42, no. 7, pp. 1600–1609, PLAN’99 Conference on Programming Language Design and Imple- 1994. mentation (PLDI), vol. 34, (Atlanta, Georgia), pp. 169–180, ACM, May [14] R. Yavne, “An economical method for calculating the discrete Fourier 1999. transform,” in Proc. AFIPS Fall Joint Computer Conf., vol. 33, pp. 115– [42] Y. Arai, T. Agui, and M. Nakajima, “A fast DCT-SQ scheme for images,” 125, 1968. Trans. IEICE, vol. 71, no. 11, pp. 1095–1097, 1988. [15] P. Duhamel and H. Hollmann, “Split-radix FFT algorithm,” Electron. [43] W. B. Pennebaker and J. L. Mitchell, JPEG Still Image Data Compres- Lett., vol. 20, no. 1, pp. 14–16, 1984. sion Standard. New York: Van Nostrand Reinhold, 1993. [16] J. B. Martens, “Recursive cyclotomic —a new algorithm for [44] P. N. Swarztrauber, “Vectorizing the FFTs,” in Parallel Computations calculating the discrete Fourier transform,” IEEE Trans. Acoust., Speech, (G. Rodrigue, ed.), pp. 52–83, New York: Academic Press, 1982. Signal Processing, vol. 32, no. 4, pp. 750–761, 1984. [45] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, [17] T. Lundy and J. Van Buskirk, “A new matrix approach to real FFTs and in C: The Art of Scientific Computing. New York, of length 2k,” Computing, vol. 80, no. 1, pp. 23–45, 2007. NY: Cambridge Univ. Press, 2nd ed., 1992. [18] B. G. Lee, “A new algorithm to compute the discrete cosine trans- [46] C. van Loan, Computational Frameworks for the Fast Fourier Transform. form,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 32, no. 6, Philadelphia: SIAM, 1992. pp. 1243–1245, 1984. [47] K. R. Rao and P. Yip, Discrete Cosine Transform: Algorithms, Advan- [19] Z. Wang, “On computing the discrete Fourier and cosine transforms,” tages, Applications. Boston, MA: Academic Press, 1990. IEEE Trans. Acoust., Speech, Signal Processing, vol. 33, no. 4, [48] H. V. Sorensen, D. L. Jones, M. T. Heideman, and C. S. Burrus, “Real- pp. 1341–1344, 1985. valued fast Fourier transform algorithms,” IEEE Trans. Acoust., Speech, [20] N. Suehiro and M. Hatori, “Fast algorithms for the DFT and other Signal Processing, vol. 35, no. 6, pp. 849–863, 1987. sinusoidal transforms,” IEEE Trans. Acoust., Speech, Signal Processing, [49] W. M. Gentleman and G. Sande, “Fast Fourier transforms—for fun and vol. 34, no. 3, pp. 642–644, 1986. profit,” Proc. AFIPS, vol. 29, pp. 563–578, 1966. [21] H. S. Hou, “A fast algorithm for computing the discrete cosine trans- [50] J. C. Schatzman, “Accuracy of the discrete Fourier transform and the form,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 35, no. 10, fast Fourier transform,” SIAM J. Scientific Computing, vol. 17, no. 5, pp. 1455–1461, 1987. pp. 1150–1166, 1996. [22] F. Arguello and E. L. Zapata, “Fast cosine transform based on the suc- [51] M. Tasche and H. Zeuner, Handbook of Analytic-Computational Meth- cessive doubling method,” Electronic letters, vol. 26, no. 19, pp. 1616– ods in Applied Mathematics, ch. 8, pp. 357–406. Boca Raton, FL: CRC 1618, 1990. Press, 2000. [23] C. W. Kok, “Fast algorithm for computing discrete cosine transform,” [52] R. E. Crochiere and A. V. Oppenheim, “Analysis of linear digital IEEE Trans. Signal Processing, vol. 45, no. 3, pp. 757–760, 1997. networks,” Proc. IEEE, vol. 63, no. 4, pp. 581–595, 1975. [24] J. Takala, D. Akopian, J. Astola, and J. Saarinen, “Constant geometry [53] X. Shao and S. G. Johnson, “Type-IV DCT, DST, and MDCT algorithms algorithm for discrete cosine transform,” IEEE Trans. Signal Processing, with reduced numbers of arithmetic operations,” Signal Processing, vol. 48, no. 6, pp. 1840–1843, 2000. 2008. In press. [25] M. P¨uschel and J. M. F. Moura, “The algebraic approach to the discrete cosine and sine transforms and their fast algorithms,” SIAM J. Computing, vol. 32, no. 5, pp. 1280–1316, 2003. [26] M. P¨uschel, “Cooley–Tukey FFT like algorithms for the DCT,” in Proc. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing, vol. 2, (Hong Kong), pp. 501–504, April 2003. [27] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,” IEEE Trans. Comput., vol. 23, no. 1, pp. 90–93, 1974. [28] R. M. Haralick, “A storage efficient way to implement the discrete cosine transform,” IEEE Trans. Comput., vol. 25, pp. 764–765, 1976. [29] W.-H. Chen, C. H. Smith, and S. C. Fralick, “A fast computational algorithm for the discrete cosine transform,” IEEE Trans. Commun., vol. 25, no. 9, pp. 1004–1009, 1977. [30] M. J. Narasimha and A. M. Peterson, “On the computation of the discrete cosine transform,” IEEE Trans. Commun., vol. 26, no. 6, pp. 934–936, 1978. [31] B. D. Tseng and W. C. Miller, “On computing the discrete cosine transform,” IEEE Trans. Comput., vol. 27, no. 10, pp. 966–968, 1978. [32] L. P. Yaroslavskii, “Shifted discrete Fourier transforms,” Problems of Information Transmission, vol. 15, no. 4, pp. 324–327, 1979. [33] J. Makhoul, “A fast cosine transform in one and two dimensions,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 28, no. 1, pp. 27–34, 1980. [34] H. S. Malvar, “Fast computation of the discrete cosine transform and the discrete Hartley transform,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 35, no. 10, pp. 1484–1485, 1987. [35] W. Li, “A new algorithm to compute the DCT and its inverse,” IEEE Trans. Signal Processing, vol. 39, no. 6, pp. 1305–1313, 1991. [36] G. Steidl and M. Tasche, “A approach to fast algorithms for discrete Fourier-cosine and Fourier-sine transforms,” Math. Comp., vol. 56, no. 193, pp. 281–296, 1991. [37] E. Feig and S. Winograd, “On the multiplicative complexity of discrete cosine transforms,” IEEE Trans. Info. Theory, vol. 38, no. 4, pp. 1387– 1391, 1992. [38] J. Astola and D. Akopian, “Architecture-oriented regular algorithms for discrete sine and cosine transforms,” IEEE Trans. Signal Processing, vol. 47, no. 4, pp. 1109–1124, 1999. [39] Z. Guo, B. Shi, and N. Wang, “Two new algorithms based on product system for discrete cosine transform,” Signal Processing, vol. 81, pp. 1899–1908, 2001.