A NOVEL HIGH-THROUGHPUT
FFT ARCHITECTURE FOR WIRELESS COMMUNICATION SYSTEMS
A Thesis
Presented to the
Faculty of
San Diego State University
In Partial Fulfillment
of the Requirements for the Degree
Master of Science
in
Electrical Engineering
by
Nikhilesh Vinayak Bhagat
Spring 2016
iii
Copyright c 2016 by Nikhilesh Vinayak Bhagat iv
DEDICATION
To Aai and Pappa. v
ABSTRACT OF THE THESIS
A Novel High-Throughput FFT Architecture for Wireless Communication Systems by Nikhilesh Vinayak Bhagat Master of Science in Electrical Engineering San Diego State University, 2016
The design of the physical layer (PHY) of Long Term Evolution (LTE) standard is heavily influenced by the requirements for higher data transmission rate, greater spectral efficiency, and higher channel bandwidths. To fulfill these requirements, orthogonal frequency division multiplex (OFDM) was selected as the modulation scheme at the PHY layer. The discrete Fourier transform (DFT) and the inverse discrete Fourier transform (IDFT) are fundamental building blocks of an OFDM system. Fast Fourier transform (FFT) is an efficient implementation of DFT. This thesis focuses on a novel high-throughput hardware architecture for FFT computation utilized in wireless communication systems, particularly in the LTE standard. We implement a fully-pipelined FFT architecture that requires fewer number of computations. Particularly, we discuss a novel approach to implement FFT using the combined Good-Thomas and Winograd algorithms. It is found that the combined Good-Thomas and Winograd FFT algorithms provides a significantly more efficient FFT solution for a wide range of applications. A detailed analysis and comparison between different FFT algorithms and potential architectures suitable for the requirements of the LTE standard is presented. Theoretical results have been validated by the implementation of the proposed approach on a field-programmable gate array (FPGA). As demonstrated by the mathematical analysis, a significant reduction has been achieved in all the design parameters, such as computational delay and the number of arithmetic operations as compared to conventional FFT architectures currently used in various wireless communication standards. It is concluded that the proposed algorithm and its hardware architecture can be efficiently used as an enhanced alternative in the LTE wireless communication systems. vi
TABLE OF CONTENTS PAGE ABSTRACT ...... v LIST OF TABLES...... viii LIST OF FIGURES ...... ix ACKNOWLEDGMENTS ...... xi CHAPTER 1 INTRODUCTION ...... 1 1.1 Review of the FFT Algorithms ...... 1 1.2 Motivation ...... 2 1.3 Contribution of Thesis ...... 3 1.4 Organization of Thesis ...... 3 2 FAST FOURIER TRANSFORM ALGORITHMS...... 4 2.1 Mapping to Two Dimensions ...... 4 2.2 The Cooley-Tukey FFT Algorithm ...... 5 2.2.1 Workload Computation...... 7 2.3 Radix-2 Cooley-Tukey FFT ...... 9 2.3.1 Architecture of the Radix-2 FFT...... 11 2.3.2 Workload Computation...... 13 2.4 Radix-4 Cooley-Tukey FFT ...... 14 2.4.1 Architecture of Radix-4 FFT ...... 16 2.4.2 Workload Computation...... 18 2.5 The Good-Thomas Prime-Factor Algorithm ...... 19 2.5.1 Workload Computation...... 22 2.5.2 Comparison and Summary of the FFT Algorithms ...... 24 3 FAST FOURIER TRANSFORMS via CONVOLUTION ...... 26 3.1 Rader’s Algorithm...... 26 3.1.1 Workload Computation...... 30 3.2 Winograd Short Convolution ...... 31 3.3 Winograd Fourier Transform Algorithm ...... 33 vii 3.4 Summary ...... 41 4 GOOD-THOMAS AND WINOGRAD PRIME-FACTOR FFT ALGORITHMS .. 44 4.1 Introduction...... 44 4.2 Data Format ...... 45 4.3 The Winograd FFT modules ...... 46 4.4 The Prime-Factor FFT Algorithm ...... 46 4.5 Architecture ...... 47 4.6 Hardware Design ...... 47 4.6.1 Design Considerations...... 51 4.6.2 Parallel Processing Architecture ...... 52 4.6.3 Matlab Design ...... 53 4.6.4 Verilog Design ...... 55 4.7 Testing ...... 57 4.8 Hardware Cost and Implementation Results ...... 58 4.8.1 Latency and Throughput ...... 60 4.9 Analysis and Comparison...... 61 5 APPLICATION OF FFT IN WIRELESS COMMUNICATION SYSTEMS ...... 63 5.1 Overview ...... 63 5.2 OFDM Technique ...... 63 5.3 LTE Physical Layer ...... 65 5.3.1 Generic Frame Structure ...... 65 5.3.2 LTE Parameters ...... 66 6 CONCLUSION ...... 69 6.1 Future Work ...... 69 6.1.1 VLSI layout ...... 70 BIBLIOGRAPHY ...... 71 viii
LIST OF TABLES PAGE Table 2.1. Cooley-Tukey multiplications compared to one-dimensional DFT calculation . 9 Table 2.2. Comparison of the radix-2 and radix-4 algorithms ...... 19 Table 2.3. Good-Thomas products savings ratio with respect to direct DFT calculation. .. 22 Table 2.4. Comparison of the Cooley-Tukey, Radix-2, Radix-4, and Good- Thomas FFT algorithms ...... 25 Table 3.1. Determination of N k(x)...... 32 Table 3.2. Multiplier coefficients for the 3-point WFT...... 34 Table 3.3. Multiplier coefficients for the 5-point WFT...... 35 Table 3.4. Multiplier coefficients for the 7-point WFT...... 36 Table 3.5. Multiplier coefficients for the 8-point WFT...... 36 Table 3.6. Multiplier coefficients for the 9-point WFT...... 38 Table 3.7. Multiplier coefficients for the 16-point WFT...... 39 Table 3.8. Computational requirements of Radix-2 and Winograd FFT algorithms ...... 41 Table 4.1. Resource utilization of the Winograd FFT algorithms ...... 60 Table 4.2. Resource utilization of the combined Good-Thomas and Winograd FFT algorithm ...... 60 Table 4.3. Performance of the Winograd FFT algorithms ...... 60 Table 4.4. Performance of the combined Good-Thomas and Winograd FFT algorithms ...... 61 Table 4.5. Resource utilization comparison with other fixed-point FFT proces- sors ...... 61 Table 4.6. Performance comparison with other fixed-point FFT processors ...... 62 Table 5.1. FFT sizes and other physical parameters used in the current LTE standard ..... 66 Table 5.2. Available resource blocks and occupied sub-carriers ...... 67 Table 5.3. Minimum required FFT lengths and other physical parameters for LTE ...... 67 Table 5.4. Possible FFT lengths using Good-Thomas algorithm and other phys- ical parameters for LTE ...... 67 Table 5.5. Maximum possible FFT lengths using Good-Thomas algorithm and other physical parameters for LTE ...... 68 ix
LIST OF FIGURES PAGE Figure 2.1. The two-dimensional mapping of a 15-point Cooley-Tukey algorithm...... 6 Figure 2.2. Radix-2 butterfly architecture...... 11 Figure 2.3. Radix-2 N-point flowgraph...... 11 Figure 2.4. Radix-2 8-point flowgraph...... 12 Figure 2.5. The top-level architecture of the Radix-2 FFT algorithm...... 12 Figure 2.6. The Radix-2 butterfly architecture...... 14 Figure 2.7. The shuffling unit...... 14 Figure 2.8. Radix-4 butterfly structure...... 16 Figure 2.9. Radix-4 FFT architecture...... 18 Figure 2.10. The simplified Radix-4 dragonfly architecture...... 18 Figure 2.11. The block diagram of the Good-Thomas mapping for 15-point Fourier transform...... 20 Figure 3.1. The Rader’s algorithm...... 28 Figure 3.2. The SFG for 2-point Winograd Fourier transform algorithm...... 34 Figure 3.3. The SFG for the 3-point Winograd Fourier transform algorithm...... 34 Figure 3.4. The SFG for the 4-point Winograd Fourier transform algorithm...... 35 Figure 3.5. The SFG for the 5-point Winograd Fourier transform algorithm...... 36 Figure 3.6. The SFG for the 7-point Winograd Fourier transform algorithm...... 37 Figure 3.7. The SFG for the 8-point Winograd Fourier transform algorithm...... 38 Figure 3.8. The SFG for the 9-point Winograd Fourier transform algorithm...... 39 Figure 3.9. The SFG for the 16-point Winograd Fourier transform algorithm...... 40 Figure 3.10. Required number of multiplications for different FFT algorithms...... 42 Figure 3.11. Required number of additions for different FFT algorithms...... 43 Figure 4.1. The IO interface of the FFT processor...... 45 Figure 4.2. The top-level block diagram of the WFT algorithm...... 46 Figure 4.3. The Good-Thomas FFT architecture...... 49 Figure 4.4. The FSM for the combined Good-Thomas and Winograd FFT algorithm...... 50 Figure 4.5. The design flow...... 51 x Figure 4.6. FFT architecture...... 52 Figure 4.7. Temporal parallelism of sub-tasks...... 53 Figure 4.8. Flowchart of the operations...... 54 Figure 4.9. The datapath of the CRT input mapping...... 56
Figure 4.10. The top-level organization of the N × WL block RAM, where N = N1 × N2...... 57 Figure 4.11. The datapath of the RCM output mapping...... 58 Figure 4.12. Fourier transform output of the shifted impulse input...... 59 Figure 4.13. Absolute error plots of fixed-point and floating-point Fourier transforms. .... 59 Figure 5.1. OFDM architecture...... 63 Figure 5.2. OFDM sub-carrier spacing ...... 64 Figure 5.3. LTE symbol and frame structure...... 65 xi
ACKNOWLEDGMENTS This thesis owes its existence to the help, support and inspiration of several people. I am grateful to God for the good health and well being that were necessary to complete my research. Firstly, I would like to gratefully and sincerely thank Dr. Amir Alimohammad and Dr. fred harris for their guidance, understanding, patience, and most importantly, their mentoring during my graduate research at San Diego State University. Their mentorship was paramount in providing a well rounded experience consistent my long-term career goals. They have encouraged me to not only grow as an experimentalist and a technologist but also as an instructor and an independent thinker. I am not sure many graduate students are given the opportunity to develop their own individuality and self-sufficiency by being allowed to work with such independence. For everything youve done for me, I thank you. Prof. Amir, my sincere gratitude for your continuous support and encouragement. Your excellent inputs to the research and great expertise in these fields helped me to achieve results better. You have supported me academically by providing me with all the necessary facilities for the research. During the most difficult times of my thesis, you gave me the moral support and the freedom I needed to move on. I could not have imagined having a better advisor and mentor for my research. I am extremely thankful and indebted to Dr. fred harris, who has been a constant source of encouragement and enthusiasm. It is great honor to have him as one of my thesis advisors. Many thanks to Dr. Marie A. Roch and I am grateful to her for her very valuable comments on this thesis. I wish to express my sincere gratitude for your valuable time and readily accepting to be my thesis committee member. I would also like to thank all of the members of the VLSI research group, especially Fang, Jai, Abhinaya and Vamsi for the stimulating discussions, for the sleepless nights we were working together before deadlines, and for all the fun we have had in the last two years. I thank my friends Ameya, Mihir, Kaushik, Akshay B., Akshay D., Rohit, Amruta, Payal, Bela, Prasad, Disha, Kanak, Pratish for some much needed humor and entertainment. I would never forget the chats and beautiful moments I shared with them. They were fundamental in supporting me during these stressful and difficult moments. A special thanks goes to Nikita for always inspiring and mentoring me, and for always being there for me. My little sister, no matter where you are around the world, you are always with me. xii I am very grateful to all the people I have met along the way and have contributed to the development of my research. In particular I would like to show my gratitude to the National Science Foundation for funding this work. Finally, my deepest gratitude goes to my parents Sangita and Vinayak for their unflagging love and unconditional support throughout my life and my studies. You made me live the most unique, magic and carefree childhood that has made me who I am now. And most importantly, I would like to thank my girlfriend Snehal. Her support, encouragement, quiet patience and unwavering love were undeniably the bedrock upon which the past few years of my life have been built. Her tolerance of my occasional vulgar moods is a testament in itself of her unyielding devotion and love. 1
CHAPTER 1 INTRODUCTION
The DFT plays a key role in a wide range of signal processing applications because it can be used as a mathematical tool to describe the relationship between the time-domain and frequency-domain representations of discrete signals. The DFT X[k] of a sequence x[n] of N terms is defined as: N−1 X X[k] = x[n]ωnk, (1.1) n=0 where the sequence x[n] is viewed as N consecutive samples, x[nT ], of a continuous signal x(t), and ω = e−j2π/N is the twiddle factor. Similar to DFT there exist inverse DFT (IDFT) which maps a frequency sequence to its corresponding time sequence. The IDFT finds its usefulness in circular convolution where the inverse DFT of the product of two time sequences corresponds to circularly convolving the two time sequences.
1.1 REVIEWOFTHE FFTALGORITHMS Around 1805, Carl Friedrich Gauss invented a revolutionary technique for efficiently computing the coefficients of what is now called the discrete Fourier series [1]. Unfortunately, Gauss never published his work and it was lost for over one hundred years. In the early twentieth century, Carl Runge derived an algorithm similar to that of Gauss that could compute the coefficients on an input with size equal to a power-of-two and was later generalized to powers-of-three. Then in 1965, James W. Cooley and John W. Tukey discovered an algorithm for the computation of Fourier transform that required significantly fewer computation than by direct computation using Equation (2.1) [2]. This algorithm became widely-known as the fast Fourier transform (FFT). This algorithm is characterized by a butterfly architecture, which is used recursively to efficiently implement the DFT. Utilizing the Cooley-Tukey FFT algorithm 2 the computational load was reduced from O(N ) arithmetical operations to O(Nlog2N) arithmetic operations. “The development of the FFT originally by Cooley and Tukey followed by various enhancements/modifications by other researchers has provided the incentive and the impetus for its rapid and widespread utilization in a number of diverse disciplines” [3]. Special versions of these algorithms are popularly known as the “Radix-2” and “Radix-4” FFTs. 2 In addition to the Cooley-Tukey approach, several algorithms, such as prime-factor [4], split radix [5] and Winograd Fourier transform algorithm (WFTA) [6] have been proposed over the last several decades. In 1960, I. J. Good [7] showed that the Thomas algorithm [8] could be generalized for an arbitrary number of prime factors. This became to be known as Good-Thomas algorithm. This algorithm remained obscure until a method was found to efficiently compute the Fourier transforms of sequences whose length was a prime, or a power of prime [9]. In 1975, S. Winograd published a paper in which he announced new algorithm for computing DFTs [6]. This algorithm was known as the Winograd Fourier transform (WFT) algorithm. The WFT algorithm was known to be optimal from the standpoint of minimization of the number of multiplications required. The WFT algorithm combined with the Prime Factor Algorithm (PFA) became collectively known as the Good-Thomas and Winograd prime factor algorithm. The Good-Thomas algorithm promised a drastic reduction in the amount of time a computer or hardware requires to perform the DFT. However, this time the reduction is achieved at the cost of increased design complexity.
1.2 MOTIVATION A common module in recent wireless communication standards, such as 3GPP-LTE, is the FFT and its inverse. Although many architectures exist for traditional power-of-two FFT lengths, we may also define non-power-of-two transform lengths for the recent 3GPP LTE standards. The main motivation behind this work is to design and implement, a high-performance FFT architecture for wireless communication systems. It has been observed that efficient hardware implementation of a FFT algorithm requires a relatively large number of multipliers. Many of the algorithms that are commonly used, require large number of multiplications by twiddle factors, which directly impacts the silicon area and power consumption. This research focuses on the development of a novel high-performance FFT algorithm that does not require any twiddle factor multiplications, and hence, significantly reduces the computational complexity. In order to achieve this goal, our work is oriented towards the Good-Thomas and Winograd FFT algorithms. Unfortunately, Good-Thomas and Winograd algorithms have received very little attention compared to Cooley-Tukey algorithms among VLSI designers, perhaps due to the mathematical complexity of these techniques. 3
1.3 CONTRIBUTIONOF THESIS While it is unlikely that the completion of this thesis will trigger a revolution similar to that which followed the publication of the Cooley and Tukey paper, it is hoped that this thesis will demonstrate the effectiveness of the Good-Thomas, Winograd FFT algorithms. In the FFT designs for the LTE systems, research is concentrated on the variable length and low power architectures for FFTs. Effort is also being made mainly towards finding a universal computation structure for FFTs, which can compute other transforms such as discrete cosine transform (DCT), discrete Hilbert transform (DHT), easily with the help of single computation block. This work presents the analysis and design of various FFT algorithms suitable for wireless communication systems. A major contribution of this thesis is to show that the combined Good-Thomas and Winograd FFT is a preferred alternative, for practical cases in LTE, as compared to various other widely used FFT algorithms. It also identifies and demonstrates various critical design issues, which need to be considered while designing FFT blocks for LTE. This theoretical analysis has been supported with the FPGA implementation results for the proposed architectures. It is shown that the combined Good-Thomas and Winograd FFT algorithms provides several advantages over other conventional algorithms when used in reconfigurable systems but this comes along with some trade-offs.
1.4 ORGANIZATION OF THESIS This thesis is organized as follows:
• Chapter 2: provides a detailed information on the implementation and complexity of the above discussed FFT algorithms. A comparison based on the computational complexity of these algorithms has been presented and discussed. Next, the software implementation of these algorithms are presented and their performance is evaluated in terms of timing, memory, and complexity. • Chapter 3: discusses about different Fourier transform algorithms that can be constructed using the convolution property. • Chapter 4: presents the hardware implementation of Good-Thomas and Winograd algorithms and its hardware architectures. The design of an FFT processor is discussed. Different abstraction levels, block diagrams, trade-offs and simulation results are also discussed. • Chapter 5: focuses on the LTE physical layer parameters. A brief discussion about the LTE standard and the application of our combined Good-Thomas and Winograd FFT architecture in LTE wireless communication systems have been presented. • Chapter 6: provides some concluding remarks and presents a few suggestions for future work. 4
CHAPTER 2 FAST FOURIER TRANSFORM ALGORITHMS
There are two basic strategies for computing the discrete Fourier transform. One strategy is to change a one-dimensional Fourier transform into a two-dimensional Fourier transform, which is more efficient to compute. The second strategy is to change a one-dimensional Fourier transform into a small convolution, which can be computed by using the techniques described in Chapter 3. These strategies can be used effectively to minimize the computational complexity of the FFT calculation. This chapter discusses the Cooley-Tukey and Good-Thomas FFT algorithms and compares their computational complexities. Also, the Winograd Fourier transform algorithm (WFTA) and the Rader algorithm are discussed in detail. A peculiar feature of the WFTA and Rader algorithm is that the transform is based on convolution. As shown, this reduces the number of multiplications required for FFT computation significantly. Formal derivations of the algorithms used in Radix-2 and the Winograd FFTs will be presented.
2.1 MAPPINGTO TWO DIMENSIONS The goal of index mapping is to convert a single complex problem into multiple simple ones so that the number of multiplications and additions are reduced. This can be achieved by mapping the one-dimensional DFT into a two-dimensional transform. Further computational savings are achieved if this process is carried on iteratively to higher dimensions. The mapping to a multi-dimensional transform results in a significantly smaller number of data transfers and arithmetic operations relative to those required for the direct implementation of the DFT. If the length N of DFT is not prime, N can be factored as
N = N1 × N2. The two new independent variables are defined as n1 = 0, 1, 2,...,N1 − 1 and n2 = 0, 1, 2,...,N2 − 1, where the linear equation which maps n1 and n2 onto n is given by n = K1n1 + K2n2 mod N, where k1 and k2 can be defined using the following two classes of index mapping:
• Common factor mapping (CFM), where K1 = aN2 or K2 = bN1 but not both.
• Prime factor mapping (PFM), where K1 = aN2 and K2 = bN1. The choice of “and/or” creates the difference between the two mappings. Note that the PFM can be used only if the factors are relatively prime, while the CFM can be used for any factors. 5 The one-dimensional DFT is expressed as:
N−1 X nk X[k] = x[n]WN , (2.1) n=0
−j2π/N where n = 0, 1,...,N − 1, k = 0, 1,...,N − 1, and WN = e . The objective is to convert this equation into a two-dimensional form by applying substitution mapping techniques on indices, which results in the following two-dimensional representation:
N1−1 N2−1 X X K1K3n1k1 K1K4n1k2 K2K3n2k1 K2K4n2k2 Xk1k2 = W W W W x[n1n2]. (2.2)
n1=0 n2=0 The Cooley-Tukey FFT algorithm relays on the CFM, while PFM is used by the Good-Thomas FFT algorithm.
2.2 THE COOLEY-TUKEY FFTALGORITHM This section investigates the particular properties that result from using the CFM to calculate the DFT in an efficient way. Algorithms of this kind are known collectively as fast Fourier transform (FFT) algorithms. To derive the general form of the Cooley-Tukey FFT algorithm, suppose that the length N of the transform is composite, i.e., N = N1 × N2, where
N1 and N2 are integer factors of N. Next, the index mapping is defined as: n = n1 + n2N1, where n1 = 0, 1,...,N1 − 1, n2 = 0, 1,...,N2 − 1, k = k2 + k1N2, k1 = 0, 1,...,N1 − 1, and k2 = 0, 1,...,N2 − 1. These index mappings are evaluated modulo N but is not explicitly mentioned as n does not exceed N. By substituting the indices, the DFT in (2.2) can be represented in two-dimensions as:
N −1 N −1 X1 X2 X = W n1k1 W n2k1 W n2k2 x[n n ]. k1k2 N1 N N2 1 2 (2.3) n1=0 n2=0
Fig. 2.1 shows a fifteen-point transform as two dimensions with N1 = 3 and N2 = 5. For instance, suppose one needs to map the index n = 4 from the one-dimensional array to the two-dimensional array. The equation n = 4 = n1 + 3n2 is satisfied by n1 = 1 and n2 = 1, which are the coordinates indicated in Fig. 2.1. Using this indexing scheme, the input and output data vectors can be mapped into two-dimensional arrays. Note that the components of the transform X[k] are arranged differently than the components of the signal x[n]. This is also known as address shuffling. The first step of the Cooley-Tukey algorithm derivation is to substitute the indices in
(2.1) along with the relation N = N1 × N2 to obtain the following equation:
N1−1 N2−1 X X (n +n N )(k +k N ) X = W 1 2 1 2 1 2 x[n n ], k1k2 N1N2 1 2 (2.4) n1=0 n2=0 6
J @
J @
J @
Figure 2.1. The two-dimensional mapping of a 15-point Cooley-Tukey algorithm.
where n1 = 0, 1,...,N1 − 1, n2 = 0, 1,...,N2 − 1, k1 = 0, 1,...,N1 − 1, and k2 = 0, 1,...,N2 − 1. The twiddle factor W can be expanded as:
W (n1+n2N1)(k2+k1N2) = W n1k2 W n1k1N2 W n2k2N1 W n2k1N1N2 . N1N2 N1N2 N1N2 N1N2 N1N2 (2.5)
−j2πN (n k ) −j2πN(n2k1) 2 1 1 W n2k1N1N2 = W N = 1,W n1k1N2 = W N1N2 = W n1k1 , However, N1N2 N1N2 N1N2 N1N2 N1 and W n2k2N1 = W n2k2 N1N2 N2 . Therefore, the two-dimensional Cooley-Tukey Fourier transform equation is given by: N −1 N −1 X1 X2 X = W n1k1 W n1k2 W n2k2 x[n n ]. k1k2 N1 N N2 1 2 (2.6) n1=0 n2=0
The transform is first applied to the n1 dimensions, i.e. the columns and the result is then multiplied by the “twiddle factor”. The twiddle factor being a function of n1 and n2 prevents the two-dimensional transforms to be independent of each other.
The Cooley-Tukey transform can be summarized as follows:
(i) Map the indices into two dimensions n1 × n2.
(ii) Transform the columns k1 × n2.
(iii) Multiply by the twiddle factors W n1×n2 . 7
(iv) Transform the rows k1 × k2.
(v) Map the indices into one dimension k.
It can be observed that the one-dimensional input sequence n is mapped into n1 × n2 using CFM mentioned earlier. We then perform column transformation using N1-point FFT and multiply the resulting two-dimensional array with twiddle factors. N2-point FFT is then performed for N2 rows to form k1 × k2 array. The resulting two-dimensional array is then unloaded column wise to form the one-dimensional transform k. The procedure and pseudo-code for the Cooley-Tukey FFT algorithm is described in Algorithm 2.1.
2.2.1 Workload Computation The computations of the Cooley-Tukey FFT can be seen as mapping of one-dimensional signal-domain array into a two-dimensional transform-domain array. To count the number of multiplications and additions, it is assumed that the row and column transforms are computed using DFTs. The computation consists of an N1-point DFT on each column, followed by element-by-element complex multiplications throughout the new array
n1k2 by WN , followed by an N2-point DFT on each row. This results in N1 column transforms, 2 each requiring approximately N2 multiplications and additions, and N2 row transforms, each 2 requiring N1 multiplications and additions. The workload count must also include the twiddle factor multiplications. Hence, 2 2 • Number of multiplications is N1(N2) + N2(N1) + N1N2 = N1N2(N1 + N2 + 1). 2 2 • Number of additions is N1(N2) + N2(N1) = N1N2(N1 + N2). However, the number of additions for direct one-dimensional approach is equal to the number 2 of multiplications and is N = N1N2 × N1N2. The computational saving in terms of the required multiplications can be written as: Number of Multiplications(Cooley − T ukey) N + N + 1 • R = = 1 2 Number of Multiplications(Direct) N1N2 Table 2.1 shows some examples of the savings ratio for different FFT lengths. By using Cooley-Tukey algorithm, a transform with length N can be decomposed into a form requiring fewer complex multiplications. It can observed that the Cooley-Tukey FFT algorithm provides a better approach for Fourier transform computation compared to the direct computation. Note that the computational savings is not directly proportional to the actual time reduction of the program execution as it does not take into consideration the comparison of addressing overhead. It merely represents the amount of arithmetic computation reduced. 8
Algorithm 2.1 Cooley-Tukey algorithm 1: procedure INPUTMAPPING:
2: for i ≤ N1 do
3: for j ≤ N2 do
4: index(i,j) = index(i+j × N2); 5: end for 6: end for 7: end procedure 8: procedure COLUMN TRANSFORMATION:
9: for i ≤ N2 do 10: Select all the elements of column i;
11: Perform N1-point Fourier transform over N1 elements of column i;
12: Result is a two dimensional array with Fourier transform of all the N2 columns; 13: end for 14: end procedure 15: procedure MULTIPLICATION WITH TWIDDLE FACTORS: nk 16: The twiddle factors WN are calculated based on the index value n and k;
17: for i ≤ N1 do
18: for j ≤ N2 do ij 19: element(i,j) = element(i,j)× WN ; 20: end for 21: end for 22: end procedure 23: procedure ROW TRANSFORMATION:
24: for i ≤ N1 do 25: Select all the elements of row i;
26: Perform N2-point Fourier transform over N2 elements of row i;
27: The result is a two dimensional array with Fourier transform of all the N1 row
vectors and N2 column vectors; 28: end for 29: end procedure 9
30: procedure OUTPUTMAPPING: Initialize k = 0
31: for i ≤ N1 do
32: for j ≤ N2 do
33: index(k) = index(i+j × N2); 34: k = k + 1; 35: end for 36: end for 37: end procedure
Table 2.1. Cooley-Tukey multiplications compared to one-dimensional DFT calculation
N N1 N2 R 6 3 2 1.0 20 5 4 0.5 120 15 8 0.20 250 25 10 0.14 1008 26 63 0.08
The number of computations required for the Cooley-Tukey algorithm have been further reduced by the introduction of the radix algorithms [3]. The Radix-2 and Radix-4 algorithms that are based on the Cooley-Tukey algorithm, are among the most widely-used algorithms that are discussed in subsequent section.
2.3 RADIX-2 COOLEY-TUKEY FFT The direct computation of an N-point DFT requires nearly O(N 2) complex arithmetic operations (i.e., a multiplication and an addition). This complexity has been significantly reduced by the Cooley-Tukey FFT which requires O(N) complex arithmetic operations. Undoubtedly, the Cooley-Tukey FFT algorithms is one of the most widely used FFT algorithms. Many applications of the Cooley-Tukey FFT use a blocklength N that is a power of two or of four. The block length 2m can be factored either as 2m−1 × 2 or as 2 × 2m−1, which is called a Radix-2 Cooley-Tukey FFT. Similarly, in the Radix-4 Cooley-Tukey FFT, the blocklength 4m is factored either as 4m−1 × 4 or as 4 × 4m−1. In addition to the radix-2 and radix-4 algorithms, several other FFT algorithms are available, such as the radix-8, mixed-radix, split-radix, vector radix and the vector-split-radix algorithms [10]. However, we 10 will only discuss the popular and widely preferred radix-2 and radix-4 FFT algorithms in detail. In this section, we will initially discuss the decimation-in-time (DIT) and decimation-in-frequency (DIF) FFT algorithms. The detailed development will be based on radix-2. This will then be extended to other radices such as Radix-4, etc. m m−1 N The 2 -point Cooley-Tukey algorithm with N1 = 2 and N2 = 2 = 2 is known as the Radix-2 DIT algorithm. Starting with Equation (2.6) we can write:
N −1 N −1 X1 X2 X = W n1k1 W n1k2 W n2k2 x[n n ], k1k2 N1 N N2 1 2 (2.7) n1=0 n2=0 which represents N-point Cooley-Tukey FFT. The first stage of the FFT can be defined by
choosing the values of N1 and N2. This is equivalent to splitting an N-point sequence into N two 2 -point sequences x2m and x2m+1, respectively, to the even and odd samples of xn. By N N definition, n2 = 0, 1,..., 2 − 1 and k2 = 0, 1,..., 2 − 1. Also, n1 = 0, 1 and k1 = 0, 1.
Under these conditions, Xk1k2 , which is referred as Xk and X N becomes: k+ 2
N N 2 −1 2 −1 X 2n2k2 k2 X 2n2k2 Xk = WN x[2n2] + WN WN x[2n2 + 1]. (2.8) n2=0 n2=0
N 2 Since, WN = −1, we can write:
N N 2 −1 2 −1 X 2n2k2 k2 X 2n2k2 X N = W x[2n2] − W W x[2n2 + 1]. (2.9) k+ 2 N N N n2=0 n2=0 Equation (2.8) and (2.9) can be written as follows:
nk Xk = A(k) + WN B(k), (2.10)
nk X N = A(k) − WN B(k). (2.11) k+ 2 This is illustrated by the DIT butterfly structure in Fig. 2.2. The basic processing element (PE) of the FFT operation is known as butterfly. The butterfly structure consists of an addition, a subtraction, and a complex multiplication by the twiddle factor. The DIT and the DIF algorithms are different in structure and in the sequence of the computations, however, the number of operations in both algorithms are the same. The performance of both, DIT and DIF FFTs are the same, but the user may prefer one of them because of some implementation considerations. The DIT or DIF algorithm is used recursively, in which at each level, an N-point Fourier transform is replaced by two N/2-point 11
X[n] U X[k]
X[n+N/2] R X[k+N/2]
WN Figure 2.2. Radix-2 butterfly architecture.
x[0] X[0] x[2] X[1]
Even N/2-point Points Transform
x[N/2]
x[1] x[3] Odd N/2-point Points Transform
x[(N/2)+1] X[N-1] Figure 2.3. Radix-2 N-point flowgraph.
Fourier transforms. For larger values of N, the flow graph would appear as in Fig. 2.3 where the blocks are replaced similarly until only two-point transforms are remaining. An example of entire signal flow graph representation of an 8-point DIT Radix-2 is shown in Fig (2.4). It can be seen that the signal flow has 3 stages and each stage consists of a 2-point butterfly structure. The pseudo-code for the radix-2 DIT algorithm is described in Algorithm 2.2:
2.3.1 Architecture of the Radix-2 FFT A block diagram of the N-point radix-2 FFT architecture implemented in [11] is shown in Fig. (2.5). The circuit consists of the FFT processor, which performs the operations required for the computation of the FFT (butterfly and shuffling units) and two adaptation units: the input mapping unit, which maps the input stream of samples onto two separate ones and the output mapping unit, which maps the two streams of outputs of the FFT processor onto one stream of samples in natural order. The FFT processor consists of n butterfly units and (n − 1) shuffling units, where n = log2N. The two inputs of the butterfly unit of a stage are generated from the output of the butterfly unit of the previous stage at different time points. These inputs are obtained from either the upper or the lower output of the butterfly unit of the previous stage. A shuffling unit is inserted between two successive butterfly units to route these outputs to the corresponding inputs. 12
6`a `a
6`a R `a
6`a R `a
6` a R R `a
6`a R `a
6` a R R ` a
6`a R R ` a
6` a R R R ` a Figure 2.4. Radix-2 8-point flowgraph.
:$V :$V :$VJ J]% % ]% R: : R: : `V:I Butterfly Shuffling Butterfly Shuffling Butterfly Shuffling `V:I Unit Unit Unit Unit Unit Unit
Figure 2.5. The top-level architecture of the Radix-2 FFT algorithm.
2.3.1.1 INPUTAND OUTPUTMAPPING The input mapping unit is needed so that the input data stream can be separated to form even and odd samples. The input mapping unit can be a de-multiplexer unit that separates the input data stream. An output mapping unit maps the output sequence of the last butterfly unit, onto one stream of samples in natural order. The reverse operation for the formation of the output stream of samples is performed by a multiplexer.
2.3.1.2 BUTTERFLY STRUCTURE The term “butterfly” appears in the context of the Cooley-Tukey FFT algorithm, which recursively breaks down a DFT of composite size n into r smaller transforms of size m, where r is the “radix” of the transform. These smaller DFTs are then combined via size-r butterflies, which themselves are DFTs of size r pre-multiplied by roots of unity, known as twiddle factors. The datapath of radix-2 butterfly can be constructed from the signal flow graph shown in Fig. 2.2. Let, Xn = a + ib, Xn+N/2 = c + id and twiddle factor WN = cos θ − i sin θ, where, a, b, c, d, cos θ, sin θ are real numbers. Thus, the output is given by
Xk = (a + c cos θ − d sin θ) + i(b + d cos θ − c sin θ) and
Xk+N/2 = (a − c cos θ + d sin θ) + i(b − d cos θ − c sin θ). The real part and the imaginary 13
Algorithm 2.2 Radix-2 DIT algorithm x is the input vector of complex samples;
2: Nin is the number of input samples;
m is the number of stages given by m = log2(x); 4: N is the number of FFT points given by N = 2m;
Append N − Nin zeros to the input vector as: x = [x, zeros(N − Nin)]
6: The twiddle factor matrix WN is calculated as:
WN = cos(2 × π/N × [0 : (N/2 − 1)]) − j × sin(2 × π/N × [0 : (N/2 − 1)]); The intermediate butterfly matrix is computed as following: 8: for i ≤ m do For every stage calculate A(k) and B(k) for the even and odd samples; 10: for k ≤ N/2 do A(k) = x(n + k + 1);
12: B(k) = x(n + k + N/2 + 1) × WN (ik); X(k) = A(k) + B(k); 14: X(k + N/2) = A(k) − B(k); end for 16: end for
parts are all expressed as the addition of three real numbers. Thus, we can use 2-input adders and multipliers to implement the butterfly unit as shown in Fig. (2.6), where “R” is a register.
2.3.1.3 SHUFFLING UNITSTRUCTURE The shuffling units is implemented using the design shown in Fig. 2.7. This circuit consists of delay elements for the synchronization of the outputs of the previous butterfly unit and two multiplexers that perform the routing of the outputs to the corresponding inputs of the next butterfly unit. The control signal of the multiplexers takes the value 0, so that the upper outputs of the butterfly unit of stage s is routed to the upper input of the butterfly unit of stage s + 1 and the lower output of the butterfly unit of stage s is routed to the lower input of the butterfly unit of stage s + 1 respectively. Next, the control signal takes the value 1, so that the upper output is routed to the lower input while the lower output of the butterfly unit of stage s is routed to the upper input of the butterfly unit of stage s + 1.
2.3.2 Workload Computation
N The computation of an N-point DFT is replaced by that of two DFTs of length 2 plus N k2 N additions and 2 multiplications by the twiddle factors W . The same procedure can be N N applied repeatedly to replace the two DFTs of length 2 by four DFTs of length 4 at the cost 14
a c cosϴ d sinϴ csinϴ d cosϴ b
R X X X X R
- +
+ - + -
Real(Xk) Real(Xk+N/2) Img(Xk) Img(Xk+N/2) Figure 2.6. The Radix-2 butterfly architecture.
C Stage m Stage m+1
0
R
1
C
1
R 0
Figure 2.7. The shuffling unit.
N of N additions and 2 multiplications by the twiddle factors. Thus, a systematic approach of m i this method computes DFT of length 2 in m = log2N stages, each stage converting 2 DFTs m−i i+1 m−i−1 N of length 2 into 2 DFTs of length 2 at the cost of N additions and 2 multiplications by the twiddle factor. Consequently, the number of complex additions and complex multiplications required to compute a DFT of length N by the radix-2 FFT algorithm is: N • Number of multiplications: log N 2 2
• Number of additions: Nlog2N
2.4 RADIX-4 COOLEY-TUKEY FFT The radix-4 Cooley-Tukey FFT algorithms has also been widely adopted in various applications. They can be used when the blocklength N is a power of four. Then 4m is factored either as 4m−1 × 4 or as 4 × 4m−1. The radix-4 algorithm is derived in a similar manner to the radix-2 algorithm. The equations of this FFT can be obtained by setting N1 = 4 15 and N2 = N/4 in the general Equation (2.6), for the Cooley-Tukey FFT. In the radix-2 DIF FFT, the DFT equation is expressed as the summation of two calculations. One calculation, is the summation over the first half and one calculation is the summation over the second half of the input sequence. Similarly, the radix-4 DIF FFT expresses the DFT equation as four summations, then divides it into four equations, each of which computes every fourth output sample. The radix-4 DIF can be written as:
N 2∗N 3∗N 4∗N 4 −1 4 −1 4 −1 4 −1 X nk X nk X nk X nk Xk = x[n] WN + x[n] WN + x[n] WN + x[n] WN . (2.12) n=0 N 2∗N 3∗N n= 4 n= 4 n= 4 Let n = n + N/4, n = n + N/2, and n = n + 3N/4 in the second, third, and fourth summations, respectively. Equation (2.12) can be re-written as:
N N N 4 −1 4 −1 4 −1 X X N (n+ N )k X 2N (n+ 2∗N )k X = x[n] W nk + x[n + ] W 4 + x[n + ] W 4 k N 4 N 4 N n=0 n=0 n=0 N (2.13) 4 −1 X 3N (n+ 3N )k + x[n + ] W 4 , 4 N n=0 which represents four N/4-point DFTs. N N N k 4 k k 2 k 3k 4 k Using, WN = (−j) , WN = (−1) , and WN = (j) , Equation (2.13) can be written as:
N −1 X4 N N 3N X = [x[n] + (−j)kx[n + ] + (−1)kx[n + ] + (j)kx[n + ]] W nk. (2.14) k 4 2 4 N n=0
4 Let WN = WN/4. Equation (2.14) can then be written as:
N −1 X4 N N 3N X = [x[n] + x[n + ] + x[n + ] + x[n + ]] W nk , (2.15) 4k 4 2 4 N/4 n=0
N −1 X4 N N 3N X = [x[n] − j × x[n + ] − x[n + ] + j × x[n + ]] W nk W n , (2.16) 4k+1 4 2 4 N/4 N n=0
N −1 X4 N N 3N X = [x[n] − x[n + ] + x[n + ] − x[n + ]] W nk W 2n, (2.17) 4k+2 4 2 4 N/4 N n=0
N −1 X4 N N 3N X = [x[n] + j × x[n + ] − x[n + ] − j × x[n + ]] W nk W 3n, (2.18) 4k+3 4 2 4 N/4 N n=0 for k = 0, 1,... (N/4) − 1. 16 Equations (2.15) through (2.18) represent a decomposition process yielding four 4-point DFTs. The butterfly of a radix-4 algorithm consists of four inputs and four outputs with intermediate adders and complex twiddle factor multipliers. The signal flow graph of the radix-4 butterfly structure is shown in Fig 2.8. The radix-4 FFT equation essentially combines two stages of the radix-2 FFT into one, so that half as many stages are required.
xn[0] X + Xk[0]
J
= xn[1] X - Xk[1] R J =
R xn[2] X + Xk[2] R J = R R= xn[3] X - Xk[3] Figure 2.8. Radix-4 butterfly structure.
The pseudo-code for the radix-4 FFT algorithm is described in Algorithm 2.3:
2.4.1 Architecture of Radix-4 FFT The architecture for radix-4 FFT in [12] is based on analyzing the SFG of the FFT. The radix-4 FFT divides an N-point DFT into four N/4-point DFTs, then into 16 N/16-point DFTs, and so on for every stage. The radix-4 FFT expresses the DFT equation as four summations, and then divides it into four equations, each of which computes every fourth output sample as described in Equations (2.15) through (2.18). An example of 1024-point FFT architecture that processes 1024 complex samples, with 16-bit word length is shown in Fig. 2.9. These samples are stored in the input memory “RAM1”. The radix-4 butterfly takes 4 samples and the result is stored in the output memory “RAM2”. This process is repeated for all the butterfly computations of every stage.
2.4.1.1 BUTTERFLY STRUCTURE The butterfly structure takes 4 signed fixed-point data from memory and computes the FFT algorithm. The datapath of radix-4 butterfly can be constructed from the SFG, as shown in Fig. 2.10.
2.4.1.2 RAM MODULES The FFT processor has three memory blocks, each is a 1024 × 16 BRAM, one for storing the input samples, one for storing the output results, and a 768 × 16 ROMs for storing 17
Algorithm 2.3 Radix-4 FFT algorithm Input x is a vector of complex input samples;
Nin is the number of input samples;
3: m is the number of stages given by m = clog4(x); N is the number of FFT points given by N = 4m;
Append N − Nin zeros to the input vector as: x = [x, zeros(N − Nin)];
6: The twiddle factor matrix WN is calculated as:
WN = cos(2 × π/N × [0 : (N − 1)]) − j × sin(2 × π/N × [0 : (N − 1)]); The intermediate radix-4 butterfly for m stages is computed as below: for i ≤ m do 9: For every stage calculate A(k), B(k), C(k) and D(k) as follows: for k = 1 : 4 : N/4 do
Let ak = x(n + k), bk = x(n + k + 1), ck = x(n + k + 2) and dk = x(n + k + 3);
12: X(k) = ak + bk + ck + dk;
X(k + N/4) = (ak − bk + ck − dk) × WN (2 × ik);
X(k + N/2) = (ak − j × bk − ck + dk) × WN (ik);
15: X(k + 3N/4) = (ak + j × bk − ck − j × dk) × WN (3 × ik); end for end for 18
Real R R Input
R R RAM 1 Radix-4 dragonfly Img 1024 x 16 R structure R Input
R R
SIN and COS ROM Real Output Read Add RAM 2 ROM 1024 x 16 Img Output Write Add Write count ROM Read count
Figure 2.9. Radix-4 FFT architecture.
xn[0] + + Xk[0]
xn[1] X - + Xk[1]
xn[2] X + - Xk[2]
R=
xn[3] X - X - Xk[3] Figure 2.10. The simplified Radix-4 dragonfly architecture. the twiddle factors (also referred as the sine and cosine look-up table). The real and imaginary signals are stored in different block memories.
2.4.2 Workload Computation The radix-4 DIT algorithm rearranges the DFT equation into four parts. If the FFT length N is 4M , the shorter length DFTs can be further decomposed recursively to produce the full radix-4 DIT FFT. The radix-4 transform is also more efficient considering the number of accesses to the memory. In the radix-4 DIT FFT, each decomposition stage creates additional savings in computation. To determine the total computational cost of the radix-4 FFT, note log N N that there are M = log N = 2 stages, each with butterflies per stage. Each radix-4 4 2 4 butterfly requires 3 complex multiplications and 8 complex additions. The total cost is as given below. 3N • Number of multiplications: log N 8 2 19
• Number of additions: 8(N/8)log2N The computational requirements of radix-2 and radix-4 algorithms are shown in Table (2.2). [13] It can be seen that the radix-4 algorithm reduces the number of multiplications by about 25 percent. Slight improvement can also be obtained by using Radix-8 or Radix-16 algorithms. When N is not a power of a single radix, one is prompted to use mixed-radix [14] approach. One main drawback of these Radix algorithms is that it is limited to block lengths that are a power of 4 for radix-4 or power of 2 for radix-2.
Table 2.2. Comparison of the radix-2 and radix-4 algorithms Radix-2 Radix-4 DFT Size N Number of Number of Number of Number of multiplications additions multiplications additions 4 16 24 12 16 16 128 192 96 146 64 768 1152 576 930 256 4096 6144 3072 5122 1024 20480 30720 15360 26114
2.5 THE GOOD-THOMAS PRIME-FACTOR ALGORITHM This section discusses the FFT algorithm that result from using the prime-factor index mapping. One such algorithm is the Good-Thomas prime factor algorithm. Good and Thomas suggested one more method of converting a Fourier transform from one dimension to two dimensions, which is regarded as the Good-Thomas prime factor algorithm. It is based on factorization of a block-length into distinct prime powers. It requires significantly less computation than the Cooley-Tukey algorithm, while it is conceptually more complicated. The Good-Thomas algorithm differs from the Cooley-Tukey algorithm in that it requires N1 and N2 to be relatively prime, i.e., GCD(N1,N2) = 1, and that it maps into a true two-dimensional transform (no twiddle factors are necessary). The idea is very different from
the idea of the Cooley-Tukey algorithm. Now N1 and N2 must be co-prime and the mapping is accomplished by applying residue reduction and the Chinese remainder theorem [15].
As an example shown in Fig. 2.11, consider a 15-point FFT, where N1 = 5 and
N2 = 3. The input data is stored in a two-dimensional array by starting in the upper left corner and listing the components down the extended diagonal. Because the number of rows and the number of columns are co-prime, the extended diagonal passes through every element 20
J @
J @
J @
Figure 2.11. The block diagram of the Good-Thomas mapping for 15-point Fourier trans- form. of the array. The output array, however, is different from the order of input components in the input array and is mapped by using the Ruritanian correspondence mapping [15]. For example, to map n = 8, we can write, n1 = 8 mod 3 = 2 and n2 = 8 mod 5 = 3. Hence, n = 8 is mapped onto n1 = 2 and n2 = 3. The derivation of the Good-Thomas FFT algorithm is based on the Chinese remainder theorem for integers. The first step in deriving the Good-Thomas algorithm is to describe the input indices by their residues. The input index is given by n1 = n mod N1 and n2 = n mod N2. If N1, N2, n1 and n2 are known, then n may be calculated by the Chinese remainder theorem, which uses the following equations:
n = [n1N2M2 + n2N1M1] mod N, (2.19) and
[N1M1 + N2M2] = 1. (2.20)
Equation (2.20) can be solved using the Euclidean algorithm, where N1 and N2 are the scalar, integer coefficients of the given linear Diophantine equation [16], N2M1 = 1 mod N1, and N1M2 = 1 mod N2. For example, if N = 15, N1 = 5, N2 = 3, n1 = 3, and n2 = 2, the values of M1 and M2 can be determined by solving the above equations. A set of solutions is
M1 = −1 and M2 = 2. There are other possible solution pairs, but there is only one unique solution of modulo X, where X is less than N1 and N2, i.e., X = 3. After using the modulus operation to get the positive value for M1, this yields M1 = 2. 21 The output indices are defined using the Ruritanian correspondence method. This is defined as follows:
k1 = kM2 mod N1, (2.21)
k2 = kM1 mod N2, (2.22)
where M1 and M2 are the integers determined during the input mapping. For example, if
k = 13, then k1 = (2)(13)mod 5 = 1 and k2 = (−1)(13)mod 3 = 2, which represent the
coordinates of the output map. Knowing k1, k2, N1 and N2, the output index k is calculated
using k = [k1N2 + k2N1] mod N. By substituting these indices in the DFT equation in Equation (2.1), we can write:
N1−1 N2−1 X X (n M N +n M N )(k N +k N ) X = W 1 2 2 2 1 1 1 2 2 1 x[n n ]. k1k2 N1N2 1 2 (2.23) n1=0 n2=0 The product in the exponent can be written as:
nk n1k1M2N2N2 n2k2M1N1N1 n1k2M2N1N2 n2k1M1N1N2 WN = WN WN WN WN . (2.24)
According to Equation (2.20), M2N2 = (1 − M1N1), M1N1 = (1 − M2N2), N1N2 = N and N1N2 WN = 1. Substituting these terms in Equation (2.24), the exponent will reduce to W nk = W n1k1 W n2k2 N N1 N2 . Thus, the Good-Thomas Fourier transform can be written as:
N −1 N −1 X1 X2 X = W n1k1 W n2k2 x[n n ]. k1k2 N1 N2 1 2 (2.25) n1=0 n2=0
It can be observed that Equation(2.20) is a decoupled two-dimensional N1 × N2 point transform, i.e., either the rows or the columns can be transformed in any order. There are no twiddle factor multiplications involved, which is perhaps the greatest strength of the Good-Thomas algorithm. The Good-Thomas algorithm can be summarized as follows:
(i) Map the input sequence using the Chinese Remainder Theorem n1 × n2.
(ii) Transform the rows or columns k1 × n2.
(iii) Transform the columns or rows k1 × k2.
(iv) Map the output sequence based on the Ruritanian correspondence mapping k.
It can be observed that the one-dimensional input sequence n is mapped into n1 × n2 using prime factor mapping discussed earlier. We then perform column transformation using
N1-point FFT. Unlike the Cooley-Tukey transform, we do not need any twiddle factor 22 multiplications. This is because of the inherent property of prime factor mapping. N2-point
FFT is then performed over N2 rows to form k1 × k2 result. The resulting two-dimensional array is then mapped using the Ruritaninan correspondence mapping to form the one-dimensional transform k. The procedure and pseudo-code for the Good-Thomas FFT algorithm is described in Algorithm (2.4).
2.5.1 Workload Computation In order to count the number of multiplications and additions, it is assumed that the
row and column transforms are computed directly. This results in N1 column transforms, each 2 requiring approximately N2 multiplications and additions, and N2 row transforms, each 2 requiring N1 multiplications and additions. Thus, 2 2 • Number of multiplications: N1(N2) + N2(N1) . 2 2 • Number of additions: N1(N2) + N2(N1) . The savings ratio for multiplications is given as: Number of Multiplications(Good − T homas) N + N R = = 1 2 (2.26) Number of Multiplications(Direct) N1N2 Some examples of the savings ratio are listed in Table 2.1
Table 2.3. Good-Thomas products savings ratio with re- spect to direct DFT calculation.
N N1 N2 R 6 3 2 0.83 20 5 4 0.45 120 15 8 0.19 252 28 9 0.14 1008 26 63 0.08
One can choose either the Cooley-Tukey algorithm or the Good-Thomas algorithm to calculate the Fourier transforms. It is also possible to build a Fourier transform algorithm by using both the Cooley-Tukey FFT and the Good-Thomas FFT. For example, a 63-point transform can be broken into a 7-point transform and a 9-point transform by using the Good-Thomas FFT. The 9-point transform can then be decomposed into two 3-point transforms by using the Cooley-Tukey FFT. One then has a computation in a form similar to a three-dimensional 3 × 3 × 7 Fourier transform. The Good-Thomas FFT is more efficient than the direct DFT computation and the Cooley-Tukey FFT algorithm. However, the radix algorithms prove to be superior than the Good-Thomas FFT in terms of the arithmetic computations. The Good-Thomas FFT has been 23
Algorithm 2.4 Good-Thomas algorithm procedure INPUTMAPPING:
2: for i ∈ N1 do
for j ∈ N2 do
4: index(i,j) = index(remainder((j − 1) × N2 × M2 + (i − 1) × N1 × M1, N)); end for 6: end for end procedure 8: procedure COLUMN TRANSFORMATION.
for i ∈ N2 do 10: Select all the elements of column i;
Perform N1-point Fourier transform for the N1 elements of column i;
12: The result is a two-dimensional array with Fourier transform of all the N2 column vectors; end for 14: end procedure procedure ROW TRANSFORMATION.
16: for i ∈ N1 do Select all the elements of row i;
18: Perform N2-point Fourier transform for the N2 elements of row i;
The result is a two-dimensional array with Fourier transform of all the N1 row
vectors and N2 column vectors; 20: end for end procedure 22: procedure OUTPUTMAPPING. Initialize k = 0
for i ∈ N1 do
24: for j ∈ N2 do
index(k) = index(remainder((j − 1) × N2 + (i − 1) × N1, N)); 26: k = k + 1; end for 28: end for end procedure 24 known to people in the signal processing industry for long, but we cannot find hardware implementations of the Good-Thomas FFT. This is mainly because the Good-Thomas FFT failed to make its impression over the radix algorithms.
2.5.2 Comparison and Summary of the FFT Algorithms Table (2.4) [13] gives a comparison of Cooley-Tukey, radix-2, radix-4, and Good-Thomas FFT algorithms in terms of the required number of multiplications and additions to perform the Fourier transform. In general, direct computation of DFT requires on the order of N 2 operations, where N is the length of the transform size. The breakthrough of the Cooley-Tukey FFT comes from the fact that it brings the complexity down to an order of
Nlog2N operations. The Cooley-Tukey FFT algorithm is suited for any composite length. For N = 2n or 4n or rn in general, it can be easily derived for what is now called the radix FFT algorithms. A different approach, based on the index transformation was suggested by Good and Thomas to compute DFT when the factors of the transform lengths are co-prime (i.e.,
N = N1 × N2 and GCD(N1,N2) = 1). This approach uses a multi-dimensional mapping and does not require any twiddle factor multiplications. But its drawback, at first sight, is that it requires efficiently computable DFTs on lengths that are co-prime. This method thus requires a set of relatively small-length DFTs that seemed at first difficult to compute in less than N 2 operations. After many developments that followed the Cooley-Tukey’s original contribution, the FFT introduced in 1976 by Winograd stands out for achieving a new theoretical reduction in terms of the multiplicative complexity. Interestingly, Winograd’s algorithm uses convolutions to compute DFTs, an approach which is just the converse of the conventional method of computing convolutions by means of DFTs. Winograd has utilized results proposed by Goods and Raders in more efficient way by mixing them with his fast convolution algorithm. The Winograd FFT uses Rader’s algorithm and Winograd’s circular convolution to compute the Fourier transform. The Rader’s algorithm, Winograd circular convolution, and the Winograd FFT are explained in the next chapter. 25
Table 2.4. Comparison of the Cooley-Tukey, Radix-2, Radix-4, and Good-Thomas FFT algorithms Cooley-Tukey FFT Good-Thomas FFT DFT Size Number of Number of Number of Number of N = N1 × N2 multiplications additions multiplications additions 30 = 2 × 3 × 5 360 384 330 384 60 = 3 × 4 × 5 1080 888 1020 888 120 = 3 × 5 × 8 2880 2076 2760 2076 240 = 3 × 5 × 16 8176 4812 7936 4812 504 = 7 × 8 × 9 12544 13164 12040 13164 1008 = 7 × 9 × 16 44928 29100 43920 29100
Radix-2 FFT Radix-4 FFT DFT Size N Number of Number of Number of Number of multiplications additions multiplications additions 32 320 480 64 768 1152 576 1056 128 1792 2688 256 4096 6144 3072 5632 512 9216 6144 1024 20480 30720 15360 28160 26
CHAPTER 3
FAST FOURIER TRANSFORMS VIA CONVOLUTION
Fourier transform can be performed efficiently using convolution theorem along with a FFT algorithm. When blocklength is relatively small, one of the efficient FFT algorithms, as measured by the number of multiplications and additions, is the Winograd FFT algorithm. The Winograd FFT is derived from the Rader’s algorithm and the Winograd short circular convolution.
3.1 RADER’S ALGORITHM When the block length N is a power of an odd prime, the Rader’s algorithm [17] can be used efficiently to compute FFT. In this case, the zeroth time component and the zero frequency component are treated specially. Rader’s algorithm is based on converting the DFT into convolution. The convolution is a cyclic operation with no pre- or post- multiplications. The Rader algorithm allows us to represent a prime length DFT as a cyclic convolution, which in turn can be expressed as a polynomial product modulo a third polynomial. The DFT and the cyclic convolution operations can be written as:
N−1 X X[k] = x[n]W nk, (3.1) n=0
N−1 X X[k] = x[n]h[n − k], (3.2) n=0 respectively, where the indices are evaluated modulo N. To convert the DFT in Equation (3.1) into convolution represented in Equation (3.2), the n × k product must be changed to k − n differences. When N is a prime number, this change of indices can be accomplished by permuting the order of the original sequence. Hence, to transform the index k × n to k − n when N is prime, we can use the the following property of integers modulo prime numbers. For any prime number N, the non-negative integers less than N form a Galois-field, GF(N) under the operation modulo N addition and modulo N multiplication. From the number theory, it can be shown that if the modulus N is a prime number, a base called primitive root r exist such that the integer Equation (3.3) creates one-to-one map of N − 1 member as follows: n = rm mod N, (3.3) 27 where m = 0, 1,...,N − 2. Note that all terms in which n = 0 or k = 0 must be excluded from consideration. For instance, if n = 0 the index kn will address a single element irrespective of k and vice-versa. However, k − n will address N different elements. Therefore, when n = 0 or k = 0 it is not possible to permute the given sequence. We shall treat these terms as special cases by rewriting the DFT in Equation (3.1) as:
N−1 X X[0] = x[n], (3.4) n=0 X[k] = X[0] + X0[k], (3.5) N−1 X X0[k] = x[n]W nk, (3.6) n=1 where k = 1, 2,...,N − 1. The Rader’s algorithm uses the circular convolution properties of the prime number DFTs. The algorithm can be summarized as follows:
Zeroth sample separation Separate the first input sample x[0] from the others and prepare to compute the output frequency components except for x[0], i.e. X[i] − x[0] for i = 0, 1 ...N − 1.
Input mapping Reorder the input data sequence: For any prime number N, there is at least one factor, called primitive root, that is used to reorder the input data sequence. if r is the primitive root, then the reordered sequence n can be calculated using Equation (3.3).
(N − 1)-point FFT computation Compute the N − 1 point DFT of the new sequence using any of the FFT algorithms discussed in Chapter 3.
Reorder the twiddle factors For every primitive root, there exists another primitive root so that the product of the two is 1 modulo N. Use this root to reorder the twiddle factors.
FFT computation of the twiddle factor matrix Compute the N − 1 point DFT of this sequence. These values are computed and stored in the memory.
Complex multiplication Perform complex multiplication of the above two sequences. This stage requires N − 1 complex multiplications. 28 Inverse Fourier transform Compute the inverse Fourier transform of the output sequence of the previous stage.
Output frequency component computation We need to add x[0] to all the resulting N − 1 point sequence. Then X[0] computed using Equation (3.4) is appended to form the complete N-point Transform.
The entire process mentioned above is similar to performing an N − 1 point circular convolution. Fig. 3.1 shows a top-level description of the Rader’s algorithm.
Input Output Sequence Input sequence N-1 point Output sequence Sequence permutation convolution permutation +
Computation of the DC component
Figure 3.1. The Rader’s algorithm.
The pseudo-code for the Rader’s FFT algorithm is described in Algorithm 3.1.
Algorithm 3.1 Rader’s FFT algorithm Input x is a vector of complex input samples. The first sample is separated from the input
vector. x0 = x(0) and xi = x(i) for i ≥ 1;
Let g1 and g2 be the two primitive roots;
The index of reorder sequence index1(i) is given by: procedure INPUTMAPPING/ SCRAMBLING: for i = 1, 2,...N − 1 do i 5: index1(i) ≡ g1 mod N;
xr(i) = x(index1(i)); // reordered sequence xr end for end procedure 29
procedure (N-1)-POINT FFT Xr(k) 10: for i = 1, 2,...N − 1 do
Calculate the Fourier transform for the sequence xr(i); end for end procedure procedure TWIDDLEFACTORSCRAMBLING:
15: The index of reorganized twiddle factors index2(i) is given by: for i = 1, 2,...N − 1 do i index2(i) ≡ g2 mod N;
Wr(i) = W (index2(i)); // reordered twiddle factor Wr end for 20: end procedure
procedure (N-1)-POINT FFT Wkr(k): for i = 1, 2,...N − 1 do
Calculate the Fourier transform for the sequence Wr(i); end for 25: end procedure procedure COMPLEX MULTIPLICATION: for k = 1, 2,...N − 1 do 0 X (k) = Xr(k) × Wkr(k); end for 30: end procedure procedure (N-1)-POINTINVERSE FFT x0(i) for i = 1, 2,...N − 1 do Calculate the inverse Fourier transform of the sequence X0(k); end for 35: end procedure 30
procedure OUTPUTFREQUENCYCOMPONENTS:
Calculate the value of Xk(0) as follows: for i = 1, 2,...N − 1 do
X0 = X0 + x(i); 40: end for Add the zeroth term to the computed sequence X0(k); for i = 1, 2,...N − 1 do X(k) = x(0) + X0(k); end for
45: The final output sequence is computed by appending X0 to X(k) to form Xout as:
Xk = [X0 X(k)]; end procedure
3.1.1 Workload Computation Note that Rader’s algorithm can only be used if the length N is a prime or a power of prime. Since N − 1 is composite, the convolution can be performed directly via conventional FFT algorithms. However, if N − 1 has large prime factors, this approach may not be efficient as it requires recursive use of Rader’s algorithm. Instead, a length (N − 1) cyclic convolution can be computed exactly by zero-padding it to a length of at least 2(N − 1) − 1, say to a power of two, and which can then be evaluated in O(NlogN) without the recursive application of Rader’s algorithm. If Rader’s algorithm is performed by using FFTs of size N − 1 to compute the convolution, rather than by zero padding as mentioned above, the efficiency depends strongly upon N and the number of times that Rader’s algorithm must be applied recursively. In general the number of arithmatic operations is given by:
• Number of multiplications: 2 × MN−1 + 4 × (N − 1).
• Number of additions: 2 × AN−1 + 6 × (N − 1). where MN−1 is the number of multiplications required for N − 1-point FFT and AN−1 is the number of additions required for N − 1-point FFT. In general, the Rader’s algorithm, requires O(N) additions and O(N × logN) time for the convolution. If the convolution is performed by a pair of FFTs, this algorithm requires intrinsically more operations than FFTs of close composite sizes. The Rader’s algorithm is efficient if the length of FFTs is comparatively small. However, as the length increases, the number of multiplications and additions required increase significantly. Rader was successful 31 in mapping a DFT of sequence N, where N is prime, into a circular convolution of length (N − 1). This property was later used to form the Winograd Fourier transforms.
3.2 WINOGRAD SHORT CONVOLUTION Winograd short convolution [9] discusses a general strategy for computing linear or cyclic convolutions by applying the polynomial version of the Chinese remainder theorem. For polynomials g(x) and d(x) of degree N − 1 and M − 1 , respectively, the linear convolution s(x) has degree L − 1, where L = M + N − 1. The linear convolution described as polynomial product s(x) = g(x)d(x). This is equivalent to Equation (3.7) because the reduction mod m(x) has no effect on s(x) when the degree of m(x) exceeds that of s(x).
s(x) = g(x)d(x) mod m(x). (3.7)
For cyclic convolution, let m(x) = xN − 1, where N is the largest degree of g(x) or d(x). The first step in derivation of the Winograd algorithm is to replace m(x) with its relative prime factors i.e., m(x) = m0(x) m1(x) . . . mK−1(x). Then the residual polynomials associated with each of these factors can be computed as:
dk(x) = d(x) mod mk(x), (3.8)
gk(x) = g(x) mod mk(x), (3.9)
sk(x) = s(x) mod mk(x), (3.10)
sk(x) = dk(x) gk(x) mod mk(x). (3.11)
The final step is to compute s(x) by applying Chinese remainder theorem for polynomials as: K−1 X s(x) = sk(x) N k(x) M k(x) [mod m(x)], (3.12) k=0 m(x) where M k(x) = and N k(x) is determined by mk(x)N k(x) = 1 [mod M k(x)]. mm(x) Only short convolutions represented by the polynomial products of the second step PK−1 k required multiplications. The number of multiplications is given by k=0 deg m (x). As an
example, consider the cyclic convolution of 2-point sequences. Let g(x) = g0 + g1x and 2 d(x) = d0 + d1x be the two degree-1 polynomials. Let m(x) = x − 1 = (x − 1)(x + 1). The choice of factors is arbitrary, however, the only requirements are that the degree of m(x) is greater than the degree of s(x) and that the factors of m(x) are relatively prime. The next step is to find the residual polynomials associated with each of these factors as shown in (3.8) through (3.10). Thus, 32
0 0 g (x) = g(x) [mod m (x)] = (g0 + g1x) [mod (x − 1)] = g0 + g1, and 1 g (x) = (g0 + g1x) [mod (x + 1)] = g0 − g1.
Similarly, we can calculate the residual polynomials for d(x). Thus,
0 1 d (x) = d0 + d1 and d (x) = d0 − d1.
0 1 0 1 Let g0 = g0 + g1, g0 = g0 − g1, d0 = d0 + d1 and d0 = d0 − d1. A matrix representation of the equations is given below, " # " #" # " # " #" # d0 1 1 d g0 1 1 g 0 = 0 and 0 = 0 . 1 1 d0 1 −1 d1 g0 1 −1 g1
We now need to determine the residues of s(x) given by,
s0(x) = g0(x)d0(x) [mod m0(x)]. Thus, 0 1 s0 = (g0 + g1)(d0 + d1) and s0 = (g0 − g1)(d0 − d1).
Finally, we use Equation (3.12) to solve for s(x) and N k(x) can be computed with the help of Table (3.1).
Table 3.1. Determination of N k(x). k mk(x) M k(x) 0 x − 1 x + 1 1 x + 1 x − 1
For k = 0, N 0(x) × (x + 1) = 1[mod (x − 1)]. Similarly, for k = 1, N 1(x) × (x − 1) = 1[mod (x + 1)]. We need to use Euclid’s division algorithm for polynomials [Ref] to solve these congruences. But for now, as the terms are small we can solve them by inspection. Substituting x = −1 yields N 0(x) = 1/2 and N 1(x) = −1/2. Thus s(x) can be written as,
1 0 1 1 0 1 s(x) = 2 [s0 + s0] + 2 [s0 − s0] and, 0 1 s0 = (g0 + g1)(d0 + d1) , s0 = (g0 − g1)(d0 − d1).
Thus, this algorithm requires two multiplications and four additions compared to the direct requirement of four multiplications and two additions. The Winograd cyclic convolution reduces the number of multiplications at the cost of increased number of additions. Additional information about the Winograd cyclic convolution can be found in [18]. 33
3.3 WINOGRAD FOURIER TRANSFORM ALGORITHM In 1976, S. Winograd introduced a new approach to efficiently compute the Fourier transform that uses substantially fewer number of multiplications at the expense of slightly more additions. The motivation for the development of this algorithm was that multiplication was extremely expensive in computation time. The Winograd small FFT is a method of efficiently computing the DFT for small block lengths. Thus the algorithm was designed to minimize the number of multiplications required to implement the FFT. It is based on the Rader’s prime algorithm and Winograd cyclic convolution. The first step in WFT is to change the Fourier transform into a convolution. If N is a small prime number, we can use the Rader’s prime algorithm to express the Fourier transform as a convolution. The Rader prime algorithm changes the DFT into a convolution using permuted input sequences. This stage does not require any additions or multiplications. The next step is to use the Winograd cyclic convolution algorithm to compute the Fourier transform of (N − 1)-point sequence. The convolution algorithm itself will have a set of additions, followed by a set of multiplications followed by a set of additions again. Winograd has shown that the DFT can be computed with only O(N) irrational multiplications, which lead to an achievable lower bound on the number of multiplications. However, this comes at the cost of significantly more additions, a trade off no longer favorable on conventional FPGAs with dedicated multipliers. The theory and derivation of these algorithms is quite elegant, but requires substantial background in number theory and abstract algebra. Fortunately, for the VLSI designers, the smaller length FFT algorithms have already been derived. The most popular block lengths are 2, 3, 4, 5, 7, 8, 9 and 16. In this section, the dataflow and construction of 2-, 3-, 4-, 5-, 7-, 8-, 9-, and 16-point Winograd FFT algorithms are discussed as building blocks used in greater length FFTs. Note that the red lines indicate the pipeline stages and the details for which are explained in Chapter 5.
• Two-point FFT The SFG for 2-point Winograd Fourier transform is shown in Fig. 3.2. This is the simplest of the FFTs and requires only 4 additions and no multiplications. • Three-point FFT The SFG for the 3-point Winograd Fourier transform is shown in Fig. 3.3. The intermediate multiplier coefficients are given in Table 3.2. The 3-point Winograd Fourier transform requires 6 additions and only 2 multiplications. • Four-point FFT The SFG for 4-point Winograd Fourier transform is shown in Fig. 3.4. The 4-point Winograd FFT does not require any multiplications. The complex multiplier j can be 34
U xn[0] Xk[0]
R xn[1] Xk[1]
Figure 3.2. The SFG for 2-point Winograd Fourier transform algorithm.
Table 3.2. Multiplier coefficients for the 3-point WFT.
C1 cos(2π/3) − 1 C2 j sin(2π/3)
xn[0] + Xk[0]
C1 xn[1] + x + + Xk[1]
C2 xn[2] - x - Xk[2]
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Figure 3.3. The SFG for the 3-point Winograd Fourier transform algorithm. 35 easily replaced by using a demultiplexer to switch the real data to imaginary and the imaginary data to real. The 4-point transform only requires 4 additions.
xn[0] + + Xk[0]
xn[1] + - Xk[2]
xn[2] - + Xk[3]
j
xn[3] - x - Xk[1]
Stage 1 Stage 2 Stage 3 Figure 3.4. The SFG for the 4-point Winograd Fourier transform algorithm.
• Five-point FFT The SFG for 5-point Winograd Fourier transform is shown in Fig. 3.5. The 5-point Fourier transform requires 5 multiplications and 17 additions. The intermediate multiplier coefficients are given in Table 3.3.
Table 3.3. Multiplier coefficients for the 5-point WFT.
C1 [1/2 cos(2π/5) + 1/2 cos(4π/5) − 1] C2 [1/2 cos(2π/5) − 1/2 cos(4π/5)] C3 sin(2π/5) C4 sin(2π/5) + sin(4π/5) C5 sin(2π/5) − sin(4π/5)
• Seven-point FFT The SFG for 7-point Winograd Fourier transform is shown in Fig. 3.6. The 7-point Fourier transform requires 8 multiplications and 35 additions. The intermediate multiplier coefficients are given in Table 3.4. • Eight-point FFT The SFG for 8-point Winograd Fourier transform is shown in Fig. 3.7. The 8-point Fourier transform requires 2 multiplications and 26 additions. The intermediate multiplier coefficients are given in Table 3.5. 36
xn[0] + Xk[0]
C1 + + x + + + xn[1] Xk[1]
C2 + - x - + xn[2] Xk[2]
C3 - x + - xn[4] Xk[4]
C4 - x + - xn[3] Xk[3]
C5 + x
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Figure 3.5. The SFG for the 5-point Winograd Fourier transform algorithm.
Table 3.4. Multiplier coefficients for the 7-point WFT.
C1 [1/3[cos(2π/7) + cos(4π/7) + cos(6π/7)] − 1] C2 1/3[2 cos(2π/7) − cos(4π/7) − cos(6π/7)] C3 1/3[cos(2π/7) − 2 cos(4π/7) + cos(6π/7)] C4 1/3[cos(2π/7) + cos(4π/7) − 2 cos(6π/7)] C5 1/3[sin(2π/7) + sin(4π/7) − sin(6π/7)] C6 1/3[2 sin(2π/7) − sin(4π/7) + sin(6π/7)] C7 1/3[sin(2π/7) − 2 sin(4π/7) − sin(6π/7)] C8 1/3[sin(2π/7) + sin(4π/7) + 2 sin(6π/7)]
Table 3.5. Multiplier coefficients for the 8-point WFT.
C1 cos(2π/8) C2 sin(2π/8) 37
C1 Xk[0] - x
+ Xk[2] xn[0] + + - - -
C3 Xk[3] xn[2] + - x - + +
C4 xn[1] + + x
C 2 Xk[6] xn[4] + - x + + +
C5 Xk[1] + x + + -
C7 xn[5] - - x
C6 Xk[5] xn[6] - - x - - +
xn[3] - +
C8 Xk[4] - x - + -
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Stage 8 Figure 3.6. The SFG for the 7-point Winograd Fourier transform algorithm.
• Nine-point FFT The SFG for 9-point Winograd Fourier transform is shown in Fig. 3.8. The 9-point Fourier transform requires 10 multiplications and 44 additions. The intermediate multiplier coefficients are given in Table 3.6. • Sixteen-point FFT The SFG for 16-point Winograd Fourier transform is shown in Fig. 3.9. The 16-point Fourier transform requires 10 multiplications and 74 additions. The intermediate multiplier coefficients are given in Table 3.7. 38
XR[0] xn[0] + + +
XR[1] xn[4] - + +
XR[2] xn[2] + - +
XR[7] xn[6] - + -
XR[4] xn[1] + + -
C1 XR[3] xn[7] - + x - +
XR[6] xn[3] + - +
C1 XR[5] xn[5] - - x - -
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Figure 3.7. The SFG for the 8-point Winograd Fourier transform algo- rithm.
Table 3.6. Multiplier coefficients for the 9-point WFT.
C1 −1/2 C2 sin(6π/9) C3 cos(6π/9) − 1 C4 sin(6π/9) C5 1/3[2 cos(2π/9) − cos(4π/9) − cos(8π/9)] C6 1/3[cos(2π/9) + cos(4π/9) − 2 cos(8π/9)] C7 1/3[cos(2π/9) − 2 cos(4π/9) + cos(8π/9)] C8 1/3[2 sin(2π/9) + sin(4π/9) − sin(8π/9)] C9 1/3[sin(2π/9) − sin(4π/9) − 2 sin(8π/9)] C10 1/3[sin(2π/9) + 2 sin(4π/9) + sin(8π/9)] 39
Xk[0]
Xk[7] + + + - + -
C3 xn[0] + x
C6 Xk[4] xn[3] + - x - +
C5 - Xk[6] xn[4] + - x - + -
C1 xn[1] + + + x +
C4 Xk[1] xn[7] + - x + + +
C2 Xk[8] x + + -
C8 Xk[2] xn[6] - - x - + +
C9 Xk[5] xn[5] - - x - -
C2 Xk[3] xn[8] - + + x +
C7 xn[2] - - x -
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Stage 8 Stage 9 Stage 10 Figure 3.8. The SFG for the 9-point Winograd Fourier transform algorithm.
Table 3.7. Multiplier coefficients for the 16-point WFT.
C1 sin(2π/16) C2 cos(2π/16) C3 sin(3π/16) C4 sin(π/16) − sin(3π/16) C5 sin(π/16) + sin(3π/16) C6 cos(3π/16) C7 cos(π/16) − cos(3π/16) C8 cos(π/16) + cos(3π/16) 40
+
xR[0] xn[0] + + + +
xR[1] xn[8] - - +
xR[2] xn[4] + - - + +
j xR[3] xn[12] - x + + -
xR[4] xn[2] + + + +
C 1 xR[5] xn[10] - + x - + +
j xR[6] xn[6] + - x + -
C 2 xR[7] xn[14] - - x - + -
C 4 xR[8] xn[1] + + + x + -
C 3 xR[9] xn[9] - + + x +
C 1 xR[10] xn[5] + - + x - +
C 8 xR[11] xn[13] - - x - - -
j xR[12] xn[3] + + - x -
C 5 xR[13] xn[11] - + x - - +
C 2 xR[14] xn[7] + - - x - -
C 6 xR[15] xn[15] - - + x - - -
C7 x Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Figure 3.9. The SFG for the 16-point Winograd Fourier transform algorithm. 41
3.4 SUMMARY The FFT introduced by Winograd significantly reduces the number of multiplications required to perform Fourier transform. Winograd has utilized results proposed by Rader in more efficient way by mixing them with his fast convolution algorithm. However, Winograd’s FFT algorithm has failed to replace the popular Cooley-Tukey type algorithms. This is mainly because of the significant rise in the number of additions required for Fourier transforms of higher order. The radix algorithms are superior than the Winograd FFT algorithms for large transform lengths. The required number of additions and multiplications required to perform the Fourier transform using Winograd FFT algorithm is shown in Table (3.8). It can be
Table 3.8. Computational requirements of Radix-2 and Winograd FFT algorithms Radix-2 Winograd FFT DFT Size Number of Number of Number of Number of N = N1 × N2 multiplications additions multiplications additions 2 4 6 0 2 3 2 6 4 16 24 0 8 5 5 17 7 8 38 8 48 72 2 26 9 10 44 16 128 192 10 74 30 = 5× 6 72 384 60 = 5× 12 144 888 64 576 1056 120 = 3× 5 × 8 288 2076 180 = 4× 5 × 9 528 3936 240 = 3× 5 × 16 648 5136 256 3072 5632 420 = 3× 4 × 5 1296 11352 504 = 3× 5 × 8 1584 14428 1008 = 7× 9 × 16 3564 34416 1024 15360 28160
observed that the Winograd FFT algorithm proves to be better than the radix algorithms if the length of the transform is relatively small as the arithmetic complexity is significantly lower for smaller lengths transforms. But as the blocklength increases, the Winograd FFT algorithm fails to make an impression due to the increase in the number of additions e.g. the number of 42 adders required for 16-point transform is less compared to 1008-point transform where it requires more number of adders than radix-2 FFT. In 1975, D.P. Kolba and T.W. Parks proposed a method of computing Fourier transform based on the Good-Thomas FFT and the Winograd FFT algorithms [4]. In their approach, the Good-Thomas FFT algorithm was utilized to decompose the global computation into smaller Fourier transform computations. These smaller transforms are then implemented by the Winograd FFT algorithm. It was observed that the combined Good-Thomas and Winograd FFT, also referred as the Good-Thomas, Winograd prime factor algorithm or simply prime factor algorithm (PFA) requires minimum number of multiplications at the cost of some additions. Fig. (3.10) and Fig. (3.11) plots the number of multiplications and number of additions required by various FFT algorithms respectively.
×105 8
7 Radix-2 FFT Radix-4 FFT 6 Cooley-Tukey FFT Good-Thomas FFT Good-Thomas Winograd FFT 5
4
3
Number of Multiplications 2
1
0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Size of Fourier transform Figure 3.10. Required number of multiplications for different FFT algorithms.
The computation of the Fourier transform using PFA proved to be more computationally-efficient than all the other FFT algorithms discussed primarily because, (i) it does not require twiddle factor multiplications and (ii) the use of the small Winograd FFT algorithms inside the Good-Thomas FFT algorithm significantly reduces the arithmetic complexity due to the cyclic convolution properties of the Winograd FFT algorithms. This was the main motivation for choosing the combined Good-Thomas and Winograd FFT as the candidate FFT algorithm for VLSI implementation. The hardware construction and 43
×105 8
7 Radix-2 FFT Radix-4 FFT 6 Cooley-Tukey FFT Good-Thomas FFT Good-Thomas Winograd FFT 5
4
3 Number of Additions 2
1
0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Size of Fourier transform Figure 3.11. Required number of additions for different FFT algorithms. implementation results of the combined Good-Thomas and Winograd FFT algorithm is described in Chapter 5. 44
CHAPTER 4 GOOD-THOMASANDWINOGRAD PRIME-FACTOR FFT ALGORITHMS
This chapter discusses the design and implementation of the Good-Thomas and Winograd prime-factor FFT algorithm (PFA). The FFT algorithms are first modeled in floating-point Matlab and then in fixed-point Matlab. The fixed-point Matlab model was then converted to hardware descriptions for FPGA implementations.
4.1 INTRODUCTION The divide-and-conquer strategy, has few requirements for feasibility, i.e. the Fourier transform length N, needs only to be prime or composite, and the whole DFT is computed from DFTs on a number of points which is a factor of N. This is required for the redundancy in the computation of DFT of the signal. Good’s mapping, which can be used when the factors of the transform lengths are co-prime, is a true mono to multi-dimensional mapping, thus having the advantage of not producing any twiddle factors. However, it requires to efficiently compute the DFTs of co-prime lengths. For instance, a DFT of length 240 will be decomposed as 240 = 16 × 3 × 5, and a DFT of length 1008 will be decomposed as 1008 = 16 × 9 × 7. Thus, the intermediate DFT computations of length 16, 3, 5, 9 and 7 have to be efficient. This prime factor method requires a set of relatively small length of DFTs that seemed to be challenging to compute in less than N 2 operations. Rader showed how to map a DFT of length N, where N is a prime number, into a circular convolution of length N − 1. Winograd introduced a new approach to efficiently compute the Fourier transform that uses substantially fewer number of multiplications at the expense of more additions. The Winograd small FFT is a method of efficiently computing the discrete Fourier transform for small block lengths. It is based on the Rader’s prime algorithm and Winograd cyclic convolution to compute the DFT for a given sequence. For instance, a 16-point sequence would require only 10 multiplications and 74 additions. The algorithm was designed to minimize the number of multiplications required to implement the FFT. For example, we can now efficiently compute the 240-point DFT or the 1008-point DFT by using 16-point, 5-point, 3-point and 16-point, 9-point, 7-point Winograd FFTs, respectively. 45
4.2 DATA FORMAT The data format has a direct impact on both the accuracy of the outputs and also on the size of the design. The input and the output word-lengths of the FFT processor designed in this thesis are parametric. We use fixed-point arithmetic to implement the FFT architectures. The following strategies are used to deal with the problem of overflow in fixed-point computations. • Headroom provision The input-output word-length is increased so that only a fraction of the available word-length range is used. • Fixed scaling The outputs of the adders and multipliers that generate overflow are scaled down. The least significant bit (LSB) is dropped at given stages of the computations. • Overflow flag Overflow flags can be used to indicate that the word-length has to be increased so as to provide sufficient number of bits. Fig. 4.1 shows the input and output data signals of the implemented FFT processor. Each complex sample has a real and an imaginary part, whose values are represented in the two’s complement fixed-point form. The N complex samples all enter the FFT circuit synchronously, with their real and imaginary parts, each via a separate input data bus. Similarly, all the N Fourier transform samples are available at the output, each on a separate real and imaginary output data buses. The handshaking control signal “DVI” indicates all valid samples are applied to the input and the control signal “DVO” indicates that the Fourier transform has been completed and the transform samples are available at the output. The Fourier multiplier coefficients, which also have the same format as the inputs may be applied to the input coefficient data bus along with the inputs or may be stored in the LUT.
N-point Good- Thomas, Winograd prime- factor FFT processor
Figure 4.1. The IO interface of the FFT processor. 46
4.3 THE WINOGRAD FFT MODULES The Winograd FFT algorithm forms an integral part of our prime-factor FFT algorithm. In this section we discuss the pipelined architectures and the implementation
results of the Winograd FFT modules. These FFTs are used to perform the N1-point and
N2-point row and column transforms respectively. We do not require any twiddle factor multiplication to perform WFT, which in turn reduces the number of multiplications needed. WFTs with block lengths of 2, 3, 5, 7, 8, 9 and 16 are constructed with the help of their SFGs discussed in Chapter 5. Note that each input xn[0], xn[1],..., and output Xk[0], Xk[1],..., has a bit-width of WL bits, as specified by the user. The multiplier coefficients (C1, C2, ...) depend on the underlying WFT and are precomputed in Matlab and stored at initialization in a LUT. The bitwidth of the multiplier coefficients is parametric, however, we use 16 bits coefficients for acceptable precision. The address for the input memory module is calculated using a counter that stores the input samples sequentially at consecutive memory locations. The top level architecture of the WFT module is shown in Fig. (4.2). Note that the SFGs for the Winograd Fourier transforms are
BRAM
CLK Input memory Complex input Complex data Address data output bus 6 G1 0 1 2 Complex input 3 data 4
N
Figure 4.2. The top-level block diagram of the WFT algorithm.
already discussed in Chapter 5. The WFTs are pipelined to reduce the critical-path to one multiplier. The pipeline stages are indicated by the red lines in the SFGs.
4.4 THE PRIME-FACTOR FFTALGORITHM The prime-factor FFT algorithm uses the CRT mapping technique similar to the Good-Thomas FFT algorithm to map the input samples into a two-dimensional array. This
mapping turns the original transform of length N into several small DFTs of length N1 and
N2, where N1 and N2 are co-prime numbers. Thus, the one-dimensional input is transformed
into N1-point rows and N2-point columns. After mapping the DFT into a true multi-dimensional DFT by Good’s method, the Winograd Fourier transform algorithm can be 47 applied to evaluate the prime length DFTs. Finally, the RCM technique maps the two-dimensional sequence back to a one-dimensional array of N samples at the output. The pseudo-code for the prime-factor FFT algorithm is described in Algorithm 4.1.
4.5 ARCHITECTURE The top-level architecture of the combined Good-Thomas and Winograd prime factor FFT algorithm is shown in Fig. 4.3, which is a 3-stage pipelined design. This section describes the hardware architecture of the Good-Thomas and Winograd prime-factor FFT algorithm. The FFT block diagram shown in Fig. 4.3 uses an 8-state finite state machine (FSM) to control the sequence of FFT operations, as shown in Fig. 4.4. The FFT process begins by going to the “idle” state. In this state the FFT processor waits for the valid input signal to arrive at the input, which is represented by the “DVI” signal. The process then goes to the “reset” state, where all the buffer registers, adders and multipliers are reset to zero in this state. After initializing the registers to zero, the processor goes to the “input mapping” state to map the input signal using the Chinese remainder theorem. The CRT mapping process takes N2 clock cycles to map the N2 samples and the processor then goes to the “row transformation” state. The row transformation state performs N2-point Winograd Fourier transform on the CRT mapped samples and goes back to the “input mapping” state to map for
the next N1 samples. This process is repeated N1 times until all the input samples have been
row transformed. Note that the output samples of N2-point FFT are stored in a block memory. After performing row transformation, the registers and counters are initialized to perform the column transformation. The processor then goes to the “column transformation” state to
perform N1-point Winograd Fourier transform. The column transformation is repeated N2 times. We now need to perform the Ruritanian correspondence mapping of the row and column transformed sequence. The RCM is performed when the processor goes to “output mapping” state. Once all samples are ready, the “DVO” signal is asserted indicating that the Fourier transformed sequence is ready at the output.
4.6 HARDWARE DESIGN To design the FFT in hardware, we created a software model that accurately represents the combined Good-Thomas and Winograd FFT algorithm. By using this model, the designer can change various parameters and specifications such as data bitwidth, coefficient bitwidth, etc. to meet the design specifications. Our design approach is shown in Fig. 4.5. First the software model was developed in floating-point format in Matlab. The floating-point model was then converted into the fixed-point model. The hardware description of FFT was then 48
Algorithm 4.1 Good-Thomas and Winograd Prime-factor FFT Algorithm procedure CRT INPUTMAPPING:
for i ∈ N1 do
for j ∈ N2 do
index(i,j) = index(remainder((j − 1) × N2 × M2 + (i − 1) × N1 × M1, N)); end for 6: end for end procedure
procedure N1-POINT WINOGRAD FFT:
for i ∈ N2 do Select all the elements of column i;
Perform N1-point Winograd Fourier transform over N1 elements of column i; 12: The result is a two-dimensional array with Fourier transform of all the
N2 column vectors; end for end procedure
procedure N2-POINT WINOGRAD FFT:
for i ∈ N1 do 18: Select all the elements of row i;
Perform N2-point Winograd Fourier transform over N2 elements of row i; The result is a two-dimensional array with Fourier transform of all the
N1 row vectors and N2 column vectors. end for end procedure 24: procedure OUTPUTMAPPING: Initialize k = 0
for i ∈ N1 do
for j ∈ N2 do
index(k) = index(remainder((j − 1) × N2 + (i − 1) × N1, N)); k = k + 1. end for 30: end for end procedure 49
INPUT DATA SEQUENCE
INPUT MAPPING (CRT)
DATA REGISTER
N1-POINT FFT
N1xN2 - POINT BLOCK RAM
N2-POINT FFT
DATA REGISTER
OUTPUT MAPPING (RCM)
OUTPUT DATA SEQUENCE
Figure 4.3. The Good-Thomas FFT architecture. 50
QC%IJ
HQ%J \
INPUT QC%IJ IDLE RESET MAPPING HQ%J YY YY
YY
Q1HQ%J \ N1-POINT ROW TRANSFORM
EXIT
Q1HQ%J YY
Q1HQ%J YY INITIALIZE
QC%IJ
HQ%J YY OUTPUT N2-POINT MAPPING COLUMN TRANSFORM Q1HQ%J \
Figure 4.4. The FSM for the combined Good-Thomas and Winograd FFT algorithm. 51 developed in Verilog HDL based on the fixed-point model. Finally, the design was synthesized and implemented on a FPGA.
DESIGN SPECIFICATION
FLOATING- POINT MATLAB MODEL
FIXED-POINT MATLAB MODEL
VERILOG DESCRIPTION
FUNCTIONAL SIMULATIONS
SYNTHESIS
FPGA IMPLEMENTATION
Figure 4.5. The design flow.
4.6.1 Design Considerations We need to carefully consider the design parameters before we begin our implementation in Matlab or Verilog. These parameters are: precision of data, number of
points used for computation N, and the desired operating frequency fop. The precision WL of 52 data relates to the number of bits per input, which we refer to as bitwidth. Both real and imaginary data values will be WL bits respectively. The bitwidth parameter has a direct impact on the accuracy of results and also the design size, performance, and power consumption. The second parameter is the number of points N used for the computation. Increasing N improves the resolution of the results. While we want to maximize this parameter, we have to keep in mind that each additional point will directly increase the memory requirement of our design as we need to store more number of input samples. Thus, the upper limit on the number of points is bounded by the size of memory available on the device. The final parameter is the circuit operating frequency, fop. This parameter defines the slowest speed the circuit can work at and still be able to process the incoming data without loss of data.
4.6.2 Parallel Processing Architecture The combined Good-Thomas and Winograd prime-factor FFT that we have designed supports parallel processing of data, i.e. a temporal parallelism between the sub-blocks. We achieve this by implementing hand-shaking signals between the different modules. Fig. 4.6 shows the hand-shaking signals across the FFT processor.
ready ready ready OUTPUT Counter value N -point INPUT DATA 1 Counter value Counter value DATA of N1 points counter value SEQUENCE of N2 points of N2 points SEQUENCE INPUT N xN OUTPUT N -POINT 1 2 N -POINT MAPPING 2 POINT 1 MAPPING FFT ready FFT (CRT) done MEMORY done done (RCM) DVI DVO
Figure 4.6. FFT architecture.
The hand-shaking signals indicate the current module whether the previous task has been completed. If the previous task is still in process, the subsequent processing block will have to wait till the output of previous task is ready.
In the FFT architecture shown in Fig. 4.6, the input module will map the first set of N2
samples and set the “ready” flag when the N2-point counter reaches N2, and notify the
N2-point FFT module to process the data. Once this set of N2-point data is ready and passed
to the FFT module, it will re-start to map the next set of N2 samples and wait until the FFT
process is completed. The N2-point FFT module will simultaneously process the input samples after the “ready” flag is set by the CRT mapping module. The FFT module will
process N2-point data and set the “done” flag indicating that the N2-point transform of N2
samples is completed. The processed samples are then passed to the N1 × N2-block memory
along with the value of N1-point counter. This counter acts as the address line to store the
processed data at the corresponding location of the block RAM. The N2-point FFT module 53 will also indicate the input mapping module that it is ready to process the next set of N2-point samples. The above process of CRT mapping and row transformation is repeated for N1 times so that all the N input samples have been row-transformed and stored in the N1 rows of the block RAM.
We now have to perform the column transformation for the N1-point samples. We select the N1 samples from the block RAM and set the “ready” flag. The N1-points are formed by selecting 2 × WL-bit complex sample from every row and passing samples to perform an N1-point transform. The N1-point sequence is then passed onto the N1-point FFT module to perform column transformation of the N1-point sequence. Once this set of
N1-point data is ready and passed to the FFT module, the next set of N1-point samples is formed. However, this set of N1 samples is passed to the N1-point FFT module only if it ready to accept as indicated by the “ready” flag. Once the FFT output is ready, it is passed to the next module to perform the output mapping. The output mapping module will start to map the received input and set the
“mapped” flag to indicate that it is ready to accept the next set of N1 samples. After all the samples are mapped, the “DVO” flag is set to indicate that the FFT output is ready.
Fig. 4.7 shows the temporal overlap of the sub-tasks of FFT computation.