A NOVEL HIGH-THROUGHPUT

FFT ARCHITECTURE FOR WIRELESS COMMUNICATION SYSTEMS

A Thesis

Presented to the

Faculty of

San Diego State University

In Partial Fulfillment

of the Requirements for the Degree

Master of Science

in

Electrical Engineering

by

Nikhilesh Vinayak Bhagat

Spring 2016

iii

Copyright c 2016 by Nikhilesh Vinayak Bhagat iv

DEDICATION

To Aai and Pappa. v

ABSTRACT OF THE THESIS

A Novel High-Throughput FFT Architecture for Wireless Communication Systems by Nikhilesh Vinayak Bhagat Master of Science in Electrical Engineering San Diego State University, 2016

The design of the physical layer (PHY) of Long Term Evolution (LTE) standard is heavily influenced by the requirements for higher data transmission rate, greater spectral efficiency, and higher channel bandwidths. To fulfill these requirements, orthogonal frequency division multiplex (OFDM) was selected as the modulation scheme at the PHY layer. The discrete Fourier transform (DFT) and the inverse discrete Fourier transform (IDFT) are fundamental building blocks of an OFDM system. (FFT) is an efficient implementation of DFT. This thesis focuses on a novel high-throughput hardware architecture for FFT computation utilized in wireless communication systems, particularly in the LTE standard. We implement a fully-pipelined FFT architecture that requires fewer number of computations. Particularly, we discuss a novel approach to implement FFT using the combined Good-Thomas and Winograd algorithms. It is found that the combined Good-Thomas and Winograd FFT algorithms provides a significantly more efficient FFT solution for a wide range of applications. A detailed analysis and comparison between different FFT algorithms and potential architectures suitable for the requirements of the LTE standard is presented. Theoretical results have been validated by the implementation of the proposed approach on a field-programmable gate array (FPGA). As demonstrated by the mathematical analysis, a significant reduction has been achieved in all the design parameters, such as computational delay and the number of arithmetic operations as compared to conventional FFT architectures currently used in various wireless communication standards. It is concluded that the proposed algorithm and its hardware architecture can be efficiently used as an enhanced alternative in the LTE wireless communication systems. vi

TABLE OF CONTENTS PAGE ABSTRACT ...... v LIST OF TABLES...... viii LIST OF FIGURES ...... ix ACKNOWLEDGMENTS ...... xi CHAPTER 1 INTRODUCTION ...... 1 1.1 Review of the FFT Algorithms ...... 1 1.2 Motivation ...... 2 1.3 Contribution of Thesis ...... 3 1.4 Organization of Thesis ...... 3 2 FAST FOURIER TRANSFORM ALGORITHMS...... 4 2.1 Mapping to Two Dimensions ...... 4 2.2 The Cooley-Tukey FFT Algorithm ...... 5 2.2.1 Workload Computation...... 7 2.3 Radix-2 Cooley-Tukey FFT ...... 9 2.3.1 Architecture of the Radix-2 FFT...... 11 2.3.2 Workload Computation...... 13 2.4 Radix-4 Cooley-Tukey FFT ...... 14 2.4.1 Architecture of Radix-4 FFT ...... 16 2.4.2 Workload Computation...... 18 2.5 The Good-Thomas Prime-Factor Algorithm ...... 19 2.5.1 Workload Computation...... 22 2.5.2 Comparison and Summary of the FFT Algorithms ...... 24 3 FAST FOURIER TRANSFORMS via ...... 26 3.1 Rader’s Algorithm...... 26 3.1.1 Workload Computation...... 30 3.2 Winograd Short Convolution ...... 31 3.3 Winograd Fourier Transform Algorithm ...... 33 vii 3.4 Summary ...... 41 4 GOOD-THOMAS AND WINOGRAD PRIME-FACTOR FFT ALGORITHMS .. 44 4.1 Introduction...... 44 4.2 Data Format ...... 45 4.3 The Winograd FFT modules ...... 46 4.4 The Prime-Factor FFT Algorithm ...... 46 4.5 Architecture ...... 47 4.6 Hardware Design ...... 47 4.6.1 Design Considerations...... 51 4.6.2 Parallel Processing Architecture ...... 52 4.6.3 Matlab Design ...... 53 4.6.4 Verilog Design ...... 55 4.7 Testing ...... 57 4.8 Hardware Cost and Implementation Results ...... 58 4.8.1 Latency and Throughput ...... 60 4.9 Analysis and Comparison...... 61 5 APPLICATION OF FFT IN WIRELESS COMMUNICATION SYSTEMS ...... 63 5.1 Overview ...... 63 5.2 OFDM Technique ...... 63 5.3 LTE Physical Layer ...... 65 5.3.1 Generic Frame Structure ...... 65 5.3.2 LTE Parameters ...... 66 6 CONCLUSION ...... 69 6.1 Future Work ...... 69 6.1.1 VLSI layout ...... 70 BIBLIOGRAPHY ...... 71 viii

LIST OF TABLES PAGE Table 2.1. Cooley-Tukey multiplications compared to one-dimensional DFT calculation . 9 Table 2.2. Comparison of the radix-2 and radix-4 algorithms ...... 19 Table 2.3. Good-Thomas products savings ratio with respect to direct DFT calculation. .. 22 Table 2.4. Comparison of the Cooley-Tukey, Radix-2, Radix-4, and Good- Thomas FFT algorithms ...... 25 Table 3.1. Determination of N k(x)...... 32 Table 3.2. Multiplier coefficients for the 3-point WFT...... 34 Table 3.3. Multiplier coefficients for the 5-point WFT...... 35 Table 3.4. Multiplier coefficients for the 7-point WFT...... 36 Table 3.5. Multiplier coefficients for the 8-point WFT...... 36 Table 3.6. Multiplier coefficients for the 9-point WFT...... 38 Table 3.7. Multiplier coefficients for the 16-point WFT...... 39 Table 3.8. Computational requirements of Radix-2 and Winograd FFT algorithms ...... 41 Table 4.1. Resource utilization of the Winograd FFT algorithms ...... 60 Table 4.2. Resource utilization of the combined Good-Thomas and Winograd FFT algorithm ...... 60 Table 4.3. Performance of the Winograd FFT algorithms ...... 60 Table 4.4. Performance of the combined Good-Thomas and Winograd FFT algorithms ...... 61 Table 4.5. Resource utilization comparison with other fixed-point FFT proces- sors ...... 61 Table 4.6. Performance comparison with other fixed-point FFT processors ...... 62 Table 5.1. FFT sizes and other physical parameters used in the current LTE standard ..... 66 Table 5.2. Available resource blocks and occupied sub-carriers ...... 67 Table 5.3. Minimum required FFT lengths and other physical parameters for LTE ...... 67 Table 5.4. Possible FFT lengths using Good-Thomas algorithm and other phys- ical parameters for LTE ...... 67 Table 5.5. Maximum possible FFT lengths using Good-Thomas algorithm and other physical parameters for LTE ...... 68 ix

LIST OF FIGURES PAGE Figure 2.1. The two-dimensional mapping of a 15-point Cooley-Tukey algorithm...... 6 Figure 2.2. Radix-2 butterfly architecture...... 11 Figure 2.3. Radix-2 N-point flowgraph...... 11 Figure 2.4. Radix-2 8-point flowgraph...... 12 Figure 2.5. The top-level architecture of the Radix-2 FFT algorithm...... 12 Figure 2.6. The Radix-2 butterfly architecture...... 14 Figure 2.7. The shuffling unit...... 14 Figure 2.8. Radix-4 butterfly structure...... 16 Figure 2.9. Radix-4 FFT architecture...... 18 Figure 2.10. The simplified Radix-4 dragonfly architecture...... 18 Figure 2.11. The block diagram of the Good-Thomas mapping for 15-point Fourier transform...... 20 Figure 3.1. The Rader’s algorithm...... 28 Figure 3.2. The SFG for 2-point Winograd Fourier transform algorithm...... 34 Figure 3.3. The SFG for the 3-point Winograd Fourier transform algorithm...... 34 Figure 3.4. The SFG for the 4-point Winograd Fourier transform algorithm...... 35 Figure 3.5. The SFG for the 5-point Winograd Fourier transform algorithm...... 36 Figure 3.6. The SFG for the 7-point Winograd Fourier transform algorithm...... 37 Figure 3.7. The SFG for the 8-point Winograd Fourier transform algorithm...... 38 Figure 3.8. The SFG for the 9-point Winograd Fourier transform algorithm...... 39 Figure 3.9. The SFG for the 16-point Winograd Fourier transform algorithm...... 40 Figure 3.10. Required number of multiplications for different FFT algorithms...... 42 Figure 3.11. Required number of additions for different FFT algorithms...... 43 Figure 4.1. The IO interface of the FFT processor...... 45 Figure 4.2. The top-level block diagram of the WFT algorithm...... 46 Figure 4.3. The Good-Thomas FFT architecture...... 49 Figure 4.4. The FSM for the combined Good-Thomas and Winograd FFT algorithm...... 50 Figure 4.5. The design flow...... 51 x Figure 4.6. FFT architecture...... 52 Figure 4.7. Temporal parallelism of sub-tasks...... 53 Figure 4.8. Flowchart of the operations...... 54 Figure 4.9. The datapath of the CRT input mapping...... 56

Figure 4.10. The top-level organization of the N × WL block RAM, where N = N1 × N2...... 57 Figure 4.11. The datapath of the RCM output mapping...... 58 Figure 4.12. Fourier transform output of the shifted impulse input...... 59 Figure 4.13. Absolute error plots of fixed-point and floating-point Fourier transforms. .... 59 Figure 5.1. OFDM architecture...... 63 Figure 5.2. OFDM sub-carrier spacing ...... 64 Figure 5.3. LTE symbol and frame structure...... 65 xi

ACKNOWLEDGMENTS This thesis owes its existence to the help, support and inspiration of several people. I am grateful to God for the good health and well being that were necessary to complete my research. Firstly, I would like to gratefully and sincerely thank Dr. Amir Alimohammad and Dr. fred harris for their guidance, understanding, patience, and most importantly, their mentoring during my graduate research at San Diego State University. Their mentorship was paramount in providing a well rounded experience consistent my long-term career goals. They have encouraged me to not only grow as an experimentalist and a technologist but also as an instructor and an independent thinker. I am not sure many graduate students are given the opportunity to develop their own individuality and self-sufficiency by being allowed to work with such independence. For everything youve done for me, I thank you. Prof. Amir, my sincere gratitude for your continuous support and encouragement. Your excellent inputs to the research and great expertise in these fields helped me to achieve results better. You have supported me academically by providing me with all the necessary facilities for the research. During the most difficult times of my thesis, you gave me the moral support and the freedom I needed to move on. I could not have imagined having a better advisor and mentor for my research. I am extremely thankful and indebted to Dr. fred harris, who has been a constant source of encouragement and enthusiasm. It is great honor to have him as one of my thesis advisors. Many thanks to Dr. Marie A. Roch and I am grateful to her for her very valuable comments on this thesis. I wish to express my sincere gratitude for your valuable time and readily accepting to be my thesis committee member. I would also like to thank all of the members of the VLSI research group, especially Fang, Jai, Abhinaya and Vamsi for the stimulating discussions, for the sleepless nights we were working together before deadlines, and for all the fun we have had in the last two years. I thank my friends Ameya, Mihir, Kaushik, Akshay B., Akshay D., Rohit, Amruta, Payal, Bela, Prasad, Disha, Kanak, Pratish for some much needed humor and entertainment. I would never forget the chats and beautiful moments I shared with them. They were fundamental in supporting me during these stressful and difficult moments. A special thanks goes to Nikita for always inspiring and mentoring me, and for always being there for me. My little sister, no matter where you are around the world, you are always with me. xii I am very grateful to all the people I have met along the way and have contributed to the development of my research. In particular I would like to show my gratitude to the National Science Foundation for funding this work. Finally, my deepest gratitude goes to my parents Sangita and Vinayak for their unflagging love and unconditional support throughout my life and my studies. You made me live the most unique, magic and carefree childhood that has made me who I am now. And most importantly, I would like to thank my girlfriend Snehal. Her support, encouragement, quiet patience and unwavering love were undeniably the bedrock upon which the past few years of my life have been built. Her tolerance of my occasional vulgar moods is a testament in itself of her unyielding devotion and love. 1

CHAPTER 1 INTRODUCTION

The DFT plays a key role in a wide range of signal processing applications because it can be used as a mathematical tool to describe the relationship between the time-domain and frequency-domain representations of discrete signals. The DFT X[k] of a sequence x[n] of N terms is defined as: N−1 X X[k] = x[n]ωnk, (1.1) n=0 where the sequence x[n] is viewed as N consecutive samples, x[nT ], of a continuous signal x(t), and ω = e−j2π/N is the twiddle factor. Similar to DFT there exist inverse DFT (IDFT) which maps a frequency sequence to its corresponding time sequence. The IDFT finds its usefulness in circular convolution where the inverse DFT of the product of two time sequences corresponds to circularly convolving the two time sequences.

1.1 REVIEWOFTHE FFTALGORITHMS Around 1805, Carl Friedrich Gauss invented a revolutionary technique for efficiently computing the coefficients of what is now called the discrete Fourier series [1]. Unfortunately, Gauss never published his work and it was lost for over one hundred years. In the early twentieth century, Carl Runge derived an algorithm similar to that of Gauss that could compute the coefficients on an input with size equal to a power-of-two and was later generalized to powers-of-three. Then in 1965, James W. Cooley and John W. Tukey discovered an algorithm for the computation of Fourier transform that required significantly fewer computation than by direct computation using Equation (2.1) [2]. This algorithm became widely-known as the fast Fourier transform (FFT). This algorithm is characterized by a butterfly architecture, which is used recursively to efficiently implement the DFT. Utilizing the Cooley-Tukey FFT algorithm 2 the computational load was reduced from O(N ) arithmetical operations to O(Nlog2N) arithmetic operations. “The development of the FFT originally by Cooley and Tukey followed by various enhancements/modifications by other researchers has provided the incentive and the impetus for its rapid and widespread utilization in a number of diverse disciplines” [3]. Special versions of these algorithms are popularly known as the “Radix-2” and “Radix-4” FFTs. 2 In addition to the Cooley-Tukey approach, several algorithms, such as prime-factor [4], split radix [5] and Winograd Fourier transform algorithm (WFTA) [6] have been proposed over the last several decades. In 1960, I. J. Good [7] showed that the Thomas algorithm [8] could be generalized for an arbitrary number of prime factors. This became to be known as Good-Thomas algorithm. This algorithm remained obscure until a method was found to efficiently compute the Fourier transforms of sequences whose length was a prime, or a power of prime [9]. In 1975, S. Winograd published a paper in which he announced new algorithm for computing DFTs [6]. This algorithm was known as the Winograd Fourier transform (WFT) algorithm. The WFT algorithm was known to be optimal from the standpoint of minimization of the number of multiplications required. The WFT algorithm combined with the Prime Factor Algorithm (PFA) became collectively known as the Good-Thomas and Winograd prime factor algorithm. The Good-Thomas algorithm promised a drastic reduction in the amount of time a computer or hardware requires to perform the DFT. However, this time the reduction is achieved at the cost of increased design complexity.

1.2 MOTIVATION A common module in recent wireless communication standards, such as 3GPP-LTE, is the FFT and its inverse. Although many architectures exist for traditional power-of-two FFT lengths, we may also define non-power-of-two transform lengths for the recent 3GPP LTE standards. The main motivation behind this work is to design and implement, a high-performance FFT architecture for wireless communication systems. It has been observed that efficient hardware implementation of a FFT algorithm requires a relatively large number of multipliers. Many of the algorithms that are commonly used, require large number of multiplications by twiddle factors, which directly impacts the silicon area and power consumption. This research focuses on the development of a novel high-performance FFT algorithm that does not require any twiddle factor multiplications, and hence, significantly reduces the computational complexity. In order to achieve this goal, our work is oriented towards the Good-Thomas and Winograd FFT algorithms. Unfortunately, Good-Thomas and Winograd algorithms have received very little attention compared to Cooley-Tukey algorithms among VLSI designers, perhaps due to the mathematical complexity of these techniques. 3

1.3 CONTRIBUTIONOF THESIS While it is unlikely that the completion of this thesis will trigger a revolution similar to that which followed the publication of the Cooley and Tukey paper, it is hoped that this thesis will demonstrate the effectiveness of the Good-Thomas, Winograd FFT algorithms. In the FFT designs for the LTE systems, research is concentrated on the variable length and low power architectures for FFTs. Effort is also being made mainly towards finding a universal computation structure for FFTs, which can compute other transforms such as discrete cosine transform (DCT), discrete (DHT), easily with the help of single computation block. This work presents the analysis and design of various FFT algorithms suitable for wireless communication systems. A major contribution of this thesis is to show that the combined Good-Thomas and Winograd FFT is a preferred alternative, for practical cases in LTE, as compared to various other widely used FFT algorithms. It also identifies and demonstrates various critical design issues, which need to be considered while designing FFT blocks for LTE. This theoretical analysis has been supported with the FPGA implementation results for the proposed architectures. It is shown that the combined Good-Thomas and Winograd FFT algorithms provides several advantages over other conventional algorithms when used in reconfigurable systems but this comes along with some trade-offs.

1.4 ORGANIZATION OF THESIS This thesis is organized as follows:

• Chapter 2: provides a detailed information on the implementation and complexity of the above discussed FFT algorithms. A comparison based on the computational complexity of these algorithms has been presented and discussed. Next, the software implementation of these algorithms are presented and their performance is evaluated in terms of timing, memory, and complexity. • Chapter 3: discusses about different Fourier transform algorithms that can be constructed using the convolution property. • Chapter 4: presents the hardware implementation of Good-Thomas and Winograd algorithms and its hardware architectures. The design of an FFT processor is discussed. Different abstraction levels, block diagrams, trade-offs and simulation results are also discussed. • Chapter 5: focuses on the LTE physical layer parameters. A brief discussion about the LTE standard and the application of our combined Good-Thomas and Winograd FFT architecture in LTE wireless communication systems have been presented. • Chapter 6: provides some concluding remarks and presents a few suggestions for future work. 4

CHAPTER 2 FAST FOURIER TRANSFORM ALGORITHMS

There are two basic strategies for computing the discrete Fourier transform. One strategy is to change a one-dimensional Fourier transform into a two-dimensional Fourier transform, which is more efficient to compute. The second strategy is to change a one-dimensional Fourier transform into a small convolution, which can be computed by using the techniques described in Chapter 3. These strategies can be used effectively to minimize the computational complexity of the FFT calculation. This chapter discusses the Cooley-Tukey and Good-Thomas FFT algorithms and compares their computational complexities. Also, the Winograd Fourier transform algorithm (WFTA) and the Rader algorithm are discussed in detail. A peculiar feature of the WFTA and Rader algorithm is that the transform is based on convolution. As shown, this reduces the number of multiplications required for FFT computation significantly. Formal derivations of the algorithms used in Radix-2 and the Winograd FFTs will be presented.

2.1 MAPPINGTO TWO DIMENSIONS The goal of index mapping is to convert a single complex problem into multiple simple ones so that the number of multiplications and additions are reduced. This can be achieved by mapping the one-dimensional DFT into a two-dimensional transform. Further computational savings are achieved if this process is carried on iteratively to higher dimensions. The mapping to a multi-dimensional transform results in a significantly smaller number of data transfers and arithmetic operations relative to those required for the direct implementation of the DFT. If the length N of DFT is not prime, N can be factored as

N = N1 × N2. The two new independent variables are defined as n1 = 0, 1, 2,...,N1 − 1 and n2 = 0, 1, 2,...,N2 − 1, where the linear equation which maps n1 and n2 onto n is given by n = K1n1 + K2n2 mod N, where k1 and k2 can be defined using the following two classes of index mapping:

• Common factor mapping (CFM), where K1 = aN2 or K2 = bN1 but not both.

• Prime factor mapping (PFM), where K1 = aN2 and K2 = bN1. The choice of “and/or” creates the difference between the two mappings. Note that the PFM can be used only if the factors are relatively prime, while the CFM can be used for any factors. 5 The one-dimensional DFT is expressed as:

N−1 X nk X[k] = x[n]WN , (2.1) n=0

−j2π/N where n = 0, 1,...,N − 1, k = 0, 1,...,N − 1, and WN = e . The objective is to convert this equation into a two-dimensional form by applying substitution mapping techniques on indices, which results in the following two-dimensional representation:

N1−1 N2−1 X X K1K3n1k1 K1K4n1k2 K2K3n2k1 K2K4n2k2 Xk1k2 = W W W W x[n1n2]. (2.2)

n1=0 n2=0 The Cooley-Tukey FFT algorithm relays on the CFM, while PFM is used by the Good-Thomas FFT algorithm.

2.2 THE COOLEY-TUKEY FFTALGORITHM This section investigates the particular properties that result from using the CFM to calculate the DFT in an efficient way. Algorithms of this kind are known collectively as fast Fourier transform (FFT) algorithms. To derive the general form of the Cooley-Tukey FFT algorithm, suppose that the length N of the transform is composite, i.e., N = N1 × N2, where

N1 and N2 are integer factors of N. Next, the index mapping is defined as: n = n1 + n2N1, where n1 = 0, 1,...,N1 − 1, n2 = 0, 1,...,N2 − 1, k = k2 + k1N2, k1 = 0, 1,...,N1 − 1, and k2 = 0, 1,...,N2 − 1. These index mappings are evaluated modulo N but is not explicitly mentioned as n does not exceed N. By substituting the indices, the DFT in (2.2) can be represented in two-dimensions as:

N −1 N −1 X1 X2 X = W n1k1 W n2k1 W n2k2 x[n n ]. k1k2 N1 N N2 1 2 (2.3) n1=0 n2=0

Fig. 2.1 shows a fifteen-point transform as two dimensions with N1 = 3 and N2 = 5. For instance, suppose one needs to map the index n = 4 from the one-dimensional array to the two-dimensional array. The equation n = 4 = n1 + 3n2 is satisfied by n1 = 1 and n2 = 1, which are the coordinates indicated in Fig. 2.1. Using this indexing scheme, the input and output data vectors can be mapped into two-dimensional arrays. Note that the components of the transform X[k] are arranged differently than the components of the signal x[n]. This is also known as address shuffling. The first step of the Cooley-Tukey algorithm derivation is to substitute the indices in

(2.1) along with the relation N = N1 × N2 to obtain the following equation:

N1−1 N2−1 X X (n +n N )(k +k N ) X = W 1 2 1 2 1 2 x[n n ], k1k2 N1N2 1 2 (2.4) n1=0 n2=0 6

J @  

 

 

 

 

J @        

J     @

       

 

 

 

 

 

Figure 2.1. The two-dimensional mapping of a 15-point Cooley-Tukey algorithm.

where n1 = 0, 1,...,N1 − 1, n2 = 0, 1,...,N2 − 1, k1 = 0, 1,...,N1 − 1, and k2 = 0, 1,...,N2 − 1. The twiddle factor W can be expanded as:

W (n1+n2N1)(k2+k1N2) = W n1k2 W n1k1N2 W n2k2N1 W n2k1N1N2 . N1N2 N1N2 N1N2 N1N2 N1N2 (2.5)

−j2πN (n k ) −j2πN(n2k1) 2 1 1 W n2k1N1N2 = W N = 1,W n1k1N2 = W N1N2 = W n1k1 , However, N1N2 N1N2 N1N2 N1N2 N1 and W n2k2N1 = W n2k2 N1N2 N2 . Therefore, the two-dimensional Cooley-Tukey Fourier transform equation is given by: N −1 N −1 X1 X2 X = W n1k1 W n1k2 W n2k2 x[n n ]. k1k2 N1 N N2 1 2 (2.6) n1=0 n2=0

The transform is first applied to the n1 dimensions, i.e. the columns and the result is then multiplied by the “twiddle factor”. The twiddle factor being a function of n1 and n2 prevents the two-dimensional transforms to be independent of each other.

The Cooley-Tukey transform can be summarized as follows:

(i) Map the indices into two dimensions n1 × n2.

(ii) Transform the columns k1 × n2.

(iii) Multiply by the twiddle factors W n1×n2 . 7

(iv) Transform the rows k1 × k2.

(v) Map the indices into one dimension k.

It can be observed that the one-dimensional input sequence n is mapped into n1 × n2 using CFM mentioned earlier. We then perform column transformation using N1-point FFT and multiply the resulting two-dimensional array with twiddle factors. N2-point FFT is then performed for N2 rows to form k1 × k2 array. The resulting two-dimensional array is then unloaded column wise to form the one-dimensional transform k. The procedure and pseudo-code for the Cooley-Tukey FFT algorithm is described in Algorithm 2.1.

2.2.1 Workload Computation The computations of the Cooley-Tukey FFT can be seen as mapping of one-dimensional signal-domain array into a two-dimensional transform-domain array. To count the number of multiplications and additions, it is assumed that the row and column transforms are computed using DFTs. The computation consists of an N1-point DFT on each column, followed by element-by-element complex multiplications throughout the new array

n1k2 by WN , followed by an N2-point DFT on each row. This results in N1 column transforms, 2 each requiring approximately N2 multiplications and additions, and N2 row transforms, each 2 requiring N1 multiplications and additions. The workload count must also include the twiddle factor multiplications. Hence, 2 2 • Number of multiplications is N1(N2) + N2(N1) + N1N2 = N1N2(N1 + N2 + 1). 2 2 • Number of additions is N1(N2) + N2(N1) = N1N2(N1 + N2). However, the number of additions for direct one-dimensional approach is equal to the number 2 of multiplications and is N = N1N2 × N1N2. The computational saving in terms of the required multiplications can be written as: Number of Multiplications(Cooley − T ukey) N + N + 1 • R = = 1 2 Number of Multiplications(Direct) N1N2 Table 2.1 shows some examples of the savings ratio for different FFT lengths. By using Cooley-Tukey algorithm, a transform with length N can be decomposed into a form requiring fewer complex multiplications. It can observed that the Cooley-Tukey FFT algorithm provides a better approach for Fourier transform computation compared to the direct computation. Note that the computational savings is not directly proportional to the actual time reduction of the program execution as it does not take into consideration the comparison of addressing overhead. It merely represents the amount of arithmetic computation reduced. 8

Algorithm 2.1 Cooley-Tukey algorithm 1: procedure INPUTMAPPING:

2: for i ≤ N1 do

3: for j ≤ N2 do

4: index(i,j) = index(i+j × N2); 5: end for 6: end for 7: end procedure 8: procedure COLUMN TRANSFORMATION:

9: for i ≤ N2 do 10: Select all the elements of column i;

11: Perform N1-point Fourier transform over N1 elements of column i;

12: Result is a two dimensional array with Fourier transform of all the N2 columns; 13: end for 14: end procedure 15: procedure MULTIPLICATION WITH TWIDDLE FACTORS: nk 16: The twiddle factors WN are calculated based on the index value n and k;

17: for i ≤ N1 do

18: for j ≤ N2 do ij 19: element(i,j) = element(i,j)× WN ; 20: end for 21: end for 22: end procedure 23: procedure ROW TRANSFORMATION:

24: for i ≤ N1 do 25: Select all the elements of row i;

26: Perform N2-point Fourier transform over N2 elements of row i;

27: The result is a two dimensional array with Fourier transform of all the N1 row

vectors and N2 column vectors; 28: end for 29: end procedure 9

30: procedure OUTPUTMAPPING: Initialize k = 0

31: for i ≤ N1 do

32: for j ≤ N2 do

33: index(k) = index(i+j × N2); 34: k = k + 1; 35: end for 36: end for 37: end procedure

Table 2.1. Cooley-Tukey multiplications compared to one-dimensional DFT calculation

N N1 N2 R 6 3 2 1.0 20 5 4 0.5 120 15 8 0.20 250 25 10 0.14 1008 26 63 0.08

The number of computations required for the Cooley-Tukey algorithm have been further reduced by the introduction of the radix algorithms [3]. The Radix-2 and Radix-4 algorithms that are based on the Cooley-Tukey algorithm, are among the most widely-used algorithms that are discussed in subsequent section.

2.3 RADIX-2 COOLEY-TUKEY FFT The direct computation of an N-point DFT requires nearly O(N 2) complex arithmetic operations (i.e., a multiplication and an addition). This complexity has been significantly reduced by the Cooley-Tukey FFT which requires O(N) complex arithmetic operations. Undoubtedly, the Cooley-Tukey FFT algorithms is one of the most widely used FFT algorithms. Many applications of the Cooley-Tukey FFT use a blocklength N that is a power of two or of four. The block length 2m can be factored either as 2m−1 × 2 or as 2 × 2m−1, which is called a Radix-2 Cooley-Tukey FFT. Similarly, in the Radix-4 Cooley-Tukey FFT, the blocklength 4m is factored either as 4m−1 × 4 or as 4 × 4m−1. In addition to the radix-2 and radix-4 algorithms, several other FFT algorithms are available, such as the radix-8, mixed-radix, split-radix, vector radix and the vector-split-radix algorithms [10]. However, we 10 will only discuss the popular and widely preferred radix-2 and radix-4 FFT algorithms in detail. In this section, we will initially discuss the decimation-in-time (DIT) and decimation-in-frequency (DIF) FFT algorithms. The detailed development will be based on radix-2. This will then be extended to other radices such as Radix-4, etc. m m−1 N The 2 -point Cooley-Tukey algorithm with N1 = 2 and N2 = 2 = 2 is known as the Radix-2 DIT algorithm. Starting with Equation (2.6) we can write:

N −1 N −1 X1 X2 X = W n1k1 W n1k2 W n2k2 x[n n ], k1k2 N1 N N2 1 2 (2.7) n1=0 n2=0 which represents N-point Cooley-Tukey FFT. The first stage of the FFT can be defined by

choosing the values of N1 and N2. This is equivalent to splitting an N-point sequence into N two 2 -point sequences x2m and x2m+1, respectively, to the even and odd samples of xn. By N N definition, n2 = 0, 1,..., 2 − 1 and k2 = 0, 1,..., 2 − 1. Also, n1 = 0, 1 and k1 = 0, 1.

Under these conditions, Xk1k2 , which is referred as Xk and X N becomes: k+ 2

N N 2 −1 2 −1 X 2n2k2 k2 X 2n2k2 Xk = WN x[2n2] + WN WN x[2n2 + 1]. (2.8) n2=0 n2=0

N 2 Since, WN = −1, we can write:

N N 2 −1 2 −1 X 2n2k2 k2 X 2n2k2 X N = W x[2n2] − W W x[2n2 + 1]. (2.9) k+ 2 N N N n2=0 n2=0 Equation (2.8) and (2.9) can be written as follows:

nk Xk = A(k) + WN B(k), (2.10)

nk X N = A(k) − WN B(k). (2.11) k+ 2 This is illustrated by the DIT butterfly structure in Fig. 2.2. The basic processing element (PE) of the FFT operation is known as butterfly. The butterfly structure consists of an addition, a subtraction, and a complex multiplication by the twiddle factor. The DIT and the DIF algorithms are different in structure and in the sequence of the computations, however, the number of operations in both algorithms are the same. The performance of both, DIT and DIF FFTs are the same, but the user may prefer one of them because of some implementation considerations. The DIT or DIF algorithm is used recursively, in which at each level, an N-point Fourier transform is replaced by two N/2-point 11

X[n] U X[k]

X[n+N/2] R  X[k+N/2]

WN Figure 2.2. Radix-2 butterfly architecture.

x[0] X[0] x[2] X[1]

Even N/2-point Points Transform

x[N/2]

x[1] x[3] Odd N/2-point Points Transform

x[(N/2)+1] X[N-1] Figure 2.3. Radix-2 N-point flowgraph.

Fourier transforms. For larger values of N, the flow graph would appear as in Fig. 2.3 where the blocks are replaced similarly until only two-point transforms are remaining. An example of entire signal flow graph representation of an 8-point DIT Radix-2 is shown in Fig (2.4). It can be seen that the signal flow has 3 stages and each stage consists of a 2-point butterfly structure. The pseudo-code for the radix-2 DIT algorithm is described in Algorithm 2.2:

2.3.1 Architecture of the Radix-2 FFT A block diagram of the N-point radix-2 FFT architecture implemented in [11] is shown in Fig. (2.5). The circuit consists of the FFT processor, which performs the operations required for the computation of the FFT (butterfly and shuffling units) and two adaptation units: the input mapping unit, which maps the input stream of samples onto two separate ones and the output mapping unit, which maps the two streams of outputs of the FFT processor onto one stream of samples in natural order. The FFT processor consists of n butterfly units and (n − 1) shuffling units, where n = log2N. The two inputs of the butterfly unit of a stage are generated from the output of the butterfly unit of the previous stage at different time points. These inputs are obtained from either the upper or the lower output of the butterfly unit of the previous stage. A shuffling unit is inserted between two successive butterfly units to route these outputs to the corresponding inputs. 12

6`a `a

  6`a R `a

  6`a R `a

    6` a R R `a

  6`a R `a

    6` a R R ` a

    6`a R R ` a

      6` a R R R ` a Figure 2.4. Radix-2 8-point flowgraph.

 :$V  :$V  :$VJ J]%  % ]%  R: : R: : `V:I Butterfly Shuffling Butterfly Shuffling Butterfly Shuffling `V:I Unit Unit Unit Unit Unit Unit

Figure 2.5. The top-level architecture of the Radix-2 FFT algorithm.

2.3.1.1 INPUTAND OUTPUTMAPPING The input mapping unit is needed so that the input data stream can be separated to form even and odd samples. The input mapping unit can be a de-multiplexer unit that separates the input data stream. An output mapping unit maps the output sequence of the last butterfly unit, onto one stream of samples in natural order. The reverse operation for the formation of the output stream of samples is performed by a multiplexer.

2.3.1.2 BUTTERFLY STRUCTURE The term “butterfly” appears in the context of the Cooley-Tukey FFT algorithm, which recursively breaks down a DFT of composite size n into r smaller transforms of size m, where r is the “radix” of the transform. These smaller DFTs are then combined via size-r butterflies, which themselves are DFTs of size r pre-multiplied by roots of unity, known as twiddle factors. The datapath of radix-2 butterfly can be constructed from the signal flow graph shown in Fig. 2.2. Let, Xn = a + ib, Xn+N/2 = c + id and twiddle factor WN = cos θ − i sin θ, where, a, b, c, d, cos θ, sin θ are real numbers. Thus, the output is given by

Xk = (a + c cos θ − d sin θ) + i(b + d cos θ − c sin θ) and

Xk+N/2 = (a − c cos θ + d sin θ) + i(b − d cos θ − c sin θ). The real part and the imaginary 13

Algorithm 2.2 Radix-2 DIT algorithm x is the input vector of complex samples;

2: Nin is the number of input samples;

m is the number of stages given by m = log2(x); 4: N is the number of FFT points given by N = 2m;

Append N − Nin zeros to the input vector as: x = [x, zeros(N − Nin)]

6: The twiddle factor matrix WN is calculated as:

WN = cos(2 × π/N × [0 : (N/2 − 1)]) − j × sin(2 × π/N × [0 : (N/2 − 1)]); The intermediate butterfly matrix is computed as following: 8: for i ≤ m do For every stage calculate A(k) and B(k) for the even and odd samples; 10: for k ≤ N/2 do A(k) = x(n + k + 1);

12: B(k) = x(n + k + N/2 + 1) × WN (ik); X(k) = A(k) + B(k); 14: X(k + N/2) = A(k) − B(k); end for 16: end for

parts are all expressed as the addition of three real numbers. Thus, we can use 2-input adders and multipliers to implement the butterfly unit as shown in Fig. (2.6), where “R” is a register.

2.3.1.3 SHUFFLING UNITSTRUCTURE The shuffling units is implemented using the design shown in Fig. 2.7. This circuit consists of delay elements for the synchronization of the outputs of the previous butterfly unit and two multiplexers that perform the routing of the outputs to the corresponding inputs of the next butterfly unit. The control signal of the multiplexers takes the value 0, so that the upper outputs of the butterfly unit of stage s is routed to the upper input of the butterfly unit of stage s + 1 and the lower output of the butterfly unit of stage s is routed to the lower input of the butterfly unit of stage s + 1 respectively. Next, the control signal takes the value 1, so that the upper output is routed to the lower input while the lower output of the butterfly unit of stage s is routed to the upper input of the butterfly unit of stage s + 1.

2.3.2 Workload Computation

N The computation of an N-point DFT is replaced by that of two DFTs of length 2 plus N k2 N additions and 2 multiplications by the twiddle factors W . The same procedure can be N N applied repeatedly to replace the two DFTs of length 2 by four DFTs of length 4 at the cost 14

a c cosϴ d sinϴ csinϴ d cosϴ b

R X X X X R

- +

+ - + -

Real(Xk) Real(Xk+N/2) Img(Xk) Img(Xk+N/2) Figure 2.6. The Radix-2 butterfly architecture.

C Stage m Stage m+1

0

R

1

C

1

R 0

Figure 2.7. The shuffling unit.

N of N additions and 2 multiplications by the twiddle factors. Thus, a systematic approach of m i this method computes DFT of length 2 in m = log2N stages, each stage converting 2 DFTs m−i i+1 m−i−1 N of length 2 into 2 DFTs of length 2 at the cost of N additions and 2 multiplications by the twiddle factor. Consequently, the number of complex additions and complex multiplications required to compute a DFT of length N by the radix-2 FFT algorithm is: N • Number of multiplications: log N 2 2

• Number of additions: Nlog2N

2.4 RADIX-4 COOLEY-TUKEY FFT The radix-4 Cooley-Tukey FFT algorithms has also been widely adopted in various applications. They can be used when the blocklength N is a power of four. Then 4m is factored either as 4m−1 × 4 or as 4 × 4m−1. The radix-4 algorithm is derived in a similar manner to the radix-2 algorithm. The equations of this FFT can be obtained by setting N1 = 4 15 and N2 = N/4 in the general Equation (2.6), for the Cooley-Tukey FFT. In the radix-2 DIF FFT, the DFT equation is expressed as the summation of two calculations. One calculation, is the summation over the first half and one calculation is the summation over the second half of the input sequence. Similarly, the radix-4 DIF FFT expresses the DFT equation as four summations, then divides it into four equations, each of which computes every fourth output sample. The radix-4 DIF can be written as:

N 2∗N 3∗N 4∗N 4 −1 4 −1 4 −1 4 −1 X nk X nk X nk X nk Xk = x[n] WN + x[n] WN + x[n] WN + x[n] WN . (2.12) n=0 N 2∗N 3∗N n= 4 n= 4 n= 4 Let n = n + N/4, n = n + N/2, and n = n + 3N/4 in the second, third, and fourth summations, respectively. Equation (2.12) can be re-written as:

N N N 4 −1 4 −1 4 −1 X X N (n+ N )k X 2N (n+ 2∗N )k X = x[n] W nk + x[n + ] W 4 + x[n + ] W 4 k N 4 N 4 N n=0 n=0 n=0 N (2.13) 4 −1 X 3N (n+ 3N )k + x[n + ] W 4 , 4 N n=0 which represents four N/4-point DFTs. N N N k 4 k k 2 k 3k 4 k Using, WN = (−j) , WN = (−1) , and WN = (j) , Equation (2.13) can be written as:

N −1 X4 N N 3N X = [x[n] + (−j)kx[n + ] + (−1)kx[n + ] + (j)kx[n + ]] W nk. (2.14) k 4 2 4 N n=0

4 Let WN = WN/4. Equation (2.14) can then be written as:

N −1 X4 N N 3N X = [x[n] + x[n + ] + x[n + ] + x[n + ]] W nk , (2.15) 4k 4 2 4 N/4 n=0

N −1 X4 N N 3N X = [x[n] − j × x[n + ] − x[n + ] + j × x[n + ]] W nk W n , (2.16) 4k+1 4 2 4 N/4 N n=0

N −1 X4 N N 3N X = [x[n] − x[n + ] + x[n + ] − x[n + ]] W nk W 2n, (2.17) 4k+2 4 2 4 N/4 N n=0

N −1 X4 N N 3N X = [x[n] + j × x[n + ] − x[n + ] − j × x[n + ]] W nk W 3n, (2.18) 4k+3 4 2 4 N/4 N n=0 for k = 0, 1,... (N/4) − 1. 16 Equations (2.15) through (2.18) represent a decomposition process yielding four 4-point DFTs. The butterfly of a radix-4 algorithm consists of four inputs and four outputs with intermediate adders and complex twiddle factor multipliers. The signal flow graph of the radix-4 butterfly structure is shown in Fig 2.8. The radix-4 FFT equation essentially combines two stages of the radix-2 FFT into one, so that half as many stages are required.

 

xn[0] X + Xk[0]

J 

= xn[1] X - Xk[1] R J = 

R xn[2] X + Xk[2] R J  = R R= xn[3] X - Xk[3] Figure 2.8. Radix-4 butterfly structure.

The pseudo-code for the radix-4 FFT algorithm is described in Algorithm 2.3:

2.4.1 Architecture of Radix-4 FFT The architecture for radix-4 FFT in [12] is based on analyzing the SFG of the FFT. The radix-4 FFT divides an N-point DFT into four N/4-point DFTs, then into 16 N/16-point DFTs, and so on for every stage. The radix-4 FFT expresses the DFT equation as four summations, and then divides it into four equations, each of which computes every fourth output sample as described in Equations (2.15) through (2.18). An example of 1024-point FFT architecture that processes 1024 complex samples, with 16-bit word length is shown in Fig. 2.9. These samples are stored in the input memory “RAM1”. The radix-4 butterfly takes 4 samples and the result is stored in the output memory “RAM2”. This process is repeated for all the butterfly computations of every stage.

2.4.1.1 BUTTERFLY STRUCTURE The butterfly structure takes 4 signed fixed-point data from memory and computes the FFT algorithm. The datapath of radix-4 butterfly can be constructed from the SFG, as shown in Fig. 2.10.

2.4.1.2 RAM MODULES The FFT processor has three memory blocks, each is a 1024 × 16 BRAM, one for storing the input samples, one for storing the output results, and a 768 × 16 ROMs for storing 17

Algorithm 2.3 Radix-4 FFT algorithm Input x is a vector of complex input samples;

Nin is the number of input samples;

3: m is the number of stages given by m = clog4(x); N is the number of FFT points given by N = 4m;

Append N − Nin zeros to the input vector as: x = [x, zeros(N − Nin)];

6: The twiddle factor matrix WN is calculated as:

WN = cos(2 × π/N × [0 : (N − 1)]) − j × sin(2 × π/N × [0 : (N − 1)]); The intermediate radix-4 butterfly for m stages is computed as below: for i ≤ m do 9: For every stage calculate A(k), B(k), C(k) and D(k) as follows: for k = 1 : 4 : N/4 do

Let ak = x(n + k), bk = x(n + k + 1), ck = x(n + k + 2) and dk = x(n + k + 3);

12: X(k) = ak + bk + ck + dk;

X(k + N/4) = (ak − bk + ck − dk) × WN (2 × ik);

X(k + N/2) = (ak − j × bk − ck + dk) × WN (ik);

15: X(k + 3N/4) = (ak + j × bk − ck − j × dk) × WN (3 × ik); end for end for 18

Real R R Input

R R RAM 1 Radix-4 dragonfly Img 1024 x 16 R structure R Input

R R

SIN and COS ROM Real Output Read Add RAM 2 ROM 1024 x 16 Img Output Write Add Write count ROM Read count

Figure 2.9. Radix-4 FFT architecture.

xn[0] + + Xk[0]

 

xn[1] X - + Xk[1]

 

xn[2] X + - Xk[2]

  R=

xn[3] X - X - Xk[3] Figure 2.10. The simplified Radix-4 dragonfly architecture. the twiddle factors (also referred as the sine and cosine look-up table). The real and imaginary signals are stored in different block memories.

2.4.2 Workload Computation The radix-4 DIT algorithm rearranges the DFT equation into four parts. If the FFT length N is 4M , the shorter length DFTs can be further decomposed recursively to produce the full radix-4 DIT FFT. The radix-4 transform is also more efficient considering the number of accesses to the memory. In the radix-4 DIT FFT, each decomposition stage creates additional savings in computation. To determine the total computational cost of the radix-4 FFT, note log N N that there are M = log N = 2 stages, each with butterflies per stage. Each radix-4 4 2 4 butterfly requires 3 complex multiplications and 8 complex additions. The total cost is as given below. 3N • Number of multiplications: log N 8 2 19

• Number of additions: 8(N/8)log2N The computational requirements of radix-2 and radix-4 algorithms are shown in Table (2.2). [13] It can be seen that the radix-4 algorithm reduces the number of multiplications by about 25 percent. Slight improvement can also be obtained by using Radix-8 or Radix-16 algorithms. When N is not a power of a single radix, one is prompted to use mixed-radix [14] approach. One main drawback of these Radix algorithms is that it is limited to block lengths that are a power of 4 for radix-4 or power of 2 for radix-2.

Table 2.2. Comparison of the radix-2 and radix-4 algorithms Radix-2 Radix-4 DFT Size N Number of Number of Number of Number of multiplications additions multiplications additions 4 16 24 12 16 16 128 192 96 146 64 768 1152 576 930 256 4096 6144 3072 5122 1024 20480 30720 15360 26114

2.5 THE GOOD-THOMAS PRIME-FACTOR ALGORITHM This section discusses the FFT algorithm that result from using the prime-factor index mapping. One such algorithm is the Good-Thomas prime factor algorithm. Good and Thomas suggested one more method of converting a Fourier transform from one dimension to two dimensions, which is regarded as the Good-Thomas prime factor algorithm. It is based on factorization of a block-length into distinct prime powers. It requires significantly less computation than the Cooley-Tukey algorithm, while it is conceptually more complicated. The Good-Thomas algorithm differs from the Cooley-Tukey algorithm in that it requires N1 and N2 to be relatively prime, i.e., GCD(N1,N2) = 1, and that it maps into a true two-dimensional transform (no twiddle factors are necessary). The idea is very different from

the idea of the Cooley-Tukey algorithm. Now N1 and N2 must be co-prime and the mapping is accomplished by applying residue reduction and the Chinese remainder theorem [15].

As an example shown in Fig. 2.11, consider a 15-point FFT, where N1 = 5 and

N2 = 3. The input data is stored in a two-dimensional array by starting in the upper left corner and listing the components down the extended diagonal. Because the number of rows and the number of columns are co-prime, the extended diagonal passes through every element 20

J @  

 

 

 

 

J @      

J     @   

      

 

 

 

 

 

Figure 2.11. The block diagram of the Good-Thomas mapping for 15-point Fourier trans- form. of the array. The output array, however, is different from the order of input components in the input array and is mapped by using the Ruritanian correspondence mapping [15]. For example, to map n = 8, we can write, n1 = 8 mod 3 = 2 and n2 = 8 mod 5 = 3. Hence, n = 8 is mapped onto n1 = 2 and n2 = 3. The derivation of the Good-Thomas FFT algorithm is based on the Chinese remainder theorem for integers. The first step in deriving the Good-Thomas algorithm is to describe the input indices by their residues. The input index is given by n1 = n mod N1 and n2 = n mod N2. If N1, N2, n1 and n2 are known, then n may be calculated by the Chinese remainder theorem, which uses the following equations:

n = [n1N2M2 + n2N1M1] mod N, (2.19) and

[N1M1 + N2M2] = 1. (2.20)

Equation (2.20) can be solved using the Euclidean algorithm, where N1 and N2 are the scalar, integer coefficients of the given linear Diophantine equation [16], N2M1 = 1 mod N1, and N1M2 = 1 mod N2. For example, if N = 15, N1 = 5, N2 = 3, n1 = 3, and n2 = 2, the values of M1 and M2 can be determined by solving the above equations. A set of solutions is

M1 = −1 and M2 = 2. There are other possible solution pairs, but there is only one unique solution of modulo X, where X is less than N1 and N2, i.e., X = 3. After using the modulus operation to get the positive value for M1, this yields M1 = 2. 21 The output indices are defined using the Ruritanian correspondence method. This is defined as follows:

k1 = kM2 mod N1, (2.21)

k2 = kM1 mod N2, (2.22)

where M1 and M2 are the integers determined during the input mapping. For example, if

k = 13, then k1 = (2)(13)mod 5 = 1 and k2 = (−1)(13)mod 3 = 2, which represent the

coordinates of the output map. Knowing k1, k2, N1 and N2, the output index k is calculated

using k = [k1N2 + k2N1] mod N. By substituting these indices in the DFT equation in Equation (2.1), we can write:

N1−1 N2−1 X X (n M N +n M N )(k N +k N ) X = W 1 2 2 2 1 1 1 2 2 1 x[n n ]. k1k2 N1N2 1 2 (2.23) n1=0 n2=0 The product in the exponent can be written as:

nk n1k1M2N2N2 n2k2M1N1N1 n1k2M2N1N2 n2k1M1N1N2 WN = WN WN WN WN . (2.24)

According to Equation (2.20), M2N2 = (1 − M1N1), M1N1 = (1 − M2N2), N1N2 = N and N1N2 WN = 1. Substituting these terms in Equation (2.24), the exponent will reduce to W nk = W n1k1 W n2k2 N N1 N2 . Thus, the Good-Thomas Fourier transform can be written as:

N −1 N −1 X1 X2 X = W n1k1 W n2k2 x[n n ]. k1k2 N1 N2 1 2 (2.25) n1=0 n2=0

It can be observed that Equation(2.20) is a decoupled two-dimensional N1 × N2 point transform, i.e., either the rows or the columns can be transformed in any order. There are no twiddle factor multiplications involved, which is perhaps the greatest strength of the Good-Thomas algorithm. The Good-Thomas algorithm can be summarized as follows:

(i) Map the input sequence using the Chinese Remainder Theorem n1 × n2.

(ii) Transform the rows or columns k1 × n2.

(iii) Transform the columns or rows k1 × k2.

(iv) Map the output sequence based on the Ruritanian correspondence mapping k.

It can be observed that the one-dimensional input sequence n is mapped into n1 × n2 using prime factor mapping discussed earlier. We then perform column transformation using

N1-point FFT. Unlike the Cooley-Tukey transform, we do not need any twiddle factor 22 multiplications. This is because of the inherent property of prime factor mapping. N2-point

FFT is then performed over N2 rows to form k1 × k2 result. The resulting two-dimensional array is then mapped using the Ruritaninan correspondence mapping to form the one-dimensional transform k. The procedure and pseudo-code for the Good-Thomas FFT algorithm is described in Algorithm (2.4).

2.5.1 Workload Computation In order to count the number of multiplications and additions, it is assumed that the

row and column transforms are computed directly. This results in N1 column transforms, each 2 requiring approximately N2 multiplications and additions, and N2 row transforms, each 2 requiring N1 multiplications and additions. Thus, 2 2 • Number of multiplications: N1(N2) + N2(N1) . 2 2 • Number of additions: N1(N2) + N2(N1) . The savings ratio for multiplications is given as: Number of Multiplications(Good − T homas) N + N R = = 1 2 (2.26) Number of Multiplications(Direct) N1N2 Some examples of the savings ratio are listed in Table 2.1

Table 2.3. Good-Thomas products savings ratio with re- spect to direct DFT calculation.

N N1 N2 R 6 3 2 0.83 20 5 4 0.45 120 15 8 0.19 252 28 9 0.14 1008 26 63 0.08

One can choose either the Cooley-Tukey algorithm or the Good-Thomas algorithm to calculate the Fourier transforms. It is also possible to build a Fourier transform algorithm by using both the Cooley-Tukey FFT and the Good-Thomas FFT. For example, a 63-point transform can be broken into a 7-point transform and a 9-point transform by using the Good-Thomas FFT. The 9-point transform can then be decomposed into two 3-point transforms by using the Cooley-Tukey FFT. One then has a computation in a form similar to a three-dimensional 3 × 3 × 7 Fourier transform. The Good-Thomas FFT is more efficient than the direct DFT computation and the Cooley-Tukey FFT algorithm. However, the radix algorithms prove to be superior than the Good-Thomas FFT in terms of the arithmetic computations. The Good-Thomas FFT has been 23

Algorithm 2.4 Good-Thomas algorithm procedure INPUTMAPPING:

2: for i ∈ N1 do

for j ∈ N2 do

4: index(i,j) = index(remainder((j − 1) × N2 × M2 + (i − 1) × N1 × M1, N)); end for 6: end for end procedure 8: procedure COLUMN TRANSFORMATION.

for i ∈ N2 do 10: Select all the elements of column i;

Perform N1-point Fourier transform for the N1 elements of column i;

12: The result is a two-dimensional array with Fourier transform of all the N2 column vectors; end for 14: end procedure procedure ROW TRANSFORMATION.

16: for i ∈ N1 do Select all the elements of row i;

18: Perform N2-point Fourier transform for the N2 elements of row i;

The result is a two-dimensional array with Fourier transform of all the N1 row

vectors and N2 column vectors; 20: end for end procedure 22: procedure OUTPUTMAPPING. Initialize k = 0

for i ∈ N1 do

24: for j ∈ N2 do

index(k) = index(remainder((j − 1) × N2 + (i − 1) × N1, N)); 26: k = k + 1; end for 28: end for end procedure 24 known to people in the signal processing industry for long, but we cannot find hardware implementations of the Good-Thomas FFT. This is mainly because the Good-Thomas FFT failed to make its impression over the radix algorithms.

2.5.2 Comparison and Summary of the FFT Algorithms Table (2.4) [13] gives a comparison of Cooley-Tukey, radix-2, radix-4, and Good-Thomas FFT algorithms in terms of the required number of multiplications and additions to perform the Fourier transform. In general, direct computation of DFT requires on the order of N 2 operations, where N is the length of the transform size. The breakthrough of the Cooley-Tukey FFT comes from the fact that it brings the complexity down to an order of

Nlog2N operations. The Cooley-Tukey FFT algorithm is suited for any composite length. For N = 2n or 4n or rn in general, it can be easily derived for what is now called the radix FFT algorithms. A different approach, based on the index transformation was suggested by Good and Thomas to compute DFT when the factors of the transform lengths are co-prime (i.e.,

N = N1 × N2 and GCD(N1,N2) = 1). This approach uses a multi-dimensional mapping and does not require any twiddle factor multiplications. But its drawback, at first sight, is that it requires efficiently computable DFTs on lengths that are co-prime. This method thus requires a set of relatively small-length DFTs that seemed at first difficult to compute in less than N 2 operations. After many developments that followed the Cooley-Tukey’s original contribution, the FFT introduced in 1976 by Winograd stands out for achieving a new theoretical reduction in terms of the multiplicative complexity. Interestingly, Winograd’s algorithm uses to compute DFTs, an approach which is just the converse of the conventional method of computing convolutions by means of DFTs. Winograd has utilized results proposed by Goods and Raders in more efficient way by mixing them with his fast convolution algorithm. The Winograd FFT uses Rader’s algorithm and Winograd’s circular convolution to compute the Fourier transform. The Rader’s algorithm, Winograd circular convolution, and the Winograd FFT are explained in the next chapter. 25

Table 2.4. Comparison of the Cooley-Tukey, Radix-2, Radix-4, and Good-Thomas FFT algorithms Cooley-Tukey FFT Good-Thomas FFT DFT Size Number of Number of Number of Number of N = N1 × N2 multiplications additions multiplications additions 30 = 2 × 3 × 5 360 384 330 384 60 = 3 × 4 × 5 1080 888 1020 888 120 = 3 × 5 × 8 2880 2076 2760 2076 240 = 3 × 5 × 16 8176 4812 7936 4812 504 = 7 × 8 × 9 12544 13164 12040 13164 1008 = 7 × 9 × 16 44928 29100 43920 29100

Radix-2 FFT Radix-4 FFT DFT Size N Number of Number of Number of Number of multiplications additions multiplications additions 32 320 480 64 768 1152 576 1056 128 1792 2688 256 4096 6144 3072 5632 512 9216 6144 1024 20480 30720 15360 28160 26

CHAPTER 3

FAST FOURIER TRANSFORMS VIA CONVOLUTION

Fourier transform can be performed efficiently using along with a FFT algorithm. When blocklength is relatively small, one of the efficient FFT algorithms, as measured by the number of multiplications and additions, is the Winograd FFT algorithm. The Winograd FFT is derived from the Rader’s algorithm and the Winograd short circular convolution.

3.1 RADER’S ALGORITHM When the block length N is a power of an odd prime, the Rader’s algorithm [17] can be used efficiently to compute FFT. In this case, the zeroth time component and the zero frequency component are treated specially. Rader’s algorithm is based on converting the DFT into convolution. The convolution is a cyclic operation with no pre- or post- multiplications. The Rader algorithm allows us to represent a prime length DFT as a cyclic convolution, which in turn can be expressed as a polynomial product modulo a third polynomial. The DFT and the cyclic convolution operations can be written as:

N−1 X X[k] = x[n]W nk, (3.1) n=0

N−1 X X[k] = x[n]h[n − k], (3.2) n=0 respectively, where the indices are evaluated modulo N. To convert the DFT in Equation (3.1) into convolution represented in Equation (3.2), the n × k product must be changed to k − n differences. When N is a prime number, this change of indices can be accomplished by permuting the order of the original sequence. Hence, to transform the index k × n to k − n when N is prime, we can use the the following property of integers modulo prime numbers. For any prime number N, the non-negative integers less than N form a Galois-field, GF(N) under the operation modulo N addition and modulo N multiplication. From the number theory, it can be shown that if the modulus N is a prime number, a base called primitive root r exist such that the integer Equation (3.3) creates one-to-one map of N − 1 member as follows: n = rm mod N, (3.3) 27 where m = 0, 1,...,N − 2. Note that all terms in which n = 0 or k = 0 must be excluded from consideration. For instance, if n = 0 the index kn will address a single element irrespective of k and vice-versa. However, k − n will address N different elements. Therefore, when n = 0 or k = 0 it is not possible to permute the given sequence. We shall treat these terms as special cases by rewriting the DFT in Equation (3.1) as:

N−1 X X[0] = x[n], (3.4) n=0 X[k] = X[0] + X0[k], (3.5) N−1 X X0[k] = x[n]W nk, (3.6) n=1 where k = 1, 2,...,N − 1. The Rader’s algorithm uses the circular convolution properties of the prime number DFTs. The algorithm can be summarized as follows:

Zeroth sample separation Separate the first input sample x[0] from the others and prepare to compute the output frequency components except for x[0], i.e. X[i] − x[0] for i = 0, 1 ...N − 1.

Input mapping Reorder the input data sequence: For any prime number N, there is at least one factor, called primitive root, that is used to reorder the input data sequence. if r is the primitive root, then the reordered sequence n can be calculated using Equation (3.3).

(N − 1)-point FFT computation Compute the N − 1 point DFT of the new sequence using any of the FFT algorithms discussed in Chapter 3.

Reorder the twiddle factors For every primitive root, there exists another primitive root so that the product of the two is 1 modulo N. Use this root to reorder the twiddle factors.

FFT computation of the twiddle factor matrix Compute the N − 1 point DFT of this sequence. These values are computed and stored in the memory.

Complex multiplication Perform complex multiplication of the above two sequences. This stage requires N − 1 complex multiplications. 28 Inverse Fourier transform Compute the inverse Fourier transform of the output sequence of the previous stage.

Output frequency component computation We need to add x[0] to all the resulting N − 1 point sequence. Then X[0] computed using Equation (3.4) is appended to form the complete N-point Transform.

The entire process mentioned above is similar to performing an N − 1 point circular convolution. Fig. 3.1 shows a top-level description of the Rader’s algorithm.

Input Output Sequence Input sequence N-1 point Output sequence Sequence permutation convolution permutation +

Computation of the DC component

Figure 3.1. The Rader’s algorithm.

The pseudo-code for the Rader’s FFT algorithm is described in Algorithm 3.1.

Algorithm 3.1 Rader’s FFT algorithm Input x is a vector of complex input samples. The first sample is separated from the input

vector. x0 = x(0) and xi = x(i) for i ≥ 1;

Let g1 and g2 be the two primitive roots;

The index of reorder sequence index1(i) is given by: procedure INPUTMAPPING/ SCRAMBLING: for i = 1, 2,...N − 1 do i 5: index1(i) ≡ g1 mod N;

xr(i) = x(index1(i)); // reordered sequence xr end for end procedure 29

procedure (N-1)-POINT FFT Xr(k) 10: for i = 1, 2,...N − 1 do

Calculate the Fourier transform for the sequence xr(i); end for end procedure procedure TWIDDLEFACTORSCRAMBLING:

15: The index of reorganized twiddle factors index2(i) is given by: for i = 1, 2,...N − 1 do i index2(i) ≡ g2 mod N;

Wr(i) = W (index2(i)); // reordered twiddle factor Wr end for 20: end procedure

procedure (N-1)-POINT FFT Wkr(k): for i = 1, 2,...N − 1 do

Calculate the Fourier transform for the sequence Wr(i); end for 25: end procedure procedure COMPLEX MULTIPLICATION: for k = 1, 2,...N − 1 do 0 X (k) = Xr(k) × Wkr(k); end for 30: end procedure procedure (N-1)-POINTINVERSE FFT x0(i) for i = 1, 2,...N − 1 do Calculate the inverse Fourier transform of the sequence X0(k); end for 35: end procedure 30

procedure OUTPUTFREQUENCYCOMPONENTS:

Calculate the value of Xk(0) as follows: for i = 1, 2,...N − 1 do

X0 = X0 + x(i); 40: end for Add the zeroth term to the computed sequence X0(k); for i = 1, 2,...N − 1 do X(k) = x(0) + X0(k); end for

45: The final output sequence is computed by appending X0 to X(k) to form Xout as:

Xk = [X0 X(k)]; end procedure

3.1.1 Workload Computation Note that Rader’s algorithm can only be used if the length N is a prime or a power of prime. Since N − 1 is composite, the convolution can be performed directly via conventional FFT algorithms. However, if N − 1 has large prime factors, this approach may not be efficient as it requires recursive use of Rader’s algorithm. Instead, a length (N − 1) cyclic convolution can be computed exactly by zero-padding it to a length of at least 2(N − 1) − 1, say to a power of two, and which can then be evaluated in O(NlogN) without the recursive application of Rader’s algorithm. If Rader’s algorithm is performed by using FFTs of size N − 1 to compute the convolution, rather than by zero padding as mentioned above, the efficiency depends strongly upon N and the number of times that Rader’s algorithm must be applied recursively. In general the number of arithmatic operations is given by:

• Number of multiplications: 2 × MN−1 + 4 × (N − 1).

• Number of additions: 2 × AN−1 + 6 × (N − 1). where MN−1 is the number of multiplications required for N − 1-point FFT and AN−1 is the number of additions required for N − 1-point FFT. In general, the Rader’s algorithm, requires O(N) additions and O(N × logN) time for the convolution. If the convolution is performed by a pair of FFTs, this algorithm requires intrinsically more operations than FFTs of close composite sizes. The Rader’s algorithm is efficient if the length of FFTs is comparatively small. However, as the length increases, the number of multiplications and additions required increase significantly. Rader was successful 31 in mapping a DFT of sequence N, where N is prime, into a circular convolution of length (N − 1). This property was later used to form the Winograd Fourier transforms.

3.2 WINOGRAD SHORT CONVOLUTION Winograd short convolution [9] discusses a general strategy for computing linear or cyclic convolutions by applying the polynomial version of the Chinese remainder theorem. For polynomials g(x) and d(x) of degree N − 1 and M − 1 , respectively, the linear convolution s(x) has degree L − 1, where L = M + N − 1. The linear convolution described as polynomial product s(x) = g(x)d(x). This is equivalent to Equation (3.7) because the reduction mod m(x) has no effect on s(x) when the degree of m(x) exceeds that of s(x).

s(x) = g(x)d(x) mod m(x). (3.7)

For cyclic convolution, let m(x) = xN − 1, where N is the largest degree of g(x) or d(x). The first step in derivation of the Winograd algorithm is to replace m(x) with its relative prime factors i.e., m(x) = m0(x) m1(x) . . . mK−1(x). Then the residual polynomials associated with each of these factors can be computed as:

dk(x) = d(x) mod mk(x), (3.8)

gk(x) = g(x) mod mk(x), (3.9)

sk(x) = s(x) mod mk(x), (3.10)

sk(x) = dk(x) gk(x) mod mk(x). (3.11)

The final step is to compute s(x) by applying Chinese remainder theorem for polynomials as: K−1 X s(x) = sk(x) N k(x) M k(x) [mod m(x)], (3.12) k=0 m(x) where M k(x) = and N k(x) is determined by mk(x)N k(x) = 1 [mod M k(x)]. mm(x) Only short convolutions represented by the polynomial products of the second step PK−1 k required multiplications. The number of multiplications is given by k=0 deg m (x). As an

example, consider the cyclic convolution of 2-point sequences. Let g(x) = g0 + g1x and 2 d(x) = d0 + d1x be the two degree-1 polynomials. Let m(x) = x − 1 = (x − 1)(x + 1). The choice of factors is arbitrary, however, the only requirements are that the degree of m(x) is greater than the degree of s(x) and that the factors of m(x) are relatively prime. The next step is to find the residual polynomials associated with each of these factors as shown in (3.8) through (3.10). Thus, 32

0 0 g (x) = g(x) [mod m (x)] = (g0 + g1x) [mod (x − 1)] = g0 + g1, and 1 g (x) = (g0 + g1x) [mod (x + 1)] = g0 − g1.

Similarly, we can calculate the residual polynomials for d(x). Thus,

0 1 d (x) = d0 + d1 and d (x) = d0 − d1.

0 1 0 1 Let g0 = g0 + g1, g0 = g0 − g1, d0 = d0 + d1 and d0 = d0 − d1. A matrix representation of the equations is given below, " # " #" # " # " #" # d0 1 1 d g0 1 1 g 0 = 0 and 0 = 0 . 1 1 d0 1 −1 d1 g0 1 −1 g1

We now need to determine the residues of s(x) given by,

s0(x) = g0(x)d0(x) [mod m0(x)]. Thus, 0 1 s0 = (g0 + g1)(d0 + d1) and s0 = (g0 − g1)(d0 − d1).

Finally, we use Equation (3.12) to solve for s(x) and N k(x) can be computed with the help of Table (3.1).

Table 3.1. Determination of N k(x). k mk(x) M k(x) 0 x − 1 x + 1 1 x + 1 x − 1

For k = 0, N 0(x) × (x + 1) = 1[mod (x − 1)]. Similarly, for k = 1, N 1(x) × (x − 1) = 1[mod (x + 1)]. We need to use Euclid’s division algorithm for polynomials [Ref] to solve these congruences. But for now, as the terms are small we can solve them by inspection. Substituting x = −1 yields N 0(x) = 1/2 and N 1(x) = −1/2. Thus s(x) can be written as,

1 0 1 1 0 1 s(x) = 2 [s0 + s0] + 2 [s0 − s0] and, 0 1 s0 = (g0 + g1)(d0 + d1) , s0 = (g0 − g1)(d0 − d1).

Thus, this algorithm requires two multiplications and four additions compared to the direct requirement of four multiplications and two additions. The Winograd cyclic convolution reduces the number of multiplications at the cost of increased number of additions. Additional information about the Winograd cyclic convolution can be found in [18]. 33

3.3 WINOGRAD FOURIER TRANSFORM ALGORITHM In 1976, S. Winograd introduced a new approach to efficiently compute the Fourier transform that uses substantially fewer number of multiplications at the expense of slightly more additions. The motivation for the development of this algorithm was that multiplication was extremely expensive in computation time. The Winograd small FFT is a method of efficiently computing the DFT for small block lengths. Thus the algorithm was designed to minimize the number of multiplications required to implement the FFT. It is based on the Rader’s prime algorithm and Winograd cyclic convolution. The first step in WFT is to change the Fourier transform into a convolution. If N is a small prime number, we can use the Rader’s prime algorithm to express the Fourier transform as a convolution. The Rader prime algorithm changes the DFT into a convolution using permuted input sequences. This stage does not require any additions or multiplications. The next step is to use the Winograd cyclic convolution algorithm to compute the Fourier transform of (N − 1)-point sequence. The convolution algorithm itself will have a set of additions, followed by a set of multiplications followed by a set of additions again. Winograd has shown that the DFT can be computed with only O(N) irrational multiplications, which lead to an achievable lower bound on the number of multiplications. However, this comes at the cost of significantly more additions, a trade off no longer favorable on conventional FPGAs with dedicated multipliers. The theory and derivation of these algorithms is quite elegant, but requires substantial background in number theory and abstract algebra. Fortunately, for the VLSI designers, the smaller length FFT algorithms have already been derived. The most popular block lengths are 2, 3, 4, 5, 7, 8, 9 and 16. In this section, the dataflow and construction of 2-, 3-, 4-, 5-, 7-, 8-, 9-, and 16-point Winograd FFT algorithms are discussed as building blocks used in greater length FFTs. Note that the red lines indicate the pipeline stages and the details for which are explained in Chapter 5.

• Two-point FFT The SFG for 2-point Winograd Fourier transform is shown in Fig. 3.2. This is the simplest of the FFTs and requires only 4 additions and no multiplications. • Three-point FFT The SFG for the 3-point Winograd Fourier transform is shown in Fig. 3.3. The intermediate multiplier coefficients are given in Table 3.2. The 3-point Winograd Fourier transform requires 6 additions and only 2 multiplications. • Four-point FFT The SFG for 4-point Winograd Fourier transform is shown in Fig. 3.4. The 4-point Winograd FFT does not require any multiplications. The complex multiplier j can be 34

U xn[0] Xk[0]

R xn[1] Xk[1]

Figure 3.2. The SFG for 2-point Winograd Fourier transform algorithm.

Table 3.2. Multiplier coefficients for the 3-point WFT.

C1 cos(2π/3) − 1 C2 j sin(2π/3)

xn[0] + Xk[0]

C1 xn[1] + x + + Xk[1]

C2 xn[2] - x - Xk[2]

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Figure 3.3. The SFG for the 3-point Winograd Fourier transform algorithm. 35 easily replaced by using a demultiplexer to switch the real data to imaginary and the imaginary data to real. The 4-point transform only requires 4 additions.

xn[0] + + Xk[0]

xn[1] + - Xk[2]

xn[2] - + Xk[3]

j

xn[3] - x - Xk[1]

Stage 1 Stage 2 Stage 3 Figure 3.4. The SFG for the 4-point Winograd Fourier transform algorithm.

• Five-point FFT The SFG for 5-point Winograd Fourier transform is shown in Fig. 3.5. The 5-point Fourier transform requires 5 multiplications and 17 additions. The intermediate multiplier coefficients are given in Table 3.3.

Table 3.3. Multiplier coefficients for the 5-point WFT.

C1 [1/2 cos(2π/5) + 1/2 cos(4π/5) − 1] C2 [1/2 cos(2π/5) − 1/2 cos(4π/5)] C3 sin(2π/5) C4 sin(2π/5) + sin(4π/5) C5 sin(2π/5) − sin(4π/5)

• Seven-point FFT The SFG for 7-point Winograd Fourier transform is shown in Fig. 3.6. The 7-point Fourier transform requires 8 multiplications and 35 additions. The intermediate multiplier coefficients are given in Table 3.4. • Eight-point FFT The SFG for 8-point Winograd Fourier transform is shown in Fig. 3.7. The 8-point Fourier transform requires 2 multiplications and 26 additions. The intermediate multiplier coefficients are given in Table 3.5. 36

xn[0] + Xk[0]

C1 + + x + + + xn[1] Xk[1]

C2 + - x - + xn[2] Xk[2]

C3 - x + - xn[4] Xk[4]

C4 - x + - xn[3] Xk[3]

C5 + x

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Figure 3.5. The SFG for the 5-point Winograd Fourier transform algorithm.

Table 3.4. Multiplier coefficients for the 7-point WFT.

C1 [1/3[cos(2π/7) + cos(4π/7) + cos(6π/7)] − 1] C2 1/3[2 cos(2π/7) − cos(4π/7) − cos(6π/7)] C3 1/3[cos(2π/7) − 2 cos(4π/7) + cos(6π/7)] C4 1/3[cos(2π/7) + cos(4π/7) − 2 cos(6π/7)] C5 1/3[sin(2π/7) + sin(4π/7) − sin(6π/7)] C6 1/3[2 sin(2π/7) − sin(4π/7) + sin(6π/7)] C7 1/3[sin(2π/7) − 2 sin(4π/7) − sin(6π/7)] C8 1/3[sin(2π/7) + sin(4π/7) + 2 sin(6π/7)]

Table 3.5. Multiplier coefficients for the 8-point WFT.

C1 cos(2π/8) C2 sin(2π/8) 37

C1 Xk[0] - x

+ Xk[2] xn[0] + + - - -

C3 Xk[3] xn[2] + - x - + +

C4 xn[1] + + x

C 2 Xk[6] xn[4] + - x + + +

C5 Xk[1] + x + + -

C7 xn[5] - - x

C6 Xk[5] xn[6] - - x - - +

xn[3] - +

C8 Xk[4] - x - + -

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Stage 8 Figure 3.6. The SFG for the 7-point Winograd Fourier transform algorithm.

• Nine-point FFT The SFG for 9-point Winograd Fourier transform is shown in Fig. 3.8. The 9-point Fourier transform requires 10 multiplications and 44 additions. The intermediate multiplier coefficients are given in Table 3.6. • Sixteen-point FFT The SFG for 16-point Winograd Fourier transform is shown in Fig. 3.9. The 16-point Fourier transform requires 10 multiplications and 74 additions. The intermediate multiplier coefficients are given in Table 3.7. 38

XR[0] xn[0] + + +

XR[1] xn[4] - + +

XR[2] xn[2] + - +

XR[7] xn[6] - + -

XR[4] xn[1] + + -

C1 XR[3] xn[7] - + x - +

XR[6] xn[3] + - +

C1 XR[5] xn[5] - - x - -

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Figure 3.7. The SFG for the 8-point Winograd Fourier transform algo- rithm.

Table 3.6. Multiplier coefficients for the 9-point WFT.

C1 −1/2 C2 sin(6π/9) C3 cos(6π/9) − 1 C4 sin(6π/9) C5 1/3[2 cos(2π/9) − cos(4π/9) − cos(8π/9)] C6 1/3[cos(2π/9) + cos(4π/9) − 2 cos(8π/9)] C7 1/3[cos(2π/9) − 2 cos(4π/9) + cos(8π/9)] C8 1/3[2 sin(2π/9) + sin(4π/9) − sin(8π/9)] C9 1/3[sin(2π/9) − sin(4π/9) − 2 sin(8π/9)] C10 1/3[sin(2π/9) + 2 sin(4π/9) + sin(8π/9)] 39

Xk[0]

Xk[7] + + + - + -

C3 xn[0] + x

C6 Xk[4] xn[3] + - x - +

C5 - Xk[6] xn[4] + - x - + -

C1 xn[1] + + + x +

C4 Xk[1] xn[7] + - x + + +

C2 Xk[8] x + + -

C8 Xk[2] xn[6] - - x - + +

C9 Xk[5] xn[5] - - x - -

C2 Xk[3] xn[8] - + + x +

C7 xn[2] - - x -

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Stage 8 Stage 9 Stage 10 Figure 3.8. The SFG for the 9-point Winograd Fourier transform algorithm.

Table 3.7. Multiplier coefficients for the 16-point WFT.

C1 sin(2π/16) C2 cos(2π/16) C3 sin(3π/16) C4 sin(π/16) − sin(3π/16) C5 sin(π/16) + sin(3π/16) C6 cos(3π/16) C7 cos(π/16) − cos(3π/16) C8 cos(π/16) + cos(3π/16) 40

+

xR[0] xn[0] + + + +

xR[1] xn[8] - - +

xR[2] xn[4] + - - + +

j xR[3] xn[12] - x + + -

xR[4] xn[2] + + + +

C 1 xR[5] xn[10] - + x - + +

j xR[6] xn[6] + - x + -

C 2 xR[7] xn[14] - - x - + -

C 4 xR[8] xn[1] + + + x + -

C 3 xR[9] xn[9] - + + x +

C 1 xR[10] xn[5] + - + x - +

C 8 xR[11] xn[13] - - x - - -

j xR[12] xn[3] + + - x -

C 5 xR[13] xn[11] - + x - - +

C 2 xR[14] xn[7] + - - x - -

C 6 xR[15] xn[15] - - + x - - -

C7 x Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Figure 3.9. The SFG for the 16-point Winograd Fourier transform algorithm. 41

3.4 SUMMARY The FFT introduced by Winograd significantly reduces the number of multiplications required to perform Fourier transform. Winograd has utilized results proposed by Rader in more efficient way by mixing them with his fast convolution algorithm. However, Winograd’s FFT algorithm has failed to replace the popular Cooley-Tukey type algorithms. This is mainly because of the significant rise in the number of additions required for Fourier transforms of higher order. The radix algorithms are superior than the Winograd FFT algorithms for large transform lengths. The required number of additions and multiplications required to perform the Fourier transform using Winograd FFT algorithm is shown in Table (3.8). It can be

Table 3.8. Computational requirements of Radix-2 and Winograd FFT algorithms Radix-2 Winograd FFT DFT Size Number of Number of Number of Number of N = N1 × N2 multiplications additions multiplications additions 2 4 6 0 2 3 2 6 4 16 24 0 8 5 5 17 7 8 38 8 48 72 2 26 9 10 44 16 128 192 10 74 30 = 5× 6 72 384 60 = 5× 12 144 888 64 576 1056 120 = 3× 5 × 8 288 2076 180 = 4× 5 × 9 528 3936 240 = 3× 5 × 16 648 5136 256 3072 5632 420 = 3× 4 × 5 1296 11352 504 = 3× 5 × 8 1584 14428 1008 = 7× 9 × 16 3564 34416 1024 15360 28160

observed that the Winograd FFT algorithm proves to be better than the radix algorithms if the length of the transform is relatively small as the arithmetic complexity is significantly lower for smaller lengths transforms. But as the blocklength increases, the Winograd FFT algorithm fails to make an impression due to the increase in the number of additions e.g. the number of 42 adders required for 16-point transform is less compared to 1008-point transform where it requires more number of adders than radix-2 FFT. In 1975, D.P. Kolba and T.W. Parks proposed a method of computing Fourier transform based on the Good-Thomas FFT and the Winograd FFT algorithms [4]. In their approach, the Good-Thomas FFT algorithm was utilized to decompose the global computation into smaller Fourier transform computations. These smaller transforms are then implemented by the Winograd FFT algorithm. It was observed that the combined Good-Thomas and Winograd FFT, also referred as the Good-Thomas, Winograd prime factor algorithm or simply prime factor algorithm (PFA) requires minimum number of multiplications at the cost of some additions. Fig. (3.10) and Fig. (3.11) plots the number of multiplications and number of additions required by various FFT algorithms respectively.

×105 8

7 Radix-2 FFT Radix-4 FFT 6 Cooley-Tukey FFT Good-Thomas FFT Good-Thomas Winograd FFT 5

4

3

Number of Multiplications 2

1

0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Size of Fourier transform Figure 3.10. Required number of multiplications for different FFT algorithms.

The computation of the Fourier transform using PFA proved to be more computationally-efficient than all the other FFT algorithms discussed primarily because, (i) it does not require twiddle factor multiplications and (ii) the use of the small Winograd FFT algorithms inside the Good-Thomas FFT algorithm significantly reduces the arithmetic complexity due to the cyclic convolution properties of the Winograd FFT algorithms. This was the main motivation for choosing the combined Good-Thomas and Winograd FFT as the candidate FFT algorithm for VLSI implementation. The hardware construction and 43

×105 8

7 Radix-2 FFT Radix-4 FFT 6 Cooley-Tukey FFT Good-Thomas FFT Good-Thomas Winograd FFT 5

4

3 Number of Additions 2

1

0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Size of Fourier transform Figure 3.11. Required number of additions for different FFT algorithms. implementation results of the combined Good-Thomas and Winograd FFT algorithm is described in Chapter 5. 44

CHAPTER 4 GOOD-THOMASANDWINOGRAD PRIME-FACTOR FFT ALGORITHMS

This chapter discusses the design and implementation of the Good-Thomas and Winograd prime-factor FFT algorithm (PFA). The FFT algorithms are first modeled in floating-point Matlab and then in fixed-point Matlab. The fixed-point Matlab model was then converted to hardware descriptions for FPGA implementations.

4.1 INTRODUCTION The divide-and-conquer strategy, has few requirements for feasibility, i.e. the Fourier transform length N, needs only to be prime or composite, and the whole DFT is computed from DFTs on a number of points which is a factor of N. This is required for the redundancy in the computation of DFT of the signal. Good’s mapping, which can be used when the factors of the transform lengths are co-prime, is a true mono to multi-dimensional mapping, thus having the advantage of not producing any twiddle factors. However, it requires to efficiently compute the DFTs of co-prime lengths. For instance, a DFT of length 240 will be decomposed as 240 = 16 × 3 × 5, and a DFT of length 1008 will be decomposed as 1008 = 16 × 9 × 7. Thus, the intermediate DFT computations of length 16, 3, 5, 9 and 7 have to be efficient. This prime factor method requires a set of relatively small length of DFTs that seemed to be challenging to compute in less than N 2 operations. Rader showed how to map a DFT of length N, where N is a prime number, into a circular convolution of length N − 1. Winograd introduced a new approach to efficiently compute the Fourier transform that uses substantially fewer number of multiplications at the expense of more additions. The Winograd small FFT is a method of efficiently computing the discrete Fourier transform for small block lengths. It is based on the Rader’s prime algorithm and Winograd cyclic convolution to compute the DFT for a given sequence. For instance, a 16-point sequence would require only 10 multiplications and 74 additions. The algorithm was designed to minimize the number of multiplications required to implement the FFT. For example, we can now efficiently compute the 240-point DFT or the 1008-point DFT by using 16-point, 5-point, 3-point and 16-point, 9-point, 7-point Winograd FFTs, respectively. 45

4.2 DATA FORMAT The data format has a direct impact on both the accuracy of the outputs and also on the size of the design. The input and the output word-lengths of the FFT processor designed in this thesis are parametric. We use fixed-point arithmetic to implement the FFT architectures. The following strategies are used to deal with the problem of overflow in fixed-point computations. • Headroom provision The input-output word-length is increased so that only a fraction of the available word-length range is used. • Fixed scaling The outputs of the adders and multipliers that generate overflow are scaled down. The least significant bit (LSB) is dropped at given stages of the computations. • Overflow flag Overflow flags can be used to indicate that the word-length has to be increased so as to provide sufficient number of bits. Fig. 4.1 shows the input and output data signals of the implemented FFT processor. Each complex sample has a real and an imaginary part, whose values are represented in the two’s complement fixed-point form. The N complex samples all enter the FFT circuit synchronously, with their real and imaginary parts, each via a separate input data bus. Similarly, all the N Fourier transform samples are available at the output, each on a separate real and imaginary output data buses. The handshaking control signal “DVI” indicates all valid samples are applied to the input and the control signal “DVO” indicates that the Fourier transform has been completed and the transform samples are available at the output. The Fourier multiplier coefficients, which also have the same format as the inputs may be applied to the input coefficient data bus along with the inputs or may be stored in the LUT.

  N-point  Good-   Thomas,      Winograd    prime-    factor FFT      processor    

Figure 4.1. The IO interface of the FFT processor. 46

4.3 THE WINOGRAD FFT MODULES The Winograd FFT algorithm forms an integral part of our prime-factor FFT algorithm. In this section we discuss the pipelined architectures and the implementation

results of the Winograd FFT modules. These FFTs are used to perform the N1-point and

N2-point row and column transforms respectively. We do not require any twiddle factor multiplication to perform WFT, which in turn reduces the number of multiplications needed. WFTs with block lengths of 2, 3, 5, 7, 8, 9 and 16 are constructed with the help of their SFGs discussed in Chapter 5. Note that each input xn[0], xn[1],..., and output Xk[0], Xk[1],..., has a bit-width of WL bits, as specified by the user. The multiplier coefficients (C1, C2, ...) depend on the underlying WFT and are precomputed in Matlab and stored at initialization in a LUT. The bitwidth of the multiplier coefficients is parametric, however, we use 16 bits coefficients for acceptable precision. The address for the input memory module is calculated using a counter that stores the input samples sequentially at consecutive memory locations. The top level architecture of the WFT module is shown in Fig. (4.2). Note that the SFGs for the Winograd Fourier transforms are

BRAM

CLK Input memory Complex input Complex data Address data    output bus 6 G1 0     1 2 Complex input 3 data 4

N

Figure 4.2. The top-level block diagram of the WFT algorithm.

already discussed in Chapter 5. The WFTs are pipelined to reduce the critical-path to one multiplier. The pipeline stages are indicated by the red lines in the SFGs.

4.4 THE PRIME-FACTOR FFTALGORITHM The prime-factor FFT algorithm uses the CRT mapping technique similar to the Good-Thomas FFT algorithm to map the input samples into a two-dimensional array. This

mapping turns the original transform of length N into several small DFTs of length N1 and

N2, where N1 and N2 are co-prime numbers. Thus, the one-dimensional input is transformed

into N1-point rows and N2-point columns. After mapping the DFT into a true multi-dimensional DFT by Good’s method, the Winograd Fourier transform algorithm can be 47 applied to evaluate the prime length DFTs. Finally, the RCM technique maps the two-dimensional sequence back to a one-dimensional array of N samples at the output. The pseudo-code for the prime-factor FFT algorithm is described in Algorithm 4.1.

4.5 ARCHITECTURE The top-level architecture of the combined Good-Thomas and Winograd prime factor FFT algorithm is shown in Fig. 4.3, which is a 3-stage pipelined design. This section describes the hardware architecture of the Good-Thomas and Winograd prime-factor FFT algorithm. The FFT block diagram shown in Fig. 4.3 uses an 8-state finite state machine (FSM) to control the sequence of FFT operations, as shown in Fig. 4.4. The FFT process begins by going to the “idle” state. In this state the FFT processor waits for the valid input signal to arrive at the input, which is represented by the “DVI” signal. The process then goes to the “reset” state, where all the buffer registers, adders and multipliers are reset to zero in this state. After initializing the registers to zero, the processor goes to the “input mapping” state to map the input signal using the Chinese remainder theorem. The CRT mapping process takes N2 clock cycles to map the N2 samples and the processor then goes to the “row transformation” state. The row transformation state performs N2-point Winograd Fourier transform on the CRT mapped samples and goes back to the “input mapping” state to map for

the next N1 samples. This process is repeated N1 times until all the input samples have been

row transformed. Note that the output samples of N2-point FFT are stored in a block memory. After performing row transformation, the registers and counters are initialized to perform the column transformation. The processor then goes to the “column transformation” state to

perform N1-point Winograd Fourier transform. The column transformation is repeated N2 times. We now need to perform the Ruritanian correspondence mapping of the row and column transformed sequence. The RCM is performed when the processor goes to “output mapping” state. Once all samples are ready, the “DVO” signal is asserted indicating that the Fourier transformed sequence is ready at the output.

4.6 HARDWARE DESIGN To design the FFT in hardware, we created a software model that accurately represents the combined Good-Thomas and Winograd FFT algorithm. By using this model, the designer can change various parameters and specifications such as data bitwidth, coefficient bitwidth, etc. to meet the design specifications. Our design approach is shown in Fig. 4.5. First the software model was developed in floating-point format in Matlab. The floating-point model was then converted into the fixed-point model. The hardware description of FFT was then 48

Algorithm 4.1 Good-Thomas and Winograd Prime-factor FFT Algorithm procedure CRT INPUTMAPPING:

for i ∈ N1 do

for j ∈ N2 do

index(i,j) = index(remainder((j − 1) × N2 × M2 + (i − 1) × N1 × M1, N)); end for 6: end for end procedure

procedure N1-POINT WINOGRAD FFT:

for i ∈ N2 do Select all the elements of column i;

Perform N1-point Winograd Fourier transform over N1 elements of column i; 12: The result is a two-dimensional array with Fourier transform of all the

N2 column vectors; end for end procedure

procedure N2-POINT WINOGRAD FFT:

for i ∈ N1 do 18: Select all the elements of row i;

Perform N2-point Winograd Fourier transform over N2 elements of row i; The result is a two-dimensional array with Fourier transform of all the

N1 row vectors and N2 column vectors. end for end procedure 24: procedure OUTPUTMAPPING: Initialize k = 0

for i ∈ N1 do

for j ∈ N2 do

index(k) = index(remainder((j − 1) × N2 + (i − 1) × N1, N)); k = k + 1. end for 30: end for end procedure 49

INPUT DATA SEQUENCE

INPUT MAPPING (CRT)

DATA REGISTER

N1-POINT FFT

N1xN2 - POINT BLOCK RAM

N2-POINT FFT

DATA REGISTER

OUTPUT MAPPING (RCM)

OUTPUT DATA SEQUENCE

Figure 4.3. The Good-Thomas FFT architecture. 50

QC%IJ

HQ%J \

INPUT QC%IJ IDLE RESET MAPPING HQ%J YY YY

YY

Q1HQ%J \ N1-POINT ROW TRANSFORM

EXIT

Q1HQ%J YY

Q1HQ%J YY INITIALIZE

QC%IJ

HQ%J YY OUTPUT N2-POINT MAPPING COLUMN TRANSFORM Q1HQ%J \

Figure 4.4. The FSM for the combined Good-Thomas and Winograd FFT algorithm. 51 developed in Verilog HDL based on the fixed-point model. Finally, the design was synthesized and implemented on a FPGA.

DESIGN SPECIFICATION

FLOATING- POINT MATLAB MODEL

FIXED-POINT MATLAB MODEL

VERILOG DESCRIPTION

FUNCTIONAL SIMULATIONS

SYNTHESIS

FPGA IMPLEMENTATION

Figure 4.5. The design flow.

4.6.1 Design Considerations We need to carefully consider the design parameters before we begin our implementation in Matlab or Verilog. These parameters are: precision of data, number of

points used for computation N, and the desired operating frequency fop. The precision WL of 52 data relates to the number of bits per input, which we refer to as bitwidth. Both real and imaginary data values will be WL bits respectively. The bitwidth parameter has a direct impact on the accuracy of results and also the design size, performance, and power consumption. The second parameter is the number of points N used for the computation. Increasing N improves the resolution of the results. While we want to maximize this parameter, we have to keep in mind that each additional point will directly increase the memory requirement of our design as we need to store more number of input samples. Thus, the upper limit on the number of points is bounded by the size of memory available on the device. The final parameter is the circuit operating frequency, fop. This parameter defines the slowest speed the circuit can work at and still be able to process the incoming data without loss of data.

4.6.2 Parallel Processing Architecture The combined Good-Thomas and Winograd prime-factor FFT that we have designed supports parallel processing of data, i.e. a temporal parallelism between the sub-blocks. We achieve this by implementing hand-shaking signals between the different modules. Fig. 4.6 shows the hand-shaking signals across the FFT processor.

ready ready ready OUTPUT Counter value N -point INPUT DATA 1 Counter value Counter value DATA of N1 points counter value SEQUENCE of N2 points of N2 points SEQUENCE INPUT N xN OUTPUT N -POINT 1 2 N -POINT MAPPING 2 POINT 1 MAPPING FFT ready FFT (CRT) done MEMORY done done (RCM) DVI DVO

   

Figure 4.6. FFT architecture.

The hand-shaking signals indicate the current module whether the previous task has been completed. If the previous task is still in process, the subsequent processing block will have to wait till the output of previous task is ready.

In the FFT architecture shown in Fig. 4.6, the input module will map the first set of N2

samples and set the “ready” flag when the N2-point counter reaches N2, and notify the

N2-point FFT module to process the data. Once this set of N2-point data is ready and passed

to the FFT module, it will re-start to map the next set of N2 samples and wait until the FFT

process is completed. The N2-point FFT module will simultaneously process the input samples after the “ready” flag is set by the CRT mapping module. The FFT module will

process N2-point data and set the “done” flag indicating that the N2-point transform of N2

samples is completed. The processed samples are then passed to the N1 × N2-block memory

along with the value of N1-point counter. This counter acts as the address line to store the

processed data at the corresponding location of the block RAM. The N2-point FFT module 53 will also indicate the input mapping module that it is ready to process the next set of N2-point samples. The above process of CRT mapping and row transformation is repeated for N1 times so that all the N input samples have been row-transformed and stored in the N1 rows of the block RAM.

We now have to perform the column transformation for the N1-point samples. We select the N1 samples from the block RAM and set the “ready” flag. The N1-points are formed by selecting 2 × WL-bit complex sample from every row and passing samples to perform an N1-point transform. The N1-point sequence is then passed onto the N1-point FFT module to perform column transformation of the N1-point sequence. Once this set of

N1-point data is ready and passed to the FFT module, the next set of N1-point samples is formed. However, this set of N1 samples is passed to the N1-point FFT module only if it ready to accept as indicated by the “ready” flag. Once the FFT output is ready, it is passed to the next module to perform the output mapping. The output mapping module will start to map the received input and set the

“mapped” flag to indicate that it is ready to accept the next set of N1 samples. After all the samples are mapped, the “DVO” flag is set to indicate that the FFT output is ready.

Fig. 4.7 shows the temporal overlap of the sub-tasks of FFT computation.

OUTPUT

OUTPUT MAPPING (RCM)

SEQUENCE

COLUMN TRANSFORM

WRITE DATA TO READ COLUMN ELEMENTS

Task

RAM OF BLOCK RAM

ROW TRANSFORM

INPUT

RESET INPUT MAPPING (CRT)

SEQUENCE

Time Figure 4.7. Temporal parallelism of sub-tasks.

Fig. (4.8) shows the flowchart of the operations for the combined Good-Thomas and Winograd prime factor FFT.

4.6.3 Matlab Design The design of the FFT processor begins with the development of the FFT algorithm in Matlab. The advantage of starting the design process in Matlab is that software modeling offers a higher level of abstraction for the rapid development and verification of algorithms. This means that in a relatively short time, different models can be developed and compared in 54

Input Sequence

No DVI = = 1

Yes

Input mapping

No Counter = = N2

Yes

Row transform- ation

Write data to block RAM

No Counter = = N1

Yes

Read data from block RAM

No Data Ready

Yes

Column transform- ation

No Counter = = N2

Yes

Output mapping

No DVO == 1

Yes

Output Ready

Figure 4.8. Flowchart of the operations. 55 terms of number of required operations. We have designed the combined Good-Thomas and Winograd FFT algorithms in both fixed-point and floating-point representations. The output samples of our design were compared with the output values of the built-in Matlab FFT function. After verifying the FFT designs in floating-point representation, these designs were then converted into the fixed-point representation using our custom fixed-point library.

4.6.4 Verilog Design The Verilog design started when the model of the parameterizable FFT architecture was verified using Matlab simulations. We used Xilinx ISE for synthesizing our designs and Mentor Graphics Modelsim for functional simulation. This section describes the datapath of every module used to design the combined Good-Thomas and Winograd prime-factor FFT processor. Let N denote the length of the transform and N1 and N2 be its co-prime factors.

4.6.4.1 INPUTMAPPING The input samples arrive in order and are stored in the sequence of their arrival. The one-dimensional sequence should be mapped onto a two-dimensional sequence based on the Chinese remainder theorem. According to the CRT, the index values are given by, index(i, j) = index(remainder((j − 1) × N2 × M2 + (i − 1) × N1 × M1),N), where M1 and M2 are calculated using linear Diophantine equation. Note that the index values are pre-computed in Matlab and are stored in a “.coe” file for hardware implementation. This file contains all N index values to initialize the memory that stores the CRT index values. The block RAM is a configurable memory module with a data bus and address bus of log2 N bits.

As an example, if N = 15, then N1 = 5, N2 = 3, and the index values are, 0, 6, 12, 3, 9, 10, 1, 7, 13, 4, 5, 11, 2, 8 and 14. Each of these index values will have a bit-width of log2 15 = 4 bits and are stored in the RAM at location 0, 1, 2, ... 15, respectively. Thus, address value of 0 corresponds to index value of 0, address value of 1 corresponds to index value of 6, and so on. The address for the block RAM is generated using a counter which counts up to N2 and resets to 0. The datapath of the input module is shown in Fig. (4.9). The data bus of the block RAM acts as a select line for the multiplexer and routes the input to the output depending on the value of the select line. The output is a N1-point sequence which is then used to perform row transformations.

4.6.4.2 N2-POINTROWTRANSFORM Winograd Fourier transform (WFT) algorithm is used to perform the row transformation of the N2-point sequence. These WFTs are constructed with different pipelining stages as mentioned earlier. The output of the N2-point row transform is stored in a 56

BRAM

CLK CRT INDEX DATA BUS VALUES ADDRESS 0 6 BUS 12 3 9 SELECT LINE 15

% ]% :I]CV 0 RV]VJR1J$QJ 1 1JRV60:C%V 2 R 3 4

15 INPUT SAMPLES

Figure 4.9. The datapath of the CRT input mapping.

block memory for further processing. Note that the row transformation is performed N1 times

for all the N2 samples.

4.6.4.3 N1 × N2-POINT BRAM

Each of the N2 row-transformed outputs are stored in a block RAM that consists of N

elements and each element has a bit-width of WL bits. All the N2 outputs of row transformation module are stored sequentially in the N elements of the memory. The datapath and the input/output signals are shown in Fig. 4.10. The “read/write”

signal is made low to write every set of the N2 samples to each of the corresponding address of the memory. The write operation takes one clock cycle to write each element onto the block RAM. In order to perform column transformation, we need to select the column elements of a two-dimensional memory. In hardware, this is done by selection every sample at a distance of N2 points. For instance,we first select the zeroth sample, then, N2th sample, then

(2 × N2)th sample and so on to form an N1-point sequence. Once this set of N1 points is

created, we form the next set of N1 points by selecting 1st sample, (N1 + 1)th sample,

(2 × N1 + 1)th sample and so on to perform N1-point transform of the N1-point sequence.

This process is repeated for N2 times until all the column elements have been transformed. To read each of these elements the “read/write” signal is set high. 57

Read/ WL Write CLK Counter Read/ for N1 Write 0 Output data (Counter value Address bus bus

for N2) + N2 N 1 Input data bus

Counter for N2

Figure 4.10. The top-level organization of the N × WL block RAM, where N = N1 × N2.

4.6.4.4 N1-POINTCOLUMN TRANSFORM

We use WFT algorithm to perform the column transformation of the N1-point sequence. The N1-point column transformation is constructed similar to the N1-point WFT discussed earlier. Note that N1 and N2 are co-prime numbers.

4.6.4.5 OUTPUTMAPPING The output mapping is based on Ruritanian correspondence mapping. The output index values are given by, index(k) = index(remainder((j − 1) × N2 + (i − 1) × N1,N)), which are pre-calculated in Matlab and stored in a block memory, similar to the CRT values.

The data bus and address bus are log2 N bits wide. For instance, if N = 15 then N1 = 5,

N2 = 3, and the index values are, 0, 3, 6, 9, 12, 5, 8, 11, 14, 2, 10, 13, 1, 4 and 7. Each of these

index values will have a bitwidth of log2 15 = 4 bits and are stored in the RAM at location 0, 1, 2, ... 15, respectively. The mapping of input samples to a corresponding location on the output begins by

selecting WL bits starting from the MSB. A counter of N samples acts as the address bus for the block RAM, which then selects the appropriate output location and the input sample is

routed to the value selected from the block RAM. The next set of WL bits are selected and

routed accordingly and so on for the given set of N2 samples. This process is repeated for all

the N2 × N1 samples, ensuring all the samples are mapped. The output signal “DVO” is set to high, indicating that the Fourier transform is completed. The datapath of the output module is shown in Fig. 4.11.

4.7 TESTING Testing is an important part of the project and testing is done on all levels. Random testing is the most frequently used method. Random sequences are generated by Matlab, 58

BRAM

CLK RCM INDEX DATA BUS VALUES ADDRESS 0 3 BUS 6 9 12 SELECT LINE 15

% ]% :I]CV 0 RV]VJR1J$QJ 1 1JRV60:C%V 2 R 3 4

15

N2-point column transform samples

Figure 4.11. The datapath of the RCM output mapping. written onto a test file, and read by the Verilog testbenches. We also performed corner testing. Corner cases are input sequences that the designer think can cause errors, e.g. overflows in adders and multipliers. In our design, a corner case could be an input sequence of only maximum and only minimum input values. In this thesis, corner testing is mostly done to ensure that the safe scaling works correctly without overflows. Block testing is the lowest level of testing. Before a block is built out of other sub-blocks, the sub-blocks are run through a block test to ensure that each sub-block is verified. When it is known that all sub-blocks work correctly, it can be assumed that the block errors are due to the error in the interconnection of sub-blocks. The FFT block as a whole has been successfully tested for impulse, sine, and cosine input values.

4.8 HARDWARE COSTAND IMPLEMENTATION RESULTS To demonstrate our implementation of the Good-Thomas and Winograd prime factor FFT algorithm, we plot the output of a 1008-point transform for a shifted impulse input as shown in Fig. 4.12 for fixed-point and floating point implementations. Also, Fig.4.13 shows the plot of absolute error between the fixed-point and floating-point implementations. The implementation results for 2, 3, 4, 5, 7, 8, 9 and 16 points Winograd FFTs with 16-bit wordlength are listed in Table (4.1). 59

Subplot 1: Cosine(x) in Fixed point. 1

0.5

0 Amplitude −0.5

−1 0 100 200 300 400 500 600 700 800 900 1000 Size of Fourier transform Subplot 1: Cosine(x) in Floating point. 1

0.5

0 Amplitude −0.5

−1 0 100 200 300 400 500 600 700 800 900 1000 Size of Fourier transform

Figure 4.12. Fourier transform output of the shifted impulse input.

0.04

0.035

0.03

0.025

0.02

Absolute error 0.015

0.01

0.005

0 0 100 200 300 400 500 600 700 800 900 1000 Size of Fourier transform

Figure 4.13. Absolute error plots of fixed-point and floating-point Fourier transforms. 60

Table 4.1. Resource utilization of the Winograd FFT algorithms FFT points 2 3 4 5 7 8 9 16 Multipliers 0 2 0 5 8 2 10 10 Adders 2 6 8 17 38 26 44 74 Pipeline Stages 0 5 3 7 8 5 10 7 Registers 136 458 227 774 1698 1029 2316 3091 LUTs 208 629 416 1636 3569 2509 4543 7106

The implementation results for the combined Good-Thomas and Winograd prime factor FFT algorithm for FFT lengths of 80, 120, 180, 240, 504, 720 and 1008 points have been described in Table (4.2).

Table 4.2. Resource utilization of the combined Good-Thomas and Winograd FFT algo- rithm FFT points 80 120 180 240 504 720 1008 Registers 2316 2529 3644 4390 5336 6305 7229 LUTs 5966 4974 7224 9653 10863 13646 15550 Block RAMs 5 9 9 9 9 9 9 DSP48E1s 16 18 30 34 40 50 56

4.8.1 Latency and Throughput For a k−stage fully pipelined structure, the latency is given by k times the clock period and the throughput is given by the clock frequency. In our design, we have several pipeline stages depending on the SFG. The performance parameters such as Latency, Throughput and transform time for WFT and combined Good-Thomas and Winograd FFT is described in Table. 4.3. and Table. 4.4. respectively.

Table 4.3. Performance of the Winograd FFT algorithms FFT points 2 3 4 5 7 8 9 16 Clock frequency (MHz) 369 350 360 320 296 310 296 296 Latency (cycles) 2 8 7 12 15 13 19 23 Transform time (cycles) 1 5 3 7 8 5 10 7 Transform time (µ sec) 0.0027 0.014 0.008 0.021 0.027 0.016 0.033 0.023 Throughput (M Samples/sec) 369 350 360 320 296 310 296 296 61

Table 4.4. Performance of the combined Good-Thomas and Winograd FFT algorithms FFT points 80 120 180 240 504 720 1008 Clock frequency (MHz) 296 296 296 296 296 296 296 Latency (cycles) 335 1283 2207 2521 4979 7163 9176 Transform time (µ sec) 1.1 4.33 7.4 8.51 16.8 24.2 31 Throughput (M Samples/sec) 296 296 296 296 296 296 296

4.9 ANALYSIS AND COMPARISON Table. 4.5 and Table. 4.6 gives implementation characteristics of various published FFT implementations along with our implementation results.

Table 4.5. Resource utilization comparison with other fixed-point FFT processors

Design FFT Alg. N WL Device Regs LUTs BRAMs DSP48s Clock (MHz) [19] Radix-22 1024 16 Spartan 3 −− 5916 −− 16 92 [20] Radix-2 1024 32 – −− 27403 19 32 250 [21] Radix var −− – 2372 4278 19 −− 200 [22] Radix-2 1024 16 Virtex-4 −− 10353 −− 10 100 [23] Radix-2 1024 32 – 372 −− −− −− 100 [12] Radix-4 1024 16 Spartan 9742 3911 −− −− 158 [24] R4SDC 1024 16 Spartan 3064 −− 8 −− 219 [24] R22SDF 1024 16 Virtex 2256 −− 8 −− 235 [25] Radix-2 1024 16 Virtex-6 4699 6298 17 16 366 Ours GTW 1008 16 Virtex-6 7229 15550 9 56 296

It can be observed that except our work most of the other FFT implementations use Radix-based algorithm for FFT computation. In this thesis, we have successfully implemented the combined Good-Thomas and Winograd prime factor FFT algorithm and proved to have better throughput than all other FFT implementations. In [19], the author has implemented a 1K-FFT algorithm for OFDM systems used in Wi-MAX technology. The algorithm has the same multiplicative complexity as the radix-4 algorithm, but retains the butterfly structure of radix-2 algorithm. The FFT architecture in [20] is developed by Sundance Technologies claims that its new off-the-shelf IEEE-754 FFT / IFFT Floating Point core is the most efficient available in the FPGA world. Its architecture allows the user to change the transform length on the fly without having to reconfigure the programmable device, which makes it a very flexible component for systems with complex algorithms. 62

Table 4.6. Performance comparison with other fixed-point FFT processors Design Clock Frequency Latency Transform time Throughput (MHz) (Cycles) µ sec M Samples/sec [19] 92 6086 65.89 – [20] 250 2850 11.4 90 [21] 200 – – 200 [22] 100 – 26 – [23] 100 6295 62.95 – [12] 158 – – 158 [24] 219 1041 4.67 219 [24] 235 1042 4.35 235 [25] 366 12453 34.02 366 Ours 296 9176 31 296

The author in [21] implements a hardware efficient variable length FFT processor for wireless communication standards. In this paper, a flexible communication solution is presented, applicable to the useful FFT processor lengths: 2n (n = 6, 7,..., 13). The solution is optimized ensuring an efficient implementation with respect to resource usage and high throughput for wireless communication applications. The key features of the efficient design include a conflict free in-place memory replacement scheme for intermediate data storage, a dynamic address generator scheme and the Co-ordinate Rotational Digital Computer (CORDIC) technique for twiddle factor multiplication. The Xilinx LogiCORE IP FFT implements the Cooley-Tukey FFT algorithm. All memory is on-chip using either Block RAM or Distributed RAM. It supports run-time configurable scaling schedule for scaled fixed-point cores. The implementation results shown in Table. 4.5 for 3GPP-LTE wireless communication standard. Note that the results are for version 7.0 which was published in 2011. The present Xilinx IP core is much advanced and allows the user to select operating frequency that varies from 50-500 MHz. 63

CHAPTER 5 APPLICATIONOFFFTINWIRELESS COMMUNICATION SYSTEMS

5.1 OVERVIEW Long Term Evolution (LTE) [26] has been the next step forward in the cellular 3G services. Cellular networks require flexibility to support a variety of wireless communication system standards. The design of the LTE physical layer (PHY) is significantly influenced by the requirements for high data transmission rate (100 Mbps downlink/50 Mbps uplink), spectral efficiency, and multiple channel bandwidths (1.25-20 MHz). To fulfill these requirements, orthogonal frequency division multiplex (OFDM) was chosen as the basis for the PHY layer. OFDM based wireless communication uses FFT and its inverse as shown in Fig. 5.1. Various architectures have been proposed for implementation of FFT with power-of-two lengths. The various FFT and IFFT lengths require support for radix algorithms, more commonly radix-2. In this section, we present the suitability of the combined Good-Thomas and Winograd FFT architecture for LTE cellular networks.

Transmitter

Bit stream Constellation D/A Power Modulation Inverse FFT Cyclic prefix RF module mapping conversion amplifier

Receiver

Bit stream De- Constellation Remove cyclic A/D Baseband Low noise FFT modulation mapping prefix conversion module amplifier Figure 5.1. OFDM architecture.

5.2 OFDM TECHNIQUE The OFDM [27] technology is based on using multiple narrow-band sub-carriers spread over a wide channel bandwidth. OFDM systems divide the available bandwidth into many narrower sub-carriers and transmit the data in parallel streams. Each of these sub-carriers experiences flat-fading as they have a bandwidth smaller than the mobile channel coherence bandwidth. This obviates the need for complex frequency equalizers, which are 64 featured in 3G technologies. Each sub-carrier is modulated using varying levels of QAM modulation, e.g. QPSK, QAM, 64-QAM or possibly higher orders depending on signal quality. Each OFDM symbol is therefore a linear combination of the instantaneous signals on each of the sub-carriers in the channel. Because data is transmitted in parallel rather than serially, OFDM symbols are generally much longer than symbols on single carrier systems of equivalent data rate. The sub-carriers are mutually orthogonal in the frequency domain, which mitigates intersymbol interference (ISI). OFDM communication systems do not rely on increased symbol rates in order to achieve higher data rates. This makes the task of managing ISI much simpler.

OFDM subcarrier spacing 1

0.8

0.6

0.4

0.2

Normalized Amplitude 0

-0.2

-0.4 -5 -4 -3 -2 -1 0 1 2 3 4 5 Frequency Units Figure 5.2. OFDM sub-carrier spacing

The mobile propagation channel is typically time-dispersive, i.e., multiple replicas of a transmitted signal are received with various time delays due to multipath resulting from reflections the signal incurs along the path between the transmitter and receiver. Time-dispersion is equivalent to a frequency selective channel frequency response. This leads to at least a partial loss of orthogonality between sub-carriers which causes ISI not only within a sub-carrier, also between sub-carriers. To prevent an overlapping of symbols and

reduce intersymbol interference, a guard interval Tg is added at the beginning of OFDM symbols. The guard time interval, or cyclic prefix (CP) is a duplication of a fraction of the

symbol end. The total symbol length becomes Ts = Tu + Tg, where Tu is the useful symbol duration. This makes the OFDM symbol insensitive to time-dispersion. 65 The downlink physical layer of LTE is based on orthogonal frequency division multiple access (OFDMA). However, despite its many advantages, OFDMA has certain drawbacks, such as high sensitivity to frequency offset (resulting from instability of electronics and Doppler spread due to mobility) and high peak-to-average power ratio (PAPR). PAPR occurs due to random constructive addition of sub-carriers and results in spectral spreading of the signal leading to adjacent channel interference. It is a problem that can be overcome with high compression point power amplifiers and amplifier linearization techniques. While these methods can be used on the base station, they become expensive on the User Equipment (UE). Hence, LTE uses Single Carrier FDMA (SC-FDMA) with cyclic prefix on the uplink, which reduces PAPR as there is only a single carrier as opposed to N carriers.

5.3 LTE PHYSICAL LAYER

5.3.1 Generic Frame Structure In the time domain, different time intervals within LTE [28] are expressed as multiples of a basic time unit Ts = 1/30720000. The radio frame has a length Tframe = 307200 × Ts =

10 ms. Each frame is divided into 10 equally sized sub-frames of Tsubframe = 30720 × Ts = 1 ms in length. Scheduling is done on a sub-frame basis for both the downlink and uplink.

Each sub-frame consists of two equally sized slots of Tslot = 15360 × Ts = 0.5 ms in length. Each slot in turn consists of a number of OFDM symbols, which can be either seven (normal cyclic prefix) or six (extended cyclic prefix). Fig. (5.3) shows the frame structure for LTE in frequency division duplex (FDD) mode. The useful symbol time is Tu = 2048 × Ts ≈ 66.7

JV`:R1Q``:IV5 ``:IVY 6 YI

Tsubframe = 1ms FRAME STRUCTURE Slot # 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 # 11 # 12 # 13 # 14 # 15 # 16 # 17 # 18 # 19 # 20 number

Tslot= 0.5 ms CP = CP = Useful symbol

5.2 us SUB-FRAME STRUCTURE 4.7 us duration

Special OFDM OFDM symbol symbol Figure 5.3. LTE symbol and frame structure.

µs. For the normal mode, the first symbol has a cyclic prefix (CP) of length 66

TCP = 160 × Ts ≈ 5.2 µs. The remaining six symbols have a cyclic prefix of length

TCP = 144 × Ts ≈ 4.7 µs. The reason for different CP length of the first symbol is to make the overall slot length in terms of time units divisible by 15360. For the extended mode, the

cyclic prefix is TCP −e = 512 × Ts ≈ 16.7 µs. The CP is longer than the typical delay spread of a few microseconds typically encountered in practice. The normal cyclic prefix is used in urban cells and high data rate applications, while the extended cyclic prefix is used in special cases, such as multi-cell broadcast and in very large cells.

5.3.2 LTE Parameters In the frequency domain, the number of sub-carriers N ranges from 128 to 2048, depending on the channel bandwidth with 512 and 1024 for 5 and 10 MHz, respectively,

being most commonly used in practice. The sub-carrier spacing, is given by ∆f = 1/Tu = 15

kHz. The sampling rate is fs = ∆f × N = 15000 × N. LTE parameters have been chosen such that FFT lengths and sampling rates are easily obtained for all operation modes, while at the same time ensuring the easy implementation of dual-mode devices with a common clock reference. Table 5.1 summarizes some of the main physical layer parameters that are currently being used for LTE in the FDD mode [29].

Table 5.1. FFT sizes and other physical parameters used in the current LTE standard Channel Bandwidth (MHz) 1.25 2.5 5 10 15 20 Subcarrier Spacing (KHz) 15 Number of Resource Blocks 6 12 25 50 75 100 Occupied Sub-carriers 76 151 301 601 901 1201 Guard Sub-carriers 52 105 211 423 635 847 FFT Size 128 256 512 1024 1536 2048 Sampling Frequency (MHz) 1.92 3.84 7.68 15.36 23.04 30.72

It can be observed that the FFT lengths are chosen to be a power-of-2 as radix-2 FFT algorithms have been widely utilized. However, it is possible to implement a variety of FFT lengths using the Good-Thomas and Winograd prime-factor FFT algorithms. Note that this is not a direct replacement to the power-of-2 FFT algorithms. The length of the prime factor FFT algorithms is different than the radix algorithms. Thus, we need to design the sampling frequency, depending on the FFT length. The available number of resource blocks and the occupied sub-carriers for the LTE standard are fixed for a given channel bandwidth. For instance, the 1.25 MHz channel will 67 have 6 resource blocks and 72 occupied sub-carriers. Table 5.2 shows the resource blocks and the occupied sub-carriers for the supported channel bandwidths.

Table 5.2. Available resource blocks and occupied sub-carriers Channel Bandwidth (MHz) 1.25 2.5 5 10 15 20 Sub-carrier Spacing (KHz) 15 Number of Resource Blocks 6 12 25 50 75 100 Occupied Sub-carriers 72 144 300 600 900 1200

Note that, the guard sub-carriers should at least be 10% of the occupied sub-carriers to avoid interference. Therefore, the required guard sub-carriers can be calculated to form the appropriate FFT size. Suppose, for instance, if we are using an FFT of size 80 samples for a 1.25 MHz bandwidth, then the occupied sub-carriers are 72. Thus, we can have 8 sub-carriers to give total sub-carriers equal to 80, which is equal to the desired FFT size. Tables 5.3, 5.4 and 5.5 shows the physical layer parameters required for implementing the LTE standard over different channel bandwidths. These FFTs can be implemented using the Good-Thomas and Winograd prime factor algorithm.

Table 5.3. Minimum required FFT lengths and other physical parame- ters for LTE Channel Bandwidth (MHz) 1.25 2.5 5 10 15 20 Guard Sub-carriers 8 16 30 60 90 120 FFT Size 80 160 330 660 990 1320 Sampling Frequency (MHz) 1.2 2.4 4.95 9.9 14.85 19.8

Table 5.4. Possible FFT lengths using Good-Thomas algorithm and other physical parameters for LTE Channel Bandwidth (MHz) 1.25 2.5 5 10 15 20 Guard Sub-carriers 18 36 60 120 108 1320 FFT Size 90 180 360 720 1008 2520 Sampling Frequency (MHz) 1.35 2.70 5.4 10.8 15.12 37.8 68

Table 5.5. Maximum possible FFT lengths using Good-Thomas algo- rithm and other physical parameters for LTE Channel Bandwidth (MHz) 1.25 2.5 5 10 15 20 Guard Sub-carriers 48 96 204 408 780 1320 FFT Size 120 240 504 1008 1680 2520 Sampling Frequency (MHz) 1.80 3.60 7.56 15.12 25.2 37.8 69

CHAPTER 6 CONCLUSION

Fast Fourier transform is one of most important block in today’s wireless communication systems, such as the 3GPP-LTE. In this thesis, various FFT algorithms have been discussed based on their computational requirements, arithmetic complexity, and hardware architectures. Majority of FFT implementations in hardware are based on Cooley-Tukey algorithm or some derivations of it due its simplicity and relatively straight-forward approach for hardware realizations. We demonstrated, to the best of our knowledge, the first high-throughput hardware architecture for the combined Good-Thomas and Winograd FFT algorithms. We successfully developed different FFT modules for different FFT lengths on a Xilinx Virtex-6 FPGA. In this thesis we briefly reviewed the most commonly used FFT algorithms, including the Cooley-Tukey algorithm, Good-Thomas algorithm, different radix-based algorithms, and the Winograd FFT algorithm. We compared them in terms of their arithmetic complexity. We showed that the Good-Thomas and Winograd FFTs require minimum number of arithmetic computations and hence suitable for hardware implementation. We have implemented three different architectures for the selected FFT algorithm and chose to present the most efficient architecture that is able to operate at a very high clock frequency. We focused on improving the performance, which is determined by the frequency of operation, at the expense of slightly increased area utilization of hardware resources. We pipelined the FFT architecture, and presented different implementation results based on the utilized number of pipeline stages. We also compared our implementation results with several other state-of-the-art FFT architectures. We have shown that our pipelined architecture operates at a higher frequency compared to other FFT implementations at the expense of greater utilization of resources.

6.1 FUTURE WORK In general, optimizing a hardware structure is a trade-off between various design constraints, such as performance, resource utilization, power consumption, and precision. How should these parameters be weighed against each other? This depends of course on the application. Usually, for a given precision, the goal is to implement the FFT module that meets certain performance, and power constraints. In this thesis we aimed for the highest possible operating frequency while one can look into more compact architectures that 70 dissipate minimal power. Another potential future work is to focus on developing an FFT processor that supports various FFT lengths.

6.1.1 VLSI layout The work in this thesis ends on an FPGA synthesizable code. FPGAs are good in many ways. They have a short development cycle, because layouts do not have to be done sent for fabrication. The FPGAs cannot compete with VLSI layouts in speed and power consumption, and when it comes to large series of components the FPGAs can not compete with the prize. Due to the better performance, in speed and power consumption, that a VLSI design gives, there could be an interest in doing a VLSI layout of this FFT processor. 71

BIBLIOGRAPHY

[1] M. T. Heideman, D. H. Johnson, and C. S. Burrus, “Gauss and the history of the fast Fourier transform,” IEEE Mag. Acoust. Speech Signal Process., vol. 1, no. 3, pp. 14–21, 1984. [2] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex Fourier series,” Math. of Comput., vol. 19, pp. 297–301, 1965. [3] P. Duhamel and M. Vetterli, “Fast Fourier transforms: A tutorial review and a state of the art,” IEEE Trans. Signal Process., vol. 19, pp. 259–299, 1990. [4] D. P. Kolba and T. W. Parks, “A prime factor FFT algorithm using high-speed convolution,” IEEE Trans. Acoust. Speech Signal Process., vol. 25, pp. 281–294, 1977. [5] P. Duhamel and H. Hollmann, “Split radix FFT algorithm,” Electron. Lett., vol. 20, pp. 14–16, 1984. [6] S. Winograd, “On the multiplicative complexity of the discrete Fourier transform,” Advances in Math., vol. 32, pp. 83–117, 1979. [7] I. J. Good, “The interaction algorithm and practical Fourier analysis,” J. Roy. Statist. Soc., vol. 32, pp. 361–372, 1958. [8] L. H. Thomas, Using a Computers to Solve Problems in Physics. Boston, MA: Ginn and Co., 1963. [9] R. E. Blahut, Fast Algorithms for Digital Signal Processing. New York, NY: Cambridge University Press, 2010. [10] D. Sundararajan and M. Ahmad, “Vector split-radix algorithm for dft computation,” in IEEE Inter. Symp. Circuits Syst., 1996, pp. 532–535. [11] C. Meletis, P. Bougas, G. Economakos, P. Kalivas, and K. Pekmestzi, “High-speed pipeline implementation of radix-2 DIF algorithm,” Int. J. Signal Process., vol. 4, no. 1, pp. 66–69, 2004. [12] R. B. Alapure and K. M. Ghadge, “FPGA Implementation of 1024-point Radix-4 FFT core using Xilinx VHDL,” Int. J. Eng. Tech. Res., vol. 3, pp. 180–181, 2015. [13] C. Burrus and T. W. Parks, DFT/FFT and Convolution Algorithms: theory and Implementation. New York, NY: John Wiley & Sons, Inc., 1991. [14] S. He and M. Torkelson, “A new approach to pipeline FFT processor,” in IEEE Proc. Parallel Process. Symp., 1996, pp. 766–770. [15] E. Chu, Discrete and Continuous Fourier Transforms: Analysis, Applications and Fast Algorithms. Boca Raton, FL: CRC Press, 2008. [16] Eric Weisstein, “Diophantine equation,” 2016. [Online]. Available: http://mathworld.wolfram.com/DiophantineEquation 72 [17] C. Rader, “Discrete Fourier transforms when the number of data samples is prime,” Proc. IEEE, vol. 56, no. 6, pp. 1107–1108, 1968. [18] M. Vitterli and H. Nussbaumer, “Simple FFT and DCT algorithms with reduced number of operation,” IEEE J. Signal Process., vol. 6, no. 4, pp. 267–278, 1984. [19] K. Harikrishna, T. R. Rao, and V. A. Labay, “FPGA implementation of FFT algorithm for IEEE 802.16 e (mobile wimax),” Int. J. Comput. Theory Eng., vol. 3, no. 2, pp. 197–202, 2011. [20] Sundance Technologies, “FFT/IFFT Floating Point core,” 2016. [Online]. Available: http://www.sundance.com/product-range/sundance-products/ip-cores/ fc100-ieee-754-fft-ifft-floating-point-core-for-fpga [21] V. Gautam, K. C. Ray, and P. Haddow, “Hardware efficient design of variable length FFT processor,” in IEEE Int. Symp. Des. Diagnostics Electron. Circuits Syst., 2011, pp. 309–312. [22] Z. H. Derafshi, J. Frounchi, and H. Taghipour, “A high speed FPGA implementation of a 1024-point complex FFT processor,” in IEEE Int. Conf. Comput. Netw. Technol., 2010, pp. 312–315. [23] S. Zhou, X. Wang, J. Ji, and Y. Wang, “Design and implementation of a 1024-point high-speed FFT processor based on the FPGA,” in IEEE Int. Conf. Image Signal Process., vol. 2, 2013, pp. 1112–1116. [24] B. Zhou, Y. Peng, and D. Hwang, “Pipeline FFT architectures optimized for FPGAs,” Int. J. Reconfigurable Comput., p. 1, 2009. [25] Xilinx Incorporation, “Fast Fourier Transform v7.1 Logicore IP,” 2016. [Online]. Available: http://www.xilinx.com/support/documentation/ip documentation/xfft ds260 [26] C. Johnson, Long Term Evolution in Bullets. Seattle, WA: Createspace Independent Pub, 2010. [27] W. Kabir, “Orthogonal Frequency Division Multiplexing (OFDM),” in IEEE Conf. Microw., 2008, pp. 178–184. [28] J. Zyren and W. McCoy, “Overview of the 3GPP long term evolution physical layer,” Freescale Semiconductor, Inc., 2007. [29] Telesystem Innovations, “LTE in a nutshell,” White paper, p. 6, 2010.