Adv. Radio Sci., 6, 89–94, 2008 Advances in www.adv-radio-sci.net/6/89/2008/ © Author(s) 2008. This work is distributed under Radio Science the Creative Commons Attribution 3.0 License.

Matrix-Vector Based Fast Fourier Transformations on SDR Architectures

Y. He2, K. Hueske1, J. Gotze¨ 1, and E. Coersmeier2 1University of Dortmund, Information Processing Lab, Otto-Hahn-Str. 4, 44221 Dortmund, Germany 2Nokia Research Center, Bochum, Meesmannstr. 103, 44807 Bochum, Germany

Abstract. Today Discrete Fourier Transforms (DFTs) are ap- optimum in terms of mathematical complexity, but as the plied in various radio standards based on OFDM (Orthogonal used architectures are not necessarily optimized for butterfly Frequency Division Multiplex). It is important to gain a fast structures, the question arises: Are there any efficient pro- computational speed for the DFT, which is usually achieved cedures, which reduce the computational complexity of the by using specialized Fast (FFT) engines. DFT significantly (as far as possible to the FFT complex- However, in face of the Software Defined Radio (SDR) de- ity), but at the same time are more tailor-made to specific velopment, more general (parallel) processor architectures hardware architectures than the FFT?. In this paper a - are often desirable, which are not tailored to FFT compu- vector based formulation of the FFT algorithm (Van Loan, tations. Therefore, alternative approaches are required to 1992) is derived from the respective matrix factorizations of reduce the complexity of the DFT. Starting from a matrix- the DFT matrix. Due to the structure of the butterfly matrices vector based description of the FFT idea, we will present dif- and the matrix factorizations, the matrix-vector based FFT ferent factorizations of the DFT matrix, which allow a reduc- allows several separations of the resulting matrix product. tion of the complexity that lies between the original DFT and These separations have an increased computational complex- the minimum FFT complexity. The computational complexi- ity compared to the FFT, but because of their mathematical ties of these factorizations and their suitability for implemen- structure they allow more flexible implementations on dif- tation on different processor architectures are investigated. ferent architectures (e.g. multi processors, vector processors, accelerated systems). The paper is organized as follows: In Sect. 2 we will clarify the definition and notation of the DFT. A short repetition of the FFT idea and the corresponding fac- 1 Introduction torization is described in Sect. 3. Then, in Sect. 4 differ- ent separations of the butterfly matrices are presented, which Recently multi-carrier modulation techniques, such as lead to different algorithmic structures and different compu- OFDM, have received great attention in high-speed data tational complexities. Implementation issues of these separa- communication systems and have been selected for several tions and their suitability for different hardware architectures communication standards (e.g. WLAN, DVB and WiMax). are discussed. Conclusions are given in Sect. 5. A central baseband signal processing task required in OFDM transceivers is the Fourier transform of the received data, which is generally realized as FFT (Cooley, 1965). To meet 2 Discrete Fourier Transform (DFT) the real-time processing requirements, various specialized m FFT processors have been proposed (Son, 2002). On the The sequence of N (with N=2 , m∈ N) complex numbers other hand Software Defined Radio (SDR) approaches for x1, ···, xN is transformed into the sequence of N complex mobile terminals are gaining momentum (Dillinger, 2003). numbers y1, ···, yN by the DFT according to To meet the demand for high computing power and flexi- N bility, specific reconfigurable accelerators (Heyne, 2004) or = X (t−1)(n−1) = ··· yt xnwN , t 1, ,N, (1) parallel processor architectures (Berthold, 2004) are used. n=1 The DFT can be implemented as FFT, which defines the −j 2π where wN =e N is a primitive N-th root of the unity. The DFT can also be formulated as a matrix vector prod- Correspondence to: K. Hueske T N uct (Van Loan, 1992). With xN =[x1···xN ] ∈ C and ([email protected]) T N yN =[y1···yN ] ∈ C , the DFT can be described as matrix

Published by Copernicus Publications on behalf of the URSI Landesausschuss in der Bundesrepublik Deutschland e.V. 90 Y. He et al.: Matrix-Vector Based Fast Fourier Transformations on SDR Architectures

N×N vector product yN =FN ·xN , where FN ∈ C is the DFT- − · − Matrix with elements f =w(t 1) (n 1) and n, t=1, ···,N. nt N   y1 T yN = = FN · xN = FN · PN · PN ·xN (5) y2 | {z } 3 Matrix-vector based FFT IN       I N D N F N 0 xo 2 2 2 To give a repetition of the known radix-2 FFT algo- =     ·   rithms (Cooley, 1965), this section will show the connection     I N −D N 0 F N xe 2 2 2 between the matrix FN and the matrix FN/2, which enables us to compute an N-point DFT from a pair of (N/2)-point Using yo=F N ·xo and ye=F N ·xe we obtain: DFTs. To transfer the FFT idea to the DFT matrix, we in- 2 2 troduce a so-called PN of size (N×N), which performs an odd-even ordering of the rows (multiplied y1 = yo + D N · ye = yo + d N · ∗ye (6) 2 2 from the left with PN ) or columns (multiplied from the right T T · = with PN ) of a matrix. Furthermore PN PN IN holds. The following factorizations and separations are shown for the y2 = yo − D N · ye = yo − d N · ∗ye. (7) 2 2 decimation in time (DIT) FFT. They can be easily transferred to the decimation in frequency (DIF) FFT. With d N the vectorized diagonal of D N , with (·∗) the 2 2 component-wise vector multiplication and taking into ac- 3.1 FFT-Matrix decomposition count that the two products are identical, Eq. (6) and Eq. (7) need altogether N multiplications and 2 · N additions. It is T 2 2 Applying PN to the right of the DFT-matrix FN yields easy to see that a DFT of size N can be executed by two N = m   DFTs of size 2 . For N 2 this process can be recursively F N D N · F N = 2 2 2 repeated m log2 N times. The required number of oper- F0 = F · PT =   . (2) N N N N   ations then will be 2 log2 N multiplications and N log2 N F N −D N · F N additions, which represents the complexity of the FFT. 2 2 2

0 The resulting product FN has the following structure: The 3.3 Factorization of the DFT matrix 0 odd indexed columns, forming the left half of FN , are com- posed of FN/2, while the even indexed columns, form- To obtain alternative factorizations of the DFT matrix, the 0 · ing the right half of FN , are composed of DN/2 FN/2 and number of recursions described in the previous section can be −DN/2·FN/2. This matrix can be decomposed into the prod- modified. Introducing a new parameter k we can define the uct of two matrices (as shown in Eq. 3), number of performed recursions as m−k with N=2m giving the size of the initial DFT matrix. For arbitrary m and k the     k k I N D N F N 0 resulting block submatrices in FB are of size (2 ×2 ) and 2 2 2 0     can be executed as independent DFTs (or FFTs). Because the FN = · (3)     m I N −D N 0 F N length of the input vector xN is given by N=2 the number 2 2 2 of block submatrices equals 2m/2k=2m−k. After calculating − − 2m k independent DFTs, m−k butterfly matrices B have to where D is a with elements d =w(n 1) i N/2 nn N be applied to finish the calculation. The entire DFT is then and n=1, ···,N/2. The first matrix represents FFT butterfly given as operations, while the second matrix contains two indepen- dent DFTs of length N/2. Note that the second matrix is a y = B · B ··· B · F · x , (8) block diagonal matrix. N m−k m−k−1 1 B P with x the (m−k)-times permuted input vector x . An ex- 3.2 Matrix-Vector based computation of the FFT p N ample for m=4 and k=2 is given below. 0 Example 1: To apply the modified DFT FN to an input vector xN , a per- mutation of xN is required, i.e. ⎡ ⎤ ⎡ ⎤ I4 D4 00 F N 000 T ⎢ ⎥ ⎢ 4 ⎥ P · x = [x x ] , (4) ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ N N o e ⎢ ⎥ ⎢ 0 F 00⎥ I8 D8 ⎢ I4 −D4 00⎥ ⎢ N ⎥ ⎣ ⎦ ⎢ ⎥ ⎢ 4 ⎥ y = · · ·xp N N N ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 00F N 0 ⎥ 2 2 I8 −D8 ⎢ 00I4 D4 ⎥ ⎢ ⎥ with xo ∈ C the odd and xe ∈ C the even indexed part of   ⎣ ⎦ ⎣ 4 ⎦ N B 2 000F , ∈ 2 00I4 −D4 N xN . Introducing y1 y2 C we can formulate the DFT as     4 B follows: 1 F B

Adv. Radio Sci., 6, 89–94, 2008 www.adv-radio-sci.net/6/89/2008/ Y. He et al.: Matrix-Vector Based Fast Fourier TransformationsYuheng He on SDR et al.: Architectures Matrix-Vector Based Fast Fourier 91 Transformations on SDR Architectures 3

3.4 Construction of butterfly matrices B1 B1 F4 Increasing the order of recursion m−k will also increase the number of butterfly matrices B . While the independent DFT = F4 i y16 xp submatrices in FB can be processed independently, the but- F4 terfly matrices can be treated in different ways: F4 ·As sequence of matrix multiplications a) m−k−1 nz = 512 nz = 512 Y B2 ⋅ Bm−k−i B2 B1 B1′ F4 i=0 B2′ F4 y16 = Pl Pr xp ·As one with the butterfly product ma- B3′ F4 trix B4′ F4 m−k−1 b) Y B = Bm−k−i i=0 nz = 512 nz = 1024 F4 F4 B3 B3 ⋅ B2 ⋅ B1 ·As partly combined matrix multiplication with the products F4 F4 y16 = Pl WPr xp p p m−k−1 F4 F4 Y1 Y2 Y B B B m−k−i m−k−i m−k−i F4 F4 i=0 i=p1−1 i=pr −1 c) with arbitrarily chosen integer numbers p fulfilling (1

Complexity Xp B1 B1 100 F4 80 60 x7 x8 4 · · · = F 40 x1 x2 y16 xp F4 20 F4 a) 10 nz = 512 nz = 512 8 DFT B2 B2 ⋅ B1 B1′ F4 6 FFT B2′ F4 4 SepaA y16 = Pl Pr xp SepaB F 4 SepaC 4 Multiplications normalized to FFT B3′ F 2 SepaD B4′ F4 b) 1 0 1 2 3 4 5 6 7 8

k ±d4 · ∗ ±d4 ·∗ ±d4 · ∗ ±d4 · ∗ nz = 512 nz = 1024 F4 F4 ⇓ ⇓ ⇓ ⇓ B3 ⋅ ⋅ B3 B2 B1 y11 y y y14 F4 F4 Fig. 3. Complexities for DFT, FFT and various separations. y21 12 y22 13 y23 y24 y16 = Pl WPr xp ±d8 · ∗ ±d8 · ∗ F4 F4 ⇓ ⇓

butterfly matrices are separately executed one by one. The ′ ′ ′ ′ = = y11 y21 y12 y22 F4 F4 clarify this, an example is given in Fig. 2m−fork m 4 and k 2. c) numberExample of 3 the: In DFT a) we blocks can see in F theB is graphical2 and representation each DFT k k k ±d16 · ∗ needs 2 2 2 2 + 1 multiplications. The number of ⇓ of example· 2,− with· the butterfly product matrix B on them left−1 butterfly matrices is m k and each butterfly needs 2 yN and the block FFT matrix FB on the right hand side. By nz = 512 nz = 2048 Fig. 2. ParallelFig. 2. Parallel FFT FFT following following Separation Separation D. D. multiplications. This leads− to B4 B4 ⋅ B3 ⋅ B2 ⋅ B1 sorting rows and columns of B we can transform the sparse Fig. 4. Matrix-vector processor based implementation structure. diagonal matrix B tom a−1 blockk diagonalk matrix,k as shownm−k in MA = (m k) 2 + (2 2 2 2 + 1) 2 . This leads to b). This leads− to a· matrix based· notation− · of the Jeon-Reeves· butterfly matrices for increasing i is shown on the right hand example 2, with the butterfly product matrix B on the left FFT algorithms (Jeon (1986)). As the blocks (B0 , ···, B0 ) 8 8 Separation B. We use B FB, where the submatrices1 of4 m−1 k k k m−k and the block FFT matrix FB on the right hand side. By side. TheMA matrices= (m − k) are· 2 all+ of(2 dimension· 2 − 2 · 2 +21) · 2 2 ., the variableare not identical, constant twiddle· factors are extracted from × FB are still executed as independent DFTs, and B is the sorting rows and columns of B we can transform the sparse nz in the Figure shows the number of non-zero elements.butterflythe Fo- matrices, product which matrix. transfers Since the only block every submatrices2k-th diagonal to DFT is · diagonal matrix B to a block diagonal matrix, as shown in Separation B. We use B FB , where the submatrices of FB occupied,matrices, like the shown number in of c). required The extracted multiplications twiddle factors for B areis cusing on the butterfly product matrix in the bottom right, we m b). This leads to a matrix based notation of the Jeon-Reeves nz = 512 nz = 4096 are still executed as independent DFTs, and B is the but- collectedm in 2W. The computational complexity of this sep- (2 1) ( k 1). The number of multiplications for FB ′ ′ 8 8k 2 k FFT algorithms [Jeon (1986)]. As the blocks (B , , B ) can findterfly that product only nz matrix.= 4096 Sinceof only all every2 2 2 -th= diagonal 65536 iselementsisaration identical− is· determined to separation− by A the , such number that of FFTs (2·2 ) and the 1 · · · 4 occupied, the number of required multiplications· for B is = m are not identical, constant twiddle factors are extracted from Fig. 1. The single butterfly matrices Bi on the left hand side; their are non-zero. Thism means, every 16-th diagonal of theadditional but- twiddle factor multiplications (N 2 ): (2m − 1)·( 2 − 1). The number of multiplications for F m the matrices, which transfers the block submatrices to DFT 2k B m 2 k k k m−k products on the right hand side. terfly product matrix is non-zero, since the other multiplicMBa-= (2 k 1) (k−1 1)m + (2m 2 2 2 + 1) 2 . matrices, like shown in c). The extracted twiddle factors are is identical to separation A , such that MD = 2 · 2−· k ··2 2k +−2 = 2 (·1 +−m/2·) · (10) tions have already been performed by the block diagonal ma- collected in W. m B F m 2 k k k k m−k TheSeparation number of C. requiredWe again multiplications use B, for but separations now the subma- A–D The computational complexity of this separation is deter- trix F M.B In= general,(2 − 1) · every( − 12) +-th(2 diagonal· 2 − 2 · 2 is+ non-zero,1) · 2 . which · k 3.4 Construction of Butterfly Matrices B 2k tricesare given of F inB Fig.are executed3 for m= as8 independentand varying k FFTs.as multiples There are of mined by the number of FFTs (2 2 ) and the additional k−1 multiplications required for each FFT, and totally · m means, that the inter-space will be enlarged with increasinkMFFT2g. The upper bound is defined by the DFT complexity, twiddle factor multiplications (N = 2 ): Separation C. We again use B·F , but now the subma- · m−k B therethe lower are 2 boundFFT shows blocks, the FFT so the complexity. number of Notemultiplications that sepa- Increasing the order of recursion m k will also increase k. trices of F are executed as independent FFTs. There are k k−1 m m B isration given D as is only defined for k=m/2=4. MD = 2 2 k 2 + 2 = 2 (1 + m/2) (10) the number of butterfly matrices B .− While the independent k·2k−1 multiplications required for each FFT, and totally · · · i m−k m there are 2 FFT blocks, so the number of multiplications m 2 k−1 m−k The number of required multiplications for separations A-D DFT submatrices in F can be processed independently, the 4.2 ImplementationMC = (2 1) issues( 1) + k 2 2 . B 4 Implementationis given as − · 2k − · · are gi-ven in Figure 3 for m = 8 and varying k as multiples butterfly matrices can be treated in different ways: m HereSeparation we will discussD. A special possible case implementations of separation C is of separation the pre- of MFFT . The upper bound is defined by the DFT complex- m 2 k−1 m−k As sequence of matrix multiplications MC = (2 − 1) · ( − 1) + k · 2 · 2 . D,sented which separations describes on a partlydifferent parallel hardware implementation architectures. of The the ity, the lower bound shows the FFT complexity. Note that In this section we will2k compare different separations of the · originaldiscussion FFT is forof a the qualitative case k = character,m/2. The i.e.implementation factorization is separation D is only defined for k = m/2 = 4. m−k−1 butterfly matrixSeparation in D. termsA special of complexity case of separation and C implementation is separa- givendetails as like memory access, interconnection complexity or issues. tion D, which describes a partly parallel implementation of inter processor communication time are not considered here. 4.2 Implementation Issues Bm−k−i the original FFT for the case k=m/2. The factorization is SeparationyN A.=Matrix-vectorPl FB W processorsPr FB x asp, special kind(9) · · · · · Here we will discuss possible implementations of the pre- iY=0 given as of SIMD (Single Instruction Multiple Data) architectures are 4.1 Complexity withdiscussedW a twiddleas possible factor platforms diagonal for matrix SDR and (SchoenesPl and(P2003r per-)). sented separations on different hardware architectures. The mutation matrices that sort the rows and columns of a matrix, discussion is of a qualitative character, i.e. implementation As one matrix multiplication with the butterfly product ma- yN = Pl · FB · W · Pr · FB · xp, (9) As an example we will show an implementation of sepa- · The application of the DFT matrix requiressuchration that AP usingl FB aW generalPr = systolicB is the array butterfly structure product for matrix. matrix details like memory access, interconnection complexity or trix To clarify this,· an· example· is given in Figure 2 for=m = 4 and= inter processor communication time are not considered here. m−k−1 (N 1)with(WN a twiddle1) multiplications, factor diagonal matrix i.e. and Pl and thePr productper- multiplications, of like depicted in Fig. 4. With m 5 and k 2 mutation matrices that sort the rows and columns of a matrix, kwe= have 2. to perform 8 DFTs of length 4 in FB . To obtain Separation A. Matrix-vector processors as special kind B B − − − · − = m k i all rowssuch and that columns,Pl·FB ·W·P exceptr =B is the the butterfly first product row/column, matrix. To whichExamplethe is matrix-matrix 3: In a) we structure can see we the separate graphical the representation input vector x ofp of SIMD (Single Instruction Multiple Data) architectures are iY=0 all ones: As partly combined matrix multiplication with the products Adv. Radio Sci., 6, 89–94, 2008 www.adv-radio-sci.net/6/89/2008/ · M = N 2 2N +1=2m 2m 2 2m + 1. DFT − · − · p1 p2 m−k−1 The FFT is based on butterfly operations, which require Bm−k−i Bm−k−i Bm−k−i N/2 = 2m−1 multiplications each (see section 3.2). Since iY=0 i=Yp1−1 i=Ypr −1 we have m butterfly matrices, we obtain: with arbitrarily chosen integer numbers p fulfilling (1 < m−1 N p1

into 8 parts (x1, x2, ···, x8 Complexity) of length 4, which means that Xp 100 · j= , ..., all 80F4 xj with 1 8 can be calculated independently. The60 multiple matrix-vector products can then be described x7 x8 · · · as matrix-matrix40 product F4·Xp with Xp=[x1 x2·x8]. The x1 x2 remaining butterfly operations are performed according to Eq.20 (6) and Eq. (7). The even numbered columns of the product F4·Xp are multiplied component-wise with d4 and then10 added/subtracted to the corresponding odd numbered 8 DFT columns.6 For example, the second column is multiplied by FFT d4 and4 then added/subtracted to the first column,SepaA resulting in the intermediate vectors y11 and y21 that are storedSepaB in the F 4 SepaC Multiplications normalized to FFT registers2 of the first and second column of the processorSepaD ar- ray. In the next step these intermediate results are multiplied by d1 and added/subtracted to the other intermediate results. 80 1 2 3 4 5 6 7 8

After m − k steps the registersk contain the Fourier transform ±d4 · ∗ ±d4 ·∗ ±d4 · ∗ ±d4 · ∗ ⇓ of x. ⇓ ⇓ ⇓

y11 y y y14 Fig.The 3. Complexities computational for complexityDFT, FFT and for various this separations. approach is only y21 12 y22 13 y23 y24

slightly increased for small values of k (compare Fig. 3), but ±d8 · ∗ ±d8 · ∗ a high speed partly parallel Fourier transformation can be re- ⇓ ⇓ butterfly matrices are separately executed one by one. The ′ ′ ′ ′ y y y y alized. m−k 11 21 12 22 number of the DFT blocks in FB is 2 and each DFT Separationk k B. Fork generic matrix-vector architectures the ±d16 · ∗ needs 2 2 2 2 + 1 multiplications. The number of ⇓ DFT can· be described− · as simple matrix-vector product usingm−1 butterfly matrices is m k and each butterfly needs 2 yN multiplications.separation B. In This this leads case− to the number of required multipli- cations is increased considerably compared to the FFT, but Fig. 4. Matrix-vector processor based implementation structure. − − Fig. 4. Matrix-vector processor based implementation structure. nevertheless,M = (m itk) is2 interestingm 1 + (2k to2 seek the2 significant2k + 1) 2 decreasem k. A − · · − · · of the number of multiplications (e.g. for k=3, 4, 5 in Fig. 3) example 2, with the butterfly product matrix B on the left comparedSeparation to the B. We DFT, use justB byF separatingB, where the the DFT submatrices matrix into of · andcomputed the block using FFT the matrix specializedFB on processor the right cores, hand the side. results√ By FaB productare still of executed the two matrices as independentB and FB DFTs,. and B is the sortingare then rows permuted and columns and multiplied of B we by canW. transform The remaining the sparseN k butterflySeparation product C. matrix.To match Since the only real every time2 requirements-th diagonal is in diagonalblocks are matrix againB computedto a block using diagonal the FFT matrix, cores. as shown in occupied,SDR systems, the number general of purpose required processors multiplications are often for B com-is m b). This leads to a matrix based notation of the Jeon-Reeves m 2 bined with hardwarek . The accelerators, number of like multiplications e.g. for the for FourierF ′ ′ (2 1) ( 2 1) B FFT algorithms [Jeon (1986)]. As the blocks (B , , B ) − · − 5 Conclusions 1 4 istransform. identical to As separation the flexibility A , such of these that accelerators regarding are not identical, constant twiddle factors are extracted· · · from to the DFT size might be limited, separation C can be used m the matrices, which transfers the block submatrices to DFT to map DFTsm of any2 size onto thesek k engines.k For examplem−k The presented separations of the DFT matrix provide differ- MB = (2 1) ( 1) + (2 2 2 2 + 1) 2 . matrices, like shown in c). The extracted twiddle factors are in a systems− with· FFT2k − accelerators· of− size· 64 and a· required ent algorithmic structures, which allow a flexibility in terms collectedof the used in W hardware. architectures for the implementation FFT length of 256 (this correspondsB F to m=8 and k=6), the Separation C. We again use B, but now the subma- Theof the computational DFT. Depending complexity on the of architecture, this separation the most is deter- suit- block DFTs in FB can be realized· on the accelerators while k trices of FB are executed as independent FFTs. There are minedable separation by the number can be of chosen, FFTs e.g. (2 separation2 ) and the A for additional matrix thek remaining−1 2 butterfly operations are executed as a sim- · m k 2 multiplications required for each FFT, and totally twiddleprocessors factor or separationmultiplications C for (N architectures= 2 ): using FFT hard- thereple· matrix are 2m− vectork FFT multiplication blocks, so the on number a generic of multiplications host processor. This leads to a major flexibility regarding the DFT size of the ware accelerators. Depending− on the chosen parameters the is given as M = 2 2k k 2k 1 + 2m = 2m(1 + m/2) (10) used accelerators in context of the required DFT size of the computationalD · complexity· · in terms of multiplications is only m slightly increased, but some architectures may benefit from specific application.m Furthermore2 existingk− architectures1 m−k with The number of required multiplications for separations A-D MC = (2 1) ( 1) + k 2 2 . the modified algorithmic structure (e.g. addressing, bus com- fixed length FFT− accelerators· 2k − can be easily· reused.· are gi-ven in Figure 3 for m = 8 and varying k as multiples Depending on the difference of m and k, the mathematical munication). The behaviour of the presented separations on Separation D. A special case of separation C is separation ofspecificMFFT existing. The upper hardware bound architectures is defined by has the yet DFT to complex-be exam- complexity is increased (compare Fig. 3). For the example ity, the lower bound shows the FFT complexity. Note that D,the which number describes of required a partly multiplications parallel implementation is increased by of a thefac- ined in future work. separation D is only defined for k = m/2 = 4. originaltor 1.5, FFT but the for 4 the FFTs case of sizek = 64m/ can2. be The executed factorization in parallel, is given as if more than one accelerator is available in the system. 4.2 Implementation Issues Separation D. The FFT proposed by Jeon and Reeves yN = Pl FB W Pr FB xp, (9) is especially suitable· for multi· processor· · architectures· (Jeon Here we will discuss possible implementations of the pre- (1986)). However the extraction of the twiddle factors in sep- with W a twiddle factor diagonal matrix and Pl and Pr per- sented separations on different hardware architectures. The mutationaration D matrices also enables that sort the the use rows of homogeneous and columns parallelof a matrix, FFT discussion is of a qualitative character, i.e. implementation cores in an SDR system, as the resulting block submatrices such that Pl FB W Pr = B is the butterfly product matrix. details like memory access, interconnection complexity or Toare clarify all identical. this,· √ an· example This· means is given an√ N-pointin Figure DFT 2 for√ canm = be 4 sepa-and inter processor communication time are not considered here. rated into 2 N FFTs of size N. The first N FFTs are k = 2. Separation A. Matrix-vector processors as special kind Example 3: In a) we can see the graphical representation of of SIMD (Single Instruction Multiple Data) architectures are www.adv-radio-sci.net/6/89/2008/ Adv. Radio Sci., 6, 89–94, 2008 94 Y. He et al.: Matrix-Vector Based Fast Fourier Transformations on SDR Architectures

References Schoenes, M., Eberli, S., Burg, A., et al.: A novel SIMD DSP archi- tecture for software defined radio, Proc. of the 46th IEEE Inter- Berthold, U., Riehmeier, A.-R., and Jondral, F. K.: Spectral parti- national Midwest Symposium on Circuits and Systems (MWS- tioning for modular software defined radio, 59th Vehicular Tech- CAS2003), 3, 1443–1446, December 2003. nology Conference (VTC2004), 2, 1218–1222, May 2004. Son, B. S., Jo, B. G., Sunwoo, M. H., and Yong, S. K.: A high- Cooley, J. W. and Tukey, J. W.: An algorithm for the machine cal- speed FFT processor for OFDM systems, Proc. of the IEEE In- culation of complex Fourier series, Math. Comput., 19, 297–301, ternational Symposium on Circuits and Systems (ISCAS2002), 1965. 3, 281–284, May 2002. Dillinger, M., Madani, K., and Alonistioti, N.: Software Defined Van Loan, C.: Computational Frameworks for the Fast Fourier Radio-Architectures, Systems and Functions, Wiley Series in Transform, Frontiers in applied mathematics, Society for Indus- Software Radio, John Wiley & Sons Ltd., 2003. trial and Applied Mathematics, Philadelphia, 1992. Heyne, B. and Gotze,¨ J.: A Pure Cordic Based FFT for Reconfig- urable Digital Signal Processing, 12th European Signal Process- ing Conference (Eusipco2004), Vienna, September 2004. Jeon, C. H. and Reeves, A. P.: The on a mul- ticluster computer architecture, Proceedings of the 19th Annual Hawaii International Conference on System Sciences, 103–110, 1986.

Adv. Radio Sci., 6, 89–94, 2008 www.adv-radio-sci.net/6/89/2008/