<<

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. n the past decade, the unprecedented advances in VLSI CORDIC technology have stimulated great interests in developing special purpose, parallel arrays to facilitate real The basic concept of the CORDIC computation is to time digital processing. systems decompose the desired rotation angle into the weighted sum such as systolic arrays [57] and wavefront arrays [59],[60] of a set of predefined elementary rotation angles such that the have been extensively studied. The basic arithmetic computa- rotation through each of them can be accomplished with tion of these parallel VLSI arrays has often been implemented simple shift-and-add operations. For this, we let the rotation with a multiplication and accumulation (MAC) unit, because angle, 0, be represented as: these operations arise frequently in DSP . The reduction in hardware cost also motivated the development of more sophisticated DSP algorithms to enhance the perfor- mance of modern digital systems. Many of these new algorithms require the evaluation of elementary functions, such as trigonometric, exponential, and logarithm where the iPrhelementary rotation angle, am(i) is defined by: functions, which cannot be evaluated efficiently with MAC based arithmetic units. Consequently, when DSP algorithms incorporate these elementary functions, it is not unusual to observe significant performance degradation. On the other hand, an alternative arithmetic computing algorithm known as CORDIC (Coordinate Rotation DIgital Computer) has received renewed attention, as it offers a unified iterative formulation to efficiently evaluate each of these elementary functions. Specifically, all the evaluation In the above equations, m = 1, -1, and 0 corresponds to, tasks in CORDIC are formulated as a rotation of a 2 x 1 vector respectively, the rotation operation in a circular coordinate in various coordinate systems. By varying a few simple system, a hyperbolic coordinate system, and a linear coor- parameters, the same CORDIC processor is capable of itera- dinate system. The norm of a vector [ x y 1' in these three tively evaluating these elementary functions using the same coordinate systems are defined as The term hardware within the same amount of time. This regular, w. {mu(i);0 5 i I n - 1; } is a sequence of +Is which deter- unified formulation makes the CORDIC based architecture mines the rotation angle, and the modes of operations. The very appealing for implementation with pipelined VLSI array term { s(m,i) ; 0 I i I n - 1 ) is a non-decreasing integer shift processors. sequence which is usually determined in advance. In this context, many research efforts have been directed With these definitions, the basic CORDIC algorithm can to the application of CORDIC based architectures for DSP be described as follows: applications 121,131,141, [181, [191,l211,l221,1231, [271, 1411, 1471,l521,l531,l631,[641 [701, l721,1741,1761,[771,1791, and 1801. The primary objective of this article is to provide a brief survey of these recent research efforts. Initiation: Given x(O), y(O), z(0). We will first review the state-of-the-art of the evolution of For i = 0 to n - I, Do the CORDIC algorithm, and CORDIC processors. Then we I* CORDIC iteration equation *I will survey a number of typical DSP applications suitable for implementation with CORDIC based hardware. CORDIC ALGORITHM AND CORDIC BASED PROCESSOR ARRAY I" Angle updating equation "I CORDIC is an iterative arithmetic algorithm introduced z(i+l) = z(i) -pi a,(i) (4) by Volder [81] in 1956 and later refined by Walther 1821 and End i-loop many others [I], VI, 1101, 1141, 1161, 1171, [201, [IS], 1251, 1301, [37], 1421, [44], 1451, 1401, 1751, 1771. The CORDIC I* Sca algorithm has found a wide range of applications, including discrete transformations such as discrete Hartley transform [ 1 I], discrete cosine transform (DCT) 1141, fast Fourier trans- form (FFT) [25], [26], Chirp Z transform (CZT) [44], solving i=U eigenvalue and singular value problems [IS], 13 I], 1721, digital filters [21], 1221, 1801, Toeplitz system and linear system solvers 1471, [48],[52], 1741, 1.531, and Kalman filters [771.

JULY 1992 IEEE SIGNAL PROCESSING MAGAZINE 17

..- -1 n -----

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. the norm of the vector. Hence this scaling operation is needed to move the end point of the final vector back to the desired trajectory (on the dotted line). A main research issue, to be discussed later, is to reduce the computation overhead due to the scaling operation.

Modes of Operations The set of p; (Al)in the CORDIC algorithm determine the rotation angle depending on the modes of operations. The CORDIC algorithm can be operated in either a vector rotation mode, or an angle accumulation mode. In the vector rotation mode, which is also known as the vectoring mode [82], or forward rotation mode [44], the desired rotation angle 8 is given. The objective is to compute the final coordinate [,y yf]'. Usually, we set z(0)= 8 at the beginning. After n iterations, the total angle rotated is: Linear Rotation

Clearly, we should choose p; = sign of ~(i)in Eq. 4 such that after the i'h iteration, I z(i+l) I < I z(i) I. Eventually, we want to make Iz(n)l 4 0. For many DSP problems, 8 is known in advance. Often, the same 8 will be used over and over. In these situations, one may choose to evaluate Eq. 4 in advance (off line), and store the corresponding set of p;, instead of 8, in the memory. A particular advantage is that there will be no need to implement the angle updating formula (Eq. 4), resulting in a one-third cost saving of hardware. In the angle accumulation mode, which is also known as the rotation mode [82], Y-reduction mode [lo], or the back- wurd rotation mode [47], the desired rotation angle, 8, is not given. The objective is to rotate the given initial vector [x(O) y(O)]' back to the x-axis so that the angle between them can be accrued. For this purpose, we let z(0)= 0, and select p; = sign of x(i).y(i). (In the circular rotation case, an alter- native criterion to select pi = - sign ofy(i) will guarantee xf> 0 regardless the sign of x(0). In summary, the selection criteria for pi are:

sign of z(i) vector rotution mode (Cl (7) ~ -~ ______~____~_- = { - sign of x(i).y(i) angle accumulation mode I. Trajectory ofcirculur (a).linear (b),and hyberbolic (c)COR- DIC rotutions. Note that the set of {pi ; i = 0 to n - 1 } can be regarded as an alternative representation of the rotation angle 8. Hence, even The purpose of the scaling operation (Eq. 5) is to make in the angle accumulation mode, often it is not necessary to sure that after rotation, the final coordinate [ xf y]'has the explicitly compute 8 using Eq. 5. same m-norm as the initial coordinate [ x(0) ~(0)1'. That is, Shift Sequence (xf+ + m(yjl2= x2(0) + my2(0) m = -I, or I The shift sequence { s(m,i);0 Ii 5 n-1 ) determines the (For m = 0, &(n) = 1. No scaling operation is needed). convergence of the CORDIC iteration, as well as the mag- From Figure l(a)-(c), it is clear that the rotation changes nitude of the scaling factor Km(n).Since there are only n finite

18 IEEE SIGNAL PROCESSING MAGAZINE JULY 1992

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. elementary rotation angles { am(i);i = 0, n - 1 }, it is impos- P sible to represent arbitrary rotation angle 8 without error. Let us define an angle approximation error as:

where ~p = fl, and ip are positive integers. Multiplication with I/K,(n), then will take P-1 shift-and-add operations. Using the canonical multiplier recoding method [511, the value of P can be reduced to bn, where b is the number of In the vector rotation mode, since z(0) = 8, from Eq. (6), we bits in the internal registers. On the average, PI=bA. have 6 = z(n).In the angle accumulation mode, one can easily A second approach is to represent l/Km(n) as a product of show that: Q factors such that:

tan-'y(nyx(n) rn = + 1 .Q 6= y(n)/x(O) rn+O 1tanh-ly(n)/x(n) rn = - 1 The effects of the angle approximation error and the rounding where = f I, i, are positive integers, and I I < 2-b. Given error on the accuracy of the CORDIC computation have been l/Km(n), similar to the first method, the design goal is to analyzed [43]. In general, it is desired that for any given minimize Q. It has been observed [4], [37] that if some of the rotation angle, 8: elementary rotation angles are repeated, the resulting scaling factor, Km(n)can be approximated by a power of 2. Thus the I 6 I 5 a,(n - 1) (9) scaling operation can be accomplished with several addition- al (repeated) CORDIC iterations with a simple shift operation This condition imposes two constraints on 8. First, the cor- at the end. Delosme later suggested use of a simulated anneal- responding set of elementary rotation angles must satisfy: ing technique to identify the best set of elementary angles to repeat 1181. Deprettere et a1 [22], [20] proposed a variant of that method in which some product terms have the form (1 + q 2-jk + ~k2-j~)where yk = f 1 for a three-term product, and yk = 0 for a regular two-term product. In singular value decomposition and eigenvalue decom- Next, 8 must lie within the domain of convergence. That is, position problems, an important operation is to diagonalize a 2 x 2 matrix with simultaneous pre- and post- rotation opera- tions. Observing that the scaling factor is Ki(n)in this case, Cavallaro and Luk [IO] have proposed a simple scheme (discussed later) to perform exact scaling. This method was Walther [82] has proposed a set of shift sequence for each of recently studied and enhanced by Delosme [18], [19], and Yang and Bohme [84]. the three coordinate systems. In particular, for rn = 0 or I, s(rn,i) = i, 0 I i 5 n - 1. For rn = -1, s(-l,i) = I, 2, 3,4, 4, 5, ..., 12, 13, 13, 14, ..., ARCHITECTURAL DESIGN

The purpose of repeating elements { 4, 13, 40, . . . ] in the Basic CORDIC Processor hyperbolic coordinate is to assure that Eq. 10 is satisfied. Along this line of research, X. Hu et al. [40] have proposed a A basic CORDIC processor should contain function method to expand the range of Amup. modules which realize (a) the CORDIC iterations specified in Eq. 3; (b) the angle update iteration specified in Eq. 4; and (c) the scaling operation specified in Eq. 5. The functional Scaling Operation The multiplication of l/Km(n) in the scaling operation (Eq. block diagrams for each of these operations are straightfor- ward (Fig. 2). 5)imposes significant computation overhead on the CORDIC The modules in Fig. 2a support a single CORDIC iteration. algorithm. This is because to perform a multiplication, n It contains dual barrel shifters and dual adders to facilitate the shift-and-add operations will need to be performed. For- updating of both x(i) and y(i)simultaneously. The number of tunately, if we constrain Ipil= 1, and assume [ s(rn,i))is given, bits to be shifted is controlled by the shift sequence {s(rn,i)). Km(n) can be computed in advance. As such, various techni- The shift sequence can be stored on chip using ROM (read- ques have been proposed to reduce computation overhead. only memory) or RAM (random-access memory), or A first approach is to convert I/K,(n) into a canonical generated on chip with a simple and additional con- sign-digit representation [5 I]: trol devices. The addsubtract operation is determined by the

JULY 1992 IEEE SIGNAL PROCESSING MAGAZINE 19

- r -~--r II --T

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. {G-y-

!-i -:sign cit riij vector rotation mcrde

Parafft.1 and Pipelined Arrays

23 IEEE SIGNAL PROCESSING MAGAZINE JULY 1992

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. processors will propagate the result to on, also in a pipelmed iiimie~'We

time unita. Nei the luwer-level ion results of the CURDlC rotatlotis WI ter (~+.Y)T,time units.

1/(/f+S)fC,, Doublj p1 ion was first proposed by Deprettere Ct al. [22j. [ ipIenzcnting CORUlC-based digital

JULY 1992 IEEE SIGNAL PROCESSING MAGAZINE 21

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. filters. It enables a fine grain pipelining, which helps reduces the pipelining . For Toeplitz system and linear system Initiation: Y(0,k) = 0 for 0 I k I N - 1 solvers, it also facilitates saving total computation time by For k=0, 1, ..., N-1 Do almost 50 percent [41].It was called implicit rotation when Form = 0, 1, ..., N-1 Do applied to the eigenvalue decomposition problem [ 181. IMPLEMENTING ALGORITHMS

In this section, we will illustrate how to utilize a CORDIC processor array to implement digital signal processing algo- rithms. Our approach is to reformulate existing DSP algo- rithms so that they are suitable for implementation with a CORDIC processor array performing circular, or hyperbolic End m-loop rotation operations. The linear rotation operation will not be Y(k)= ~ Y(N*k) /* scaling operation */ emphasized because they can also be implemented, more K,(n) efficiently, with MAC based processors. There usually exists End k-loop a more efficient scheme for implementing a particular func- tion. However, the utility of the CORDIC based architecture lies in its generality andflexibility. With these in mind, we (Note that Ki(n)is defined in Eq. 5.) will survey three categories of DSP algorithm: In Figure 4, the sequence {X(n);0 5 n 5 N - 1 ] is taken from the left end of this processor array. Each X(n), which (a) Linear transformations, including discrete Fourier contains two real numbers x,(n) and xi(n), will be propagated transform, Chirp-Z transform, discrete Hartley transform, and ; from the current processor to the next nearest neighbor (b) Digitalfilters, including orthogonal digital filters, and processor in a pipelined manner. Since X(n) is not to be adaptive lattice filters; modified during propagation, they can be piped at a rate of (c) Matrix based digital signal processing algorithms, to time unit (1 clock cycle) per stage. Each of the N CORDIC including QR factorization, with applications to Kalman fil- processors in this processor array will be responsible for tering, linear system solvers, Toeplitz and covariance system evaluating a particular Y(k). This implies that N different solvers, with applications to least square deconvolution, and rotation angles { 2xmWN; 0 2 m 5 N - 1 } must be stored in eigenvalue and singular value decompositions with applica- each processor. Since each of each rotation angle requires n tions to array processing. bits to store, the total storage requirement, including the s scaling iterations, will be nN+s bits. Linear Transformations As Despain pointed out [25],there is no need to perform scaling operation in the DFT algorithm. This is emphasized Discrete Fourier Transform (DFT) by placing Kl(n) in front of the rotation matrix. In doing so, Given a complex-valued discrete sequence {X(n); the output will be Ki(n)Y(k),which is a scaled version of the 0 < n 5 N- I}, its discrete Fourier transform { Y(k); desired output, Y(k). Hence, there is no need to perform the OSk5N- 11 isdefinedas: scaling operation, which takes s iterations, for each of the N CORDIC rotation operation in the m-loop. Instead, only one final scaling operation is needed to compute Y(k) at the end of the m-loop as listed in the algorithm. With this scheme, the __-j2xO __-j27rk -j27r(N- 1)k N average scaling overhead is reduced from s shift-and-add Y(k)= X(O)e N + X( 1)e + . . . + X(N-l)e scaling iterations per CORDIC rotation down to s/N shift- and-add scaling iterations per CORDIC rotation. Conse- quently, the computation for each Y(k) will be (nN+s)t, time units; and the total computing time will be (nN+s+N-l)t, time units. However, since the scaling is done for 0 5 k I N - 1. To avoid notational confusion, in this article we use x(i), y(i) to denote the rotation coordinates in at the end, the dynamic range of Y(m,k)will be increased by the CORDIC algorithm, andX(n), Y(n)to denote the time and a factor of Ki(n), whose magnitude is usually less than 2. To frequency domain sequences in the DSP algorithm. We note, avoid overflow and loss of accuracy, one extra bit is needed however, that in most DSP literatures, the time domain se- to store the intermediate result, Y(m,k). quences are denoted by lower case letters, e.g., x(n). Eq. 14 can be reformulated in a recurrent algorithm format to Discrete Hurtley Transform (DHT) facilitate implementation with a CORDIC processor array. The Discrete Hartley transform is proposed by Bracewell The algorithm is as follows: [6] as an alternative to discrete Fourier transformation. It has the following formulation:

22 IEEE SIGNAL PROCESSING MAGAZINE JULY 1992

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. OmsN-I vector ... vector vector rotation rotation rotation

Y(N-1)t

{p: }, representation of the same 0 (k) and composite scaling factor

,A,...... i---+-l

t -#- Xr(N-1) Xi(N-1) Xr(N-l-rn) Xi (N-1-rn) &(O) Xi(0)

4. A pipelined CORDIC DFT array (a).X(m) will be propagated from processor i to processor i+l every nt, time units. Scaling is done only once in each processor ufter N CORDIC rotations are all conipleted. Each processor will store the same set of N rotation angles in the form of {p,). Which angle to use in the mth iteration depends on k. Pipelined CZT array (b). The rotation angle and composite scal- ing factor will be,fed into each basic CORDICprocessor, one by one. Each processor will cycle through n+s cycles before feeding its result to the next processor. The input datu re applied to each processor in parallel. The output is available at processor N-1.

Initiation: Y(0,k) = 0 for 0 5 k 5 N - 1. Fork = 0, 1, ..., N-I, Do (15) Form = 0, 1, ..., N-1, Do n=O

where {X(n))and { Y(k))are both real sequence. To realize DHT in a CORDIC based processor array, we set the initial 1v End m-loop; coordinates of the CORDIC algorithm to (X(n), X(n)). That Y(N,k) is, x(0) = y(0) = X(n),and we rotate through an angle of Y(k)= ~ I* scaling operation *I 27Ckrl/N. Each term in the above summation will appear as yy. Kl@) End k-loop Again, similar to the case of the DFT, only one scaling operation is required for every Y(k). The recurrent formula- tion of the DHT is as follows:

JULY 1992 IEEE SIGNAL PROCESSING MAGAZINE 23

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. Note that the DFTarray presented in Fig. 4 can also be applied to the DHT with x(0) and y(0) tied together. A implementation of the DHT based on the CORDIC processor has been reported recently [ 1 I].

Chirped Z-Transform (CZT) The Chirped Z transform evaluates the Z-transform of a discrete, complex valued sequence {X(n)) at the points k :k = AlV for k = 0, 1, . . ., K-1, where W = Woe-Jcp0,and A =A,e"" with WO,A(,, (po, and 0o being real numbers. Hence,

N- I N- I

Traditionally, CZT can be evaluated with fast Fourier trans- formation using O(N+K-l)I o g(N+K-1) operations [66]. However, when K << N, direct evaluation of Y(ZQwould be more efficient. A CORDIC processor array of the CZT was proposed [44] which is based on the following recurrence algorithm for- mulation:

Initiation: Y(0,k) = 0 for 0 Ik I K - 1. Fork = 0, I, ... K-1 Do Form = 0, I, ..., N-I Do

5. Signa/,flowgraph of the FFT (decimation in time) algorithm (a).CORDIC implementation (b)of the DIT FFT algorithm in (a). Each bo.w is a CORDICprocessor. Thick arcs in (a)are realized bv the shaded box in (b).

Xr(N- 1-m) +[ X;(N- 1 -m) 1 End m-loop other than fl. Whenever p; = 0, the corresponding CORDIC End k-loop iteration can be skipped. Hence, under the constraint of Output: Y(k) = Y(N,k) convergence condition (Eq. 9), the objective of FAR is to n- I minimize C, I pi I . As such, the total number of CORDIC r=O iterations can be minimized. Of course, according to Eq. 5, In each iteration of the m-loop, each Y(m,k) will be multiplied setting certain p; = 0 will cause the scaling factor, K1 (n). to with a scaling constant, A;'$. This scaling constant can be be dependent on the rotation angle. Specifically, using FAR, combined with the scaling operation in the CORDIC algo- each rotation angle, 8, will have a different scaling factor. rithm. That is, instead of multiplying l/K~(n),and then fol- However, the radial scaling factor,A,' wk, is dependent on the lowed by A,' M/3, we will multiply acomposite scaling factor, rotation angle (indexed by k) anyway. Hence, a different A,' $JKl(n). A side effect, however, is that when CZT is Kl(n)for each different index, k, will pose no extra computa- realized with a CORDIC processor array, different Y(k) may tion or storage overhead. The architecture of a pipelined need different numbers of CORDIC iterations for scaling CORDIC processor array with forward angle recoding is during each m-loop. This will cause scheduling difficulties in illustrated in Fig. 4b. a pipelined implementation. On the other hand, it also presents a special opportunity to further reduce the number Fast Fourier Transformation (FFT) of CORDlC iterations using a,forwardangle recoding (FAR) The fast Fourier transformation is perhaps the most fre- method (441, (461. quently used DSP computing algorithm in modern digital Briefly speaking, in the FAR method, the internal repre- signal processing systems. Let N be the total number of points sentation of the rotation angle (p;)can take integer values in the DFT. If N = N1.N2, then, we may let index

24 IEEE SIGNAL PROCESSING MAGAZINE JULY 1992

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. Complex Add/Sub. CORDIC rotations Dual shift register C,D, C', D' are all complex numbers.

8 stages 4 stages 2 stages 1 stage n

i-p-p 6. Pipelined CORDIC FFT array [26].

n = nl + Nin2, with 0 Ini <=NI - 1, and 0 I n2 I N2 - 1. C'= C+ D Note that 2xk D' = (C - D)exp( -Jy )

Note that C, D, C' and D ' are all complex numbers. The above equations require a complex number multiplication (with a Therefore, for 0 I: k IN -1, uni-modulus complex number), and two complex number additions. The complex number multiplication can be done with n+ s CORDIC iterations and scaling iterations. The two complex additions will need just one more clock cycle. For small N, a FFT computation can be realized directly with a network of CORDIC processors (Fig. 5). A signal flow graph of an 8-point decimation-in-time FFT [66] is shown in If NI = 2, and N2 = Nn,the above equation becomes: Fig. 5a. In Figure Sb, each Butterfly operation is replaced by a rectangular box representing a CORDIC rotation followed by a complex addition or subtraction. To see how the structure in Figure Sb is obtained from Figure Sa, we highlighted the corresponding Butterfly operation in the last stage in Figure Sa, and the CORDIC processor in the shaded box in Fig. 5b. In other words, the original N-point DFT problem now be- For large N, the complicated global interconnection re- comes two N/2 point DFT sub-problems for each Y(k). quirements unfortunately make it difficult to implement effi- Recursively applying this procedure will lead to the popular ciently with a pipelined VLSI array, without severe radix-2 FFT algorithm. performance penalty [60]. Despain [26] has reported an im- The basic operation in a FFT algorithm is called a Butterfly plementation of a 16-point FFT building block using a linear- operation, which may have several different formats. For ly array of specialized hardware to minimize the number of example, consider the decimation-in-time Butterfly opera- addsubtract stages. A block diagram of that processor is tion: shown in Fig. 6. This processor array is able to achieve a pipelining period of 1. Also, the number of delay elements 2nk vary from stage to stage. It has N/2 delays in the first stage, C' = C + D.c~p(-j-) N NI4 in the stage, and so on. 2xk D' = C - D cxp(-j-) N Digital Filtering I Most of the existing digital filtering structures are imple- and the decimation-in-frequency Butterfly operation: mented with multiply-and-add operations. However, the

JULY 1992 IEEE SIGNAL PROCESSING MAGAZINE 25

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. based on the Schur algorithm to realize a uniform two-rotor- per-section ODF for stable transfer functions (Fig. 9). It was reported later [12] that the Schur algorithm is numerically sensitive to rounding error, and a remedy was developed. A different remedy using state space realization has sub- sequently been advanced by Desai [24]. Vaidyanathan [80] has presented a unified framework showing that a digital Lossless Bounded Real (LBR) two port pair can be realized with two plane rotations. This structure is also a normalized version of the series adaptor for wave digital filters [32].

Adaptive Lattice Filter Whereas an ODF is used to implement a given transfer function as a fixed-weight , a lattice filter has found most applications in adaptive signal processing [34]. In other words, most popular lattice filtering algorithms have ~~ -2 7. Basic rotor in orthogonal digitalfilters. been focused on the adaptive updating of the reflection coef- ficients so that it is able to keep track of the time varying CORDIC structure is useful in implementing a large class of parameters in the incoming signal. The building block of a digital filtering functions with some unique advantages. In lattice filter is a hyperbolic rotation unit (cf. Fig. IO), which particular, two types of digital filtering structures will be can be described as: discussed in this section: the orthogonal digitalfilter (ODF), based on circular rotations; and the lattice filter (LF), based on both hyperbolic and circular rotations. In fact, the basic building blocks of these two types of filter structures are intimatedly related.

Orthogonul Digitul Filter An orthogonal digital filter is a digital filter of which the oLerall numerical computation at each sample instance can be described by an orthogonal or unitary matrix. There are a number of advantages of an ODF [27] compared with other filter structures. First, it has low sensitivity both in the passband and the stopband, and is invariant under frequency transformation. In addition, it is stable in spite of parameter quantization. It is also free from the limit cycle and ovefflow oscillations. Finally, the pipelining, modular structure also makes it viable for VLSI implementation. The basic building block of an orthogonal digital filter is the so-called rotor (cf. Fig. 7)- a circular rotation unit which realizes the following two-port transfer function:

wherejand y are inputs, and x, and g is the output. Gray and Markel [36] proposed a cascaded structure together with a tapped-sum output to realize a given transfer function. Henrot and Mullis [38] proposed a pr0ce;dure based on the generalized Levinson algorithm to realize an ODF. Dewilde and Deprettere [271 based on an embedding technique, developed several pipelinable, cascaded ODF structures. In summary, three types of pipelined building blocks, based on different inter-connections of one or more rotors have been proposed 1281 (cf. Fig. 8). Any transfer function can be realized by a of these three building blocks, 8. Three gpes of basic structures of pipelined orthogonal digital Rao and Kailath [70] proposed a synthesis procedure filter [281.

26 IEEE SIGNAL PROCESSING MAGAZINE JULY 1992

___. .

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. - Sinh e In the linear prediction theory, k in both of these equations are known as the reflection coefficient, or sometimes the PARtial CORrelation (PARCOR) coefficient. To ensure Cosh stability of the filter, the magnitude of every reflection coef- ficient must be less than unity. That is, Ikl < 1. A basic cascaded lattice filter structure, implemented with Hyper- bolic rotation CORDIC processors, is shown in Fig. 1 I. If the ~ __ input y(t) is an auto-regressive random of order p, IO. Basic lattice filter section. then the output of theprhstage,b(r) and b,(j) will be two random processes. The rotation angles { Bi; 1 I i I N] can be estimated in a are suitable for implementation with a CORDIC processor batch processing manner or a data adaptive manner. If the array, the reflection coefficient adaptation algorithms them- correlation function (RY(n))of the input (y(t))is available, a selves usually can not be easily implemented. Previously, Toeplitz system equation called the Yule-Walker equation Ahmed et al. [2] had implemented a CORDIC based ladder can be solved to find out the N reflection coefficients; or filter for speech analysis using a normalized lattice algorithm. equivalently, the N hyperbolic rotation angles. We will show Sibul and Fogelsanger [72] have proposed a modified COR- later that a CORDIC processor array is able to solve a Teoplitz DIC algorithm to implement RLSLF. Chen and Lin [13] system of equations efficiently proposed a dedicated hardware implementation of a normal- If the sample data sequence (~(t))is to be used, we may ized RLS lattice algorithm which they claim to be more compute the reflection coefficients using, say, the Burg's efficient than CORDIC implementation. maximum entropy method [9]: Recently, Hu and Liao [49] have proposed a CORDIC- based adaptive lattice filtering (CALF) algorithm. CALF is a simplified gradient-based algorithm that is able to directly .r;( t' ) .h; (r/- 1 ) c update the hyperbolic rotation angle without explicitly update the reflection coefficients. Note that there are only 2'* possible combinations of {pi; pi = fl, 0 I i I n-1 ] . Hence, at most t' I' 2n distinct rotation angles can be represented using the given If we want to update k;(t)at each time instant, two classes of set of shift sequence. adaptive lattice filtering algorithms are available to do so: the Let us order these 2n angles in an ascending order, and gradient lattice filters (GLF), and the recursive least square label each of them with an index I, 0 I I I 2n - 1, such that: lattice filters (RLSLF) [34]. The GLF algorithms updating the reflection coefficients use various gradient based formula- tions. For example, the reflection coefficient k;(t+l) at the i'" stage at time r+l can be computed as: With this set of indices, the updating formula of CALF can now be easily described: where a; is the step size. The RLSLF algorithms update the reflection coefficients using time and order updating by recur- sively solving a least square problem. Many fast RLSLF algorithms have been reported taking advantage of the special structure between successive data vectors. A comprehensive survey on adaptive lattice filter algorithms can be found in [34]. In Fig. 12, several building blocks of these lattice filter structures are shown. I/. Basic 1atticr.filter structure implemented with hyherbolic rota Although the resulting prediction lattice filter structures tion CORDIC processors.

JULY 1992 IEEE SIGNAL PROCESSING MAGAZINE 27

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. decomposition have been used in high resolution spectrum estimation, array processing and related problems. In this section, we will present CORDIC based processor array structures for implementing these matrix related DSP com- puting algorithms. Since we have to deal with portions of a matrix quite often, we shall adopt the following conventions to simplify the notations used in the algorithms: A(1:5,:)will be the submatrix of A containing the first five rows, and A(2:p,3:5)will be the submatrix of A containing elements in the intersection of the second to thepfhrows with the third to the fifth columns.

Elementary Row and Column Operations The purpose of the elementary row operation is to rotate corresponding pairs of elements in two row vectors so that one element in the second vector is annihilated. (For column vectors, a column operation can be performed by post-multi- plying a 2 x 2 matrix.) Let A(i,:)and A(j,:)be the irhand the jfhrows of the A matrix. Then, the row operation is equivalent to pre-multiplying these two row vectors by a 2 by 2 matrix M:

We may choose M to be a unitary matrix so that circular rotation can be applied. In this case, we denote M by C(ij, - ACj,l)) to emphasize that A(j.l) is to be nullified, and to 12. Variance normalized pre-windowed least square lattice predictor perform the row operation between the ith and the j‘“ rows of ,filter ((a),see [34]).Variance normalized sliding windowed the A matrix. It is easy to verify that: coinriance lattice predictor filter (b). Note that the second stage is a rotor implemented with circular rotation. Variance normalized pre- windowed time and order update lattice predictor filter (c). L J

with 0 = -tan-’[ A(j,l)/A(i,l)]. Similarly, one may choose M so that a hyperbolic rotation between elements of A(i,:)and I,(t+l) = Ik(f) + sign of (fk(t).hk(t)) (22) A(j,:)are performed. In this case, we denote

When s(-1,k) = k+l, the corresponding set of (pi) can be coshv sinhv M=H(ij,A(j,l))= found directly using a simple up/down binary counter. This sinhv coshv makes the implementation of CALF extremely simple com- 1 pared with previous approaches. In Figure 13, a single sec- with y~ = -tanh-’[ A(j,l)/A(i,l)]. tion of the CALF structure is depicted where the adaptation The row operation can easily be realized with a doubly of the rotation angle is realized with a simple up/down binary pipelined CORDIC array (Fig. 3c). A(i,k) and A(j,k) will be counter.

Matrix Based DSP Computing Algorithms I I Up/Down Counter Matrix based linear algebra computing algorithms such as linear system solvers, least square system solvers, QR fac- torization, eigenvalue or singular value decompositions arise quite often in digital signal processing applications. For example, linear system solvers can be used in covariance system equations in the linear prediction problem; QR fac- torization can be used in least square adaptive filter applica- I tions; and eigenvalue decomposition or singular value 13. CORDIC adaptive lattice filter (CALF) [49].

28 IEEE SIGNAL PROCESSING MAGAZINE JULY 1992

~

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. loaded to the local memory of the kfhprocessor initially. The head processor will perform angle accumulation operation by evaluating p; in the if” iteration. The computed pj then will be propagated to each remaining processors. Note that we do not have to wait all the pis are computed before initiate the vector rotation operations in the remaining processors. Thus, the second processor will be able to commence its operation only one clock cycle (to)behind that of the head processor.

QR Factorization In QR factorization, a matrix is factorized into the product of an orthonormal matrix and an upper triangular matrix. This operation can be accomplished by a sequence of circular row operations to nullify elements below the main diagonal of the matrix. The algorithm is described as follows: I I . vector rotation Initiation: Given a Al Fori = 1+1 top Do

14. q-by-q triangular CORDIC array for QRfactorization. The matrix A is p-by-q (usually, p 2 y).

-Z(r) = Q(t)g(r)- Q(t)X(t)Y(r) = - ‘f’]._W(r). Output: U(1,:)=A@’([,:) [z‘”]go) [ End 1-loop (27)

Thus, the least square solution of E(t)can be found as:

A triangular array of CORDIC processors (Fig. 14) is capable -k(t) = rr’(t).?(t) of performing the QR factorization algorithm effectively. Assuming that a new (row) data vector x(t+l) = [xl(t+l) Examnple: Adaptive Antenna Nulling x2(1+1) ... xp(f+l)] is now received. Our goal is to update To illustrate the DSP applications of QR factorization, -~(t) to k(t.1). This can be accomplished by performing p consider the least square adaptive antenna nulling problem: row operations between each row of the U(t)matrix and the Let the row vector x(t)= [xl(t), x2(t), ...,xp(T)] ] be the array x(t+l) data vector to nullify E(f+l). This procedure will lead input at time t; and X(r) = [ x‘(t) <(+I) . . . ~‘(0)1‘ be the data to an updated upper triangular matrix U(r+l), and the updated matrix. (Here, the superscript “t” refers to the matrix (vector) coefficient vector: transpose. The index “(I)” refers to the tth time instant.) The objective is to find a p by 1 coefficient vector E([)for each t to minimize the square error: Rader et a1 [69] have implemented a wafer scale linear t f CORDIC processor array called MUSE (Fig. 1.5) to imple- ment this adaptive nulling algorithm. They incorporate a E(r) = c I (f)l2 = c [ &’) - X(tr)E(tt$ (26) t’=O t’=O forgetting factor into the error formulation so that the scaling operation does not have to be exact. As a result, they achieve where g(r) = d(t)- X(t)_W(f).Our approach is to apply QR significant savings in the scaling operation. factorization to X(t) such that: Much research has been reported on using triangular sys- tolic array for QR factorization [3.5], and recursive least -- square (RLS) processing [62]. Recently, Cioffi [ 1.51 reported a fast QR-RLS algorithm which requires O(N) rotations in- LA stead of O(N2)rotation. where Q(t) is an orthonormal matrix, U([)is a p by p upper triangular matrix, and 0 is a (t+ 1-p) by p zero matrix. Pre-mul- Example: Kulman Filtering (KF) tiplying the matrix Q(t)on g(t), we have: QR factorization can also be applied to solve the Kalman

JULY 1992 IEEE SIGNAL PROCESSING MAGAZINE 29

-~ ~

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. where "*" are terms which are of no concern. If QR factoriza- tion is used, M will be an orthonormal matrix. If Gaussian elimination is used, M will be a lower triangular matrix. Note that the last row of the M matrix in above equation is such thatp'A' + 4.b' = 0.Hence, the solution to Eq. 30 can be found -1 as:&=--. 4 U(3 3 U(3 4 u(3 5 This algorithm is a slightly modified version of the classi- cal Fadeev algorithm. As a variation of the above algorithm, it is suggested [52] that when A = U' U and A-' = L'L are both given, the above augmented matrix can be modified as:

Processor in the shaded area has been "lolded" back so that all the processors can be kept busy all lhe time I Now we have p'U + q b' = e', and = p'L. Therefore, ip ----L 15. Linear CORDIC for adaptive antenna nulling (MUSE).

filtering problem. A least square formula due to Paige and Toeplitz System and Covariance System Solvers Saunders [67] embedded the KF recursion formula into a In many advanced signal processing algorithms, the coef- matrix triangularization formulation: ficient matrix A in the linear system of equations has a special low displacement rank [33], [55] structure. Displacement rank is a measure of closeness between a given square matrix and a certain Toeplitz matrix. A Toeplitz matrix has a struc- ture that all elements along the diagonal direction are the same. That is,

t..=ti+=tk for-N+l

Linear System Solver Given a NxN matrix A, and a Nxl vector b, a linear systems of equations: A matrix has a structure which is close to a Toeplitz matrix if it has low displacement rank. Let us denote a shift matrix as:

0...0 0 arises quite often in digital signal applications. Conventional Z=[l,v-, methods involves the LU factorization of the A matrix, fol- 01 lowed by a back-substitution step. On the other hand, Eq. (30) The (+) displacement rank of a matrix A, denoted by a, is can be rewritten as [A b] 1]'= 0. Hence the solution [x' then defined as the rank of a displaced matrix, A - ZAP. In vector lies within the null space of the (A b] matrix. To find fact, the displacement matrix can be factorized (using, say, &explicitly,consider the triangularization of the [A bl'matrix LDU factorization) into the form: which is embedded in a (2N+l)x(N+l) matrix:

A - a2 = GCG' = G,G', - G,G; (33)

where C = diag [I,, - l~c-~],and GNX~is called a generator matrix because the original A matrix can be recovered from G according to the equation:

30 IEEE SIGNAL PROCESSING MAGAZINE JULY 1992

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. (34) (35)

where L@;) = [ Ilk 1 is a lower triangular Toeplitz matrix with for k = 1.2, ..., N.Given [ h(k)], and [ z(k)], the objective of gi being its first column. If the generator. G. is also permis- deconvolution i\ to recover the input { u(k)) in a least square sihlr, that is. there exist an ax1 vector p such that Gp = sense. The solution requires the solution of a linear system of [ I 0 ... 0 1'. then thc L L' factorization of A-' can also be equations of the form: computed explicitly. Any generator matrix can be made per- missable by adding two more redundant vectors 1611. With the generator matrix. a fast Cholesky factorization algorithm 2 then can be applied to compute the U'U Cholesky factoriza- where V' = E/u'(k)/, (5 = E/n'(k)/, and H is a lower trian- tion of the A matrix as well as the L L' Cholesky factorization gular Toeplitz matrix with its first column being 3 of the A-' matrix, using O(aN') operations instead of O(N) [h(O)/I( 1) . . . h(N- 1 )]I.It is easy to verify that a = 2, p = 2. operations. Thus. this method will be very appealing when Furthermore. since only one row of the U matrix (U'U fac- u<

is permissible, with p = [0 110 1'. These observations lead to the following fast deconvolu- tion algorithm, where the solution, w, is computed explicitly: . F("'( 1 :p,1 :N+ 1) Initiation:

' F(")(pil:a,l:Nil) /* Perform Hyperbolic rotation on the first and the p+lrhrows of F*/

Form = I toNDo

output: [Urn,1:m) U(m,m:N)] I* Right shift the first row ofthe F matrix */ F(m+')(~,l:~+l)= [o ~rn,~:m>~(m,m:~-l)l End m-loop

E.miipIe: Fcr.ct L>rc,ori~,olutionAlgoritlim [47/ Consider ;I discrete time, causal linear system with known finite impulse response sequence [ h(k)). random input se- quence { u(k)], and random observation noise sequence {n(k)].Let { z(k))be the output of this system, then:

JULY 1992 IEEE SIGNAL PROCESSING MAGAZINE 31

-- -

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. Eigenvalue and Singular Value Decomposition processing [7],[lo], [18], [39]. For real-valued matrices, a CORDIC processor can be applied to efficiently compute Eigenvalue decomposition (EVD) and singular value both singular value and eigenvalue decompositions. decomposition (SVD) have found extensive applications in Basically, a given matrix is diagonalized by pre- and modem digital signal processing. High resolution spectral post-multiplying with a sequence of unitary transformations. estimation techniques, such as the Pisarenko's method [68], In other words, SVD and EVD can be accomplished with a high resolution array processing algorithms such as the M- sequence of row and column operations. Thus, a CORDIC IJltiple SIgnal Classification (MUSIC) method [7I], all re- processor array is very suitable for the implementation of quire the computation of the eigenvalue decomposition or SVD or EVD of real-valued matrices. the singular value decomposition of the spatial covariance Below, we present an approach based on the classical matrix of array inputs. Jacobi method [83].This method was proposed by Brent and Given a pxq real matrix A, its singular value decomposi- Luk [7],and later implemented by Delosme [ 181, Cavallaro tion has the form: [IO] using CORDIC processor arrays. The Brent-Luk method is applicable both for SVD and EVD (symmetric matrix only) of real, square matrices. In the following discussion of this (37) algorithm, we shall assume the A matrix is asymmetric, and comment on the changes need to be made when A become symmetric. where U and V are, respectively, pxr (r< min (p,q)) and For convenience, we shall assume the dimension of the A rxq orthonormal matrices satisfying U'U = V'V = Zr. C = diag matrix p is an even number (if this is not the case, we simply append the A matrix with one row and one column of 0s so that it has even dimension). We will partition the A matrix into 2 x 2 blocks. For each of these pL2 blocks along the diagonal of the A matrix, the All A12 A13 A14 1 -*8 Propagate following 2 x 2 two-sided rotation operation can be per- A23 A24 - formed to nullify the two off-diagonal elements simul- taneously: exchange

In Eq. (39),the left and right rotation angles, 01 and 0, can be solved from the following two equations: ' A51 A52

L- 1 c+h €4 + 0, = tan- a-d - . ~~ ______~~~ 16 Two dirnen5ronal CORDlCproceywr array for SVD and EVD [IOI.

[GI,. . ., or] is a rxr diagonal matrix containing the r singular In symmetric eigenvalue computation problems, b = c on values (~i;i = 1 to r}. By definition, we have Gi > 0, each diagonal block. Hence, Eqs. 40a and 40b reduce to: I

r A = EAE' = hiviv: Each 81 computed in the ith diagonal block will be propagated ;= 1 to all the blocks in the ith row. Each 0r computed will be propagated to all the blocks in the ifh column. Thus, the ij where E is a pxp orthonormal matrix such that E E' = E' E = off-diagonal block will receive 01 from the irhdiagonal block, Zp. A= diag[hi, . . ., $1 contains the p real eigenvalues. Sequential algorithms for EVD and SVD has been well and 9, from the jfhdiagonal block. After every block has developed and ready to run Fortran routines can be found in completed this two-side CORDIC iteration, every pair of the public domain packages such as EISPAK [73],or LINPAK 2ifh and the 2i+lth column, as well as the rows, will be [29].A number of efforts have also been made to map these interchanged. This enables the beginning of the next iteration. sequential algorithms to a systolic array for parallel, pipelined Obviously, the two-sided rotation is very suitable for

32 IEEE SIGNAL PROCESSING MAGAZINE JULY 1992

.

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. CORDIC implementation [IO]. Figure 16 shows a 3 x 3 Angeles, in 1980, and 1982, respectively. CORDIC processor array for implementing this algorithm for From 1983 to 1987, he was assistant a 6 x 6 matrix. Using the two-sided rotation method, the professor in the Department of Electrical scaling operation can be significantly simplified: since two Engineering at Southern Methodist CORDIC operations are performed at a time, the scaling University, Dallas, TX. He joined the factor will become: Department of Electrical and Computer t7- 1 n- 1 f Engineering at the University of Wiscon- 1 1 1 sin, Madison, in 1987 as an assistant = + 2-2“( I ,i) = “1.22 a n professor. Currently, he serves as associate professor. His r=O id research interests include VLSI signal processing, artificial Now,ifJ=( l,3,5,..., [:]-l], neural networks, spectrum estimation, fast algorithms and parallel computer architectures, and computer aided design tools for VLSI. He has published more than 100 journal and ti- I conference papers in these areas. He is a former associate 2-2j1.n [ 1 + 2-2i ] = 1 + 0(2Xn-’) editor (1988-90) for the IEEE Transactions of Acoustic, je J id Speech and Signal Processing in the areas of system iden- tification and fast algorithms. Dr. Hu is a member of IEEE, where 0(2-n-1) are terms which are smaller than 2?. Assum- SIAM, and Eta Kappa Nu. ing the register length is also n bits, these terms will not affect the accuracy of the result. Hence, up to n-bit accuracy, we REFERENCES have l/K:(n) = 1 - 2-2j]. In other words, for every two n[ [I] Abuzzo, J., “Applicability of CORDIC algorithm to arithmetic process- 0 ing”, IEEE Eighteenth Asilomar Conf: on Circuits, Systems and Computers, CORDIC operations(1eft and right), approximately n/4 addi- Pacific Grove, CA, USA, Nov. 5-7, 1984. tional iterations are used for scaling. Thus the savings is [2] Ahmed, H.M., and M. Morf. “Synthesis and control of signal processing significant. architectures based on rotations”, Proc. qf’ the Firsf Int’l Con$ on VLSI, In the original formulation, as seen in Eqs. 40a and 40b, Edinburgh, Scotland, Aug. 18-21, 1981, the rotation angles Or + Or and Or - 01 must be computed 131 Ahmed, H.M., M. Morf, D.T.L. Lee, and P.H. Ang, “A VLSI speech analysis chip set based on square-root normalized ladder forms”. Proc. 1981 explicitly using Eq. 4 in order to compute Or and Or. Thus, the ICASSP, Atlanta. GA, I98 I, pp. 648-653. double pipelining technique is not applicable. Delosme [I 81 [4] Ahmed, H.M.. “Signal Processing Algorithms and Architectures”, Ph.D has proposed some modifications to enable the use of double dissertation, Department of , Stanford University, pipelining. Stanford. CA, June 1982. [5] Ahmed, H.M., “Alternative arithmetic unit architectures for VLSI digital signal processors”, VLSl and modern signal processing, Prentice-Hall, CONCLUSION Englewood Cliffs, NJ, 1985. [SI Bracewell. R. N., “Discrete Hartley Transform”, J. Opt. Soc. Arner., Vol. In this article, we presented the basic CORDIC algorithm, 73. No. 12, pp. 1832- 1835, Dec. 1983. and a partial list of potential applications of a CORDIC-based [71 Brent, R. P, and F. T. Luk, “The solution of singular-value and symmetric processor array to digital signal processing. Due to space eigenvalue problems on multiprocessor arrays,” SIAM .I.Sci. Star. Comut., limitations, many exciting on-going research projects could Vol. 6, No. I, pp. 69-84, 1985. only be broadly addressed in the discussion. Details such as [8] Bu,J., E.F.A. Deprettere, and F. De-Lange, “On the optimization of QR least square adaptive filters [ 151, and on-linear implemen- pipelined silicon CORDIC algorithm”, Proc. of’ EUSIPCO-86 Signal Processing 111: Theories and Applications, Hague, Netherlands, pp. 1227- tation of CORDIC algorithms [30], for example, have been 30, Sept. 2-5, 1986. left out. We hope that this article stimulates further interest in [91 Burg, J. P., Maximum Entropy Spectral Ancrlysis, Ph. D. dissertation, the development of CORDIC-based digital signal processing Stanford University, Stanford, CA, 1975. algorithms and CORDIC-based special purpose array proces- [IO] Cavallaro, J.R., and F.T. Luk, “Architectures for a CORDIC SVD sors. processor”, Proc. SPlE Int. Soc. Opt. Eng. (USA),vol. 698, pp. 45-53, 1987. 11 I J Chang, L.W., and S. W. Lee, “Systolic Arrays for the discrete Hartley ACKNOWLEDGMENT Transform” IEEE Truns. on Signal Processing, Vol. 29, No. 11, Nov. 1991, 241 1-2418. The author would like to thank Dr. J. Deller, editor-in- [I21 Chapmann, R, and M. A. Rahman, “A generalized design method for orthogonal IIR lattice filters” Proc. ICASSP88, Glasglow, Scotland, 1988, chief of the SP Magazine, for his patience during the prepara- pp. 829-832. tion of this manuscript. Many useful comments from the [ 131 Chen, S. G. and J.-F. Lin, “Efficient implementation of the normalized anonymous reviewers are also deeply appreciated. recursive least square lattice filter”,Priic. ICASSPYI, pp. 1565- 1568, Toron- to, CANADA, 199 1, Yu Hen Hu received his B.S.E.E. degree from National [I41 Chen, W.H., C.H. Smith, and S.C. Fralick, “A fast computational Taiwan University, Taipei, Taiwan, ROC in 1976. He algorithm for the discrete cosine transform”, IEEE Trans. on Cornrnunica- tion.r. Vol. COM-25, pp. 1004.9, Sept. 1977. received the M.S.E.E. and Ph.D. degrees in Electrical En- [ 151Cioffi, J. M., “The fast adaptive ROTOR’SRLS algorithm”,IEEE Trans. gineering from the University of Southern California, Los on Signid Processing, Vol. 38, No. 4, Apr. 1990, pp. 63 1-653.

JULY 1992 IEEE SIGNAL PROCESSING MAGAZINE 33

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. [ 161 Cosnard, M.,A. Guyot, B. Hochet, J.M. Muller, H. Ouaouicha, P. Paul, [39] Heller, D. E., and I. C. F. Ipsen, “Systolic networks for orthogonal and E. Zysman, “The FELIN arithmetic chip”, IEEE Proc. of equivalence transformations and their applications”, Proc. 1982 Con$ on the 8rh Symp. on CompurerArithmetic, Como, Italy, May 19-21, 1987. AdvancedResearch in VLSI, MIT, pp. 113-122, 1982. [ I71 Delosme, J.M., “VLSI implementation of rotations in pseudo-Euclidean [40] Hu, X., R. G. Harber, and S. C. Bass, “Expanding the range of spaces”, ICASSP Proc. of Int ’I Con$ on Acoustics, Speech and Signal convergence of the CORDIC algorithm”, IEEE Trans. on Computers, vol. Processing, Boston, MA, pp. 927-930, April 14-16, 1983. 40,No. 1, pp. 13-21, Jan. 1991. [ 181 Delosme, J.M., “A processor for two-dimensional symmetric eigenvalue [41] Hu, Y.H., “Pipelined CORDIC architecture for the implementation of and singular value arrays,”, Proc. 21-‘h Asilomar Con$ on Circ., Syst., and rotation-based algorithms”, Int’l Symp. on VLSI Technology, Systems, and Compu., pp. 217-221, Nov. 1988. Algorithms, Taipei, Taiwan, May 1985. [ 191 Delosme, J.M., “CORDIC algorithms: theory and extensions”, Proc. [42] Hu, Y.H., and T.Y. Sung, “The Optimal Design of VLSl CORDIC SPIE, Advanced Algorithms and Architectures for Signal Processing IV, processor,” Proc. Int’l. Symp. on VLSI Tech., Syst., cind Appl., Taipei, Col. 1152, 1989. Taiwan, R.O.C., pp. 31-35, May 1987. [20] Delange, A.A.J., Ed.F. Deprettere, A.J. Vander Veen, and J. Bu, “Real [43] Hu, Y.H., “The quantization effects of the CORDIC algorithm”, IEEE Time Applications of the Floating Point CORDIC Processor in Trans. on Signal Processing, Vol. 40, No.4, pp. 834-844, April 1992. Massive-Parallel Pipelined DSP Algorithms”, Proc. ICASSP Int ’I Con$ o~i [44] Hu, Y.H., and S. Naganathan, “Efficient Implementation of the Chirp Acoustic, Speech, and Signal Processing, pp. 1013-6, April 1990. Z-Transform using a CORDIC processor”, Asilornar Cor$ on Circuits, [?,I] Deprettere, Ed. F., “Synthesis and fixed point implementation of Systems and , CA, Oct. 1988. Also in IEEE Tram. ASSP, Vol. 38, No. pipelinedtrue orthogonal filters”, IEEE Int’l Con$ on ASSP, Boston, MA, 2, Feb. 1990, pp. 3.52-354. Vol. 1. pp. 217-220, 1983. [45] Hu, Y.H., “Parallel eigenvalue decomposition for Toeplitz and related [22] Deprettere, Ed. F., P. Dewilde, and R. Udo, “Pipelined CORDIC matrices”, Proc. ICASSP, Glasgow, Scotland, pp. 1107-10, May 1989. Architectures for Fast VLSI filtering,” IEEE Int’l Con$ on ASSP, Florida, [46] Hu, Y.H., and S. Naganathan, “An AngleRecoding method for CORDIC pp. 41A.6.1-4, April 1984. algorithm Implementation”. IEEE Trans. of Computers, (to appear). [?3] Deprettere. Ed.F., and K. Jainandunsing, “Orthogonal and J-orthogonal [47] Hu, Y.H., and P.H. Milenkovic, “A fast least square deconvolution matrix inversion techniques”, IEEE Int’l Symp. on Circuits and Systems, algorithmfor vocal tract cross-section estimation,” IEEE Trans. on ASP, Philadelphia, PA, USA, May 4-7, 1987. June 1990, pp. 92 1-924. 1241 Desai, U. B., “A state-space approach to orthogonal digital filters for 1481 Hu, Y.H., and H.M. Chern, “VLSI CORDIC array structure implemen- VLSI implementation”, Proc. ICASSP89, New York. NY, 1990, pp. 1255- tation of Toeplitz eigensystem solvers,” Proc. ICASSP. pp. 1575-8, April 1258. 1990. I251 Despain, A.M., “Fourier Transform Computers using CORDIC itera- [49] Hu, Y. H., and H.-E. Liao, “CALF: a CORDIC adaptive lattice filter”, tions,” IEEE Trans. on Computers, Vol. 23, pp. 993-1001. Oct. 1974. to appear at IEEE Trans. on Signal Processing, Vol. 40, No. 4, pp. 990-993, [?6] Despain, A.M., “Very Fast Fourier Transform Algorithms for Hardware April 1992. Implementation”, IEEE Trans. on Computers, Vol. C-28, pp. 333-341, May [50] Hu, Y. H., H. M. Chern, “An efficient CORDIC backward angle 1979. recoding algorithm and its applications to digital signal processing,” (suh- [27] Dewilde, P, and E. F. Deprettere, and C.V. K. Prabhakara Rao, “Or- mittedj. thogonal digital filters”, Proc. ICASSP84,Vol. I., pp. 230-233, San Diego, [S 1 ] Hwang, K.. “Computer Arithmetic principles, architecture, and design”, CA. 1984. John Wile? & Sous, Inc., New York, 1979. [28] Dewilde, P., Ed.F. Deprettere and R. Nouta, ‘‘Parallel and pipelined [52] Jainandunsing. K., and E.F. Deprettere, “A new class of parallel algo- VLSI Implementation of signal processing algorithms.” in VLSIandModern rithm for solving systems of linear equation,” SIAM J. Sri. Stat. Comput., Signal Processing, S.Y. Kung et al, Eds., Prentice Hall Series, 1985. vol. IO, No. 5, pp. 880-912, Sept. 1989. 1291 Dongarra, J., J. R. Bunch. C. B. Moler, and G.W. Stewart, LINPACK [53] Jou, I. C., T. Sung, Y. Hu, and T. Parng, “A CORDIC implementation Urer Guide, SIAM Publications, Philadelphia, 1978. of pipelined Toeplitz system solver”, IEEE Int’l Syrup. on Circuits and [30] Ercegovac, M.D., and T. Lang, “Implementation of fast angle calcula- Systems, Kyoto, Japan, June 1985. tion and rotation using on-line CORDIC’, Proc. 1988 IEEE Int’l Symp. on [54] Jou, I.C., Y.H. Hu, and T.M. Parng, “VLSI algorithms and pipelined Circuits aridSystems, Espoo, Finland, June 7-9, 1988. architectures for solving structured linear systems”, Pmc. VLSI Algorithms [3 I] Ercegovac, M.D., and T. Lang, “Redundant and On-Line CORDIC: and Architectures. Agean Workshop on Computing, Loutraki, Greece, July Applications to Matrix triangularization and SVD’, IEEE Trans. on Com- 1986. puters, Vol. 39, No. 6, pp. 725-740. [55] Kailath, T., S.Y. Kung, and M. Mod, “Displacement ranks of [32]Fettweis, A, “Pseudopassivity, sensitivity, and stability of wave digital matrices and linear equations,” J. Math. Anal. Appl., Vol. 68, No. 2, pp. filter.” IEEE Tran.s. Circuit Theory, vol. 19, pp. 668-673, Nov. 1972. 395-407, 1979. [33] Friedlander, B., T. Kailath, M. Mod, and L. Ljung. “Extended Levinson [56] Kalman, R.E., “A new approach to linear filtering and prediction and Chandrasekhar equations for general discrete time linear estimation problems”, J. of Basic Engineering\fR, 82D, pp. 34-45, March 1960. problems,” IEEE Trans. on Automatic Control, Vol. AC-23. No. 4, pp. [S7] Kung, H. T., “Why systolic architectures?’, IEEE Computer, Vol. 15, 653-659, August 1978. No. 1, Jan. 1982. [31] Friedlander, B., “Lattice filtering for adaptive signal processing” IEEE [58] Kung, S. Y., “AToeplitz approximation methodand some applications”, Proceedings, vo1.70, No. 8, Aug. 1982, pp. 829-867. Proc. Int. Symp. Mathematical Theory, NetworkSy.stems. Santa Monica, CA, 13.51Gentleman, W. M., and H. T. Kung, “Matrix triangularization by systolic Aug. 1981, pp. 262-266. array”. Proc. SPIE, vol. 298, Real-time signal processing IV, pp. 19-26, 1591 Kung, S.Y., and Y.H. Hu, “A highly concurrent algorithm and pipelined Bellingham, Washington, 1981. architecture for solving Toeplitz systems,” IEEE Trans. on ASP, Vol. 31, [36] Gray, A. H. Jr., and J. D. Markel, “A normalized digital filter structure”, No. I, pp.66-76, Feb. 1983. IEEE Trans. on ASSP, Vol. 23, No. 3, pp. 268-277, June 1975. [60] Kung, S. Y., VLSIArray Processors, Prentice Hall, 1987. [37]Haviland, G.L., and A.A. Tuszynski, “A CORDIC arithmetic processor [61] Lev-Ari, H., and T. Kailath, “Lattice filter parameterizations and model- chip”, IEEE Trans. Comput. (USA), vol. C-29, no. 2, pp. 68-79, Feb. 1980. ing of nonstationary process,” IEEE Trans. Inform. Theory, IT-30, pp. 2- 16, [38] Henrot, D, and C.T. Mulles, “A modular and orthogonal digital filter 1984. structure for parallel processing”, Proc. ICASSP83, Vol. 2, pp. 623-626, [62] McWhirter, J. G., “Recursive least squares minimization using a systolic Boston, MA, 1983. array”, Proc. SP1431, real time signalprocessing VI. pp. 105-112, 1983.

34 IEEE SIGNAL PROCESSING MAGAZINE JULY 1992

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply. [63] Naganathan. S., and Y.H. Hu, “Architectural design styles in the VLSI [74] Sung, T.Y., Y.H. Hu, and H.J. Yu, “Doubly pipelined CORDIC array implementation of real discrete Fourier transform”, Proc. ISCAS, IEEE processor for solving Toeplitz Systems”, Proc. of the Twenty-third Annual Symp. on Circuits and Systems, New Orleands, LA, May 1-3, 1990. Allerton Conf on Communication, Control, and Computing, Monticello, IL, Oct. 1985. [64] Naganathan, S., “Design Methodology For The Implementation Of Rotaion BasedAlgorithms Using A CORDIC Processor”, Ph.D dissertation, [75] Sung, T.Y., T. Parng, Y. Hu, and P. Chou, “Design and implementation Department of Electrical Engineering, Southern Methodist University, Dal- of a VLSI CORDIC processor”, IEEE Int’l Symp. on Circuits and Systems, las, Tx., Mar. 1990. San Jose, CA, pp. 934-5, May 1986. [65] Note, S.. J. Van Meerhergen, F. Cathoor, and H. De Man, “Automated [76] Sung, T.Y., Y.H. Hu, and H.J. Yu, “Doubly pipelined CORDIC array synthesis of a high speed CORDIC algorithm with the Cathedral-I11compila- for digital signal processing algorithms”,IEEEICASSPInt’l Con$ on Acous- tion Fystem”, Proc. 1988 /€€E ISCAS, Espoo, Finland, June 7-9, 1988. tics, Speech and Signal Processing, Tokyo, Japan, pp. 69-72, Apr. 1986. [66] Oppenheim, A.V., and R.W Schafer, “Digital Signal Processing”, [77] Sung, T.Y., and Y. H. Hu, “Parallel VLSI implementation of Kalman Prentice Hall. 1975. filter”, IEEE Trans. on Aerospace and Electronic Systems, Vol. AES 23, No. 2, pp. 215-24, March 1987. [67] Paige, C. and M. Saunders, “Least square estimation of discrete linear dynamic systems using orthogonal transformation”, SIAM J. Numerical [78] Timmermann, D., H. Hahn, B. J. Hosicka, and G. Schmidt, “A program- AnaIy.sis, pp. 180-183, 1977. mable CORDIC chip for digital signal processing applications”, IEEE J. of [68] Pisarenko, V.F., “The retrieval of harmonics from a covariance func- Solid-State Circuits, Vol. 26, No. 9, Sep. 1991, pp. 1317-1321. tion”, Geophys J. Roy. Astrorzorn. Soc., Vol. 33, pp. 347-366, 1973. [79] Udo, R.,E. Deprettere, and P. Dewilde, “On the implementation of [69] Rader, C.M.. D.L. Allen, D. B. Glasco, and C. E. Woodward, “MUSE orthogonal and orthogonalizing algorithms using pipelined CORDIC ar- - A systolic array for adaptive nulling with 64 degrees of freedom, using chitectures”, IEEE EUSIPCO Proc. of 2nd. European Signal Processing, Givens transformations and wafer scale integration”, MIT Lincoln Erlangen, Germany, Sept. 1983. Laboratory, Tech. report 886, May, 1990, Lexington, MA. [80] Vaidyanathan, P.P., “A unified approach to orthogonal digital filters [70] Rao, S.K.. and T. Kailath, “Orthogonal digital filters for VLSI im- and wave digital filters based on the LBR two-pair extraction”, IEEE Trans. plementation,” /€€E Trans. on Cir. & Syst., Vol. 3 1, No. 1 1, pp. 933-945, on Circuits and Systems, July 1985, pp. 673-686. Nov. 1984. [81] Volder, J.E., “The CORDIC Trigonometric computing technique,” IRE Trans. 17 I] Schmidt, R., “Multiple emitter locations and signal parameter estimation”, on Electronic Computers, Vol. EC-8, No. 3, pp. 330-4, Sept. 1959. Proc. RADC Spectral Estiniution Workshop, ROME, NY, 1979, pp. 243-258. [82] Walther, J.S., “A Unified algorithm for elementary functions” Spring [72J Sibul, L.H., and A.L. Fogelsanger, “Application of coordinate rotation Joint Computer Con$, pp. 379-385, 1971. algorithm to singular value decomposition”, Proc. qflEEE Inlr’l Symp. on [83] Wilkinson, J. H., The Algebraic Eigenvalue Problem“, Oxford Univer- Circuits orid Sy.\tein.s, Montreal, Que., pp. 821-4, May 1984. sity Press, 1965. [73] Smith, B.T., J. M. Boyle, Y. Ikebe. V.C. Klema. andC. B.Moler, Matrix [84] Yang, B., and J.F. Bohme, “Reducing the computations of the SVD array Eigensystein Routines: EISPACK Guide, 2nd ed., Springer-Verlag, New given by Brent and Luk”, Proc. SPIEAdvanced algorithms and architectures York, 1970. for signal processing, vol. I 152, pp. 92- 102, Aug. 1989.

Digital signal processor code modules from lLLlC0 provide you with instant modem capability for realizing all of the CClTT Group 3 facsimile modem standards: V.21, V27ter, V.29, and the new V.17. These modules are available for many of the popular processors. Or Model 250 for Algorithm Development, lLLlC0 will translate its code library for the processor Data Acquisition, lnstrumentation, Audio. of your choice. FAX code modules are ideal for shared

processor environments such as voicelFAX multiplex 0 TMS320C25 DSP at 10 MIPS. and multi-media applications. 0 Up to 192 Kwords RAM. 0 Multi-ChannelAnalog IO - 250K Samples/sec. lLLlC0 can provide custom features to support user 0 Development Software, including Assembler & special requirements: Debugger. 0 Applications Software includes FFT, Signal

0 Direct 64Kbls PCM interface using sample Display, Data Acquisition & Waveform Editor. interpolation. 0 No Gap Sampling to/from Disk at Very High Rates. 0 Multiple independent in one processor. 0 Supports Multiboard & Standalone (EPROM)

0 V.21 versus QAM incoming signal discrimination. Operation. 0 From $1 095. Other DSP Products Available. I ZLLZCO! PRODUCT DEVELOPMENT 89 Westland Avenue 2700 Augustine Drive Phone: 408-980-8170 Rochester, N.Y. 14618 Santa Clara, California 95054 FAX: 408-980-9327 (716) 473-3610 ~~ ~~

Circle Reader Service Number I6 Circle Reader Service Number 7

JULY 1992 IEEE SIGNAL PROCESSING MAGAZINE 35

Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on April 19,2010 at 02:55:58 UTC from IEEE Xplore. Restrictions apply.