<<

A survey of CORDIC for FPGA based computers Ray Andraka Andraka Consulting Group, Inc 16 Arcadia Drive North Kingstown, RI 02852 401/884-7930 FAX 401/884-7950 email:[email protected] 1. ABSTRACT efficient algorithms is a class of iterative solutions for trigonometric and other transcendental functions that use The current trend back toward hardware only shifts and adds to perform. The trigonometric intensive has uncovered a functions are based on vector rotations, while other relative lack of understanding of hardware functions such as are implemented using an signal processing architectures. Many incremental expression of the desired function. The trigonometric is called CORDIC, an acronym for hardware efficient algorithms exist, but these Coordinate &on&ion DIgital computer. The incremental are generally not weil known due to the functions are performedwith a very simple extensionto the dominance of software systems over the past hardware architecture, and while not CORDIC in the strict quarter century. Among these algorithms is a sense, are often included becauseof the close similarity. set of shift-add algorithms collectively known The CORDIC algorithms generally produce one additional bit of accuracyfor each iteration. as CORDIC for computing a wide range of functions including certain trigonometric, The trigonometric CORDIC algorithms were originally hyperbolic, linear and logarithmic functions. developed as a digital solution for real-time navigation problems. The original work is credited to Jack Volder While there are numerous articles covering [4,9]. Extensions to the CORDIC theory basedon work by various aspects of CORDIC algorithms, very John Wahher[l] and others provide solutions to a broader few survey more than one or two, and even class of functions. The CORDIC algorithm has found its fewer concentrate on implementation in way into diverse applications including the 8087 math FPGAs. This paper attempts to survey [7], the HP-35 , radar signal processors[3]and robotics. CORDIC has also been commonly used functions that may be proposed for computing Discrete Fourier[4], Discrete accomplished using a CORDIC architecture, Cosine[4], Discrete Hartley[lO] and Chirp-Z [9] explain how the algorithms work, and explore transforms, filtering[4], Singular Value implementation specific to FPGAs. Decomposition[141, and solving linear systems[11. 1.1 Keywords This paper attempts to survey the existing CORDIC and CORDIC, , cosine, vector magnitude,polar conversion CORDIC-like algorithms with an eye toward implementation in Field Programmable Gate Arrays 2. INTRODUCTION (FPGAs). First a brief description of the theory behind the The digital signal processing landscape has long been algorithm and the derivation of several functions is dominated by microprocessorswith enhancementssuch as presented. Then the theory is extended to the so-called single cycle multiply-accumulate instructions and special unified CORDIC algorithms, after which implementation addressingmodes. While these processorsare low cost of FPGA CORDIC processorsis discussed. and offer extreme flexiblility, they are often not fast enough for truly demanding DSP tasks. The advent of Permissionto make digitalhrd copies of all or pti oft& material for reconfigurable logic computers permits the higher speeds Personal or chsmom use is mted without fee provided that the copies of dedicated hardware solutions at costs that are are not made or distributed for profit or commercial advantage, &e copy- competitive with the traditional software approach. right notice, the title oflhe publication and its date appear, and notice is &‘en that mpy&ht is by permission of the ACM, Inc. To copy otherwise, Unfortunately, algorithms optimized for these to republish, to post on servers or to redistribute to lists, requires specific based systems do not usually map well permission and/or fee. into hardware. While hardware-efficient solutions often FPGA 98 Monterey CA USA exist, the dominance of the softwaresystems has kept those %yri&t 1998 ACM 0-89791-978~5/98101..$5.00 solutions out of the spotlight. Among these hardware-

191 3. CORDIC THEORY: AN ALGORITHM The set of all possible decision vectors is an angular FOR VECTOR ROTATION measurement system based on binary arctangents. Conversionsbetween this angular systemand any other can All of the can be computed or be accomplished using a look-up. A better conversion derived from functions using vector rotations, as will be method uses an additional -subtractor that discussedin the following sections. Vector rotation can accumulates the elementary rotation angles at each also be used for polar to rectangular and rectangular to iteration. The elementary angles can be expressedin any polar conversions,for vector magnitude, and as a building convenient angular unit. Those angular values are supplied block in certain transformssuch as the DFT and DCT. The by a small (one entry per iteration) or are CORDIC algorithm provides an iterative method of hardwired, depending on the implementation. The angle performing vector rotations by arbitrary angles using only accumulator adds a third difference equation to the shifts and adds. The algorithm, credited to Volder[4], is CORDIC algorithm: derived from the general (Givens) rotation transform: - di . &-‘(2~‘) x1= xcos$ - ysiIl(l 2.r+l =q y’= ycos$+xsin+ Obviously, in cases where the angle is useful in the arctangentbase, this extra elementis not needed. , which rotates a vector in a Cartesianplane by the angle $. Thesecan be rearrangedso that: The CORDIC rotator is normally operated in one of two modes. The first, called rotation by Volder[4], rotates the x’=cost#b[x-ytang]* input vector by a specified angle (given as an argument). The secondmode, called vectoring, rotatesthe input vector y’=cos~.[y+xtan~] to the x axis while recording the angle required to make So far, nothing is simplified. However, if the rotation that rotation. angles are restricted so that tan(~$)=EZ’,the In rotation mode, the angle accumulator is initialized with by the tangent term is reduced to simple shift operation. the desired rotation angle. The rotation decision at each Arbitrary angles of rotation are obtainable by performing a iteration is made to diminish the magnitude of the residual series of successivelysmaller elementary rotations. If the angle in the angle accumulator. The decision at each decision at each iteration, i, is which direction to rotate iteration is therefore basedon the sign of the residual angle rather than whether or not to rotate, then the cos(SJ term after each step. Naturally, if the input angle is already becomes a constant (because cos(Si) = cos(-6J). The expressed in the binary arctangent base, the angle iterative rotation can now be expressedas: accumulator may be eliminated. For rotation mode, the CORDIC equationsare: x,+1 &,[x, -yi .di .2-j] x.1+1 =xi -y/d,.2-’ yj+* = Ki[yi t xi -di -2”] yi+l =yj+x,-di.2-’ zj+* = zj - dj 4af’(2-‘) Ki = cos(tan-’ 2-‘) = l/J= where di =+I dF -1 if z, c 0, +l otherwise Removing the scale constant from the iterative equations which provides the following result: yields a shift-add algorithm for vector rotation. The product of the Kr’s can be applied elsewherein the system = A&, cosz,‘-y, sinz,] or treatedas part of a systemprocessing gain. That product 58 approaches 0.6073 as the number of iterations goes to Y, = An[Yo coszo + x0 sinz, infmity. Therefore, the rotation algorithm has a gain, A,,, 1 of approximately 1.647. The exact gain depends on the Zll=o number of iterations, and obeysthe relation A, =nm A, = nJl+z”” n n In the vectoring mode, the CORDIC rotator rotates the The angle of a composite rotation is uniquely defmed by input vector through whatever angle is necessaryto align the sequenceof the directions of the elementary rotations. the result vector with the x axis. The result of the vectoring That sequencecan be representedby a decision vector. operation is a rotation angle and the scaled magnitude of

192 the original vector (the x component of the result). The reduction may be more convenient when wiring is vectoring function works by seeking to minimize the y restricted, as is often the casewith FPGAs. componentof the residual vector at eachrotation. The sign The CORDIC rotator described is usable to compute of the residual y component is used to determine which several trigonometric functions directly and others direction to rotate next. If the angle accumulator is indirectly. Judicious choice of initial values and modes initialized with zero, it will contain the traversed angle at permits direct computation of sine, cosine, arctangent, the end of the iterations. In vectoring mode, the CORDIC vector magnitude and transformations between polar and equationsare: Cartesiancoordinates. =xi-y;.d,.2-’ xi+1 3.1 Sine and Cosine yi+, = yj +xi.di.2-’ The rotational mode CORDIC operation can simultaneously compute the sine and cosine of the input z.I+1 = zi - di. tan-‘(2-7 angle. Setting the y componentof the input vector to zero reducesthe rotation mode result to: where d,= +I ifx, < 0, -1 otherwise. x, = A, -x0 cosz, Then: y,, = A, ‘x0 sinz 0 By setting x0 equal to l/.-A, the rotation produces the x,, = Andxi +yi unscaled sine and cosine of the angle argument,z,. Very Y, =o often, the sine and cosine values modulate a magnitude value. Using other techniques (e.g., a look up table) requires a pair of multipliers to obtain the modulation. The CORDIC technique performs the multiply as part of the rotation operation, and therefore eliminates the need for a pair of explicit multipliers. The output of the CORDIC n rotator is scaled by the rotator gain. If the gain is not The CORDIC rotation and vectoring algorithms as stated acceptable,a single multiply by the reciprocal of the gain are limited to rotation angles between -7ri2 and 7~12.This constant placed before the CORDIC rotator will yield limitation is due to the use of 2” for the tangent in the fmt unscaled results. It is worth noting that the hardware iteration. For compositerotation angles larger than n/2, an complexity of the CORDIC rotator is approximately additional rotation is required. Volder[4] describes an equivalent to that of a single multiplier with the sameword initial rotation &r/2. This gives the correction iteration: size. x’= -&y 3.2 Polar to Rectangular Transformation y’= d-x A logical extension to the sine and cosine computer is a polar to Cartesian coordinate transformer. The z’= z+dJ2 / transformation from polar to Cartesianspace is defmedby: where d = +l if r

193 3.3 General vector rotation rotator gain, and should be accountedfor elsewherein the The rotation mode CORDIC rotator is also useful for system. If the gain is unacceptable,it can be correctedby performing general vector rotations, as are often multiplying the resulting magnitude by the reciprocal of the encounteredin motion correction and control systems. For gain constant. general rotation, the 2 dimensional input vector is presented to the rotator inputs. The rotator rotates the 3.7 Inverse CORDIC functions vector through the desired angle. The output is scaled by In most cases,if a function can be generatedby a CORDIC the CORDIC rotator gain, which must be accounted for style computer, its inverse can also be computed. Unless elsewherein the system. If the scaling is unacceptable,a the inverse is calculable by changing the mode of the pair of constant multipliers is required to compensatefor rotator, its computation normally involves comparing the the gain. CORDIC rotators may be cascadedin a tree output to a target value. The CORDIC inverse is illustrated architecture for general rotation in n-dimensions. Some by the Arcsine function. optimization of multidimensional rotation is possible to 32 Arcsine and Arccohne permit computational savings over the general n- dimensionedcase, as reported by Hsiao et al. [4] The Arcsine can be computedby starting with a unit vector on the positive x axis, then rotating it so that its y 3.4 Arctangent component is equal to the input argument. The arcsine is The arctangent, 8=Atan(y/x), is directly computed using then the angle subtendedto causethe y component of the the vectoring mode CORDIC- rotator if the angle rotated vector to match the argument. The decision accumulatoris initialized with zero. The argumentmust be function in this caseis the result of a comparisonbetween provided as a ratio expressedas a vector (x, y). Presenting the input value and the y componentof the rotated vector at the argument as a ratio has the advantageof being able to eachiteration: represent infinity (by setting x=0). Since the arctangent xi+1=xi-yj.di.2-’ result is taken from the angle accumulator, the CORDIC rotator growth doesnot affect the result. yi+l =yi+x,.d,s2-’ z. =zj - dj 4an-’ (2-‘) Z”=zo itan-’ 0 yoxo 1+1 where 3.5 Vector Magnitude . dF +l ifvr < c, -1 otherwise, and The vectoring mode CORDIC rotator produces the magnitude of the input vector as a byproduct of computing c = input argument./ the arctangent. After the vectoring mode rotation, the Rotation producesthe following result: vector is aligned with the x axis. The magnitude of the vector is therefore the same as the x component of the rotated vector. This result is apparent in the result x, =J/(A,.xo)2-c2 equationsfor the vector mode rotator: YII =c

XII =A,dm The magnitude result is scaledby the processorgain, which needs to be accounted for elsewherein the system. This implementation of vector magnitude has a hardware complexity of roughly one multiplier of the same width. n The CORDIC implementation represents a significant The arcsine function as stated above returns correct angles hardware savings ’ over an equivalent Pythagorean for inputs -1 < c/kQo < 1, although the accuracysuffers as processor.,The accuracy of the magnitude result improves the input approaches+I (the error increasesrapidly for by 2 bits for each iteration performed. inputs larger than about 0.98). This loss of accuracy is due to the gain of the rotator. For angles near the y axis, the 3.6 Cartesian to Polar transformation rotator gain causesthe rotated vector to be shorter than the The Cartesian to Polar transformation consists of fmding reference (input), so the decisions are made improperly. the magnitude (x=sqrt(x2+y2)) and phase angle The gain problems can be corrected using a “double ($=atan[y/x]) of the input vector, (x, y). The reader will iteration algorithn?[9] at the cost of an increase in immediately recognize that both functions are provided complexity. simultaneously by the vectoring mode CORDIC rotator. The Arccosine computation is similar, except the difference The magnitude of the result will be scaledby the CORDIC between the x component and the input is used as the

194 decision function. Without modification, the arccosine d,= -1 if zi < 0, +l otherwise. algorithm works only for inputs less than l/A,, making the double iteration algorithm a necessity. The Arccosine Then: could also be computed by using the arcsine function and x, = A, [x0 cash z. + y. sinh zo] subtracting x/2 from the result, followed by an angular reduction if the result is in the fourth quadrant. y, = A,[y, cash z. + x0 sinh zo] 3.9 Extension to Linear functions Zll =o A simple modification to the CORDIC equation permits the computationof linear functions: A, =J$i7~0.80 n X j+, =x;-O.y;.d;.2-‘=x; In vectoring mode cd,= +l if Y; < 0, -1 otherwise) the rotation produces: Yi+l =y;+xj.dj.2-’

= z; - d; . x, = %Ja -Ylf Z.r+l (2-9 =o For rotation mode (d,= -1 if Z; < 0, +l otherwise) the linear YIZ rotation produces: Z”=z,+tanh-’ yoxo x, =x0 (A Y, = Yo +xozo A=&/- n =o ,_ Zll The elemental rotations in the hyperbolic coordinate This operation is similar to the shift-add implementation of system do not converge. However, it can be shown[I] that a multiplier, and as multipliers go is not an optimal convergenceis achieved if certain iterations (1=4, 13,40,..., solution. The multiplication is handy in applications where k, 3k+l,...) are repeated. a CORDIC structure is already available. The vectoring The hyperbolic equivalents of all the functions discussed mode (d,= +I ify; < 0, -1 otherwise) is more interesting, as for the circular coordinate system can be computed in a it provides a method for evaluating ratios: similar fashion. Additionally, as Walther[l] points out, the X” = x0 following functions can be derived from the CORDIC functions: Yn = 0 tana = sinakosa .?I =zo-YolXo tanha = sinha/cosha The rotations in the linear coordinate system have a unity expa = sinha + cosha gain, so no scaling corrections are required. lna = 2tanh-‘[y/x] where x=a +l and y==a-I 3.10 Extension to (a)ln= (x2-y2)mwhere ~=a+114and y=a-l/4 The close relationship between the trigonometric and It is worth noting the similarities between the CORDIC hyperbolic functions suggeststhe samearchitecture can be equationsfor circular, linear, and hyperbolic systems. The used to compute the hyperbolic functions. While, there is selection of coordinate system can be made by introducing early mention of using the CORDIC structure for a mode variable that takes on values l,O, or -1 for circular, hyperbolic coordinate transforms [4], the first description linear and hyperbolic systemsrespectively. The unified [l] of the algorithm is that by Walther [l]. The CORDIC CORDIC iteration equationsare then: equations for hyperbolic rotations are derived using the samemanipulations as those used to derive the rotation in xi+1 = x; -m-y, .dj .2-’ the circular coordinate system. For rotation mode these are: yj+, =y;+xj.dj.2-’ = z,-d,.e, Xi+1 =x;+y;-d;.2-i ‘;+I yj+l =y;+x;.d,.2-’ where e; is the elementaryangle of rotation for iteration i ‘in the selectedcoordinate system. Specifically, e; r tam1~'(2~') =z;-d;-tanh-’ 2-’ Zi+l ( > for m=l, e; = 2” for m=O, and e; = &I,&-*(~‘) for m=l. where

195 This unification, due to Walther, permits the design of a will typically require several layers of logic (i.e., the signal generalpurpose CORDIC . will need to pass through a number of FPGA cells). The result is a slow design that uses a large number of logic 3.11 Short cuts cells. For fixed angle rotations, as are encounteredin such places as fast Fourier Transforms (FFTs), the arctangent base representation of the angle can be pre-computed and applied directly to the CORDIC rotator. This hardwiring of a fixed angle(s) eliminates the need for the angle accumulator,which reducesthe circuit complexity by about 25 percent. If the constraints on the decision variable are relaxed to allow that variable to take on values of {-l,O,l} instead of just {-l,l}, the number of iterations can also be reduced. Iterations for which the decision variable is zero passthe data unrotated, and can thus be eliminated. This f modification causesthe gain to become a function of the register rotated angle, so it is only useful if the rotation angle is fixed. Hu and Naganathan[lO] propose a method of pre- Yo computing the recoded angles for the ternary decision m variable. This technique can significantly reduce the complexity of on-line CORDIC processorsused for fmed angle rotations. 4. IMPLEMENTATION IN AN FPGA There are a number of ways to implement a CORDIC zo processor. The ideal architecture depends on the speed versus area tradeoffs in the intended application. First we Figure 1. Iterative CORDIC structure will examine an iterative architecture that is a direct translation from the CORDIC equations. From there, we A considerably more compact design is possible using bit will look at a minimum hardware solution and a maximum serial arithmetic. The simplified interconnect and logic in a performancesolution. bit serial design allows it to work at a much higher than the equivalent bit parallel design. Of course,the 4.1 Iterative CORDIC Processors design also needsto clocked w times for each iteration (w is An iterative CORDIC architecture can be obtained simply the width of the data word). The bit serial design consists by duplicating each of the three difference equations in of three bit serial adder-subtractors,three shift registersand hardware as shown in Figure 1. The decision function, dr, a serial ReadOnly Memory (ROM). Each shift register has is driven by the sign of the y or z register depending on a length equal to the word width. There is also some whether it is operated in rotation or vectoring mode. In gating or to select taps off the shift registers operation,the initial values are loaded via multiplexers into for the right shifted cross terms (shifting is accomplished the x, y and z registers. Then on each of the next n clock using bit delays in bit serial systems). The bit serial cycles, the values from the registers are passedthrough the CORDIC architecture is shown in Figure 2. In this design, shifters and adder-subtractorsand the results placed back in w clocks are required for each of the n iterations, where w the registers. The shifters are modified on each iteration to is precision of the adders. In operation, the load cause the desired shift for the iteration. Likewise, the multiplexers on the left are opened for w clock periods to ROM addressis incremented on each iteration so that the initialize the x, y and z registers (these registers could also appropriate elementary angle value is presentedto the z be parallel loaded to initialize). Once loaded, the data is adder-subtractor. On the last iteration, the results are read shifted right through the serial adder-subtractors and directly from the adder-subtractors. Obviously, a simple returned to the left end of the register. Each iteration statemachine is required keep track of the current iteration, requires w clocks to return the result to the register. At the and to select the degreeof shift and ROM addressfor each beginning of each iteration, the control state machine reads iteration. . the sign of the y (or z) register and sets the add/subtract The design depicted in Figure 1 usesword-wide data paths controls accordingly. The appropriate tap off the register (called bit-parallel design). The bit-parallel variable shift for the crossterms is also selectedat the beginning of each shifters do not map well to FPGA architecturesbecause of iteration. During the tih iteration, the results can be read the high fan-in required. If implemented, those shifters

196 from the outputs of the serial adders while the next significant simplifications. First the shifters are each a initialization data is shifted into the registers. fured shift, which means that they can be implemented in the wiring. Second, the lookup values for the angle accumulator are distributed as constantsto each adder in the angle accumulator chain. Those constants can be hardwired instead of requiring storage space. The entire CORDIC processoris reducedto an array of interconnected adder-subtractors.The need for registersis also eliminated, making the unrolled processorstrictly combinatorial. The delay through the resulting circuit would be substantial,but the processing time is reduced from that required by the iterative circuit (if by nothing else than the set-up and hold times of the register). Most times, especially in an FPGA, it does not make senseto use such a large combinatorial circuit. The unrolled processor is easily pipelimed by Figure 2 Bit serial iterative CORDIC inserting registers between the adder-subtractors. In the caseof most FPGA architecturesthere are already registers The simplicity of the bit serial design is apparent from present in each logic cell, so the of the pipeline figure 2. Even in this case, the wiring of the shift tap registershas no hardware cost. multiplexers can presentproblems in some FPGAs (this is one place where t&state long lines can come in handy). Even so, the interconnect is minimal and the logic between registers is simple. This combination permits bit clock rates near the maximum toggle frequency of the FPGA. +I- R The possibility of using extreme bit clock frequencies makesup for the large number of clock cycles required to T T’ completeeach rotation. Yo G Dual Port AddlSubt yn Now, if the design is in a Xilhrx 4000E seriespart, the shift +syncRam ,+/- R registers can be implemented in the CLB RAM[2]. The RAM emulates a shift register by incrementing the- P t t-l read/write address after each access. The dual port capability of the CLB RAM provides the capability to read two locations in the 16x1 RAM simultaneously [9]. By properly sequencingthe second address,the effect of the shift tap is achieved without a physical multiplexer. The result is the shift register and multiplexer for word lengths up to 16 bits are implemented in a single CLB (plus 8 CLBs for the 2 address sequencers and iteration , which are shared by the three shifters). The serial ROM also usesthe CLB for data storage. One CLB is required for every two iterations. The 16 bit, 8 iteration CORDIC processorshown in Figure 3 uses only 21 CLBs, and will run at bit rates up to about 90 MHz (mainly limited by the RAM write cycle). This translates to about a 1.5~1sprocessing time, which is only about three and a half times longer than the best one could expect from the much larger bit parallel iterative solution. Figure 3 Iterativebitign fomE series 4.2 On-Line CORDIC Processors FPGA uses21 CLBs The CORDIG processors discussed so far are iterative, which means the processorhas to perform iterations at-n times the data rate. The iteration processcan unrolled[l8] so that each of n processingelements always performs the sameiteration. An unrolled CORDIC processor is shown in Figure 4. Unrolling the processor results in two

197 shows two iterations of a bit serial CORDIC processor implemented in an Atmel 6005 or NSC Clay31 FPGA. Notice the cross term is taken from different taps off the shift register at each iteration. This particular processoris used to compute vector magnitude. Since this is a vector mode processand the result angle is not required, there is no need for an angle accumulator. Figure 6 shows the detail of the adder-subtractorfor that design. The adder subtractor in this case includes logic to extend the sign of the shifted cross term and to reset the adder subtractor between words. The entire 7 iteration design occupies approximately 20% of the FPGA and runs at bit ratesup to 125 Mhz [3]. Higher performance requires either multiple bit serial processors running in parallel, or an unrolled parallel pipeline. Until recently, FPGAs did not have the required combination of logic and routing resource to build a parallel processor. This barrier is mostly due to the large amount of cross routing required between the x and y registers at each pipeline stage. Additionally, the performance diminishes as the word width is increased becauseof the carry propagation times acrossthe adders. The Xilinx 4000E serieshas sufficient routing to realize a reasonably compact parallel CORDIC pipeline. Its Figure 4 Unrolled CORDIC processor dedicated carry logic provides acceptableperformance for The unrolled processorcan also be converted to a bit serial the adders. Figure 7 shows a 14 bit, 5 iteration pipelined design. Each adder subtractor is replacedby a serial adder- CORDIC processor that fits comfortably in half of a subtractor, separatedby w bit shift registers. The shift 4013E. That design,used for polar to Cartesiancoordinate registers are necessaryto extract the sign of the y or 2 transformations in a radar target generator,runs at 52 MHz element before the fast bits (lsbs) reach the next adder- (clock rate and datarate) in an XC4013E-2. subtractors. The right shifted cross terms are taken from fmed taps in the shift registers. Some method of sign extension for the shifted terms is required too. Figure 5 ADD/SUBTRACT ADD/SUBTRACT X--r= SH,FTER D=‘-” ”

Figure 5 two iterations of bit serial CORDIC pipeline in AtmeUlWC FPGA

198 ASNI , RNF m FI m

FOWX I mF0 “s”:.:=q-k X

I ASNO RS,

Fo m AS0 ,

Figure 6 detail of pipelined bit serial adder-subtractor in Atniel/NSC FPGA

PHS-COAR AOSUlZX

PHS,P -01 - EXLL

GCll* 01 . OA 13’0 1 FO34 3.01 FO34

01 F(11 FO2

01

I3.01 AOSUl4X :11.

FZSt33.01 GO1 F25[13.01 GO2

c .

Figure 7 section of parallel pipelined CORDIC can run at over 50 Megasamplesper second in a Xilinx XC4013E2

199 5. CONCLUSIONS The CORDIC algorithms presentedin this paper are well [8] Hsiao, S.F. and Delosme,J.M., “The CORDIC known in the researchand super-computing circles. It is, HouseholderAlgorithm,” Proceedingsof the 10th however, my experience that the majority of today’s Symposiumon ComputerArithmetic, pp. 256-263, 1991. hardware DSP designs are being done by engineers with [9] Hu, Y.H., and Naganathan,S., “A Novel little or no background in hardware efficient DSP Implementation of Chirp Z-Transformation Using a algorithms. The new DSP designersmust becomefamiliar CORDIC Processor,” IEEE Transactionson ASSP, Vol. with these algorithms and the techniques for implementing 38, pp. 352-354,199O. them in FPGAs in order to remain competitive. The [ 101Hu, Y.H., and Naganathan,S., “An Angle Recoding CORDIC algorithm is a powerful tool in the DSP toolbox. Method for CORDIC Algorithm Implementation”, IEEE This paper shows that tool is available for use in FPGA Transactionson Computers,Vol. 42, pp. 99-102, January based computing machines, which are the likely basis for 1993 the next generationDSP systems. [ 1l] Knapp, S. K., “XC4000E Edge triggered and Dual Port RAM Capability,” Xilinx application note, August 11, 6. REFERENCES 1995 [ 12]Marchesi,M., Orlandi, G., and Piazza,F., “Systolic [l] Ahmed, H. M., Delosme;J.M., and Morf, M., “Highly Circuit for Fast Hartley Transform,” Proceedings- IEEE Concurrent Computing Structurefor Matrix Arithmetic and International Symposiumon Circuits and Systems,Espoo, Signal Processing,” IEEE Comput. Mag., Vol. 15, 1982, Finland, June 1988,pp. 2685-2688 pp. 65-82. [13] Mazenc, C., Merrheim, X., and Muller, J.M., [2] Alfke, P., “Efficient Shift Registers,LFSR Counters, “Computing Functions Arccos and Arcsin Using and Long PseudoRandom Sequence Generators,” Xilinx CORDIC,” IEEE Transactionson Computers,Vol. 42, pp. application note, August, 1995. 118-122,1993. [3] Andraka, R. J., “Building a High PerformanceBit- [ 141Sibul, L.H. and Fogelsanger,A.L., “Application of Serial Processorin an FPGA,” Proceedingsof Design CoordinateRotation Algorithm to Singular Value SuperCon‘96, Jan 1996.~~5.1 - 5.21 Decomposition,” IEEE Int. Symp. Circuits and Systems, [4] Deprettere,E., Dewilde, P., and Udo, R., “Pipelined pp. 821-824,1984. CORDIC Architecture for Fast VLSI Filtering and Array [15] Volder, J., “Binary computation algorithms for Processing,” Proc. ICASSP’S4,1984, pp. 41.A.6.1- coordinaterotation and function generation,” 41.A.6.4 Report IAR-1 148 Aeroelectrics Group, June 1956. [5] Despain,A.M., “Fourier Transform Computations 1161Volder, J., “The CORDIC Trigonometric Computing Using CORDIC Iterations,” IEEE Transactionson Technique,” IRE Trans. Electronic Computing, Vol EC-8, Computers,Vo1.23, 1974, pp. 993-1001. ~~330-334Sept 1959. [6] Duh, W.J., and Wu, J.L., “Implementing the Discrete [ 171Walther, J.S., “A unified algorithm for elementary CosineTransform by Using CORDIC Techniques,” functions,” Spring Joint ComputerConf., 1971, proc., pp. Proceedingsthe International Symposiumon VLSI 379-385. Technology, Systemsand Applications, Taipei, Taiwan, [ 181Wang, S. and Piuri, V., “A Unified View of CORDIC 1989,pp. 281-285 ProcessorDesign”, Application Specific Processors,Edited [7] Duprat, J. and Muller, J.M., “The CORDIC Algorithm: by Earl E.Swartzlander,Jr., Ch. 5, pp. 121-160,Kluwer New Resultsfor Fast VLSI Implementation,” IEEE Academic Press,November 1996. Transactionson Computers,Vol. 42, pp. 168-178, 1993.

290