<<

electronics

Article Real-Time and High-Accuracy Arctangent Computation Using CORDIC and Fast Magnitude Estimation

Luca Pilato, Luca Fanucci and Sergio Saponara *

Dipartimento Ingegneria della Informazione, Università di Pisa, via G. Caruso 16, 56122 Pisa, Italy; [email protected] (L.P.); [email protected] (L.F.) * Correspondence: [email protected]; Tel.: +39-050-221-7511; Fax: +39-050-221-7522

Academic Editors: Christos Koulamas and Mihai Lazarescu Received: 11 February 2017; Accepted: 13 March 2017; Published: 16 March 2017

Abstract: This paper presents an improved VLSI (Very Large Scale of Integration) architecture for real-time and high-accuracy computation of with fixed-point arithmetic, particularly arctangent using CORDIC (Coordinate Digital Computer) and fast magnitude estimation. The standard CORDIC implementation suffers of a loss of accuracy when the magnitude of the input vector becomes small. Using a fast magnitude estimator before running the standard , a pre-processing magnification is implemented, shifting the input coordinates by a proper factor. The entire architecture does not use a multiplier, it uses only shift and add primitives as the original CORDIC, and it does not change the data path precision of the CORDIC core. A bit-true case study is presented showing a reduction of the maximum phase error from 414 LSB (angle error of 0.6355 rad) to 4 LSB (angle error of 0.0061 rad), with small overheads of complexity and speed. Implementation of the new architecture in 0.18 µm CMOS technology allows for real-time and low-power processing of CORDIC and arctangent, which are key functions in many embedded DSP systems. The proposed macrocell has been verified by integration in a system-on-chip, called SENSASIP (Sensor Application Specific Instruction-set ), for position sensor in automotive measurement applications.

Keywords: real-time; Digital Signal Processing (DSP); Embedded Systems; CORDIC (Coordinate Rotation Digital Computer); ASIC (Application Specific ); FPGA (Field Programmable ); IP (Intellectual Property); automotive sensors

1. Introduction The CORDIC (Coordinate Rotation Digital Computer) algorithm [1] is used in many embedded signal-processing applications, where data has to be processed with a trigonometric computation. Rotary encoders, positioning systems, kinematics computation, phase trackers, coordinate transformation and rotation are only a few examples [2–4]. One of those is the use of CORDIC as an arctangent computation block, from here referenced as atan-CORDIC. For example, it is at the core of the phase calculation of complex-envelope signals from the in-phase and quadrature components in communication systems, or for angular position measurement in automotive applications, or for phase computation in power systems control and in AC circuit analysis [5–9]. Although the standard CORDIC algorithm is well known in literature, its efficient implementation in terms of computation accuracy and with low overheads in terms of circuit complexity and power consumption is still an open issue [6–16]. To this aim, both full-custom (ASIC in [6]) and semi-custom (IP macrocells synthesizable on FPGA or standard-cell technologies in [8,9,16]) approaches have been

Electronics 2017, 6, 22; doi:10.3390/electronics6010022 www.mdpi.com/journal/electronics Electronics 2017, 6, 22 2 of 10 followed in the literature. Due to the need for flexibility in modern embedded systems, the design of a synthesizable IP is proposed in this work. The iterative algorithm of atan-CORDIC, even if implemented in an IPA (Infinite Precision Arithmetic), suffers from approximation errors due to the n finite iteration. Using the ArcTangent Radix (ATR) angles set, and using vectors compliant with the CORDIC convergence theorem [10], the phase error Eθ is bounded by the expression in Equation (1), which depends on the number of iterations n. This means that ideally we can reduce the angle computation error using more iterations of the CORDIC algorithm. −1 −n+1 Eθ ≤ tan (2 ) (1) The most accurate way to approximate IPA is by relying on floating-point computation, e.g., using the IEEE 754 double-precision floating-point format. However, when targeting real-time and low-complexity/low-power implementations, the design of fixed-point macrocells is more suited. Moreover, many applications using CORDIC, such as position sensor conditioning on-board vehicles, typically do not adopt floating-point units. For example, in automotive or industrial applications, at maximum 32-bit fixed-point processors are used [5,17]. The implementation of the fixed-point algorithm in a Finite Precision Arithmetic (FPA) implies an accuracy degradation of the phase calculation. The contribution to the errors in FPA is the quantization of the ATR angles set and the quantization on the x, y data path, where rounding or truncation action is taken after the shift operation in each step. Using a similar approach to [11,12], we can estimate a coarse upper-bound of the phase error in FPA, as shown in Equation (2). ε √ E ≤ sin−1(2−n+1 + 2(n − 1)) + nδ (2) θ r In Equation (2), n is the number of iterations, ε is the fixed-point precision of the coordinates’ data path where the quantization is taken at each loop, r is the magnitude of the starting point and δ is the angle data path precision. If the initial vector has a magnitude r great enough to neglect the second argument of the sin−1 function in Equation (2), then the error bound depends mainly on the number of iterations. Due to the quantization in the data paths of the fixed-point implementation, every loop of the algorithm has a truncation error that accumulates itself. This is the meaning of the terms (n − 1) and nδ in Equation (2). Particularly, the latter can lead to an increase of the error upper bound in each iteration. Instead, the 2−n + 1 term indicates the improvement of the error due to more iterations. An example of the dependence of the error upper bound on the number n, once the parameters ε, r and δ are fixed, will be shown in Section3. It must be noted that Equation (2) expresses an upper bound to the error, and not the real error (see the difference between the red lines and blue lines in Figures 6–9 in Section3). For vectors with a magnitude near zero, the error bound and the estimated error increase ideally up to π/2 due the sin−1 term in Equation (2). The idea proposed in this work is to have a fast estimation of the magnitude of the initial vector and to design a circuit able to magnify the coordinates to let the standard atan-CORDIC work in the area where the phase error is small. Hereafter, Section2 presents the fast magnitude estimator. Section3 discusses the achieved performance in terms of the computation accuracy of the new algorithm. Section4 discusses the performance of the IP macrocell implementation in CMOS (Complementary Metal Oxide Semiconductor) technology, and its integration in an automotive system-on-chip for sensor signal processing, called SENSASIP (Sensor Application Specific Instruction-set Processor). A comparison of the achieved results to the state-of-the-art in terms of processing accuracy, circuit complexity, power consumption and processing speed is also shown. Conclusions are drawn in Section5.

2. Fast Magnitude Estimator and Improved CORDIC Architecture The circuit for a fast magnitude estimation is based on the alphamax-betamin technique [13]. The magnitude estimation is a linear combination of the max and min absolute values of the input Electronics 2017, 6, 22 3 of 10 coordinates,ElectronicsElectronics 20172017 as, shown, 66,, 2222 in Equation (3). The choice of a and b values depends on the metric33 ofof 1010 that the approximation in Equation (3) wants to minimize. If the max absolute error of the entire input approximationapproximation inin EquationEquation (3)(3) wantswants toto minimize.minimize. IfIf thethe maxmax absoluteabsolute errorerror ofof thethe entireentire inputinput spacespace space is the selected metric, the optimum numerical values are a ≈ 0.9551 and b ≈ 0.4142. A coarse isis thethe selectedselected metric,metric, thethe optimumoptimum numericalnumerical valuesvalues areare aa ≈ ≈ 0.95510.9551 andand bb ≈ ≈ 0.4142.0.4142. AA coarsecoarse approximationapproximationapproximation for a forfor simplified aa simplifiedsimplified and andand fast fastfast implementation implementationimplementation can can bebe aaa === 11 1andand and bb ==b 0.5.0.5.= 0.5. ThisThis This meansmeans means thatthat only thatonly only an absanan block, absabs block,block, a comparison aa comparisoncomparison block blockblock and a andand virtual aa virtualvirtual shift plus shiftshift plusplus additionaddition have tohavehave be implementedtoto bebe implementedimplemented in hardware, inin whilehardware,hardware, avoiding whilewhile the use avoidingavoiding of multipliers. thethe useuse ofof Using multipliers.multipliers. the architecture UsingUsing thethe inarchitecturearchitecture Figure1, whichinin FigureFigure implements 1,1, whichwhich the calculationimplementsimplements in Equation thethe calculationcalculation (3), the estimatedinin EquationEquation magnitude (3),(3), thethe estimatedestimatedm* is always magnitudemagnitude higher mm* than* isis alwaysalways the real higherhigher one. than Forthan example, thethe realreal one.one. ForFor example,example, ifif thethe coordinatescoordinates| | = | | areare equal,equal, i.e.,i.e., xx  yy thethe resultresult∗ =ofof |EquationEquation| · (3)(3) isis if the coordinates are equal, i.e.,√x0 y0 the√ result of Equation00 00 (3) is m x0 3/2 while the real magnitude is m = |x | · 2 = m ∗ ·2 2/3. In the case of |x | = 0 then m = m∗ = |y |, mm** xx00 33//22 whilewhilereal thethe 0 realreal magnitudemagnitude isis mmrealreal  xx00  22 mm**022 22//33 .. InIn realthethe casecase ofof 0 while if |y0| = 0 then mreal = m∗ = |x0|. These considerations are shown in Figure2, where the light xx00 00thenthenmmrealreal mm** yy00 ,, whilewhile ifif yy 00thenthen mm mm** xx .. TheseThese considerationsconsiderations areare shownshown inin grey circumference arc includes the possible00 positionsrealreal of the points00 with the estimated magnitude m*. FigureFigure 2,2, wherewhere thethe lightlight greygrey circumferencecircumference arcarc includesincludes thethe possiblepossible positionspositions ofof thethe pointspoints withwith thethe Figure2 also shows the real position corresponding to the case |x | = |y | with a black point and two estimatedestimated magnitudemagnitude mm*.*. FigureFigure 22 alsoalso showsshows thethe realreal positionposition correspondingcorresponding0 0 toto thethe casecase xx  yy segments, drawn with a black line, connecting this point to the points corresponding to0 the0 cases00 of withwith aa blackblack pointpoint andand twotwo segments,segments, drawndrawn withwith aa blackblack line,line, connectingconnecting thisthis pointpoint toto thethe pointspoints mreal = m∗ = |x0| or|y0| . In Figure2 the two segments are always under the computed magnitude correspondingcorresponding toto thethe casescases ofofmm mm** xx ororyy .. InIn FigureFigure 22 thethe twotwo segmentssegments areare alwaysalways underunder arc. It should be noted that for allrealreal the input points00 00 belonging to the two segments, the corresponding magnitudethethe computedcomputedm*, estimated magnitudemagnitude with arc.arc. Equation ItIt shouldshould (3) bebe andnotednoted the thatthat circuit forfor allall in thethe Figure inputinput1 ,pointspoints is the belongingbelonging radius of toto the thethe light twotwo grey circumferencesegments,segments, the arcthe corresponding incorresponding Figure2. magnitudemagnitude mm*,*, estimatedestimated withwith EquationEquation (3)(3) andand thethe circuitcircuit inin FigureFigure 1,1, isis thethe radiusradius ofof thethe lightlight greygrey circumferencecircumference arcarc inin FigureFigure 2.2. m∗ = amax(|x |, |y |) + bmin(|x |, |y |) (3) mm**aamax(max(0xx00 ,, yy000 ))bbmin(min(xx000,, yy00 )0) (3)(3)

FigureFigureFigure 1. 1.1.The TheThe fast fastfast magnitude magnitudemagnitude estimatorestimator architecture.architecture. architecture.

FigureFigureFigure 2. 2.2.Segments SegmentsSegments for forfor a aa computedcomputed magnitudemagnitude magnitude ofof of mm*.*.m *.

IfIf thethe xx andand yy datadata pathspaths areare quantizedquantized onon thethe MM bitbit signed,signed, thethe magnitudemagnitude estimatorestimator inin FigureFigure 11 If the x and y data paths are quantized on the M bit signed, the magnitude estimator in Figure1 needsneeds MM bitbit unsignedunsigned toto representrepresent mm*.*. TheThe ideaidea isis toto segmentsegment thethe inputinput space,space, basedbased onon mm*.*. WeWe labellabel needs M bit unsigned to represent m*. The idea is to segment the input space, based on m*. We label asas M1M1 areaarea thethe setset ofof pointspoints withwith thethe MSBMSB (Most(Most SignificantSignificant Bit)Bit) ofof mm*,*, saysay mm*[M*[M‐‐1],1], equalequal toto one.one. IfIf aa as M1pointpoint area isis the notnot in setin thethe of M1M1 points areaarea then withthen wewe the looklook MSB forfor (Mostthethe mm*[M*[M Significant‐‐2]2] bitbit toto markmark Bit) itit of inin m thethe*, M2M2 say area,area,m*[M-1], andand soso equal on,on, untiluntil to one. If a pointthethe desireddesired is not level inlevel the ofof M1 segmentationsegmentation area then is weis reached.reached. look for FigureFigure the m33 *[M-2]showsshows the bitthe tofirstfirst mark areaarea itsegmentssegments in the M2ofof thethe area, inputinput and so on, until the desired level of segmentation is reached. Figure3 shows the first area segments of the input coordinates while Figure4 shows a simple circuit able to extract the logical information of the segmented area as a function of m*. Electronics 2017, 6, 22 4 of 10 Electronics 2017, 6, 22 4 of 10 ElectronicsElectronics2017, 20176, 22, 6, 22 4 of 104 of 10 coordinatescoordinates while while Figure Figure 4 4 shows shows a a simple simple circuit circuit able able to to extract extract the the logical logical information information of of the the segmented segmented areacoordinatesarea as as a a function function while of ofFigure m m*.*. 4 shows a simple circuit able to extract the logical information of the segmented area as a function of m*.

Figure 3. Segmentation of the input coordinates. FigureFigure 3. 3.Segmentation Segmentation of of the the input input coordinates. coordinates. Figure 3. Segmentation of the input coordinates.

FigureFigure 4. 4.Segmentation Segmentation flags flags circuit. circuit. Figure 4. Segmentation flags circuit. Figure 4. Segmentation flags circuit.

WhenWhen the Mkthe Mk area area for for a point a point is is known, known, a a shift shift factor factor (SFk) is is selected selected and and applied applied to the to the x0 xandand y When the Mk area for a point is known, a shift factor (SFk) is selected and applied to the x0 and0 0 When the Mk area for a point is known, a shift factor (SFk) is selected and applied to the x0 and inputy coordinates0 input coordinates in a way in a that way thethat magnifiedthe magnified coordinates coordinates reach reach the the M1 M1 area area or orthe the M2 M2area. area. In other In other y0 input coordinates in a way that the magnified coordinates reach the M1 area or the M2 area. In other words,y we input want coordinates to work in with a way points that the that magnified are always coordinates in the firstreach two the M1 segments, area or the far M2 from area. the In zeroother point. words,words,0 we we want want to to work work with with points points that that are are always always in in the the first first two two segments, segments, far far from from the the zero zero point. point. Equation (4) shows the doubling rule to correctly shift the inputs starting from the M3 area (SF1 = 0, Equationwords,Equation we (4)(4) want showsshows to work thethe doubling doubling with points rulerule that toto correctly correctlyare always shiftshift in the the first inputsinputs two startingstarting segments, from from far thethe from M3M3 the area area zero (SF1(SF1 point. = = 0, 0, SF2 =SF2EquationSF2 0). == Figure0).0). Figure(4)Figure shows5 shows 55 showsshows the thedoubling thethe architecture architecturearchitecture rule to correctly of ofof the thethe improved improved improvedshift the inputs atanatan atan-CORDIC‐‐CORDIC CORDICstarting from withwith with the thethe M3 the fastfast area fast magnitudemagnitude (SF1 magnitude = 0, estimatorestimatorSF2estimator = and0). andFigureand the thethe input 5 input inputshows magnification. magnification.magnification. the architecture A AA difference difference differenceof the improved comparedcompared compared atan toto‐ toCORDIC otherother other CORDICCORDIC CORDIC with the optimizationsoptimizations fast optimizations magnitude andand and CORDICCORDICestimatorCORDIC architectures architectures architecturesand the input in thein in magnification.the the literature literature literature [7 [7–9,14–16] – [7–9,14–16]A9, 14difference–16] is is is the the comparedthe maintenance maintenance to other of of of the CORDICthe the standard standard standard optimizations CORDIC CORDIC CORDIC core, core, and core, to whichtoCORDICto wewhichwhich add we wearchitectures a add low-complexityadd aa lowlow‐ ‐complexityincomplexity the literature pre-processing prepre‐‐processing processing[7–9,14–16] unit, unit,unit, is the working workingworking maintenance onon on thethe the ofinputinput input the ranges,standardranges, ranges, thusthus CORDIC thus minimizingminimizing minimizing core, tothe which overall we circuit add a complexity low‐complexity overhead. pre‐processing unit, working on the input ranges, thus minimizing the overallthe overall circuit circuit complexity complexity overhead. overhead. the overall circuit complexity overhead.SF SF SF k SF k x0 2 k ; y0 2 k : SFk  k  2 (4) x0 2 SF ; y0 2 SF : SFk  k  2 k ≥ 3 (4) x 2xSF2k ; yk ; 2ySF2k :k SF: SF=kk−22 kk ≥ ≥3 3 (4) 0 0 0 0 k k k ≥ 3 (4)

Figure 5. Improved atan‐CORDIC architecture. Figure 5. Improved atan‐CORDIC architecture. Figure 5. Improved atan‐CORDIC architecture. Figure 5. Improved atan-CORDIC architecture.

3. Computation Accuracy Evaluation The architecture proposed in Figure5 has been designed as a parametric HDL macrocell where the bit-width of the x, y and angle data paths can be customized at synthesis time, as well as the ElectronicsElectronics2017 2017, 6,, 22 6, 22 5 of 105 of 10 Electronics 2017, 6, 22 5 of 10 3. Computation Accuracy Evaluation number3. Computationn of CORDIC Accuracy iterations Evaluation and the number of input segments in Figure3. For the atan-CORDIC The architecture proposed in Figure 5 has been designed as a parametric HDL macrocell where core, a pipelined fully sequential architecture has been implemented. The choice of these parameters the bit‐Thewidth architecture of the x, yproposed and angle in dataFigure paths 5 has can been be designedcustomized as aat parametric synthesis HDLtime, macrocellas well as where the implies different optimization levels. An empirical parameter exploration is mandatory to characterize numberthe bit n‐width of CORDIC of the iterations x, y and angleand the data number paths of can input be segmentscustomized in Figureat synthesis 3. For time,the atan as ‐CORDICwell as the the sub-optimum architecture depending on the application requirements. core,number a pipelined n of CORDIC fully sequential iterations architecture and the number has been of input implemented. segments inThe Figure choice 3. ofFor these the atan parameters‐CORDIC impliesThecore, quantization adifferent pipelined optimization fully parameters sequential levels. in thearchitecture An fixed-point empirical has parameter implementationbeen implemented. exploration are The is the mandatory choice number of these ofto bitscharacterize parameters of the x and y coordinates,theimplies sub‐optimum different the number architecture optimization of bits depending levels. of the An angle onempirical the quantization, application parameter requirements. the exploration number of is mandatory iterations ofto thecharacterize algorithm and thetheThe number sub quantization‐optimum of segmented architecture parameters areas. depending in Givingthe fixed on the‐ pointthe angle application implementation resolution requirements. in are K bits,the number we know of bits that of the the useful x numberand y of coordinates,The iterations quantization comesthe number parameters from of the bits in relation the of fixedthe where angle‐point thequantization, implementation quantized the angle arenumber the on number the of ATRiterations of set bits goes ofof thethe under x and y coordinates, the number of bits of the angleK quantization,−1 the number of iterations of −thei halfalgorithm of its LSB and (Last the Significantnumber of segmented Bit), being areas.LSB =Givingπ/2 the. angle Overestimating resolution in the K bits, angle weα knowi with that2 , we canthe computealgorithm useful thenumber and useful the of number limititerations of of iterations comessegmented from with areas. the relation Giving where the angle the resolutionquantized inangle K bits, on thewe knowATR set that goesthe underuseful half number of its of LSB iterations (Last Significant comes from Bit), the being relation LSB where ⁄ 2 the .quantized Overestimating angle onthe the angle ATR set withgoes 2 under, we can half compute of its LSB the −(Last iuseful Significant limit of iterations Bit), being with LSB ⁄ 2 . Overestimating the angle 2 < LSB/2 ⇒ i > K − log2 π ≈ K − 1.65 (5) with 2, we can compute the useful limit of iterations with i 2  LSB / 2  i  K  log   K 1.65 (5) This means that the iteration indexi goes from 0 to K −2 1 (K iterations). As a case study, we present 2  LSB / 2  i  K  log2   K 1.65 (5) an atan-CORDICThis means with that an thex iteration, y data path index on goes 14-bit from signed, 0 to K − an1 (K angle iterations). data path As a on case 12-bit study, signed we present and n = 12 iterationsan atanThis‐ (iterationCORDIC means with indexthat anthe from x iteration, y data 0 to path index 11). on To goes 14 understand‐bit from signed, 0 to K an − the 1angle (K phase iterations). data error path As behavior,on a 12 case‐bit study, signed a first we and analysis present n = is done12an holding iterations atan‐CORDIC the (iteration x coordinate with index an x from (,x y =data 16 0 to inpath 11). the on To example) 14 understand‐bit signed, and the sweeping an phaseangle errordata the entirepathbehavior, on y 12 space. a‐bit first signed Figure analysis and6 shows is n = thedone absolute12 iterations holding error the (iteration in x coordinate index of ( fromx the = 16 computed 0 in to the 11). example) To phaseunderstand and with sweeping the the phase standard the errorentire CORDICbehavior, y space. Figurea with first analysis a6 shows blue trace. is done holding the x coordinate (x = 16 in the example) and sweeping the entire y space. Figure 6 shows Wethe see absolute that for error points in nearradians zero, of the the computed phase error phase increases with the significantly. standard CORDIC The max with error a blue is trace. ~0.10865 We rad seethe that absolute for points error near in radians zero, the of the phase computed error increases phase with significantly. the standard The CORDIC max error with is a ~0.10865 blue trace. rad We (about 71 LSB). Using the improved atan-CORDIC with nine area segments (max pre-magnification (aboutsee that 71 LSB). for points Using nearthe improved zero, the atanphase‐CORDIC error increases with nine significantly. area segments The (max max pre error‐magnification is ~0.10865 ofrad of 128), we get the result in Figure7. In Figure7, the blue trace is the absolute error. In Figure7, the max 128),(about we get71 LSB).the result Using in theFigure improved 7. In Figure atan‐CORDIC 7, the blue with trace nine is thearea absolute segments error. (max In preFigure‐magnification 7, the max of error is ~0.0035 rad (less than 3 LSB). The same analysis can be extended through the x coordinates, error128), is we ~0.0035 get the rad result (less in than Figure 3 LSB). 7. In TheFigure same 7, theanalysis blue tracecan beis theextended absolute through error. Inthe Figure x coordinates, 7, the max collectingcollectingerror theis the~0.0035 maximum maximum rad (less error than of of each each3 LSB). run. run. The Figure Figure same 8 shows 8analysis shows the canstandard the be standard extended atan‐CORDIC atan-CORDIC through phase the xerrors coordinates, phase with errors witha blue acollecting blue trace trace wherethe wheremaximum the maximum the error maximum of iseach ~0.6355 run. is ~0.6355 Figure rad (about 8 rad shows (about414 the LSB). standard 414 Figure LSB). atan 9 Figure shows‐CORDIC9 the shows improvedphase the errors improved atan with‐ atan-CORDICCORDICa blue trace errors errors where with with thea blue maximum a blue trace trace where is ~0.6355 where the maximum therad maximum(about is ~0.0052414 LSB). is ~0.0052rad Figure (less radthan 9 shows (less 4 LSB). thanthe The improved 4 red LSB). line The atan in red‐ lineFigures inCORDIC Figures 6–9 errorsis6– the9 is coarsewith the coarse a upper blue upper-boundtrace‐bound where error the evaluated error maximum evaluated with is ~0.0052Equation with rad Equation (2). (less than (2). 4 LSB). The red line in Figures 6–9 is the coarse upper‐bound error evaluated with Equation (2).

Figure 6. Standard atan‐CORDIC phase error of x = 16 and y. The red line is the coarse upper ‐bound Figure 6. Standard atan-CORDIC phase error of x = 16 and y. The red line is the coarse upper-bound errorFigure evaluated 6. Standard with Equationatan‐CORDIC (2); the phase blue errortrace ofis thex = standard16 and y .atan The‐ CORDICred line is phase the coarse error upper‐bound error evaluatederror evaluated with with Equation Equation (2); (2); the the blue blue trace trace is is the the standard standard atan-CORDICatan‐CORDIC phase phase error error.

Figure 7. Improved atan‐CORDIC phase error of x = 16 and y. The red line is the coarse upper ‐bound Figure 7. Improved atan‐CORDIC phase error of x = 16 and y. The red line is the coarse upper‐bound Figureerror 7. evaluatedImproved with atan-CORDIC Equation (2); phase the blue error trace of isx the= 16 standard and y. atan The‐CORDIC red line isphase the coarseerror upper-bound error evaluated with Equation (2); the blue trace is the standard atan‐CORDIC phase error error evaluated with Equation (2); the blue trace is the standard atan-CORDIC phase error.

ElectronicsElectronics2017 2017, 6, 22, 6, 22 6 of 106 of 10 Electronics 2017, 6, 22 6 of 10

Figure 8. Standard atan‐CORDIC max phase error on all the input space. The red line is the coarse FigureupperFigure 8. ‐Standard bound8. Standard error atan-CORDIC atanevaluated‐CORDIC with max max Equation phase phase error(2);error the on blue all the trace input input is the space. space. standard The The redatan red line‐CORDIC line is the is the coarsephase coarse upper-bounderrorupper ‐bound error error evaluated evaluated with with Equation Equation (2); (2); the the blue blue trace trace is the is standardthe standard atan-CORDIC atan‐CORDIC phase phase error. error

FigureFigure 9. Improved 9. Improved atan-CORDIC atan‐CORDIC max max phase phase errorerror on all the the input input space. space. The The red red line line is the is thecoarse coarse upper-boundupperFigure‐ bound9. Improved error error evaluated evaluatedatan‐CORDIC with with Equation maxEquation phase (2); (2); theerror the blue on blue traceall tracethe is input the is the standard space. standard The atan-CORDIC redatan line‐CORDIC is the phase coarsephase error. errorupper ‐bound error evaluated with Equation (2); the blue trace is the standard atan‐CORDIC phase error To evaluate further the performance of the proposed architecture, the maximum error on the To evaluate further the performance of the proposed architecture, the maximum error on the angleangle is analyzedTo is analyzedevaluate by furtherby varying varying the some someperformance quantization quantization of the parameters, proposed architecture, as as reported reported inthe inTables maximum Tables 1 and1 and error2. 2In. InTable on Table the 1 1 the lighttheangle light cells is analyzed cells show show the by the standardvarying standard some CORDIC CORDIC quantization maximum maximum parameters, error, error, fixed asfixed reported to to 414 414 LSB in LSB Tables (410 (410 LSB 1 andLSB in 2.in Table In Table Table2), 2), while 1 the greywhilethe light cells the cells grey represent show cells therepresent the standard improved the CORDICimproved CORDIC maximum CORDIC maximum error,maximum angle fixed error.angle to 414 error. From LSB (410From the analysisLSB the inanalysis Table of Tables 2),of 1 and2Tableswhile, it is the clear1 and grey that2, itcells is increasing clear represent that increasing the the number improved the of number MCORDIC areas of and/orM maximum areas increasingand/or angle increasing error. the bitFrom the width bitthe width ofanalysis the of xthe ofand y dataxTables paths and y 1 allows data and paths2, reducingit is allowsclear that the reducing error.increasing Instead, the error.the number if Instead, we compare of ifM we areas thecompare and/or peak errorthe increasing peak as a error function the asbit a width function of the of number the of of iterationsthex and number y datan (12 ofpaths iterations iterations allows n reducing in(12 Table iterations 1the vs. error. in nine Table Instead, iterations 1 vs. if ninewe in compare iterations Table2), the thein peakTable minimum error 2), the as is aminimum function obtained ofis for Tableobtainedthe2, whichnumber for uses ofTable iterations a number 2, which n (12 ofuses iterations iterations a number lowerin ofTable iterations than 1 vs. the lowernine maximum iterations than the 12 maximumin estimated Table 2), 12 the from estimated minimum Equation from is (5), withEquationobtainedK = 12. This(5),for Tablewith can be2,K justifiedwhich= 12. Thisuses by acan the number factbe justified that, of iterations as by discussed the lower fact in thanthat, Section theas maximumdiscussed1, the quantization 12in estimatedSection in 1, thefrom the data Equation (5), with K = 12. This can be justified by the fact that, as discussed in Section 1, the pathsquantization of the fixed-point in the data implementation paths of the fixed causes‐point a truncation implementation error that causes is accumulated a truncation inerror every that loop is of accumulatedquantization inin everythe data loop paths of the of algorithm. the fixed‐ point implementation causes a truncation error that is the algorithm. accumulated in every loop of the algorithm. Table 1. Max angle error expressed in LSB of the standard (std) vs. the proposed (our) CORDIC, Table 1. Max angle error expressed in LSB of the standard (std) vs. the proposed (our) CORDIC, K = 12, KTable = 12, 1.12 Max iterations angle from error 0 expressedto 11, bits forin LSBx and of y thefrom standard 10 to 14, (std) M areas vs. thefrom proposed 6 to 9. (our) CORDIC, 12 iterationsK = 12, 12 from iterations 0 to 11,from bits 0 to for 11,x bitsand fory from x and 10 y from to 14, 10 M to areas 14, M from areas 6 from to 9. 6 to 9. M Areas

6M Areas 7 8 9 M Areas 6 7 8 9 x, y Bits std6 our std 7our std 8our std our 9 10x, y Bits 414std 50our 414std 16our 414std 16our 414std 16our x, y Bits std our std our std our std our 1110 414 50 414 1316 414 1016 414 1016 101211 414414 5050 414414 13 16 414 41410 16414 414610 16 111312 414414 5050 414414 13 13 414 41410 10414 41446 10 12 1413 414414 5050 414414 13 13 414 41410 10414 4144 6 13 414 50 414 13 414 10 414 4 14 414 50 414 13 414 10 414 4 14 414 50 414 13 414 10 414 4

Electronics 2017, 6, 22 7 of 10

ElectronicsTable 2. 2017Max, 6, angle22 error expressed in LSB of the standard (std) vs. the proposed (our) CORDIC, K =7 12 of, 10 nine iterations from 0 to 8, bits for x and y from 10 to 14, M-areas from 6 to 9. Table 2. Max angle error expressed in LSB of the standard (std) vs. the proposed (our) CORDIC, K = 12, nine iterations from 0 to 8, bits for x and y fromM 10 areas to 14, M‐areas from 6 to 9. M 6areas 7 8 9

x, y Bits std6 our std7 our8 std our9 std our x10, y Bits 410std 45our 410std our 15std 410our 15std 410our 15 1011 410410 4545 410410 15 10 410 41015 10 410 41015 10 1112 410410 4545 410410 10 9 410 41010 9410 41010 8 1213 410410 4545 410410 9 9410 4109 9410 4108 7 1314 410410 4545 410410 9 9410 4109 9410 4107 7 14 410 45 410 9 410 9 410 7 Indeed, in Table3 we have reported, for both the standard and the proposed CORDIC, the maximumIndeed, angle in Table error, 3 expressed we have reported, in LSB, for as both a function the standard of the and number the proposed of iterations. CORDIC, For the example, maximum the minimumangle error, of the expressed maximum in LSB, angle as errora function is achieved of the number for the proposedof iterations. CORDIC For example, with seven the minimum iterations. Thisof behaviorthe maximum can beangle theoretically error is achieved justified, for since the proposed the maximum CORDIC angle with error seven is achievediterations. for This low magnitudebehavior can values, be theoretically i.e., low values justified, of the since radius the maximum r in Equation angle error (2). Figure is achieved 10 plots for low the magnitude error upper boundvalues, as ai.e., function low values of the of number the radius of iterations,r in Equation using (2). Equation Figure 10 (2) plots with thee =error1, δ upper= π/ 2boundK−1, K as= a12 , 1⁄ 2 12 andfunction values of for the the number radius ofr o iterations, f 8, 16, 32 usingand 64Equation. In Figure (2) with 10, similarly , to what, is observed, and in values Table 3, for the radius 8, 16, 32 64. In Figure 10, similarly to what is observed in Table 3, there is a there is a first part where the upper-bound error decreases and a second part where the upper-bound first part where the upper‐bound error decreases and a second part where the upper‐bound error error increases. increases. Table 3. Max angle error expressed in LSB of the standard vs. proposed CORDIC (K = 12, 12 bits for x andTabley data 3. paths,Max angle six M error areas). expressed in LSB of the standard vs. proposed CORDIC (K = 12, 12 bits for x and y data paths, six M areas).

Max. Number ofMax. Iterations Number of Iterations

22 3 3 44 55 66 7 8 89 910 1011 1112 12 StandardStandard CORDIC CORDIC512 512 302 302 252252 333333 374374 394394 404 404409 409412 412 413 413414 414 ProposedProposed CORDIC CORDIC512 512 302 302 160160 8282 4343 3030 40 40 45 45 48 4849 4950 50

FigureFigure 10. 10.Upper-bound Upper‐bound error error behaviorbehavior as function of of the the iteration iteration number. number.

4. VLSI4. VLSI Implementation Implementation and and Characterization Characterization TheThe improved improved CORDIC CORDIC architecture architecture proposedproposed in Section 2 has been designed designed in in VHDL VHDL as as a a parametricparametric macrocell macrocell that that can can be be integrated integrated inin anyany system-on-chip.system‐on‐chip. With With reference reference to to the the macrocell macrocell configurationconfiguration discussed discussed in in Section Section3 ,3, a a synthesis synthesis in in the the standard-cellstandard‐cell 180 nm nm 1.8 1.8 V V CMOS CMOS technology technology from AMS ag has been carried out. The macrocell design is characterized by a low circuit complexity from AMS ag has been carried out. The macrocell design is characterized by a low circuit complexity and low power consumption: about 1700 equivalent logic gates, with a max power consumption

Electronics 2017, 6, 22 8 of 10 and low power consumption: about 1700 equivalent logic gates, with a max power consumption (static and dynamic power) of about 0.45 mW when working with a clock frequency of 30 MHz. The overhead due to the innovative parts of this work (fast magnitude estimation and input coordinates magnification) vs. a standard CORDIC implementation in the same technology is limited to 24% in terms of equivalent logic gates and −36% in terms of max working frequency. The processing time of the atan function is n·Tclock, i.e., 0.4 µs when n = 12 with a clock of 30 MHz. As a possible improvement, a pipeline register can be inserted in Figure5 before the atan-CORDIC core. This way, the complexity increases by about 10%, resulting in 1900 equivalent logic gates, while the working frequency can be increased up to 46 MHz and the power consumption is 0.77 mW. The processing time is (n + 1)·Tclock; with n = 12 iterations of the atan function, the processing time is 0.28 µs. The comparison of the proposed solution vs. the standard state-of-the-art solution, implemented in the same 180 nm 1.8 V technology, is summarized in Table4. Table4 shows also the comparison with other known CORDIC computation architectures [6,8]. In [8], an embedded solution is proposed using a 32-bit ART7TDMI-S core (18,000 equivalent gates) plus 32 kB of RAM (Random Access Memory) and 512 kB of EEPROM (Electrically Erasable Programmable Read Only Memory). With a clock of 60 MHz, the solution in [8] implements an atan evaluation in 35 µs and an angle error of 2.42 × 10−5 rad with an optimized algorithm. At 60 MHz with 180 nm 1.8 V CMOS technology, the ARM7-TDMIs has a power consumption of about 24 mW. Although [8] is optimized to achieve a lower error, 2.42 × 10−5 rad vs. 6.14 × 103 rad in our implementation, with K = 12, the proposed design outperforms the design in [8] in terms of the reduced processing time, by a factor ×124, the reduced power consumption, by a factor ×31, and the reduced circuit complexity, by a factor ×9.5 . To achieve a similar calculation accuracy as in [8], our IP has been re-synthetized with K = 20. In such a case, the processing time is still within 1 µs and the complexity in the logic gates is still below 4 kgates.

Table 4. Performance of the proposed CORDIC solution vs. standard solution.

IP macrocell Standard CORDIC This work [6][8] 18,000 + 32kB RAM Equivalent gates 1400 1700 (1900 pipelined) 8000 + 512 kB E2PROM Clock frequency, MHz 46 30 (46 pipelined) 10 60 Max. angle error 0.6355 rad 6.1 × 10−3 rad N/A (K = 16, n = 20) 2.42 × 10−5 rad Processing latency, µs 0.28 0.4 (0.28 pipelined) N/A 35 Power consumption, mW 0.57 0.45 (0.77 pipelined) N/A 24

The 180 nm technology node and the target clock frequency have been selected following the constraints of automotive applications where, to face a harsh operating environment (over-voltage, over-current, temperature up to 150 ◦C), a 180 nm 1.8 V technology ensures higher robustness vs. scaled sub-100 nm CMOS technologies operating with a voltage supply of 1 V. The considered 180 nm technology is automotive-qualified and has a density of 118 kgates/mm2 [18]. The N-MOS supplied at 1.8 V has a voltage threshold of 0.35 mV and a current capability of 600 µA/µm2. The P-MOS supplied at 1.8 V has a voltage threshold of 0.42 mV and a current capability of 260 µA/µm2. The proposed improved CORDIC IP has been integrated in an application-specific processor, called SENSASIP [5], to improve the digital signal processing capability of a 32-bit CORTEX-M0 core. Particularly, the SENSASIP chip is used to implement the in [3,19–21] in real time for the sensor signal processing of a 3D Hall sensor and of an inductive position sensor in X-by-wire automotive applications. In these specific applications, the real-time application requirements were met already with a clock frequency for SENSASIP of 8 MHz. Hence, with a 46 MHz clock frequency, the improved CORDIC architecture is about six times faster than required. At 8 MHz, the power consumption of the proposed CORDIC architecture is below 0.1 mW. It should be noted that the CORDIC computation can be done via software on the 32-bit Cortex core, as is done in [15] for the same class of magnetic sensor position signal conditioning. However, as proved in [5], the use of the proposed CORDIC macrocell integrated on-chip allows for savings in Electronics 2017, 6, 22 9 of 10 power consumption up to 75%. It should be noted that the proposed macrocell can be also integrated in FPGA technology, as in [16], in case the application of interest requires an FPGA implementation rather than a system-on-chip one.

5. Conclusions In this work, we propose an improved architecture for arctangent computation using the CORDIC algorithm. A fast magnitude estimator works with a pre-processing magnification to rescale the input coordinates to an optimum range as a function of the estimated magnitude. After the scaling, the inputs are processed by the standard CORDIC algorithm implemented without changing any data path precision. The improved architecture is generic and can be applied to any atan-CORDIC implementation. The architecture improvement is achieved with a low-circuit complexity since it does not require any special operation, but only comparison, shift and add operations. Several case studies have been analyzed, considering a resolution for x and y from 10 to 14 bits, a resolution of 12 bits for the angle, a number of iterations from two to 12, and from six to nine area segments. The bit-true analysis shows that the maximum absolute error in the atan-CORDIC computation improves by at least two orders of magnitude, e.g., in Table1 from 414 LSB to 4 LSB. The proposed macrocell has been integrated in a system-on-chip, called SENSASIP, for automotive sensor signal processing. When implemented in an automotive-qualified 180 nm 1.8 V CMOS technology, two configurations are available. The first one has a circuit complexity of 1700 logic gates, a max clock frequency of 30 MHz, a CORDIC processing time of 0.4 µs and a power consumption of 0.45 mW. The second configuration has an extra pipeline stage, and has a circuit complexity of 1900 logic gates, a max clock frequency of 46 MHz, a CORDIC processing time of 0.28 µs and a power consumption of 0.77 mW. These values are compared well to state-of-the-art works in the same 180 nm technology node. The 46 MHz clock value is six times higher than what is required (8 MHz) by real-time automotive sensor signal conditioning applications. In these operating conditions, the power consumption of the proposed solution is below 0.1 mW.

Acknowledgments: This project was partially supported by the projects FP7 EU Athenis3D and the University of Pisa PRA E-TEAM. Author Contributions: All authors contributed to the research activity and to the manuscript writing and review. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Volder, J.E. The CORDIC Trigonometric Computing Technique. IRE Trans. Electr. Comput. 1959, EC-8, 330–334. [CrossRef] 2. Meher, P.K.; Valls, J.; Juang, T.B.; Sridharan, K.; Maharatna, K. 50 Years of CORDIC: Algorithms, Architectures, and Applications. IEEE Trans. Circuit Syst. I 2009, 56, 1893–1907. [CrossRef] 3. Sisto, A.; Pilato, L.; Serventi, R.; Fanucci, L. A design platform for flexible programmable DSP for automotive sensor conditioning. In Proceedings of the Nordic Circuits and Systems Conference (NORCAS): NORCHIP & International Symposium on System-on-Chip (SoC), Oslo, Norway, 26–28 October 2015; pp. 1–4. 4. Zheng, D.; Zhang, S.; Zhang, Y.; Fan, C. Application of CORDIC in capacitive rotary encoder signal demodulation. In Proceedings of the 2012 8th IEEE International Symposium on Instrumentation and Control Technology (ISICT), London, UK, 11–13 July 2012; pp. 61–65. 5. Sisto, A.; Pilato, L.; Serventi, R.; Saponara, S.; Fanucci, L. Application specific instruction set processor for sensor conditioning in automotive applications. Microprocess. Microsyst. 2016, 47, 375–384. [CrossRef] 6. Timmermann, D.; Hahn, H.; Hosticka, B.; Schmidt, G. A Programmable CORDIC Chip for Digital Signal Processing Applications. IEEE J. Solid-State Circuits 1991, 26, 1317–1321. [CrossRef] 7. Rajan, S.; Wang, S.; Inkol, R.; Joyal, A. Efficient approximation for the arctangent function. IEEE Signal . Mag. 2006, 23, 108–111. [CrossRef] Electronics 2017, 6, 22 10 of 10

8. Ukil, A.; Shah, V.H.; Deck, B. Fast computation of arctangent functions for embedded applications: A comparative analysis. In Proceedings of the 2011 IEEE International Symposium on Industrial Electronics (ISIE), Gdansk, Poland, 27–30 June 2011; pp. 1206–1211. 9. Revathi, P.; Rao, M.V.N.; Locharla, G.R. Architecture design and FPGA implementation of CORDIC algorithm for fingerprint recognition applications. Procedia Technol. 2012, 6, 371–378. [CrossRef] 10. Walther, J.S. A unified algorithm for elementary functions. In Proceedings of the Spring Joint Computer Conference AFIPS ‘71 (Spring), Atlantic City, NJ, USA, 18–20 May 1971; pp. 379–385. 11. Hu, Y.H. The quantization effects of the CORDIC algorithm. IEEE Trans. Signal Process. 1992, 40, 834–844. [CrossRef] 12. Kota, K.; Cavallaro, J.R. A normalization scheme to reduce numerical errors in inverse tangent computations on a fixed-point CORDIC processor. In Proceedings of the 1992 IEEE International Symposium on Circuits and Systems, San Diego, CA, USA, 10–13 May 1992; Volume 1, pp. 244–247. 13. Fanucci, L.; Rovini, M. A Low-Complexity and High-Resolution Algorithm for the Magnitude Approximation of Complex Numbers. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2002, E85-A, 1766–1769. 14. Muller, J.-M. The CORDIC Algorithm. In Elementary Functions; Birkhäuser Boston: Cambridge, MA, USA, 2016; pp. 165–184. 15. Glascott-Jones, A. Optimising Efficiency using Hardware Co-Processing for the CORDIC Algorithm. In Advanced Microsystems for Automotive Applications; Springer: Heidelberg, Germany, 2010; pp. 325–335. 16. Taylor, A.P. How to use the CORDIC algorithm in your FPGA design. XCell J. 2012, 79, 50–55. 17. Anoop, C.V.; Betta, C. Comparative Study of Fixed-Point and Floating-Point Code for a Fixed-Point Micro. In Proceedings of the dSPACE User Conference 2012, Bangalore, India, 14 September 2012. 18. Minixhofer, R.; Feilchenfeld, N.; Knaipp, M.; Röhrer, G.; Park, J.M.; Zierak, M.; Enichlmair, H.; Levy, M.; Loeffler, B.; Hershberger, D.; et al. A 120 V 180 nm High Voltage CMOS smart power technology for System-on-chip integration. In Proceedings of the 2010 22nd International Symposium on Power Semiconductor Devices & IC’s (ISPSD), Hiroshima, Japan, 6–10 June 2010. 19. Bretschneider, J.; Wilde, A.; Schneider, P.; Hohe, H.-P.; Koehler, U. Design of multidimensional magnetic position sensor systems based on hallinone technology. IEEE Int. Symp. Ind. Electron. (ISIE) 2010, 422–427. [CrossRef] 20. Sarti, L.; Sisto, A.; Pilato, L.; di Piro, L.; Fanucci, L. Platform based design of 3D Hall sensor conditioning electronics. In Proceedings of the 2015 11th Conference on IEEE Conference on Ph.D. Research in Micr. and Electronics (PRIME), Glasgow, UK, 29 June–2 July 2015; pp. 200–203. 21. Kim, S.; Abbasizadeh, H.; Ali, I.; Kim, H.; Cho, S.; Pu, Y.; Yoo, S.-S.; Lee, M.; Hwang, K.C.; Yang, Y.; et al. An Inductive 2-D Position Detection IC With 99.8% Accuracy for Automotive EMR Gear Control System. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017.[CrossRef]

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).