<<

Copyright © 2019 American Scientific Publishers Journal of All rights reserved Low Power Printed in the United States of America Vol. 15, 338–350, 2019

A Novel Coordinate Rotation Digital Computer Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle

Giuseppe Visalli Independent Researcher, Via G. Guerzoni, 1 Padova, 35126, Italy

(Received: 23 June 2019; Accepted: 7 October 2019)

In this work, we propose an approximate and energy-efficient CORDIC method, based on a trigono- metric function spatial locality principle derived from benchmarks profiling. Successive sine/cosine computation requests cover more than 50% when the absolute phase difference is at most ten degrees. Consequently, this property suggests an optimized circuit implementation, both iterative or a succession of microrotation modules, where the last CORDIC requires fewer iterations, reduc- ing the latency and the total energy budget at the same precision of two separate and indepen- dent instances. Thus, this simple design strategy allows significant area and energy dissipation in general-purpose VLSI architectures, but it introduces also dramatically optimizations in application- specific embedded systems used in the area of signal processing and radio frequency communi- cation. In this contribution, we introduce a method, the hardware overhead and the energy budget per single cycle. Simulation results show the total energy saving in considered benchmarks is 40% in pipelined and iterativeIP: general192.168.39.210 purposes CORDIC.On: Sat, 25 Furthermore, Sep 2021 our 20:55:15 application-specific systems (fast Fourier transform andCopyright: digital oscillators American for radiofrequency Scientific Publishers down conversions) show remarkable cycle savings when the successive sine/cosineDelivered computation by Ingenta requests are more than 70%. Finally, in this work, we extend the proposed approach to whichever phase difference less than 26.56,as a variable for the second CORDIC number of angle rotations. Keywords: CORDIC, Digital Arithmetic, Digital Signal Processing Chip, Energy Estimation, Fixed-Point Arithmetic, Very-Large Scale Integration Systems.

1. INTRODUCTION AND BACKGROUND Moreover, this algorithm has applications in special- Due to the rapid advances made in very-large-scale inte- purpose systems for real-time signal processing.12 13 These gration (VLSI) technology, digital signal (DSP), typical applications require the computation of trigono- the design for real-time applications requires the com- metric functions reducing the hardware to save area, putation of trigonometric functions as a primary task energy and the achievement of the minimum latency. The in the image, signal processing, and telecommunication. CORDIC algorithm calculates the sine and cosine of any The CORDIC (Coordinate Rotation Digital Computer) angle by successive rotations until the total residual angle algorithm1 is the most commonly used code to perform reaches a position very close to zero. This algorithm does trigonometry, logarithmic, exponential, hyperbolic, real not require signed integer multiplications as a principal and complex multiplication, eigenvalue estimation, square advantage, but it accomplishes the calculus by successive root, division, singular value decomposition and many shift-and-add operations. However, the total latency is the more.2 Walther3 extended the initial implementation and most important drawback, proportional to the used iter- N refined by many others.4 The CORDIC algorithm could ations ( ). For this reason, many applications improve be used in image and signal processing applications,5 6 conventional CORDIC, reducing the required cycles at the 14 telecommunication (e.g., radio baseband processors),7–9 same result precision. Control CORDIC and the Angle 15 3D graphics manipulation10 and Internet of things (IoT).11 Re-coding (AR) CORDIC reduces the used iterations; this last approach requires an average of N /3 and a maxi- mum of N /2. Additionally, since the single CORDIC rota- Emails: [email protected], [email protected] tion is not a pure rotation but a rotation-extension, the

338 J. Low Power Electron. 2019, Vol. 15, No. 4 1546-1998/2019/15/338/013 doi:10.1166/jolpe.2019.1619 Visalli A Novel CORDIC Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle number of rotations for any angle should be a constant and quality of result in our considered benchmarks. The independent of the operand so that a constant scale fac- organization of the rest of the paper is as follows. Section 2 tor is used. Hence, the utilization of redundant number is an overview of the CORDIC algorithms. Subsequently, representations or carry-save adders (CSA)16 17 in addi- Section 3 enhances the conventional CORDIC to perform tion/subtraction reduces the total latency allowing a higher a second instance with a minimal incremental angle. Fur- .18 19 ther, we formalize our approach in terms of algorithm, In this work, we simulated several representative bench- VLSI implementation and relative error measurement in marks in the fields of signal processing, computer arith- Section 4, where we consider both general-purpose and metic and telecommunication for general purpose and application-specific problems. Section 5 summarizes our application-specific embedded cores. We trace the access results in terms of energy savings in our considered bench- to two successive trigonometric (sine/cosine) functions marks and the measure of achieved precisions of the sec- when the angles differ at most 10. Our profile shows that ond CORDIC instance. Finally, Section 6 concludes our 50% of total successive sine/cosine computations fall into work. our region; this last sentence is an equivalent spatial local- ity principle used in the context of cache 2. THE CORDIC ALGORITHM 20 memory design. Low-power CORDIC implementations This section introduces the conventional CORDIC and its generally reduce the total energy at the price of increased specializations for reducing the total iterations responsi- 21 latency as depicted in the work. Another research in ble for the full latency. This algorithm is a succession of the area of VLSI signal processing considers the precom- phase rotations translated in hardware by an iterative VLSI putation of each CORDIC direction, accelerating every implementation or a succession of micro-rotation module; 22 shift-and-add stage. However, these timing optimizations this last preferred to achieve an elevated throughput. How- are by far from our approach excluding some un-useful ever, in this paper, we focus our work on to CORDIC rotations, under trigonometric operand constraints, reduc- algorithm with variable and unknown angles of rotation, ing a large portion of the required delay. Thus, our pro- a property used in the implementation of low-power DSP posed approach is reconfigurable hardware to reduce the and . energy dissipation and latency by the removal of un- necessary angle rotations when twoIP: successive192.168.39.210 sin/cos On: fall Sat,2.1. 25 Sep Conventional 2021 20:55:15 CORDIC into an angle difference of at mostCopyright: ten degrees. American Other- ScientificThe CORDIC Publishers algorithm is a succession of rotations of wise, the hardware dissipation is almost alignedDelivered to the by Ingenta −n 4 angle in the form of en = arctan2 when n is an integer. current state of art and implementation technology. Our A read-only memory (ROM) stores these angles in a fixed proposed hardware does not dissipate more energy than precision format; the memory stores these angles mea- known , a miss of the locality principle does sured in degrees. Hereafter, we consider a radix-2 32-bit not dissipate additional energy and latency. Our idea is to fixed-point CORDIC algorithm. The trigonometric func- compute the second CORDIC algorithm without resetting tions sine and cosine use a rotation starting from a couple the rotation results from the previous instance. In this way, x y of fixed-point numbers 0 0 and we obtain the point the hardware starts from a non-reset value and the next x y d · e d ∈ n+1 n+1 by a rotation of angle n n,where n rotation begins from a slightly larger angle than our refer- +1 −1,sowehave: ence offset value: 10. This approach could be used in any x −d · −n x CORDIC implementations based on binary either redun- n+1 = e 1 n 2 · n y cos n d · −n y (1) dant or un-redundant representation and with the single or n+1 n 2 1 n double rotation-extension algorithm implementation, this Here, the CORDIC algorithm uses the iterative equation last introduced in Ref. [23]. for zn as follows: The spatial locality of sine/cosine angles allows signif- z = z − d · −n icant hardware savings in application-specific VLSI. We n+1 n n arctan 2 (2) explored two representative benchmarks: the fast Fourier This new variable is the accumulated partial sum of step 24–27 transform (FFT) processors and the implementation of angles; it controls the sign of the angle rotation. The sign an accurate numerical controlled oscillator (NCO) used in z + z of n sets the rotation mode: clockwise ( 1) or not. If 0 the spectroscopy28 29 or modulated wave down conversion  −k = is less than or equal to k=0 arctan 2 1 743 we 30 section in a typical digital telecommunication receiver. have: We consider a discrete number of reference approaches ⎛ ⎞ ⎛ ⎞ xn x · cosz − y · sin z illustrated in the contributions31 32 estimating the hard- ⎝ ⎠ ⎝ 0 0 0 0 ⎠ lim yn = K · x · sin z + y · cosz (3) ware wastes (area and power) and the maximum operative n→ 0 0 0 0 zn 0 frequency. K  / d · We translated the reference and new hardware in a Scale factor in√ Eq. (3) is equal to n=0 1 cos n −n =  + −2n = 90 nm CMOS technology, evaluating the energy efficiency arctan 2 n=0 1 2 1 646 . We calculate the J. Low Power Electron. 15, 338–350, 2019 339 A Novel CORDIC Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle Visalli

x = /K = sine and cosine functions starting from 0 1 binary position; it computes two rotation-extension coeffi- y = z q p 0 607252 , 0 0 0and 0 our wished angle. A software cients namely n and n as follows: profiling revealed the maximum error percentage (E%) of ⎧ ⎪− zj−1 zj zj+1 < sine and cosine computation by conventional CORDIC is ⎨ 1 1 n−1 n−1 n−1 0 j−1 j j+1 qnpn = 0 −1zz z = 0 (6) 3.33% when the angles are 1 and 89 respectively. ⎩⎪ n−1 n−1 n−1 + + zj−1 zj zj+1 > 1 1 n−1 n−1 n−1 0 2.2. Control CORDIC 14 This algorithm is efficient in particular for the computa- Control CORDIC reduces the total iterations at the same tion of sine/cosine function without the use of a scaling z precision of the former implementation. Variable k in the factor during the computation so the result is immediately conventional CORDIC does not converge monotonically. available. Consequently, this algorithm introduces divergent rotations similar to an overshoot of an underdamped response of 2.4. Correcting Rotation CORDIC a second-order filter. Control CORDIC removes divergent This approach, also introduced in Ref. [23], uses again rotations, with a monotonic residual eliminating the over- the binary redundant representation and the sequence of shoot, without sacrificing wished precision. Thus, the idea rotation-extension is identical that conventional CORDIC. was to change the rotation direction now in the set (+1 0) This algorithm performs an extra rotation-extension in according to the formula below: every mth step, m can be an arbitrary integer (cor- recting period). Since the number of rotation-extensions + z ≥ e performed for each angle is a constant, the scaling d = 1whenn n n (4) factor is constant and independent of the operand. 0otherwise Equations (1) and (2) are still valid; every m steps the new equations are: Here, Eq. (4) applies when the input angle is posi- xˆ = x − q · −n · y tive. Similarly, for negative angles rotation direction is n n−1 n 2 n−1 restricted to ( − 1 0). yˆ = y − q · −n · x n n−1 n 2 n−1 zˆ = z − q · −n · −n 2.3. Double Rotation CORDICIP: 192.168.39.210 On: Sat, 25 Sep 2021n 20:55:15n−1 n 2 arctan 2 Copyright: American Scientific Publishers −n (7) The former redundant CORDIC computes theDelivered sine and by Ingenta xn =ˆxn −ˆqn · 2 ·ˆyn cosine by using few most significant bits of the remain- y =ˆy −ˆq · 2−n ·ˆx ing angle to determine the direction of the rotation, this n n n n 33 −n −n last represented in a redundant binary format. Further- zn =ˆzn −ˆqn · 2 · arctan 2 more, this CORDIC evaluates the addition/subtraction in a Again, the variables z and zˆ are represented by a redundant number system. However, since a non-rotation- n n redundant binary fraction whose most significant digit is extension could occur, the CORDIC algorithm calculates located at the jth binary position. Additionally, xn, yn, xˆn, the scale factor that is a variable dependent on the operand, yˆn are in the redundant binary format. The directions qn increasing the total latency and the amount of hardware. and qˆn follow these equations: The Double Rotation CORDIC23 uses the redundant binary j− j j+ representation with a digit in the set {−1 0 +1}. This −1 z 1 z z 1 <0 q = n−1 n−1 n−1 (8) algorithm requires a combination of two rotation-extension n + zj−1 zj zj+1 ≥ 1 n−1 n−1 n−1 0 each type of rotation, non-rotation positive or negative. j− j j+ This approach makes the scaling factor independent of −1 zˆ 1 zˆ zˆ 1 <0 qˆ = n−1 n−1 n−1 x y z n j−1 j j+1 (9) the used operand. We start with the operands 0, 0, 0 + zˆ zˆ zˆ ≥ 1 n−1 n−1 n−1 0 and the used scale factor (as a conventional algorithm) is N − n− The choice of the correcting period influences the algo- K1 = 1+2 2 2. This CORDIC algorithm proceeds n=1 rithm performance in terms of accuracy and precision. In according to the following iterative equations: particular, when “m” is small there are a large number of extra rotations, but the number of digits for the determina- x = x − q · −n · y − p · −2n−2 · x n n−1 n 2 n−1 n 2 n−1 tion of the direction of the rotations becomes smaller and y = y − q · −n · x − p · −2n−2 · y vice versa. n n−1 n 2 n−1 n 2 n−1 (5) z = z − · −n−1 n n−1 2 arctan2 2.5. Angle Rotation CORDIC This angle rotation (AR) mode15 reduces total iterations We represent zn by a redundant binary fraction; this in the CORDIC trigonometric computation, only in for- algorithm considers the most significant digits at the jth ward rotation mode (also known as vector rotation mode).

340 J. Low Power Electron. 15, 338–350, 2019 Visalli A Novel CORDIC Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle

The former implementation is a software-based algorithm, although there are many attempts to translate in hardware this approach. The idea is to realize the wished rotation angle by a combination of a much-reduced set of angles. In this work, without loss of generality, we consider the Conventional, Control, Double Rotation, and Correction Rotation CORDIC only. Our ultimate goal is to save both energy and latency removing the un-necessary angle rota- tions. This approach is transversal whichever VLSI imple- mentations of known CORDIC algorithm based on angle rotations: higher radix and/or redundant number logic are Fig. 1. Average rate of successive sine/cos versus []. still applicable because they solve the single CORDIC instance, when our solution solves the problem of energy cosine computation varying the parameter  simulating and latency saving in consecutive instances under a con- our considered benchmarks. straint of the incremental angle that should be less than We propose in this work a novel 2-pass CORDIC 26.56. method that computes the sine/cosine function of angle + when the same function is known with angle . Hereafter we suppose, without any limitation, that is 3. A NOVEL ENERGY-EFFICIENT at most ± 10. Consequently, we define the past and cur- 2-PASS CORDIC METHOD rent CORDIC instances as CORDIC-1 and CORDIC-2 We start from the consideration that potentially signal respectively. We start from the Eqs. (1) and (2); processing and multimedia benchmarks admit a form of CORDIC-1 has N iterations. In our proposed approach, spatial locality when requiring the computation of basic we do not clear xN yN and zN forcing a further rota- trigonometric functions. If a particular angle is referenced tion with M additional iterations. Additionally, the angle by a trigonometric function at a particular time, then it is e rotation N +1 is fixed according to the following rule: likely that a nearby angle will be referenced by the same e = −2 = function soon. We collected some interesting benchmarks N +1 arctan 2 14 03 (10) mostly for the embedded marketIP: by 192.168.39.210 using MiBench34 On:and Sat, 25 Sep 2021 20:55:15 Copyright: American ScientificHere 14.03Publishers is the first angle in 32-bit fixed-point in the many others in the area of telecommunication (e.g., Soft- −N output Viterbi algorithm, SOVA35 of 240 channelDelivered sym- byform Ingenta 2 greater than 10 degrees as our considered limit. bols) and signal processing, these last as representative In this way, we optimize the angle dynamic according to benchmarks for application-specific embedded processors our realistic results of the conducted profile in Table I. and DSP. Next, we profiled a radar processing algorithm In particular, Figure 2 shows the cycle savings varying the  from NASA open-source software:36 Turbulence Detection range. We save one clock cycle if is less than 26.56  Algorithm (TDA). This software, written in Python, esti- degrees and greater than 14.03 degrees. When is from mates the eddy dissipation rate (a measure of turbulence) 7.12 to 14.03 degrees we save two CORDIC iterations; from Doppler weather radar data. We measure the occur- three cycles when the same parameter is from 3.57 to 7.12 rence of two consecutive access to trigonometric function degrees. We save four iterations when the absolute phase  by the difference between the actual ( + ) and previous difference is from 1.78 to 3.57 degrees; finally, when () angles. We fix the maximum of =10 as a hypo- is more than 0.89 and less than 1.78 degrees we save five = = thetical limit of our analysis. The results are in Table I; we iterations. Below, an example; we fix 30 and 2 . see an average greater than 50% that justifies our approach. Equation (11) represents a 32-bit fixed-point CORDIC-1 N = M = Figure 1 shows an average rate of successive sine and and 8 main iterations and CORDIC-2 with 4 cycles.

Table I. Successive sine/cosine when angles differ from at most ±10 . 30 ≈+4502 − 2658 + 1404 − 713 + 358 + 179 Algorithm Type Successive/total Ratio (%) − 089 + 045 = 3028 FFT Signal processing 45/61 73 77 32 ≈ 3028 + 357 − 179 − 089 + 044 = 3161 NCO Signal processing 512/512 10000 Rsynth Office 6142/19540 3143 (11) SOVA Soft-output viterbi 9600/15360 6250 Typeset Consumer 340/469 7249 Lame MP3 player 201/1328 1513 Fbench Floating-point bench 320/351 9116 Ffbench FFT 2D 560/1281 4371 TDA Radar 7560/7560 10000 Average – 2808/5162 5441 Fig. 2. CORDIC-2 cycle savings varying the range ∈ max–min.

J. Low Power Electron. 15, 338–350, 2019 341 A Novel CORDIC Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle Visalli

IP: 192.168.39.210 On: Sat, 25 Sep 2021 20:55:15 Copyright: American Scientific Publishers Delivered by Ingenta

Fig. 3. Error analysis of 2-pass CORDIC when M = 4, 8, 12 for sine and cosine functions.

Conventional CORDIC final estimations with N = 16 4. VLSI DESIGN AND IMPLEMENTATION main iterations and M = 8 are 29.97 and 32.07 respec- The total latency represents the CORDIC most important tively. Finally, Eq. (12) shows the control CORDIC with drawback, requiring a plurality of clock cycles to compute the same reference angle when N = 4andM = 2. the correct result.37 For this reason, a VLSI implemen- 30 ≈+2657 + 179 + 089 + 044 = 2969 tation considers, in addition to the iterative implemen- tation, a pipeline (un-rolled) architecture, a plurality of (12) 32 ≈ 2969 + 179 + 044 = 3192 single add-and-shift unit as depicted in Figure 4 (left). We In this example, control CORDIC halves the total implement the additional logic, also shown in the same latency for the first and second trigonometric compu- Figure 4 (right), evaluating the difference of and + tations. We analyzed the parameter E%, simulating the and comparing the result to a particular threshold: ten 2-pass conventional CORDIC when = 1 (sine) and = degrees. The comparator output (C2_valid) validates the 89 (cosine) varying from 1 to 10 in three different CORDIC-2 result. This additional logic in 90 nm CMOS scenarios: M = 4, 8 and 12 iterations and N = 16. We plot represents in the unrolled hardware architecture at most the reference value of E% = 3.33 obtained from the worst 10% of the area and 12% energy dissipation of the single scenario of conventional CORDIC. Results in Figure 3 rotation module. shows how our approach limits the worst E%upto4%in We propose a 2-pass CORDIC whose architecture several realistic scenarios. is composed of a sequence of N elementary units

342 J. Low Power Electron. 15, 338–350, 2019 Visalli A Novel CORDIC Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle

Fig. 4. Unrolled CORDIC architecture: Single rotation sub-system (left), CORDIC-2 additional logic (right). for the CORDIC-1 implementation and a successive varying the CORDIC-2 pipes from 1 to 16. In this first sequence of M identical subsystems to accomplish the scenario, the access to two consecutive trigonometric func- CORDIC-2 instance. This approach that uses a conven- tions requires 2N pipes, instead, the 2-pass CORDIC tional CORDIC, accepts an extension of whichever imple- requires N +M pipes. The first conclusion is our approach mentation. Figure 5 illustrates the proposed approximate is efficient in any scenario. Secondly, we plot another FoM conventional CORDIC as a succession of angle estima- varying the clock rate from 4.0 ns to 6.0 ns and illus- tions when = 28 and =+3, N = 12 and M = 4. trated in Figure 7. We consider four different values for M, Control CORDIC achieves the same precision, but with when M = 16 we gain the performance of conventional fewer iterations. Table II shows the implementation param- CORDIC. The values in Figure 7 are lower than those eters Area (A) and Energy (Energy) per single cycle of illustrated in Figure 6 since the delay path is shorter than IP: 192.168.39.210 On: Sat,the 25 clockSep 2021 rate. As20:55:15 an example, when the clock rate is 4.0 proposed CORDICs: conventional, control,Copyright: double American rotation Scientific Publishers and correcting rotation all in the 90 nm CMOSDelivered process. byns Ingenta the combinational logic has a delay path of 3.83 ns. An unrolled CORDIC-1 dissipates a multiple (N )ofthe When the clock rate is 4.2 ns the delay path of combina- area and energy showed in that table; CORDIC-2 a mul- tional logic is 4.03 ns. In general, a delay path of combi- tiple of M and the comparator waste. Thus, we use these national logic is around 95% of the associated clock rate. numbers to compute the energy savings, by a complete regression of our considered benchmarks. Additionally, we 4.1. A Low-Complex Numerical Controlled Oscillator plot some figures of merit (FoM) of the energy-delay prod- Modulated signals down conversion is a typical applica- uct. We measured a first FoM as depicted in Figure 6 tion of how successive trigonometric requests differ from when the clock rate is 4.0 ns, comparing the conven- a small value. In this document, we use the CORDIC algo- tional CORDIC when N = 16 and the 2-pass CORDIC rithm to address the problem of the memoryless generation of a basic trigonometric function. Therefore, our purpose is the computation of a dense distribution of numerical samples in the following function: Yt= · · f · t + sin 2 0 0 (13) f Here, in (13), the wave has frequency 0 measured in Hertz, “t” is the continue time sample measured in seconds and 0 the initial angle in radian. Hereafter, without loss of generality, we assume as zero the initial angle. The sam- pling frequency is fc = 1/Tc, the inverse of wave frequency T n 0 and the discrete-time “ ” changes the previous equation to a discrete model: T Yn= · · c · n sin 2 T (14) 0 f = T /T Fig. 5. The proposed method as a sequence of two CORDIC algo- We call normalized frequency the ratio norm c 0). rithms: Estimated angle versus current iteration. The numerically controlled oscillator requires a simple

J. Low Power Electron. 15, 338–350, 2019 343 A Novel CORDIC Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle Visalli

Table II. Approximate CORDICs’ area (∗) and energy (per cycle) dissipation.

Conventional Control Double rotation Correcting rotation CMOS Area Energy Area Energy Area Energy Area Energy

90 nm 129.19 1.18 102.70 1.07 179.00 1.31 131.41 1.19

Note:(∗) Area measured in m2, the energy measured in pJoule.

rotation module used in an iterative architecture with the frequency. Generally, Tc is the CORDIC hardware’s main- two shifters and adders as shown in Figure 8. Let N be clock hence: the order of the combination shifters, our NCO performs = a −N Tc the rotation of an angle tan 2 ; this hardware N =−log tan 2 · · (16) 2 T receives the sine and cosine of angle and performs the 0 same computation of angle +. The sign wire is a logi- cal one when the angle is in the interval {0, /2} and zero Finally, Figure 9 shows the algebraic relation between N in {,3/2}. Finally, the produced wave is: the required parameter and the supported frequency in MHz (up) and KHz (down), when the gate level transistors Yn= · · f · n own to a 90 nm CMOS channel length and an operative sin 2 norm (15) frequency of 600 MHz (Tc ∼1.6 ns). This proposed low- This theory suggests the parameter N has a role complexity CORDIC enables the NCO application from in the supported precision and the maximum operative low-frequency signals (medical, spectroscopy, etc.) until the low-range radio-frequency modulations.

IP: 192.168.39.210 On: Sat, 25 Sep 2021 20:55:15 Copyright: American Scientific Publishers Delivered by Ingenta

Fig. 6. Energy-delay product FoM, comparing the conventional Fig. 8. The NCO that uses a single CORDIC stage. CORDIC with the proposed 2-pass CORDIC (90 nm CMOS, 4.0 ns clock).

Fig. 7. The energy-delay product FoM, when the 2-pass CORDIC has M pipes for CORDIC-2 computation, varying the clock delay (90 nm Fig. 9. NCO supported frequencies versus used iterations (that fix the CMOS). angle step) in MHz (up) and KHz (down).

344 J. Low Power Electron. 15, 338–350, 2019 Visalli A Novel CORDIC Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle

Fig. 12. CORDIC-2 sine and cosine measured precision when is ran- dom with uniform distribution.

4.2. Two-Pass CORDIC Algorithm for FFT Processors In this subsection, we illustrate the main advantages by using our 2-pass CORDIC, for accelerating the trigono- metric function computations. We consider important application-specific hardware: the FFT processor. The FFT processor computes the discrete Fourier transform (DFT) by a successive stage of intermediate calculus, starting from the input sequence xn organizing the data using the well-known “butterfly” circuit. Let xn be the input IP: 192.168.39.210 On: Sat, 25 Sep 2021 20:55:15N n Copyright: American Scientificdata sequence Publishers with p points where “ ” is the discrete Delivered bytime-step. Ingenta The DFT result Xk is, therefore:

Np−1 Xk = xn· W n·kk= N − Np 0 1 p 1 n=0 (17) W n·k = e−j·2·/Np·n·k Np Fig. 10. The used flow for area and energy estimations. xn· W n·k The product Np is the FFT key operator, where the term W is called ‘twiddle factor.” The constant j = √ Np −1andk/Np is the normalized frequency. Generally, Np N is a power of two, so the FFT algorithm requires a log2( p) stages and each stage contains Np/2 butterfly operations. An FFT processor receives the input data xn; it organizes these numbers in Np/2 butterfly inputs. Since the twiddle

Table III. Energy savings of the considered benchmarks (Conventional CORDIC, 90 nm CMOS, N = 16, M = 4, =10).

Algorithm Initial energy (nJ) Optimized energy (nJ) Savings (%)

FFT 091 044 5164 NCO 972 243 7500 Rsynth 37124 28371 2357 SOVA 29184 15503 4687 Typeset 889 404 5455 Lame 4546 4346 439 Fbench 005 004 2000 Ffbench 002 001 5000 TDA 14361 3592 7498 Fig. 11. CORDIC-2 sine and cosine measured accuracy when = 22 , N = 32, M = 8. Total 871 74 525 08 39 76

J. Low Power Electron. 15, 338–350, 2019 345 A Novel CORDIC Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle Visalli

Table IV. Conventional CORDIC-1/2 computation when = 30 and operate with a power supply of 0.9 Volt and an opera- =+ 10 . tive temperature of 40 Celsius. The clock rate in 90 nm

CORDIC-1 order CORDIC-2 order + CMOS is 4.0 ns and input/output ports delays are 0.8 ns

each for the master clock. We estimate the energy con- N = 44.65M = 26.53 M = 41.16 sumption by pre-layout simulations using Mentor Model- M = 60.19 sim as a Verilog simulator, estimating the power by energy M = 80.07 models of the target standard cell library. We simulated N = 80.28M = 22.81the gate-level netlist using an SDF-file back annotation. M = 41.02 Our tests generate random patterns for the CORDIC gate- M = 60.32 M = 80.02 level net-list to collect the switching activity used for the N = 12 0.03 M = 23.20 first level of activity-driven power optimization and sec- M = 41.33 ondly for the final power estimation. Also, to verifying M = 60.01 the proposed CORDICs circuit functionality the simulator M = 80.10can report the exact activity of internal nodes and include glitching. Power analysis at this level, when the CMOS factor is periodic with period Np/2 we can use a unique technology is 90 nm and estimation of exact delays by wire rotation to calculate two intermediate outputs: load models (WLM) is still valid, providing an accurate power estimation before the power measurement on the W n+Np/2·k = W n·k · e−j·k Np Np (18) post-layout net-list and real delays and parasitic. Figure 10 illustrates our used flow; we perform a first logical synthe- It is easy to demonstrate that Xk and Xk + N /2 p sis with Synopsys Design compiler; we simulate with the require the computation of a subset of the DFT odd typical operating condition. We operate with 32-bit fixed- and even input members namely O and E respectively k k point precision and CORDIC implementations illustrated according to the formula below: in this paper (conventional, control, double rotation, cor- −j·2·/Np·k Xk = Ek + e · Ok recting rotation). Np (19) −j·2·/Np·k X k + = Ek − eIP: 192.168.39.210· Ok On: Sat,5.1. 25 Sep Analysis 2021 of 20:55:15 Precision and Accuracy 2 Copyright: American ScientificIn this section,Publishers we measure the accuracy and precision Delivered by Ingenta Thus, the butterfly receives the term Ek and Ok and of CORDIC-2 result by software simulations of our pro- outputs the number Xk and Xk + Np/2. The role of posed method in different implementations. In particular, the CORDIC algorithm is to compute the product of Ok · we issue the sine and cosine functions varying the input e−j·2·/Np·k O which is a rotation and the term k is gener- angle in 32-bit fixed-point implementation, measuring the ally a complex number. bias between the true value and the result. Additionally, we also measure the systematic errors (precision) by ran- 5. EXPERIMENTAL RESULTS dom simulations, using a very accurate ANSI C code. This section summarizes our basic results, accurately cod- In this last scenario, we issue 512 random trigonomet- ing the proposed CORDIC in ANSI C language for error ric function requests, with uniform distribution, measuring performance evaluations and translating the reconfigurable the relative frequency of any measured bias. Figure 11 hardware, described in RTL Verilog, in 90 nm CMOS tech- shows our results, measuring the accuracy of our dou- nology using Synopsys Design Compiler. In particular, we ble instance CORDIC hardware; we found how the abso- lute error for sine and cosine is average 0.022 and 0.035 respectively. Finally, Figure 12 illustrates the measured precision simulating some hundreds of random angles in the interval (0.1,1.0). We found for the sine algorithm a mean value of 0.152 and a standard deviation of 0.026. Instead, the cosine introduces an error centered in 0.380 with a standard deviation of 0.0131. This last result means both sine and cosine introduce a fixed bias of 0.15 and 0.38 respectively that could be corrected improving the result accuracy.

5.2. Energy Dissipation of Selected Benchmarks Fig. 13. Average energy-saving varying the parameter  (conventional We add in the system description the logic for comput- CORDIC, 90 nm CMOS, N = 16, M = 4). ing difference between the and + by a binary

346 J. Low Power Electron. 15, 338–350, 2019 Visalli A Novel CORDIC Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle

Fig. 14. The proposed NCO’s output when N = 10. comparator. We use this information to run the bench- energy savings when the parameter  is from one to marks reported in Table I to estimate the energy savings 10 degrees. in these CMOS technologies. When our algorithm esti- mates , it starts a new instance from a fixed-point tangent 5.3. The NCO Area and Energy Dissipation aligned to the Eq. (10). Energy savings are an average of The NCO , as depicted in Figure 8, has 40% as depicted in Table III, a total of 346.6 nano-Joule been translated in a 90 nm CMOS technology library when an impressive result based on the prior assumption of con- Tc = 16 ns varying the oscillation frequency 1/T as illus- secutive trigonometric function requests with two similar 0 trated in Figure 9 consequently of the formula between operands. T and N as showed in (16). Figure 14 reports the two Let and be the angle absolute error perform- 0 + waveforms, sine and cosine when N = 10. Therefore, the ing CORDIC-1 and CORDIC-2IP: respectively. 192.168.39.210 Table On: IV Sat, 25 Sep 2021 20:55:15 2 shows the achieved precision varyingCopyright: the parameters AmericanN Scientificarea and Publishers energy estimations in 90 nm are 30 m and 0.17 and M, in boldface blue the wished iterationDelivered when bypico-Joule Ingenta respectively. The hardware is very small if com- the accuracy of results is similar in both CORDIC-1 pared to a CORDIC iterative architecture that uses a Barrel and CORDIC-2. Last absolute error, when N = 12 and shifter that is very hungry in hardware complexity. Instead, M = 8, worsen the result when comparing the choice of the proposed NCO does not use shifters, resulting in a col- M = 6 since the next two iterations add −022 and lection of a couple of adders and 64 D-Type flip-flops. The +011 to 4001. Finally, Figure 13 shows our achieve- waves in Figure 14 come from a post-processing algorithm ments, simulating some benchmarks and estimating the that receives the 32-bit variables x (cosine of )andy

Table V. Twiddle angle generation table for 128-point radix-2 FFT using CORDIC algorithm.

Twiddle angle

Butterfly counter Stage 0 Stage 1 Stage 2 Stage 3 Stage 4

000000 0 0 0 0 0 000001 /64 2 /64 4 /64 8 /64 16 /64 000010 2 /64 4 /64 8 /64 16 /64 32 /64 000011 3 /64 6 /64 12 /64 24 /64 48 /64 000100 4 /64 8 /64 16 /64 32 /64 0 000101 5 /64 10 /64 20 /64 40 /64 16 /64 000110 6 /64 12 /64 24 /64 48 /64 32 /64 000111 7 /64 14 /64 28 /64 56 /64 48 /64 001000 8 /64 16 /64 32 /64 0 0 001001 9 /64 18 /64 36 /64 8 /64 16 /64 001010 10 /64 20 /64 40 /64 16 /64 32 /64 001011 11 /64 22 /64 44 /64 24 /64 48 /64 001100 12 /64 24 /64 48 /64 32 /64 0 001101 13 /64 26 /64 52 /64 40 /64 16 /64 111111 63 /64 62 /64 60 /64 56 /64 48 /64

J. Low Power Electron. 15, 338–350, 2019 347 A Novel CORDIC Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle Visalli

Table VI. Cycle (C) and energy (E) savings computing the 128-point radix-2 FFT using two-step CORDIC algorithm 90 nm CMOS.

C/E savings Butterfly counter Stage 0 Stage 1 Stage 2 Stage 3 Stage 4

000000 0 0 0 0 0 000001 4/4.72 pJ 3/3.54 pJ 2/2.36 pJ 1/1.18 pJ 0 000010 4/4.72 pJ 3/3.54 pJ 2/2.36 pJ 1/1.18 pJ 0 000011 4/4.72 pJ 3/3.54 pJ 2/2.36 pJ 1/1.18 pJ 0 000100 4/4.72 pJ 3/3.54 pJ 2/2.36 pJ 1/1.18 pJ 0 000101 4/4.72 pJ 3/3.54 pJ 2/2.36 pJ 1/1.18 pJ 0 000110 4/4.72 pJ 3/3.54 pJ 2/2.36 pJ 1/1.18 pJ 0 000111 4/4.72 pJ 3/3.54 pJ 2/2.36 pJ 1/1.18 pJ 0 001000 4/4.72 pJ 3/3.54 pJ 2/2.36 pJ 1/1.18 pJ 0 001001 4/4.72 pJ 3/3.54 pJ 2/2.36 pJ 1/1.18 pJ 0 001010 4/4.72 pJ 3/3.54 pJ 2/2.36 pJ 1/1.18 pJ 0 001011 4/4.72 pJ 3/3.54 pJ 2/2.36 pJ 1/1.18 pJ 0 001100 4/4.72 pJ 3/3.54 pJ 2/2.36 pJ 1/1.18 pJ 0 001101 4/4.72 pJ 3/3.54 pJ 2/2.36 pJ 1/1.18 pJ 0 111111 4/4.72 pJ 3/3.54 pJ 2/2.36 pJ 1/1.18 pJ 0 Total 252/297.3 pJ 189/223,0 pJ 126/148.6 pJ 63/74.3 pJ 0

(sine of ) respectively. The real value is normalized with two consecutive angles differ from a small value. This 31 2 = 2147483648 deriving the new variables xr and yr . basic approach justifies our research, proposing a recon- The post-processing algorithm for sine (20) and cosine figurable hardware for the CORDIC algorithm computa- (21) follows: tion when two successive angles are very close to each other. We demonstrate how the proposed approach lim- y ∈ 0 sin = r (20) its the relative error compared to two successive former −20 + yr ∈ 2 CORDIC implementation. We operate in the context of ⎧ IP: 192.168.39.210 On: Sat,conventional 25 Sep 2021 and 20:55:15 control CORDIC, but our proposed x ∈ / ⎪ r Copyright:0 2 American Scientificmethod isPublishers valid whichever CORDIC variants in the cur- ⎨⎪ −20 + x ∈ /2 Delivered byrent Ingenta literature and future implementations based on angle cos = r (21) ⎪−x ∈ / rotations and optional scaling algorithms. Moreover, the ⎪ r 3 2 ⎩ constraint of 10 optimizes energy savings, where cycle 20 − x ∈ 3/2 2 r savings are very close to the minimum allowable value arctan2−1 = 2656, but our solution is valid for any 5.4. The 128-Point Radix-2 FFT Estimation of number. Finally, we demonstrate how our solution achieves Cycles and Energy Savings significant energy savings. The proposed approximate The work38 shows the very-large-scale integration archi- CORDIC algorithm is useful in any area of digital comput- tecture of a serial-in and serial-out FFT processor that ing. Similarly, to approximate computing paradigm,39 our uses the CORDIC algorithm to compute the product intro- conclusions bring to scalable effort hardware that modu- duced above. Additionally, to save area and power, the lates the trigonometric sine/cosine computation toward the required architecture generates online the twiddles accord- desired quality/efficiency tradeoff. ingtoTableV,whenNp = 128-point radix-2 format with an angle precision of 32-bit fixed-point arithmetic. In this scenario, the cycle savings illustrated in Table VI are glob- References ally 567 clocks obtained mainly in the first four stages. 1. J. E. Volder, The CORDIC trigonometric computing technique. IRE Furthermore, the energy savings using the data in Table II Transactions on Electronic Computers EC-8, 330 (1959). when a 90 nm CMOS technology is used and conventional 2. P. Choudhary and A. Karmakar, CORDIC based implementation of N fast Fourier transform, 2011 2nd International Conference on Com- CORDIC is 743.2 pJoule. However, when p is very high, puter and Communication Technology (ICCCT-2011), Allahabad significant cycle savings could be achieved with an incre- (2011), pp. 550–555. mental angle very short that implies few iterations for the 3. J. S. Walther, A unified algorithm for elementary functions, Proceed- CORDIC-2 according to the Figure 2. ings of the May 18–20, Spring Joint Computer Conference (AFIPS ’71),ACM,NewYork,NY,USA(1971). 4. P. K. Meher, J. Valls, T. B. Juang, K. Sridharan, and K. Maharatna, 6. CONCLUSION 50 years of CORDIC: Algorithms, architectures, and applications. IEEE Transactions on Circuits and Systems I: Regular Papers In this work, we highlight the basic principle that most 56, 1893 (2009). of the signal processing and multimedia benchmarks gen- 5. X.-G. Jiang, J.-Y. Zhou, J.-H. Shi, and H.-H. Chen, FPGA implemen- erate sequential trigonometric calculus requests when the tation of image rotation using modified compensated CORDIC, 2005

348 J. Low Power Electron. 15, 338–350, 2019 Visalli A Novel CORDIC Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle

6th International Conference on ASIC, Shanghai (2005), pp. 752– 24. M. Garrido and J. Grajal, Efficient memoryless for FFT com- 756. putation, 2007 IEEE International Conference on Acoustics, Speech 6. S. R. Bhaisare, A. V. Gokhale, and P. K. Dakhole, CORDIC archi- and Signal Processing-ICASSP ’07, Honolulu, HI (2007), pp. II- tecture based 2-D DCT and IDCT for image compression, 2015 113–II-116. International Conference on Communications and Signal Processing 25. S. Y. Park, N. I. Cho, S. U. Lee, K. Kim, and J. Oh, Design (ICCSP), Melmaruvathur (2015), pp. 1473–1477. of 2K/4K/8K-point FFT processor based on CORDIC algorithm 7. X. Li and W. Qin, CORDIC based algorithm for frequency offset in OFDM receiver, 2001 IEEE Pacific Rim Conference on Com- estimation, 2010 IEEE 12th International Conference on Communi- munications, Computers and Signal Processing (IEEE Cat. No. cation Technology, Nanjing (2010), pp. 817–820. 01CH37233), Victoria, BC, Canada (2001), Vol. 2, pp. 457–460. 8. Y. D. Borole and C. G. Dethe, Mixed radix CORDIC FFT algo- 26. P. Choudhary and A. Karmakar, CORDIC based implementation of rithm for OFDM WPAN applications, 2017 International Confer- fast Fourier transform, 2011 2nd International Conference on Com- ence on Energy, Communication, Data Analytics and Soft Computing puter and Communication Technology (ICCCT-2011), Allahabad (ICECDS), Chennai (2017), pp. 2975–2979. (2011), pp. 550–555. 9. S. R. Penubolu and R. R. Gudheti, VLSI implementation of syn- 27. V. Gautam, K. C. Ray, and P. Haddow, Hardware efficient design chronizer and pipelined CORDIC in OFDM receiver for fourth of variable length FFT processor, 14th IEEE International Sympo- generation wireless LAN applications, 2011 IEEE 3rd Interna- sium on Design and Diagnostics of Electronic Circuits and Systems, tional Conference on Communication Software and Networks,Xi’an Cottbus (2011), pp. 309–312. (2011), pp. 312–315. 28. P. R. B. de Carvalho, J. A. A. Palacio, and W. Van Noije, 10. T. Lang and E. Antelo, High-throughput CORDIC-based geometry Area optimized CORDIC-based numerically controlled oscillator operations for 3D computer graphics. IEEE Transactions on Com- for electrical bio-impedance spectroscopy, 2016 IEEE International puters 54, 347 (2005). Frequency Control Symposium (IFCS), New Orleans, LA (2016), 11. A. Syed and R. M. Lourde, Hardware security threats to DSP appli- pp. 1–6. cations in an IoT network, 2016 IEEE International Symposium on 29. L. Zhihua and W. Weilian, The design of NCO based on CORDIC Nano Electronic and Information Systems (iNIS), Gwalior (2016), algorithm and implementation in FPGA, 2011 International Confer- pp. 62–66. ence on Electronics, Communications and Control (ICECC), Ningbo 12. K. Kota and J. R. Cavallaro, Numerical accuracy and hardware trade- (2011), pp. 2902–2905. offs for CORDIC arithmetic for special-purpose processors. IEEE 30. Y. Xie, C. Peng, X. Jiang, and S. Ouyang, Hardware design and Transactions on Computers 42, 769 (1993). implementation of DOA estimation algorithms for spherical array 13. R. M. Jiang, An area-efficient FFT architecture for OFDM digi- antennas, 2014 IEEE International Conference on Signal Processing, tal video broadcasting. IEEE Transactions on Consumer Electronics Communications and Computing (ICSPCC), Guilin (2014), pp. 219– 53, 1322 (2007). 223. 14. S. Wang and E. E. Swartzlander, Critically damped CORDIC algo- 31. G. Zhang and F. Chen, Parallel FFT with CORDIC for ultra-wide rithm, Proceedings of 1994 37th Midwest Symposium on Circuits IP: 192.168.39.210 On: Sat, 25band, Sep2004 2021 IEEE 20:55:15 15th International Symposium on Personal, Indoor and Systems, Lafayette, LA (1994), Vol. 1, pp. 253–256. Copyright: American Scientificand Mobile Publishers Radio Communications (IEEE Cat. No. 04TH8754), 15. Y. H. Hu and S. Naganathan, An angle recoding method for Delivered by IngentaBarcelona (2004), Vol. 2, pp. 1173–1177. CORDIC algorithm implementation. IEEE Transactions on Comput- 32. C. Zhang, J. Han, and K. Li, Design and implementation of hybrid ers 42, 99 (1993). CORDIC algorithm based on phase rotation estimation for NCO. 16. R. Datta, J. A. Abraham, R. Montoye, W. Belluomini, H. Ngo, The Scientific World Journal (2014) C. McDowell, J. B. Kuang, and K. Nowka, A low latency and low 2014, Article ID 897381 . power dynamic carry save , 2004 IEEE International Sympo- 33. W. Kamp and A. Bainbridge-Smith Abd M. Hayes, Efficient imple- sium on Circuits and Systems (IEEE Cat. No. 04CH37512), Vancou- mentation of fast redundant number adders for long word-lengths ver, BC (2004), pp. II–477. in FPGAs, IEEE International Conference on Field-Programmable 17. R. Mahalakshmi and T. Sasilatha, A power efficient carry save adder Technology, Sydney (2009), pp. 239–246. and modified carry save adder using CMOS technology, 2013 IEEE 34. M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, International Conference on Computational Intelligence and Com- and R. B. Brown, MiBench: A free, commercially representative puting Research,Enathi(2013), pp. 1–5. embedded benchmark suite, Proceedings of the Fourth Annual IEEE 18. M. D. Ercegovac and T. Lang, Redundant and on-line CORDIC: International Workshop on Workload Characterization, WWC-4 Application to matrix triangularization and SVD. IEEE Transactions (Cat. No. 01EX538) (2001), pp. 3–14. on Computers 39, 725 (1990). 35. C. D. Nguyen and J. Lee, Improving SOVA output using extrinsic 19. M. D. Ercegovac and T. Lang, Implementation of fast angle calcula- information for bit patterned media recording, IEEE International tion and rotation using on-line CORDIC, 1988, IEEE International Conference on Consumer Electronics,LasVegas(2015), pp. 136– Symposium on Circuits and Systems, Espoo, Finland (1988),Vol.3, 137. pp. 2703–2706. 36. NASA, NASA Open Source Software, https://code.nasa.gov. 20. A. Asaduzzaman, M. Rani, and F. N. Sibai, On the design of 37. J. Rudagi and S. Subbaraman, Comparative analysis of radix-2, low-power cache memories for homogeneous multi-core processors, radix-4, radix-8 CORDIC processors, IEEE International Confer- 2010 International Conference on Microelectronics,Cairo(2010), ence on Inventive Computing and Informatics, Coimbatore (2017), pp. 387–390. pp. 378–382. 21. N. K. Nawandar, B. Garg, and G. K. Sharma, RICO: A low 38. P. Bansal, B. S. Dhaliwal, and S. S. Gill, Memory-efficient power repetitive iteration CORDIC for DSP applications in portable radix-2 FFT processor using CORDIC algorithm, 2014 Inter- devices. Journal of System Architecture 70, 82 (2016). national Conference on Green Computing Communication and 22. L. Chen, F. Lombardi, J. Han, and W. Liu, A fully parallel approx- Electrical Engineering (ICGCCEE), Coimbatore, India (2014), imate CORDIC design, IEEE/ACM International Symposium on pp. 1–5. Nanoscale Architectures, Beijing (2016), pp. 197–202. 39. V. K. Chippa, S. Venkataramani, S. T. Chakradhar, K. Roy, and 23. N. Takagi, T. Asada, and S. Yajima, Redundant CORDIC methods A. Raghunathan, Approximate computing: An integrated hardware with a constant scale factor for sine and cosine computation. IEEE approach, 2013 Asilomar Conference on Signals, Systems and Com- Transactions on Computers 40, 989 (1991). puters,PacificGrove,CA(2013).

J. Low Power Electron. 15, 338–350, 2019 349 A Novel CORDIC Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle Visalli

Giuseppe Visalli Giuseppe Visalli received a Degree in Telecommunication Engineering from the University of Pisa, Italy in 1996. From 2001 to 2007, he has been a Research engineer in the Advanced System Technology group at ST Microelectronics. In 2007, he moved into a flash memory group as a spin-off with Intel counterpart to form Numonyx, acquired by Micron Technology two years later. Currently, he is working in the Nonvolatile Engineering Group in Micron Semiconductor Italy. His current research interests focus on low-power architectures and methodologies, wireless sensor networks, computer architecture, and telecommunication. Dr. Visalli is an author of twelve US granted patents in his research area.

IP: 192.168.39.210 On: Sat, 25 Sep 2021 20:55:15 Copyright: American Scientific Publishers Delivered by Ingenta

350 J. Low Power Electron. 15, 338–350, 2019