Copyright © 2019 American Scientific Publishers Journal of All rights reserved Low Power Electronics Printed in the United States of America Vol. 15, 338–350, 2019
A Novel Coordinate Rotation Digital Computer Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle
Giuseppe Visalli Independent Researcher, Via G. Guerzoni, 1 Padova, 35126, Italy
(Received: 23 June 2019; Accepted: 7 October 2019)
In this work, we propose an approximate and energy-efficient CORDIC method, based on a trigono- metric function spatial locality principle derived from benchmarks profiling. Successive sine/cosine computation requests cover more than 50% when the absolute phase difference is at most ten degrees. Consequently, this property suggests an optimized circuit implementation, both iterative or a succession of microrotation modules, where the last CORDIC requires fewer iterations, reduc- ing the latency and the total energy budget at the same precision of two separate and indepen- dent instances. Thus, this simple design strategy allows significant area and energy dissipation in general-purpose VLSI architectures, but it introduces also dramatically optimizations in application- specific embedded systems used in the area of signal processing and radio frequency communi- cation. In this contribution, we introduce a method, the hardware overhead and the energy budget per single cycle. Simulation results show the total energy saving in considered benchmarks is 40% in pipelined and iterativeIP: general192.168.39.210 purposes CORDIC.On: Sat, 25 Furthermore, Sep 2021 our 20:55:15 application-specific systems (fast Fourier transform andCopyright: digital oscillators American for radiofrequency Scientific Publishers down conversions) show remarkable cycle savings when the successive sine/cosineDelivered computation by Ingenta requests are more than 70%. Finally, in this work, we extend the proposed approach to whichever phase difference less than 26.56 ,as a variable for the second CORDIC number of angle rotations. Keywords: CORDIC, Digital Arithmetic, Digital Signal Processing Chip, Energy Estimation, Fixed-Point Arithmetic, Very-Large Scale Integration Systems.
1. INTRODUCTION AND BACKGROUND Moreover, this algorithm has applications in special- Due to the rapid advances made in very-large-scale inte- purpose systems for real-time signal processing.12 13 These gration (VLSI) technology, digital signal processor (DSP), typical applications require the computation of trigono- the design for real-time applications requires the com- metric functions reducing the hardware to save area, putation of trigonometric functions as a primary task energy and the achievement of the minimum latency. The in the image, signal processing, and telecommunication. CORDIC algorithm calculates the sine and cosine of any The CORDIC (Coordinate Rotation Digital Computer) angle by successive rotations until the total residual angle algorithm1 is the most commonly used code to perform reaches a position very close to zero. This algorithm does trigonometry, logarithmic, exponential, hyperbolic, real not require signed integer multiplications as a principal and complex multiplication, eigenvalue estimation, square advantage, but it accomplishes the calculus by successive root, division, singular value decomposition and many shift-and-add operations. However, the total latency is the more.2 Walther3 extended the initial implementation and most important drawback, proportional to the used iter- N refined by many others.4 The CORDIC algorithm could ations ( ). For this reason, many applications improve be used in image and signal processing applications,5 6 conventional CORDIC, reducing the required cycles at the 14 telecommunication (e.g., radio baseband processors),7–9 same result precision. Control CORDIC and the Angle 15 3D graphics manipulation10 and Internet of things (IoT).11 Re-coding (AR) CORDIC reduces the used iterations; this last approach requires an average of N /3 and a maxi- mum of N /2. Additionally, since the single CORDIC rota- Emails: [email protected], [email protected] tion is not a pure rotation but a rotation-extension, the
338 J. Low Power Electron. 2019, Vol. 15, No. 4 1546-1998/2019/15/338/013 doi:10.1166/jolpe.2019.1619 Visalli A Novel CORDIC Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle number of rotations for any angle should be a constant and quality of result in our considered benchmarks. The independent of the operand so that a constant scale fac- organization of the rest of the paper is as follows. Section 2 tor is used. Hence, the utilization of redundant number is an overview of the CORDIC algorithms. Subsequently, representations or carry-save adders (CSA)16 17 in addi- Section 3 enhances the conventional CORDIC to perform tion/subtraction reduces the total latency allowing a higher a second instance with a minimal incremental angle. Fur- clock rate.18 19 ther, we formalize our approach in terms of algorithm, In this work, we simulated several representative bench- VLSI implementation and relative error measurement in marks in the fields of signal processing, computer arith- Section 4, where we consider both general-purpose and metic and telecommunication for general purpose and application-specific problems. Section 5 summarizes our application-specific embedded cores. We trace the access results in terms of energy savings in our considered bench- to two successive trigonometric (sine/cosine) functions marks and the measure of achieved precisions of the sec- when the angles differ at most 10 . Our profile shows that ond CORDIC instance. Finally, Section 6 concludes our 50% of total successive sine/cosine computations fall into work. our region; this last sentence is an equivalent spatial local- ity principle used in the context of microprocessor cache 2. THE CORDIC ALGORITHM 20 memory design. Low-power CORDIC implementations This section introduces the conventional CORDIC and its generally reduce the total energy at the price of increased specializations for reducing the total iterations responsi- 21 latency as depicted in the work. Another research in ble for the full latency. This algorithm is a succession of the area of VLSI signal processing considers the precom- phase rotations translated in hardware by an iterative VLSI putation of each CORDIC direction, accelerating every implementation or a succession of micro-rotation module; 22 shift-and-add stage. However, these timing optimizations this last preferred to achieve an elevated throughput. How- are by far from our approach excluding some un-useful ever, in this paper, we focus our work on to CORDIC rotations, under trigonometric operand constraints, reduc- algorithm with variable and unknown angles of rotation, ing a large portion of the required delay. Thus, our pro- a property used in the implementation of low-power DSP posed approach is reconfigurable hardware to reduce the and microprocessors. energy dissipation and latency by the removal of un- necessary angle rotations when twoIP: successive192.168.39.210 sin/cos On: fall Sat,2.1. 25 Sep Conventional 2021 20:55:15 CORDIC into an angle difference of at mostCopyright: ten degrees. American Other- ScientificThe CORDIC Publishers algorithm is a succession of rotations of wise, the hardware dissipation is almost alignedDelivered to the by Ingenta −n 4 angle in the form of en = arctan 2 when n is an integer. current state of art and implementation technology. Our A read-only memory (ROM) stores these angles in a fixed proposed hardware does not dissipate more energy than precision format; the memory stores these angles mea- known CORDICs, a miss of the locality principle does sured in degrees. Hereafter, we consider a radix-2 32-bit not dissipate additional energy and latency. Our idea is to fixed-point CORDIC algorithm. The trigonometric func- compute the second CORDIC algorithm without resetting tions sine and cosine use a rotation starting from a couple the rotation results from the previous instance. In this way, x y of fixed-point numbers 0 0 and we obtain the point the hardware starts from a non-reset value and the next x y d · e d ∈ n+1 n+1 by a rotation of angle n n,where n rotation begins from a slightly larger angle than our refer- +1 −1 ,sowehave: ence offset value: 10 . This approach could be used in any x −d · −n x CORDIC implementations based on binary either redun- n+1 = e 1 n 2 · n y cos n d · −n y (1) dant or un-redundant representation and with the single or n+1 n 2 1 n double rotation-extension algorithm implementation, this Here, the CORDIC algorithm uses the iterative equation last introduced in Ref. [23]. for zn as follows: The spatial locality of sine/cosine angles allows signif- z = z − d · −n icant hardware savings in application-specific VLSI. We n+1 n n arctan 2 (2) explored two representative benchmarks: the fast Fourier This new variable is the accumulated partial sum of step 24–27 transform (FFT) processors and the implementation of angles; it controls the sign of the angle rotation. The sign an accurate numerical controlled oscillator (NCO) used in z + z of n sets the rotation mode: clockwise ( 1) or not. If 0 the spectroscopy28 29 or modulated wave down conversion −k = is less than or equal to k=0 arctan 2 1 743 we 30 section in a typical digital telecommunication receiver. have: We consider a discrete number of reference approaches ⎛ ⎞ ⎛ ⎞ xn x · cosz − y · sin z illustrated in the contributions31 32 estimating the hard- ⎝ ⎠ ⎝ 0 0 0 0 ⎠ lim yn = K · x · sin z + y · cosz (3) ware wastes (area and power) and the maximum operative n→ 0 0 0 0 zn 0 frequency. K / d · We translated the reference and new hardware in a Scale factor in√ Eq. (3) is equal to n=0 1 cos n −n = + −2n = 90 nm CMOS technology, evaluating the energy efficiency arctan 2 n=0 1 2 1 646 . We calculate the J. Low Power Electron. 15, 338–350, 2019 339 A Novel CORDIC Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle Visalli
x = /K = sine and cosine functions starting from 0 1 binary position; it computes two rotation-extension coeffi- y = z q p 0 607252 , 0 0 0and 0 our wished angle. A software cients namely n and n as follows: profiling revealed the maximum error percentage (E%) of ⎧ ⎪ − zj−1 zj zj+1 < sine and cosine computation by conventional CORDIC is ⎨ 1 1 n−1 n−1 n−1 0 j−1 j j+1 qn pn = 0 −1 z z z = 0 (6) 3.33% when the angles are 1 and 89 respectively. ⎩⎪ n−1 n−1 n−1 + + zj−1 zj zj+1 > 1 1 n−1 n−1 n−1 0 2.2. Control CORDIC 14 This algorithm is efficient in particular for the computa- Control CORDIC reduces the total iterations at the same tion of sine/cosine function without the use of a scaling z precision of the former implementation. Variable k in the factor during the computation so the result is immediately conventional CORDIC does not converge monotonically. available. Consequently, this algorithm introduces divergent rotations similar to an overshoot of an underdamped response of 2.4. Correcting Rotation CORDIC a second-order filter. Control CORDIC removes divergent This approach, also introduced in Ref. [23], uses again rotations, with a monotonic residual eliminating the over- the binary redundant representation and the sequence of shoot, without sacrificing wished precision. Thus, the idea rotation-extension is identical that conventional CORDIC. was to change the rotation direction now in the set (+1 0) This algorithm performs an extra rotation-extension in according to the formula below: every mth step, m can be an arbitrary integer (cor- recting period). Since the number of rotation-extensions + z ≥ e performed for each angle is a constant, the scaling d = 1whenn n n (4) factor is constant and independent of the operand. 0otherwise Equations (1) and (2) are still valid; every m steps the new equations are: Here, Eq. (4) applies when the input angle is posi- xˆ = x − q · −n · y tive. Similarly, for negative angles rotation direction is n n−1 n 2 n−1 restricted to ( − 1 0). yˆ = y − q · −n · x n n−1 n 2 n−1 zˆ = z − q · −n · −n 2.3. Double Rotation CORDICIP: 192.168.39.210 On: Sat, 25 Sep 2021n 20:55:15n−1 n 2 arctan 2 Copyright: American Scientific Publishers −n (7) The former redundant CORDIC computes theDelivered sine and by Ingenta xn =ˆxn −ˆqn · 2 ·ˆyn cosine by using few most significant bits of the remain- y =ˆy −ˆq · 2−n ·ˆx ing angle to determine the direction of the rotation, this n n n n 33 −n −n last represented in a redundant binary format. Further- zn =ˆzn −ˆqn · 2 · arctan 2 more, this CORDIC evaluates the addition/subtraction in a Again, the variables z and zˆ are represented by a redundant number system. However, since a non-rotation- n n redundant binary fraction whose most significant digit is extension could occur, the CORDIC algorithm calculates located at the jth binary position. Additionally, xn, yn, xˆn, the scale factor that is a variable dependent on the operand, yˆn are in the redundant binary format. The directions qn increasing the total latency and the amount of hardware. and qˆn follow these equations: The Double Rotation CORDIC23 uses the redundant binary j− j j+ representation with a digit in the set {−1 0 +1}. This −1 z 1 z z 1 <0 q = n−1 n−1 n−1 (8) algorithm requires a combination of two rotation-extension n + zj−1 zj zj+1 ≥ 1 n−1 n−1 n−1 0 each type of rotation, non-rotation positive or negative. j− j j+ This approach makes the scaling factor independent of −1 zˆ 1 zˆ zˆ 1 <0 qˆ = n−1 n−1 n−1 x y z n j−1 j j+1 (9) the used operand. We start with the operands 0, 0, 0 + zˆ zˆ zˆ ≥ 1 n−1 n−1 n−1 0 and the used scale factor (as a conventional algorithm) is N − n− The choice of the correcting period influences the algo- K1 = 1+2 2 2 . This CORDIC algorithm proceeds n=1 rithm performance in terms of accuracy and precision. In according to the following iterative equations: particular, when “m” is small there are a large number of extra rotations, but the number of digits for the determina- x = x − q · −n · y − p · −2n−2 · x n n−1 n 2 n−1 n 2 n−1 tion of the direction of the rotations becomes smaller and y = y − q · −n · x − p · −2n−2 · y vice versa. n n−1 n 2 n−1 n 2 n−1 (5) z = z − · −n−1 n n−1 2 arctan2 2.5. Angle Rotation CORDIC This angle rotation (AR) mode15 reduces total iterations We represent zn by a redundant binary fraction; this in the CORDIC trigonometric computation, only in for- algorithm considers the most significant digits at the jth ward rotation mode (also known as vector rotation mode).
340 J. Low Power Electron. 15, 338–350, 2019 Visalli A Novel CORDIC Method for Energy and Latency Saving by Trigonometric Operations Spatial Locality Principle
The former implementation is a software-based algorithm, although there are many attempts to translate in hardware this approach. The idea is to realize the wished rotation angle by a combination of a much-reduced set of angles. In this work, without loss of generality, we consider the Conventional, Control, Double Rotation, and Correction Rotation CORDIC only. Our ultimate goal is to save both energy and latency removing the un-necessary angle rota- tions. This approach is transversal whichever VLSI imple- mentations of known CORDIC algorithm based on angle rotations: higher radix and/or redundant number logic are Fig. 1. Average rate of successive sine/cos versus [ ]. still applicable because they solve the single CORDIC instance, when our solution solves the problem of energy cosine computation varying the parameter simulating and latency saving in consecutive instances under a con- our considered benchmarks. straint of the incremental angle that should be less than We propose in this work a novel 2-pass CORDIC 26.56 . method that computes the sine/cosine function of angle + when the same function is known with angle . Hereafter we suppose, without any limitation, that is 3. A NOVEL ENERGY-EFFICIENT at most ± 10 . Consequently, we define the past and cur- 2-PASS CORDIC METHOD rent CORDIC instances as CORDIC-1 and CORDIC-2 We start from the consideration that potentially signal respectively. We start from the Eqs. (1) and (2); processing and multimedia benchmarks admit a form of CORDIC-1 has N iterations. In our proposed approach, spatial locality when requiring the computation of basic we do not clear xN yN and zN forcing a further rota- trigonometric functions. If a particular angle is referenced tion with M additional iterations. Additionally, the angle by a trigonometric function at a particular time, then it is e rotation N +1 is fixed according to the following rule: likely that a nearby angle will be referenced by the same e = −2 = function soon. We collected some interesting benchmarks N +1 arctan 2 14 03 (10) mostly for the embedded marketIP: by 192.168.39.210 using MiBench34 On:and Sat, 25 Sep 2021 20:55:15 Copyright: American ScientificHere 14.03Publishers is the first angle in 32-bit fixed-point in the many others in the area of telecommunication (e.g., Soft- −N output Viterbi algorithm, SOVA35 of 240 channelDelivered sym- byform Ingenta 2 greater than 10 degrees as our considered limit. bols) and signal processing, these last as representative In this way, we optimize the angle dynamic according to benchmarks for application-specific embedded processors our realistic results of the conducted profile in Table I. and DSP. Next, we profiled a radar processing algorithm In particular, Figure 2 shows the cycle savings varying the from NASA open-source software:36 Turbulence Detection range. We save one clock cycle if is less than 26.56 Algorithm (TDA). This software, written in Python, esti- degrees and greater than 14.03 degrees. When is from mates the eddy dissipation rate (a measure of turbulence) 7.12 to 14.03 degrees we save two CORDIC iterations; from Doppler weather radar data. We measure the occur- three cycles when the same parameter is from 3.57 to 7.12 rence of two consecutive access to trigonometric function degrees. We save four iterations when the absolute phase by the difference between the actual ( + ) and previous difference is from 1.78 to 3.57 degrees; finally, when ( ) angles. We fix the maximum of =10 as a hypo- is more than 0.89 and less than 1.78 degrees we save five = = thetical limit of our analysis. The results are in Table I; we iterations. Below, an example; we fix 30 and 2 . see an average greater than 50% that justifies our approach. Equation (11) represents a 32-bit fixed-point CORDIC-1 N = M = Figure 1 shows an average rate of successive sine and and 8 main iterations and CORDIC-2 with 4 cycles.