Computation of Decimal Transcendental Functions Using the CORDIC Algorithm

Computation of Decimal Transcendental Functions Using the CORDIC Algorithm Álvaro Vázquez, Julio Villalba* and Elisardo Antelo University of Santiago de Compostela *University of Málaga SPAIN 19th IEEE Symposium on Computer Arithmetic Portland (USA), June 8-10, 2009 Motivation • Renewed interest in decimal floating point (IEEE 754- 2008; Hardware implementations .) • Research now mainly focused on basic floating point operations (+/-, x, /). • We consider transcendentals (sin, cos, exp, log…) to fully support the IEEE 754-2008 Decimal Number System for DPD formats. • Slow multiplication -> CORDIC might be an option for transcendentals. 2 Commercial Hardware Implementations Cycles required for execution for Dec128 operands (Power 6 - 13 FO4 cycle - 5GHz @ 65nm SOI ) FP add/sub 11 to 19 (depending on specific case ) FP Multiplication 21+2 N (wost case:89) (N: nº digits excluding leading zeros ) FP Division 154 FXP add/sub 2 3 Transcendentals • Transcendental functions: sin, cos, … • We aim to implement transcendentals based on CORDIC (serial method). • Why?, slow decimal multiplication-> Table + polynomial approx might be slow. • Resembles the context when CORDIC were used in x86 processors. 4 CORDIC Rotation mode (circular) Vectoring mode (circular) output: rotated vector Input angle Input vector Input vector output: modulus and angle Circular Functions: sin, cos, tan, norm of a vector, tan -1(y/x),… Extension to “hyperbolic rotation”: sinh,cosh, sqrt, tan -1(y/x),ln,exp,.. 5 CORDIC Rotation decomposed into elementary rotations α α cos( ) sin( ) cos( α1) sin( α1) cos( α2) sin( α2) = α α .... -sin( ) cos( ) -sin( α1) cos( α1) -sin( α2) cos( α2) Rotation angle decomposed as sum of elementary rotation angles α = Σ α i 1 tan( α ) α i cos( i) α -tan( i) 1 6 Radix-2 CORDIC: basic step Basic step: elementary rotation 1 tan( α ) with increase in the modulus α i cos( i) α (x[i],y[i]) -tan( i) 1 (x[i+1],y[i+1]) basic step α -1 -i -1 -i σ -i i=+ tan (2 ) or -tan (2 ) 1 i 2 σ -i - i 2 1 σ i=+1 or -1 α σ cos( i) independent of i Κ= Π α cos( i) constant scale factor ->precomputed and compensated after basic steps 7 CORDIC Convergence Worst case: vector on target position α Next residual angle value is i Next elementary rotation target position (x[i+1],y[i+1]) α (remainder angle=0) i (x[i],y[i]) α < Σ α Convergence condition : i k=i+1 k 8 Our Decimal CORDIC • We want constant scale factor -> precomputed and simple to compensate. • Constant scale factor means two possible angles (same magnitude, different sign) for each elementary rotation. • Conventional radix-2 CORDIC not good for decimal operands 10 -k 10 -(k+1) BCD Rep. σ -i x[i] ……8 4 2 1 8 4 2 1 1 i 2 σ -i y[i] x x x x x x x x - i 2 1 Not a simple right shift 9 Elementary angles Key issue: elementary angles set Requirements: α α • Two possible angles per iteration : + i and – i. σσσ ααα • tan( i i) proportional to a power of ten . 1 tan( σ α ) α i i cos( i) σ α -tan( i i) 1 Simple right shift for BCD operands Constant scale factor 10 Angle set for Dec. CORDIC Binary weights of a decimal rep. RNC8 paper [3] ααα -1 σσσ -ceil(i/4) i=tan ( i C[i] 10 ) σ i=-1 or +1 8, 4, 2,1 5, 2, 1,1 Easier to implement (easy multiples for dec.) α < α α α Convergence: i i+1 + i+2 +…..+ i+k +… 11 Contributions in this work • Extend the algorithm to support hyperbolic coordinates (tanh -1 angles) • Extend to floating point • Mapping the algorithm to a state of the art Decimal Floating Point Unit. • Algorithm with redundant adder (carry-save). 12 Angle set for circular and hyperbolic ααα -1 σσσ -ceil(i/4) i=tan ( i C[i] 10 ) ααα -1 σσσ -ceil(i/4) i=tanh ( i C[i] 10 ) It does not assure convergence C[i]=R[i mod 4]=1, 5, 2, 1 for both Enough redundancy to assure C[i]=R[i mod 4]=1, 5, 2, 2 convergence for hyperbolic coord. 13 Unified Algorithm Roughly one iteration per bit of the input operands. (4 iter./digit) c=1 (circular); c=-1 (hyperbolic); C[i] takes values 5, 2 or 1. 14 Floating-Point Extension Range Reduction : standard methods (see paper) sin/cos, tan -1,sinh/cosh,e f,10 f, tanh -1 ln(f) log(f), sqrt Result after range reduction : One of the inputs: yin= Myin 10 -Eyin or zin=Mzin 10 -Ezin |Mzin|, |Myin| in [1,10) and Eyin, Ezin > or = 0 We need a floating point version of the iterations: Scale y or z iterations (or both) to move Eyin or Ezin Leading zeros (or nines). 15 Floating-Point Extension Floating-point iterations : Scale y and/or z iterations by 10 Eyin or 10 Ezin Start iterations with index J=4Ezin-3 or J=4Eyin-3 Funct P Q sin/cos; Ezin Ezin sinh/cosh (N=0) sinh/cosh 0 Ezin (N!=0) tan -1, tanh -1 Eyin Eyin Ln, log sqrt 0 0 p decimal digits and 1 ulp accuracy: datapath of 1 integer and p+3 fract. and m=p+1 (4(p+1) elementary rotations). 16 Mapping to a DFPU (for Decimal128 format) 17 Pipelining the Algorithm Interleave x and y iterations for effective pipelining 18 Pipelining the Algorithm: Number of cycles for Decimal 128 -#cycles without pre and post processing due to range reduction -Optimization performed : after about half of rotations, perform a single final rotation since cos and sin can be linearly approximated. This requires a mult. for rotation and a division for vectoring both of about m/2 digits. 19 Algorithm for Carry-Save Representation Carry-save Adder CPA Perform iterations with redundant adder. (may be used for division) Why use a redundant adder? Not for speed (actually the algorithm will be slower) Our aim is to use a simple adder for the iteration instead of the complex carry-propagate adder. 20 Algorithm for Carry-Save Representation Direction of rotation obtained as the sign of estimation of y or z. Redundant representation: estimation of sign from a few leading digits. Real value “wrong” direction of rot. 0 α i Estimated value of y or z 2D[t] -> error of estimation Convergence condition (without rotation repetition): α < α α α i + D[t] i+1 + i+2 +…..+ i+k +… 21 Algorithm for Carry-Save Representation Example for rotation mode: Scale recurrence to have sign information at the most sig. digits Scaled recurrence for z: Estimation of one fractional digit enough for convergence (see paper). < −α α α α D[t] i + i+1 + i+2 +…..+ i+k +… Angular slack > 0 due to redundant angular representation 22 Pipelining the Redundant Algorithm Schedule two shift-scale-add(3-2 csa) operations for each x and y iterations due to redundant representation. 23 Related Works • Early versions of decimal CORDIC (60’s-70’s) – Pocket calculators. – Elementary angle set: tan -1(10 -j) – Convergence not achieved -1 -j Σ -1 -k tan (10 ) > k=j+1 tan (10 ) – Solution: repeat 9 times each element. rot . -1 -j Σ -1 -k tan (10 ) < k=j+1 9 x tan (10 ) – Comparison: 9 rots./digit vs 4 rots/digit – Comparison: 2.25 more elementary rotations. 24 Related Works • Radix-10 BKM (ref[4]) – log and exp in the complex domain (similar to combined cordic + log+ exp). – More complex iterations: 1 iteration/digit; initial overhead Main drawbacks: complex schedule in DFPU and high storage requirements (380 Kbits vs 14 Kbits for fixed point) x y 25 di , di takes values in {-6,…,0,…,6} Comparison with Table-based Polynomial Approx. Scheme # digits Polynom. Latency Storage Look-up table degree (# cycles) (K BCD digits) LuT+Poly 2 10 560 120 (one funct.) LuT+Poly 1 14 880 4.5 (one funct.) Our CORDIC - - 300-400 9.4 (several functions) 26 Conclusions • Floating point decimal CORDIC to compute transcendentals on long formats (Dec128). • Constant scale factor and number of iterations=number of bits of input operands (as in binary standard CORDIC). • Novel decimal angle set to support a unified algorithm for circular and hyperbolic coordinates. • Floating-point extension and mapping to a state of the art DFPU. • Redundant version of the algorithm to use simple 3-2 decimal carry-save adder. • Compares favourably with other decimal CORDIC proposals. • Allows to reduce latency and/or storage compared to table driven polynomial implementations. 27.

Load more