Computation of Decimal Transcendental Functions Using the CORDIC Algorithm
Computation of Decimal Transcendental Functions Using the CORDIC Algorithm
Álvaro Vázquez, Julio Villalba* and Elisardo Antelo University of Santiago de Compostela *University of Málaga SPAIN
19th IEEE Symposium on Computer Arithmetic Portland (USA), June 8-10, 2009 Motivation
• Renewed interest in decimal floating point (IEEE 754- 2008; Hardware implementations .)
• Research now mainly focused on basic floating point operations (+/-, x, /).
• We consider transcendentals (sin, cos, exp, log…) to fully support the IEEE 754-2008 Decimal Number System for DPD formats.
• Slow multiplication -> CORDIC might be an option for transcendentals.
2 Commercial Hardware Implementations
Cycles required for execution for Dec128 operands (Power 6 - 13 FO4 cycle - 5GHz @ 65nm SOI )
FP add/sub 11 to 19 (depending on specific case )
FP Multiplication 21+2 N (wost case:89) (N: nº digits excluding leading zeros )
FP Division 154
FXP add/sub 2
3 Transcendentals
• Transcendental functions: sin, cos, …
• We aim to implement transcendentals based on CORDIC (serial method).
• Why?, slow decimal multiplication-> Table + polynomial approx might be slow.
• Resembles the context when CORDIC were used in x86 processors.
4 CORDIC
Rotation mode (circular) Vectoring mode (circular)
output: rotated vector
Input angle Input vector Input vector
output: modulus and angle
Circular Functions: sin, cos, tan, norm of a vector, tan -1(y/x),… Extension to “hyperbolic rotation”: sinh,cosh, sqrt, tan -1(y/x),ln,exp,.. 5 CORDIC
Rotation decomposed into elementary rotations
α α cos( ) sin( ) cos( α1) sin( α1) cos( α2) sin( α2) = α α .... -sin( ) cos( ) -sin( α1) cos( α1) -sin( α2) cos( α2)
Rotation angle decomposed as sum of elementary rotation angles α = Σ α i
1 tan( α ) α i cos( i) α -tan( i) 1
6 Radix-2 CORDIC: basic step
Basic step: elementary rotation 1 tan( α ) with increase in the modulus α i cos( i) α (x[i],y[i]) -tan( i) 1 (x[i+1],y[i+1])
basic step
α -1 -i -1 -i σ -i i=+ tan (2 ) or -tan (2 ) 1 i 2
σ -i - i 2 1 σ i=+1 or -1
α σ cos( i) independent of i Κ= Π α cos( i) constant scale factor ->precomputed and compensated after basic steps 7 CORDIC Convergence
Worst case: vector on target position α Next residual angle value is i
Next elementary rotation target position (x[i+1],y[i+1]) α (remainder angle=0) i (x[i],y[i])
α < Σ α Convergence condition : i k=i+1 k 8 Our Decimal CORDIC
• We want constant scale factor -> precomputed and simple to compensate.
• Constant scale factor means two possible angles (same magnitude, different sign) for each elementary rotation.
• Conventional radix-2 CORDIC not good for decimal operands
10 -k 10 -(k+1)
BCD Rep. σ -i x[i] ……8 4 2 1 8 4 2 1 1 i 2
σ -i y[i] x x x x x x x x - i 2 1
Not a simple right shift 9 Elementary angles
Key issue: elementary angles set
Requirements:
α α • Two possible angles per iteration : + i and – i.
σσσ ααα • tan( i i) proportional to a power of ten .
1 tan( σ α ) α i i cos( i) σ α -tan( i i) 1
Simple right shift for BCD operands Constant scale factor 10 Angle set for Dec. CORDIC
Binary weights of a decimal rep. RNC8 paper [3]
ααα -1 σσσ -ceil(i/4) i=tan ( i C[i] 10 )
σ i=-1 or +1 8, 4, 2,1 5, 2, 1,1 Easier to implement (easy multiples for dec.) α < α α α Convergence: i i+1 + i+2 +…..+ i+k +…
11 Contributions in this work
• Extend the algorithm to support hyperbolic coordinates (tanh -1 angles)
• Extend to floating point
• Mapping the algorithm to a state of the art Decimal Floating Point Unit.
• Algorithm with redundant adder (carry-save).
12 Angle set for circular and hyperbolic
ααα -1 σσσ -ceil(i/4) i=tan ( i C[i] 10 )
ααα -1 σσσ -ceil(i/4) i=tanh ( i C[i] 10 )
It does not assure convergence C[i]=R[i mod 4]=1, 5, 2, 1 for both
Enough redundancy to assure C[i]=R[i mod 4]=1, 5, 2, 2 convergence for hyperbolic coord.
13 Unified Algorithm
Roughly one iteration per bit of the input operands. (4 iter./digit)
c=1 (circular); c=-1 (hyperbolic); C[i] takes values 5, 2 or 1. 14 Floating-Point Extension
Range Reduction : standard methods (see paper)
sin/cos, tan -1,sinh/cosh,e f,10 f, tanh -1 ln(f) log(f), sqrt
Result after range reduction :
One of the inputs:
yin= Myin 10 -Eyin or zin=Mzin 10 -Ezin |Mzin|, |Myin| in [1,10) and Eyin, Ezin > or = 0
We need a floating point version of the iterations:
Scale y or z iterations (or both) to move Eyin or Ezin Leading zeros (or nines). 15 Floating-Point Extension
Floating-point iterations :
Scale y and/or z iterations by 10 Eyin or 10 Ezin Start iterations with index J=4Ezin-3 or J=4Eyin-3
Funct P Q
sin/cos; Ezin Ezin sinh/cosh (N=0)
sinh/cosh 0 Ezin (N!=0)
tan -1, tanh -1 Eyin Eyin Ln, log
sqrt 0 0 p decimal digits and 1 ulp accuracy: datapath of 1 integer and p+3 fract. and m=p+1 (4(p+1) elementary rotations). 16 Mapping to a DFPU (for Decimal128 format)
17 Pipelining the Algorithm
Interleave x and y iterations for effective pipelining 18 Pipelining the Algorithm: Number of cycles for Decimal 128
-#cycles without pre and post processing due to range reduction
-Optimization performed : after about half of rotations, perform a single final rotation since cos and sin can be linearly approximated.
This requires a mult. for rotation and a division for vectoring both of about m/2 digits. 19 Algorithm for Carry-Save Representation
Carry-save Adder CPA
Perform iterations with redundant adder. (may be used for division) Why use a redundant adder?
Not for speed (actually the algorithm will be slower)
Our aim is to use a simple adder for the iteration instead of the complex carry-propagate adder. 20 Algorithm for Carry-Save Representation
Direction of rotation obtained as the sign of estimation of y or z.
Redundant representation: estimation of sign from a few leading digits.
Real value “wrong” direction of rot.
0 α i
Estimated value of y or z 2D[t] -> error of estimation
Convergence condition (without rotation repetition): α < α α α i + D[t] i+1 + i+2 +…..+ i+k +… 21 Algorithm for Carry-Save Representation Example for rotation mode:
Scale recurrence to have sign information at the most sig. digits
Scaled recurrence for z:
Estimation of one fractional digit enough for convergence (see paper). < −α α α α D[t] i + i+1 + i+2 +…..+ i+k +…
Angular slack > 0 due to redundant angular representation 22 Pipelining the Redundant Algorithm
Schedule two shift-scale-add(3-2 csa) operations for each x and y iterations due to redundant representation. 23 Related Works
• Early versions of decimal CORDIC (60’s-70’s) – Pocket calculators. – Elementary angle set: tan -1(10 -j) – Convergence not achieved -1 -j Σ -1 -k tan (10 ) > k=j+1 tan (10 ) – Solution: repeat 9 times each element. rot . -1 -j Σ -1 -k tan (10 ) < k=j+1 9 x tan (10 ) – Comparison: 9 rots./digit vs 4 rots/digit – Comparison: 2.25 more elementary rotations.
24 Related Works
• Radix-10 BKM (ref[4])
– log and exp in the complex domain (similar to combined cordic + log+ exp).
– More complex iterations: 1 iteration/digit; initial overhead
Main drawbacks: complex schedule in DFPU and high storage requirements (380 Kbits vs 14 Kbits for fixed point)
x y 25 di , di takes values in {-6,…,0,…,6} Comparison with Table-based Polynomial Approx.
Scheme # digits Polynom. Latency Storage Look-up table degree (# cycles) (K BCD digits) LuT+Poly 2 10 560 120 (one funct.)
LuT+Poly 1 14 880 4.5 (one funct.)
Our CORDIC - - 300-400 9.4 (several functions)
26 Conclusions
• Floating point decimal CORDIC to compute transcendentals on long formats (Dec128).
• Constant scale factor and number of iterations=number of bits of input operands (as in binary standard CORDIC).
• Novel decimal angle set to support a unified algorithm for circular and hyperbolic coordinates.
• Floating-point extension and mapping to a state of the art DFPU.
• Redundant version of the algorithm to use simple 3-2 decimal carry-save adder.
• Compares favourably with other decimal CORDIC proposals.
• Allows to reduce latency and/or storage compared to table driven polynomial implementations.
27