<<

Floating-Point Operator based on CORDIC 79

Floating-Point Division Operator based on CORDIC Algorithm

Pongyupinpanich Surapong1 and Faizal Arya Samman2, Non-members

ABSTRACT the complexity to implement the operation, especially when using floating-point data operands. Design and evaluation of a CORDIC (COordinate DIgital Computer) algorithm for a floating- Compared to other arithmetic operator, designing point division operation is presented in this paper. In a divider unit requires a special attention because of general, division operation based on CORDIC algo- the complexity to implement the operation, especially rithm has a limitation in term of the range of inputs when using floating-point data operands. The special that can be processed by the CORDIC machine to attention opens a challenging issue for many scientists give proper convergence and precise division opera- and researchers to introduce new efficient tion result. A hardware architecture of CORDIC al- and methods to design the divider unit. [1] gorithm capable of processing broader input ranges Division operations are often used in many sci- is implemented and presented in this paper by using entific computations of image and a pre-processing and a post-processing stage. The algorithms. Image convolution and Gaussian Filter- performance as well as the calculation error statis- ing are example computations [2] that require divider tics over exhaustive sets of input tests are evaluated. operator as presented in Equ. (1) and Equ. (2), re- The results show that the CORDIC algorithm can spectively. Equ. (1) is the equation of pixel output of be well-convergence and gives precise division opera- a 3 × 3 convolution masks window, where wi is the tion results with broader input ranges. The proposed weight of the j adjacent input image pixel within the hardware architecture is modeled in VHDL and syn- window. thesized on a CMOS standard-cell technology and a FPGA device, resulting 1 GFlops on the CMOS and ∑ 210.812 MFlops on the FPGA device. 9 p w P = ∑i=0 i i (1) 9 w Keywords: Floating-Point Operators, Accelerator i=0 i , Product-of-Sum, Sum-of-Product, 32-bit IEEE Standard Single-Precision. Another important equation with division opera- tion is presented by the 2D Gaussian distribution as 1. INTRODUCTION shown in Equ. (2), where x and y is the two dimen- In modern digital , floating- sional image signal and σ is the standard deviation point arithmetic units have been important compo- of the distribution. The Gaussian filtering is used to nents to improve the performance of the digital com- “blur” images and remove detail and noise [2]. puter. Arithmetic components such as /sub- tractor, multiplier and divider units are generally 1 − x2+y2 basic operators in a scientific computations beside 2σ2 G(x, y) = 2 e (2) other such as , cosine, 2πσ , exponent, etc. Compared to fixed point arithmetic units, the floating-point arithmetic units In adaptive digital signal processing applications provides better accuracy and precision and it cov- for instance, division operations are used in a nor- ers larger data ranges, which is suitable for scien- malized adaptive least-mean-square (LMS) algorithm tific computations in engineering application areas. presented in Equ. (5) to update the parameters of an Compared to other arithmetic operator, designing a adaptive filter. The filter output signal, Equ. (3), is divider unit requires a special attention because of compared to the desired system model output sig- nal (d), resulting in an error signal, Equ. (4). This Manuscript received on July 2, 2012. error signal is then used to drive the adaptive filter 1 The author is with Ramkhamhaeng University, Fac- parameters, in such a way that finally the adaptive ulty of Engineering, Department of Computer Engineering, Ramkhamhaeng Road, Hua Mark, Bangkapi, Bangkok 10240, filter parameters (wj) will be equal (almost equal in Thailand , E-mail: [email protected] practice) to system model parameters. The signal di- 2 The author is with Universitas Hasanuddin at Makassar, visor in this case (γ + x(k − j)2) is used to improve Faculty of Engineering, Department of Electrical Engineering, Jl. Perintis Kemerdekaan Km. 10, Tamalanrea, Makassar the stability of the adaptive parameter identification 90245, Indonesia. , E-mail: [email protected] algorithm. 80 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.7, NO.1 May 2013

malization or division by convergence [12]. The dis- advantage of the Goldschmidt’s algorithm in term of N∑tap the area overhead is the need for two independent − ∀ ∈ y(k) = wj(k)x(k j), j Ψ (3) parallel . As we know, a multiplier re- j=1 quires large number of logic area, especially when it e(k) = d(k) − y(k) (4) is implemented in floating-point arithmetic. β e(k)x(k − j) 5. Newton-Raphson Algorithm: The Newton- w (k + 1) = w (k) + , ∀j ∈ Ψ (5) j j γ + x(k − j)2 Raphson division algorithm is almost similar with the Goldschmidt’s algorithm. In the Newton-Raphson 2. STATE-OF-THE-ARTS OF DIVISION method however, the iterative refinement is applied METHODS only to the reciprocal value of the divisor, which will be convergent after several iterations [13]. The divi- In this section, we will present brief descriptions sion operation of the Newton-Raphson method can on the state-of-the-art of the methodologies or algo- be divided into three steps, i.e. the initial estimation rithms to implement the binary division operation. of the divisor’s reciprocal, the iterative refinement of The methodologies are described as follows. the divisor’s reciprocal and the multiplication step be- 1. Adder-Cell-based Method: The design of divi- tween the divided and the final convergent divisor’s sion operator using the adder-cell-based method will reciprocal. The work in [14] has presented for ex- always result in a very compact divider architecture. ample a decimal floating-point divider using newton- This method is classified as non iterative technique, raphson iteration, where an accurate piece-wise linear where the divider unit consists of half-adder and full- approximation is used to obtain an initial estimate of adder cells as well as other units and sup- a divisor’s reciprocal. porting modules [3]. A binary divider that uses carry- 6. CORDIC Algorithm: Beside the aforemen- save adder units is presented for example in [4]. tioned method to implement the division operation, 2. Digit Recurrence Algorithm: In modern float- there is also another powerful algorithm to implement ing point arithmetic units the most common algo- the divider unit called CORDIC (COrdinate Rotation rithm employed to division function is a digit recur- DIgital Computer) algorithm [15]. Like digit recur- rence algorithm [5] [6] [7]. The algorithm performs rence method, CORDIC is also classified into iter- both operations based on shifting and ative method. The main powerful characteristic of as the fundamental operators. A combined floating- the CORDIC algorithm is the capability to imple- point square-root and division operation can also be ment several trigonometric function [16], [15], phase implemented by using a subtractive SRT (Sweeney, and magnitude functions [17], and hyperbolic func- Robertson and Tocher) algorithm [8], which can be tions [18] as well as linear operational function such classified as a digit recurrence algorithm. The sub- as multiplication and division functions. tractive SRT algorithm can be extended by using By using CORDIC algorithms, we can also easily im- Radix-8 IDS (Interleaved Digit Set) algorithm to im- plement all the function in a single CORDIC hard- prove the performance of the traditional digit re- ware architecture [19]. Some basic standard and currence algorithm. Another variant of the digit- non-standard operators such as sum-of-product and recurrence method is svoboda algorithm. A new product-of-sum [20], which can be used to accelerate Svoboda-Tung Division algorithm is for instance pro- floating-point operations, can also be implemented by posed in [9]. using CORDIC algorithm. The work in [21] presents 3. Taylor’s Series Expansion Algorithm: A Tay- for example a flexible FPGA implementation of a pa- lor’s Series Expansion Algorithm [10] for example can rameterizable floating-point library allowing to com- be used to calculate division operation using a sequen- pute the sine, cosine or arctangent functions. The tial series of a harmonic equation. However, the Tay- CORDIC algorithm can also even be used to solve lor’s Series Expansion algorithm is rarely used and problematic operation in a fuzzy logic controller cir- perform slow computation to calculate the division cuit [22]. Moreover, the CORDIC algorithm can operations. be used to implement a unified frequency analysis 4. Goldschmidt’s Algorithm: The basic idea be- or transformations functions such as DFT (Discrete- hind the Goldschmidt’s Algorithm is the iterative Fourier Transform), DHT (Discrete Hartley Trans- parallel multiplication of the dividend and divisor by form), DCT (Discrete Cosine Transform) and DST updated factors in such as a way that the final divi- (Discrete Sine Transform) [23]. sor will be driven to one. Thus, the final dividend gives the quotient (the division result). Oberman et There are two main issues, in which CORDIC al- al. for example [11] proposes a floating-point divider gorithm is preferable to design of the floating-point and for AMD-K7 by using Goldschmidt’s division operator i.e., algorithm. The Goldschmidt’s algorithm has been 1. The benefit of CORDIC Algorithm: The broadly used on many commercial CORDIC algorithm provides advantages in the per- and is also known as division by multiplicative nor- formability of fundamental function for scientific and Floating-Point Division Operator based on CORDIC Algorithm 81

engineering, the low algorithmic complexity, and the 20 simplicity for VLSI implementation. The VLSI im- Matlab plementation is simple because it uses only shift and 15 CORDIC add operations. The area cost of the CORDIC al- gorithm is certainly lower than the Newton-Raphson 10 Algorithm, Goldschmidt’s Algorithm and Taylor’s Se- ries Expansion Algorithm, which require multiply and 5

add operations. Compared to the Digit-Recurrence Y/X algorithm, CORDIC algorithm is slightly simpler. 0 2. Design alternative: From the previous works, −5 there is few research dealing with design and inves- tigation of a divider based on CORDIC algorithm. −10 Therefore, this paper proposes the algorithm, design, −15 and architecture of the floating-point divider based −1.5 −1 −0.5 0 0.5 1 1.5 on CORDIC algorithm and the analysis and investi- Θ gation on its error characteristics. The CORDIC algorithm is however lack in compu- Fig.1:: The CORDIC divergence problem. tational latency and convergence range. These disad- vantage can be alleviated by the techniques proposed convergence value, when the division results are out- in [15]. The convergence of the CORDIC algorithm − ≤ Y ≤ side the following ranges, i.e. 0.9647 X 2.9647. can be accelerated by duplicate and triplicate the mi- Based on the above mentioned facts, we propose cro (angle) rotation on each stage of the CORDIC a solution to improve the range of inputs that can iterative algorithm. guarantee the precision and the convergence of the The rest of the paper is organized as follows. Sec- CORDIC algorithm to perform a floating-point divi- tion 3.shows the architecture of a CORDIC algorithm sion operation. Firstly, we will introduce the basic of especially used for division operations. The main rea- the CORDIC algorithm as described in the following. sons why we propose a modification of CORDIC al- gorithm are also presented in this Section. Section 4. 3.1 Basic Radix-2 CORDIC Division Algo- describes the performance analysis of the CORDIC rithm divider by giving a wide range of inputs. The con- vergence and calculation errors are presented in this The basic radix-2 CORDIC iteration algorithm section. Synthesis results of the CORDIC divider core can be written as follows [19]. modeled in VHDL by using 130-nm CMOS standard- Xi+1 = Xi − m · µi · Yi · δm,i cell technology and by using an FPGA device are pre- Y = Y + µ · X · δ sented in this section. i+1 i i i m,i Zi+1 = Zi − µi · δm,i (6) 3. CORDIC-BASED DIVIDER ARCHITEC- The CORDIC implementation for Divider function TURE can be performed by configuring the CORDIC algo- There are two basic reasons why a new CORDIC rithm into linear mode of operation (linear modus), −i algorithm is proposed in this paper. i.e. by setting m = 0 in the Equ. (6) and δm,i = 2 . 1. The CORDIC algorithm gives correct convergence Hence, the CORDIC iterative equation for the Divi- when the expected division results are located in the sion operator is shown in Equ. (7). − ≤ Y ≤ following ranges: 0.9647 Q = X 2.9647 (based on our Matlab simulation results). The values outside Xi+1 = Xi the range will tend to saturate at unexpected division Y = Y + µ · X · 2−i operation results. Please see the experimental result i+1 i i i − · −i shown in Fig. 1. Zi+1 = Zi µi 2 (7) 2. We cannot identify, whether the division results The set of µ values is µ = {−1, +1}, which depends are in the aforementioned range or not, unless the i i on the value of Yi as shown in Equ. (8). division has been made. { Now let us see the divergence problem as presented +1 : Yi < 0 in Fig. 1, which is obtained from our simple experi- µi = (8) −1 : Y ≥ 0 mental. Two curves are presented in the figure, i.e. i the ideal (MATLAB) output and the division results using CORDIC algorithm. The figure presents the 3.2 The Proposed CORDIC Division Algo- θ-axis and the Y/X-ordinat, where Y = sin θ and rithm X = cos θ, θ in unit. The simulation results Based on our experience, the domain of well- present that the division operations cannot move to a convergence of the CORDIC inversion function is 82 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.7, NO.1 May 2013

Dividend, Y Divisor, X Algorithm 1 [EC ,EX0]=DivisorDetect(X), where ≥ S Y EY MY S X EXXM X 1 1: Yo=1, Zo=0, Xo=X, S0=-1 {Initialization} 2: EX = Exponent of the Input Divisor X 3: if EX < 126 then Exponent 4: EC = 126 − EX Detection 5: EX0 = 126 6: else if E > 127 then ECE XO X 7: EC = EX − 127 8: EX0 = 127 Inversion 9: else (CORDIC Algorithm) 10: EC = 0 11: EX0 = EX EMZZ 12: end if 13: return EC ,EX0 Alignment Algorithm 2 Z=Inverting(X,I), where X ≥ 1 ZA 1: Y0=1, Z0=0, X0=Xnew, S0=-1 {Initialization} Multiplication 2: for i = 0 to I − 1 do 3: Xi+1 = Xi −i Q 4: Yi+1 = Yi + (Xi × Si × 2 ) −i 5: Zi+1 = Zi − (Si × 2 ) Fig.2:: The floating-point divider architecture. 6: if Y < 0 then 7: Si+1=+1 8: else 9: Si+1=-1 shown in Equ. (9). If the domain is written in the 10: end if IEEE binary floating-point standard, the domain can 11: end for 12: return Z be described in Equ. (10) as well as Equ. (11) and Equ. (12) present the exponent and mantissa faction.

Yd Qd = = 2.77778. The floating point formats of Xd 134 0.5 ≤ X < 1.5 (9) the input signals are Xfp = 1.40625 × 2 , where × 126 ≤ × 127 SX = 0, MX = 1.40625 and EX = 134, meanwhile 1.0 2 X < 1.5 2 (10) 128 Yd = 1.38889 × 2 , where SY = 0, MY = 1.38889 and EY = 128. The step-by-step computations made by using our proposed algorithm is as follows: ≤ ≤ 1.0 MX 1.5 (11) 1. Divisor’s Exponent Detection: By using Alg. 1, ≤ ≤ 126 EX 127 (12) the credit exponent is obtained as EC = EX − 127 = 134 − 127 = 7. The complete division operation Q = Y and its X 2. Divisor Inversion: By using Alg. 2, inversion re- hardware architecture are proposed in this paper, 1 1 × 126 sult gives Z = = × 127 = 1.42222 2 . where Q is the division operation result, Y is the Xnew 1.40625 2 3. Alignment: In this third stage, we have ZA = dividend and X is the divisor, The hardware archi- − 1.42222 × 2EZ +EC = 1.42222 × 2126 7 = 1.42222 × tecture is classified into four main stages as presented 2119. in Fig. 2. The description of the floating-point hard- 4. Multiplication: Finally the division result is Q = ware architecture is described as follows. Y × Z = 1.9531250 × 1.42222 × 2135+119−127 = 1. Divisor’s Exponent Detection: In this first stage, A 2.77778 × 2127 or according to the normalized IEEE a credit exponent E and e new exponent E for X C X0 binary floating-point standard, Q = 1.38889×2128 as are computed by using Alg. 1. expected. 2. Divisor Inversion: In this second stage the inverse value of the new input divisor X = (−1)S ×M × new X X 4. PERFORMANCE ANALYSIS 2EX0 is computed by using CORDIC algorithm pre- sented in Alg. 2 to obtain the variable Z. The CORDIC Division function has been evalu- 3. Alignment: In this third stage, ZA is computed ated by feeding input signals ranging for low mag- from the Z variable (computed from Alg. 2) whose nitude until high magnitude. Fig. 3 shows how the exponent is aligned by using the credit exponent EC Divider Hardware unit gives output signals Q by in- 1 (computed from Alg. 1). By using a formal floating verting input signals X (Q = X , where Dividend Y 1 SZ point equation, then we have ZA = (−1) × 1.MZ × is set 1). The input signals are incremented linearly EZ −EC 2 , where SZ = SX . from X = 0.1 until X = 33. The figure shows also 4. Multiplication: Finally we will have the complete the comparison of the Divider Hardware and Matlab division result as Q = Y × ZA. Output (Real calculated output) as presented in the Example: If we have decimal numbers Xd = 180 upper diagram of Fig. 3. The absolute calculation and Yd = 500 then the CORDIC divider should give errors are shown in the bottom diagram of Fig. 3. Floating-Point Division Operator based on CORDIC Algorithm 83

Input Signal X,Y 3000 2000 1000 CORDIC Output Signal 0 Input X 1.8 Hardware Output −1000 Matlab Output 1.6 −2000 Input Y 1.4 −3000 1.2 Input Signal X,Y −4000 1 −5000 1000 1500 2000 2500 3000 3500 4000 0.8 Set Number of Input Signal (X,Y) 0.6 CORDIC Hardware Output Signal −1 0.4 −2 Cordic Inversion Output 0.2 −3 0 0 5 10 15 20 25 30 −4 Input Signal (X) −5 −6 CORDIC Error Calculation Signal −7 3.5e−05 −8 Hardware Output Cordic Division Output Q −9 3e−05 1000 1500 2000 2500 3000 3500 4000 Set Number of Input Signal (X,Y) 2.5e−05 CORDIC Matlab Output Signal 2e−05 −1 1.5e−05 −2 −3 1e−05 −4 5e−06 −5 Cordic Error Calculations −6 0 0 5 10 15 20 25 30 −7 Input Signal (X) −8 Matlab Output

Cordic Division Output Q −9 1000 1500 2000 2500 3000 3500 4000 Fig.3:: Measurements of the CORDIC Division with Set Number of Input Signal (X,Y) 1 CORDIC Error Calculation Signal unit Dividend (Q = X ) and the calculation errors by 0.0006 testing of the 33 sets of inputs X. 0.0005 0.0004 0.0003 0.0002 0.0001 0

Cordic Error Calculations −0.0001 1000 1500 2000 2500 3000 3500 4000 Set Number of Input Signal (X,Y) CORDIC Output Signal 0.01 Hardware Output Fig.5:: Evaluation of the CORDIC Division oper- Matlab Output Y 0.0095 ation (Q = X ) and the calculation errors by incre- menting both input operands. 0.009

0.0085 Fig. 4 presents also another simulation result with 0.008 different input ranges, i.e. from X = 100 until Cordic Inversion Output X = 133. The simulation results shown in Fig. 3 and 0.0075 100 105 110 115 120 125 130 Fig. 4 are presented in this case to show the evalua- Input Signal (X) tion result of the inversion step in the CORDIC-based CORDIC Error Calculation Signal division operation. The evaluation of the division op- 5e−07 4.5e−07 erations itself will be presented in the next figure. 4e−07 Fig. 5 presents a simulations result of division op- 3.5e−07 erations, i.e. Q = Y , where the input signal X as 3e−07 X 2.5e−07 the divisor and input signal Y as the dividend are 2e−07 shown in the first and second lines of the diagrams 1.5e−07 1e−07 shown in the figure. Both input signals are incre-

Cordic Error Calculations 5e−08 mented to evaluate the CORDIC division operations 0 100 105 110 115 120 125 130 in different input ranges. The third and the fourth Input Signal (X) lines of the diagrams are the calculation results of the Fig.4:: Measurements of the CORDIC Division with floating-point-based CORDIC Divider implemented 1 using VHDL model and the ideal Matlab output, re- unit Dividend (Q = X ) and the calculation errors with another testing of the 133 sets of input X. spectively. The diagram in the bottom line is the ab- solute calculation errors between the CORDIC Hard- ware and the Matlab output. Another simulation result is presented in Fig. 6 84 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.7, NO.1 May 2013

Table 1:: Statistical error of CORDIC-based and traditional inverse calculation algorithm varied, where X and Y are varied from 0.1 to 500.

Iter. Max. Min. Abs. Mean Std. Deviation 8 3.9030 -15.6220 3.6760×10−2 0.2999 16 0.0152 -0.0610 1.5841×10−4 1.4705×10−3 32 1.0×10−4 -4.4680×10−5 6.1798×10−7 3.3635×10−6 64 1.0×10−4 -4.4680×10−5 6.1798×10−7 3.3635×10−6 128 1.0694×10−6 -8.0123×10−7 1.1888×10−7 1.9968×10−7

Input Signal X, Y Input Signal X, Y 500 500 450 450 400 Input Y 400 350 350 300 300 Input X 250 250 200 200 Input Y 150 150 100 Input X Input Signal X, Y

Input Signal X, Y 100 50 50 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0 500 1000 1500 2000 Set Number of Input Signal (X,Y) Set Number of Input Signal (X,Y) CORDIC Hardware Output Signal 9 CORDIC Hardware Output Signal Hardware Output 1 8 0.9 7 0.8 6 0.7 5 0.6 4 0.5 3 0.4 0.3 2 0.2 1 0.1 Cordic Division Output Q 0 Hardware Output 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Cordic Division Output Q 0 0 500 1000 1500 2000 Set Number of Input Signal (X,Y) Set Number of Input Signal (X,Y) CORDIC Matlab Output Signal CORDIC Matlab Output Signal 1 9 Matlab Output 8 0.9 7 0.8 0.7 6 0.6 5 0.5 4 0.4 3 0.3 2 0.2 1 0.1 Matlab Output

Cordic Division Output Q 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Cordic Division Output Q 0 500 1000 1500 2000 Set Number of Input Signal (X,Y) Set Number of Input Signal (X,Y) CORDIC Error Calculation Signal CORDIC Error Calculation Signal 0.00045 3e−05 0.0004 0.00035 2.5e−05 0.0003 2e−05 0.00025 0.0002 1.5e−05 0.00015 1e−05 0.0001 5e−05 5e−06 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0 Cordic Error Calculations

Cordic Error Calculations 0 500 1000 1500 2000 Set Number of Input Signal (X,Y) Set Number of Input Signal (X,Y)

Fig.6:: Evaluation of the CORDIC Division oper- Fig.7:: Evaluation of the CORDIC Division oper- Y Y ation (Q = X ) and the calculation errors by incre- ation (Q = X ) and the calculation errors by decre- menting the divisor X and decrementing the dividend menting the divisor X and incrementing the dividend Y . Y . and Fig. 7. In the simulation result as presented in Y is incremented. As shown in the figure, the divi- Fig. 6, the input divisor X is ramp up, while the input sion result of the CORDIC hardware tends to increase dividend Y is decremented. About 4500 sets of input exponentially, which in accordance with the pairs are presented in the figure. As shown in the fig- simulation result. However, in this test case scenario, ure, the division results of both the CORDIC hard- the calculation errors tend to increase. ware and the matlab simulation tend to decrease ex- Table 1 shows the statistical analysis results of ponentially, and the absolute calculation errors tends the CORDIC hardware over the computational er- also to decrease. rors compared to MATLAB simulation results. At Fig. 7 shows another simulation, where the input each number of iteration, the maximum, the mini- divisor X is decremented, while the input dividend mum and the absolute average errors as well as the Floating-Point Division Operator based on CORDIC Algorithm 85

standard-deviation of the errors are evaluated over Table 4:: Synthesis result using Virtex-2 FPGA de- 5000 sets of input samples. It seems that the calcu- vice (2vp30ff896-7) from Xilinx Corporation. lation errors decrease as the number of iterations is increased. Utilization % of Total Number of slice 1254 of 13696 9% 5. SILICON-AND FPGA-BASED SYNTHE- Number of MULT18X18 4 of 136 2% SIZED RESULT AND COMPARISON blocks Minimum Delay 9.487 ns The synthesis result of the floating-point CORDIC Maximum Frequency 105.406 MHz) Divider by using a 130-nm CMOS standard-cell tech- nology library from Faraday Technology Corporation is presented in Table 2. The synthesis result is made can be used for performance and efficiency evaluation by using target frequency of 500 MHz, resulting in in top-level design. The proposed floating-point di- a slack-time of about 1.92 ns. The performance can vision based on CORDIC method is compared with be still improved by using a newest and faster CMOS Goldschmit’s [11] and Taylor-Series Expansion [10] standard-cell technology. methods. The computational latency includes the The logic cell area of components in the CORDIC time to form the initial approximation and the appro- Hardware Divider is presented in Table 3. The to- priate number of iterations. The latency of the pro- tal area of the CORDIC divider is 40612 µm2. Most posed method is higher than Taylor-Series Expansion of logic area is occupied by the multiplier and align- method, but lower than Goldschmit’s method. When ment units. The rest 5% logic-cell area is occupied the number of basic floating-point operators are con- by combinatorial blocks. Since two parallel floating- sidered, the proposed CORDIC-based method applies point operations are performed by CORDIC core, the #1FPMUL and #2FPADD/SUB, where as Taylor- CORDIC has about 2 × 500 MHz = 1 GF lops (Giga Series Expansion method and Goldschmit’s method floating-point operation per second). consume #2FPMUL, #1FPADD/SUB, and #2FP- Table 4 shows the synthesis result of the floating- MUL, respectively. point CORDIC Divider on an FPGA device from Xil- inx Corporation. By using the Virtex-2 FPGA de- vice, the maximum data frequency of the CORDIC core is slower than the synthesis result on CMOS standard-cell technology. The maximum performance of the CORDIC on the Virtex-2 FPGA is about 2 × 105.406 MHz = 210.812 MF lops. A brief survey of several floating-point division im- plementations in published articles is summarized in Table 5. It is difficult to compare the designs and ar- chitectures where different design methodologies were applied, such as full custom and standard cell silicon technology approach. However, the iterative archi- tectural comparison based on the number of basic floating-point operators and computational latency

Table 2:: Synthesis results using 130-nm CMOS standard-cell technology library with target fre- quency of 500 MHz.

Measurements Synth. result Total logic cell area 0.0406 mm2 Slack time (critical path) 1.92 ns Switching power (1.32V) 2.178 mW Internal power (1.32V) 5.994 mW

Table 3:: Logic cell area of the components.

Component Cell area Percent. (µm2) (%). Top Module 40612 100.0 Inversion Module 14264 35.0 Multiplier+Alignment 24222 59.6 86 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.7, NO.1 May 2013

Table 5:: Floating-Point Division comparison of published literatures in single-precision.

Methodology Latency Basic FP Operator Speed CMOS Technology size Goldschmit’s [11] 16/13 #2FPMUL 2.3 GHz 65 nm Taylor-Series Expansion [10] 12/5 #2FPMUL+#1FPADD/SUB 500 MHz 65 nm The proposed CORDIC-based 14/8 #1FPMUL+#2FPADD/SUB 500 MHz 130 nm FPMUL : Floating-Point MULtiplier, FPFPADD/SUB : Floating-Point ADDer/SUBtractor

6. CONCLUSIONS AND FUTURE WORKS mar Conference on Signals, Systems and Com- puters, vol. 2, 2004, pp. 1849–1855. A CORDIC core implementing a floating-point di- [7] L. Montalvo, K. Parhi, and A. Guyot, “A vider operation has been presented in this paper. The Radix-10 Digit-Recurrence Division Unit: Algo- new algorithm is proposed to solve the limited input rithm and Architecture,” IEEE Trans. Comput- domains of the input ranges that can be solved with ers, vol. 56, no. 6, pp. 727–739, June 2007. well convergence by the traditional CORDIC algo- [8] I. Rust and T. Noll, “A digit-set-interleaved rithms to implement the division operations. The radix-8 division/square root kernel for double- CORDIC operation is basically divided into four precision floating point,” in IEEE International stages, i.e. divider exponent detection, divisor inver- Symposium on System on Chip (SoC), 2010, pp. sion, inversion result alignment and multiplication. 150–153. The proposed algorithm shows that the CORDIC can [9] L. Montalvo, K. Parhi, and A. Guyot, “New increase the input ranges that can guarantee the con- Svoboda-Tung Division,” IEEE Trans. Comput- vergence of the CORDIC algorithm. In future work, ers, vol. 47, no. 9, pp. 1014–1020, Sep. 1997. the core will be integrated within a streaming proces- [10] T.-J. Kwon, J. Sondeen, and J. Draper, sor [24]. Due to the reconfigurability of the CORDIC “Floating-point division and square root using a core, the implementations of many trigonometric and Taylor-series expansion algorithm,” in the 50th logarithmic functions, including the division opera- Midwest Symposium on Circuits and Systems tion on top of a single programmable/reconfigurable (MWSCAS 2007), 2007, pp. 305–308. CORDIC core will be helpful to reduce area of the [11] S. F. Oberman, “Floating Point Division and processor arithmetic units. Square Root Algorithms and Implementation in the AMD-K7 ,” in The 14th References IEEE Symposium on Computer Arithmetic, 1999, pp. 106–115. [1] B. Parhami, Computer Arithmetic: Algorithms [12] M. J. Schulte, D. Tan, and C. L. Lemonds, and Hardware Design. Oxford University Press, “Floating-Point Division Algorithms for an New York, Oxford, 2000. Microprocessor with a Rectangular Multiplier,” [2] D. Rao, S. Patil, N. Babu, and V. Muthuka- in Proc. Int’l. Conf. on Computer Design mar, “Implementation and Evaluation of Image (ICCD), 2007, pp. 304–310. Processing Algorithms on Reconfigurable Ar- [13] W. Gallagher and E. Swartzlander, “Fault- chitecture using C-based Hardware Description Tolerance Newton-Raphson and Goldschmidt Languages,” International Journal of Theoreti- Dividers using Time Shared TMR,” IEEE Trans. cal and Applied Computer Sciences, vol. 1, no. 1, Computers, vol. 49, no. 6, pp. 588–595, June pp. 9–34, 2006. 2000. [3] A. Beaumont-Smith and S. Samudrala, “Method [14] L.-K. Wang and M. Schulte, “Processing Unit and System of a Microprocessor Subtraction- Having Decimal Floating-Point Divider Us- Division Floating-Point Divider,” Patent US ing Newton-Raphson Iteration,” Patent US 7,127,483 B2, Oct. 24, 2006. 7,467,174 B2, Dec. 16, 2008. [4] D. J. Desmonds, “Binary Divider with Carry [15] P. Surapong, F. A. Samman, and M. Glesner, Save Adders,” Patent US 4,320,464, March 16, “Design and Analysis of Extension-Rotation 1982. CORDIC Algorithms based on Non-Redundant [5] S. F. Oberman and M. J. Flynn, “Division Al- Method,” International Journal of Signal Pro- gorithms and Implementations,” IEEE Trans. cessing, Image Processing and Pattern Recogni- Computers, vol. 46, no. 8, pp. 833–854, Aug. tion, vol. 5, no. 1, pp. 65–84, March 2012. 1997. [16] K. Maharatna, S. Banerjee, E. Grass, M. Krstic, [6] J. Ebergen, I. Sutherland, and A. Chakraborty, and A. Troya, “Modified Virtually Scaling-Free “New Division Algorithms by Digit Recurrence,” Adaptive CORDIC Rotator Algorithm and Ar- in Conference Record of the Thirty-Eighth Asilo- chitecture,” IEEE Trans. on Circuits and Sys- Floating-Point Division Operator based on CORDIC Algorithm 87

tems for Video Technology, vol. 15, no. 11, pp. Pongyupinpanich Surapong was born 1463–1474, Nov. 2005. in Prachinburi, Thailand. He received his Bachelor and Master of Engineering degree in Electrical Engineering from King Mongkut’s Institute of Technol- [17] F. A. Sammany, P. Surapong, C. Spies, and ogy Ladkrabang (KMITL), Thailand in M. Glesner, “Floating-point-based Hardware 1998 and 2002, respectively. He received Accelerator of a Beam Phase-Magnitude De- his PhD degree in 2012 from Technische Universit¨atDarmstadt, Germany. Cur- tector and Filter for a Beam Phase Control rently, he is a lecturer and research fel- System in a Heavy-Ion Synchrotron Applica- low at Department of Computer Engi- tion,” in Proc. Int’l Conf. on Accelerator and neering, Faculty of Engineering, Ramkhamhaeng Universitas, in Bangkok, Thailand. His research interests include computer- Large Experimental Physics Control Systems aided VLSI design, hardware modeling, design optimization al- (ICALEPCS), 2011. gorithm, circuit simulation, digital signal processing, system- on-chip, all in the context of field-programmable gate-array devices and VLSI technology. [18] H. Hahn, D. Timmermann, B. Hosticka, and B. Rix, “A Unified and Division-Free CORDIC Argument Reduction Method with Unlimited Convergence Domain Including Inverse Hyper- Faizal Arya Samman was born in bolic Functions,” IEEE Trans. on computers, Makassar, Indonesia. He received his Bachelor of Engineering degree in vol. 43, no. 11, pp. 1339–1344, Nov. 1994. Electrical Engineering from Universitas Gadjah Mada (UGM), Yogyakarta in 1999 and his Master of Engineering de- [19] J. Zhou, Y. Dou, Y. Lei, and Y. Dong, gree from Institut Teknologi Bandung “Hybrid-Mode Floating-Point FPGA CORDIC (ITB) in 2002. Since 2002 he has been appointed to be a research and teach- Co-processor,” in The 4th international work- ing staff at Universitas Hasanuddin in shop on Reconfigurable Computing: Architec- Makassar, Indonesia. He received his tures, Tools and Applications, (ARC’08), 2008, PhD degree in 2010 from Technische Universit¨atDarmstadt, Germany with scholarship award (2006-2010) from Deutscher pp. 256–261. Akademischer Austausch-Dienst (DAAD, German Academic Exchange Service). From 2010 until 2012, he was a postdoc- toral fellow in the research project in LOEWE-Zentrum AdRIA [20] P. Surapong, F. Philipp, F. A. Samman, (Adaptronik-Research, Innovation, Application) within the re- and M. Glesner, ““Improvement of Standard search cooperation framework between Technische Universit¨at and Non-Standard Floating-Point Operators”,” Darmstadt and Fraunhofer Institut LBF in Darmstadt. He is now a lecturer and research fellow at Department of Electrical ECTI Trans. Computer and Information Tech- Engineering, Faculty of Engineering, Universitas Hasanuddin, nology, vol. 6, no. 1, pp. 9–22, May 2012. in Makassar, Indonesia. His research interests include network on-chip (NoC) , NoC-based multiprocessor system-on-chip, design and implementation of analog and dig- [21] D. Munoz, D. Sanchez, C. Llanos, and M. Ayala- ital electronic circuits for control system applications on FP- GA/ASIC as well as energy harvesting systems and wireless Rincon, “FPGA based floating-point library sensor networks. for CORDIC algorithms,” in IEEE Conf. Pro- grammable Logic Conference, 2010, pp. 55–60.

[22] T. Lund, M. Aguirre, and A. Torralba, ““Mak- ing use of CORDIC and distributed arithmetic to produce a field-programmable fuzzy logic con- troller in an FPGA”,” in The 28th IEEE An- nual Conf. of the Industrial Electronic Society (IECON’02), 2002, pp. 3205–3208.

[23] B. Das and S. Banerjee, “Unified CORDIC- based chip to realise DFT/DHT/DCT/DST,” IEE Proceedings, Computers and Digital Tech- niques, vol. 149, no. 4, pp. 121–127, Nov. 2002.

[24] F. A. Samman, P. Surapong, and M. Glesner, “Reconfigurable streaming processor core with interconnected floating-point arithmetic units for multicore adaptive signal processing sys- tems,” in Proc. of the 6th International Work- shop on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), 2011, pp. 1–6.