High Precision Division and Square Root

High Precision Division and Square Ro ot Alan H. Karp and Peter Markstein Hewlett-Packard Lab oratories 1501 Page Mill Road Palo Alto, CA 94304 We present division and square ro ot algorithms for calculations with more bits than are handled by the oating p oint hardware. These algorithms avoid the need to multiply two high precision numb ers, sp eeding up the last iteration byasmuch as a factor of ten. Wealsoshowhow to pro duce the oating p ointnumb er closest to the exact result with relatively few additional op erations. Categories and Sub ject Descriptors: G.1.0 [Numerical Analysis]: Computer Arithmetic; G.4 [Mathematical Software]: Algorithm Analysis General Terms: Floating p oint arithmetic Additional Key Words and Phrases: Quad precision, square ro ot, division 1. INTRODUCTION Floating p oint division and square ro ot take considerably longer to compute than addition and multiplication. The latter two are computed directly while the former are usually computed with an iterative algorithm. The most common approach is to use a division-free Newton-Raphson iteration to get an approximation to the recipro cal of the denominator (division) or the recipro cal square ro ot, and then multiply bythenumerator (division) or input argument (square ro ot). Done care- fully, this approach can return a oating p ointnumb er within 1 or 2 ULPs (Units in the Last Place) of the oating p ointnumb er closest to the exact result. With more work, sometimes a lot more, the error can b e reduced to half an ULP or less. Name: Alan H. Karp Aliation: Future Systems Department Address: alan [email protected] Name: Peter Markstein Aliation: Computer Systems Lab oratory [email protected] Address: p eter Permission to make digital or hard copies of part or all of this work for p ersonal or classro om use is granted without fee provided that copies are not made or distributed for pro t or direct commercial advantage and that copies show this notice on the rst page or initial screen of a display along with the full citation. Copyrights for comp onents of this work owned by others than ACM must b e honored. Abstracting with credit is p ermitted. To copy otherwise, to republish, to p ost on servers, to redistribute to lists, or to use any comp onentofthiswork in other works, requires prior sp eci c p ermission and/or a fee. Permissions may b e requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or [email protected]. 2 Alan H. Karp and Peter Markstein The usual approachworks quite well when the results are computed to no more precision than the hardware addition and multiplication. Typically, division and square ro ot take10to20machine cycles on a pro cessor that do es a multiplication in the same precision in 2 or 3 cycles. The situation is not so go o d for higher precision results b ecause the cost of the nal multiplication is much higher. The key feature of our algorithm is that it avoids this exp ensivemultiplication. Section 2 discusses some metho ds used to do division and square ro ot. Section 3 describ es the oating p oint arithmetic and the assumptions we are making ab out the oating p oint hardware. Section 4 describ es our new algorithms. In Section 5 we present an analysis of the accuracy of the algorithm. Next, weshowhowto get the correctly rounded results with relatively few additional op erations, on av- erage, in Section 6. The pro cedures we use for testing are describ ed in Section 7. Programs written in the programming language of the Unix desk top calculator utility bc[Hewlett-Packard 1991] that implement these algorithms are included in an app endix. 2. ALGORITHMIC CHOICES There are a number of ways to p erform division and square ro ot. In this section we will discuss these metho ds with emphasis on results that exceed the precision of the oating p oint hardware. One approach istodoscho ol-b oy long division, a scheme identical to successive subtraction for base 2 arithmetic[Goldb erg 1990]. The pro cedure is straightforward. To compute B=A,we initialize a remainder to R =0.LetS b e a shift register containing R concatenated with B . Then, for each bit we do the following steps: (1) Shift S left one bit so that the high order bit of B b ecomes the low order bit of R. (2) Subtract A from from the R part of S . (3) If the result is negative, set the low order bit of the B part of S to zero; otherwise set it to 1. (4) If the result of the subtraction is negative, add A to the R part of S . At the end of N steps, the N-bit quotientisinB , and the remainder is in R. A similar scheme can b e used for square ro ot[Monuschi and Mezzalama 1990]. We start with the input value A and a rst guess Y =0.For each bit we compute 2 the residual, R = A Y . Insp ection of R at the k 'th step is used to select a value for the k 'th bit of Y . The selection rules give some exibilityintuningthe algorithm. There are improvements to these approaches. For example, non-restoring division avoids adding A to R.TheSRT algorithm[Patterson and Hennessy 1990] is an opti- mization based on a carry-sum intermediate representation. Higher radix metho ds compute several bits p er step. Unfortunately, these improvements can't overcome the drawbackthatwe only get a xed numb er of bits p er step, a particularly serious shortcoming for high precision computations. Polynomialapproximations can also b e used[Monuschi and Mezzalama 1990]. Chebyshev p olynomials are most often used b ecause they form an approximation with a guaranteed b ound on the maximum error. However, each term in the series High Precision Division and Square Ro ot 3 adds ab out the same numb er of bits to the precision of the result, so this approach is impractical for high precision results. Another approach is one of the CORDIC metho ds whichwere originally derived for trigonometric functions but can b e applied to square ro ot[Monuschi and Mez- zalama 1990]. These metho ds treat the input as a vector in a particular co ordinate system. The result is then computed by adding and shifting some tabulated numbers. However, the tables would havetobeinconveniently large for high precision results. There are two metho ds in common use that converge quadratically instead of linearly as do these other metho ds, Goldschmidt's metho d and Newton iteration. Quadratic convergence is particularly imp ortant for high precision calculations b e- cause the numb er of steps is prop ortional to the logarithm of the numberofdigits. Goldschmidt's algorithm[Patterson and Hennessy 1990] is based on a dual iter- 0 ation. TocomputeB=A we rst nd an approximationto1=A = A .Wethen 0 0 initialize x = BA and y = AA .Next,we iterate until convergence. (1) r =2 y . (2) y = ry. (3) x = rx. 0 If 0 AA < 2, the iteration converges quadratically. A similar algorithm can b e derived for square ro ot. Round-o errors will not b e damp ed out b ecause Goldschmidt's algorithm is not self-correcting. In order to get accurate results, we will havetomaintain throughout the calculation more digits than app ear in the nal result. While carrying an extra word of precision is not a problem in a multiprecision calculation, the extra digits required will slowdown a calculation that could have b een done with quad precision hardware. Newton's metho d for computing B=A is to approximate 1=A, apply several iterations of the Newton-Raphson metho d, and multiply the result by B . The iteration for the recipro cal of A is x = x + x (1 Ax ): (1) n+1 n n n p p Newton's metho d for computing A is to approximate 1= A, apply several iterations of the Newton-Raphson metho d, and multiply the result by A. The iteration for the recipro cal square ro ot of A is x n 2 x = x + (1 Ax ): (2) n+1 n n 2 Newton's metho d is quadratically convergent and self-correcting. Thus, round- o errors made in early iterations damp out. In particular, if we know that the rst k guess is accurate to N bits, the result of iteration k will b e accurate to almost 2 N bits. 3. FLOATING POINT ARITHMETIC Our mo di cation to the standard Newton metho d for division and square ro ot makes some assumptions ab out the oating p oint arithmetic. In this section we intro duce some terminology, describ e the hardware requirements to implementour algorithm, and showhow to use software to overcome any de ciencies. 4 Alan H. Karp and Peter Markstein We assume that wehave a oating p oint arithmetic that allows us to write any representable number N X e k x = f ; (3) k k =0 where is the numb er base, the integer e is the exp onent, and the integers f are k in the interval [0; ). As will b e seen in Section 4, we compute a 2N -bit approximation to our function from an N -bit approximation. We will refer to these as full precision and half precision numb ers, resp ectively. In the sp ecial case where the op erations on half precision numb ers are done in hardware, we will refer to 2N -bit numb ers as quad precision and N -bit numb ers as hardware precision.

High Precision Division and Square Root

Lesson 19: the Euclidean Algorithm As an Application of the Long Division Algorithm

Primality Testing for Beginners

So You Think You Can Divide?

Decimal Long Division EM3TLG1 G5 466Z-NEW.Qx 6/20/08 11:42 AM Page 557

Basics of Math 1 Logic and Sets the Statement a Is True, B Is False

Whole Numbers

Long Division, the GCD Algorithm, and More

Long Division

"In-Situ": a Case Study of the Long Division Algorithm

The Euclidean Algorithm

Solve 87 ÷ 5 by Using an Area Model. Use Long Division and the Distributive Property to Record Your Work. Lessons 14 Through 21

Blind Mathematicians