148 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009 A Software Implementation of the IEEE 754R Floating-Point Arithmetic Using the Binary Encoding Format

Marius Cornea, Member, IEEE, John Harrison, Member, IEEE, Cristina Anderson, Ping Tak Peter Tang, Member, IEEE, Eric Schneider, and Evgeny Gvozdev

Abstract—The IEEE Standard 754-1985 for Binary Floating-Point Arithmetic [19] was revised [20], and an important addition is the definition of decimal floating-point arithmetic [8], [24]. This is intended mainly to provide a robust reliable framework for financial applications that are often subject to legal requirements concerning rounding and precision of the results, because the binary floating-point arithmetic may introduce small but unacceptable errors. Using binary floating-point calculations to emulate decimal calculations in order to correct this issue has led to the existence of numerous proprietary software packages, each with its own characteristics and capabilities. The IEEE 754R decimal arithmetic should unify the ways decimal floating-point calculations are carried out on various platforms. New algorithms and properties are presented in this paper, which are used in a software implementation of the IEEE 754R decimal floating-point arithmetic, with emphasis on using binary operations efficiently. The focus is on rounding techniques for decimal values stored in binary format, but algorithms are outlined for the more important or interesting operations of addition, multiplication, and division, including the case of nonhomogeneous operands, as well as conversions between binary and decimal floating-point formats. Performance results are included for a wider range of operations, showing promise that our approach is viable for applications that require decimal floating-point calculations. This paper extends an earlier publication [6].

Index Terms—Computer arithmetic, multiple-precision arithmetic, floating-point arithmetic, decimal floating-point, computer arithmetic, correct rounding, binary-decimal conversion. Ç

1INTRODUCTION

HERE is increased interest in decimal floating-point 754R decimal floating-point software implementation Tarithmetic in both industry and academia as the IEEE based on the BID encoding. We include a discussion of 754R1 [20] revision becomes the new standard for floating- our motivation, selected algorithms, performance results, point arithmetic. The 754R Standard describes two different and future work. The most important or typical operations possibilities for encoding decimal floating-point values: the will be discussed: primarily decimal rounding but also binary encoding, based on using a Binary Integer [24] to addition, multiplication, division, and conversions between represent the (BID, or Binary Integer Decimal), binary and decimal floating-point formats. and the decimal encoding, which uses the (DPD) [7] method to represent groups of up to 1.1 Motivation and Previous Work three decimal digits from the significand as 10-bit declets.In An inherent problem of binary floating-point arithmetic this paper, we present results from our work toward a used in financial calculations is that most decimal floating- point numbers cannot be represented exactly in binary 1. Since the time of writing of this paper, the IEEE 754R Standard has floating-point formats, and errors that are not acceptable IEEE Standard 754TM-2008 become the for Floating-Point Arithmetic. may occur in the course of the computation. Decimal floating-point arithmetic addresses this problem, but a . M. Cornea, J. Harrison, and C. Anderson are with Intel Corp., JF1-13, degradation in performance will occur compared to binary 2111 NE 25th Avenue, Hillsboro, OR 97124. floating-point operations implemented in hardware. Despite E-mail: {Marius.Cornea, cristina.s.anderson}@intel.com, its performance disadvantage, decimal floating-point arith- [email protected]. . P.T.P. Tang is with D.E. Shaw Research, L.L.C., 120 West 45th Street, metic is required by certain applications that need results 39th Floor, New York, NY 10036. identical to those calculated by hand [3]. This is true for E-mail: [email protected]. currency conversion [13], banking, billing, and other . E. Schneider is with Intel Corp., RA2-406, 2501 NW 229th Avenue, Hillsboro, OR 97124. E-mail: [email protected]. financial applications. Sometimes, these requirements are . E. Gvozdev is with Intel Corp., SRV2-G, 2 Parkovaya Street, Satis, mandated by law [13]; other times, they are necessary to Diveevo District, Nizhny Novgorod Region 607328, Russia. avoid large accounting discrepancies [18]. E-mail: [email protected]. Because of the importance of this problem a number of Manuscript received 23 July 2007; revised 19 June 2008; accepted 10 Sept. 2008; published online 7 Nov. 2008. decimal solutions exist, both hardware and software. Recommended for acceptance by P. Kornerup, P. Montuschi, J.-M. Muller, Software solutions include C# [22], COBOL [21], and XML and E. Schwarz. [25], which provide decimal operations and datatypes. Also, For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCSI-2007-07-0378. Java and C/C++ both have packages, called BigDecimal Digital Object Identifier no. 10.1109/TC.2008.209. [23] and decNumber [9], respectively. Hardware solutions

0018-9340/09/$25.00 ß 2009 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 27, 2009 at 16:30 from IEEE Xplore. Restrictions apply. CORNEA ET AL.: A SOFTWARE IMPLEMENTATION OF THE IEEE 754R DECIMAL FLOATING-POINT ARITHMETIC USING THE BINARY... 149 were more prominent earlier in the computer age with the implementations of decimal floating-point arithmetic on ENIAC [27] and UNIVAC [16]. However, more recent machines with binary hardware. For example, assume that examples include the CADAC [5], IBM’s z900 [4] and z9 [10] the exact result of a decimal floating-point operation has a architectures, and numerous other proposed hardware coefficient C ¼ 1234567890123456789 with q ¼ 19 decimal implementations [11], [12], [1]. More hardware examples digits that is too large to fit in the destination format can be found in [8], and a more in-depth discussion is found and needs to be rounded to the destination precision of in Wang’s work [26]. p ¼ 16 digits. C The implementation of nonhomogeneous decimal As mentioned, is available in binary format. To round C to 16 decimal digits, one has to remove the lower x ¼ 3 floating-point operations, whose operands and results have decimal digits ðx ¼ q p ¼ 19 16Þ and possibly to add mixed formats, is also discussed below. The method used is one unit to the next decimal place, depending on the based on replacing certain nonhomogeneous operations by rounding mode and on the value of the quantity that has a similar homogeneous operation and then doing a floating- been removed. If C is rounded to nearest, the result will be point conversion to the destination format. Double round- C ¼ 1234567890123457 103.IfC is rounded toward zero, ing errors that may occur in such cases are corrected using the result will be C ¼ 1234567890123456 103. simple logical equations. Double rounding errors have been The straightforward method to carry out this operation is discussed in other sources. A case of double rounding error to divide C by 1,000, to calculate and subtract the in conversions from binary to decimal (for printing binary remainder, and possibly to add 1,000 to the result at the values) is presented in [15] (in the proof of Theorem 15; a end, depending on the rounding mode and on the value of solution of how to avoid this is given based on status flags the remainder. and only for this particular case). A paper by Figueroa [14] A better method is to multiply C by 103 and to truncate discusses cases when double rounding errors will not occur the result so as to obtain the value 1234567890123456. but does not offer a solution for cases when they do occur. However, negative powers of 10 cannot be represented A paper by Boldo and Melquiond [2] offers a solution for exactly in binary, so an approximation will have to be used. 3 3 correcting double rounding errors when rounding to Let k3 10 be an approximation of 10 , and thus, nearest but only if a non-IEEE special rounding mode is C k3 1234567890123456.Ifk3 overestimates the value of 3 implemented first, named in the paper as rounding to odd 10 , then C k3 > 1234567890123456. Actually, if k3 is (which is not related to rounding to nearest-even). How- calculated with sufficient accuracy, it will be shown that ever, the solution presented here for correcting double bC k3c¼1234567890123456 with certainty. rounding errors seems to be new. (Note that the floorðxÞ, ceilingðxÞ, and fractionðxÞ Estimations have been made that hardware approaches functions are denoted here by bxc, dxe,andfxg, to decimal floating-point arithmetic will have average respectively.) speedups of 100-1,000 times over software [18]. However, Let us separate the rounding operation of a value C to the results from our implementation show that this is fewer decimal digits into two steps: first calculate the result unlikely, as the maximum clock cycle counts for decimal rounded to zero and then apply a correction if needed, by operations implemented in software are in the range of tens adding one unit to its least significant digit. Rounding or hundreds on a variety of platforms. Hardware imple- overflow (when after applying the correction, the result mentations would undoubtedly yield a significant speedup requires p þ 1 decimal digits) is not considered here. but not as dramatic, and that will make a difference only if The first step in which the result rounded toward zero is applications spend a large percentage of their time in calculated can be carried out by any of the following four decimal floating-point computations. methods. 3 3 Method 1. Calculate k3 10 as a y-bit approximation of 10 2DECIMAL ROUNDING rounded up (where y will be determined later). Then, bC k3c¼1234567890123456, which is exactly the same A decimal floating-point number n is encoded using three value as bC=103c. fields: sign s, exponent e, and significand with at most p decimal digits, where p is the precision (p is 7, 16, or 34 in It can be noticed, however, that 103 ¼ 53 23 and that IEEE 754R, but p ¼ 7 for the 32-bit format is for storage multiplication by 23 is simply a shift to the right by 3 bits. only). The significand can be scaled up to an integer C, Three more methods to calculate the result rounded toward referred to as the coefficient (and the exponent is decreased 0 zero are then possible: accordingly): n ¼ð1Þs 10e ¼ð1Þs 10e C. h 53 53 The need to round an exact result to the precision p of the Method 1a. Calculate 3 as a y-bit approximation of destination format occurs frequently for the most common rounded up (where y will be determined later). Then, 3 decimal floating-point operations: addition, subtraction, bðC h3Þ2 c¼1234567890123456, which is the same as 3 multiplication, fused multiply-add, and several conversion bC=10 c. However, this method is no different from Method 1, 3 operations. For division and square root, this happens only so the approximation of 5 has to be identical in the number of 3 in certain corner cases. If the decimal floating-point operands significant bits to the approximation of 10 from Method 1 3 are encoded using the IEEE 754R binary format, the (but it will be scaled up by a factor of 2 ). rounding operation can be reduced to the rounding of an integer binary value C to p decimal digits. Performing this The third method is obtained if instead of multiplying by operation efficiently on decimal numbers stored in binary 23 at the end, C is shifted to the right and truncated before format is very important, as it enables good software multiplication by h3.

Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 27, 2009 at 16:30 from IEEE Xplore. Restrictions apply. 150 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009

3 3 x Method 2. Calculate h3 5 as a y-bit approximation of 5 For this, assume that such a power of two exists: 10 < s x 10x s x 2x s x rounded up (where y will be determined later). Then, 2 10 ð1 þ "Þ,ð1þ"Þ 2 <10 , ð1þ"Þ 2 <2 . 3 bbC 2 ch3c¼1234567890123456, which is identical to Even for those values of x where x is slightly 3 bC=10 c. It will be shown that h3 can be calculated to fewer larger than an integer (x ¼ 22, x ¼ 25, and x ¼ 28), 2x s bits than k3 from Method 1. " would have to be too large in order to satisfy ð1þ"Þ 2 . x The conclusion is that kx is in the same binade as 10 , bxcy The final and fourth method is obtained by multiplying which will let us determine ulpðkxÞ¼2 . In order C by h3 and then truncating, followed by a multiplication to satisfy inequality (1), it is sufficient to have the by 23 and a second truncation. following (increasing the left-hand side and decreasing 3 3 the right-hand side of the inequality): ulpðkxÞ Method 2a. Calculate h3 5 as a y-bit approximation of 5 102x q x x y df xgþ qe roundedup(wherey can be determined later). Then, ð10 10 Þ . This leads eventually to , 3 as required. tu bbC h3c2 c¼1234567890123456. However, it can be shown that in this last case, h3 has to be calculated to the same In our example from Method 1 above, y ¼df 3gþ number of bits as k3 in Method 1, which does not represent an 19 65 103 improvement over Method 1. e¼ (so needs to be rounded up to 65 bits). The values kx for all x of interest are precalculated and K ;e K e Only Method 1 and Method 2 shall be examined next. are stored as pairs ð x xÞ, with x and x positive ex We will solve next the problem of rounding correctly a integers, and kx ¼ Kx 2 . This allows for efficient number with q digits to p ¼ q x digits, using approxima- implementations, exclusively in the integer domain, of tions to negative powers of 10 or 5. We will also solve the several decimal floating-point operations, particularly ad- problem of determining the shortest such approximations, dition, subtraction, multiplication, fused multiply-add, and i.e., with the least number of bits. certain conversions. The second step in rounding correctly a value C to fewer 2.1 Method 1 for Rounding a Decimal Coefficient decimal digits consists of applying a correction to the value For rounding a decimal coefficient represented in binary rounded to zero, if necessary. The inexact status flag has to using Method 1, the following property specifies the be set correctly as well. For this, we have to know whether x minimum accuracy y for the approximation kx 10 , the original value that we rounded was exact (did it have x which ensures that a value C with q decimal digits trailing zeros in decimal?) or if it was a midpoint when represented in binary can be truncated without error to rounding to nearest (were its x least significant decimal p ¼ q x digits by multiplying C by kx and truncating. digits a 5 followed by x 1 zeros?). Property 1 states that if y dflog210 xgþlog210 qe and kx Once we have the result truncated to zero, the straight- is a y-bit approximation of 10x rounded up, then we can forward method to check for exactness and for midpoints x calculate bC kxc in the binary domain, and we will obtain could be to calculate the remainder C percent 10 and to the same result as for bC=10xc calculated in decimal. compare it with 0 and with 50...0(x decimal digits), which Property 1. Let q 2 IN , q>0, C 2 IN , 10q1 C 10q 1, is very costly. However, this is not necessary because the information about exact values and midpoints is contained x 2f1; 2; 3; ...;q 1g, and ¼ log210.Ify 2 IN satisfies x in the fractional part f of the product C kx. y df xgþ qe and kx is a y-bit approximation of 10 x rounded up (the subscript RP, y indicates rounding up to y The following property states that if C 10 covers x x an interval [H, H þ 1) of length 1, where H is an integer, bits), i.e., kx ¼ð10 ÞRP;y ¼ 10 ð1 þ "Þ, where yþ1 x then C k belongs to [H, H þ 1) as well, and the fraction f 0 <"<2 , then bC kxc¼bC=10 c. x discarded by the truncation operation belongs to (0, 1) (so f Proof outline. The proof can be carried out starting with the cannot be zero). More importantly, we can identify the C C ¼ d 10q1 þ d representation of in decimal, 0 1 exact cases C 10x ¼ H (then, rounding C to p ¼ q x 10q2 þþd 101 þ d ,whered ;d ; ...d 2 q2 q1 0 1 q1 digits is an exact operation) and the midpoint cases f0; 1; ...; 9g, and d0 6¼ 0. The value of C can be expressed x 1 x x qx1 C 10 ¼ H þ 2 . as C ¼ 10 HþL, where H ¼bC=10 c¼d0 10 þ d1 qx2 qx3 1 qx1 q 2 IN q>0 x 2f1; 2; 3; ...;q 1g C 2 IN 10 þd2 10 þ ...þdqx2 10 þdqx1 2½10 ; Property 2. Let , , , , qx x x1 10q1 C 10q 1 C 10x H L H 10qx1; 10 1, and L ¼ C percent 10 ¼ dqx 10 þ dqx1 , ¼ þ , where 2½ x2 1 x 10qx 1 L 0; 10x 1 H L IN f C k 10 þ ...þdq2 10 þdq1 2½0; 10 1. Then, the mini- , 2½ ,and , 2 , ¼ x x mum value of y can be determined such that bC kxc¼ bC kxc, ¼ log210, y 2 IN , y 1 þd qe, and kx ¼ 10 yþ1 H ,H ðHþ10x LÞð1 þ "Þ

Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 27, 2009 at 16:30 from IEEE Xplore. Restrictions apply. CORNEA ET AL.: A SOFTWARE IMPLEMENTATION OF THE IEEE 754R DECIMAL FLOATING-POINT ARITHMETIC USING THE BINARY... 151

Another important result of Property 2 was that it helped pseudocode fragment describes the algorithm for adding two refine the results of Property 1. Although the accuracy y of kx decimal floating-point numbers encoded using the binary determined in Property 1 is very good, it is not the best format, n1 ¼ C1 10e1 and n2 ¼ C2 10e2, whose possible. For a given pair q and x ¼ q p, starting with the can be represented with p decimal digits (C1, C2 2 ZZ, value y ¼df xgþ qe, one can try to reduce it 1 bit at a 0 0, C 2 IN , 10q1 C 10q 1, C00 ¼ C0 þ 1=2 10x x d2pe x 2f1; 2; 3; ...;q 1g,and ¼ log210.Ify 2 IN , kx ¼ 10 ð1 þ Þ, 0 <<2 00 00 Ex y df xgþ qex, and hx is a y-bit approximation of C ¼ C kx ¼ C Kx 2 x x x 5 rounded up, i.e., hx ¼ð5 ÞRP;y ¼ 5 ð1 þ "Þ, f = the fractional part of C // lower Ex bits of yþ1 x x 00 0 <"<2 , then bbC 2 chxc¼bC=10 c. product C Kx if 0

Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 27, 2009 at 16:30 from IEEE Xplore. Restrictions apply. 152 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009 for addition, then the smaller operand is just a rounding e ¼ e1 e2 scale // expected exponent error compared to the larger one, and rounding is trivial). endif This also means that the number x of decimal digits to be Q1 ¼bC1=C2c, R ¼ C1 Q1 C2 // long integer removed from a result of length q 2½p þ 1; 2 p is between 1 divide and remainder and p. It is not difficult to prove that the comparisons for Q ¼ Q0 þ Q1 p midpoint and exact case detection can use the constant 10 if ðR ¼¼ 0Þ x instead of 10 , thus saving a table read. eliminate trailing zeros from Q: p x Note that 10 (or 10 ) cannot be represented exactly in find largest integer d s.t. Q=10d is exact binary format, so approximations of these values have to be Q ¼ Q=10d 1 used. It is sufficient to compare f or f 2 (both have a e ¼ e þ d // adjust expected exponent finite number of bits when represented in binary) with a if ðe EMINÞ p truncation t of 10 whose unit in the last place is no larger return Q 10e f f 1 t than that of or 2 . The values are calculated such endif that they always align perfectly with the lower bits of the if ðe EMINÞ f e fraction , which makes the tests for midpoints and round Q 10 according to current rounding mode exactness relatively simple in practice. // rounding to nearest based on comparing C2 The IEEE status flags are set correctly in all rounding and 2 R modes. The results in rounding modes other than to nearest else are obtained by applying a correction (if needed) to the result compute correct result based on Property 1 rounded to nearest, using information on whether the precise // underflow result was an exact result in the IEEE sense, a midpoint less endif than an even floating-point number, a midpoint greater than an even number, an inexact result less than a midpoint, or an The square root algorithm (not presented here) is some- inexact result greater than a midpoint. whatsimilar, in the sense thatit requirescomputing an integer square root of 2 p digits (while division required an integer division of 2 p by p digits). Rounding of the square root result 4DIVISION AND SQUARE ROOT is based on a test involving long integer multiplication. The sign and exponent fields are easily computed for these two operations. The approach used for calculating the 5CONVERSIONS BETWEEN BINARY AND DECIMAL correctly rounded significand of the result was to scale the FORMATS significands of the operands to the integer domain and to use a combination of integer and floating-point operations. Conversions between IEEE 754 binary formats and the new The is summarized below, where p is decimal formats may sometimes be needed in user applica- the maximum digit size of the decimal coefficient, as tions, and the revised standard specifies that these should specified by the following format: p ¼ 16 for a 64-bit be correctly rounded. We will first consider the theoretical decimal and p ¼ 34 for a 128-bit decimal format. EMIN is analysis determining the accuracy required in intermediate the minimum decimal exponent allowed by the format. calculations and then describe more details of the algo- Overflow situations are not explicitly treated in the rithms for some typical source and target formats. algorithm description below; they can be handled in the 5.1 Required Accuracy return sequence, when the result is encoded in BID format. We analyze here how close, in relative terms, a number x in if C1 , then we will be able to make a rounding scale ¼ scale þ 1 decision based on an approximation good to a relative error endif of order , as spelled out in more detail below. C1 ¼ C10 10scale Consider converting binary floating-point numbers to a Q0 ¼ 0 decimal format; essentially, similar reasoning works the e ¼ e1 e2 scale nd // expected exponent other way round. We are interested in how close, in relative else terms, the answer can be to a decimal floating-point number Q0 ¼bC1=C2c, R ¼ C1 Q C2 // long integer (for directed rounding modes) or a midpoint between them divide and remainder (for rounding to nearest), i.e., either of if ðR ¼¼ 0Þ e e 2 m 2 m return Q 10e1e2 // result is exact 1 or 1; 10dn 10dðn þ 1=2Þ endif scale ¼ p digitsðQÞ where integers m and n are constrained by the precision of C1 ¼ R 10scale the formats, and d and e are constrained by their exponent Q0 ¼ Q0 10scale ranges. We can encompass both together by considering

Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 27, 2009 at 16:30 from IEEE Xplore. Restrictions apply. CORNEA ET AL.: A SOFTWARE IMPLEMENTATION OF THE IEEE 754R DECIMAL FLOATING-POINT ARITHMETIC USING THE BINARY... 153

2e1m problems of the form j 10dn 1j, where n is bounded by and the most difficult for conversion from Decimal64 to twice the usual bound on the decimal coefficient. Binary64 is a relative difference of 2114:62 for 1 5.1.1 Naive Analysis 3743626360493413 10165 6898586531774200 2549: Based on the most straightforward analytical reasoning, 2 we can deduce that either the two numbers are exactly equal or else 5.2 Implementations We will now show the general pattern of the actual e e d 2 m j2 m 10 nj 1 algorithms for conversions in both directions; in fact, both 1 ¼ 10dn j10dnj j10dnj directions have a lot of stylistic similarities. Generally, we consider the two problems “parametrically,” using a because all quantities are integers. However, this bound is general p for the precision of the binary format and d for somewhat discouraging, because the exponent d may be the number of decimal digits in the decimal format. very large, e.g., d ¼ 6;111 for Decimal128. However, some structural differences arise depending on On the other hand, it seems plausible on statistical the parameters, two of which we mention now. grounds that the actual worst case is much less bad. First, for conversions between certain pairs of formats For example, imagine distributing N input numbers (say, (e.g., from any binary format considered into the N ¼ 264 for the two 64-bit formats) randomly over a format Decimal128 format), no overflow or underflow can occur, with M possible significands. The fractional part of the so certain tests or special methods noted below can be resulting significand will probably be distributed more or omitted and are not present in the actual implementations. less randomly in the range 0 x<1, so the probability of Second, many of the algorithms involve scaled prestored e f its being closer than to a floating-point number will be approximations to values such as 10 =2 . In the presentation about . Therefore, the probability that none of them will be below, we assume that such values are stored for all possible N N 1 N input exponents (the output exponent is determined closer than is about ð1 Þ ¼ð1 Þ e .Arelative consequentially). For several of our implementations, this difference of typically corresponds to an absolute error in is indeed the case; doing so ensures high speed and allows the integer significand of about ¼ M,andso,the MN us to build appropriate underflow of the output into the probability of relative closeness is about 1 e . Thus, tables without requiring special program logic. However, we expect the minimum to be of order 1=ðMNÞ. for some conversions, in particular those into Decimal128, the storage requirement seems excessive, so the necessary 5.1.2 Diophantine Approximation Analysis values are instead derived internally from a bipartite table. The above argument is really just statistical hand-waving and may fail for particular cases. However, we can establish 5.2.1 Decimal-Binary Conversion Algorithm such bounds rigorously by rearranging our problem. We Consider converting from a decimal format with d decimal are interested in p digits to a binary format with precision . The basic steps of e e d 2 m 2 =10 n=m the algorithm are the following: d 1 ¼ : 10 n n=m . Handle NaN, infinity and zero, and extremely large We can therefore consider all the permissible pairs of and small inputs. e exponent values e and d, finding the n=m with numerator . Strip off the sign, so the input is now 10 m, where d and denominator within appropriate bounds that gives the 0

Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 27, 2009 at 16:30 from IEEE Xplore. Restrictions apply. 154 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009

That is, x lies strictly between the smallest normalized results), i.e., q 117.ToconvertfromDecimal32to q 107 binary floating-point number with exponent f 1 and the Binary32, we need 2 222254:81 , i.e., q 57. In fact, we next beyond the largest with exponent f. We therefore almost always make q slightly larger than necessary so that have only two possible choices for the “provisional it is a multiple of 32 or 64. This makes the shift by q more exponent” f. (We say provisional because rounding may efficient since it just amounts to picking words of a still bump it up to the next binade, but this is easy to check multiword result. at the end.) And that is without any knowledge of m, When q is chosen satisfying these constraints, we just except the assumption that it is normalized. Moreover, we need to consider the product mr to make our rounding can pick the exact mmin that acts as the breakpoint between decisions. The product with the lowest q bits removed is the these possibilities: correctly truncated significand. Bit q 1 is the rounding bit, and the bottom q 1 bits (0 to q 2) can be considered as a fþp1 e d q mmin ¼d2 =10 e1; sticky bit: test if it is at least 10 =2 . Using this information, we can correctly round in all rounding modes. Where the so 10em < 2fþp1 10eðm þ 1Þ. Thus, we can pick the min min input is smaller in magnitude than the smallest possible provisional exponent to be f 1 for all m m and f for min number in the output format, we artificially limit the all m>m , and we are sure that 2f 2p1 x<2f 2p. In the min effective shift so that the simple interpretation of “sticky” future, we will just use f for this correctly chosen bits does not need changing. provisional exponent. The mechanics of rounding uses a table of “boundaries” Now, we aim to compute n with 2f n 10em, i.e., the against which the lowest q bits of the reciprocal product left-hand side should be a correctly rounded approxima- (i.e., the “round” and “sticky” data) is compared. If this tion of the right-hand side. Staying with the provisional value is strictly greater than the boundary value, the exponent f, this entails finding a correctly rounded integer e truncated product is incremented. (In the case where it n 10 m 2f . We do this by multiplying by an approximation overflows the available significands, the output exponent is 10e=2f 2q q to scaled by for some suitably chosen and one greater than the provisional exponent.) To avoid rounded up, i.e., lengthy tests, these boundaries are indexed by a composi- q e f 4r þ 2s þ l r s r ¼d2 10 =2 e; tion of bitfields , where is the rounding mode, is the sign of the input, and l is the least significant bit of the and truncating by shifting to the right by q bits. How do truncated exponent. we choose q? From the results above, given the input and 10e 5.2.2 Binary-Decimal Conversion Algorithm output formats, we know an >0 such that if n 6¼ 2f m 1 10e p or n þ 2 6¼ 2f , then their relative difference is at least . Consider converting from a binary format with precision We have to a decimal format with d decimal digits. An important difference from decimal-binary conversion is that the 2q10e 2q10e r< þ 1; exponent in the output is supposed to be chosen as close 2f 2f to zero as possible for exact results and as small as possible and therefore (since m<10d) for inexact results. The core algorithm is given as follows:

10e mr 10e 10d . Handle NaN, infinity and zero, and extremely large m < m þ : 2f 2q 2f 2q and small inputs. . 2em 10e Stripoffthesign,sotheinputis with We want to make a rounding decision for 2f m based on 0

Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 27, 2009 at 16:30 from IEEE Xplore. Restrictions apply. CORNEA ET AL.: A SOFTWARE IMPLEMENTATION OF THE IEEE 754R DECIMAL FLOATING-POINT ARITHMETIC USING THE BINARY... 155

200 2e 1 2e range such as 2 . To maximize code sharing in this part of n 6¼ 10f m or n þ 2 6¼ 10f , then their relative difference is at the algorithm, we always treat the input as a quad precision least . We have number, so 2112 m<2113, and the exponent is adjusted 2qþe 2qþe accordingly. We can pass through any inputs with e>0 r< þ 1; 10f 10f since in this case, the input is 2113, which is too large for the significand of any of the decimal formats. So we can and therefore (since m<2p) e 0 t assume that ; let us write for the number of trailing 2e mr 2e zeros in m. Now, there are two cases: m < m þ 2pq: 10f 2q 10f . If e þ t 0, the input is an integer, so we treat it 2e We want to make a rounding decision for 10f m based on mr specially iff it fits in the coefficient range. We just q . First, since the latter is an overestimate, it is immediate 0 e 2 need to test if the shifted coefficient m ¼ m=2 is that if the true result is a rounding boundary, so is the within range for the coefficient range of the output estimate. So we just need to check that if the true answer is format. If so, that is the output coefficient, and the < a rounding boundary, so is the estimate. We know that if output exponent is zero. Otherwise, we can pass the true answer is < a rounding boundary, then so is through to the main path. 2e 10f mð1 þ Þ, so it suffices to show that . If a ¼ðe þ tÞ > 0, then we have a noninteger 2e 2e input. If special treatment is necessary, the result m þ 2pq mð1 þ Þ; will have a coefficient of m0 ¼ 5aðm=2tÞ and an 10f 10f e exponent of a.Ifa>48, this can be dismissed 2pq 2 m f 49 34 i.e., that 10f . By the choice of made, we have since 5 > 10 , which is too big for any decimal 10f 10d1 2em, so it suffices to establish that 2pq 10d1. format’s coefficient range. Otherwise, we determine We also want to be able to test whether a conversion is whether we are indeed in range by a table based mr exact. If it is, then the fractional part of 2q is between 0 and on a and, if so, get the multiplier from another 2pq by the bound noted above. If it is not, then the table indexed by a. fractional part is at least 10d1 2pq. To discriminate Now, we consider the main path. If we set cleanly, we want this to be at least 2pq, so we actually want f1þd eþp fþd pqþ1 d1 f ¼dðe þ pÞ log10 2ed,wehave10 < 2 10 . to pick a q such that 2 10 or, in other words, x q 2pþ1 This means that for any floating-point number with 2 10d1 . exponent e, we have For example, to convert from Binary64 to Decimal64, q 254 we need 2 15 115:53 (based on the above best approx- f1 d1 f1þd eþp 10 2 10 10 ¼ 10 =10 < 2 =10 imation results), i.e., q 120. To convert from Binary32 to eþp1 eþp fþd q 225 < 2 x<2 10 : Decimal32, we need 2 106251:32 , i.e., q 57. When q is chosen satisfying these constraints, we just That is, x lies strictly between the smallest normalized need to consider the product mr to make our rounding decimal number with exponent f 1 and the next beyond decisions. The product with the lowest q bits removed is the the largest with exponent f. We therefore have only two correctly truncated significand. Bit q 1 is the rounding bit, possible choices for the “provisional exponent” f. (We say and the bottom q 1 bits (0 to q 2) can be considered as a provisional because rounding may still bump it up to the sticky bit: zero out the lowest p bits and test whether the next decade, but this is easy to check at the end.) And that is result is zero or nonzero or, in other words, test whether the without any knowledge of m, except the assumption that it bitfield from bit p to bit q 2 is nonzero. Using this is normalized. Moreover, we can pick the exact mmin that information, we can correctly round in all rounding modes. acts as the breakpoint between these possibilities: The approach uses exactly the same boundary tables as in the decimal-binary case discussed previously. fþd1 e mmin ¼d10 =2 e1; e fþd1 e so 2 mmin < 10 2 ðmmin þ 1Þ. Thus, we can pick the 6NONHOMOGENEOUS DECIMAL FLOATING-POINT provisional exponent to be f 1 for all m mmin and f for OPERATIONS all m>m , and we are sure that 10f 10d1 x<10f 10d. min The IEEE 754R Standard mandates that the binary and the f In the future, we will just use for this correctly chosen decimal floating-point operations of addition, subtraction, provisional exponent. f e multiplication, division, square root, and fused multiply- Now, we aim to compute n with 10 n 2 m, i.e., the left- add have to be capable of rounding results correctly to any hand side should be a correctly rounded approximation of supported floating-point format, for operands in any the right-hand side. Staying with the provisional exponent f, supported floating-point format in the same base (2 or 10). 2e this entails finding a correctly rounded n 10f m. We do this The more interesting cases are those where at least some of e f q multiplying by an approximation to 2 =10 scaled by 2 for the operands are represented in a wider precision than that of some suitably chosen q and rounded up, i.e., the result (such a requirement did not exist in IEEE 754-1985). There are thus a relatively large number of operations that r ¼d2qþe=10f e; need to be implemented in order to comply with this and truncating by shifting to the right by q bits. How do we requirement. Part of these are redundant (because of the choose q? From the results above, we know an such that if commutative property), and others are trivial, but several

Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 27, 2009 at 16:30 from IEEE Xplore. Restrictions apply. 156 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009 where at least some of the operands have a larger precision N0 with the result rounded correctly to 128-bit format than the precision N of the result require a more careful ðN0 ¼ 34Þ, as specified by the IEEE 754R Standard: implementation. One solution is to implement each operation a ¼ 10000000000000015 1016; separately, for example, for Decimal128 þ Decimal64 ¼ 34 Decimal64, Decimal128 þ Decimal128 ¼ Decimal64, etc. How- b ¼1 10 ; ever, since these nonhomogeneous operations will occur res0 ¼ða þ bÞRN;34 quite rarely in practice, an alternate solution was chosen. If a ¼ 1000000000000001500000000000000000 1033: floating-point operation has to be performed on operands of x x precision N0 and the result has to be rounded to precision (The notation RN;34 indicates the rounding of to nearest-even to 34 decimal digits.) Next, the result res0 is NN significand by the latter). To correct against the possibility where 0 . This extra information consists of four of such errors, a third component is added to the logical indicators for each of the two rounding operations N N N N implementation: ( 0 to 0 and then 0 to ), which can be used for correcting the result when a double rounding occurs 3. correction logic against double rounding errors. (C notation is used; MP denotes a midpoint between two The conditions that cause double rounding errors and consecutive floating-point numbers): the logical equations for their correction will be presented inexact lt MP; inexact gt MP; MP lt even; MP gt even: next—first for rounding to nearest-even and then for rounding to nearest-away. A fifth indicator can be derived from the first four: In rounding-to-nearest-even mode, double rounding exact ¼ !inexact lt MP && !inexact gt MP && errors may occur when the exact result res0 of a floating- point operation with operands of precision N0 is rounded !MP lt even && !MP gt even: res correctly (in the IEEE 754R sense) first to a result 0 of This information is used to either add 1 or subtract 1 to/ N res precision 0, and then, 0 is rounded again to a narrower from the significand of the second result, in case a double precision N. Sometimes, the result res does not represent rounding error occurred. This can be an efficient solution 0 the IEEE-correct result res that would have been obtained for implementing correctly rounded operations with oper- res were the original exact result 0 rounded directly to ands of precision N0 and results of precision N ðN0 >NÞ, 0 precision N. In such cases, res differs from res by 1 ulp, instead of having complete implementations for these less and the error that occurs is called a double rounding error. frequently used but mandatory operations defined in Only positive results will be considered in this context, as IEEE 754R. (It can be noted, however, that for certain the treatment of negative results is similar because round- floating-point operations, under special circumstances, ing to nearest is symmetric with respect to zero. A double double rounding errors will not occur; in most cases, rounding error for rounding to nearest-even can be upward however, when performing two consecutive rounding (when the result res is too large by 1 ulp) or downward operations as explained above, double rounding errors (when the result res is too small by 1 ulp). can occur. For example, if a binary floating-point division For example, consider the addition operation of two operation or square root operation is performed on an decimal floating-point values in 128-bit format ðN0 ¼ 34Þ, operand(s) of precision N0 and the result of precision N0 is

Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 27, 2009 at 16:30 from IEEE Xplore. Restrictions apply. CORNEA ET AL.: A SOFTWARE IMPLEMENTATION OF THE IEEE 754R DECIMAL FLOATING-POINT ARITHMETIC USING THE BINARY... 157 then rounded a second time to a smaller precision N, then However, in certain situations, it is useful to know also double rounding errors are not possible if N0 > 2 N þ 1.) the correct rounding indicators from the second rounding. The logic for correcting double rounding errors when For example, they can be used for correct denormalization rounding results to nearest-even is based on the observation of the result res0 if its exponent falls below the minimum that such errors occur when exponent in the destination format. In this case, the complete logic is expressed by the following pseudocode, 1. the first rounding pulls up the result to a value that where in addition to correcting the result, the four rounding is a midpoint situated to the left of an even floating- indicators from the second rounding are also corrected if point number (as seen on the real axis) for the double rounding errors occur: second rounding (double rounding error upward) or 2. the first rounding pulls down the result to a value that // avoid a double rounding error and adjust rounding indicators is a midpoint situated to the right of an even floating- // for rounding to nearest-even point number (on the real axis) for the second if MP lt even && ðinexact gt MP0kMP lt even0Þ rounding (double rounding error downward). // double rounding error upward In C-like pseudocode, the correction method can be significand--// significand becomes odd ^ expressed as follows (0 identifies the first rounding): if significand ¼¼ b ðN 1Þ1 // falls below the smallest N-digit significand // avoid double rounding for rounding to nearest-even significand ¼ b^N 1 // largest significand in base b sub1 ¼ MP lt even && ðinexact gt MP0kMP lt even0Þ unbiased_exp--// decrease exponent by 1 add1 ¼ MP gt even && ðinexact lt MP0kMP gt even0Þ endif if sub1 // double rounding error upward MP lt even ¼ 0 significand--// significand becomes odd inexact lt MP ¼ 1 ^ if significand ¼¼ b ðN 1Þ1 else if MP gt even && ðinexact lt MP0kMP gt even0Þ // falls below the smallest N-digit significand // double rounding error downward ^ significand ¼ b N 1 // largest significand in base b significand++ unbiased_exp--// decrease exponent by 1 // significand becomes odd (so it cannot be b^N for b ¼ 2 or endif b ¼ 10) else if add1 // double rounding error downward MP gt even ¼ 0 significand++ inexact gt MP ¼ 1 ^ // significand becomes odd (so it cannot be b N for b ¼ 2 else if !MP lt even && !MP gt even && !inexact lt MP & or b ¼ 10) &!inexact gt MP endif // if this second rounding was exact the result may still be // Otherwise no double rounding error; result is already correct // inexact because of the previous rounding If l, r, and s represent the least significant decimal digit if ðinexact gt MP0kMP lt even0Þ before rounding, the rounding digit, and the sticky bit inexact gt MP ¼ 1 (which is 0 if all digits to the right of r are 0 and 1 endif otherwise), then if ðinexact lt MP0kMP gt even0Þ inexact lt MP ¼ 1 inexact lt MP ¼ððr ¼¼ 0&&s ! ¼ 0Þk ð1 <¼ r&&r<¼ 4ÞÞ; endif inexact gt MP ¼ððr ¼¼ 5&&s ! ¼ 0Þk ðr>5ÞÞ; else if ðMP gt even && ðinexact gt MP0kMP lt even0ÞÞ MP lt even¼ðððl percent 2Þ¼¼1Þ&&ðr¼¼5Þ&&ðs¼¼0ÞÞ; // pulled up to a midpoint greater than an even floating-point MP gt even¼ðððl percent 2Þ¼¼0Þ&&ðr¼¼5Þ&&ðs¼¼0ÞÞ; number MP gt even ¼ 0 exact ¼ððr ¼¼ 0Þ && ðs ¼¼ 0ÞÞ: inexact lt MP ¼ 1 The correction logic modifies the result res of the second else if MP lt even && ðinexact lt MP0kMP gt even0Þ rounding only if a double rounding error has occurred; // pulled down to a midpoint less than an even floating-point otherwise, res0 ¼ res is already correct. Note that after number correction, the significand of res0 can be smaller by 1 ulp MP lt even ¼ 0 than the smallest N-digit significand in base b.The inexact gt MP ¼ 1 correction logic has to take care of this corner case too. else For the example considered above, l0 ¼ 9, r0 ¼ 9, s0 ¼ 0, // the result and the rounding indicators are already correct l ¼ 1, r ¼ 5, and s ¼ 0. It follows that add1 ¼ 0 and sub1 ¼ 1, endif and so, the double rounding error upward will be corrected by subtracting 1 from the significand of the result. (But in In rounding-to-nearest-away mode, double rounding practice, for the BID library implementation, it was not errors may also occur just as they do in rounding-to- necessary to determine l, r, and s in order to calculate the nearest-even mode: when the exact result res0 of a floating- values of these logical indicators.) point operation with operands of precision N0 is rounded The logic shown above remedies a double rounding correctly (in the IEEE 754R sense) first to a result res0 of error that might occur when rounding to nearest-even and precision N0, and then, res0 is rounded again to a narrower generates the correct value of the result in all cases. precision N. Sometimes, the result res does not represent

Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 27, 2009 at 16:30 from IEEE Xplore. Restrictions apply. 158 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009 the IEEE-correct result res0 that would have been obtained correct were the original exact result res0 rounded directly to endif precision N. In this case too, only positive results will be Just as for rounding to nearest-even, the values of the considered, because rounding to nearest-away is symmetric rounding digit r and of the sticky bit s can be used to with respect to zero. A double rounding error for rounding calculate to nearest-away can only be upward. The same example considered above for rounding to inexact lt MP ¼ððr¼¼0 && s ! ¼0Þjjð1 <¼ r && r <¼4ÞÞ; nearest-even can be used for rounding to nearest-away. In inexact gt MP ¼ððr¼¼5 && s ! ¼0Þjjðr > 5ÞÞ; this case, we have MP ¼ððr¼¼5Þ && ðs¼¼0ÞÞ; exact ¼ððr¼¼0Þ && ðs¼¼0ÞÞ: res0 ¼ða þ bÞRA;34 ¼ 1000000000000001500000000000000000 1034: The correction is applied to the result res of the second rounding based on the value of sub1 but only if a double (The notation x indicates the rounding of x to RA;34 rounding error has occurred. Otherwise, res0 ¼ res is res nearest-away to 34 decimal digits.) The result 0 rounded already correct. Note that after correction, the significand N 16 to 64-bit format ð ¼ Þ is of res0 can be smaller by 1 ulp than the smallest N-digit 16 significand in base b. Again, the correction logic has to take res ¼ðres0Þ ¼ 1000000000000002 10 : RA;16 care of this corner case too. For the example shown above, This result is different from what would have been the correction signal sub1 ¼ 1 is easily calculated given obtained by rounding a + b directly to precision N ¼ 16;a r0 ¼ 9, s0 ¼ 0, r ¼ 5, and s ¼ 0. The double rounding error is double rounding error of 1 ulp, upward, has occurred, corrected by subtracting 1 from the significand of the result. because Again, it may be useful to know also the correct rounding indicators from the second rounding. The com- 0 res plete correction logic is similar to that obtained for round- 34 ing to nearest-even. ¼ð10000000000000014999999999999999999 10 ÞRA;16 This method was successfully applied for implementing ¼ 1000000000000001 1016: various nonhomogeneous decimal floating-point opera- Again, correction of double rounding errors can be tions, with minimal overhead. It is worth mentioning again achieved by using minimal additional information from the that the same method can be applied for binary floating- point operations in hardware, software, or a combination of floating-point operation applied to operands of precision N0 the two. and result rounded to precision N0 and the conversion of the result of precision N0 to precision N, where N0 >N. The additional information consists in this case of just 7PERFORMANCE DATA three logical indicators for each of the two rounding In this section, we present performance data in terms of N N N N operations ( 0 to 0 and then 0 to ): clock cycle counts needed to execute several library inexact lt MP; inexact gt MP; MP: functions. The median and maximum values are shown. The minimum values were not considered very interesting A fourth indicator can be derived from the first three: (usually, a few clock cycles) because they reflect just a quick exact ¼ !inexact lt MP && !inexact gt MP && !MP: exit from the function call for some special-case operands. To obtain the results shown here, each function was run on The method for correction of double rounding errors a set of tests covering corner cases and ordinary cases. The when rounding results to nearest-away is based on a similar mix of data has been chosen to exercise the library observation that such errors occur when the first rounding (including special cases such as treatment of and pulls up the result to a value that is a midpoint between two infinities) rather than to be a representative of a specific consecutive floating-point numbers, and therefore, the decimal floating-point workload. second rounding causes an error of 1 ulp upward. These preliminary results give a reasonable estimate of The correction method can be expressed as follows worst case behavior, with the median information being a (0 identifies again the first rounding): good measure of the performance of the library. Test runs were carried out on four different platforms to // avoid a double rounding error for rounding to nearest-away get a wide sample of performance data. Each system was sub1 ¼ MP && ðMP0kinexact gt MP0Þ running Linux. The best performance was on the Xeon 5100 if sub1 // double rounding error upward Intel164 platform, where the numbers are within one to significand-- two orders of magnitude from those for possible hardware ^ if significand ¼¼ b ðN 1Þ1 solutions. // falls below the smallest N-digit significand Table 1 contains clock cycle counts for a sample of significand ¼ b^N 1 arithmetic functions. These are preliminary results as the unbiased_exp--// decrease exponent by 1 library is in the pre-beta development stage. A small endif number of optimizations were applied, but significant // Otherwise no double rounding error; the result is already improvements may still be possible through exploitation

Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 27, 2009 at 16:30 from IEEE Xplore. Restrictions apply. CORNEA ET AL.: A SOFTWARE IMPLEMENTATION OF THE IEEE 754R DECIMAL FLOATING-POINT ARITHMETIC USING THE BINARY... 159

TABLE 1 TABLE 3 Clock Cycle Counts for a Subset Clock Cycle Counts for a Subset of Arithmetic Functions {Max/Median Values} of Miscellaneous Functions {Max/Median Values}

example is that based on its median latency (71 clock cycles for add64 in Table 1), the 64-bit BID add is less than four of specific hardware features or careful analysis and times slower than a four-clock-cycle single-precision binary reorganization of the code. floating-point add operation in hardware on a 600-MHz Table 2 contains clock cycle counts for conversion Ultra Sparc III CPU of just a few years ago. functions. It is worth noting that the BID library performs well even for conversion routines to and from string format. 8CONCLUSION Table 3 contains clock cycle counts for other miscellaneous IEEE 754R functions, which can be found in [20] under In this paper, we presented some mathematical properties slightly different names. For example, rnd_integral_ and algorithms used in the first implementation in software of away128 and q_cmp_lt_unord128 are our implementa- the IEEE 754R decimal floating-point arithmetic, based on the tions of, respectively, the IEEE 754R operations round- BID encoding. We concentrated on the problem of rounding ToIntegralTiesAway and compareQuietLess correctly decimal values that are stored in binary format Unordered on the Decimal128 type. while using binary operations efficiently and also presented It is interesting to compare these latencies with those for briefly other important or interesting algorithms used in the other approaches. For example, the decNumber package [9] library implementation. Finally, we provided a wide sample run on the same Xeon 5100 system has a maximum latency of performance numbers that demonstrate that the possible of 684 clock cycles and a median latency of 486 clock cycles speedup of hardware implementations over software may for the add64 function. As a comparison, the maximum and not be as dramatic as previously estimated. The implementa- median latencies of a 64-bit addition on the 3.0-GHz Xeon tion was validated through testing against reference func- 5100 are 133 and 71 clocks cycles, respectively. Another tions implemented independently, which used, in some cases, existing multiprecision arithmetic functions. As we look toward the future, we expect further TABLE 2 improvements in performance through algorithm and code Clock Cycle Counts for a Subset of Conversion Functions {Max/Median Values} optimizations, as well as enhanced functionality, for example, through addition of transcendental function support. Furthermore, we believe that hardware support can be added incrementally to improve decimal floating- point performance as demand for it grows.

APPENDIX ALGORITHMS FOR BEST RATIONAL APPROXIMATIONS The standard problem in Diophantine approximation is to investigate how well a number x can be approximated by rationals subject to an upper bound on the size of the denominator. That is, what is the minimal value of jx p=qj

Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 27, 2009 at 16:30 from IEEE Xplore. Restrictions apply. 160 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009 for jqjN? The usual approach to finding such approx- and so, since we are dealing with integers, imations is to generate sequences of rationals p1=q1 N, we know, respectively, that p1=q1 is and N 1, since otherwise, the problem is of little interest. the best left approximation with denominator bound N and But note that even with this assumption, there may be no p2=q2 is the best right approximation with denominator best right approximation within some bounds. For example, bound N. If on the other hand, q1 þ q2

p1=q1

p1=q1 x

Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 27, 2009 at 16:30 from IEEE Xplore. Restrictions apply. CORNEA ET AL.: A SOFTWARE IMPLEMENTATION OF THE IEEE 754R DECIMAL FLOATING-POINT ARITHMETIC USING THE BINARY... 161 next mediant exceeds the denominator limit. In practice, REFERENCES one can sometimes reach pathological situations where an [1] G. Bohlender and T. Teufel, “A Decimal Floating-Point Processor infeasible number of mediant steps are to be taken. So for Optimal Arithmetic,” Computer Arithmetic: Scientific Computa- rather than make these steps one at a time, we can condense tion and Programming Languages, pp. 31-58, Teubner Stuttgart, 1987. multiple “go left” and “go right” operations into one. (This [2] S. Boldo and G. Melquiond, “When Double Rounding Is Odd,” is effectively like a traditional continued fraction algorithm, Proc. 17th IMACS World Congress, Scientific Computation, Applied except that we are more delicate about the limits on the Math. and Simulation, 2005. [3] W. Buchholz, “Fingers or Fists? (The Choice of Decimal or Binary numerator and the denominator.) The following defines the Representation),” Comm. ACM, vol. 2, no. 12, pp. 3-11, 1959. number of left and right steps we can take when finding left [4] F.Y. Busaba, C.A. Krygowski, W.H. Li, E.M. Schwarz, and approximations before we either need a step of the other S.R. Carlough, “The IBM z900 Decimal Arithmetic Unit,” Proc. 35th Asilomar Conf. Signals, Systems and Computers, vol. 2, p. 1335, kind or exceed the denominator bound: Nov. 2001. 8 jk [5] M.S. Cohen, T.E. Hull, and V.C. Hamacher, “CADAC: A < Nq2 Controlled-Precision Decimal Arithmetic Unit,” IEEE Trans. q ; if q1x ¼ p1; k ¼ 1jkjk Computers, vol. 32, no. 4, pp. 370-377, Apr. 1983. left : N q p q x min 2 ; 2 2 ; otherwise; [6] M. Cornea, C. Anderson, J. Harrison, P.T.P. Tang, E. Schneider, q1 q1xp1 and C. Tsen, “An Implementation of the IEEE 754R Decimal Floating-Point Arithmetic Using the Binary Encoding Format,” and Proc. 18th IEEE Symp. Computer Arithmetic (ARITH ’07), pp. 29-37, 8 jk 2007. < Nq1 ; if q2x ¼ p1; [7] M.F. Cowlishaw, “Densely Packed Decimal Encoding,” IEE q2jklm kright ¼ Proc.—Computers and Digital Techniques, vol. 149, pp. 102-104, : Nq1 q1xp1 May 2002. min q ; p q x 1 ; otherwise: 2 2 2 [8] M.F. Cowlishaw, “Decimal Floating-Point: Algorism for Compu- ters,” Proc. 16th IEEE Symp. Computer Arithmetic (ARITH ’03), pp. 104-111, June 2003. Now, if kleft > 0, our next left convergent pair will be [9] M.F. Cowlishaw, The decNumber Library, http://www2.hursley. ðp1=q1; p=qÞ,wherep ¼ p1kleft þ p2,andq ¼ q1kleft þ q2. ibm.com/decimal/decnumber.pdf, 2006. k > 0 [10] A.Y. Duale, M.H. Decker, H.-G. Zipperer, M. Aharoni, and Otherwise, if right , our next left convergent pair will T.J. Bohizic, “Decimal Floating-Point in z9: An Implementation be ðp=q; p2=q2Þ, where p ¼ p1 þ p2kright, and q ¼ q1 þ q2kright. and Testing Perspective,” IBM J. Research and Development, Otherwise, we have reached the denominator limit, and the http://www.research.ibm.com/journal/rd/511/duale.html, p =q 2007. best left approximation is 1 1. When finding right [11] M.A. Erle, J.M. Linebarger, and M.J. Schulte, “Potential Speedup approximations, the corresponding kleft and kright are just Using Decimal Floating-Point Hardware,” Proc. 36th Asilomar a little different: Conf. Signals, Systems, and Computers, pp. 1073-1077, Nov. 2002. 8 jk [12] M.A. Erle and M.J. Schulte, “Decimal Multiplication via Carry- Save Addition,” Proc. IEEE 14th Int’l Conf. Application-Specific < Nq2 q ; if q1x ¼ p1; Systems, Architectures, and Processors (ASAP ’03), pp. 348-358, k ¼ 1jklm left : N q p q x June 2003. min 2 ; 2 2 1 ; otherwise; q1 q1xp1 [13] European Commission, The Introduction of the Euro and the Rounding of Currency Amounts, http://europa.eu.int/comm/ and economy_finance/publications/euro_papers/2001/eup22en.pdf, 8 jk Mar. 1998. < Nq1 [14] S.A. Figueroa, “When Is Double Rounding Innocuous?” ACM q ; if q2x ¼ p1; k 2jkjk SIGNUM Newsletter, vol. 20, no. 3, pp. 21-26, 1995. right ¼ : [15] D. Goldberg, “What Every Computer Scientist Should Know min Nq1 ; q1xp1 ; otherwise: q2 p2q2x about Floating-Point Arithmetic,” ACM Computing Surveys, vol. 23, pp. 5-48, 1991. [16] G. Gray, “UNIVAC I Instruction Set,” Unisys History Newsletter, Of course, to find the best approximation to a number, vol. 2, no. 3, 2001. [17] T. Horel and G. Lauterbach, “UltraSPARC-III: Designing Third- we see which of its best left and right approximations is Generation 64-Bit Performance,” IEEE Micro, vol. 19, no. 3, closer and pick either if they are equally close. (If the pp. 73-85, May/June 1999. number is itself rational, we may, depending on context, [18] IBM Corp., The Telco Benchmark, http://www2.hursley.ibm.com/ decimal/telco.html, Mar. 1998. also include the number if its denominator is within range.) [19] ANSI/IEEE Standard for Floating-Point Arithmetic 754-1985. IEEE, Another advantage of distinguishing left and right 1985. approximations is that we can very easily generate the [20] Draft Standard for Floating-Point Arithmetic P754, Draft 1.2.5., IEEE, k p http://www.validlab.com/754R/drafts/archive/2006-10-04.pdf, closest left or right approximations. For example, let 1=q1 Oct. 2006. be the best left approximation to x with denominator [21] ISO 1989:2002 Programming Languages—COBOL, ISO Standards, bound N. The next-best left approximation to x with JTC 1/SC 22, 2002. [22] N C# Language Specification, Standard ECMA-334, http://www.ecma- denominator bound must in fact be the best left international.org/publications/files/ECMA-ST/Ecma-334.pdf, approximation to p1=q1 (since if it were >p1=q1, it would 2005. [23] contradict the best approximation status of p1=q1), and so Sun Corp., Class BigDecimal, http://java.sun.com/j2se/1.4.2/ docs/api/java/math/BigDecimal.html, 2008. forth. In this way, we can also enumerate all rationals [24] P. Tang, “Binary-Integer Decimal Encoding for Decimal Floating subject to the denominator bound that are within some Point,” technical report, Intel Corp., http://754r.ucbtest.org/ given >0 of x. Of course, depending on x, N, and , the issues/decimal/bid_rationale.pdf. [25] XML Scheme Part 2: Datatypes Second Edition, World Wide Web number of such approximations may mean that they are not Consortium (W3C) recommendation, http://www.w3.org/Tr/ feasibly enumerable. 2004/REC-xmlschema-2-20041028/, Oct. 2004.

Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 27, 2009 at 16:30 from IEEE Xplore. Restrictions apply. 162 IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 2, FEBRUARY 2009

[26] L. Wang, “Processor Support for Decimal Floating-Point Arith- Ping Tak Peter Tang received the PhD degree metic,” technical report, Electrical and Computer Eng. Dept., in mathematics from the University of California, Univ. of Wisconsin, Madison, year? Berkeley.Hewas,untilrecently,asenior [27] M.H. Weik, The ENIAC Story, http://ftp:.arl.mil/~mike/ principal engineer with Intel Corp., Hillsboro, comphist/eniac-story.html, 2007. Oregon. He is currently with D.E. Shaw Rea- serch, L.L.C., New York. His main interest is in Marius Cornea received the MSc degree in numerical algorithms, as well as error analysis of electrical engineering from the Polytechnic such algorithms. He is a member of the IEEE. Institute of Cluj, Romania, and the PhD degree in computer science from Purdue University, West Lafayette, Indiana. He joined Intel Corp., Hillsboro, Oregon, in 1994 and is now a principal engineer and the manager of the Numerics Eric Schneider received the MS degree in Team. His work has involved many aspects of computer science from Washington State Uni- scientific computation, design and development versity. He has been with Intel Corp., Hillsboro, of numerical algorithms, floating-point emula- Oregon, since 1995. He is interested in all tion, exception handling, and new floating-point instruction definition and aspects of CPU validation, specializing in ALU analysis. He is a member of the IEEE. validation along with random instruction test generation. John Harrison received the bachelor’s degree in electronic engineering, the diploma in com- puter science, and the PhD degree in computer science from the University of Cambridge, England. He joined Intel Corp., Hillsboro, Oregon, in 1998 and is now a principal engineer. He specializes in formal verification, Evgeny Gvozdev graduated from Moscow State automated theorem proving, floating-point arith- University. He formerly worked as a software metic, and mathematical algorithms. He is a engineer in the Russian Federal Nuclear Center, member of the IEEE. Sarov. He joined Intel Corp., Nizhny Novgorod Region, Russia, in 1996 and currently works for the Intel MKL team. His special interest is in testing methodologies for numerical functions, Cristina Anderson received the MS degree in which he has applied to support the quality computer science from the Polytechnic Univer- assurance of Intel mathematical libraries. sity of Bucharest, Romania, in 1995 and the PhD degree in computer engineering, from Southern Methodist University, Dallas, in 1999. She has . For more information on this or any other computing topic, been with Intel Corp., Hillsboro, Oregon, since please visit our Digital Library at 1999. Her topics of interest include numerical www.computer.org/publications/dlib. algorithms and computer architecture, with a focus on arithmetic unit design.

Authorized licensed use limited to: Intel Corporation via the Intel Library. Downloaded on March 27, 2009 at 16:30 from IEEE Xplore. Restrictions apply.