<<

Comparison between binary64 and decimal64 floating-point numbers

Nicolas Brisebarre, Marc Mezzarobba Christoph Lauter Jean-Michel Muller Université Pierre et Marie Curie Laboratoire LIP Laboratoire d’Informatique de Paris 6 CNRS, ENS Lyon, INRIA, Paris, France Univ. Claude Bernard Lyon 1 [email protected] Lyon, France fi[email protected]

Abstract—We introduce a software-oriented algorithm that binary32 binary64 binary128 allows one to quickly compare a binary64 floating-point (FP) precision (bits) 24 53 113 number and a decimal64 FP number, assuming the “binary 푒max +127 +1023 +16383 encoding” of the formats specified by the IEEE 754-2008 푒min −126 −1022 −16382 standard for FP arithmetic is used. It is a two-step algorithm: a first pass, based on the exponents only, makes it possible to decimal64 decimal128 quickly eliminate most cases, then when the first pass does not precision (digits) 16 34 suffice, a more accurate second pass is required. We provide an 푒max +384 +6144 implementation of several variants of our algorithm, and compare 푒min −383 −6143 them. TABLE I I.INTRODUCTION THE BASIC BINARY AND DECIMAL INTERCHANGE FORMATS SPECIFIED BY THE 754-2008 STANDARD. The IEEE 754-2008 Standard for Floating-Point Arith- metic [3] specifies several binary and decimal formats. The “basic interchange formats” of the standard are presented in Table I. the boolean result might not be the expected one because of Although the standard does not require that a binary and a rounding in the comparison. decimal number can be compared (it says floating-point data This kind of strategy may lead to inconsistencies. Imagine represented in different formats shall be comparable as long as such a “naïve” approach built as follows: when comparing a the operands’ formats have the same radix), such comparisons binary floating-point number 푥2 of format ℱ2, and a decimal may nevertheless offer several advantages: it is not infrequent to floating-point number 푥10 of format ℱ10, we first convert 푥10 read decimal data from some database and to have to compare to the binary format ℱ2 (that is, we replace it by the ℱ2 it to some binary floating-point number. That comparison may number nearest 푥10), and then we perform the comparison be slightly inaccurate if the decimal number is preliminarily in binary. Denote ○< , ○6 , ○>, and ○> as the comparison converted to binary, or, respectively, if the binary number is operators so defined. Consider the following variables (all first converted to decimal. exactly represented in their respective formats): 55 While it does not require comparisons between floating-point ∙ 푥 = 3602879701896397/2 , declared as a binary64 numbers of different radices, the IEEE 754-2008 standard does number; 27 not forbid them for languages or systems. As the technical ∙ 푦 = 13421773/2 , declared as a binary32 number; report on decimal floating-point arithmetic in C [4] is currently ∙ 푧 = 1/10, declared as a decimal64 number. a mere draft, compilers supporting decimal floating-point Then it holds that 푥 ○< 푦, but also 푦 ○6 푧 and 푧 ○6 푥. Such an arithmetic handle code sequences such as the following one at inconsistent result might suffice to prevent a sorting program their discretion and often in an unsatisfactory way: from terminating. double x = ...; A direct mixed-radix binary-decimal-comparison might be _Decimal64 y = ...; an answer. if (x <= y) { Also, in the long run, allowing “exact” comparison of ... decimal and binary floating-point numbers will certainly be } the only rigorous way of executing program instructions of the As it occurs, this sequence is translated, for example by Intel’s form icc 12.1.3, into a conversion from binary to decimal followed if x > 0.1 then ... by a decimal comparison. The compiler emits no warning that unless compilers learn how to round (binary or decimal) constants figuring in comparisons in a way that does not affect counter with straight-line code. The pipeline stalls in the loop the comparison. are hence avoided. In the following, we aim at introducing an algorithm for We hence define ⎧ comparing a binary64 floating-point number 푚 = 푀2, ⎪ 푒2−52 ⎪ 푥 = 푀 · 2 , ⎨ ℎ = 휈 + 푒2 − 푒10 − 37, 2 2 (3) 푛 = 푀 · 2휈 , and a decimal64 floating-point number ⎪ 10 ⎩⎪ 푔 = 푒10 − 15, 푒10−15 푥10 = 푀10 · 10 . so that 53 {︃ ℎ 푔−휈 Here 푀2 and 푀10 are integers, with |푀2| 6 2 − 1 and 푥2 = 푚 · 2 · 2 , 16 푔 푔−휈 (4) |푀10| 6 10 − 1, and 푒2 and 푒10 are the floating-point 푥10 = 푛 · 5 · 2 . exponents of 푥2 and 푥10. Although most of what will be presented in this paper is Our comparison problem becomes: generalizable to other formats, we restrict ourselves to these Compare 푚 · 2ℎ with 푛 · 5푔. two formats, for the sake of simplicity. We will also assume We have that the so-called binary encoding [3], [6] of IEEE 754-2008 is 52 53 2 6 푚 6 2 − 1, used for the decimal64 format, so that the integer 푀10 is easily 53 54 (5) accessible in binary. Since comparisons are straightforward 2 6 푛 6 2 − 1. if 푥2 and 푥10 have different signs or are zero, we assume From the ranges of 푒2, 푒10 and 휈, one easily deduces that 푀2 > 0 and 푀10 > 0. Also, to leave the IEEE 754 −398 6 푔 6 369, flags untouched (unless when required), we will use integer (6) arithmetic rather than floating-point arithmetic to perform our −1495 6 ℎ 6 1422. tests. Note that once the first step, eliminating comparison cases When 푥2 and 푥10 have significantly different orders of purely based on the exponents, is complete, we will be able magnitude, examining their exponents will suffice to compare to reduce these domains (see Lemma 1). It makes no sense, them. Hence, we first address the problem of performing a for instance, to assume that 푒2 is tiny and 푒10 is huge: in such first, exponent-based, test. We then show how to perform the a case, the comparison is straightforward. comparison when the first test does not suffice. Now define two functions 휙 and 휓 by

II.FIRSTSTEP: ELIMINATING THE “SIMPLECASES” BY 휙(ℎ) = ⌊ℎ · log5 2⌋, 휓(푔) = ⌊푔 · log2 5⌋. (7) EXAMININGTHEEXPONENTS The function 휙 is appropriate to perform the first comparison Through a possible preliminary binary shift of 푀2 (when 푥2 52 step, as stated in Property 1 below. The other function will be is subnormal), we may assume that 2 6 푀2. This gives the useful in the sequel. We will propose an efficient and easy-to- following constraints on 푒2 (taking into account that possible implement way to compute these functions 휙 and 휓. shift): Property 1. We have − 1074 6 푒2 6 1023. (1)

The constraints on 푒10, read from Table I, are 푔 < 휙(ℎ) ⇒ 푥2 > 푥10, 푔 > 휙(ℎ) ⇒ 푥2 < 푥10. − 383 6 푒10 6 384. (2) Proof: If 푔 < 휙(ℎ) then 푔 6 휙(ℎ) − 1, hence 푔 6 Also, from 푔 ℎ 16 ℎ log5 2 − 1. This implies that 5 6 (1/5) · 2 , therefore 1 6 푀10 6 10 − 1, 54 푔 52 ℎ 푔 ℎ 2 · 5 6 (4/5) · 2 · 2 . As a consequence, 푛 · 5 < 푚 · 2 . we easily deduce that there exists a unique 휈 ∈ {0, 1, 2,..., 53} If 푔 > 휙(ℎ) then 푔 > 휙(ℎ) + 1, hence 푔 > ℎ log5 2, so that such that 5푔 > 2ℎ. This implies 253 · 5푔 > (253 − 1) · 2ℎ, and hence 53 휈 54 푛 · 5푔 > 푚 · 2ℎ 2 6 2 푀10 6 2 − 1. . Property 2. Denoting by ⌊·⌉ the nearest integer function, let Hence our initial problem of comparing 푥2 and 푥10 reduces 푒2−52+휈 휈 푒10−15 to comparing 푀2 · 2 and (2 푀10) · 10 . 19 퐿휙 = ⌊2 log5 2⌉ = 225799, The fact that we “normalize” the decimal significand 푀 10 퐿 = ⌊212 log 5⌉ = 9511. by a binary shift between two consecutive powers of two is of 휓 2 course questionable; 푀10 could also be normalized into the For all ℎ in the range (6), we have 15 푡 16 range 10 6 10 · 푀10 6 10 − 1. However, as hardware −19 support for binary encoded decimal floating-point arithmetic is 휙(ℎ) = ⌊ℎ · 퐿휙 · 2 ⌋. not widespread, a decimal normalization would require a loop Similarly, we have 10 multiplying by and testing, whereas the proposed binary −12 normalization can exploit an existing hardware leading-zero 휓(푔) = ⌊퐿휓 · 푔 · 2 ⌋ for |푔| 6 204, and in the special case 푔 = 16푞, In addition, for each fixed 휈, we can easily compute the 푘 (푒 , 푒 ) −1022 푒 1023 −8 number 휈 of pairs 2 10 such that 6 2 6 , 휓(16푞) = ⌊퐿휓 · 푞 · 2 ⌋ for |푞| 32. 6 −383 6 푒10 6 384, and 휙(ℎ) = 푔. norm The products 퐿휙 ·ℎ and 퐿휓 ·푔, for 푔, ℎ in the indicated ranges, Let 푋2 be the subset of 푋2 consisting of normal numbers. can all be computed exactly in (signed or unsigned) 32-bit The number of pairs (푒2, 휉10) such that 휙(ℎ) = 푔 and 휉2 ∈ −훽 norm integer arithmetic. Computing ⌊휉 · 2 ⌋ of course reduces to 푋2 is a right-shift by 훽 bits. 53 ∑︁ 63 Proof: Using Maple, we exhaustively check that 푁1 = 푛휈 푘휈 = 14 292 575 372 220 927 484 6 1.55 · 2 . 휈=0 휙(ℎ) = ⌊ℎ · 퐿 · 2−19⌋ If now 푥2 is subnormal, then ℎ < −623 and 휙(ℎ) < −268.A norm for |ℎ| 6 1831, and that rough bound on the number of (휉2, 휉10) ∈ (푋2∖푋2 ) × 푋10 −12 such that 푔 = 휙(ℎ) is hence 휓(푔) = ⌊퐿휓 · 푔 · 2 ⌋ 푁 = 252 · 1016 · (398 − 268) 1.12 · 2112. when either |푔| 6 204 or 푔 = 16푞 with |푞| 6 32. The 2 6 corresponding quantities 퐿휙 · ℎ and 퐿휓 · 푔 all fit on at most Thus, the number of elements of 푋2 × 푋10 for which the first 29 bits in magnitude, plus one sign bit. step fails is bounded by From Properties 1 and 2, we easily derive the following 52 115 algorithm, that performs the first step, based on the exponents 푁 = 2 푁1 + 푁2 6 1.691 · 2 . of 푥2 and 푥10. This is to be compared with Algorithm 1. First step 52 16 125 |푋2×푋10| = (2 − 1)·2047·(10 − 1)·768 > 1.66·2 . ∙ compute ℎ = 휈 + 푒2 − 푒10 − 37 and 푔 = 푒10 − 15; −19 We obtain 푁/|푋 × 푋 | 10−3. ∙ compute 휙(ℎ) = ⌊퐿휙 · ℎ · 2 ⌋ using 32-bit integer 2 10 6 arithmetic; As a matter of course, pairs (휉2, 휉10) will almost never be equidistributed in practice. Hence the previous estimate should ∙ if 푔 < 휙(ℎ) then return “푥2 > 푥10”, else if 푔 > 휙(ℎ) not be interpreted as a probability of success of Step 1. It then return “푥2 < 푥10”, else perform the second step. seems more realistic to assume that a well-written numerical Note that, when 푥10 admits multiple distinct representations algorithm will mostly perform comparisons between numbers in binary64 format (i.e. when its cohort is non-trivial [3]), which are suspected to be close to each other. For instance, in the success of the first step may depend on the specific an iterative algorithm where comparisons are used as part of a representation passed as input. For instance, both 퐴 = {푀10 = convergence test, it is to be expected that most comparisons 15 10 , 푒10 = 0} and 퐵 = {푀10 = 1, 푒10 = 15} are valid need to proceed to the second step. Conversely, there are representations of the integer 1. Assume we are trying to scenarios, e.g., checking for out-of-range data, where the first compare 푥10 = 1 to 푥2 = 2. Using representation 퐴, we have step should almost always be enough. 휈 = 4, ℎ = −32, and 휙(ℎ) = −14 > 푔 = −15, hence the test III.SECONDSTEP: A CLOSER LOOK AT THE from Algorithm 1 shows that 푥10 < 푥2. In contrast, if 푥10 is given in the form 퐵, we get 휈 = 53, 휙(ℎ) = 휙(2) = 0 = 푔, In the following we assume that 푔 = 휙(ℎ) (otherwise, the and the test is inconclusive. first step already allowed us to compare 푥2 and 푥10). We may quantify the quality of this first filter as follows. Define a function We say that Algorithm 1 succeeds if it answers “푥2 > 푥10” or 푓(ℎ) = 5휙(ℎ) · 2−ℎ = 2⌊ℎ log5 2⌋·log2 5−ℎ. “푥2 < 푥10” without proceeding to the second step, and fails otherwise. Let 푋 and 푋 denote the sets of representations 2 10 We have ⎧ of positive, finite numbers, respectively in binary64 and in ⎨ 푓(ℎ) · 푛 > 푚 ⇒ 푥10 > 푥2, decimal64 format. (In the case of 푋2, each number has a 푓(ℎ) · 푛 < 푚 ⇒ 푥10 < 푥2, (8) single representation.) Assuming zeros, infinities, and ⎩ 푓(ℎ) · 푛 = 푚 ⇒ 푥10 = 푥2. have been handled before, the input of Algorithm 1 may be The second test consists in performing this comparison, with thought of as a pair (휉2, 휉10) ∈ 푋2 × 푋10. 푓(ℎ) · 푛 replaced by an approximation that is accurate enough Proposition 1. Algorithm 1 succeeds for more than 99.9% of not to introduce false results. Ensuring that an approximate test is indeed equivalent to (8) the input pairs (휉2, 휉10) ∈ 푋2 × 푋10. requires to find a lower bound 휂 on the minimum nonzero Proof: The first step succeeds if and only if 휙(ℎ) ̸= 푔, value of |푚/푛 − 5휙(ℎ)/2ℎ| that may appear at this stage. We that is, iff 휙(휈 + 푒2 − 푒10 − 37) ̸= 푒10 − 15. The value of 휈 want 휂 to be as tight as possible in order to avoid unduly costly depends only on 푀10. Knowing 휈, the number of possible computations when approximating 푓(ℎ) · 푛. The search space values of 푀10 is is constrained by the following observations. {︃ 53−휈 2 , 휈 > 0, Lemma 1. The equality 푔 = 휙(ℎ) where 푔 and ℎ are defined 푛휈 = 16 53 10 − 2 , 휈 = 0. by (3) can hold only if −787 6 ℎ 6 716. Additionally, 16 ∙ if 푛 > 10 , then 푛 is necessarily even; Note that this process is actually Euclid’s algorithm. For all ′ ∙ if ℎ 680, then 휈 = ℎ + 휙(ℎ) − 971 > 0 and 푛 is 0 푖 푛, the rational number > ′ 6 6 divisible by 2휈 . 푝 1 푖 = 푎 + 푞 0 1 Proof: Substituting 푒2 = ℎ + 푔 − 휈 + 52 and 푔 = 휙(ℎ) 푖 푎 + 1 1 into inequality (1) yields 푎 + 2 . 1 .. + − 1126 6 푒2 − 52 = ℎ + 휙(ℎ) − 휈 6 971. (9) 푎푖 is the 푖th convergent of the continued fraction expansion of 훼. Since we have 0 휈 53 and ℎ·log 2−1 휙(ℎ) ℎ·log 2, 6 6 5 6 6 5 We write 훼 = [푎 ; 푎 , ··· , 푎 ]. The 푎 are the partial quotients this implies 0 1 푛 푖 of the expansion. Note that, if we assume 푝−1 = 1, 푞−1 = 0, 푝0 = 푎0, 푞0 = 1, we have 푝푖+1 = 푎푖+1푝푖 + 푝푖−1, 푞푖+1 = −1126 6 ℎ · (1 + log5 2) 6 1025. 푎푖+1푞푖 + 푞푖−1, and 훼 = 푝푛/푞푛. The bound on ℎ follows. First assume that the minimum nonzero value will be less −109 52 53 Both conditions on 푛 come from the fact that 2휈 divides 푛 than or equal to 2 . Then, the integers 2 6 푚 6 2 − 1 53 54 (which we denote 2휈 | 푛 in the sequel of the paper) by definition and 2 6 푛 6 2 − 1 that we want to determine necessarily 16 satisfy |5휙(ℎ)/2ℎ − 푚/푛| 2−109 < 1/(2푛2): a classical of 푛. If 푛 > 10 , then 휈 > 1 and hence 푛 is even. In general, 6 the last inequality from (9) implies 휈 휈′. This last bound is result (see Theorem 19 of [5] for instance) yields that 푚/푛 is > 휙(ℎ) ℎ nontrivial when ℎ 680. a convergent of 5 /2 . Hence, we only have to compute, > 휙(ℎ) ℎ We now deal with the problem of finding 휂. This problem for each ℎ ∈ [54, 716], the convergents of 5 /2 the 54 is very close to that, considered by Cornea et al. in [1], of denominators of which are smaller than 2 . Moreover, we only finding the required accuracy for performing correctly rounded consider the convergents 푝푖/푞푖 which improve the minimum conversions between the binary and decimal formats of the nonzero value and for which there exists 푘 ∈ N such that IEEE 754-2008 Standard. The constraints on 푚, 푛 and ℎ follow 푚 = 푘푝푖 and 푛 = 푘푞푖 satisfy the constraints of Problem 1. from inequalities (5) and Lemma 1. We stress that this pruning trick is valid if and only if one eventually obtains a minimum nonzero value lesser than or Problem 1. Find the smallest nonzero value of equal to 2−109. If not, one can use the techniques presented

⃒ 휙(ℎ) ⃒ in the Appendix of [1] to get worst and bad cases for the ⃒5 푚⃒ 푑ℎ(푚, 푛) = ⃒ − ⃒ approximation problem we consider (that is, to get values 2ℎ 푛 ⃒ ⃒ 푥2 and 푥10 that are very near). We have favored a different under the following constraints approach, here, because it leads to a less computationally intensive and mathematically simpler method, which is easier ⎧ 52 53 2 6 푚 6 2 − 1, to check either manually or with a proof-assistant. However, ⎪ 53 54 ⎪ 2 6 푛 6 2 − 1, if one wishes to generate many cases for which 푥2 and 푥10 ⎪ ⎨ −787 6 ℎ 6 716, are very close (typically, for the purpose of testing comparison 16 푛 is even if 푛 > 10 , algorithms), the techniques of [1] become of huge interest. We ⎪ ′ ⎪ if ℎ > 680, then 휈 = ℎ + 휙(ℎ) − 971 > 0 will take advantage of them in Section VI. ⎪ ′ ⎩⎪ and 2휈 |푛. The computation shows that the smallest admissible value of −35 −113.59 푑ℎ(푚, 푛) for ℎ > 0 is 6.3566 ... × 10 < 2 , attained For each integer ℎ ∈ [0, 716], a numerator of 푑ℎ(푚, 푛) is for ⃒ ⃒ 푚 3521275406171921 ⃒5휙(ℎ)푛 − 푚2ℎ⃒ and a denominator is 2ℎ푛. ℎ = 612, = . 휙(ℎ) ℎ 푛 8870461176410409 If 0 6 ℎ 6 53 and 푑ℎ(푚, 푛) ̸= 0, we have |5 푛−푚2 | > ℎ −107 ℎ ∈ [−787, 0] 1, hence 푑ℎ(푚, 푛) 1/(2 푛) 2 . This easy argument We address the case in a similar way. In this > > ⃒ −ℎ −휙(ℎ) ⃒ actually means that for the range of ℎ where equality cases case, a numerator of 푑ℎ(푚, 푛) is ⃒2 푛 − 5 푚⃒ and a −휙(ℎ) −휙(ℎ) 54 might occur, the non-zero minimum is no less that 2−107. denominator is 5 푛. We notice that 5 > 2 if and 휙(ℎ) ℎ 54 only if ℎ −54. If −53 ℎ 0 and 푑 (푚, 푛) ̸= 0, we have If ℎ > 54, since 5 and 2 are coprime and 푛 6 2 − 1, 6 6 6 ℎ |2−ℎ푛 − 5−휙(ℎ)푚| 1, hence 푑 (푚, 푛) 1/(5−휙(ℎ)푛) we have 푑ℎ(푚, 푛) ̸= 0. Minimizing 푑ℎ(푚, 푛) is a classical > ℎ > > −108 question related to the theory of continued fractions [2], [5], 2 . For −787 6 ℎ 6 −54, we use the same continued [7]. We now give a short reminder on the results we need in fraction tools as above, leading to the following result. our setting. Theorem 1. The minimum nonzero value of 푑ℎ(푚, 푛) under Let 훼 ∈ Q. We build two finite sequences (푎푖)06푖6푛 and the constraints from Problem 1 is (푟푖)0 푖 푛 defined by: 6 6 6.0485 ... × 10−35 < 2−113.67, ⎧ 푟0 = 훼, ⎨⎪ attained for 푎푖 = ⌊푟푖⌋ , 푚 4988915232824583 ⎩⎪ ℎ = −275, = . 푟푖+1 = 1/(푟푖 − 푎푖) if 푎푖 ̸= 푟푖. 푛 12364820988483254 We may thus take 휂 = 2−113.7. appropriate amount to recover 푓(ℎ). The number of elements to tabulate goes down from 1504 to 647, that is, by a factor IV. INEQUALITYTESTING of about 2.3. This would correspond to a 10.4 KB table (for a As already mentioned, the first step of the full comparison memory alignment on 8-byte boundaries). algorithm is straightforwardly implementable in 32-bit integer More precisely, define 퐹 (푔) = 5푔2−휓(푔) (cf. Sec. II). We arithmetic. The second one reduces to comparing 푚 and 푓(ℎ)·푛, then have and we have seen in Section III that it is enough to know 푓(ℎ) 푓(ℎ) = 퐹 (휙(ℎ)) · 2−휌(ℎ), (12) with relative accuracy 2−113.67. We now discuss several ways to perform this comparison. The main difficulty is to efficiently where evaluate 푓(ℎ) · 푛 with just enough accuracy. 휌(ℎ) = ℎ − 휓(휙(ℎ)). A. Direct method The shift 휌(ℎ) is easily computed based on Property 2. In addition, we may check that Perhaps the first thing that comes to mind is to implement the criterion from Equation (8) directly, only with 푓(ℎ) replaced 1/2 < 퐹 (휙(ℎ)) 6 1 and 0 6 휌(ℎ) 6 2 by a sufficiently accurate approximation, read in a table that is 휌(ℎ) indexed with ℎ. Theorem 2 below implies that it is more than for all ℎ. Since 휌(ℎ) is nonnegative, multiplication by 2 enough to compute the product 푓(ℎ) · 푛 in 128-bit arithmetic, can be implemented with a bitshift, not requiring branches. using a 128-bit approximation of 푓(ℎ). Additionally, since there is enough headroom in the 64-bit By Lemma 1, there are 1 504 values of ℎ to consider, which variable holding 푚 (which fits on 54 bits), we can use a left corresponds to a 23.5 KB table with 128-bit precision entries shift on 푚 rather than a two-word right shift on 퐹 (푔). for 푓(ℎ). The resulting table is quite large but could possibly Note that we did not implement this size reduction and be reused, for instance, in conversions between binary and hence could not check its suitability in practice. Instead, we decimal formats. One advantage of this method is that the concentrated on another size reduction opportunity, that we table value 푓(ℎ) only depends on ℎ, which is available early shall describe now. in the algorithm flow. Thus, the memory latency for the table B. Bipartite table access can be hidden behind the first step. The table size can still be reduced more, using an alternative Theorem 2. Assume that 휇 approximates 푓(ℎ)·푛 with relative method, based on a bipartite table. The technique takes accuracy 휖 < 휂/4 or better, that is, advantage of the fact that the exact value of 5푔 fits on 64 bits for 휂 푔 6 27. Assuming that the entries are 8-byte aligned, the table |푓(ℎ) · 푛 − 휇| 휖푓(ℎ) · 푛 < 푓(ℎ) · 푛. (10) 6 4 uses only 800 B of storage. Compared to the straightforward The following implications hold: method from Section IV-A, this second method requires quite a few more arithmetic operations for the sake of a considerably ⎧ 54 ⎪ 휇 > 푚 + 휖 · 2 =⇒ 푥10 > 푥2, smaller table. Most of the additional overhead in arithmetic ⎨ 54 휇 < 푚 − 휖 · 2 =⇒ 푥10 < 푥2, (11) operations can however be hidden on current processors through ⎩⎪ 54 increased instruction-level parallelism. The reduction in table |푚 − 휇| 6 휖 · 2 =⇒ 푥10 = 푥2. size also helps decreasing the probability of cache misses, as Proof: First notice that 푓(ℎ) 6 1 for all ℎ. Since 푛 < it holds on only a few cache lines. 54 54 2 , condition (10) implies |휇 − 푓(ℎ) · 푛| < 2 휖, and hence Recall that we suppose 푔 = 휙(ℎ) and hence −787 ℎ 54 6 6 |푓(ℎ) · 푛 − 푚| > |휇 − 푚| − 2 휖. 716. Write Now consider each of the possible cases that appear in (11). ⌊︁ 푔 ⌋︁ 54 푔 = 휙(ℎ) = 16푞 − 푟, 푞 = + 1 , (13) If |휇 − 푚| > 2 휖, then 푓(ℎ) · 푛 − 푚 and 휇 − 푚 have the 16 same sign. The first two implications from (8) then translate 16푞 −푟 −ℎ into the corresponding cases from (11). so that 푓(ℎ) = 5 · 5 · 2 . We thus have 54 If finally |휇 − 푚| 6 2 휖, then the triangle inequality yields − 20 푞 21, 1 푟 16. (14) 55 55 6 6 6 6 |푓(ℎ)·푛−푚| 6 2 휖, which implies |푓(ℎ)−푚/푛| < 2 휖/푛 < 53 In particular, 5푟 is an integer and fits on 38 bits. 2 휂/푛 6 휂. But by definition of 휂, this cannot happen unless 푚/푛 = 푓(ℎ). This accounts for the last case. Instead of tabulating 푓(ℎ) directly, we will use two tables: 푟 Even though a 23.5 KB table for the direct table method one containing the (exact) value 5 for 1 6 푟 6 16, the other, 16푞 may still be acceptable in size, let us mention a way this table the most significant bits of 5 to a precision of roughly could be reduced. The function 휙(ℎ) is a staircase-function that 128 bits. It will prove convenient to store these values with grows slower than identity. This means several consecutive ℎ their leading non-zero bit left-aligned, in a fashion similar to map to a same value 푔 = 휙(ℎ). The corresponding values of floating-point significands. They will be stored respectively on 푓(ℎ) = 2휙(ℎ)·log2 5−ℎ are binary shifts of each other. Hence, one and two 64-bit words each. Thus, we set we may reduce the table size (at the price of a few additional 16푞 −휓(16푞)+127 휃1(푞) = 5 ·2 , 푓(ℎ) operations) by storing the most significant bits of as a 푟 −휓(푟)+63 function of 푔. We then shift the value read off the table by an 휃2(푟) = 5 ·2 , where the power-of-two factors provide for the desired left- Theorem 3, this effect allows equality cases (cases where alignment. We can check that 푥2 = 푥10) to be handled extremely easily. The multiplications by 28 in (18) are not essential. They 2127 휃 (푞) < 2128 − 1, 263 < 휃 (푟) < 264 6 1 2 (15) serve to shift the significant bits of 푚 and 푛 as much as for all ℎ. possible to the left, so that the two-word comparison that The value 푓(ℎ) now decomposes as follows finishes after only one branching in as many cases as possible. 휃1(푞) −64−휎(ℎ) 푓(ℎ) = 2 Lemma 2 is in principle enough to compare 푥2 to 푥10, 휃2(푟) using a criterion similar to that from Theorem 2. But we where can actually prove finer properties that allow for a more 휎(ℎ) = 휓(푟) − 휓(16푞) + ℎ. (16) efficient final decision step. Indeed, the use of Theorem 2 would require adding and subtracting a particular constant 휖. Equations (7) and (16) imply that Here, we show that the sign of Δ˜ already provides the answer to the comparison. ℎ − 푔 log2 5 − 1 < 휎(ℎ) < ℎ − 푔 log2 5 + 1. Theorem 3. Assume 푔 = 휙(ℎ), and let Δ˜ be the 128-bit signed As we also have ℎ log5 2 − 1 6 푔 = 휙(ℎ) < ℎ log5 2 by definition of 휙, it follows that integer defined by (18). Then following equivalences hold:

Δ˜ < 0 ⇐⇒ |푥10| < |푥2|, 0 6 휎(ℎ) 6 3. (17) Δ˜ = 0 ⇐⇒ 푥 = 푥 , 휎(ℎ) 10 2 In particular, the quantity 푚 · 2 fits on 56 bits, leaving ˜ headroom for additional 8 bits on a 64-bit variable. Δ > 0 ⇐⇒ |푥10| > |푥2|. Now let Proof: Write −64+8 8+휎(ℎ) Δ = 휃1(푞) · 푛 · 2 − 휃2(푟) · 푚 · 2 . ⌈휃1(푞)⌉ = 휃1(푞) + 훿tbl −56 −56 Obviously, we have |푥2| > |푥10|, |푥2| = |푥10| or |푥2| < |푥10| ⌊⌈휃1(푞)⌉· 푛 · 2 ⌋ = ⌈휃1(푞)⌉· 푛 · 2 + 훿rnd depending respectively if Δ < 0, Δ = 0 or Δ > 0. The case for some |푥2| = |푥10| implies 푥2 = 푥10 because this second stage of 0 6 훿tbl < 1, −1 < 훿rnd 6 0. the algorithm is used only when 푥2 and 푥10 have the same −56 sign. Setting 훿 = 훿rnd +푛2 훿tbl, we have Δ˜ = Δ+훿, and, by (5), 124 Lemma 2. Unless 푥2 = 푥10, we have |Δ| > 2 휂. −1 < 훿 < 1/4. Proof: The definition of Δ rewrites as If Δ ̸= 0, it follows that 8+휎(ℎ) (︁ 푚)︁ ˜ 124 10 Δ = 2 휃2(푟)푛 푓(ℎ) − . |Δ| > 2 휂 − |훿| > 2 푛 by Lemma 2. In particular, Δ and Δ˜ have the same sign, and The bounds (5), (15), (17) together imply that 28+휎(ℎ)휃 (푟)푛 2 > Δ˜ < 0 ⇔ |푥 | < |푥 |. If Δ = 0, we get |Δ˜ | = 훿, and we can 2124. We know from Theorem 1 that either 푓(ℎ) = 푚/푛, which 10 2 conclude that Δ˜ = 0 because Δ˜ is an integer. is equivalent to 푥 = 푥 , or |푓(ℎ) − 푚/푛| 휂. 2 10 > The conclusion Δ˜ = 0 = Δ when 푥 = 푥 means that in As already explained, the values of 휃 (푟) are all integers 2 10 2 this case, there is no error in the approximate computation of Δ. and can exactly be tabulated as-is. In contrast, only an Specifically, the tabulation error 훿 and the rounding error approximation can be tabulated for 휃 (푞). We chose to represent tbl 1 훿 cancel out, thanks to our choice to tabulate the ceiling these values as ⌈휃 (푞)⌉, hence replacing Δ by the easy-to- rnd 1 of 휃 (푞). compute 1 Theorem 3 implies that the following algorithm correctly 8−64 8+휎(ℎ) Δ˜ = ⌊⌈휃1(푞)⌉· 푛 · 2 ⌋ − 휃2(푟) · 푚 · 2 . (18) computes the sign of 푥2 − 푥10. 8+휎(ℎ) Here 푛, 휃2(푟) and 푚 · 2 are all nonnegative integers of Algorithm 2. Step 2 (second method). at most 64 bits, and ⌈휃1(푞)⌉ is a positive integer of at most ∙ Compute 푞 and 푟 as defined in (13). Compute 휎(ℎ) using 128 bits. Property 2. Computing the floor function in the first term of Δ˜ comes ∙ Read the 128 bits of ⌈휃1(푞)⌉ as two 64 bit words. down to dropping the low-order 64-bit word in a three- ∙ Compute the 128 high-order bits 훼 of word integer. This is also the reason we chose to take the 8 ⌈휃1(푞)⌉· (푛2 ) ceiling ⌈휃1(푞)⌉ when representing 휃1(푞) in the table: by dropping words, we can only compute under-approximations. with one 128 × 64-bit multiplication, dropping the 64 low By choosing over-approximations for the table value, we can order bits. manage for the two approximations to cancel out, at least ∙ Compute 8+휎(ℎ) partially. As we will see in the sequel, in particular with 훽 = 휃2(푟) · (푚2 ) with one full 64 × 64-bit multiplication, keeping all 128 VI.EXPERIMENTAL RESULTS bits. Both algorithms, the direct method presented in Section IV-A ∙ Compute the signed difference Δ˜ = 훼 − 훽. and the bipartite table method presented in Section IV-B, have ∙ Return ⎧ been implemented and thoroughly tested. Our implementation “푥 > 푥 ” if Δ˜ < 0, ⎨⎪ 2 10 is available at “푥2 = 푥10” if Δ˜ = 0, ⎪ http://hal.archives-ouvertes.fr/ensl-00737881/ ⎩“푥2 < 푥10” if Δ˜ > 0. along with tables of worst and bad cases for the approximation V. EQUALITYCASES problem from Section III. To the extent possible when using decimal floating-point As seen in the last Sections, the direct and bipartite methods numbers, we gave priority to portability over performance in 푥 equal 푥 are able to precisely determine cases when 2 is to 10, the main source code. It is written in plain C, using the decimal besides deciding the two possible inequalities. However, when floating-point type support offered by the compiler we are using, it comes to solely determine such equality cases additional gcc 4.6.3-1 [8]. No particular effort at optimization was made. properties may simplify and accelerate the previous tests. The For instance, multiple-word operations are implemented in 푥 = 푥 condition 2 10 is equivalent to portable C, with no use of hardware support for operations such as 64 × 64 → 128 bit multiplication except for that 푚 · 2ℎ = 푛 · 5푔. automatically inserted by the compiler. Nevertheless, we also provide an alternative implementation of the bipartite table Lemma 3. Assume 푥2 = 푥10. Then, method featuring some manual optimization. This optimized ∙ either 0 6 푔 6 22, 0 6 ℎ 6 53, and then 푚 is divisible version is written using gcc extensions and compiler intrinsics by 5푔 and 푛 is divisible by 2ℎ; besides plain C. ∙ or −22 6 푔 6 0, −51 6 ℎ 6 0, in which case we have Testing was done using test vectors that extensively cover −ℎ −푔 2 | 푚 and 5 | 푀10. all floating-point input classes, such as normal numbers, subnormals, Not-A-Numbers, zeros, infinities, for both the Proof: Notice that ℎ and 푔 = 휙(ℎ) always have the same binary input 푥 and the decimal input 푥 . sign. When they are nonnegative, 푛 must be a multiple of 2ℎ 2 10 The test vectors also exercise the worst-cases (푚, 푛) for and 푚 must be a multiple of 5푔. It follows that 5푔 푚 < 253, 6 the critical value 푑 (푚, 푛) = ⃒5휙(ℎ)/2ℎ − 푚/푛⃒ for each hence 푔 log 253 < 22.9, which in turn implies ℎ 53. ℎ ⃒ ⃒ 6 5 6 admissible value of ℎ, as explained in Section III. Additionally, When 푔 and ℎ are negative, 푚 is a multiple of 2−ℎ while the 50-but-worst cases (푚, 푛) for each value ℎ were computed 푛 = 푀 · 2휈 and hence 푀 are multiples of 5−푔. We get 10 10 using the method described in the Appendix of [1] and added −푔 < log 1016 < 22.9, whence −ℎ 51. 5 6 to the test vectors. The rationale behind exercising not only We deduce the following algorithm, which decides whether the worst-case for each ℎ but also some bad cases is that an 푥 = 푥 using only 64-bit integer operations. Note that here 2 10 implementation might work for the worst-case input just by we do not need to compute 휙(ℎ). chance, whereas chances decrease rapidly for a larger number Algorithm 3. Equality test. of difficult test cases. ℎ The test vectors were finally completed with a reasonably ∙ if 0 ℎ 53 and 0 푔 22 and 2 | 푛 then 6 6 6 6 large set of random inputs that fully exercise all possible 푔 – read 5 in a table or compute it on the fly values of the normalization 2휈 푀 for the decimal mantissa ′ 푔 −ℎ 10 – compute 푚 = 5 · (푛 2 ) with a 64-bit integer (cf. Section II) as well as both the first step succeeding or the multiplication, yielding 64 output bits second step being necessary. ′ – return (푚 = 푚) Both implemented methods have been compared for perfor- −ℎ ∙ else if ℎ > −51 and −22 6 푔 6 0 and 2 | 푚 then mance to the naïve comparison method where the binary or – read 5−푔 in a table or compute it on the fly decimal input is converted to decimal or binary respectively – compute 푛′ = 5−푔 · (푚 2ℎ) with a 64-bit integer before getting compared. These experimental results are multiplication reported in Table II. In the last two rows, “easy” cases are – return (푛′ = 푛) cases where the first step of our algorithm succeeds, while “hard” cases refer to those for which running the second step ∙ else return false is necessary. 푔 The constants 5 , 0 6 푔 6 22 used in the algorithm all fit The code is executed on a system equipped with a quad-core on 51 bits. They can for instance be read in a small table or Intel Core i7 M 620 processor clocked at 2.67 GHz, running computed on the fly (which requires at most 5 squarings and Linux 3.2.0 in 64 bit mode. The comparison functions are multiplications by 5 using binary powering). The tests on 푔 in compiled at optimization level 3, with the -march=native Algorithm 3 are there only to make sure that the values of 5±푔 flag set. The timing measurements were done using the Read- used later are in the allowable range. Time-Step-Counter instruction after pipeline serialization. The Naïve method Naïve method Direct method Bipartite table Bipartite table converting 푥2 converting 푥10 method method (optimized to decimal64 to binary64 described in described in implementation) (incorrect) (incorrect) Section IV-A Section IV-B min/avg/max min/avg/max min/avg/max min/avg/max min/avg/max Special cases (±0, NaNs, Inf) 14/–/254 7/–/281 8/–/114 7/–/132 7/–/112 푥2 subnormal, 푥10 of same sign 87/137/297 192/226/300 173/199/307 196/219/298 23/54/158 푥2 normal, 푥10 of opposite sign 50/144/278 21/107/321 5/32/118 5/30/121 4/38/122 푥2 normal, same sign, “easy” cases 84/156/294 21/107/309 15/47/144 15/46/122 6/47/122 푥2 normal, same sign, “hard” cases 32/95/283 56/86/207 34/58/212 52/70/226 19/44/198

TABLE II TIMINGS (INCYCLES) FORBOTHPRESENTEDMETHODSANDFORTWONAÏVEMETHODS.

serialization and function call overhead was subtracted off other format. Furthermore, they always return a correct answer, after timing an empty function. Measurements were taken once which is not the case of the naïve technique. the caches were preheated by previous comparison function Future work will have to address the other common calls, the results of which were discarded. The indicated IEEE 754-2008 binary and decimal formats, such as binary32 numbers are in cycles and given for the minimum/ average/ and decimal128. Since the algorithmic problems of exact maximum value that was observed. Cycle counts larger than binary to decimal comparison and correct rounding of binary about 370 (depending on the test run) were discarded, and the to decimal conversions (and vice versa) are similar, future corresponding tests run again, in order not to account in the investigations should also consider the possible reuse of results for long delays likely due to exceptional external events. tables for both problems. Finally, one should mention that No average values are given for case involving special inputs, considerable additional work is required in order to enable such as zeros, Not-A-Numbers (NaNs) or infinities. mixed-radix comparisons in the case when the decimal floating- As can be seen from Table II, our implementations of point number is stored in dense-packed-decimal representation. both the Direct Method presented in Section IV-A and the ACKNOWLEDGEMENT Bipartite Table Method described in Section IV-B outperform both naïve methods in most cases. Let us mention again that This work is partly supported by the TaMaDi project of the this naïve comparison is what some C compilers currently do French Agence Nationale de la Recherche. (cf. Section I). Besides being faster, the methods presented REFERENCES in this paper have the advantage of always providing correct [1] M. Cornea, J. Harrison, C. Anderson, P. T. P. Tang, E. Schneider, and boolean answers. E. Gvozdev, A software implementation of the IEEE 754R decimal floating- The (unoptimized) Direct Method, based on direct tabulation point arithmetic using the binary encoding format, IEEE Transactions on of 푓(ℎ) (cf. Section IV-A), is slightly faster than the (unop- Computers 58 (2009), no. 2, 148–162. [2] G. H. Hardy and E. M. Wright, An introduction to the theory of numbers, timized) Bipartite Table Method. This is mainly due to the Oxford University Press, London, 1979. more computationally expensive second step of the Bipartite [3] IEEE Computer Society, IEEE standard for floating-point arithmetic, Table Method. The difference in performance might however IEEE Standard 754-2008, August 2008, available at http://ieeexplore.ieee. org/servlet/opac?punumber=4610933. be too small to justify the usage of an about 30 times larger [4] ISO/IEC JTC 1/SC 22/WG 14, Extension for the programming language table for the direct method. We do not have experimental data C to support decimal floating-point arithmetic, Proposed Draft Technical for the Direct Method using the table size reduction technique Report, May 2008. [5] A. Ya. Khinchin, Continued fractions, Dover, New York, 1997. through tabulation of 퐹 (푔) (cf. Section IV-A). [6] J.-M. Muller, N. Brisebarre, F. de Dinechin, C.-P. Jeannerod, V. Lefèvre, G. Melquiond, N. Revol, D. Stehlé, and S. Torres, Handbook of floating- VII.CONCLUSION point arithmetic, Birkhäuser, Boston, 2010. [7] O. Perron, Die Lehre von den Kettenbrüchen, 3rd ed., Teubner, Stuttgart, Though not foreseen in the IEEE 754-2008 standard, ex- 1954–57. act comparisons between floating-point formats of different [8] The GNU project, GNU compiler collection, 1987–2013, available at radices would enrich the current floating-point environment http://gcc.gnu.org/. and enhance the safety and provability of numerical software. This paper has investigated the feasibility of such comparison at the instance of the binary64 and decimal64 formats. A simple test has been presented that eliminates most of the comparison inputs. For the remaining cases, two algorithms were proposed, a more direct method and a technique based on a bipartite table. The bipartite table uses only 800 bytes of table space. Both methods have been proven, implemented and thor- oughly tested. They outperform the naïve comparison technique consisting in conversion of one of the inputs to the respectively