Comparison Between Binary64 and Decimal64 Floating-Point Numbers

Comparison Between Binary64 and Decimal64 Floating-Point Numbers

Comparison between binary64 and decimal64 floating-point numbers Nicolas Brisebarre, Marc Mezzarobba Christoph Lauter Jean-Michel Muller Université Pierre et Marie Curie Laboratoire LIP Laboratoire d’Informatique de Paris 6 CNRS, ENS Lyon, INRIA, Paris, France Univ. Claude Bernard Lyon 1 [email protected] Lyon, France fi[email protected] Abstract—We introduce a software-oriented algorithm that binary32 binary64 binary128 allows one to quickly compare a binary64 floating-point (FP) precision (bits) 24 53 113 number and a decimal64 FP number, assuming the “binary emax +127 +1023 +16383 encoding” of the decimal formats specified by the IEEE 754-2008 emin −126 −1022 −16382 standard for FP arithmetic is used. It is a two-step algorithm: a first pass, based on the exponents only, makes it possible to decimal64 decimal128 quickly eliminate most cases, then when the first pass does not precision (digits) 16 34 suffice, a more accurate second pass is required. We provide an emax +384 +6144 implementation of several variants of our algorithm, and compare emin −383 −6143 them. TABLE I I. INTRODUCTION THE BASIC BINARY AND DECIMAL INTERCHANGE FORMATS SPECIFIED BY THE 754-2008 STANDARD. The IEEE 754-2008 Standard for Floating-Point Arith- metic [3] specifies several binary and decimal formats. The “basic interchange formats” of the standard are presented in Table I. the boolean result might not be the expected one because of Although the standard does not require that a binary and a rounding in the comparison. decimal number can be compared (it says floating-point data This kind of strategy may lead to inconsistencies. Imagine represented in different formats shall be comparable as long as such a “naïve” approach built as follows: when comparing a the operands’ formats have the same radix), such comparisons binary floating-point number x2 of format F2, and a decimal may nevertheless offer several advantages: it is not infrequent to floating-point number x10 of format F10, we first convert x10 read decimal data from some database and to have to compare to the binary format F2 (that is, we replace it by the F2 it to some binary floating-point number. That comparison may number nearest x10), and then we perform the comparison be slightly inaccurate if the decimal number is preliminarily in binary. Denote ○< , ○6 , ○>, and ○> as the comparison converted to binary, or, respectively, if the binary number is operators so defined. Consider the following variables (all first converted to decimal. exactly represented in their respective formats): 55 While it does not require comparisons between floating-point ∙ x = 3602879701896397=2 , declared as a binary64 numbers of different radices, the IEEE 754-2008 standard does number; 27 not forbid them for languages or systems. As the technical ∙ y = 13421773=2 , declared as a binary32 number; report on decimal floating-point arithmetic in C [4] is currently ∙ z = 1=10, declared as a decimal64 number. a mere draft, compilers supporting decimal floating-point Then it holds that x ○< y, but also y ○6 z and z ○6 x. Such an arithmetic handle code sequences such as the following one at inconsistent result might suffice to prevent a sorting program their discretion and often in an unsatisfactory way: from terminating. double x = ...; A direct mixed-radix binary-decimal-comparison might be _Decimal64 y = ...; an answer. if (x <= y) { Also, in the long run, allowing “exact” comparison of ... decimal and binary floating-point numbers will certainly be } the only rigorous way of executing program instructions of the As it occurs, this sequence is translated, for example by Intel’s form icc 12.1.3, into a conversion from binary to decimal followed if x > 0.1 then ... by a decimal comparison. The compiler emits no warning that unless compilers learn how to round (binary or decimal) constants figuring in comparisons in a way that does not affect counter with straight-line code. The pipeline stalls in the loop the comparison. are hence avoided. In the following, we aim at introducing an algorithm for We hence define 8 comparing a binary64 floating-point number m = M2; > e2−52 > x = M · 2 ; < h = 휈 + e2 − e10 − 37; 2 2 (3) n = M · 2휈 ; and a decimal64 floating-point number > 10 :> g = e10 − 15; e10−15 x10 = M10 · 10 : so that 53 ( h g−휈 Here M2 and M10 are integers, with jM2j 6 2 − 1 and x2 = m · 2 · 2 ; 16 g g−휈 (4) jM10j 6 10 − 1, and e2 and e10 are the floating-point x10 = n · 5 · 2 : exponents of x2 and x10. Although most of what will be presented in this paper is Our comparison problem becomes: generalizable to other formats, we restrict ourselves to these Compare m · 2h with n · 5g. two formats, for the sake of simplicity. We will also assume We have that the so-called binary encoding [3], [6] of IEEE 754-2008 is 52 53 2 6 m 6 2 − 1; used for the decimal64 format, so that the integer M10 is easily 53 54 (5) accessible in binary. Since comparisons are straightforward 2 6 n 6 2 − 1: if x2 and x10 have different signs or are zero, we assume From the ranges of e2, e10 and 휈, one easily deduces that M2 > 0 and M10 > 0. Also, to leave the IEEE 754 −398 6 g 6 369; flags untouched (unless when required), we will use integer (6) arithmetic rather than floating-point arithmetic to perform our −1495 6 h 6 1422: tests. Note that once the first step, eliminating comparison cases When x2 and x10 have significantly different orders of purely based on the exponents, is complete, we will be able magnitude, examining their exponents will suffice to compare to reduce these domains (see Lemma 1). It makes no sense, them. Hence, we first address the problem of performing a for instance, to assume that e2 is tiny and e10 is huge: in such first, exponent-based, test. We then show how to perform the a case, the comparison is straightforward. comparison when the first test does not suffice. Now define two functions ' and by II. FIRST STEP: ELIMINATING THE “SIMPLE CASES” BY '(h) = bh · log5 2c; (g) = bg · log2 5c: (7) EXAMINING THE EXPONENTS The function ' is appropriate to perform the first comparison Through a possible preliminary binary shift of M2 (when x2 52 step, as stated in Property 1 below. The other function will be is subnormal), we may assume that 2 6 M2. This gives the useful in the sequel. We will propose an efficient and easy-to- following constraints on e2 (taking into account that possible implement way to compute these functions ' and . shift): Property 1. We have − 1074 6 e2 6 1023: (1) The constraints on e10, read from Table I, are g < '(h) ) x2 > x10; g > '(h) ) x2 < x10: − 383 6 e10 6 384: (2) Proof: If g < '(h) then g 6 '(h) − 1, hence g 6 Also, from g h 16 h log5 2 − 1. This implies that 5 6 (1=5) · 2 , therefore 1 6 M10 6 10 − 1; 54 g 52 h g h 2 · 5 6 (4=5) · 2 · 2 . As a consequence, n · 5 < m · 2 . we easily deduce that there exists a unique 휈 2 f0; 1; 2;:::; 53g If g > '(h) then g > '(h) + 1, hence g > h log5 2, so that such that 5g > 2h. This implies 253 · 5g > (253 − 1) · 2h, and hence 53 휈 54 n · 5g > m · 2h 2 6 2 M10 6 2 − 1: . Property 2. Denoting by ⌊·⌉ the nearest integer function, let Hence our initial problem of comparing x2 and x10 reduces e2−52+휈 휈 e10−15 to comparing M2 · 2 and (2 M10) · 10 . 19 L' = b2 log5 2e = 225799; The fact that we “normalize” the decimal significand M 10 L = b212 log 5e = 9511: by a binary shift between two consecutive powers of two is of 2 course questionable; M10 could also be normalized into the For all h in the range (6), we have 15 t 16 range 10 6 10 · M10 6 10 − 1. However, as hardware −19 support for binary encoded decimal floating-point arithmetic is '(h) = bh · L' · 2 c: not widespread, a decimal normalization would require a loop Similarly, we have 10 multiplying by and testing, whereas the proposed binary −12 normalization can exploit an existing hardware leading-zero (g) = bL · g · 2 c for jgj 6 204; and in the special case g = 16q, In addition, for each fixed 휈, we can easily compute the k (e ; e ) −1022 e 1023 −8 number 휈 of pairs 2 10 such that 6 2 6 , (16q) = bL · q · 2 c for jqj 32: 6 −383 6 e10 6 384, and '(h) = g. norm The products L' ·h and L ·g, for g; h in the indicated ranges, Let X2 be the subset of X2 consisting of normal numbers. can all be computed exactly in (signed or unsigned) 32-bit The number of pairs (e2; 휉10) such that '(h) = g and 휉2 2 −훽 norm integer arithmetic. Computing b휉 · 2 c of course reduces to X2 is a right-shift by 훽 bits. 53 X 63 Proof: Using Maple, we exhaustively check that N1 = n휈 k휈 = 14 292 575 372 220 927 484 6 1:55 · 2 : 휈=0 '(h) = bh · L · 2−19c If now x2 is subnormal, then h < −623 and '(h) < −268.A norm for jhj 6 1831, and that rough bound on the number of (휉2; 휉10) 2 (X2nX2 ) × X10 −12 such that g = '(h) is hence (g) = bL · g · 2 c N = 252 · 1016 · (398 − 268) 1:12 · 2112: when either jgj 6 204 or g = 16q with jqj 6 32.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    8 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us