
Emulating round-to-nearest ties-to-zero ”augmented” floating-point operations using round-to-nearest ties-to-even arithmetic Sylvie Boldo, Christoph Lauter, Jean-Michel Muller To cite this version: Sylvie Boldo, Christoph Lauter, Jean-Michel Muller. Emulating round-to-nearest ties-to-zero ”aug- mented” floating-point operations using round-to-nearest ties-to-even arithmetic. IEEE Transactions on Computers, Institute of Electrical and Electronics Engineers, In press, 10.1109/TC.2020.3002702. hal-02137968v4 HAL Id: hal-02137968 https://hal.archives-ouvertes.fr/hal-02137968v4 Submitted on 13 Mar 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 1 Emulating Round-to-Nearest Ties-to-Zero “Augmented” Floating-Point Operations Using Round-to-Nearest Ties-to-Even Arithmetic Sylvie Boldo*, Christoph Lautery, Jean-Michel Mullerz * Université Paris-Saclay, Univ. Paris-Sud, CNRS, Inria, Laboratoire de recherche en informatique, 91405 Orsay, France y University of Alaska Anchorage, College of Engineering, Computer Science Department, Anchorage, AK, USA z Univ Lyon, CNRS, ENS de Lyon, Inria, Université Claude Bernard Lyon 1, LIP UMR 5668, F-69007 Lyon, France Abstract—The 2019 version of the IEEE 754 Standard for emin < 0 and the maximum exponent is emax. A floating-point Floating-Point Arithmetic recommends that new “augmented” number is a number of the form operations should be provided for the binary formats. These ex−p+1 operations use a new “rounding direction”: round-to-nearest x = Mx × 2 ; (1) ties-to-zero. We show how they can be implemented using the currently available operations, using round-to-nearest ties-to-even where with a partial formal proof of correctness. p Mx is an integer satisfying jMxj ≤ 2 − 1; (2) Keywords. Floating-point arithmetic, Numerical repro- ducibility, Rounding error analysis, Error-free transforms, and e ≤ e ≤ e : Rounding mode, Formal proof. min x max (3) If jMxj is maximum under the constraints (1), (2), and (3), emin I. INTRODUCTION AND NOTATION then ex is the floating-point exponent of x. The number 2 is The new IEEE 754-2019 Standard for Floating-Point (FP) the smallest positive normal number (a FP number of absolute 2emin subnormal 2emin−p+1 Arithmetic [8] supersedes the 2008 version. It recommends value less than is called ), and is the that new “augmented” operations should be provided for the smallest positive FP number. The largest positive FP number binary formats (see [15] for history and motivation). These is Ω = (2 − 2−p+1) · 2emax : operations are called augmentedAddition, augmentedSub- traction, and augmentedMultiplication. They use a new We will assume “rounding direction”: round-to-nearest ties-to-zero. The reason 3p ≤ emax + 1; (4) behind this recommendation is that these operations would significantly help to implement reproducible summation and which is satisfied by all binary formats of the IEEE 754 Stan- dot product, using an algorithm due to Demmel, Ahrens, and dard, with the exception of binary16 (which is an interchange NGuyen [5]. Obtaining very fast reproducible summation with format but not a basic format [8]). The usual round-to-nearest, that algorithm may require a direct hardware implementation ties-to-even function (which is the default in the IEEE-754 of these operations. However, having these operations avail- Standard) will be noted RNe. We recall its definition [8]: able on common processors will certainly take time, and they RNe(t) (where t is a real number) is the floating- may not be available on all platforms. The purpose of this point number nearest to t. If the two nearest floating- paper is to show that, in the meantime, one can emulate point numbers bracketing t are equally near, RNe(t) these operations with conventional FP operations (with the is the one whose least significant bit is zero. If emax−p usual round-to-nearest ties-to-even rounding direction), with jtj ≥ Ω + 2 then RNe(t) = 1, with the same reasonable efficiency. In this paper, we present the first pro- sign as t. posed emulation algorithms, with proof of their correctness We will also assume that an FMA (fused multiply-add) and experimental results. This allows, for instance, the design instruction is available. This is the case on all recent FP units. of programs that use these operations, and that will be ready As said above, the new recommended operations use a new for use with full efficiency as soon as the augmented operations “rounding direction”: round-to-nearest ties-to-zero. It corre- are available in hardware. Also, when these operations are sponds to the rounding function RN0 defined as follows [8]: available in hardware on some systems, this will improve the RN0(t) (where t is a real number) is the floating- portability of programs using these operations by allowing point number nearest t. If the two nearest floating- them to still work (with degraded performance, however) on point numbers bracketing t are equally near, other systems. RN0(t) is the one with smaller magnitude. If emax−p In the following, we assume radix-2, precision-p floating- jtj > Ω + 2 then RN0(t) = 1, with the same point arithmetic [13]. The minimum floating-point exponent is sign as t. 2 This is illustrated in Fig. 1. As one can infer from the defini- place”). If x is a floating-point number different from −Ω, tions, RNe(t) and RN0(t) can differ in only two circumstances first define pred(x) as the floating-point predecessor of x, i.e., (called halfway cases): when t is halfway between two consec- the largest floating-point number < x. We define ulpH (x) as utive floating-point numbers, and when t = ±(Ω + 2emax−p). follows. Definition 1 (Harrison’s ulp). If x is a floating-point number, RN0(x) = RNe(x) +1 then ulpH (x) is 0 RN0(y) RNe(y) jxj − pred (jxj) : 100100 100101 100110 100111 101000 Notation ulpH is to avoid confusion with the usual definition y x of function ulp. The usual ulp and function ulpH differ at powers of 2, except in the subnormal domain. For instance, Fig. 1: Round-to-nearest ties-to-zero (assuming we are in the −p+1 −p ulp(1) = 2 , whereas ulpH (1) = 2 . One easily checks positive range). Number x is rounded to the (unique) FP that if jtj is not a power of 2, then ulp(t) = ulpH (t), and if number nearest to x. Number y is a halfway case: it is exactly k k−p+1 jtj = 2 , then ulp(t) = 2 = 2ulpH (t), except in the halfway between two consecutive FP numbers: it is rounded emin−p+1 subnormal range where ulp(t) = ulpH (t) = 2 . to the one that has the smallest magnitude. The reason for choosing function ulpH instead of function ulp is twofold: The augmented operations are required to behave as fol- lows [8], [15]: ∙ if t > 0 is a real number, each time RN0(t) differs from RN (t), RN (t) will be the floating-point predecessor ∙ augmentedAddition(x; y) delivers (a ; b ) such that e 0 0 0 (t) (t) 6= (t) t a = RN (x + y) and, when a 2= {±∞; NaNg, of RNe , because RN0 RNe implies that 0 0 0 is what we call a “halfway case” in Section II: it is b0 = (x+y)−a0. When b0 = 0, it is required to have the same sign as a . One easily shows that b is a FP number. exactly halfway between two consecutive floating-point 0 0 (t) For special rules when a 2 {±∞; NaNg, see [15]; numbers, and in that case, RN0 is the one of these 0 two FP numbers which is closest to zero and RN (t) is ∙ augmentedSubtraction(x; y) is exactly the same as e (t) augmentedAddition(x; −y), so we will not discuss that the other one. Hence, in these cases, to obtain RN0 (t) operation further; we will have to subtract from RNe a number which is exactly ulp (RN (t)) (for negative t, for symmetry ∙ augmentedMultiplication(x; y) delivers (a ; b ) such H e 0 0 ( (t)) (t) that a = RN (x · y) and, where a 2= {±∞; NaNg, reasons, we will have to add ulpH RNe to RNe ); 0 0 0 and b0 = RN0((x · y) − a0). When (x · y) − a0 = 0, the ∙ (t) floating-point number b (equal to zero) is required to there is a very simple algorithm for computing ulpH 0 in the range where we need it (Algorithm 4 below). have the same sign as a0. Note that in some corner cases (an example is given in Section IV-A), b0 may differ Let us now briefly recall the classical Algorithms Fast2Sum, from (x · y) − a0 (in other words, (x · y) − a0 is not 2Sum, and Fast2Mult. always a floating-point number). Again, rules for handling infinities, NaNs and the signs of zeroes are given in [8], ALGORITHM 1: Fast2Sum(x; y). The Fast2Sum [15]. algorithm [4]. Because of the different rounding function, these augmented ae RNe(x + y) 0 operations differ from the well-known Fast2Sum, 2Sum, and y RNe(ae − x) 0 Fast2Mult algorithms (Algorithms 1, 2 and 3 below). As said be RNe(y − y ) above, the goal of this paper is to show that one can implement these augmented operations just by using rounded-to-nearest ties-to-even FP operations and with reasonable efficiency on Lemma 1.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages13 Page
-
File Size-