Emulating round-to-nearest ties-to-zero ”augmented” floating-point operations using round-to-nearest ties-to-even arithmetic Sylvie Boldo, Christoph Lauter, Jean-Michel Muller

To cite this version:

Sylvie Boldo, Christoph Lauter, Jean-Michel Muller. Emulating round-to-nearest ties-to-zero ”aug- mented” floating-point operations using round-to-nearest ties-to-even arithmetic. IEEE Transactions on Computers, Institute of Electrical and Electronics Engineers, In press, ￿10.1109/TC.2020.3002702￿. ￿hal-02137968v4￿

HAL Id: hal-02137968 https://hal.archives-ouvertes.fr/hal-02137968v4 Submitted on 13 Mar 2020

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 1

Emulating Round-to-Nearest Ties-to-Zero “Augmented” Floating-Point Operations Using Round-to-Nearest Ties-to-Even Arithmetic Sylvie Boldo*, Christoph Lauter†, Jean-Michel Muller‡ * Université Paris-Saclay, Univ. Paris-Sud, CNRS, Inria, Laboratoire de recherche en informatique, 91405 Orsay, France † University of Alaska Anchorage, College of Engineering, Computer Science Department, Anchorage, AK, USA ‡ Univ Lyon, CNRS, ENS de Lyon, Inria, Université Claude Bernard Lyon 1, LIP UMR 5668, F-69007 Lyon, France

Abstract—The 2019 version of the IEEE 754 Standard for 푒min < 0 and the maximum exponent is 푒max. A floating-point Floating-Point Arithmetic recommends that new “augmented” number is a number of the form operations should be provided for the binary formats. These 푒푥−푝+1 operations use a new “ direction”: round-to-nearest 푥 = 푀푥 × 2 , (1) ties-to-zero. We show how they can be implemented using the currently available operations, using round-to-nearest ties-to-even where with a partial formal proof of correctness. 푝 푀푥 is an integer satisfying |푀푥| ≤ 2 − 1, (2) Keywords. Floating-point arithmetic, Numerical repro- ducibility, Rounding error analysis, Error-free transforms, and 푒 ≤ 푒 ≤ 푒 . Rounding mode, Formal proof. min 푥 max (3) If |푀푥| is maximum under the constraints (1), (2), and (3), 푒min I.INTRODUCTION AND NOTATION then 푒푥 is the floating-point exponent of 푥. The number 2 is The new IEEE 754-2019 Standard for Floating-Point (FP) the smallest positive normal number (a FP number of absolute 2푒min subnormal 2푒min−푝+1 Arithmetic [8] supersedes the 2008 version. It recommends value less than is called ), and is the that new “augmented” operations should be provided for the smallest positive FP number. The largest positive FP number binary formats (see [15] for history and motivation). These is Ω = (2 − 2−푝+1) · 2푒max . operations are called augmentedAddition, augmentedSub- traction, and augmentedMultiplication. They use a new We will assume “rounding direction”: round-to-nearest ties-to-zero. The reason 3푝 ≤ 푒max + 1, (4) behind this recommendation is that these operations would significantly help to implement reproducible summation and which is satisfied by all binary formats of the IEEE 754 Stan- dot product, using an algorithm due to Demmel, Ahrens, and dard, with the exception of binary16 (which is an interchange NGuyen [5]. Obtaining very fast reproducible summation with format but not a basic format [8]). The usual round-to-nearest, that algorithm may require a direct hardware implementation ties-to-even function (which is the default in the IEEE-754 of these operations. However, having these operations avail- Standard) will be noted RN푒. We recall its definition [8]: able on common processors will certainly take time, and they RN푒(푡) (where 푡 is a real number) is the floating- may not be available on all platforms. The purpose of this point number nearest to 푡. If the two nearest floating- paper is to show that, in the meantime, one can emulate point numbers bracketing 푡 are equally near, RN푒(푡) these operations with conventional FP operations (with the is the one whose least significant bit is zero. If 푒max−푝 usual round-to-nearest ties-to-even rounding direction), with |푡| ≥ Ω + 2 then RN푒(푡) = ∞, with the same reasonable efficiency. In this paper, we present the first pro- as 푡. posed emulation algorithms, with proof of their correctness We will also assume that an FMA (fused multiply-add) and experimental results. This allows, for instance, the design instruction is available. This is the case on all recent FP units. of programs that use these operations, and that will be ready As said above, the new recommended operations use a new for use with full efficiency as soon as the augmented operations “rounding direction”: round-to-nearest ties-to-zero. It corre- are available in hardware. Also, when these operations are sponds to the rounding function RN0 defined as follows [8]: available in hardware on some systems, this will improve the RN0(푡) (where 푡 is a real number) is the floating- portability of programs using these operations by allowing point number nearest 푡. If the two nearest floating- them to still work (with degraded performance, however) on point numbers bracketing 푡 are equally near, other systems. RN0(푡) is the one with smaller magnitude. If 푒max−푝 In the following, we assume radix-2, precision-푝 floating- |푡| > Ω + 2 then RN0(푡) = ∞, with the same point arithmetic [13]. The minimum floating-point exponent is sign as 푡. 2

This is illustrated in Fig. 1. As one can infer from the defini- place”). If 푥 is a floating-point number different from −Ω, tions, RN푒(푡) and RN0(푡) can differ in only two circumstances first define pred(푥) as the floating-point predecessor of 푥, i.e., (called halfway cases): when 푡 is halfway between two consec- the largest floating-point number < 푥. We define ulp퐻 (푥) as utive floating-point numbers, and when 푡 = ±(Ω + 2푒max−푝). follows. Definition 1 (Harrison’s ulp). If 푥 is a floating-point number, RN0(푥) = RN푒(푥) +∞ then ulp퐻 (푥) is 0 RN0(푦) RN푒(푦) |푥| − pred (|푥|) .

100100 100101 100110 100111 101000 Notation ulp퐻 is to avoid confusion with the usual definition 푦 푥 of function ulp. The usual ulp and function ulp퐻 differ at powers of 2, except in the subnormal domain. For instance, Fig. 1: Round-to-nearest ties-to-zero (assuming we are in the −푝+1 −푝 ulp(1) = 2 , whereas ulp퐻 (1) = 2 . One easily checks positive range). Number 푥 is rounded to the (unique) FP that if |푡| is not a power of 2, then ulp(푡) = ulp퐻 (푡), and if number nearest to 푥. Number 푦 is a halfway case: it is exactly 푘 푘−푝+1 |푡| = 2 , then ulp(푡) = 2 = 2ulp퐻 (푡), except in the halfway between two consecutive FP numbers: it is rounded 푒min−푝+1 subnormal range where ulp(푡) = ulp퐻 (푡) = 2 . to the one that has the smallest magnitude. The reason for choosing function ulp퐻 instead of function ulp is twofold: The augmented operations are required to behave as fol- lows [8], [15]: ∙ if 푡 > 0 is a real number, each time RN0(푡) differs from RN (푡), RN (푡) will be the floating-point predecessor ∙ augmentedAddition(푥, 푦) delivers (푎 , 푏 ) such that 푒 0 0 0 (푡) (푡) ̸= (푡) 푡 푎 = RN (푥 + 푦) and, when 푎 ∈/ {±∞, NaN}, of RN푒 , because RN0 RN푒 implies that 0 0 0 is what we call a “halfway case” in Section II: it is 푏0 = (푥+푦)−푎0. When 푏0 = 0, it is required to have the same sign as 푎 . One easily shows that 푏 is a FP number. exactly halfway between two consecutive floating-point 0 0 (푡) For special rules when 푎 ∈ {±∞, NaN}, see [15]; numbers, and in that case, RN0 is the one of these 0 two FP numbers which is closest to zero and RN (푡) is ∙ augmentedSubtraction(푥, 푦) is exactly the same as 푒 (푡) augmentedAddition(푥, −푦), so we will not discuss that the other one. Hence, in these cases, to obtain RN0 (푡) operation further; we will have to subtract from RN푒 a number which is exactly ulp (RN (푡)) (for negative 푡, for symmetry ∙ augmentedMultiplication(푥, 푦) delivers (푎 , 푏 ) such 퐻 푒 0 0 ( (푡)) (푡) that 푎 = RN (푥 · 푦) and, where 푎 ∈/ {±∞, NaN}, reasons, we will have to add ulp퐻 RN푒 to RN푒 ); 0 0 0 and 푏0 = RN0((푥 · 푦) − 푎0). When (푥 · 푦) − 푎0 = 0, the ∙ (푡) floating-point number 푏 (equal to zero) is required to there is a very simple algorithm for computing ulp퐻 0 in the range where we need it (Algorithm 4 below). have the same sign as 푎0. Note that in some corner cases (an example is given in Section IV-A), 푏0 may differ Let us now briefly recall the classical Algorithms Fast2Sum, from (푥 · 푦) − 푎0 (in other words, (푥 · 푦) − 푎0 is not 2Sum, and Fast2Mult. always a floating-point number). Again, rules for handling infinities, and the signs of zeroes are given in [8], ALGORITHM 1: Fast2Sum(푥, 푦). The Fast2Sum [15]. algorithm [4].

Because of the different rounding function, these augmented 푎푒 ← RN푒(푥 + 푦) ′ operations differ from the well-known Fast2Sum, 2Sum, and 푦 ← RN푒(푎푒 − 푥) ′ Fast2Mult algorithms (Algorithms 1, 2 and 3 below). As said 푏푒 ← RN푒(푦 − 푦 ) above, the goal of this paper is to show that one can implement these augmented operations just by using rounded-to-nearest ties-to-even FP operations and with reasonable efficiency on Lemma 1. If 푥 = 0 or 푦 = 0, or if the floating-point exponents a system compliant with IEEE 754-2008. 푒푥 and 푒푦 of 푥 and 푦 satisfy 푒푥 ≥ 푒푦, then Let 푡 be the exact sum 푥 + 푦 (if we consider implementing 1) the two variables 푎 and 푏 returned by Algorithm 1 augmentedAddition) or the exact product 푥 · 푦 for augment- 푒 푒 (Fast2Sum) satisfy 푎 + 푏 = 푥 + 푦; edMultiplication). To implement the augmented operations, in 푒 푒 2) the operations performed at lines 2 and 3 of Algorithm 1 the general case (i.e., the sum or product does not overflow, are exact operations: 푦′ = 푎 − 푥 and 푏 = 푦 − 푦′. and in the case of augmentedMultiplication, 푥 and 푦 satisfy 푒 푒 the requirements of Lemma 2 below), we first use the clas- See for instance [13] for a proof. Hence, the variable 푏푒 sical Fast2Sum, 2Sum, or Fast2Mult algorithms to generate returned by Algorithm 1 is the error of the floating-point two FP numbers 푎푒 and 푏푒 such that 푎푒 = RN푒(푡) and addition 푎푒 ← RN푒(푥 + 푦). The second property given in 푏푒 = 푡 − 푎푒. We explain how augmentedAddition(푥, 푦) and Lemma 1 will be useful in Section IV-. In practice, condition augmentedMultiplication(푥, 푦) can be obtained from 푎푒 and 푏푒 “푒푥 ≥ 푒푦” may be hard to check. However, if |푥| ≥ |푦| in Sections III and IV, respectively, using a “recomposition” then that condition is satisfied. Algorithm 1 is immune to algorithm presented in Section II. spurious overflow: it was proved in [1] that if the addition In the following, we need to use a definition inspired from RN푒(푥 + 푦) does not overflow then the other two operations Harrison’s definition [6] of function ulp (“unit in the last cannot overflow. 3

ALGORITHM 2: 2Sum(푥, 푦). The 2Sum algo- Lemma 4 (Sterbenz Lemma). If 푥 and 푦 are floating-point rithm [12], [11]. numbers that satisfy 푥/2 ≤ 푦 ≤ 2푥, then 푥 − 푦 is a floating- point number, which implies RN (푥 − 푦) = 푥 − 푦. 푎푒 ← RN푒(푥 + 푦) 푒 ′ 푥 ← RN푒(푎푒 − 푦) Finally, we will sometimes use the following lemmas, whose ′ ′ 푦 ← RN푒(푎푒 − 푥 ) proofs are straightforward. ′ 훿푥 ← RN푒(푥 − 푥 ) ′ Lemma 5. If 푎 is a nonzero floating-point number. If 푡 is a 훿푦 ← RN푒(푦 − 푦 ) real number such that |푡| ≤ |푎| and 푡 is a multiple of ulp(푎), 푏푒 ← RN푒(훿푥 + 훿푦) then 푡 is a floating-point number.

Lemma 6. If 푡1 and 푡2 are real numbers such that 푒min 푒max−푝 Algorithm 2 (2Sum) gives the same results 푎푒 and 푏푒 as 1) 2 ≤ |푡1|, |푡2| < Ω + 2 ; 푘 Algorithm 1, but without any requirement on the exponents of 2) there exists an integer 푘 such that 푡1 = 2 · 푡2; 푘 푘 푥 and 푦. It is almost immune to spurious overflow: if |푥|= ̸ Ω then RN푒(푡1) = 2 · RN푒(푡2) and RN0(푡1) = 2 · RN0(푡2). and the addition RN푒(푥 + 푦) does not overflow then the other five operations cannot overflow [1]. As explained in Section II (where it corresponds to A similar algorithm, Algorithm 3 (Fast2Mult), makes it “Halfway Case 1”), when RN0(푡) and RN푒(푡) differ, RN0(푡) possible to express the exact product of two floating-point is obtained by subtracting sign(푡)·ulp퐻 (RN푒(푡)) from RN푒(푡). numbers 푥 and 푦 as the sum of the rounded product RN (푥푦) Therefore, we need to be able to compute sign(푎)·ulp퐻 (푎). If 푒 푒 and an error term. It requires the availability of an FMA |푎| > 2 min , this can be done using Algorithm 4 below, which (fused multiply-add) instruction. To be exactly representable is a variant of an algorithm introduced by Rump [16]. as the sum of two floating-point numbers, the exact product must not be too tiny. Several sufficient conditions appear in ALGORITHM 4: MyulpH(푎): Computes sign(푎) · 푒min the literature (such as the exponents 푒푥 and 푒푦 satisfying pred(|푎|) and sign(푎) · ulp퐻 (푎) for |푎| > 2 . Uses −푝 푒푥 + 푒푦 ≥ 푒min + 푝 − 1, see [14] for a proof). We will use a the FP constant 휓 = 1 − 2 . slightly different condition, given by Lemma 2 below. 푧 ← RN푒(휓푎) 훿 ← RN푒(푎 − 푧) ALGORITHM 3: Fast2Mult(푥, 푦). The Fast2Mult return (푧, 훿) algorithm (see for instance [10], [14], [13]). It requires the availability of a fused multiply-add (FMA) instruc- tion for computing RN푒(푥 · 푦 − 푎푒). Lemma 7. The numbers 푧 and 훿 returned by Algorithm 4 satisfy: 푎푒 ← RN푒(푥 · 푦) 푒min 푏푒 ← RN푒(푥 · 푦 − 푎푒) ∙ if |푎| > 2 then 푧 = sign(푎) · pred(|푎|) and 훿 = sign(푎) · ulp퐻 (푎); 푒 ∙ If |푎| ≤ 2 min then 푧 = 푎 and 훿 = 0. 푒min+푝 푒max−푝 Lemma 2. If 2 ≤ |푥 · 푦| < Ω + 2 (which in Proof. 푒min+푝 푒min+1 particular holds if 2 + 2 ≤ |RN푒(푥 · 푦)| ≤ Ω) 푒min then the numbers 푎 and 푏 returned by Algorithm 3 satisfy ∙ The fact that when |푎| > 2 the number 푧 returned 푒 푒 by Algorithm 4 equals sign(푎) · pred(|푎|) is a direct 푎푒 + 푏푒 = 푥 · 푦. consequence of [16, Lemma 3.6] (see also [9]). The value 푘 푘+1 of 훿 immediately follows from that. Proof. Let 푘 be the integer such that 2 ≤ |푥푦| < 2 . Since 푒min 푘−2푝+1 ∙ If |푎| < 2 (i.e., 푎 is subnormal or zero), then 푥푦 is a 2푝-bit number, it is a multiple of 2 , so that −푝 푒min−푝 1 푘−2푝+1 |2 푎| < 2 = 2 ulp(푎), from which we obtain 푥푦 − RN푒(푥푦) is a multiple of 2 too. |푥푦 − RN푒(푥푦)| 1 1 푘−푝 |휓푎 − 푎| < 2 ulp(푎), thus 푧 = RN푒(휓푎) = 푎 and 훿 = 0. is less than or equal to ulp(푥푦) = 2 . Also, Condition 푒 2 ∙ Finally, if |푎| = 2 min , the ties-to-even rule implies 푒min+푝 2 ≤ |푥 · 푦| implies 푘 ≥ 푒min + 푝. All this implies 푘−2푝+1 푧 = RN푒(휓푎) = 푎 and 훿 = 0. that 푥푦 − RN푒(푥푦) can be written 푀 × 2 , where 푀 is an integer of absolute value less than or equal to 푝−1 2 and 푘 − 2푝 + 1 ≥ 푒min − 푝 + 1, which implies that The fact that the radix is 2 is important here (a counterex- 푥푦 − RN푒(푥푦) is a floating-point number, which implies that ample in radix 10 is 푝 = 3 and 푎 = 101). This means that 푏푒 = 푥푦 − RN푒(푥푦) = 푥푦 − 푎푒. our work cannot be straightforwardly generalized to decimal We will also use the following, classical results, due to floating-point arithmetic. Hauser [7] and Sterbenz [17] (the proofs are straightforward, see for instance [13]). II.RECOMPOSITION Lemma 3 (Hauser Lemma). If 푥 and 푦 are floating-point num- In this section, we start from two FP numbers 푎푒 and 푏푒, that satisfy 푎푒 = RN푒(푡), with 푡 = 푎푒 + 푏푒, and we assume bers, and if the number RN푒(푥+푦) is subnormal, then 푥+푦 is 푒min |푎푒| > 2 . These numbers may have been preliminarily a floating-point number, which implies RN푒(푥 + 푦) = 푥 + 푦. generated by the 2Sum, Fast2Sum or Fast2Mult algorithms 4

(Algorithms 1, 2, and 3). We want to obtain from 푎푒 and 푏푒 two ALGORITHM 5: Recomp(푎푒, 푏푒). From two FP FP numbers 푎0 and 푏0 such that 푎0 = RN0(푡) and 푎0 +푏0 = 푡. numbers 푎푒 and 푏푒 such that 푎푒 = RN푒(푎푒 + 푏푒) 푒min Before giving the algorithm, let us present the basic principle and |푎푒| > 2 , computes 푎0 and 푏0 such that 푒min in the case 2 < 푡 < Ω (푡 is thus assumed positive to 푎0 + 푏0 = 푎푒 + 푏푒 and 푎0 = RN0(푎푒 + 푏푒). simplify the presentation). If 푡 is not halfway between two (푧, 훿) ← MyulpH(푎푒) consecutive FP numbers, we know that 푎0 = 푎푒 and 푏0 = 푏푒. if −2 · 푏푒 = 훿 then If 푡 is halfway between two FP numbers (one of them being 푎0 ← 푧 푎푒), then two cases may occur: 푏0 ← −푏푒 1 ∙ Halfway case 1: 푡 = 푎푒 − 2 ulp퐻 (푎푒) (i.e., 푏푒 = else 1 − 2 ulp퐻 (푎푒)); 푎0 ← 푎푒 1 ∙ Halfway case 2: 푡 = 푎푒 + 2 ulp퐻 (푎푒) (i.e., 푏푒 = 푏0 ← 푏푒 1 + 2 ulp퐻 (푎푒)). end if In the second case, 푎푒 is already equal to 푡 rounded to zero, so return (푎0, 푏0) we must choose 푎0 = 푎푒 and 푏0 = 푏푒. In the first case, 푎0 is the 1 floating-point predecessor of 푎푒, and 푏0 = 2 ulp퐻 (푎푒) = −푏푒. Hence, to find 푎0 and 푏0 we must first detect if we are in augmentedMultiplication this will require a special handling Halfway case 1: it is the only case where (푎0, 푏0) differs from (see Sections IV-C and IV-D). (푎푒, 푏푒). That detection is done using Algorithm 4 (MyulpH). In the next two sections, we examine how Algorithm 5 can be used to compute augmentedAddition(푥, 푦) and augmentedMultiplication(푥, 푦). ulp퐻 (푎푒) +∞ III.USEOF ALGORITHM RECOMPFORIMPLEMENTING 0 푎0 푎푒 AUGMENTEDADDITION From two input floating-point numbers 푥 and 푦, we wish to compute RN0(푥+푦) and (푥+푦)−RN0(푥+푦). We recall that 푡 when (푥 + 푦) − RN0(푥 + 푦) equals zero, the IEEE 754-2019 Standard requires that it should be returned with the sign of Fig. 2: Halfway case 1: 푡 = 푎 − (1/2)ulp (푎 ), where 푡 = 푒 퐻 푒 RN (푥+푦). Let us first give a simple algorithm (Algorithm 6, 푎 + 푏 . We have 푎 = 푎 − ulp (푎 ) and 푏 = −푏 . 0 푒 푒 0 푒 퐻 0 0 푒 below) that returns a correct result (possibly with a wrong sign for 푏0 when it is zero) when no exception occurs (i.e, the returned values are finite floating-point numbers). ulp퐻 (푎푒) +∞ ALGORITHM 6: AA-Simple(푥, 푦): computes 0 푎0 = 푎푒 augmentedAddition(푥, 푦) when no exception occurs. 1: if |푦| > |푥| then 2: swap(푥, 푦) 3: end if 푡 4: (푎푒, 푏푒) ← Fast2Sum(푥, 푦) Fig. 3: Halfway case 2: 푡 = 푎푒 + (1/2)ulp퐻 (푎푒), where 푡 = 5: (푎0, 푏0) ← Recomp(푎푒, 푏푒) 푎푒 + 푏푒. We have 푎0 = 푎푒 and 푏0 = 푏푒. 6: return (푎0, 푏0)

This is illustrated by Figures 2 and 3, and this leads to Algorithm 5 below. In Algorithm 5, when the number −2 · 푏푒 Theorem 1. The values 푎0 and 푏0 returned by Algorithm 6 is equal to 훿 (i.e., when Halfway case 1 occurs), we must satisfy: 푒max−푝 −푝 푒max return 푎0 = 푎푒 − 훿 = sign(푎푒) · pred(|푎푒|). This explains why 1) if |푥+푦| < Ω+2 = (2−2 )·2 then (푎0, 푏0) in that case the value of 푎0 returned by the algorithm is 푧. We is equal to augmentedAddition(푥, 푦), with the possible obtain Lemma 8 below. exception that if 푏0 = 0 it may have a sign that differs from the one specified in the IEEE 754-2019 Standard; Lemma 8. If 2푒min < |푎 | ≤ Ω then the two floating-point 푒 푒max−푝 2) if |푥 + 푦| = Ω + 2 then 푎0 = ±∞ and 푏0 is numbers 푎0 and 푏0 returned by Algorithm 5 satisfy ±∞ (with a sign different from the one of 푎0), whereas 푎0 = RN0(푎푒 + 푏푒), the correct values would have been 푎0 = ±Ω and 푏0 = 푒 −푝 푎0 + 푏0 = 푎푒 + 푏푒. ±2 max (with the appropriate signs); 3) if |푥 + 푦| > Ω + 2푒max−푝 then 푎 = ±∞ (with the appro- Condition 2푒min < |푎 | in Lemma 8 is necessary: if 0 푒 priate sign) and 푏 is either NaN or ±∞ (possibly with a |푎 | ≤ 2푒min , an immediate consequence of Lemma 7 is 0 푒 wrong sign), whereas the standard requires 푎 = 푏 = ∞ that Algorithm 5 returns 푎 = 푎 and 푏 = 푏 . This is 0 0 0 푒 0 푒 (with the same sign as 푥 + 푦). not a problem for implementing augmentedAddition thanks to Lemma 3, as we are going to see in Section III. For Note that if we are certain that |푥|= ̸ Ω (so that 2Sum(푥, 푦) 5 can be called without any risk of spurious overflow) we can Alternatively, one can also use the copySign instruction speci- replace lines 1 to 4 of the algorithm by a simple call to fied by the IEEE 754 Standard [8] if it is faster than a floating- 2Sum(푥, 푦). Note also that Theorem 1 implies that each time point multiplication on the system being used: copySign(푥, 푦) 푎0 is a finite floating-point number, Algorithm 6 returns a has the absolute value of 푥 and the sign of 푦. correct result (with a possible wrong sign for 푏0 when it is The second item in Theorem 1 follows immediately by zero). applying Algorithm 6 to the corresponding input value. The first item in Theorem 1 is an immediate consequence Concerning the third item in Theorem 1, Table III gives the of the properties of the Fast2Sum and Recomp algorithms. Let values returned by Algorithm 6 when |푥 + 푦| > Ω + 2푒max−푝, us momentarily ignore the signs of zero variables. We have and compares them with the correct values. 푎푒 = RN푒(푥 + 푦) and 푎푒 + 푏푒 = 푥 + 푦. Hence, 푒min TABLE III: Values obtained using Algorithm 6 (possibly ∙ if |푎푒| > 2 then Recomp(푎푒, 푏푒) gives the expected result; with a replacement of Fast2Sum by 2Sum) when |푥 + 푦| > 푒max −푝 푒min 2 (2 − 2 ). ∙ if |푎푒| ≤ 2 then from Lemma 3, we know that the floating-point addition of 푥 and 푦 is exact, hence 푏푒 = 0. Variant of Algorithm 6 Result required (푎 , 푏 ) = (푎 , 푏 ) Algorithm 6 with (푎푒, 푏푒) obtained We easily deduce that Recomp 푒 푒 푒 푒 which by the standard through Fast2Sum is the expected result. In particular, if 푎푒 = 0 then we obtain 푎0 = 푏0 = 0. 푎0 +∞ · sign(푥 + 푦) +∞ · sign(푥 + 푦) +∞ · sign(푥 + 푦) Now, let us reason about the signs of zero variables. Note that 푏0 NaN −∞ · sign(푥 + 푦) +∞ · sign(푥 + 푦) 푎 = 0 is possible only when 푥 + 푦 = 0. A quick look at 0 If the considered applications only require augmentedAd- Fast2Sum and MyulpH shows that when 푥 + 푦 = 0, 푎 = 0 dition to follow the specifications when no exception occurs, 0 with the same sign as 푎 , which corresponds to what is 푒 Algorithm 6 (possibly with the above given additional lines if requested by IEEE 754-2019. Hence, when 푎 = 0, it has the 0 the signs of zeroes matter) is a good candidate. If we wish to right sign. always follow the specifications, we suggest using Algorithm 7 When 푏 = 0, this may come from two possible cases: either 0 below. 푥 + 푦 is a nonzero floating-point number (in which case, 푎0 is that number), or 푥 + 푦 = 0. In both cases 푏 should be zero 0 ALGORITHM 7: AA-Full(푥, 푦): computes with the same sign as 푎 . Tables I and II give the values of 0 augmentedAddition(푥, 푦) in all cases. 푏0 returned by Algorithm 6 in these two cases. One can see 1: |푦| > |푥| that when 푏0 = 0, its sign is not always correct. if then 2: swap(푥, 푦) TABLE I: Value of 푏0 computed by Algorithm 6 and value 3: end if of 푏0 specified by the IEEE-754 Standard when 푥 + 푦 is a 4: (푎푒, 푏푒) ← Fast2Sum(푥, 푦) nonzero floating-point number (i.e., 푏푒 = ±0). 5: (푎0, 푏0) ← Recomp(푎푒, 푏푒) computed correct 6: if 푏0 = 0 then Case 푏0 푏0 7: 푏0 ← (+0) × 푎0 푒 푦 ̸= 0 and |푎푒| > 2 min +0 +0 × sign(푎푒) 8: else if |푎푒| = +∞ then 푒 |푎푒| ≤ 2 min −0 ′ ′ 푒 9: (푎푒, 푏푒) ← Fast2Sum(0.5푥, 0.5푦) 푦 = +0 and |푎푒| > 2 min +0 +0 × sign(푎푒) ′ 푒 ′ 푒 −푝−1 푒 10: if (푎 = 2 max and 푏 = −2 max ) or |푎푒| ≤ 2 min −0 푒 푒 푒 ′ 푒max ′ 푒max−푝−1 푦 = −0 and |푥| = |푎푒| > 2 min −0 +0 × sign(푎푒) (푎푒 = −2 and 푏푒 = +2 ) then 푒 |푥| = |푎푒| ≤ 2 min +0 ′ −푝+1 11: 푎0 ← RN푒(푎푒 · (2 − 2 )) ′ 12: 푏0 ← −2푏푒 13: else TABLE II: Value of 푏0 computed by Algorithm 6 and value 14: 푎0 ← 푎푒 (infinity with right sign) of 푏0 specified by the IEEE-754 Standard when 푥 + 푦 = 0). 15: 푏0 ← 푎푒 Case computed 푏0 correct 푏0 16: end if 푥 = −푦 and 푥 ̸= 0 −0 +0 17: end if 푥 = +0 and 푦 = +0 −0 +0 푥 = +0 and 푦 = −0 +0 +0 18: return (푎0, 푏0) 푥 = −0 and 푦 = +0 −0 +0 푥 = −0 and 푦 = −0 +0 −0 Theorem 2. The output (푎 , 푏 ) of Algorithm 7 is equal to However, if the signs of the zero variables matter in the 0 0 augmentedAddition(푥, 푦). target application, there is a simple solution. Since the sign of 푎0 is always correct, and since when 푏0 = 0 it must be returned Proof. with the sign of 푎0, it suffices to add to add the following lines 1) if |푥 + 푦| < Ω + 2푒max−푝 then Item 1 of Theorem 1 to Algorithm 6 after Line 5: tells us that the values 푎0 and 푏0 computed at Line 5 of if 푏0 = 0 then Algorithm 7 are equal to augmentedAddition(푥, 푦), with 푏0 ← (+0) × 푎0 the possible exception that if 푏0 = 0 it may have a sign end if that differs from the one specified in the IEEE 754-2019 6

Standard. This possible error in the sign of 푏0 is corrected exception that if 푏0 = 0 it may have a sign that differs at Lines 6-7. from the one specified in the IEEE 754-2019 Standard; 푒max−푝 푒max−푝 −푝 푒max 2) if |푥 + 푦| = Ω + 2 then |푎푒| = +∞. In that case, 2) if |푥 · 푦| = Ω + 2 = (2 − 2 ) · 2 , then 1 Ω 푒max−푝−1 푒max 푒max−푝−1 since 2 · |푥 + 푦| = 2 + 2 = 2 − 2 , 푎0 = ±∞ (with the sign of 푥 · 푦) and 푏0 = ±∞ (with at Line 9 we obtain the opposite sign) whereas the correct values would have 푒max−푝 been 푎0 = ±Ω and 푏0 = ±2 (both with the sign 푎′ = sign(푥 + 푦) · 2푒max of 푥 · 푦); 푒max−푝 and 3) if |푥 · 푦| > Ω + 2 , then 푎0 = ±∞ (with the sign ′ 푒 −푝−1 푏 = −sign(푥 + 푦) · 2 max . of 푥 · 푦) and 푏0 = ±∞ (with the opposite sign) whereas the correct values would have been 푎0 = 푏0 = ±∞ (with In that case, Lines 10-12 of the algorithm return the the sign of 푥 · 푦). correct values Proof. The first item in Theorem 3 is a consequence of 푎0 = sign(푥 + 푦) · Ω Lemma 2 and Lemma 8. If and 2푒min+푝 + 2푒min < |푥 · 푦| < Ω + 2푒max−푝 푒max−푝−1 푏0 = sign(푥 + 푦) · 2 . 푒min+푝 푒min+1 then 2 + 2 ≤ |RN푒(푥 · 푦)| ≤ Ω, therefore 푒max−푝 3) if |푥 + 푦| > Ω + 2 then |푎푒| = +∞, and the ∙ (푎푒, 푏푒) = Fast2Mult(푥, 푦) gives 푎푒 + 푏푒 = 푥 · 푦; sum (푥 + 푦)/2 computed using Fast2Sum at Line 9 푒 ∙ |푎 | > 2 min ; (without overflow since |푥 + 푦|/2 is less than or equal 푒 therefore Recomp(푎 , 푏 ) returns the expected result. to the maximum of |푥| and |푦|) will be or absolute value 푒 푒 The second item in Theorem 3 follows immediately by (strictly) larger than 2푒max − 2푒max−푝−1, hence Lines 14- applying Algorithm 8 to the corresponding input value. Con- 15 of the algorithm will be executed, and we will obtain cerning the third item in Theorem 3, Table IV gives the values 푎0 = 푏0 = sign(푥 + 푦) · ∞, as expected. returned by Algorithm 8 when |푥 · 푦| > Ω + 2푒max−푝.

IV. USEOF ALGORITHM RECOMPFORIMPLEMENTING TABLE IV: Values obtained using Algorithm 8 when |푥·푦| > AUGMENTEDMULTIPLICATION 2푒max (2 − 2−푝). A. General case Result required From two input floating-point numbers 푥 and 푦, we wish Algorithm 8 by the standard to compute RN0(푥 · 푦) and 푥 · 푦 − RN0(푥 · 푦) (or, merely, [︀ ]︀ 푎0 +∞ · sign(푥 · 푦) +∞ · sign(푥 · 푦) RN0 푥 · 푦 − RN0(푥 · 푦) when 푥 · 푦 − RN0(푥 · 푦) is not a floating-point number). As we did for augmentedAddition, 푏0 −∞ · sign(푥 · 푦) +∞ · sign(푥 · 푦) let us first present a simple algorithm (Algorithm 8 below). Unfortunately, it will be less general than the simple addition As with the addition algorithm, if the signs of the zero algorithm: this is due to the fact that when the absolute variables matter in the target application and if Condition 1 of value of the product of two floating-point numbers is less Theorem 3 is satisfied, it suffices to add the following lines to than or equal to 2푒min+푝, it may not be exactly representable Algorithm 8 after Line 2: by the sum of two floating-point numbers (an example is if 푏0 = 0 then 푥 = 1 + 2−푝+1 and 푦 = 2푒min + 2푒min−푝+1: their product 푏0 ← (+0) × 푎0 푒min 푒min−푝+2 푒min−2푝+2 2 + 2 + 2 cannot be a sum of two end if FP numbers, since such a sum is necessarily a multiple of (and, again, function copySign can be used if it is faster than 2푒min−푝+1). a floating-point multiplication). Another solution is to notice that in Case 1 of Theorem 3, 푥푦−푎 is a floating-point number, (푥, 푦) 0 ALGORITHM 8: AM-Simple : computes therefore one can just compute 푎 with Algorithm 8 and obtain augmentedMultiplication(푥, 푦) when 0 푏0 with one FMA instruction, as RN푒(푥푦 − 푎0). As we are 2푒min+푝 + 2푒min < |푥 · 푦| < Ω + 2푒max−푝. going to see that solution is useful even when Condition 1 of Theorem 3 is not satisfied. 1: (푎푒, 푏푒) ← Fast2Mult(푥, 푦) Let us now build another augmented multiplication al- 2: (푎0, 푏0) ← Recomp(푎푒, 푏푒) gorithm, Algorithm 9 below, that returns a correct re- 3: return (푎0, 푏0) sult even if the condition of Case 1 of Theorem 3 (i.e., 2푒min+푝 + 2푒min < |푥 · 푦| < Ω + 2푒max−푝) is not satisfied. We need to be able to address the cases Theorem 3. The values 푎0 and 푏0 returned by Algorithm 8 satisfy: RN푒(푥 · 푦) = ±∞ 1) If 2푒min+푝 +2푒min < |푥·푦| < Ω+2푒max−푝 (which implies 푒min+푝 푒min+1 (that correspond to items 2 and 3 in Theorem 3) and 2 + 2 ≤ |RN푒(푥 · 푦)| ≤ Ω) then (푎0, 푏0) is 푒min+푝 equal to augmentedMultiplication(푥, 푦), with the possible |RN푒(푥 · 푦)| ≤ 2 . 7

This will be done by scaling the calculation, i.e., by finding a 2푒min−푝+1 (we are in the subnormal range), we have the suitable power of 2, say 2푘, such that 2푘푥 is computed without following property:

푒min−푝 over/underflow, and the requested calculations (in particular ∙ if 푥푦 = 푎푒 − sign(푎푒) · 2 (i.e., we are in what we ′ 푘 those in Algorithm 8) can be done safely with inputs 푥 = 2 푥 call “Halfway case 1” in Section II), then and 푦. We need to consider three cases: 푒min−푝+1 푘 푎0 = 푎푒 − sign(푎푒) · 2 ; ∙ when RN푒(푥 · 푦) = ±∞, we choose 2 = 1/2. This case

is dealt with in Section IV-B; ∙ otherwise, 푎0 = 푎푒. 푒min+1 푒min−푝+1 ∙ when |RN푒(푥 · 푦)| ≤ 2 − 2 , we choose 푒min−푝 푘 2푝 Therefore, we need to compare 푥푦 with 푎푒 −sign(푎푒)·2 . 2 = 2 . This case is dealt with in Section IV-C; This cannot be done straightforwardly, because 푥푦 is not nec- 푒min+1 푒min+푝 푘 ∙ when 2 ≤ |RN푒(푥·푦)| ≤ 2 , we choose 2 = 푝 essarily representable exactly as the sum of two FP numbers 2 . This case is dealt with in Section IV-D. (Lemma 2 does not hold). Instead, we will compare 22푝푥푦 푒min+푝 2푝 푒min+푝 Note that when |RN푒(푥 · 푦)| ≤ 2 (i.e., in the last two with 2 푎푒 − sign(푎푒) · 2 . The first step for doing that does 2푝 cases mentioned just above), RN0(푥푦 − RN0(푥푦)) = 0 will be to express 2 푥푦 as the sum of two FP numbers 푡1 not mean that 푥푦 − RN0(푥푦) = 0. This slightly complicates and 푡2 using Algorithm Fast2Mult. Then, to compare 푡1 + 푡2 2푝 푒min+푝 the choice of the sign of 푏0 when it is equal to zero. More with 2 푎푒 − sign(푎푒) · 2 , we will first show that the 2푝 precisely, when 푏0 = 0, the sign of 푏0 must be: subtraction 푡3 = 푡1 − 2 푎푒 is performed exactly, so that it 푒min+푝 ∙ the sign of 푎0 (i.e., the sign of 푥푦) when 푥푦 − 푎0 = 0; will suffice to compare 푡2 + 푡3 with −sign(푎푒) · 2 . ∙ the real sign of 푥푦 − 푎0 otherwise. So, we successively compute (using FMA instructions)

Fortunately, in both cases, this is the same sign as the one of (︀ 2푝 )︀ 푡1 = RN푒 (2 · 푥) · 푦 RN푒(푥푦 − 푎0), which is obtained using an FMA instruction. 푡 = RN (︀(22푝 · 푥) · 푦 − 푡 )︀ = 푥 · 푦 · 22푝 − 푡 Hence, when 푏0 = 0, one can return (+0) · RN푒(푥푦 − 푎0) 2 푒 1 1 (︀ 2푝)︀ (the multiplication by (+0) is necessary to handle the case 푡3 = RN푒 푡1 − 푎푒 · 2 . 푒min−푝 푥푦 − 푎0 = 2 , for which functions RN0 and RN푒 differ). First, 푡1 can be computed without overflow: 푒 +1 푒 −푝+1 ∙ min min B. First special case: if RN푒(푥 · 푦) = ±∞ |푥 · 푦| < 2 and, since 푥 · 푦 ̸= 0, |푦| ≥ 2 . Therefore, |푥| ≤ 2푒min+1/2푒min−푝+1 = 2푝, therefore In this case, which corresponds to Lines 2–11 in Al- |푥| · 22푝 < 23푝. Using (4), this implies that gorithm 9, we need to know if we are in Case 2 (i.e., |푥 · 22푝| < 2푒max+1, hence 푥 · 22푝 is a floating-point |푥 · 푦| = Ω + 2푒max−푝) or Case 3 (i.e., |푥 · 푦| > Ω + 2푒max−푝) number; of Theorem 3. Hence our problem reduces to checking if 2푝 푒 +1+2푝 3푝 푒 +1 ∙ now, |(2 · 푥) · 푦| < 2 min < 2 < 2 max since |푥 · 푦| = Ω + 2푒max−푝. That problem is addressed easily. It 푒 < 0. suffices to compute (푎′ , 푏′ ) = Fast2Mult(0.5 · 푥, 푦): min 푒 푒 2푝 푒 −푝 Therefore, |2 · 푥 · 푦| is below the overflow threshold, and ∙ If |푥 · 푦| = Ω + 2 max , then (푥/2) · 푦 is computed by Fast2Mult without overflow, which allows one to check ⃒ 2푝 ⃒ 1 ⃒푡1 − 2 푥푦⃒ ≤ ulp(푡1). (8) its equality with ± (Ω + 2푒max−푝) /2; 2 푒 −푝 ∙ If it turns out that |푥 · 푦/2|= ̸ (Ω + 2 max ) /2 it suffices 2푝 The fact that 푡2 = 푥 · 푦 · 2 − 푡1 comes from Lemma 2 and to return 푎0 = 푏0 = RN푒(푥 · 푦): they will be infinities (7). with the right sign. 2푝 Let us show that 휃3 = 푡1−푎푒·2 is a floating-point number. This will imply 푒min+1 푒min−푝+1 C. Second special case: if |RN푒(푥·푦)| ≤ 2 −2 2푝 푡3 = 휃3 = 푡1 − 푎푒 · 2 In that case, (hence, 휃3 can be computed with an FMA, or with an exact |푥 · 푦 − RN (푥 · 푦)| ≤ 2푒min−푝, (5) 0 multiplication by 22푝 followed by a subtraction). From (7) we and thus RN0 (푥 · 푦 − RN0(푥 · 푦)) = 0, so we have to return obtain

푏0 = 0 (with the sign of RN푒(푥푦 − 푎0)), and we only have to 푒min+푝 푒min+2푝+1 푒min+푝+1 2 ≤ |푡1| ≤ 2 − 2 focus on the computation of 푎0 = RN0(푥·푦). We also assume 푒min+푝+1 that RN푒(푥 · 푦) ̸= 0 (otherwise, it suffices to return the pair and ulp(푡1) ≤ 2 . 푒min−푝+1 (0, 0), with the sign of 푥푦). We therefore have Since 푎푒 (as any FP number) is a multiple of 2 , the number 22푝 · 푎 is a multiple of 2푒min+푝+1. Therefore, 휃 2푒min−푝 < |푥 · 푦| < 2푒min+1 − 2푒min−푝. (6) 푒 3 is a multiple of ulp(푡1). 푒min−푝 푒min−푝 Note that (6) implies Now, (5) gives 푥 · 푦 − 2 ≤ 푎푒 ≤ 푥 · 푦 + 2 , from which we deduce 2푒min+푝 < |22푝푥 · 푦| < 2푒min+2푝+1 − 2푒min+푝. (7) 푥 · 푦 · 22푝 − 2푒min+푝 ≤ 푎 · 22푝 ≤ 푥 · 푦 · 22푝 + 2푒min+푝, Let us first give the general reasoning behind the calcu- 푒 lations of Lines 16–25 of Algorithm 9 (detailed proof will which implies, using (8), follow). Let 푎푒 be RN푒(푥 · 푦). Since the distance between 1 1 푒min+푝 2푝 푒min+푝 consecutive floating-point numbers in the vicinity of 푥푦 is 푡1 − ulp(푡1) − 2 ≤ 푎푒 · 2 ≤ 푡1 + ulp(푡1) + 2 , 2 2 8

ALGORITHM 9: AM-Full(푥, 푦): computes Now, we can compute 푎0 = RN0(푥 · 푦). If

augmentedMultiplication(푥, 푦) in all cases. 푒min−푝 푥 · 푦 = 푎푒 − sign(푎푒) · 2 1: 푎푒 ← RN푒(푥 · 푦) then 푎 = 푎 − sign(푎 ) · 2푒min−푝+1 (computed without 2: if |푎푒| = +∞ then 0 푒 푒 3: 푥′ ← 0.5 · 푥 error), otherwise 푎0 = 푎푒. Hence we have to decide whether 푒min−푝 ′ ′ ′ 푥 · 푦 = 푎푒 − sign(푎푒) · 2 . This is equivalent to checking 4: (푎푒, 푏푒) ← Fast2Mult (푥 , 푦) 푒min+푝 ′ 푒max ′ 푒max−푝+1 if 푡2 + 푡3 = −sign(푎푒) · 2 . This can be done as 5: if (푎푒 = 2 and 푏푒 = −2 ) or (푎′ = −2푒max and 푏′ = +2푒max−푝+1) then follows: first note that since 푡3 is a multiple of ulp(푡1) and 푒 푒 1 ′ −푝+1 |푡2| ≤ ulp(푡1), either 푡3 = 0 or |푡3| > |푡2|. Therefore, 6: 푎0 ← RN푒(푎푒 · (2 − 2 )) 2 ′ Lemma 1 can be applied to the addition of 푡2 and 푡3. Item 7: 푏0 ← −2푏푒 8: else 2 of that lemma tells us that if we define 푧 = RN푒(푡2 + 푡3), then RN (푧 − 푡 ) = 푧 − 푡 . Therefore, checking if 9: 푎0 ← 푎푒 (infinity with right sign) 푒 3 3 10: 푏 ← 푎 푒min+푝 0 푒 푡2 + 푡3 = −sign(푎푒) · 2 11: end if 푒min+푝 12: else if |푎푒| ≤ 2 then is equivalent to checking if 13: 푒min+푝 if 푎푒 = 0 then 푧 = −sign(푎푒) · 2 14: 푎0 ← 푎푒 and 15: 푏0 ← 푎푒 RN푒(푧 − 푡3) = 푡2. 푒min+1 푒min−푝+1 16: else if |푎푒| ≤ 2 − 2 then

17: 푏0 ← 0 푒min+1 푒min+푝 (︀ 2푝 )︀ D. Last special case: if 2 ≤ |RN푒(푥 · 푦)| ≤ 2 18: (푡1, 푡2) ← Fast2Mult (푥 · 2 ), 푦 2푝 That case corresponds to Lines 26–34 of Algorithm 9. In 19: 푡3 ← RN푒(푡1 − 푎푒 · 2 ) that case, we know that 푥 · 푦 − RN0(푥 · 푦) is of magnitude less 20: 푧 ← RN푒(푡2 + 푡3) 푒min 푒min+푝 than or equal to 2 , but is not necessarily a floating-point 21: if (푧 = −sign(푎푒) · 2 ) and number. The standard requires that we return 푎0 = RN0(푥 · 푦) (RN푒(푧 − 푡3) = 푡2) then 푒min−푝+1 and 푏0 = RN0 (푥 · 푦 − RN0(푥 · 푦)). 22: 푎0 ← 푎푒 − sign(푎푒) · 2 (2푝푥) · 푦 23: else We start by applying Algorithm 8 to the product . That product can be computed without overflow: 24: 푎0 ← 푎푒 푒min+푝 25: end if ∙ first, |RN푒(푥 · 푦)| ≤ 2 implies 26: else |푥푦| ≤ 2푒min+푝 + 2푒min . 27: (푎′, 푏′) ← AM-Simple(2푝푥, 푦) −푝 ′ 푒min+1 28: 푎0 ← RN푒(2 · 푎 ) Also, 2 ≤ |RN푒(푥 · 푦)| implies 푦 ̸= 0, therefore −푝 ′ 푒min−푝+1 29: 훽 ← RN푒(2 · 푏 ) |푦| ≥ 2 . Thus 푝 ′ 푒min 30: if RN푒(2 훽 − 푏 ) = sign(훽) · 2 then ⃒ ⃒ 푒min+푝 푒min ⃒푥푦 ⃒ 2 + 2 2푝−1 푝−1 31: 푏 ← 훽 − sign(훽) · 2푒min−푝+1 |푥| = ⃒ ⃒ ≤ = 2 + 2 . 0 ⃒ 푦 ⃒ 2푒min−푝+1 32: else 푝 3푝−1 2푝−1 푒max 33: 푏0 ← 훽 Therefore |2 푥| ≤ 2 +2 < 2 using (4). Thus 푝 34: end if (2 푥) is below the overflow threshold. 푒 +푝 푒 35: end if ∙ finally, |푥푦| ≤ 2 min + 2 min implies 36: else |(2푝푥) · 푦| ≤ 2푒min+2푝 + 2푒min+푝, 37: 푏푒 ← RN푒(푥 · 푦 − 푎푒) 푒max 38: (푎0, 푏0) ← Recomp(푎푒, 푏푒) which is less than 2 from (4) and the fact that 푒min 39: end if is negative. 푝 ′ 40: if 푏0 = 0 then Algorithm 8 applied to (2 푥) · 푦 returns two values, say 푎 ′ ′ 푝 ′ 푝 ′ 41: 푏0 ← (+0) · RN푒(푥푦 − 푎0) and 푏 , such that 푎 = RN0(2 푥 · 푦) and 푏 = 2 푥 · 푦 − 푎 . We 42: end if immediately deduce using Lemma 6 that 2−푝푎′ is the expected −푝 ′ −푝 ′ 43: return (푎0, 푏0) RN0(푥 · 푦). Obtaining RN0(푥 · 푦 − 2 푎 ) = RN0(2 푏 ) is slightly more tricky (Lemma 6 cannot be used because |2−푝푏′| can be strictly less than 2푒min ). We first compute −푝 ′ so that 훽 = RN푒(2 푏 ). The number 훽 is equal to the expected −푝 ′ 1 1 RN0(2 푏 ) unless we are in Halfway Case 1 of Section II, ⃒ 2푝⃒ 푒min+푝 ⃒푡1 − 푎푒 · 2 ⃒ ≤ ulp(푡1) + 2 ≤ ulp(푡1) + |푡1|. i.e., unless 2 2 −푝 ′ 푒min−푝 Hence, 휃3 is a multiple of ulp(푡1) of magnitude less than or 훽 − (2 푏 ) = sign(훽) · 2 (9) equal to 1 ulp(푡 ) + |푡 |. Since 푡 is a multiple of ulp(푡 ), we 2 1 1 1 1 in which case, one should replace 훽 by 훽−sign(훽)·2푒min−푝+1. deduce that 휃 is a multiple of ulp(푡 ) of magnitude less than 3 1 Equation (9) is equivalent to or equal to |푡1|. An immediate consequence (using Lemma 5) 푝 ′ 푒min is that 휃3 is a floating-point number, which implies 푡3 = 휃3. 2 훽 − 푏 = sign(훽) · 2 , 9

Definition Recomp := fun c1 c2ab ⇒ a condition which is easy to test since the subtraction is exact: letz := round_flt c1 (psi*a) in 2푝훽 − 푏′ is a multiple of 2푒min−푝+1, of magnitude less than or letd := round_flt c2 (z−a) in equal to 2푒min , hence it is a floating-point number. if (Req_bool (2*b) d) then (z,−b) else (a,b). All this gives Algorithm 9 and Theorem 4. Definition AA_Simple := fun c1 c2xy ⇒ let (x’,y’) := if (Rlt_bool (Rabsx )(Rabsy )) then (y,x) else (x,y) in Theorem 4. The output (푎0, 푏0) of Algorithm 9 is equal to let (ae,be) := Fast2Sumx ’ y’ in augmentedMultiplication(푥, 푦). Recomp c1 c2 ae be.

Definition AM_Full := fun c1 c2 c3 c4 c5 c6 c7xy ⇒ let ae := round_flt ZnearestE (x*y) in if (Rle_bool (Rabs ae)(bpow (emin+prec))) then (* zero *) if (Req_bool ae 0) then (0,0) else (* very small *) V. FORMALPROOF if (Rle_bool (Rabs ae)(bpow (emin+1) − bpow (emin−prec+1))) then let t1 := round_flt c1 (x*(y*bpow (2*prec))) in let t2 := round_flt c2 (x*(y*bpow (2*prec)) − t1) in let t3 := round_flt c3 (t1 − ae*bpow (2*prec)) in Arithmetic algorithms can be used in critical applications. letz := round_flt ZnearestE (t2+t3) in The proof presented here is complex, with many particular if (andb (Req_boolz (−sign(ae)*bpow (emin+prec))) cases to be considered. We have used the Coq proof assis- (Req_bool (round_flt ZnearestE (z−t3)) t2)) then (ae−sign(ae)*bpow (emin−prec+1),0) tant and the Flocq library [2] for our development towards else (ae,0) Theorems 1 and 4. (* medium small*) else let t1 := round_flt c1 (x*(y*bpow prec)) in Our formal proof is available as electronic appendix. let t2 := round_flt c2 (x*(y*bpow prec) − t1) in letA ’ := Recomp emin prec c3 c4 t1 t2 in Note that we have aimed at genericity. In particular, we let a0 := round_flt c5 (bpow (−prec)*fstA ’) in have tried to generalize the tie-breaking rule when possible. let beta := round_flt c6 (bpow (−prec)*sndA ’) in letz := round_flt c7 (bpow prec*beta−sndA ’) in The precision and minimal exponent are hardly constrained as if (Req_boolz (sign beta*bpow emin)) we only require 푝 > 1 and 푒min < 0. As explained above, then (a0, beta − sign(beta)*bpow (emin−prec+1)) the radix must be 2 as Algorithm 4 does not hold for radix else (a0,beta) (*big*) 10 (the definitions and first properties of ulp퐻 and RN0 are else generic though). let be := round_flt ZnearestE (x*y−ae) in Recomp c1 c2 ae be. The formal proof quite follows the mathematical proof described above. Of course, we had to add several lemmas Lemma AA_Simple_correct : forall c1 c2xy , format_fltx → format_flty → and to define RN0 and its properties. This definition was let (a0,b0) := AA_Simple c1 c2xy in very similar to the definition of rounding-to-nearest with tie- x+y = a0 + b0 ∧ a0 = round_flt Znearest0 (x+y). breaking away from zero defined by the standard for decimal Lemma AM_Full_correct : forall c1 c2 c3 c4 c5 c6 c7xy , arithmetic [8], and most of the proofs were nearly identical. format_fltx → format_flty → let (a0,b0) := AM_Full c1 c2 c3 c4 c5 c6 c7xy in We then proved the correctness of Algorithm 4. In this case a0 = round_flt Znearest0 (x*y) 푒min for |푎| > 2 , the two RN푒 may be replaced with a ∧ b0 = round_flt Znearest0 (x*y−a0). rounding to nearest with any tie-breaking rule (they may even A very important limitation of these proofs is that overflows, differ). Algorithm 5 is also proven. Similarly, the two RN푒 infinite numbers, and the signs of zeroes are not considered. roundings may in fact use any tie-breaking rule. The proof of We relied on the Flocq formalization that considers floating- Theorem 1 is then easily deduced, with Recomp using any point numbers as a subset of real numbers. Therefore, zeroes two tie-breaking rules. are merged and there are neither infinities, nor NaNs. It allows As on paper, the proof of Theorem 4 is more intricate, with us to state the final theorems in the most understandable way: many subcases, even if we handle only cases A (without the 푎0 = RN0(푡) and 푎0 + 푏0 = 푡 or at least 푏0 = RN0(푡 − 푎0) zeroes), C, and D. Here, the case split depends on the tie- (with 푡 being either the sum or product of two floating-point breaking rule: the equalities may be either strict or large de- numbers). pending upon the tie-breaking rule. For the sake of simplicity, We have tried to develop additional formal proofs taken all we chose to stick to the pen-and-paper proof and share the exceptional behaviors into account (especially NaNs and over- same case split. We then require some roundings to use tie- flows). We have relied on a modified version of the Binary breaking to even. We were not able to generalize the proof at a definitions of Flocq and we have defined the full algorithms, reasonable cost to handle all tie-breaking rules. Nevertheless, with comparisons on FP numbers and possible overflows. the proof was formally done and we were able to prove It made both the algorithms and their specifications more the correctness of Theorems 1 and 4 (without considering complicated and less readable. Moreover, the comprehensive overflows and signs of zeroes). The Coq statements are as formal proofs were out of reach, both by lack of support follows (with few simplifications for the sake of readability). lemmas and by combinatorial explosion of the subcases for Note that c1... c7 are arbitrary tie-breaking rules. each and every operation (NaN, overflow, signed zero, and so 10 on). This really calls for automations in Coq for handling FP to cancellation), which requires a branch and a 1-bit shift numbers with exceptional behaviors, that is out of scope of for multiplication and a leading-zero count and a larger, this paper. 128-bit shift (across 64-bit registers) for addition. 5) The intermediate result (−1)푠 2퐸 푀 is rounded to the VI.IMPLEMENTATION AND COMPARISON nearest IEEE754 binary64 value 푎, applying round-to- nearest-ties-to-zero rules. This rounding step is imple- We have implemented the algorithms presented in this mented as a “rounding to odd” [3] (with sticky bit) to ′ paper in binary64 (a.k.a. double precision) arithmetic, as (−1)푠 2퐸 푀 ′, where 푀 ′ is a 64-bit integer significand, well as emulation algorithms based on integer arithmetic, followed by the actual rounding to the binary64 format. described below. We used an Intel Core i7-7500U x86_64 This code sequence is quite complicated as it must cope processor clocked at 2.7GHz under GNU/Linux (Debian with a multitude of possible cases, such as overflow, 4.19.0-8-amd64), and the programs were compiled gradual or complete underflow as well as exact zeroes. A using GCC (Debian 8.3.0-6) 8.3.0, with the trace of overflow, underflow and inexact rounding is kept option -O3 -march=native. Our implementation, during this rounding step. together with all testing and performance evaluation 6) The high-order word 푎 is decomposed again into code, is available as additional material coming with this 푠푎 퐸푎 (−1) 2 푀푎. If it is finite, this value is subtracted article; it can be downloaded at https://gitlab.com/cquirin/ from the intermediate result (−1)푠 2퐸 푀, which, after ieee754-2019-augmented-operations-reference-implementation appropriate leading-zero count and normalization shift, and is archived at https://hal.archives-ouvertes.fr/ 푠ℓ 퐸ℓ yields to (−1) 2 푀ℓ, where 푀ℓ is an integer signifi- hal-02137968. 푠ℓ 퐸ℓ cand stored on a 64-bit integer. This value (−1) 2 푀ℓ Having an integer-based version of the augmented opera- is given to the same rounding code as the one used above, tions was important for comparison purposes, since there are which yields 푏 with round-to-nearest-ties-to-zero and a no other implementations of these operations at the time we are trace of overflow1, underflow and inexact rounding. writing this paper. Importantly enough, as we are going to see 7) Out of both traces for overflow, underflow and inexact below, the integer-based algorithms are not simpler than the a global IEEE754 flag setting is computed and applied floating-point based algorithms presented in this paper. This is to IEEE754 flag registers by executing a dummy FP because the floating-point operations somehow automatically operation that make the appropriate flags be raised. handle most special cases. The emulation code has the advantage of being the only The integer-based emulation code proceeds as follows (as- version of our algorithms that is able to set the IEEE754 suming here we use the binary64 format): flags correctly and to be insensible to the prevailing IEEE754 1) The inputs 푥 and 푦 are checked for special values, such rounding direction attribute. The FP-based algorithms may set as NaN or infinities. For these special values, a regular the inexact flag as well as other flags spuriously. They do FP addition or multiplication is executed, in order to get require the IEEE754 rounding direction attribute to be round- correct setting of flags and special values for the high to-nearest-ties-to-even, which is the default. However, these order result 푎. For 푏, the same value is used. Similar advantages of the emulation code come at a significant cost: logic is used for zero inputs. as we are going to see, the emulation code is on average 2) Finite, non-zero, regular FP inputs 푥 and 푦 are decom- 1.5 to 20 times slower than the FP-based algorithms. The 푠푥 퐸푥 푠푦 퐸푦 posed into (−1) 2 푀푥 and (−1) 2 푀푦 where emulation code also has the disadvantage of being rather 푀푥 and 푀푦 are normalized integer significands stored complex. The precise rounding logic for round-to-nearest-ties- on 64-bit integer variables. This step includes a normal- to-zero for example is quite complicated, which required extra ization step using the integer hardware lzc (leading-zero care at its development to overcome its error-prone nature. count) instruction whenever 푥 or 푦 are subnormal. The The statistical distribution of the number of cycles used by hidden bit is made explicit. our algorithms (using 106 samples, with the distributions of 3) For addition, the two couples (푠푥, 퐸푥, 푀푥) and the inputs described below) is given: (푠푦, 퐸푦, 푀푦) are ordered for exponents: 퐸푥 ≥ 퐸푦. When ∙ for the augmentedAddition algorithms, in Figures 4a for the exponent difference exceeds a certain limit, addition Algorithm 6 (AA-Simple), 4b for Algorithm 7 (AA-Full), returns the ordered 푥 and 푦 as 푎 and 푏 as-is. and 4c for the integer-based emulation of augmentedAd- 4) For all other cases, a temporary exact intermediate result dition; 푠 퐸 (−1) 2 푀 is formed, where the integer significand 푀 ∙ for the augmentedMultiplication algorithms, in Figures 5a is stored on a 128-bit variable, which the compiler em- for Algorithm 8 (AM-Simple), 5b for Algorithm 9 (AM- ulates on two 64-bit registers and the appropriate use of Full), and 5c for the integer-based emulation of augment- addition-with-carry and high/low-part multiplication ma- edMultiplication. chine instructions. For addition, this intermediate result is For each input sample, formed by an input couple (푥, 푦), obtained by shifting the ordered 푀푥 integer significand the number of cycles is measured as follows: the function by 퐸푥 − 퐸푦 places to the left and adding or subtracting implementing one of our algorithms is run on 푥 and 푦. Its 푀푦. For multiplication, it suffices to multiply 푀푥 and 푀푦 execution time is measured by reading off the x86 Time with a 64-by-64-gives-128 bit multiplication. The integer significand 푀 is normalized (unless it becomes zero due 1which is impossible in this step 11

25 30 30

25 25 20

20 20

15

15 15

10

10 10

5 5 5

0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 (a) Algorithm 6 (all simple inputs) (b) Algorithm 7 (all inputs) (c) Emulation algorithm (all inputs) Fig. 4: Statistical distribution of the number of cycles for the addition algorithms

Step Counter with the rdtsc instruction before and after an all simple case is taken by comparing the output of the execution of the function. Before reading the Time Step the respective FP-based simple algorithm with the one of Counter, the CPU’s pipeline is serialized by executing a our integer-based emulation code. dummy cpuid instruction, which Intel documents as having ∙ The halfway cases input samples (푥, 푦) are such that this serialization effect. As the time measurements obtained by the respective outputs (푎, 푏) are finite FP numbers (zero subtracting the before Time Step Counter’s value from the after and non-zero but no overflows) and 푥 + 푦 resp. 푥 × 푦 counter’s value also include the time for pipeline serialization, is precisely in the middle between two binary64 FP execution of the rdtsc instruction and the function call itself, numbers. They are produced as follows: the same measurement is repeated with an empty function – For addition in binary64, a 54 bit odd integer number that has the same signature as the actual function to time. 푁 is produced2 along with a uniformly distributed The empty function’s measured execution time is subtracted exponent 퐸. A uniformly distributed split-point 휎 ∈ off the actual function’s measured execution time. This yields {1,..., 53} is then produced. Using 휎, 푁 is cut into a raw execution time sample in cycles. All raw execution two parts which, along with 퐸 and 휎, yield to the times that are less than 1 cycle are discarded and the timing exponents and significands of candidates 푥 and 푦. Half procedure is repeated. The measurement yielding to positive of the values 푥 and 푦 are swapped. The “cutting” raw execution times is repeated 100 times, an average and process is not actually just a bit cut but done in such a maximum value is computed. If the average value does not way that both 푥 and 푦 can be both negative or positive. differ from the maximum value by more than a certain amount – For multiplication on binary64, two uniformly dis- (typically 25%), the average value is taken as the execution tributed 27 bit odd integers are produced along with time in cycles for this input sample (푥, 푦). The measurements uniformly distributed exponents. are taken on a CPU not executing any other heavy jobs, The candidate (푥, 푦) samples are checked for finiteness after a preheating phase for the instruction cache. The Linux and whether or not they produce finite outputs (푎, 푏); all scheduler is configured to keep the process on one CPU core candidates that do not satisfy these constraints are filtered as long as possible. Core migration is anyway filtered out by out. our testing strategy as it yields either negative raw timings ∙ The halfway simple cases are obtained by filtering from 6 or raw timings that are clear outliers. Overall, 10 different the halfway cases the ones for which the simple FP-based samples are timed, which yields to the histograms illustrated algorithms produce correct result, similarly as to how the in Figures 4 and 5 as well as the average values (averaging all simple cases are obtained. over all 106 inputs) reported in Tables V and VI. The average timings are given in the first half of Table V The different input samples (푥, 푦) are produced as follows: for the augmentedAddition algorithms, and the first half of Table VI for the augmentedMultiplication multiplication algo- ∙ The all cases input samples are produced with a pseudo- random number generator such that all signs, exponents rithms. In each of these tables, the second half gives average and significands are uniformly distributed among all timings for halfway cases. binary64 FP values (푥, 푦) that are finite numbers such Concerning augmentedAddition, Algorithm 7 is slightly that the resulting outputs (푎, 푏) are finite as well. The better than the integer-based emulation in the general case, and filtering of whether or not a candidate (푥, 푦) produces significantly better in the bad cases. Concerning augmented- finite outputs (푎, 푏) uses our integer-based emulation code Multiplication, Algorithm 9 is significantly better, except on to compute (푎, 푏) out of the candidate (푥, 푦). very rare cases (at the extreme right of Figure 5b). In all cases, the “simple” versions of the algorithms (Algorithms 6 and 8) ∙ The all simple cases input samples are produced by filtering from the set of the all cases samples the ones for are significantly faster in the cases when they work. They may which the simple FP-based Algorithms 6 and 8 produce also be significantly slower when they do not work, due to bit-correct results (including correct signs for zeroes). The 2This actually means that only 52 bits are random, as the 53rd bit and the decision of whether or not a candidate sample (푥, 푦) is 0th bit must be set. 12

60 30 30

50 25 25

40 20 20

30 15 15

20 10 10

10 5 5

0 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 (a) Algorithm 8 (all simple inputs) (b) Algorithm 9 (all inputs) (c) Emulation algorithm (all inputs) Fig. 5: Statistical distribution of the number of cycles for the multiplication algorithms

hardware constraints; for instance, Algorithm 8 may make the REFERENCES hardware produce and consume subnormal FP values, which [1] S. Boldo, S. Graillat, and J.-M. Muller. On the robustness of the 2Sum is a slow process on some architectures. and Fast2Sum algorithms. ACM Transactions on Mathematical Software, 44(1):4:1–4:14, July 2017. [2] S. Boldo and G. Melquiond. Computer Arithmetic and Formal Proofs. TABLE V: Average timings in cycles for the augmentedAd- ISTE Press – Elsevier, 2017. dition algorithms [3] Sylvie Boldo and Guillaume Melquiond. Emulation of FMA and correctly rounded sums: proved algorithms using rounding to odd. IEEE Algorithm ♯ of cycles Transactions on Computers, 57(4):462–471, April 2008. Algorithm 6 (addition, all simple cases) 7.66 [4] T. J. Dekker. A floating-point technique for extending the available Algorithm 7 (addition, all cases) 10.46 precision. Numerische Mathematik, 18(3):224–242, 1971. Integer-based emulation of addition (all cases) 14.70 [5] J. Demmel, P. Ahrens, and H. D. Nguyen. Efficient reproducible floating point summation and BLAS. Technical Report UCB/EECS-2016-121, Algorithm 6 (addition, halfway simple cases) 7.74 EECS Department, University of California, Berkeley, June 2016. Algorithm 7 (addition, halfway cases) 9.82 [6] J. Harrison. A machine-checked theory of floating point arithmetic. Integer-based emulation of addition (halfway cases) 4.99 In 12th International Conference in Theorem Proving in Higher Order Logics (TPHOLs), volume 1690 of Lecture Notes in Computer Science, pages 113–130. Springer-Verlag, Berlin, September 1999. [7] J. R. Hauser. Handling floating-point exceptions in numeric programs. TABLE VI: Average timings in cycles for the augmentedMul- ACM Transactions on Programming Languages and Systems, 18(2):139– tiplication algorithms 174, 1996. [8] IEEE. IEEE Standard for Floating-Point Arithmetic (IEEE Std 754- Algorithm ♯ of cycles 2019). July 2019. [9] C.-P. Jeannerod, J.-M. Muller, and P. Zimmermann. On various ways Algorithm 8 (multiplication, all simple cases) 7.62 to split a floating-point number. In 25th IEEE Symposium on Computer Algorithm 9 (multiplication, all cases) 13.45 Arithmetic, Amherst, MA, USA, pages 53–60, June 2018. Integer-based emulation of multiplication (all cases) 75.65 [10] W. Kahan. Lecture notes on the status of IEEE-754. Available at http: Algorithm 8 (multiplication, halfway simple cases) 3.15 //www.cs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF, 1997. Algorithm 9 (multiplication, halfway cases) 4.99 [11] D. E. Knuth. The Art of Computer Programming, volume 2. Addison- Integer-based emulation of multiplication (halfway cases) 58.34 Wesley, Reading, MA, 3rd edition, 1998. [12] O. Møller. Quasi double-precision in floating-point addition. BIT, 5:37– 50, 1965. [13] J.-M. Muller, N. Brunie, F. de Dinechin, C.-P. Jeannerod, M. Joldes, V. Lefèvre, G. Melquiond, N. Revol, and S. Torres. Handbook of CONCLUSION Floating-Point Arithmetic. Birkhäuser Boston, 2018. [14] Y. Nievergelt. Scalar fused multiply-add instructions produce floating- We have presented and implemented algorithms that allow point matrix arithmetic provably accurate to the penultimate digit. ACM Transactions on Mathematical Software, 29(1):27–48, 2003. one to emulate the newly suggested “augmented” floating- [15] E. J. Riedy and J. Demmel. Augmented arithmetic operations proposed point operations using the classical, rounded-to-nearest ties-to- for IEEE-754 2018. In 25th IEEE Symposium on Computer Arithmetic, even, operations. The algorithms are very simple in the general Amherst, MA, USA, pages 45–52, June 2018. [16] S. M. Rump. Ultimately fast accurate summation. SIAM Journal on case. Special cases are slightly more involved but will remain Scientific Computing, 31(5):3466–3502, January 2009. infrequent in most applications. These algorithms compare [17] P. H. Sterbenz. Floating-Point Computation. Prentice-Hall, Englewood favorably with an integer-based emulation of the augmented Cliffs, NJ, 1974. operations. Furthermore, the availability of both tests for special cases and formal proofs covering normal and underflow cases (despite the limitations presented in Section V) gives much confidence in these algorithms.

ACKNOWLEDGEMENT We thank Claude-Pierre Jeannerod for his very useful sug- gestions.