Optimized Binary64 and Binary128 Arithmetic with GNU MPFR

2017 IEEE 24th Symposium on Computer Arithmetic

Vincent Lefèvre Paul Zimmermann INRIA / LIP / Université de Lyon, France INRIA Nancy - Grand Est / LORIA, France

Abstract—We describe algorithms used to optimize the GNU II. BASIC ARITHMETIC MPFR library when the operands fit into one or two words. On A. The MPFR Internal Format modern processors, this gives a speedup for a correctly rounded addition, subtraction, multiplication, division or square root in Internally, a MPFR number is represented by a precision p, the standard binary64 format (resp. binary128) between 1.8 and a sign s, an exponent e, denoted by EXP(·), and a significand faithful 3.5 (resp. between 1.6 and 3.2). We also introduce a new m. The significand is a multiple-precision natural integer rounding mode, which enables even faster computations. Those n = p/64 optimizations will be available in version 4 of MPFR. represented by an array of limbs: we denote Index Terms—floating-point arithmetic; correct rounding; by m[0] the least significant limb, and by m[n − 1] the faithful rounding; binary64; binary128; GNU MPFR most significant one. For a regular number — neither NaN, nor ±∞, nor ±0 — the most significant bit of m[n − 1] I. INTRODUCTION must be set, i.e., the number is always normalized2; and an The IEEE 754-2008 standard [6] defines the binary floating- underflow occurs when a non-zero result rounded with an point formats binary64 and binary128, providing respectively unbounded exponent range would have an exponent less than a precision of 53 and 113 bits. Those standard formats are used the minimum one. The corresponding number is: in many applications, therefore it is very important to provide (−1)s · (m · 2−64n) · 2e, fast arithmetic for them, either in hardware or in software. GNU MPFR [3] (MPFR for short) is a reference software i.e., the rational significand satisfies 1/2 ≤ m/264n < 1. implementation of the IEEE 754-2008 standard in arbitrary When the precision p is not an exact multiple of 64, the precision. MPFR guarantees correct rounding for all its op- least sh = 64n − p bits of m[0] are not used. By convention erations, including elementary and special functions [6, Table they must always be zero, like the 3 bits below (drawing the 9.1]. It provides mixed-precision operations, i.e., the precision most significant limbs and bits on the left): of the input operand(s) and the result may differ. 1xxxxxxxxxxx xxxxxxxxxxxx ··· xxxxxxxxx000 Since 2000, several authors have cited MPFR, either to use it for a particular application, or to compare their own library to m[n−1] m[n−2] m[0] MPFR [2], [4], [7]. Most of those comparisons are in the case In the description of the algorithms below, those sh trailing when all operands have the same precision, which is usually bits will be represented by 3 zeros. one of the standard binary64 or binary128 formats. In this section, we describe basic algorithms used to perform Since 2008, GCC has used MPFR in its middle-end to arithmetic on 1 or 2 limbs (addition, subtraction, multipli- generate correctly rounded compile-time results regardless of cation, division and square root) for regular inputs, after the math library implementation or floating-point precision non-regular inputs have been dealt with. Those algorithms of the host platform. The Sage computer algebra system and assume the number sh of unused bits is non-zero. As a several interval arithmetic libraries depend on MPFR, among consequence, on a 64-bit processor, if efficiency is required, 1 which Moore [9], MPFI [11], and libieeep1788 . it is recommended to use precision p =63or p = 127 instead The contributions of this article are: (i) new algorithms for of p =64or p = 128, whenever possible. In the following, basic floating-point arithmetic in one- or two-word precision, we assume that all input and output arguments have the same (ii) an implementation in MPFR of those algorithms with precision p. corresponding timings for basic arithmetic and mathematical In addition to correct rounding, all MPFR functions return a functions, (iii) a description of a new faithful rounding mode, ternary value, giving the sign of the rounding error: 0 indicates with corresponding implementation and timings in MPFR. that the computed correctly rounded result y exactly equals p Notations: We use to denote the precision in bits. A limb, the infinite precision value f(x) (no rounding error occurred); following GMP terminology [5], is an unsigned integer that a positive value indicates that y>f(x), and a negative fits in a machine word, and is assumed to have 64 bits here. value indicates that y

1063-6889/17 $31.00 © 2017 IEEE 18 DOI 10.1109/ARITH.2017.28 In this section, we describe the correctly rounded basic C. Subtraction n =1 arithmetic routines in version 4 of MPFR for and We detail here the mpfr_sub1sp2 function, in the case n =2 limbs. Previous versions of MPFR were calling generic where the exponent difference d =bx− cx satisfies 0 > d (cp[1] shifted right by d bits), t and sb: For the division and square root, we are not aware of previous bp[1] bp[0] work using integer operations only, except [10] for division. B. Addition 1xxxxxxxxxxx xxxxxxxxx000 We describe here the internal mpfr_add1sp1 function for ← d bits → 1yyyyyyyyyyy yyyyyyyyy000 0 > d sb 64 65 always produces a carry: 2 ≤ bp[0] + cp[0] < 2 : The goal is to subtract these 3 limbs from b, word by bp[0] word. While this 3-word subtraction could be done with only 3 instructions on a typical processor, we need to write a 1xxxxxxxx000 portable C code, which is more complex to express the borrow 1yyyyyyyy000 propagation in C (and unfortunately, compilers are not yet able to detect patterns to generate only these 3 instructions): cp[0] We thus simply add the two shifted significands — since p< t = (cp[1] << (64 - d)) | (cp[0] >> d); 64 bx a0 = bp[0] - t; , we lose no bits in doing this — and increase by 1 (this a1 = bp[1] - (cp[1] >> d) - (bp[0] < t); will be the result exponent): sb = cp[0] << (64 - d); if (sb) { a0 = (bp[0] >> 1) + (cp[0] >> 1); a1 -= (a0 == 0); bx ++; a0 --; Since sh=64− p is the number of unused bits, the round sb = -sb; /* 2^64 - sb */} bit is bit sh − 1 from a0, which we set to 0 before storing in At this point the exact subtraction corresponds to a1 + memory, and the sticky bit is always 0 in this case: −64 −128 2 a0 +2 sb, where a1 and a0 cannot be both zero: if 62 rb = a0 & (ONE << (sh - 1)); d ≥ 2 then a1 ≥ 2 ;ifd =1, since bit 0 of cp[0] is 0 ap[0] = a0 ^ rb; because p<128, necessarily sb = 0. However, for d =1we sb=0; can have a1 =0, in which case we shift a by one word to the bx > cx cp[0] The case is treated similarly: we have to shift by left, and decrease the exponent by 64: d =bx− cx bits to the right. We distinguish three cases: d< sh where cp[0] shifted fully overlaps with ap[0], sh ≤ d<64, if (a1 == 0) { cp[0] d ≥ 64 cp[0] a1 = a0; where shifted partly overlaps, and , where a0=0; shifted does not overlap. In the last case bx < cx, we simply bx -= 64; } swap the inputs and reduce to the bx > cx case. a =0 a Now consider the rounding. All rounding modes are first Now 1 , and we shift to the left by the number of a converted to nearest, toward zero or away from zero. At this leading zeros of 1: point, ap[0] contains the current significand, bx the exponent, count_leading_zeros (cnt, a1); rb the round bit and sb the sticky bit. No underflow is possible if (cnt) { in the addition, since the inputs have same sign: |b + c|≥ ap[1] = (a1 << cnt) | (a0 >> (64-cnt)); min(|b|, |c|). An overflow occurs when bx > emax. Otherwise a0 = (a0 << cnt) | (sb >> (64-cnt)); a bx rb=sb=0 0 sb <<= cnt; we set the exponent of to .If , we return bx -= cnt; } as ternary value, which means the result is exact, whatever the else rounding mode. If rounding to nearest, we let ap[0] unchanged ap[1] = a1; when either rb = 0,orrb =0and sb=0and bit sh of ap[0] sh We now compute the round and sticky bits, and set to zero is 0 (even rule); otherwise we add one ulp, i.e., 2 to ap[0].If sh the last sh bits of a0 before storing it, with mask = 2 − 1: rounding toward zero, ap[0] is unchanged, and we return the opposite of the sign of a as ternary value, which means that rb = a0 & (ONE << (sh - 1)); the computed value is less than b+c when b, c > 0 (remember sb |= (a0 & mask) ^ rb; ap[0] = a0 & ~mask; the exact case rb=sb=0was treated before). If rounding away from zero, we add one ulp to ap[0]; while doing this, The rounding is done exactly like for the addition, except we might have an exponent increase, which might in turn give in the subtraction we cannot have any overflow (remember all an overflow (this might also occur in the multiplication). arguments have same precision), but we can have an underflow.

19 D. Multiplication Algorithm 1 DivApprox1 The multiplication is easy to implement with the internal Input: integers u, v with 0 ≤ u> 63); imate reciprocal i in step 1 is obtained using the variable sb=sb<<1;} v3 from Algorithm 2 (RECIPROCAL_WORD) in [10]; indeed Theorem 1 from [10] proves that — with our notation — The round and sticky bits are computed exactly like in the 2 0 <β − (β + v3)v<2v, which yields the inequalities (1). 2-limb subtraction, and the sign of the result is set: Theorem 1: The approximate quotient returned by Algo- rb = a0 & (ONE << (sh - 1)); rithm DivApprox1 satisfies sb |= (a0 & mask) ^ rb; ap[0] = a0 & ~mask; uβ q ≤ ≤q +2. SIGN(a) = MULT_SIGN (SIGN (b), SIGN (c)); v For the multiplication, both overflow and underflow can happen. Overflow is easy to handle. For underflow, let us recall Proof. Step 2 of DivApprox1 is simply step 1 of Algorithm 1 i i := (β2 − that MPFR signals underflow after rounding. For example, (DIV2BY1) from [10]. If is the exact reciprocal 0 1)/v −β, and q0 := i0u/β +u is the corresponding quotient, when ax = emin − 1, rb or sb is non-zero, ap[0] = r = 111 ...111000 and we round away from zero, there is no un- it is proven in [10] that the corresponding remainder 0 βu − q v 0 ≤ r < 4v derflow. Apart from this special case, the rounding is the same 0 satisfies 0 . However, the upper bound 4v includes 2v coming from a lower dividend term, which is as for the subtraction. (This special case where ax = emin − 1 r < 2v i = i − 1 and nevertheless no underflow occurs cannot happen in the zero here, thus we have 0 . In the case 0 , then q ≤ q ≤ q +1 r = βu − qv r<3v subtraction, since b and c are multiples of 2emin−p, likewise 0 , thus satisfies . for b − c; thus if the exponent ax of the difference is smaller As a consequence of Theorem 1, we can decide the cor- than emin, the difference b − c is exact, and rb=sb=0.) rect rounding except when the last sh − 1 bits from q are 64 In the mpfr_mul_2 function, which multiplies b1 ·2 +b0 000 ...000, 111 ...111,or111 ...110, which occurs with 64 by c1 · 2 + c0, the product b0c0 of the low limbs contributes probability less than 0.15% for the binary64 format. In the to less than 1 to the second most significant limb of the rare cases where we cannot decide the correct rounding, we full 4-limb product (before a potential left shift by 1 bit to compute r = βu−qv, subtract v and increment q at most two get a normalized result). Thus we only perform 3 calls to times until r

20 Algorithm 2 DivApprox2 Algorithm 3 RecSqrtApprox1 2 2 62 64 Input: integers u, v with 0 ≤ u

3 Algorithm DivApprox1 (with v1 +1 instead of v). Write r = 2 r1β+r0 with 0 ≤ r0 <β. As we will see later, the upper word Dividing by v ≥ αβ we get: satisfies r1 ≤ 4. Step 6 becomes q = q1β + r1x + r0x/β , 2 uβ 2 2 2 1 2 where the computation of r1x is done using a few additions 0 ≤ − q< + +7+2α +4α + ( +2α). v α2 α β α (2) instead of a multiplication. Theorem 2: For β ≥ 5, the approximate quotient returned The right-hand side is bounded by 21+5/β for 1/2 ≤ α<1, by Algorithm DivApprox2 satisfies thus for β ≥ 5 we have q ≤ uβ2/v < q +22. uβ2 20 q ≤ ≤q +21. The largest error we could find is , for example for v β = 4096, u = 8298491, v = 8474666.4 With β =264, 7 Proof. The input condition u

21 37 64 49.263 −59 −0.412 of a2 := d37/2 , and v3 is a 65-bit value with x3 := v3/2 Since e0 < 2 , we deduce 3/4+2 e0 < 2 and 64 approximating the root of a3 := d/2 . 0 ≤ e < 2−35298.5262−0.412 +259 +237 < 263.196. Theorem 3: The value v3 returned by Algorithm 3 differs 1 by at most 8 from the reciprocal square root: 64 √ Therefore e1 < 2 and e1 can be computed using integer 96 64 −27 v3 ≤ s := 2 / d ≤v3 +8. arithmetic modulo 2 . Since d10 = 2 (d37 − 1) +1, e1 d x ,x ,...,x only depends on 37, so that we can perform an exhaustive Lemma 1: Assume positive real numbers 0 1 n are 237 − 235 d x ≤ a−1/2 search on the possible values of 37. By doing this, computed using the following recurrence, with 0 0 : e d = x we ﬁnd that the largest value of 1 is obtained for 37 k 2 33 · 230 +1 d = 265 e = xk+1 = xk + (1 − ak+1xk), (3) , which corresponds to 10 ; this gives 1 2 63.155 10263103231123743388 < 2 . a ≥ a ≥···≥a ≥ a>0 where 0 1 n , then: Let δ2 be the truncation error in step 7, 0 ≤ δ2 < 1: −1/2 x1 ≤ x2 ≤···≤xn ≤ a . −70 −70 2 v1e1 = 2 v1e1 + δ2. −1/2 Proof. It follows from [1, Lemma 3.14] that xk+1 ≤ ak+1 . −1/2 2 Then: Together with x0 ≤ a0 , this gives 1 − akxk ≥ 0 for all k. 2 2 e =2126 − v2d =2126 − d((v + δ ) − δ )2 Since ak+1 ≤ ak, it follows that 1 − ak+1xk ≥ 1 − akxk ≥ 0. 2 2 2 2 2 126 2 Put into (3) this proves xk+1 ≥ xk, and thus the lemma since =2 − d(v2 + δ2) + dδ2(2v2 + δ2). −1/2 −1/2 xk+1 ≤ a and ak+1 ≥ a imply xk+1 ≤ a . k+1 v ≥ 231 v2d ≤ 2126 v d ≤ 295 dδ (2v + x Since 2 and 2 , 2 , thus 2 2 k (1 − 96 64 This lemma remains true when the correction term 2 δ ) < 2 +2 2 2 . ak+1xk) in Eq. (3) — which is thus non-negative — is 96 64 126 2 rounded down towards zero. By applying this result to x0 = e2 − (2 +2 ) ≤ 2 − d(v2 + δ2) 10 21 31 64 v0/2 , x1 = v1/2 , x2 = v2/2 , x3 = v3/2 , with =2126 − d(210v +2−70v e )2 10 37 64 1 1 1 a0 = d10/2 ≥ a1 = d37/2 = a2 ≥ a3 = d/2 , it√ follows 126 2 10 −70 2 10 21 31 64 32 =2 − dv1(2 +2 e1) . 1 ≤ v0/2 ≤ v1/2 ≤ v2/2 ≤ v3/2 ≤ 2 / d ≤ 2, 10 11 21 22 31 32 thus 2 ≤ v0 < 2 , 2 ≤ v1 < 2 , 2 ≤ v2 < 2 , d =227d −ρ 0 <ρ≤ 227 d v2 = 64 65 Now writing 37 with , using 37 1 2 ≤ v3 < 2 (the√ right bounds are strict because in the only 279 − e ε =296 +264 + ρv2(210 +2−70e )2 32 62 65 1, and writing 1 1 : case where 2 / d =2, i.e., d =2 ,wehavev3 =2 −3). e − ε ≤ 2126 − 227(279 − e )(210 +2−70e )2 Proof of Theorem 3. The construction of the lookup table 2 1 1 11 =2126 − (279 − e )(247 +2−32e +2−113e2) ensures that v0 < 2 and e0 ≥ 0 at step 4 of the algorithm. 1 1 1 126 126 −32 2 −34 2 −113 3 By exhaustive search on all possible d10 values, we get: =2 − (2 − 2 e1 +2 e1 − 2 e1) 49.263 =2−32e2(3/4+2−81e ). 0 ≤ e0 < 2 . 1 1 57 2 63.155 −81 −0.415 Thus at step 4, we can compute 2 − v0d37 in 64-bit Since e1 < 2 , we deduce 3/4+2 e1 < 2 , and: arithmetic, which will result in a number of at most 50 bits, 2 10 −70 2 71 20 5 −12 and step 5 can be done in 64-bit arithmetic, since the product ρv1(2 +2 e1) ≤ 2 (2 +2 +2 ). v0e0 will have at most 61 bits. Thus ε ≤ 296 +264 +291 +276 +259 and: Let δ1 be the truncation error in step 5, with 0 ≤ δ1 < 1: −47 −47 −32 126.31 −0.415 96.338 2 v0e0 = 2 v0e0 + δ1. 0 ≤ e2 < 2 2 2 + ε<2 . (4)

Then: Here again, an exhaustive search is possible, since for a 79 2 79 2 27 e1 =2− v1d37 =2 − d37((v1 + δ1) − δ1) given value of d37, e2 is maximal when d =2 (d37 − 79 2 1) e d = =2− d37(v1 + δ1) + d37δ1(2v1 + δ1). : the largest value of 2 is obtained for 37 132607222902 d = 989 e = 2 79 , corresponding to 10 , with 2 Since v1d37 ≤ 2 — because e1 ≥ 0 —, we have 81611919949651931475229016064 < 296.043 79 58 21 . The ﬁnal error v1d37 ≤ 2 /v1 ≤ 2 because 2 ≤ v1. It follows that δ 59 37 is estimated using the truncation error 3 in step 9: γ1 := d37δ1(2v1 + δ1) satisﬁes 0 ≤ γ1 < 2 +2 , and: −94 −94 79 2 2 v2e2 = 2 v2e2 + δ3 e1 − γ1 =2− d37(v1 + δ1) 79 11 −47 2 e =2192 − v2d =2192 − d((v + δ ) − δ )2 =2− d37(2 v0 +2 v0e0) 3 3 3 3 3 192 2 79 2 11 −47 2 =2 − d(v3 + δ3) + dδ3(2v3 + δ3). =2− d37v0(2 +2 e0) . 2 57 2 192 64 128 Now, using v0d37 =2 − e0: Since v3d ≤ 2 and v3 ≥ 2 , it follows v3d ≤ 2 , thus 129 64 79 57 11 −47 2 γ3 := dδ3(2v3 + δ3) < 2 +2 . e1 − γ1 =2− (2 − e0)(2 +2 e0) 79 57 22 −35 −94 2 192 2 =2− (2 − e0)(2 +2 e0 +2 e0) e3 − γ3 =2 − d(v3 + δ3) 79 79 −35 2 −37 2 −94 3 192 33 −94 2 =2− (2 − 2 e0 +2 e0 − 2 e0) =2 − d(2 v2 +2 v2e2) −35 2 −59 192 2 33 −94 2 =2 e0(3/4+2 e0). =2 − dv2(2 +2 e2) .

22 2 126 Now since dv2 =2 − e2: Algorithm 4 SqrtApprox1 192 126 33 −94 2 n 262 ≤ n<264 e3 − γ3 ≤ 2 − (2 − e2)(2 +2 e2) Input: integer with √ s 264n =2192 − (2126 − e )(266 +2−60e +2−188e2) Output: integer approximating √ 2 2 2 x 263/ n 192 192 −60 2 −62 2 −188 3 1: compute an integer approximating with =2 − (2 − 2 e2 +2 e2 − 2 e2) √ x ≤ 263/ n =2−60e2(3/4+2−128e ). 2 2 √ 96.043 −128 −0.415 2: y = n (using the approximation x) Since e2 < 2 , we deduce 3/4+2 e2 < 2 : 2 3: z = n − y −60 192.086 −0.415 129 64 131.882 −32 0 ≤ e3 < 2 2 2 +2 +2 < 2 . 4: t = 2 xz 32 5: s = y · 2 + t Again by exhaustive search, restricted on the values of d37 that give a sufficient large value of e2, we found the maximal 131.878 √ value of e3 satisfies e3 < 2 . 2 126 96.338 63 32.339 x n>2√ − 2 √ from Eq. (4), x n>2 − 2 , thus Now, given v3 returned by Algorithm 3, let c ≥ 0 be such 63 33.339 2 192 192 2 131.878 2 − x n = δx n<2 : that (v3 +c) d =2 . Since 2 −v3d<2 , it follows 131.878 131.878 2 2v3cd < 2 , thus c<2 /(2v3d). Since v3d = 264n − s2 =264n − (y · 232 + t)2 2192−e v < 265 v d ≥ (2192−2131.878)/265 > 3 and 3 ,wehave 3 =264(n − y2) − 233yt − t2 2126.999, and c<23.879 < 15. =264z − 233yt − t2. By doing again this error analysis for a given value of d37, (5) we get a tighter bound involving the δi truncation errors. For We first bound 264n−s2 by above, assuming it is non-negative: example, we can bound d37δ1(2v1 + δ1) by d37(2v1,max +1), v = 279/d 33 33 where 1,max 37. This yields a finer bound. By 2 yt =2xyz − 2 δty d √ √ exhaustive search on all possible 37 values, we found the =(264/ n − 2δ )( n − δ )z − 233δ y 8 x y t (6) maximal error is at most . √ δ n<233.339 δ ,δ < 1 The bound of 8 is optimal, since it is attained for d = thus since x and y t : 4755801239923458105 √ √ . 264z − 233yt ≤ (2δ n +264/ n)z +233y 2−94v e x √ √ √ Remark 1: one can replace 2 2 in step 9 by 32 2.339 65 −29 −65 =2z(2 α +1/ α)+2 α, 2 v2e2 , where e2 = 2 e2 has at most 32 bits, like v v e √ √ 2, and thus the product 2 2 can be computed with a low where we used y ≤ n =232 α. Also: 64-bit product, which is faster than a full product giving both √ √ 65 2 33 1/2 the low and high words. If we write e2 =2 e2 + r with z = n − ( n − δy) ≤ 2 nδy ≤ 2 α . 0 ≤ r<265, then: 264n − s2 −94 −29 −94 3 Substituting this in the bound for gives: 2 v2e2 − 2 v2e2 =2 v2r<2 . 264n − s2 ≤ 264z − 233yt ≤ 265f(α), This increases the maximal error from 8 to 15, since for the only value d37 = 35433480442 that can yield c ≥ 8 with the with f(α)=22.339α +1+α1/2. original step 9, we have c<8.006 and v2 = 4229391409, −94 thus c +2 v2r<15.9. Let c ≥ 0 such that 264n =(s+c)2. Then 264n−s2 =2sc+c2, 2) The mpfr_sqrt1 function: The mpfr_sqrt1 func- which implies 2sc < 265f(α), thus tion computes the square root of a 1-limb number n using Algorithm 4 (SqrtApprox1).√ In step 1, it computes a 32-bit c<264f(α)/s. 63 (7) approximation of 2 / n using the value v2 of Algorithm 3 (called with d = n), then uses this approximation to deduce the Since s ≥ 263 and the maximum of f(α) for 1/4 ≤ α ≤ 1 exact integer square root y of n (step 2) and the corresponding is less than 7.06 (attained√ at α =1√), we get c<14.12. 64 64 remainder z (step 3), and finally uses again x to approximate Now this gives s>√ 2 n − 15 > α2 /√1.01, therefore the correction term t (step 4) to form the approximation s in c<1.01 · f(α)/ α. Now the function f(α)/ α is bounded step 5. Steps 2-3 can be implemented as follows. First compute by 7.06 for 1/4 ≤ α ≤ 1, the maximum still attained at an initial y = 2−32x 2−31n and the corresponding remain- α =1. Therefore c<1.01 · 7.06 < 7.14, which proves the der z = n − y2. Then as long as z ≥ 2y +1, subtract 2y +1 upper bound 7. 2 126 to z and increment y by one. Note that since x n<2 , Now assume 264n − s2 < 0. Since 264z − 233yt ≥ 0 by xn/231 < 295/n ≤ 232 x 2−31n 64 2 2 , thus fits on 64 bits. Eq. (6),√ we have 2 √n − s ≥−t by Eq. (5). Since x ≤ 63 33 Theorem 4: If the approximation x in step 1 is the value v2 2 / n and z ≤ 2 α: of Algorithm 3, Algorithm SqrtApprox1 returns s satisfying √ t ≤ 2−32xz < 232. s ≤ 264n ≤s +7. √ (The last inequality is strict because we can have t =232 only n = α264 1/4 ≤ α<1 x =263/ n − δ √ Proof√ Write with , x, when x =263/ n, which can only happen when n =262,but y = n − δ t =2−32xz − δ 0 ≤ δ ,δ < 1 y and t with y t . Since in that case z =0.) This proves that the product xz at step 4

23 Algorithm 5 SqrtApprox2 like in the directed rounding modes. Moreover, the ternary 64 62 64 value is unspecified for MPFR_RNDF. Input: integer n = n1 · 2 + n0 with 2 ≤ n1 < 2 and 64 The advantage of faithful rounding is that it can be com- 0 ≤ n0 < 2 √ puted more quickly than the usual roundings: computing a Output: integer s approximating 2128n 96 √ good approximation (say, with an error less than 1/2 ulp) and 1: compute an integer x approximating 2 / n1 with rounding it to the nearest is sufficient. So, the main goal of 96 √ 0 ≤ 2 / n1 − x<δ:= 16 MPFR_RNDF is to speed up the rounding process in internal 32√ computations. Indeed, we often do not need correct rounding 2: y = 2 n1 (using the approximation x) 2 at this level, just a good approximation and an error bound. 3: z = n − y 65 In particular, MPFR_RNDF allows us to completely avoid 4: t = xz/2 s = y · 264 + t the Table Maker’s Dilemma (TMD), either for the rounding 5: or for the ternary value. MPFR provides a function mpfr_can_round taking in entry: an approximation to some real, unknown value; a 264 264n =(s−c)2 can be computed modulo . Now write with corresponding error bound (possibly with some given direc- c>0: 264n − s2 ≥−t2 > −264 tion); a target rounding mode; a target precision. The goal of this function is to tell the caller whether rounding this (s − c)2 − s2 > −264 approximation to the given target precision with the target 2sc < 264 + c2. rounding mode necessarily provides the correct rounding of the unknown value. For correctness, it is important that if 263 ≤ s − c<264 This inequality has no solutions for . the function returns true, then correct rounding is guaranteed 264n − s2 < 0 s> Indeed, since we assumed , this implies (no false positives). Rare false negatives could be accept- 263 s =263 s2 ≤ 264n , because for we have . But then, if able, but for reproducibility, it was chosen to never return a s =263 +u u ≥ 1 264c+2uc < 264 +c2 with , we would need , false negative with the usual rounding modes. When faithful 2u<264/c+c−264 1 ≤ c ≤ 264 which we can rewrite as .For , rounding was introduced, a new semantic had to be chosen 264/c + c 264 +1 the expression is bounded by , which yields for the MPFR_RNDF rounding mode argument: true means 2u<1 u ≥ 1 , having no integer solutions . that rounding the approximation in an adequate rounding 3) The mpfr_sqrt2 function: The mpfr_sqrt2 func- mode can guarantee a faithful rounding of the unknown tion first computes a 128-bit approximation of the square root value. Reproducibility was no longer important as in general using Algorithm 5, where in step 1, we can use Algorithm 3, with faithful rounding. One reason to possibly accept false with the variant of Remark 1, achieving the bound δ =16. Al- negatives here was to avoid the equivalent of the TMD for gorithm SqrtApprox2 is an integer version of Karp-Markstein’s this function. However, a false negative would often mean a algorithm for the square root [8], which incorporates n/2128 useless recomputation in a higher precision. Since it is better to in Newton’s iteration for the reciprocal square root. spend a bit more time (at most linear) to avoid a false negative Theorem 5: Algorithm SqrtApprox2 returns s satisfying than getting a very costly reiteration in a higher precision, it √ was chosen to exclude false negatives entirely, like with the 128 s − 4 ≤ 2 n ≤s +26. other rounding modes. √ √ x ≤ 296/ n +1 < 2128/ n By construction, we have √ 1 , HECKING ORRECTNESS 64 128 IV. C C therefore y ≤ n1x/2 ≤ nx/2 < n. As a consequence, To check the correctness of the routines during this develop- we have z>0 at step 3. ment work, we relied both on the correctness of the theorems, The proof of Theorem 5 is very similar to that of Theorem 4, which have been independently checked by four anonymous and can be found in the appendix. reviewers, and on all examples included in the MPFR test The largest difference we found is 24, for example with n = 64 96 √ suite. While checking corner cases during this work, we have 2 · 18355027010654859995 − 1, taking x = 2 / n1 − also added numerous new examples to the MPFR test suite. 15, whereas the actual code might use a larger value. The Nevertheless, the only way to ensure no bug remains would smallest difference we found is −1, for example with n = 64 96 √ be to perform a formal proof of the algorithms of this paper 2 · 5462010773357419421 − 1, taking x = 2 / n1 . and of their implementation in MPFR, as proposed in [13]. III. FAITHFUL ROUNDING V. E XPERIMENTAL RESULTS In addition to the usual rounding modes, faithful rounding We used a 3.2GHz Intel Core i5-6500 for our experi- (MPFR_RNDF) will be supported in MPFR 4 by the most basic ments (Skylake microarchitecture), with turbo-boost disabled, operations. We say that an operation is faithfully rounded if and revision 11452 of MPFR (aka MPFR 4.0-dev), and its result is correctly rounded toward −∞ or toward +∞. The GCC 6.3.0 under Debian GNU/Linux testing (stretch). MPFR actual rounding direction is indeterminate: it may depend on was compiled with GMP 6.1.2 [5], both configured with the argument, on the platform, etc. (except when the true result --disable-shared. The number of cycles were measured is exactly representable, because in this case, this representable with the tools/mbench utility distributed with MPFR, value is always returned, whatever the rounding mode). From which measures the average number of cycles — which might this definition, the error on the operation is bounded by 1 ulp,

24 TABLE I TABLE III AVERAGE NUMBER OF CYCLES FOR BASIC OPERATIONS. NUMBER OF CYCLES FOR MPFR_SUM WITH THE USUAL ROUNDING MODES AND MPFR_RNDF. MPFR 3.1.5 MPFR 4.0-dev (this work) Parameters RND* RNDF precision 53 113 53 113 1 7 1 10 010 10 1 411 399 mpfr_add 52 53 25 29 103 0101 105 108 27216 20366 mpfr_sub 49 52 28 33 3 1 5 8 mpfr_mul 49 63 23 33 10 110 10 10 39639 32898 mpfr_sqr 74 79 21 29 103 2101 105 108 44025 35276 mpfr_div 134 146 56 (64) 77 (102) 105 0101 101 108 1656988 1034802 mpfr_sqrt 171 268 55 (56) 84 (133) 105 0101 103 108 1393447 833711

TABLE IV TABLE II NUMBER OF CYCLES FOR MATHEMATICAL FUNCTIONS. AVERAGE NUMBER OF CYCLES FOR BASIC OPERATIONS IN 53 BITS / 113 BITS, WITH ROUNDING TO NEAREST IN THE TRUNK (T-RNDN), MPFR 3.1.5 MPFR 4.0-dev ROUNDING TO NEAREST IN THE FAITHFUL BRANCH (F-RNDN), AND (this work) FAITHFUL ROUNDING IN THE FAITHFUL BRANCH (F-RNDF). precision 53 113 53 113 T-RNDN F-RNDN F-RNDF mpfr_exp 4432 7502 2651 4769 mpfr_add 24.9 / 28.7 24.9 / 29.0 24.1 / 28.3 mpfr_sin 4111 5414 3024 4303 mpfr_sub 28.3 / 32.6 28.3 / 32.5 28.3 / 32.7 mpfr_cos 3119 3895 2109 3143 mpfr_mul 23.0 / 32.5 23.0 / 33.6 20.8 / 30.8 mpfr_log 4977 7882 2610 6746 mpfr_sqr 20.6 / 28.7 22.0 / 29.8 18.8 / 26.9 mpfr_atan 15373 22268 6151 10328 mpfr_div 56.2 / 77.1 56.0 / 77.5 54.1 / 75.7 mpfr_pow 14252 20324 6393 12589 mpfr_sqrt 55.0 / 84.1 56.6 / 81.6 54.7 / 79.4

Timings for the mpfr_sum function in the faithful branch5 are done with a special test (misc/sum-timings) because not be an integer — of 100 successive calls with different the input type, an array of MPFR numbers, is specific to this random inputs. The mbench utility uses the rdtsc/rdtscp function, and we need to test various kinds of inputs. The instructions to compute the number of cycles, taking into ac- inputs are chosen randomly with fixed parameters, correspond- count the overhead of calling these instructions, and for each of ing to the first 5 columns6 of Table III: the size of the array the 100 calls keeps only the minimum time to avoid counting (number of input MPFR numbers), the number of cancellation hardware interrupt cycles. To get stable results, we called the terms in the test (0 means no probable cancellation, 1 means mbench binary with numactl --physcpubind=0, that some cancellation, 2 even more cancellation), the precision of binds it to cpu 0, and we ensured no other processes were the input numbers (here, all of them have the same precision), running on the machine. the precision of the result, and an exponent range e (1 means that the input numbers have the same order of magnitude, and Table I compares the average number of cycles for basic a large value means that the magnitudes are very different: arithmetic operations — with rounding to nearest — between from 1 to 2e−1). A test consists in several timings on the MPFR 3 and the upcoming version 4, with 53 and 113 bits same data, and we consider the average, the minimum and the of precision, corresponding to the binary64 and binary128 maximum, and if the difference between the maximum and the formats respectively. (The numbers between parentheses for minimum is too large, the result is ignored. Due to differences mpfr_div and mpfr_sqrt are when we always take the in timings between invocations, 4 tests have been run. On some “slow” branch corresponding to the case where the first parameters, MPFR_RNDF is faster than the other rounding approximation is not enough to get correct rounding, thus modes (its maximum timing is less than the minimum timing correspond to the worst case.) The speedup goes from a factor for the usual rounding modes): for a small array size, small 1.6 (for the 113-bit subtraction) to a factor 3.5 (for the 53- output precision and no cancellation (first line of Table III), bit squaring). This improvement of basic arithmetic automat- there is a small gain; when the input precision is small and ically speeds up the computation of mathematical functions, the range is large (so that the input values do not overlap in as demonstrated in Table IV (still for rounding to nearest). general, meaning that the TMD occurs in the usual rounding Indeed, the computation of a mathematical function reduces to modes), we can typically notice a 25% to 40% gain when basic arithmetic for the argument reduction or reconstruction, there is no cancellation (lines 2, 5 and 6), and a bit less when for the computation of Taylor series or asymptotic series, etc. a cancellation occurs (lines 3 and 4). MPFR_RNDF is never Here the speedup goes from 17% (for the 113-bit logarithm) slower. Note that the kind of inputs have a great influence to a factor 2.5 (for the 53-bit arc-tangent). Note that we have on a possible gain, which could be much larger than the one chosen 53 and 113 bits, which correspond to the standard observed here (these tests have not been specifically designed binary64 and binary128 formats, but the optimizations are to trigger the TMD). not specific to these precisions; this is particularly important as the internal working precision may vary, e.g. for the 5branches/faithful revision 11121. implementation of the functions listed in Table IV. 6These are the arguments of position 2 to 6 of sum-timings.

25 VI. CONCLUSION We first bound 2128n−s2 by above, assuming 2128n−s2 ≥ 0: 65 65 This article presents algorithms for fast floating-point arith- 2 yt = xyz − δt2 y 96 metic with correct rounding on small precision numbers. All 2 32√ 65 the algorithms presented in Section II are new, and to our best =(√ − δx)(2 n1 − δy)z − δt2 y, (9) n1 knowledge the faithful rounding presented in Section III is the first implementation in arbitrary-precision software. The thus 128 65 32√ 96 √ 65 implementation of those algorithms in version 4 of MPFR 2 z − 2 yt ≤ (2 n1δx +2 / n1δy)z +2 y √ √ gives a speedup of up to 3.5 with respect to MPFR 3. As =264z( αδ + δ / α)+265y √ x y√ √ a consequence, the computation of mathematical functions is 64 129 ≤ 2 z(δ α +1/ α)+2 α, (10) greatly improved (Table IV). √ 32√ 64 where we used y ≤ 2 n1 = α2 .Now ACKNOWLEDGEMENTS. The authors thank Patrick Pélissier √ z =264n + n − (232 n − δ )2 who designed the mbench utility, which was very useful in 1 0 1 y 33√ comparing different algorithms and measuring their relative ≤ n0 +2 n1δy √ efficiency; Niels Möller and Keith Briggs for their feedback < 264(1 + 2 αδ ) √ y on this work; and the four anonymous reviewers. This work 64 ≤ 2 (1 + 2 α). (11) has been supported in part by FastRelax ANR-14-CE25-0018- 01. Substituting Eq. (11) in Eq. (10) yields: 2128z − 265yt ≤ 2129f(α), REFERENCES with [1] BRENT,R.P.,AND ZIMMERMANN,P. Modern Computer Arithmetic. 1 No. 18 in Cambridge Monographs on Applied and Computational Math- f(α)= (1 + 2α1/2)(δα1/2 +1/α1/2)+α1/2. ematics. Cambridge University Press, 2010. Electronic version freely 2 available at https://members.loria.fr/PZimmermann/mca/pub226.html. [2] DE DINECHIN,F.,ERSHOV,A.V.,AND GAST, N. Towards the post- Substituting in Eq. (8) gives ultimate libm. In Proceedings of the 17th IEEE Symposium on Computer 2128n − s2 < 2129f(α). Arithmetic (Washington, DC, USA, 2005), ARITH’17, IEEE Computer Society, pp. 288–295. Let c be the real number such that 2128n =(s + c)2. Then [3] FOUSSE, L., HANROT, G., LEFÈVRE,V.,PÉLISSIER,P.,AND ZIMMER- 2128n − s2 =(s + c)2 − s2 =2sc + c2 MANN since , which implies , P. MPFR: A multiple-precision binary floating-point library with 129 correct rounding. ACM Trans. Math. Softw. 33, 2 (2007), article 13. 2sc < 2 f(α), thus [4] GRAILLAT, S., AND MÉNISSIER-MORAIN, V. Accurate summation, dot 128 product and polynomial evaluation in complex floating point arithmetic. c<2 f(α)/s. (12) Information and Computation 216 (2012), 57 – 71. Special Issue: 8th s ≥ 2127 f(α) 1/4 ≤ α ≤ 1 Conference on Real Numbers and Computers. Since and the maximum of for [5] GRANLUND,T.,AND THE GMP DEVELOPMENT TEAM. GNU MP: is 26√.5 (attained at α√=1), we get c<√ 53. Now this gives The GNU Multiple Precision Arithmetic Library, 6.1.2 ed., 2016. https: 128 128 128 s> 2 n − 53 > 2 n/1.01 ≥ √α2 /1.01, therefore //gmplib.org/. c<1.01 · f(α)/ α [6] IEEE standard for floating-point arithmetic, 2008. Revision of ANSI- Eq. (12)√ becomes . Now the function IEEE Standard 754-1985, approved June 12, 2008: IEEE Standards f(α)/ α is bounded by 26.5 for 1/4 ≤ α ≤ 1, the maximum Board. still attained at α =1. Therefore c<1.01·26.5 < 26.8, which [7] JOHANSSON, F. Efficient implementation of elementary functions in the proves the upper bound 26. medium-precision range. In Proceedings of the 22nd IEEE Symposium 128 2 on Computer Arithmetic (Washington, DC, USA, 2015), ARITH’22, Now assume 2 n − s < 0. It follows from Eq. (9) that IEEE Computer Society, pp. 83–89. 2128z − 265yt ≥ 0 2128n − s2 ≥−t2 . Eq. (8) therefore yields√ , [8] KARP,A.H.,AND MARKSTEIN, P. High-precision division and square t2 x ≤ 296/ n root. ACM Trans. Math. Softw. 23, 4 (Dec. 1997), 561–589. and we have to bound . Using 1, Eq. (11) and n = α264 [9] MASCARENHAS, W. Moore: Interval arithmetic in modern C++. https: 1 : √ //arxiv.org/pdf/1611.09567.pdf, 2016. 8 pages. 2 2 62 2 2 2 x z 2 z 128 (1 + 2 α) [10] MÖLLER, N., AND GRANLUND, T. Improved division by invariant t ≤ ≤ ≤ 2 . 130 integers. IEEE Trans. Comput. 60, 2 (2011), 165–175. 2 n1 4α [11] REVOL, N., AND ROUILLIER, F. Motivations for an arbitrary precision Writing 2128n =(s − c)2 with c>0 yields: interval arithmetic and the MPFI library. In Reliable Computing (2002), √ pp. 23–25. 2 2 2 128 (1 + 2 α) [12] ZIMMERMANN, P. Karatsuba square root. Research Report 3805, (s − c) − s > −2 , INRIA, 1999. https://hal.inria.fr/inria-00072854. √4α (1 + 2 α)2 [13] ZIMMERMANN, P. Beyond double precision. https://members.loria.fr/ 2sc − c2 < 2128 . PZimmermann/papers/bedop.pdf, 2014. ERC Advanced Grant proposal. 4α √ √ Now s2 > 2128n implies s>264 n ≥ 2128 α, thus: APPENDIX √ 2 2 n = α264 1/4 ≤ α<1 c 1 (1 + 2 α) Proof of Theorem√ 5. Write 1 √ with . c< √ + . (13) 96 32 65 129 8 3/2 Write x =2 / n1 −δx, y =2 n1 −δy and t = xz/2 − 2 α α δt with 0 ≤ δx <δ, and 0 ≤ δy,δt < 1. This implies for all 1/4 ≤ α<1: 128 2 128 64 2 c2 2 n − s =2 n − (y · 2 + t) c< +4, 128 =2128(n − y2) − 265yt − t2 2 which proves c<4.01, thus the lower bound is s − 4. =2128z − 265yt − t2. (8)