2017 IEEE 24th Symposium on Computer Arithmetic

ULPs and Relative Error

Marius Cornea Intel Corporation Hillsboro, OR, USA [email protected]

The paper establishes several simple, but useful relationships much more convenient. Combining the two methods might between ulp (unit in the last place) errors and the corresponding represent the best solution for such proofs, especially when we relative errors. These can be used when converting between the need to analyze sequences of multiplications, divisions, and in two types of errors, ensuring that the least amount of information some cases also fused multiply-add operations. A good is lost in the process. The properties presented here were already example can be found in [3]. The present paper aims to make useful in IEEE conformance proofs for iterative division and this process easier. square root algorithms, and should be so again to numerical analysts both in ‘hand’ proofs of error bounds for floating-point computations, and in automated tools which carry out, or support The IEEE Standard 754-2008 for Floating-Point Arithmetic deriving such proofs. In most cases, the properties shown herein [4] defines both the format of binary floating-point numbers, establish tighter bounds than found in the literature. They also and the value of an ulp (named quantum in the standard). We provide ‘conversion’ rules of finer granularity, at floating-point will use the same definition of floating-point numbers in this value level instead of binade level, and take into account the special paper, however we will consider the integer exponent as being conditions which occur at binade ends. For this reason, the paper unbounded. We should note that subnormal (also called includes a small, but non-negligible element of novelty. denormalized) numbers are not considered here. The common notations are used for the sets of real, rational and integer Keywords— floating-point; ulp (unit in the last place); quantum; numbers respectively, as well as for the same without {0}: R, approximation; error; relative error; ulp error; correctly rounded; Q, Z, R*, Q*, Z*. IEEE 754-2008; numerical analysis A floating-point number is a rational number f which can I. INTRODUCTION be represented as f = ı · s · 2e where ı = ±1 is the sign, s ࣅ Q, “It is important to be able to establish links between errors 1 ” s < 2, is the significand representable on N bits (where N ࣅ expressed in ulps, and relative errors”. This is the beginning Z, N > 0), and e ࣅ Z is the exponent. More precisely, the statement of the section on ‘Errors in ulps and relative errors” is s = 1 + k / 2N–1, where k = 0, 1, 2, ..., 2N–1 – 1. One in [1] (p. 37; ‘ulp’ stands for ‘unit in the last place’, which is unit in the last place is then 1 ulp = 2e–N+1, which is also ‘the defined below, along with ulp errors and relative errors). The weight’ of the least significant bit of the floating-point number. same topic was covered in [2]. Related topics are discussed in Note that we do not specify here anything about the encoding particular in [3], [4], [5], and [6] and also in [7] through [12]. of a floating-point number. Indeed, knowing this relationship allows for easier proofs in As an example, Fig. 1 illustrates the representation on the

Fig. 1. Floating-point numbers representable with N = 4-bit

numerical analysis, including cases where we need to prove that real axis for floating-point numbers having N = 4 bits in the certain functions or floating-point operators yield correctly significand. rounded results. In these cases, accuracy is expressed ultimately Quite often it is useful to work with floating-point values in ulps, but calculating error bounds as relative errors is often having different numbers of bits in the significand, e.g. N and

Fig. 2. Floating-point numbers representable with significands of N = 4 and N = 5 bits

1063-6889/17 $31.00 © 2017 IEEE 90 DOI 10.1109/ARITH.2017.30 N+1 bits. To illustrate this case, Fig. 2 shows the positions on the real axis for floating-point numbers representable with N = 4, and respectively N = 5 bits in the significand.

The ulp by itself does not imply evaluation of any error, but approximation, round-off, and other similar errors can be expressed in terms of ulps. This happens e.g. when considering an exact result or value x (most often not representable as a floating-point number with a given precision N) and an approximation f of it (often a floating-point number with a given precision N). The relative error is defined as the absolute error (f–x) divided by the exact value, i.e. (f–x)/x. Quite often, only the magnitudes are used in the calculation of the absolute or relative error, however some information is lost that way, so we prefer preserving the sign as well. Errors expressed in ulps are thus related and similar to relative errors, the difference being that the relative error is the absolute error divided by the exact value, while the ulp error is the absolute error divided by the weight of the least significant bit of the floating-point approximation. Although the IEEE Standard 754-2008 defines the meaning of a unit in the last place (or quantum), it does not fully explain how to work with ulps in all situations, e.g. when values of concern cross the boundary between two consecutive binades. This is clarified in section 2.1.4 of [5] (p. 14; also in [1], p. 32), where the definition of an ulp is extended also to real values. We use the same definition in the following properties. Fig. 3. Dependency graph for properties presented in the paper • The central part of this paper contains a simple but useful If the ulp error is less than m ulps, how large is the property, Theorem 1, and its proof. It establishes the relative error? relationship between approximation errors measured in ulps, • How large can the relative error be, in order to ensure and the same errors expressed as relative errors. It is limited to an ulp error of less than m ulps? ulp errors of up to 1 ulp, which are most important in practice Known literature is closest to these corollaries in [6] (p. 8), in the kinds of proofs this property is intended to support, but it however the conditions presented in this paper are slightly could easily be extended to cover the (less interesting) cases of tighter, as explained in section VI, Conclusion. Finally, errors exceeding 1 ulp. This limitation is also alleviated by the Corollaries 4 and 5 are instantiations of Corollaries 2 and 3 fact that an approximation error of 1 ulp when floating-point respectively for m = 1 ulp, and Corollaries 6 and 7 for m = 1/2 numbers have N bits in the significand, is an error of 2 ulps if ulp. we allow N+1 bits in the significand, of 4 ulps for N+2 bits, etc. As some of these properties establish tighter bounds than After introducing a few definitions in section II and basic found in the literature (e.g. in [1],[2],[3],[5], or [6]), it is hoped properties in section III, we will introduce the main property in that the relationships established here between ulp errors and section IV, followed by seven corollaries in section V. A the corresponding relative errors will prove useful to numerical dependency graph relating the properties presented in the paper analysts both in ‘hand’ proofs of error bounds for floating-point is shown in Fig. 3. computations, and in automated tools which carry out or Assuming an ulp error of at most m ulps (0 < m ” 1), support deriving such proofs. Theorem 1 establishes equivalent inequalities for the relative error. Thus is provides necessary and sufficient conditions II. DEFINITIONS allowing for conversion between errors in ulps and relative A few definitions for terms used frequently throughout the errors, without loss of information. Corollary 1 represents the paper are included here. same property in the interesting case of m = 1 ulp. Both D1. Approximation of a real number: floating-point number Theorem 1 and Corollary 1 are applicable to floating-point f that approximates the real number x with a specified relative numbers whose significands are known. In order to determine similar properties for floating-point numbers with any error, or is within a specified number of ulps of x (or from x). significand, i.e. at binade level, Corollaries 2 and 3 are derived D2. Ulp, or unit in the last place associated with a floating- from Theorem 1. They establish conversion rules from errors in point number f: the weight of the least significant bit in the ulps to relative errors and back, and answer the following questions:

91 significand of a floating-point number; if the significand of f III. BASIC PROPERTIES has N bits and the exponent is e, then one ulp has magnitude Two basic properties are included in this section. The first e-N+1 2 ; also written as ulp(f). one in particular will be useful in proving the main theorem of D3. Ulp associated with a real number x: this has meaning the paper, presented in the next section. only in the context of operating with floating-point numbers Note that in order to make reading of the properties in this having N-bit significands, with N specified: if 2e ” x < 2e+1, then and all subsequent sections easier, whenever we refer to f and one ulp has magnitude 2e–N+1; also written as ulp(x). x from now on, unless specified otherwise, we will assume that D4. Ulp associated with a pair of real numbers f and x: this x ࣅ R* is a real non-zero number, and f ࣅ Q* is a floating-point e has meaning herein only in the context of operating with non-zero number with N bits in the significand, f = ı · s · 2 , N–1 N–1 floating-point numbers having N-bit significands, with N ı = ± 1, s = 1 + k/2 , k ࣅ Z, 0 ” k ” 2 – 1, e ࣅ Z (unbounded). specified; it is the smaller of ulp(f) and ulp(x), i.e. min(ulp(f), The first property enumerates the three cases which are ulp(x)); also written as ulp(f,x). possible when a floating-point number f approximates a real To understand better the importance of defining ulp(f,x) number x which is within 1 ulp of f. this way, consider e.g. the case where a real number x is very Property 1. Let f and x as defined above. close to the boundary 2e between two consecutive binades and e f x x just slightly larger than 2 . Let f1 be the first floating-point § within 1 ulp of Ù e f2 e N–1 N–1 number to the left of 2 , and the first floating-point number (f = ı·sf·2 , ı = ±1, sf = 1 + k / 2 , k ࣅ Z, 1” k ” 2 –1, x e e–N+1 to the right of it, and also to the right of (e.g. in Fig.1 one e ࣅ Z and x = ı · sx · 2 , 1 < sx < 2 and |x – f| < 2 ), or could pick 2e = 2.0; ulps immediately to the left of 2.0 are half e e the size of the ulps immediately to its right). Assume that 2e < (f = ı · 1.0 · 2 , ı = ±1, e ࣅ Z and x = ı · sx · 2 , 1 ” sx < 1 e –N+1 e–N+1 x < 2 + 1/4 ulp(x, f2). Assume also that we need to decide + 2 (thus |x – f| < 2 )), or whether to approximate x by f1, or by f2. Although on the real e e–1 –N (f = ı · 1.0 · 2 , ı = ±1, e ࣅ Z and x = ı · sx·2 , 2 – 2 < axis x is closer to f1 than to f2, the distance between x and f1 e–N sx < 2 (thus |x – f | < 2 )) measured in ulps is greater than 1 ulp(x, f1), while the distance between x and f2 measured in ulps is less than 1 ulp(x, f2). Proof: If f is within 1 ulp of x, the three cases listed above Using distances in ulps defined this way we choose to cover all the possibilities regarding the relationship between f approximate x by f2 and not by f1, which makes sense for and x. We are only justifying here that these three cases are e floating-point numbers. We chose 2 + 1/4 ulp(x, f2) as upper chosen correctly. Note that sx is the absolute value of the real e bound for x because it is the midpoint between f1 and f2. number x, scaled by 2 . Also, in the context of talking about floating-point numbers having N bits in the significand, 1 ulp = Note that from here on, whenever we say “f § x within m 2e–N+1 for numbers in the range [2 e, 2 e+1), and 1 ulp = 2e–N for ulps of x” or “f approximates x within m ulps of x” where f, numbers in the range [2 e–1, 2 e) (these being the only two x ࣅ R, by ulp we mean ulp(f,x). situations that have to be considered here). The direct implication is therefore true. The reciprocal also holds, as the D5. y is within 1 ulp of x: the real number y is "close" to three situations listed in the last part of Property 1 cover all the real number x such that |x – y| < 1 ulp(x,y); this has cases in which f is within 1 ulp of x, for fixed exponent e of f. meaning only in the context of operating with floating-point As there was no restriction imposed on e, Property 1 holds for numbers having N bits in the significand, for given N; note that any numbers f and x satisfying the specified conditions. if the relation is non-strict, the inequality above becomes |x – y| ” 1 ulp(x,y). The second property reflects the fact that ulps scale with the powers of 2. D6. f approximates x, within one ulp of x: this is also written as f § x, within 1 ulp of x; the floating-point number f Property 2. Let f and x as defined above, and E ࣅ Z. is "close" to the real number x, such that |x – f| < 1 ulp; note x§f, within 1 ulp of f Ù x·2E § f·2E, within 1 ulp of f·2E that if f and x have associated with them ulps of different magnitudes (whose ratio could only be 2:1 in this case), then Proof: This is straightforward, as 1 ulp(2E·x, 2E·f) = the smaller of the two is used in the definition above, i.e. 2E · 1 ulp(x,f). ulp(f,x) (this allows for f = 1.0 · 2e to approximate real numbers both to its right and to its left); it is assumed that f and x have the same sign, and that they are both non-zero numbers; note also that if the relation is non-strict, the inequality above becomes |x – f| ” 1 ulp.

D7. Precision of a floating-point number: the number N of bits used in the significand; the exponent is assumed to be unbounded.

92 IV. GENERAL RULES FOR CONVERSION BETWEEN ULPS AND Let the relative error function be İ = ij(x) = (f–x)/x = RELATIVE ERROR f/x–1. Hence ijƍ(x) = –f/x2, which means that ij(x) decreases Theorem 1 (ulps versus relative error) with x when f > 0, and it increases with x when f < 0 (as seen also from Fig. 4 and Fig. 5). Let f and x as defined above, and m ࣅ R, 0 < m ” 1. Let İ be the relative error when approximating x by f, İ = (f – x)/x. Note that in Fig. 4, the outer envelope corresponds to the maximum relative error when approximating real numbers by other real numbers within 1 ulp (1 ulp defined in the context of N–1 e N–1 (a) For k  0, when f = ı · (1 + k/2 ) · 2 , 1 ” k ” 2 – 1: operating with floating-point numbers with N bits in the significand). f § x within m ulps of x Ù We will consider two cases below: k  0 and k = 0, as in the f = x · (1 + İ), İ ࣅ (–m/(2N–1 + k + m), m/(2N–1+ k – m)) latter case f is the boundary between two consecutive binades and the size of 1 ulp to the left of f is different from the size of an ulp to its right (the ulp on the right is twice larger). (b) For k = 0, when f = ı · 2e:

(a) Prove the direct implication in this theorem, for k  0: f § x within m ulps of x Ù f = x · (1 + İ), İ ࣅ (–m/(2N–1+ m), m/(2N– m)) Applying Property 1, the condition f = ı · (1 + k/2N–1) · 2e § x within m ulps of x, is equivalent to x ࣅ (f – m · 2e–N+1, Proof: f + m · 2e–N+1). f N N–1 f e–N+1 Note that 1 ” ” (2 – 1)/2 , and that for , 1 ulp =2 . e–N+1 f x x f x Consider İ = ij(x) = (f – x)/x, x ࣅ (f – m · 2 , Note also that if § , within 1 ulp of or less, then and e–N+1 have the same sign. f + m · 2 ) = I. The infimum and the supremum of ij(x) when x ࣅ I (regardless of the sign of f) are respectively:

Fig. 4. Relative error when approximating positive real numbers by floating-point numbers within 1 ulp, where the floating- point numbers have N = 4 bits in the significand (exponent not shown)

93

Fig. 5. Relative error when approximating negative real numbers by floating-point numbers within 1 ulp, where the floating- point numbers have N = 4 bits in the significand (exponent not shown)

İm = infx ࣅ I ij(x) = lim x ĺ f + ı · m · 2^(e–N+1) ij(x) = all of I, which proves the reciprocal part of the statement in Theorem 1, for k  0. (f – f – ı · m · 2e–N+1)/(f + ı · m · 2e–N+1) = (b) Prove next the direct implication in the theorem, for k=0: (– ı · m · 2e–N+1)/(ı · (1 + k · 2–N+1) · 2e + ı · m · 2e–N+1) = Applying Property 1 again, the condition f = ı·1.0·2e § x (–m · 2e–N+1)/((2N–1 + k) · 2e–N+1 + m · 2e–N+1) = within m ulps of x, is equivalent to –m/(2N–1 + k + m) x ࣅ (2e – m · 2e–N, 2e + m · 2e–N+1) if ı = 1, and equivalent to x e e–N+1 e e–N İM = supx ࣅ I ij(x) = lim x ĺ f – ı · m · 2^(e–N+1) ij(x) = ࣅ (–2 – m · 2 , –2 + m · 2 ), if ı = –1. (f – f + ı · m · 2e–N+1)/( f – ı · m · 2e–N+1) = First, for ı = 1 and f = 2e, consider İ = ij(x) = (f – x)/x, (ı · m · 2e–N+1)/(ı · (1 + k · 2–N + 1) · 2e – ı · m · 2e–N+1) = x ࣅ (2e – m · 2e–N, 2e + m · 2e–N+1) = I. The infimum and the supremum of ij(x) when x ࣅ I are respectively: (m · 2e–N+1)/((2N–1 + k) · 2e–N–1 – m · 2e–N+1) = N–1 m/(2 + k – m) İm = infxࣅI ij(x) = lim xĺ2^e + m · 2^(e–N+1) ij(x) = which proves the direct implication for k  0. (2e – 2e – m · 2e–N+1)/(2e + m · 2e–N+1) = –m/(2N–1 + m)

The reciprocal is also true: for k  0, ij(x) = (f – x)/x, İM = supxࣅI ij(x) = lim xĺ2^e – m · 2^(e–N) ij(x) = e–N+1 e–N+1 ij: (f – m · 2 , f + m · 2 ) ĺ (İm, İM), is continuous and monotonic, and therefore bijective. Its inverse, x = ij–1(İ) is also (2e – 2e + m · 2e–N)/(2e – m · 2e–N) = m/(2N – m) continuous and monotonic, and is defined on the finite interval (İ , İ ). It means that when İ covers all of its domain, x covers m M This proves the direct implication for k = 0 and ı = 1.

94 The reciprocal is also true: for k = 0 and ı = 1, we have f = Note: applying again Property 1 for k  0, the first condition 2e. Then, ij(x) = (f – x)/x, ij: (2e – m · 2e–N, 2e + m · 2e–N+1) ĺ of Corollary 1, f = ı · (1 + k/2N–1) · 2e § x within 1 ulp of x, is e–N+1 e–N+1 (İm, İM), is continuous and monotonic, and therefore bijective. equivalent to x ࣅ (f – 2 , f + 2 ). Its inverse, x = ij–1(İ) is also continuous and monotonic, and it is defined on the finite interval (İm, İM). It means that when İ Also from Property 1 for k = 0, the first condition in covers all of its domain, x covers all of I, which proves the Corollary 1, f = ı · 1.0 · 2e § x within 1 ulp of x, is equivalent reciprocal part in Theorem 1 when k = 0 and ı = 1. to x ࣅ (2e – 2e–N, 2e + 2e–N+1) if ı = 1, and to x ࣅ (–2e – 2e–N+1, –2e + 2e–N), if ı = –1. Second, for ı = –1 and f = –2e, consider İ = ij(x) = (f–x)/x, e e–N+1 e e–N x ࣅ (–2 – m · 2 , –2 + m · 2 ) = I. The infimum and the Corollary 2 supremum of ij(x) when x ࣅ I are respectively: Let f and x as defined above, and m ࣅ R, 0 < m ” 1. İm = infxࣅI ij(x) = limxĺ–2^e – m · 2^(e–N+1) ij(x) = If f § x, within m ulps of x and m ࣅ (0, 1/2) then e e e–N+1 e e–N+1 N–1 (–2 + 2 + m · 2 )/(–2 – m · 2 ) = –m/(2 + m) f = x · (1 + İ), with |İ| < m / (2N–1 + m).

İM = supxࣅI ij(x) = lim xĺ–2^e + m · 2^(e–N) ij(x) = If f § x, within m ulps of x and m ࣅ [1/2, 1], then f = x · (1 + İ), with |İ| < m / (2N–1 + 1 – m). (–2e + 2e – m · 2e–N)/(–2e + m · 2e–N) = m/(2N – m) Proof: which proves the direct implication for k = 0 and ı = –1. From Theorem 1, if k  0, when f § x, within m ulps of x, The reciprocal is also true: for k = 0 and ı = –1, we have the largest extremum of |İ| on the 2N–1 – 1 intervals determined f = –2e. Then, ij(x) = (f–x)/x, ij:(–2e–m · 2e–N+1, –2e+m · 2e–N) by k is obtained for k = 1: ĺ (İm, İM), is continuous and monotonic, and therefore –1 bijective. Its inverse, x = ij (İ) is also continuous and N–1 |İ| < m / (2 + 1 – m) = İ1 monotonic, and it is defined on the finite interval (İm, İM). It means that when İ covers all of its domain, x covers all of I, which proves the reciprocal part in Theorem 1 when k = 0 and This inequality holds if f is within m ulps of x for k  0. ı = –1. For k = 0, the largest |İ| is m / (2N–1 + m): V. USEFUL COROLLARIES |İ| < m / (2N–1 + m) = İ 2 Corollary 1 This inequality holds if f is within m ulps of x for k = 0. Let f and x as defined above. Then: Compare İ1 and İ2:

N–1 e N–1 (a) For k  0, when f = ı · (1 + k/2 ) · 2 , 1 ” k ” 2 – 1: N–1 N–1 İ1 • İ2 Ù m / (2 + 1 – m) • m / (2 + m) Ù 2N–1 + 1 – m ” 2N–1 + m Ù m • 1/2 f § x within 1 ulp of x Ù The property in Corollary 2 follows from here. f = x · (1 + İ), İ ࣅ (–1 / (2N–1+ k + 1), 1 / (2N–1 + k – 1)) Corollary 3 (b) For k = 0, when f = ı · 2e: Let f and x as defined above, and m ࣅ R, 0 < m ” 1. f § x within 1 ulp of x Ù f = x · (1 + İ), N–1 N İ ࣅ (–1 / (2 + 1), 1 / (2 – 1)) If m ࣅ (0, 1/2) and f = x · (1 + İ), with |İ| < m / (2N – m), then f § x, within m ulps of x. Proof: If m ࣅ [1/2, 1] and f = x · (1 + İ), with |İ| < m / (2N+m–1), To prove this property, just substitute m = 1 in Theorem 1. then f § x, within m ulps of x.

95 Proof: Extension:

From Theorem 1, if k  0, the smallest extremum of |İ| on Let x, f ࣅ R*. If f = x · (1 + İ), with |İ| < 1/(2N + 1), then f the 2N–1–1 intervals determined by k is obtained for k = 2N–1–1: is within 1 ulp of x on N bits.

N |İ| < m / (2 + m – 1) = İ1 Proof:

This inequality ensures that f is within m ulps of x for k0. To prove this property, substitute m = 1 in Corollary 3 above. For k = 0, the smallest extremum of |İ| is m / (2N – m): To prove the extension, consider the outer envelope in Fig. N |İ| < m / (2 – m) = İ2 4 (note that it represents the set of the largest relative errors), looking for the smallest maximum relative error when approximating real numbers x by real numbers f, within 1 ulp This inequality ensures that f is within m ulps of x for k=0. of x (in the context of operating with floating–point numbers having N–bit significands). The smallest relative error Compare İ1 and İ2: extremum on the envelope, in absolute value, is 1/(2N+1).

N N İ1 > İ2 Ù m / (2 + m – 1) > m / (2 – m) Ù Corollary 6 N N 2 + m – 1 < 2 – m Ù m < 1/2 Let f and x as defined above. The property in Corollary 3 follows from here. If f § x, within 1/2 ulp of x, then f = x · (1 + İ), with |İ| < Corollary 4 1/(2N+1).

Let f and x as defined above. Proof:

If f § x, within 1 ulp of x, then f = x · (1 + İ), with To prove this property, substitute m = 1/2 in Corollary 2 |İ| < 1/2N–1. above.

Extension: Corollary 7

Let x, f ࣅ R*. If f is within 1 ulp of x on N bits, then Let f and x as defined above. f = x · (1 + İ), with |İ| < 1/2N–1. If f = x · (1 + İ), with |İ| < 1/(2N+1–1), then f § x, within Proof: 1/2 ulp of x.

To prove this property, substitute m = 1 in Corollary 2 Proof: above. To prove this property, substitute m = 1/2 in Corollary 3 To prove the extension, consider the outer envelope in Fig. above. 4 (note that it represents the set of the largest relative errors), looking for the maximum relative error when approximating real numbers x by real numbers f, within 1 ulp of x (in the VI. CONCLUSION context of operating with floating–point numbers having N–bit We presented and proved a few useful properties, significands). The relative error never exceeds 1/2N–1. establishing relationships between ulp errors and the corresponding relative errors in floating-point computations Corollary 5 when real (or rational) numbers are approximated with floating- point numbers. The main property in the paper, Theorem 1, establishes equivalence relations between the two types of Let f and x as defined above. errors when the significand of the floating-point approximation is known. In this case, the conversions can be performed in If f = x·(1 + İ), with |İ| < 1/2N, then f§x, within 1 ulp of x. either direction without loss of information. Corollaries 2 and 3 and also 4 through 7 derived from these are all obtained from Theorem 1, and determine the relationships between ulp errors

and relative errors at binade level, when the significand of the

96 floating-point approximation can take any N-bit value. In this REFERENCES case however the equivalence property in conversions is lost. [1] Jean-Michel Muller & al., Handbook of Floating-Point As pointed out in [6], ‘going from the ulp error to the relative Arithmetic, Birkhauser, 2010 error and back, we lose a factor of two’. This is true in some [2] David Goldberg, What Every Computer Scientist Should cases, but in most cases we ‘lose a factor’ that is slightly less Know About Floating-Point Arithmetic, Computing than two. For example when converting from relative error to Surveys, 1991, https://docs.oracle.com/cd/E19957- ulps according to [6], if the relative error (in absolute value) has 01/806-3568/ncg_goldberg.html N+1 [3] Marius Cornea, Roger Golliver, Peter Markstein, an upper bound of 1/2 then the ulp error is at most 1/2. Correctness Proofs Outline for Newton-Raphson Based Corollary 7 tells us that an ulp error of at most 1/2 ulp is ensured Floating-Point Divide and Square Root Algorithms, 14th by a relative error with an upper bound of 1/(2N+1–1), which is IEEE Symposium on Computer Arithmetic, 1999 slightly better than in [6] (we allow a slightly larger relative [4] IEEE Standard for Floating-Point Arithmetic 754-2008, error and still ensure a maximum error of 1/2 ulp). There is no IEEE Computer Society, 2009 general conversion rule in [6] from ulps to relative error. [5] Jean-Michel Muller, Elementary Functions, Birkhauser, 2006 These properties can be useful to numerical analysts in [6] The MPFR Team, The MPFR Library: Algorithms and proofs of error bounds for floating-point computations, and in Proofs, www.mpfr.org/algorithms.pdf particular in proofs of correct IEEE roundedness. It is easier to [7] John Harrison, A Machine-Checked Theory of Floating calculate first a tight bound on the relative error due to round- Point Arithmetic, Proceedings of the 12th International off after a chain of floating-point operations and then convert it Conference on Theorem Proving in Higher Order Logics, to an ulp error using some of the properties presented here, 1999 rather than using ulp errors all the way. In the case of operations [8] Jean-Michel Muller, On the de¿nition of ulp(x), https://hal.inria.fr/file/index/docid/70503/filename/RR- defined by IEEE Standard 754-2008 [4] which have to be 5504.pdf correctly rounded, a final approximation error of less than 1/2 [9] J. Barlow, On the Distribution of Accumulated Roundoff ulp ensures a correctly rounded result. Reference [3] offers such Error In Floating-Point Arithmetic, ARITH05, 5th IEEE examples. Symposium on Computer Arithmetic, 1981 [10] Peter Kornerup, Vincent Lef`evre, Nicolas Louvet, Jean- In most cases, the properties shown herein establish better Michel Muller, On the Computation of Correctly-Rounded bounds than found in the literature, e.g. in [1],[2],[3],[5] or [6]. Sums, 19th IEEE Symposium on Computer Arithmetic, They also provide conversion rules of finer granularity, at 2009, https://hal.inria.fr/inria-00367584/document/ floating-point value level, not only at binade level. For these [11] Marius Cornea, Precision, Accuracy, and Error reasons, the paper makes some new contributions to the field. Propagation in Exascale Computing, 21st IEEE Symposium on Computer Arithmetic, 2013 ACKNOWLEDGMENT [12] Martin Brain, Cesare Tinelli, Philipp Rummer, Thomas The author would like to thank his colleagues John Harrison Wahl, An Automatable Formal Semantics for IEEE-754 and Michael Ferry for their review of, and feedback on this Floating-Point Arithmetic, 22nd IEEE Symposium on paper. Computer Arithmetic, 2015

97