Ulps and Relative Error
Total Page:16
File Type:pdf, Size:1020Kb
2017 IEEE 24th Symposium on Computer Arithmetic ULPs and Relative Error Marius Cornea Intel Corporation Hillsboro, OR, USA [email protected] The paper establishes several simple, but useful relationships much more convenient. Combining the two methods might between ulp (unit in the last place) errors and the corresponding represent the best solution for such proofs, especially when we relative errors. These can be used when converting between the need to analyze sequences of multiplications, divisions, and in two types of errors, ensuring that the least amount of information some cases also fused multiply-add operations. A good is lost in the process. The properties presented here were already example can be found in [3]. The present paper aims to make useful in IEEE conformance proofs for iterative division and this process easier. square root algorithms, and should be so again to numerical analysts both in ‘hand’ proofs of error bounds for floating-point computations, and in automated tools which carry out, or support The IEEE Standard 754-2008 for Floating-Point Arithmetic deriving such proofs. In most cases, the properties shown herein [4] defines both the format of binary floating-point numbers, establish tighter bounds than found in the literature. They also and the value of an ulp (named quantum in the standard). We provide ‘conversion’ rules of finer granularity, at floating-point will use the same definition of floating-point numbers in this value level instead of binade level, and take into account the special paper, however we will consider the integer exponent as being conditions which occur at binade ends. For this reason, the paper unbounded. We should note that subnormal (also called includes a small, but non-negligible element of novelty. denormalized) numbers are not considered here. The common notations are used for the sets of real, rational and integer Keywords— floating-point; ulp (unit in the last place); quantum; numbers respectively, as well as for the same without {0}: R, approximation; error; relative error; ulp error; correctly rounded; Q, Z, R*, Q*, Z*. IEEE 754-2008; numerical analysis A floating-point number is a rational number f which can I. INTRODUCTION be represented as f = ı · s · 2e where ı = ±1 is the sign, s ࣅ Q, “It is important to be able to establish links between errors 1 s < 2, is the significand representable on N bits (where N ࣅ expressed in ulps, and relative errors”. This is the beginning Z, N > 0), and e ࣅ Z is the exponent. More precisely, the statement of the section on ‘Errors in ulps and relative errors” significand is s = 1 + k / 2N–1, where k = 0, 1, 2, ..., 2N–1 – 1. One in [1] (p. 37; ‘ulp’ stands for ‘unit in the last place’, which is unit in the last place is then 1 ulp = 2e–N+1, which is also ‘the defined below, along with ulp errors and relative errors). The weight’ of the least significant bit of the floating-point number. same topic was covered in [2]. Related topics are discussed in Note that we do not specify here anything about the encoding particular in [3], [4], [5], and [6] and also in [7] through [12]. of a floating-point number. Indeed, knowing this relationship allows for easier proofs in As an example, Fig. 1 illustrates the representation on the Fig. 1. Floating-point numbers representable with N = 4-bit significands numerical analysis, including cases where we need to prove that real axis for floating-point numbers having N = 4 bits in the certain functions or floating-point operators yield correctly significand. rounded results. In these cases, accuracy is expressed ultimately Quite often it is useful to work with floating-point values in ulps, but calculating error bounds as relative errors is often having different numbers of bits in the significand, e.g. N and Fig. 2. Floating-point numbers representable with significands of N = 4 and N = 5 bits 1063-6889/17 $31.00 © 2017 IEEE 90 DOI 10.1109/ARITH.2017.30 N+1 bits. To illustrate this case, Fig. 2 shows the positions on the real axis for floating-point numbers representable with N = 4, and respectively N = 5 bits in the significand. The ulp by itself does not imply evaluation of any error, but approximation, round-off, and other similar errors can be expressed in terms of ulps. This happens e.g. when considering an exact result or value x (most often not representable as a floating-point number with a given precision N) and an approximation f of it (often a floating-point number with a given precision N). The relative error is defined as the absolute error (f–x) divided by the exact value, i.e. (f–x)/x. Quite often, only the magnitudes are used in the calculation of the absolute or relative error, however some information is lost that way, so we prefer preserving the sign as well. Errors expressed in ulps are thus related and similar to relative errors, the difference being that the relative error is the absolute error divided by the exact value, while the ulp error is the absolute error divided by the weight of the least significant bit of the floating-point approximation. Although the IEEE Standard 754-2008 defines the meaning of a unit in the last place (or quantum), it does not fully explain how to work with ulps in all situations, e.g. when values of concern cross the boundary between two consecutive binades. This is clarified in section 2.1.4 of [5] (p. 14; also in [1], p. 32), where the definition of an ulp is extended also to real values. We use the same definition in the following properties. Fig. 3. Dependency graph for properties presented in the paper • The central part of this paper contains a simple but useful If the ulp error is less than m ulps, how large is the property, Theorem 1, and its proof. It establishes the relative error? relationship between approximation errors measured in ulps, • How large can the relative error be, in order to ensure and the same errors expressed as relative errors. It is limited to an ulp error of less than m ulps? ulp errors of up to 1 ulp, which are most important in practice Known literature is closest to these corollaries in [6] (p. 8), in the kinds of proofs this property is intended to support, but it however the conditions presented in this paper are slightly could easily be extended to cover the (less interesting) cases of tighter, as explained in section VI, Conclusion. Finally, errors exceeding 1 ulp. This limitation is also alleviated by the Corollaries 4 and 5 are instantiations of Corollaries 2 and 3 fact that an approximation error of 1 ulp when floating-point respectively for m = 1 ulp, and Corollaries 6 and 7 for m = 1/2 numbers have N bits in the significand, is an error of 2 ulps if ulp. we allow N+1 bits in the significand, of 4 ulps for N+2 bits, etc. As some of these properties establish tighter bounds than After introducing a few definitions in section II and basic found in the literature (e.g. in [1],[2],[3],[5], or [6]), it is hoped properties in section III, we will introduce the main property in that the relationships established here between ulp errors and section IV, followed by seven corollaries in section V. A the corresponding relative errors will prove useful to numerical dependency graph relating the properties presented in the paper analysts both in ‘hand’ proofs of error bounds for floating-point is shown in Fig. 3. computations, and in automated tools which carry out or Assuming an ulp error of at most m ulps (0 < m 1), support deriving such proofs. Theorem 1 establishes equivalent inequalities for the relative error. Thus is provides necessary and sufficient conditions II. DEFINITIONS allowing for conversion between errors in ulps and relative A few definitions for terms used frequently throughout the errors, without loss of information. Corollary 1 represents the paper are included here. same property in the interesting case of m = 1 ulp. Both D1. Approximation of a real number: floating-point number Theorem 1 and Corollary 1 are applicable to floating-point f that approximates the real number x with a specified relative numbers whose significands are known. In order to determine similar properties for floating-point numbers with any error, or is within a specified number of ulps of x (or from x). significand, i.e. at binade level, Corollaries 2 and 3 are derived D2. Ulp, or unit in the last place associated with a floating- from Theorem 1. They establish conversion rules from errors in point number f: the weight of the least significant bit in the ulps to relative errors and back, and answer the following questions: 91 significand of a floating-point number; if the significand of f III. BASIC PROPERTIES has N bits and the exponent is e, then one ulp has magnitude Two basic properties are included in this section. The first e-N+1 2 ; also written as ulp(f). one in particular will be useful in proving the main theorem of D3. Ulp associated with a real number x: this has meaning the paper, presented in the next section. only in the context of operating with floating-point numbers Note that in order to make reading of the properties in this having N-bit significands, with N specified: if 2e x < 2e+1, then and all subsequent sections easier, whenever we refer to f and one ulp has magnitude 2e–N+1; also written as ulp(x).