Suppose That X and Y Are NMN S (Both 0 Or Both 0)

Floating-Point Systems - For ß = base, e = exponent, and n = significand (or mantissa) length, (as in Atkinson & Han) we consider only ‘0’ and numbers of the form

e x = ± ß (d1.d2d3d4... dn)ß, where d1  1.

These are called the normalized machine numbers or (as here) NMNs.

For the floating-point system defined above with L  e  U, L & U integers, the smallest positive NMN is ßL while the largest is ßU+1(1– ß-n) and the cardinality of all (+, –, and 0) these NMNs is

2(U – L + 1) (ßn - ßn-1) + 1.

th Note: The k positive NMN, xk for k = 1 to (U – L + 1) #m, is

ßL+p-1 (1 + (k – 1– (p – 1) #m) ß1-n), where #m = ßn – ßn-1 is the number of normalized mantissas and p = the ‘block’ number for xk, i.e. p = Ceiling[k/#m] (so p increases from 1 to U – L + 1)

The IEEE base-2 double-precision standard calls for -1022  e  1023 and n = 53. In this arithmetic, the least positive NMN is 2-1022  2.225 10-308 and the largest is 21024 (1 - 2-53)  1.798 10308. It also follows that there are

2046 252 = 9,214,364,837,600,034,816 (9.214 1018) positive NMNs in the 53-bit arithmetic. The machine epsilon is 2-52 2.22 10-16 and the largest integer in the arithmetic, M, with the property that all of M’s integer predecessors also belong to the arithmetic is 253 9.0 1015. Maximum MN is obtained as follows: ßU (ß-1)(1 + ß-1 + ß-2 + … + ß-(n-1)) = ßU (ß-1)(1 - -n)/(1 – -1) = ßU (ß - ß1-n)

The number of m’s (mantissas), #m, is obtained as follows: (Max m – Min m)/ß1-n + 1 = ((ß - ß1-n) - 1)/ß1-n + 1 = ßn – 1 -ßn-1 + 1 = ßn -ßn-1.

th The k positive NMN, xk for k = 1 to (U – L + 1) #m, is

ßL+p-1 (1 + (k – 1– (p – 1) #m) ß1-n), where #m = ßn – ßn-1 is the number of normalized mantissas and p = the ‘block’ number for xk, i.e. p = Ceiling[k/#m] (so that p increases from 1 to U – L + 1). Proof: Let k = #m q + r, 0  r < #m (using the Division Algorithm). Case I. r = 0 (this means p = q),

th L+p-1 1-n This means xk is the final MN in the p -block: ß (ß - ß ). But ßL+p-1(1 + (#m p – 1 – (p – 1) #m) ß1-n) = ßL+p-1(1 + (#m – 1) ß1-n) = ßL+p-1(1 + (n - ßn-1 – 1) ß1-n) = ßL+p-1(1 + (ß - – ß1-n) = ßL+p-1(ß - ß1-n).

Case II. r > 0 (this means p = q + 1 or q = p -1),

th th This means xk is the r MN in the p -block, preceded by exactly q blocks.

L+p-1 1-n L+p-1 1-n Therefore, xk = ß (1 + (r – 1) ß ) = ß (1 + (k - #m q – 1) ß ) = ßL+p-1(1 + (k – 1– q #m) ß1-n) = ßL+p-1(1 + (k – 1– (p – 1) #m) ß1-n).

L A subnormal has the form ± ß (0.d2d3d4... dn)ß, where not all of the dj ‘s = 0. The number of positive subnormals is ßn-1 – 1. In the 53-bit arithmetic there are 252 – 1= 4,503,599,627,370,495 of these… It should be noted that the MN’s in a given finite arithmetic do not include every integer between that arithmetic’s least and greatest positive member:

Assume a base-2 arithmetic with a  e  b, a  0  n  b, and q an integer. Then if 1  q  2n - 1, q must be a MN. Why? Since (a proof is in Gilbert)...

n-1 n-2 n-3 0 q = c1 2 + c2 2 + c3 2 + ... + cn 2 ,

n-j = 2 (cj.cj+1cj+2...cn [if the first j-1 c’s = 0...])2,

n where the ci’s are 0 or 1 and cj is the first nonzero ci (so it’s 1). Now, 2 n n n is clearly a MN (it’s 2 (1.000...0)2), but 2 + 1 is not one while 2 + 2 is -

n -n Proof: It’s 2 (1.000...01)2, where the trailing yellowed 1 is in the 2 place;

n n -(n-1) Proof: The next MN is 2 + 2 = 2 (1.000...1(this 1 is in the 2 place))2, i.e. 2n +1 is passed over (the integer gaps widen as we move to the right). Atkinson’s definition of ‘significant digits’ is as follows -

xa has, at least, m significant digits of accuracy (or digits of accuracy) wrt xt provided the magnitude of the error, |xt - xa|, is less than or equal to five units in the (m+1)st digit counting rightward from xt’s first nonzero digit.

This is equivalent to requiring |xt - xa|  5 10e-m, for e = xt’s normalization

e exponent, i.e. xt =  10 (d1.d2d3d4...) with d1  1. Atkinson claims (pg. 45):

If |relative error|  5 10-(m+1), then xa has (at least) m significant digits of accuracy wrt xt.

e Proof: Let xt =  10 (d1.d2d3d4...dmdm+1...), where d1  1. Then the bound on the relative error implies that

e-(m+1) e-m |xt - xa|  5 10 (d1.d2d3d4...dmdm+1...) < 5 10 , and xa has at least m significant digits of accuracy relative to xt.

From Skeel & Keiper, we also have the following:

decimal places of accuracy, or d-acc, given by -log10 |xt - xa|

digits of accuracy, or -acc, given by -log10 (|xt - xa|/|xt|)

From Conte & de Boor (pg. 10) the significant digits definition:

xa approximates xt to at least d significant digits if |xt - xa|/|xt|  5 10-d

Finally, from an old trigonometry textbook (Greenleaf), the definition:

a digit is significant provided it’s known to within 4 possibilities Examples using Atkinson’s definition –

(1) xt =  = 3.1415926… vs. xa = 22/7 = 3.142857142857… The absolute error = -.00126…, so |error| < 5 10-3 and since e = 0, e – m = -3 implies that m is 3.

(2) xt = 2/9 = .222… vs. xa = .222 The absolute error = .000222…, so |error| < 5 10-4 and since e = -1, e – m = -4 implies that m is 3.

(3) xt = 23.496 vs. xa = 23.494 The absolute error = -.002…, so |error| < 5 10-3 and since e = 1, e – m = -3 implies that m is 4.

(4) xt = .02144 vs. xa = .02138 The absolute error = .00006 = .6 10-4 < 5 10-4 and since e = -2, e – m = -4 implies that m is 2.

(5) xt = 5 vs. xa = 4.995 The absolute error = .005, so |error| = 5 10-3 and since e = 0, e – m = -3 implies that m is 3.

Atkinson's best m* satisfies e + log10 5 - log10|error| - 1 < m  e + log10 5 - log10|error|.

Applying this to each of the examples above gives –

*the largest integer m satisfying |ae| 5 10e-m. Regarding Miscellaneous Calculators -

The TI-85 uses a 12+2-digit (2 internal) base 10 arithmetic with simple rounding, -999 e999.

The TI-83 uses a 12+2-digit (2 internal) base 10 arithmetic with simple rounding, -99 e99.

The TI-89 (& Voyage 200) uses a 12+2-digit (2 internal) base 10 arithmetic with simple rounding, -999 e999.

The HP-48G uses a 12 (no internal) base 10 arithmetic with simple rounding, -499 e499.

Example regarding Matlab’s “format hex” >> format hex >> 0.1 Yields… 3fb999999999999a which in binary is 0011 1111 1011 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010 The first 0 is the algebraic sign (here a +), the next 11 bits code E = 1019 which is related to the exponent e via e = E – 1023 (here e = -4), and the final 52 bits code the mantissa (save for the 1 from normalization) which here is the exact mantissa 1.100110011001… rounded up to the 2-52 place. Typing sym(0.1, ’e’) gives 1/10 + eps/40 and Matlab is telling us the floated form of 0.1 in the 53-bit arithmetic is 2-52 /40 more than 1/10. Finally, typing sym(0.1, ‘f’) will give (as a fraction) the floated form of 0.1. Assume x & y are NMN’s with xy > 0 & 0 < |y| < |x| b-1 u, for b the base of the arithmetic and u the unit round-off (= b1-s if chopping, and .5 b1-s if rounding).

Claim: y has no effect if added to x, i.e. x  y = x. Proof: Wlog, take x & y > 0 and let x = m be & y = n bf with b > m, n 1 If chopping is used, then y < x b-1 u means that n bf < m be-1 b1-s = m be-s or bf < (m/n) be-s < be-s+1 implying f - e < 1 - s or f - e -s. Therefore, x (+) y = x, since n bf – e n b-s. If rounding is used, then y < x b-1 u means n bf < (m/2) be-s. Therefore, x (+) y = x, since n bf – e  (b/2) b-s, i.e. is < ½ unit in the b1-s place. 

Examples: Let b = 10 & s = 3. Take (these do fail the inequality for rounding)... x = 1.24 & y = .00123. Then x (+) y = 1.24 (so y is ‘ignored’)

Now suppose that x & y are MN’s and 2  y/x  1/2. Then

Claim: x (–) y = x – y, i.e. the subtraction is exact (assuming x & y are MN’s and the exponent of the difference lies within the L/U range of the arithmetic).

Note: For any normalized mantissas m & n, we have 1/b < n/m, m/n < b.

Proof: Wlog, take x & y > 0 and assume x > y. Let x = m be & y = n bf, with b > m, n 1We claim that f = e - 1, e, or e + 1. Suppose f - e = k. Then 2-1  (n/m) bk 2 or m/(2n)  bk  2m/n or 1/(2b)  bk  2b. Taking logarithms base-b shows k = -1, 0, or 1. Since, x > y, e + 1 is not possible.

We now consider the cases f - e = 0 and f - e = -1.

If f – e = 0, then x – y = (m – n) be = b1-s (M – N) be for M & N the ‘integer mantissas’ m bs-1 and n bs-1. Since M – N falls between 1 and (bs - 1) - bs-1, the difference is expressible as an at most s-digit base-b integer (integers in [1, bs – 1] may all be so expressed) implying that m – n is expressible as an at most s-digit fractional mantissa, i.e. x - y is exact.

Finally, if f – e = – 1, then x – y = be-s (bM – N) and, since x > y, bM – N  1. From ½  (n/m) b-1, it follows that 2N bM. Therefore, bM – N is an at most s-digit base-b integer and x – y sits in the arithmetic (provided that the associated exponent falls in the range of the arithmetic). 

Examples: Let b = 10 and s = 3. Now take... x = 4.24 & y = 2.60. Then y/x = .6132... and x – y = 1.63 (is exact) x = 2.46 & y = 4.28. Then y/x = 1.7398... and x – y = -1.82 (is exact) x = 5.24 & y = 12.78. Then y/x = 2.4389... and x – y = -7.54 (is exact) x = 5.24 & y = 62.78. Then y/x = 11.9809... and x – y = -57.5 (not exact) Regarding the LOS example in Atkinson on page 49 – xt = cos(.01) = 0.99995000041666… vs. xa = 0.9999500004 (computed) The absolute error < 1.667 10-11 < 5 10-11 and since e = -1, e – m = -11 & xa has the full 10 significant digits of accuracy allowed by the arithmetic. xt = 1 - cos(.01) = 0.00004999958333… vs. xa = 0.0000499996 (computed) Then |absolute error| < 1.667 10-11 < 5 10-11 and since e = -5, e – m = -11 & xa has only 6 significant digits of accuracy, a loss of 4 in this arithmetic. xt = (1 - cos(.01))/(.01)2 = 0.4999958333… vs. xa = 0.499996 (computed) Then |absolute error| < 1.667 10-7 < 5 10-7 and since e = -1, e – m = -7 & xa has just 6 significant digits of accuracy, a loss of 4 in this arithmetic.

Remedy #1: Substitute the Taylor polynomial 1 – x2/2! + x4/4! – x6/6! for cos(x). Then the problematic expression becomes ½ - x2/24 + x4/720.

Remedy #2: Rewrite it as sin2(.01)/[(1 + cos(.01))(.01)2]