<p>Floating-Point Systems - For ß = base, e = exponent, and n = significand (or mantissa) length, (as in Atkinson & Han) we consider only ‘0’ and numbers of the form</p><p> e x = ± ß (d1.d2d3d4... dn)ß, where d1 1.</p><p>These are called the normalized machine numbers or (as here) NMNs.</p><p>For the floating-point system defined above with L e U, L & U integers, the smallest positive NMN is ßL while the largest is ßU+1(1– ß-n) and the cardinality of all (+, –, and 0) these NMNs is </p><p>2(U – L + 1) (ßn - ßn-1) + 1. </p><p> th Note: The k positive NMN, xk for k = 1 to (U – L + 1) #m, is </p><p>ßL+p-1 (1 + (k – 1– (p – 1) #m) ß1-n), where #m = ßn – ßn-1 is the number of normalized mantissas and p = the ‘block’ number for xk, i.e. p = Ceiling[k/#m] (so p increases from 1 to U – L + 1)</p><p>The IEEE base-2 double-precision standard calls for -1022 e 1023 and n = 53. In this arithmetic, the least positive NMN is 2-1022 2.225 10-308 and the largest is 21024 (1 - 2-53) 1.798 10308. It also follows that there are </p><p>2046 252 = 9,214,364,837,600,034,816 (9.214 1018) positive NMNs in the 53-bit arithmetic. The machine epsilon is 2-52 2.22 10-16 and the largest integer in the arithmetic, M, with the property that all of M’s integer predecessors also belong to the arithmetic is 253 9.0 1015. Maximum MN is obtained as follows: ßU (ß-1)(1 + ß-1 + ß-2 + … + ß-(n-1)) = ßU (ß-1)(1 - -n)/(1 – -1) = ßU (ß - ß1-n)</p><p>The number of m’s (mantissas), #m, is obtained as follows: (Max m – Min m)/ß1-n + 1 = ((ß - ß1-n) - 1)/ß1-n + 1 = ßn – 1 -ßn-1 + 1 = ßn -ßn-1.</p><p> th The k positive NMN, xk for k = 1 to (U – L + 1) #m, is </p><p>ßL+p-1 (1 + (k – 1– (p – 1) #m) ß1-n), where #m = ßn – ßn-1 is the number of normalized mantissas and p = the ‘block’ number for xk, i.e. p = Ceiling[k/#m] (so that p increases from 1 to U – L + 1). Proof: Let k = #m q + r, 0 r < #m (using the Division Algorithm). Case I. r = 0 (this means p = q), </p><p> th L+p-1 1-n This means xk is the final MN in the p -block: ß (ß - ß ). But ßL+p-1(1 + (#m p – 1 – (p – 1) #m) ß1-n) = ßL+p-1(1 + (#m – 1) ß1-n) = ßL+p-1(1 + (n - ßn-1 – 1) ß1-n) = ßL+p-1(1 + (ß - – ß1-n) = ßL+p-1(ß - ß1-n). </p><p>Case II. r > 0 (this means p = q + 1 or q = p -1), </p><p> th th This means xk is the r MN in the p -block, preceded by exactly q blocks. </p><p>L+p-1 1-n L+p-1 1-n Therefore, xk = ß (1 + (r – 1) ß ) = ß (1 + (k - #m q – 1) ß ) = ßL+p-1(1 + (k – 1– q #m) ß1-n) = ßL+p-1(1 + (k – 1– (p – 1) #m) ß1-n). </p><p>L A subnormal has the form ± ß (0.d2d3d4... dn)ß, where not all of the dj ‘s = 0. The number of positive subnormals is ßn-1 – 1. In the 53-bit arithmetic there are 252 – 1= 4,503,599,627,370,495 of these… It should be noted that the MN’s in a given finite arithmetic do not include every integer between that arithmetic’s least and greatest positive member:</p><p>Assume a base-2 arithmetic with a e b, a 0 n b, and q an integer. Then if 1 q 2n - 1, q must be a MN. Why? Since (a proof is in Gilbert)...</p><p> n-1 n-2 n-3 0 q = c1 2 + c2 2 + c3 2 + ... + cn 2 , </p><p> n-j = 2 (cj.cj+1cj+2...cn [if the first j-1 c’s = 0...])2,</p><p> n where the ci’s are 0 or 1 and cj is the first nonzero ci (so it’s 1). Now, 2 n n n is clearly a MN (it’s 2 (1.000...0)2), but 2 + 1 is not one while 2 + 2 is - </p><p> n -n Proof: It’s 2 (1.000...01)2, where the trailing yellowed 1 is in the 2 place; </p><p> n n -(n-1) Proof: The next MN is 2 + 2 = 2 (1.000...1(this 1 is in the 2 place))2, i.e. 2n +1 is passed over (the integer gaps widen as we move to the right). Atkinson’s definition of ‘significant digits’ is as follows -</p><p> xa has, at least, m significant digits of accuracy (or digits of accuracy) wrt xt provided the magnitude of the error, |xt - xa|, is less than or equal to five units in the (m+1)st digit counting rightward from xt’s first nonzero digit. </p><p>This is equivalent to requiring |xt - xa| 5 10e-m, for e = xt’s normalization </p><p> e exponent, i.e. xt = 10 (d1.d2d3d4...) with d1 1. Atkinson claims (pg. 45):</p><p>If |relative error| 5 10-(m+1), then xa has (at least) m significant digits of accuracy wrt xt.</p><p> e Proof: Let xt = 10 (d1.d2d3d4...dmdm+1...), where d1 1. Then the bound on the relative error implies that </p><p> e-(m+1) e-m |xt - xa| 5 10 (d1.d2d3d4...dmdm+1...) < 5 10 , and xa has at least m significant digits of accuracy relative to xt.</p><p>From Skeel & Keiper, we also have the following:</p><p> decimal places of accuracy, or d-acc, given by -log10 |xt - xa|</p><p> digits of accuracy, or -acc, given by -log10 (|xt - xa|/|xt|)</p><p>From Conte & de Boor (pg. 10) the significant digits definition:</p><p> xa approximates xt to at least d significant digits if |xt - xa|/|xt| 5 10-d</p><p>Finally, from an old trigonometry textbook (Greenleaf), the definition: </p><p> a digit is significant provided it’s known to within 4 possibilities Examples using Atkinson’s definition –</p><p>(1) xt = = 3.1415926… vs. xa = 22/7 = 3.142857142857… The absolute error = -.00126…, so |error| < 5 10-3 and since e = 0, e – m = -3 implies that m is 3.</p><p>(2) xt = 2/9 = .222… vs. xa = .222 The absolute error = .000222…, so |error| < 5 10-4 and since e = -1, e – m = -4 implies that m is 3.</p><p>(3) xt = 23.496 vs. xa = 23.494 The absolute error = -.002…, so |error| < 5 10-3 and since e = 1, e – m = -3 implies that m is 4.</p><p>(4) xt = .02144 vs. xa = .02138 The absolute error = .00006 = .6 10-4 < 5 10-4 and since e = -2, e – m = -4 implies that m is 2.</p><p>(5) xt = 5 vs. xa = 4.995 The absolute error = .005, so |error| = 5 10-3 and since e = 0, e – m = -3 implies that m is 3.</p><p>Atkinson's best m* satisfies e + log10 5 - log10|error| - 1 < m e + log10 5 - log10|error|.</p><p> e-m Proof: Suppose that |error| 5 10 . Then m log105 + e - log10|error| implying that log10 5 + e - log10|error| - 1 < m log10 5 + e - log10|error|, with m the unique integer e-(m+1) e-m e satisfying 5 10 < |error| 5 10 . Please note: m = Floor(- log10|error/(5 10 )|.</p><p>Suppose a > 0 and let m = Floor[-log(a)]. Then -log a = m + f, for some 0 f < 1, and –log a – 1 < m -log a. This implies 10-m-1 < a 10-m. So, if a = |error|/(5 10e), then 5 10e-(m+1) < |error| 5 10e-m and thus no larger m can satisfy |error| 5 10e-m.</p><p>Applying this to each of the examples above gives –</p><p>(1) 2.6 < m m = 3 Floor[-log(|relerr|)] = (2) 2.4 < m m = 3 Floor[-log(|relerr|)] = (3) 3.4 < m m = 4 Floor[-log(|relerr|)] = (4) 1.9 < m m = 2 Floor[-log(|relerr|)] = (5) 2 < m m = 3; Floor[-log(|relerr|)] = </p><p>*the largest integer m satisfying |ae| 5 10e-m. Regarding Miscellaneous Calculators -</p><p>The TI-85 uses a 12+2-digit (2 internal) base 10 arithmetic with simple rounding, -999 e999.</p><p>The TI-83 uses a 12+2-digit (2 internal) base 10 arithmetic with simple rounding, -99 e99.</p><p>The TI-89 (& Voyage 200) uses a 12+2-digit (2 internal) base 10 arithmetic with simple rounding, -999 e999.</p><p>The HP-48G uses a 12 (no internal) base 10 arithmetic with simple rounding, -499 e499.</p><p>Example regarding Matlab’s “format hex” >> format hex >> 0.1 Yields… 3fb999999999999a which in binary is 0011 1111 1011 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010 The first 0 is the algebraic sign (here a +), the next 11 bits code E = 1019 which is related to the exponent e via e = E – 1023 (here e = -4), and the final 52 bits code the mantissa (save for the 1 from normalization) which here is the exact mantissa 1.100110011001… rounded up to the 2-52 place. Typing sym(0.1, ’e’) gives 1/10 + eps/40 and Matlab is telling us the floated form of 0.1 in the 53-bit arithmetic is 2-52 /40 more than 1/10. Finally, typing sym(0.1, ‘f’) will give (as a fraction) the floated form of 0.1. Assume x & y are NMN’s with xy > 0 & 0 < |y| < |x| b-1 u, for b the base of the arithmetic and u the unit round-off (= b1-s if chopping, and .5 b1-s if rounding). </p><p>Claim: y has no effect if added to x, i.e. x y = x. Proof: Wlog, take x & y > 0 and let x = m be & y = n bf with b > m, n 1 If chopping is used, then y < x b-1 u means that n bf < m be-1 b1-s = m be-s or bf < (m/n) be-s < be-s+1 implying f - e < 1 - s or f - e -s. Therefore, x (+) y = x, since n bf – e n b-s. If rounding is used, then y < x b-1 u means n bf < (m/2) be-s. Therefore, x (+) y = x, since n bf – e (b/2) b-s, i.e. is < ½ unit in the b1-s place. </p><p>Examples: Let b = 10 & s = 3. Take (these do fail the inequality for rounding)... x = 1.24 & y = .00123. Then x (+) y = 1.24 (so y is ‘ignored’)</p><p>Now suppose that x & y are MN’s and 2 y/x 1/2. Then</p><p>Claim: x (–) y = x – y, i.e. the subtraction is exact (assuming x & y are MN’s and the exponent of the difference lies within the L/U range of the arithmetic).</p><p>Note: For any normalized mantissas m & n, we have 1/b < n/m, m/n < b.</p><p>Proof: Wlog, take x & y > 0 and assume x > y. Let x = m be & y = n bf, with b > m, n 1We claim that f = e - 1, e, or e + 1. Suppose f - e = k. Then 2-1 (n/m) bk 2 or m/(2n) bk 2m/n or 1/(2b) bk 2b. Taking logarithms base-b shows k = -1, 0, or 1. Since, x > y, e + 1 is not possible. </p><p>We now consider the cases f - e = 0 and f - e = -1.</p><p>If f – e = 0, then x – y = (m – n) be = b1-s (M – N) be for M & N the ‘integer mantissas’ m bs-1 and n bs-1. Since M – N falls between 1 and (bs - 1) - bs-1, the difference is expressible as an at most s-digit base-b integer (integers in [1, bs – 1] may all be so expressed) implying that m – n is expressible as an at most s-digit fractional mantissa, i.e. x - y is exact.</p><p>Finally, if f – e = – 1, then x – y = be-s (bM – N) and, since x > y, bM – N 1. From ½ (n/m) b-1, it follows that 2N bM. Therefore, bM – N is an at most s-digit base-b integer and x – y sits in the arithmetic (provided that the associated exponent falls in the range of the arithmetic). </p><p>Examples: Let b = 10 and s = 3. Now take... x = 4.24 & y = 2.60. Then y/x = .6132... and x – y = 1.63 (is exact) x = 2.46 & y = 4.28. Then y/x = 1.7398... and x – y = -1.82 (is exact) x = 5.24 & y = 12.78. Then y/x = 2.4389... and x – y = -7.54 (is exact) x = 5.24 & y = 62.78. Then y/x = 11.9809... and x – y = -57.5 (not exact) Regarding the LOS example in Atkinson on page 49 – xt = cos(.01) = 0.99995000041666… vs. xa = 0.9999500004 (computed) The absolute error < 1.667 10-11 < 5 10-11 and since e = -1, e – m = -11 & xa has the full 10 significant digits of accuracy allowed by the arithmetic. xt = 1 - cos(.01) = 0.00004999958333… vs. xa = 0.0000499996 (computed) Then |absolute error| < 1.667 10-11 < 5 10-11 and since e = -5, e – m = -11 & xa has only 6 significant digits of accuracy, a loss of 4 in this arithmetic. xt = (1 - cos(.01))/(.01)2 = 0.4999958333… vs. xa = 0.499996 (computed) Then |absolute error| < 1.667 10-7 < 5 10-7 and since e = -1, e – m = -7 & xa has just 6 significant digits of accuracy, a loss of 4 in this arithmetic. </p><p>Remedy #1: Substitute the Taylor polynomial 1 – x2/2! + x4/4! – x6/6! for cos(x). Then the problematic expression becomes ½ - x2/24 + x4/720.</p><p>Remedy #2: Rewrite it as sin2(.01)/[(1 + cos(.01))(.01)2]</p><p>Suppose that f*(x) is the floated form of f(x). Claim: If |f*(x) – f(x)| < 1 ulp & f(x) is a MN, then f*(x) = f(x). Proof: Assume f*(x) = 2e m* and f(x) = 2e m with 1 m < 2. Then |f*(x) – f(x)| = 2e |m* - m| < 2e 21-n implying |m* - m| < 21-n or m* = m.</p>
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-