<<

Introduction & Floating Point NMMC §1.4.1, Atkinson §1.2 1 Floating-point representation, IEEE 754. In binary, 11.01 = 1 × 21 + 1 × 20 + 0 × 2−1 + 1 × 2−2. Double-precision floating-point format has 64 bits:

±, a1, . . . , a11, b1, . . . , b52 The associated is a1···a11−1023 # = ±1.b1 ··· b52 × 2 . If all the a are zero, then the number is ‘subnormal’ and the representation is −1022 # = ±0.b1 ··· b52 × 2 If the a are all 1 and the b are all 0 then it’s ±∞ (± 1/0). If the a are all 1 and the b are not all 0 then it’s NaN (±0/0). Machine-epsilon is the difference between 1 and the smallest representable number larger than 1, i.e.  = 2−52 ≈ 2 × 10−16. The smallest and largest positive are ≈ 10±308. Numbers larger than the max are set to ∞.

2 & Arithmetic. Round-toward-nearest rounds a number towards the nearest floating-point num- ber. For nonzero numbers x within the range of max/min, the relative errors due to rounding are |round(x) − x| ≤ 2−53 ≈ 10−16 = u |x| which is half of machine epsilon.

Floating-point arithmetic: add, subtract, multiply, divide. IEEE 754: If you start with 2 exactly- represented floating-point numbers x and y and compute any arithmetic operation ‘op’, the magnitude of the relative error in the result is less than half machine precision. I.e. there is some δ with |δ| ≤u such that fl(x op y) = (x op y)(1 + δ). The exception is underflow/overflow, where the result is either larger or smaller than can be represented.

Example: 1 + 2−53 = 1 in floating-point, even though both numbers are exactly representable in double precision. Consider 1 + 2−53 + 2−53 = 1 + 2−52. If you add left-to-right you’ll get 1. If you add right-to-left you’ll get the right answer. Order matters.

Catastrophic cancellation: If you subtract two numbers that are very close the result will have very few significant digits. p q r = p − p2 + q = , p = 12345678, q = 1. p + pp2 + q Left formula in double-precision produces r ≈ −4.0978 × 10−8 right formula produces r ≈ −4.0500 × 10−8. Left formula suffers from catastrophic cancellation, resulting in very few significant digits.

The easiest way to check whether your results are sensitive to roundoff errors is to compute the solution using different levels of precision (single, double, quad).

3 Types of error analysis. Let x be the input to some problem (e.g. the coefficient matrix or the RHS) and y be the solution.

1 • Forward error: Let y be the true solution andy ˜ be the computed solution; one attempts to bound |y − y˜|. • Backward error: One attempts to show thaty ˜ is the exact solution to a perturbed problem with data x˜, and then provide bounds for |x − x˜|.

Example: Solve 3y = x where x = 2. The computed solution will bey ˜ = (2/3)(1 + δ). The forward error estimate is |y − y˜| ≤ (2/3)u. The backward error estimate says that this is the exact solution for x˜ = x(1 + δ(x)), with error |x − x˜| ≤ |x|u.

2