<<

Lecture 4 Floating Point Representation and Errors

J. Chaudhry (Zeb)

Department of Mathematics and Statistics University of New Mexico

J. Chaudhry (Zeb) (UNM) Math/CS 375 1 / 21 What we’ll do:

Machine precision or machine epsilon Floating point errors Catastrophic cancellation

J. Chaudhry (Zeb) (UNM) Math/CS 375 2 / 21 m

Take x = 1.0 and add 1/2, 1/4,..., 2−i: Hidden bit ↓ ← 52 bits → 1 1 0 0 0 0 0 0 0 0 0 0 e e 1 0 1 0 0 0 0 0 0 0 0 0 e e 1 0 0 1 0 0 0 0 0 0 0 0 e e ...... 1 0 0 0 0 0 0 0 0 0 0 1 e e 1 0 0 0 0 0 0 0 0 0 0 0 e e

Ooops! use fl(x) to represent the floating point machine for the x fl(1 + 2−52) , 1, but fl(1 + 2−53) = 1

J. Chaudhry (Zeb) (UNM) Math/CS 375 3 / 21 m

Machine Epsilon

The machine epsilon m is the smallest positive number such that

fl(1 + m) > 1

The double precision machine epsilon is about 2−52. The single precision machine epsilon is about 2−23.

1 >> eps

2 ans = 2.2204e-16

3

4 >> eps(’single’)

5 ans = 1.1921e-07

J. Chaudhry (Zeb) (UNM) Math/CS 375 4 / 21 J. Chaudhry (Zeb) (UNM) Math/CS 375 5 / 21 Floating Point Errors

Not all reals can be exactly represented as a machine floating point number. Then what? Round-off error IEEE options: I Round to next nearest FP (preferred), Round to 0, Round up, and Round down

Let x+ and x− be the two floating point machine closest to x

round to nearest: round(x) = x− or x+, whichever is closest

round toward 0: round(x) = x− or x+, whichever is between 0 and x

round toward − (down): round(x) = x−

round toward + (up): round(x) = x+ ∞ ∞

J. Chaudhry (Zeb) (UNM) Math/CS 375 6 / 21 Floating Point Errors

How big is this error? Suppose (x is closer to x−)

m x = (0.1b2b3 ... b24b25b26)2 × 2 m x− = (0.1b2b3 ... b24)2 × 2 −24 m x+ = (0.1b2b3 ... b24)2 + 2 × 2

|x+ − x−| m−25 |x − x−| 6 = 2 2 m−25 x − x− 2 −24 6 6 2 = m/2 x 1/2 × 2m

J. Chaudhry (Zeb) (UNM) Math/CS 375 7 / 21 Floating Point Arithmetic

Problem: The set of representable machine numbers is FINITE. So not all math operations are well defined! Basic algebra breaks down in floating point arithmetic

Example

a + (b + ) , (a + b) + c

J. Chaudhry (Zeb) (UNM) Math/CS 375 8 / 21 Floating Point Arithmetic

Rule 1.

fl(x) = x(1 + ), where || 6 m

Rule 2. For all operations (one of +, −, ∗, /)

fl(x y) = (x y)(1 +  ), where | | 6 m

Rule 3. For +, ∗ operations fl(a b) = fl(b a)

There were many discussions on what conditions/ rules should be satisfied by floating point arithmetic. The IEEE standard is a set of standards adopted by many CPU manufacturers.

J. Chaudhry (Zeb) (UNM) Math/CS 375 9 / 21 Floating Point Arithmetic

Consider the sum of 3 numbers: y = a + b + c.

Done as fl(fl(a + b) + c)

η = fl(a + b) = (a + b)(1 + 1)

y1 = fl(η + c) = (η + c)(1 + 2)

= [(a + b)(1 + 1) + c] (1 + 2)

= [(a + b + c) + (a + b)1)] (1 + 2)  a + b  = (a + b + c) 1 +  (1 +  ) +  a + b + c 1 2 2

So disregarding the high order term 12 a + b fl(fl(a + b) + c) = (a + b + c)(1 +  ) with  ≈  +  3 3 a + b + c 1 2

J. Chaudhry (Zeb) (UNM) Math/CS 375 10 / 21 Floating Point Arithmetic

If we redid the computation as y2 = fl(a + fl(b + c)) we would find b + c fl(a + fl(b + c)) = (a + b + c)(1 +  ) with  ≈  +  4 4 a + b + c 1 2 Main conclusion:

The first error is amplified by the factor (a + b)/y in the first case and (b + c)/y in the second case.

In order to sum n numbers more accurately, it is better to start with the small numbers first. [However, sorting before adding is usually not worth the cost!]

J. Chaudhry (Zeb) (UNM) Math/CS 375 11 / 21 Significant Digits

Significant Digits Digits in a number warranted by accuracy of the means of measurement

Example: Measurement from a ruler which goes from 0.0 to 9.0 (in cm): x = .23 × 101cm two significant digits 2 is the most significant digit 3 is the least significant digit

x = .40 × 101cm still two significant digits

x = .400 × 101cm The last zero is meaningless. From now on, we will never write it

J. Chaudhry (Zeb) (UNM) Math/CS 375 12 / 21 Floating Point Arithmetic Roundoff errors and floating-point arithmetic

One of the most serious problems in floating point arithmetic is that of cancellation. If two large and close-by numbers are subtracted the result (a small number) carries very few accurate digits (why?). This is fine if the result is not reused. If the result is part of another calculation, then there may be a serious problem

Example Roots of the equation x2 + 2px − q = 0 Assume we want the root with smallest absolute value when p > 0: q 2 q y = −p + p + q = p p + p2 + q

J. Chaudhry (Zeb) (UNM) Math/CS 375 13 / 21 Loss of Significance

Listing 1: alg 1 2 1 s := p 2 := + t s√ q 3 u := t 4 y := −p + u

Listing 2: alg 2 2 1 s := p 2 := + t s√ q 3 u := t 4 v := p + u 5 y := q/v Step 4 of Algorithm 1 can have severe cancellation when p >> q and when p is positive.

J. Chaudhry (Zeb) (UNM) Math/CS 375 14 / 21 1 >> p = 10000;

2 >> q = 1;

3 >> y = -p+sqrt(pˆ2 + q)

4 y = 5.000000055588316e-05

5

6 >> y2 = q/(p+sqrt(pˆ2+q))

7 y2 = 4.999999987500000e-05

8

9 >> x = y;

10 >>xˆ2 + 2 * p * x -q

11 ans = 1.361766321927860e-08

12

13 >> x = y2;

14 >>xˆ2 + 2 * p * x -q

15 ans = -1.110223024625157e-16

J. Chaudhry (Zeb) (UNM) Math/CS 375 15 / 21 Catastrophic Cancellation

Subtracting c = a − b will result in large error if a ≈ b. For example

lost a x.xxxx xxxx xxx1 ssss ... lost z }| { b x.xxxx xxxx xxx0 tttt ... z }| {

finite precision x.xxx xxxx xxx1 Then - . xz xxx xxxx}| xxx0{ = 0 . 000 0000 0001 ???? ???? lost precision | {z }

J. Chaudhry (Zeb) (UNM) Math/CS 375 16 / 21 Summary

subtraction: c = a − b if a ≈ b catastrophic: caused by a single operation, not by an accumulation of errors can often be fixed by mathematical rearrangement

J. Chaudhry (Zeb) (UNM) Math/CS 375 17 / 21 Loss of Significance

Example x = 0.37214 48693 and y = 0.37202 14371. What is the relative error in x − y in a computer with 5 decimal digits of accuracy?

|x − y − (x¯ − y¯)| |0.37214 48693 − 0.37202 14371 − 0.37214 + 0.37202| = |x − y| |0.37214 48693 − 0.37202 14371| ≈ 3 × 10−2

J. Chaudhry (Zeb) (UNM) Math/CS 375 18 / 21 Loss of Significance

Loss of Precision Theorem Let x and y be (normalized) floating point machine numbers with x > y > 0.

−p y −q If 2 6 1 − x 6 2 for positive integers p and q, the significant binary digits lost in calculating x − y is between q and p.

J. Chaudhry (Zeb) (UNM) Math/CS 375 19 / 21 Loss of Significance

Example Consider x = 37.593621 and y = 37.584216. y 2−12 < 1 − = 0.00025 01754 < 2−11 x So we lose 11 or 12 bits in the computation of x − y. yikes!

Example Back to the other example (5 digits): x = 0.37214 and y = 0.37202. y 10−4 < 1 − = 0.00032 < 10−3 x So we lose 3 or 4 decimal digits in the computation of x − y. Here,

x − y = 0.00012 which has only 1 significant digit that we can be sure about

J. Chaudhry (Zeb) (UNM) Math/CS 375 20 / 21 Problem at x ≈ 0. One type of fix: √ !  p  x2 + 1 + 1 f (x) = x2 + 1 − 1 √ x2 + 1 + 1 x2 = √ x2 + 1 + 1 no subtraction!

Loss of Significance

So what to do? Mainly rearrangement. p f (x) = x2 + 1 − 1

J. Chaudhry (Zeb) (UNM) Math/CS 375 21 / 21 One type of fix: √ !  p  x2 + 1 + 1 f (x) = x2 + 1 − 1 √ x2 + 1 + 1 x2 = √ x2 + 1 + 1 no subtraction!

Loss of Significance

So what to do? Mainly rearrangement. p f (x) = x2 + 1 − 1

Problem at x ≈ 0.

J. Chaudhry (Zeb) (UNM) Math/CS 375 21 / 21 Loss of Significance

So what to do? Mainly rearrangement. p f (x) = x2 + 1 − 1

Problem at x ≈ 0. One type of fix: √ !  p  x2 + 1 + 1 f (x) = x2 + 1 − 1 √ x2 + 1 + 1 x2 = √ x2 + 1 + 1 no subtraction!

J. Chaudhry (Zeb) (UNM) Math/CS 375 21 / 21