Lecture 4 Floating Point Representation and Errors
J. Chaudhry (Zeb)
Department of Mathematics and Statistics University of New Mexico
J. Chaudhry (Zeb) (UNM) Math/CS 375 1 / 21 What we’ll do:
Machine precision or machine epsilon Floating point errors Catastrophic cancellation
J. Chaudhry (Zeb) (UNM) Math/CS 375 2 / 21 m
Take x = 1.0 and add 1/2, 1/4,..., 2−i: Hidden bit ↓ ← 52 bits → 1 1 0 0 0 0 0 0 0 0 0 0 e e 1 0 1 0 0 0 0 0 0 0 0 0 e e 1 0 0 1 0 0 0 0 0 0 0 0 e e ...... 1 0 0 0 0 0 0 0 0 0 0 1 e e 1 0 0 0 0 0 0 0 0 0 0 0 e e
Ooops! use fl(x) to represent the floating point machine number for the real number x fl(1 + 2−52) , 1, but fl(1 + 2−53) = 1
J. Chaudhry (Zeb) (UNM) Math/CS 375 3 / 21 m
Machine Epsilon
The machine epsilon m is the smallest positive number such that
fl(1 + m) > 1
The double precision machine epsilon is about 2−52. The single precision machine epsilon is about 2−23.
1 >> eps
2 ans = 2.2204e-16
3
4 >> eps(’single’)
5 ans = 1.1921e-07
J. Chaudhry (Zeb) (UNM) Math/CS 375 4 / 21 J. Chaudhry (Zeb) (UNM) Math/CS 375 5 / 21 Floating Point Errors
Not all reals can be exactly represented as a machine floating point number. Then what? Round-off error IEEE options: I Round to next nearest FP (preferred), Round to 0, Round up, and Round down
Let x+ and x− be the two floating point machine numbers closest to x
round to nearest: round(x) = x− or x+, whichever is closest
round toward 0: round(x) = x− or x+, whichever is between 0 and x
round toward − (down): round(x) = x−
round toward + (up): round(x) = x+ ∞ ∞
J. Chaudhry (Zeb) (UNM) Math/CS 375 6 / 21 Floating Point Errors
How big is this error? Suppose (x is closer to x−)
m x = (0.1b2b3 ... b24b25b26)2 × 2 m x− = (0.1b2b3 ... b24)2 × 2 −24 m x+ = (0.1b2b3 ... b24)2 + 2 × 2
|x+ − x−| m−25 |x − x−| 6 = 2 2 m−25 x − x− 2 −24 6 6 2 = m/2 x 1/2 × 2m
J. Chaudhry (Zeb) (UNM) Math/CS 375 7 / 21 Floating Point Arithmetic
Problem: The set of representable machine numbers is FINITE. So not all math operations are well defined! Basic algebra breaks down in floating point arithmetic
Example
a + (b + c) , (a + b) + c
J. Chaudhry (Zeb) (UNM) Math/CS 375 8 / 21 Floating Point Arithmetic
Rule 1.
fl(x) = x(1 + ), where || 6 m
Rule 2. For all operations (one of +, −, ∗, /)
fl(x y) = (x y)(1 + ), where | | 6 m
Rule 3. For +, ∗ operations fl(a b) = fl(b a)
There were many discussions on what conditions/ rules should be satisfied by floating point arithmetic. The IEEE standard is a set of standards adopted by many CPU manufacturers.
J. Chaudhry (Zeb) (UNM) Math/CS 375 9 / 21 Floating Point Arithmetic
Consider the sum of 3 numbers: y = a + b + c.
Done as fl(fl(a + b) + c)
η = fl(a + b) = (a + b)(1 + 1)
y1 = fl(η + c) = (η + c)(1 + 2)
= [(a + b)(1 + 1) + c] (1 + 2)
= [(a + b + c) + (a + b)1)] (1 + 2) a + b = (a + b + c) 1 + (1 + ) + a + b + c 1 2 2
So disregarding the high order term 12 a + b fl(fl(a + b) + c) = (a + b + c)(1 + ) with ≈ + 3 3 a + b + c 1 2
J. Chaudhry (Zeb) (UNM) Math/CS 375 10 / 21 Floating Point Arithmetic
If we redid the computation as y2 = fl(a + fl(b + c)) we would find b + c fl(a + fl(b + c)) = (a + b + c)(1 + ) with ≈ + 4 4 a + b + c 1 2 Main conclusion:
The first error is amplified by the factor (a + b)/y in the first case and (b + c)/y in the second case.
In order to sum n numbers more accurately, it is better to start with the small numbers first. [However, sorting before adding is usually not worth the cost!]
J. Chaudhry (Zeb) (UNM) Math/CS 375 11 / 21 Significant Digits
Significant Digits Digits in a number warranted by accuracy of the means of measurement
Example: Measurement from a ruler which goes from 0.0 to 9.0 (in cm): x = .23 × 101cm two significant digits 2 is the most significant digit 3 is the least significant digit
x = .40 × 101cm still two significant digits
x = .400 × 101cm The last zero is meaningless. From now on, we will never write it
J. Chaudhry (Zeb) (UNM) Math/CS 375 12 / 21 Floating Point Arithmetic Roundoff errors and floating-point arithmetic
One of the most serious problems in floating point arithmetic is that of cancellation. If two large and close-by numbers are subtracted the result (a small number) carries very few accurate digits (why?). This is fine if the result is not reused. If the result is part of another calculation, then there may be a serious problem
Example Roots of the equation x2 + 2px − q = 0 Assume we want the root with smallest absolute value when p > 0: q 2 q y = −p + p + q = p p + p2 + q
J. Chaudhry (Zeb) (UNM) Math/CS 375 13 / 21 Loss of Significance
Listing 1: alg 1 2 1 s := p 2 := + t s√ q 3 u := t 4 y := −p + u
Listing 2: alg 2 2 1 s := p 2 := + t s√ q 3 u := t 4 v := p + u 5 y := q/v Step 4 of Algorithm 1 can have severe cancellation when p >> q and when p is positive.
J. Chaudhry (Zeb) (UNM) Math/CS 375 14 / 21 1 >> p = 10000;
2 >> q = 1;
3 >> y = -p+sqrt(pˆ2 + q)
4 y = 5.000000055588316e-05
5
6 >> y2 = q/(p+sqrt(pˆ2+q))
7 y2 = 4.999999987500000e-05
8
9 >> x = y;
10 >>xˆ2 + 2 * p * x -q
11 ans = 1.361766321927860e-08
12
13 >> x = y2;
14 >>xˆ2 + 2 * p * x -q
15 ans = -1.110223024625157e-16
J. Chaudhry (Zeb) (UNM) Math/CS 375 15 / 21 Catastrophic Cancellation
Subtracting c = a − b will result in large error if a ≈ b. For example
lost a x.xxxx xxxx xxx1 ssss ... lost z }| { b x.xxxx xxxx xxx0 tttt ... z }| {
finite precision x.xxx xxxx xxx1 Then - . xz xxx xxxx}| xxx0{ = 0 . 000 0000 0001 ???? ???? lost precision | {z }
J. Chaudhry (Zeb) (UNM) Math/CS 375 16 / 21 Summary
subtraction: c = a − b if a ≈ b catastrophic: caused by a single operation, not by an accumulation of errors can often be fixed by mathematical rearrangement
J. Chaudhry (Zeb) (UNM) Math/CS 375 17 / 21 Loss of Significance
Example x = 0.37214 48693 and y = 0.37202 14371. What is the relative error in x − y in a computer with 5 decimal digits of accuracy?
|x − y − (x¯ − y¯)| |0.37214 48693 − 0.37202 14371 − 0.37214 + 0.37202| = |x − y| |0.37214 48693 − 0.37202 14371| ≈ 3 × 10−2
J. Chaudhry (Zeb) (UNM) Math/CS 375 18 / 21 Loss of Significance
Loss of Precision Theorem Let x and y be (normalized) floating point machine numbers with x > y > 0.
−p y −q If 2 6 1 − x 6 2 for positive integers p and q, the significant binary digits lost in calculating x − y is between q and p.
J. Chaudhry (Zeb) (UNM) Math/CS 375 19 / 21 Loss of Significance
Example Consider x = 37.593621 and y = 37.584216. y 2−12 < 1 − = 0.00025 01754 < 2−11 x So we lose 11 or 12 bits in the computation of x − y. yikes!
Example Back to the other example (5 digits): x = 0.37214 and y = 0.37202. y 10−4 < 1 − = 0.00032 < 10−3 x So we lose 3 or 4 decimal digits in the computation of x − y. Here,
x − y = 0.00012 which has only 1 significant digit that we can be sure about
J. Chaudhry (Zeb) (UNM) Math/CS 375 20 / 21 Problem at x ≈ 0. One type of fix: √ ! p x2 + 1 + 1 f (x) = x2 + 1 − 1 √ x2 + 1 + 1 x2 = √ x2 + 1 + 1 no subtraction!
Loss of Significance
So what to do? Mainly rearrangement. p f (x) = x2 + 1 − 1
J. Chaudhry (Zeb) (UNM) Math/CS 375 21 / 21 One type of fix: √ ! p x2 + 1 + 1 f (x) = x2 + 1 − 1 √ x2 + 1 + 1 x2 = √ x2 + 1 + 1 no subtraction!
Loss of Significance
So what to do? Mainly rearrangement. p f (x) = x2 + 1 − 1
Problem at x ≈ 0.
J. Chaudhry (Zeb) (UNM) Math/CS 375 21 / 21 Loss of Significance
So what to do? Mainly rearrangement. p f (x) = x2 + 1 − 1
Problem at x ≈ 0. One type of fix: √ ! p x2 + 1 + 1 f (x) = x2 + 1 − 1 √ x2 + 1 + 1 x2 = √ x2 + 1 + 1 no subtraction!
J. Chaudhry (Zeb) (UNM) Math/CS 375 21 / 21