Floating-Point Arithmetic

ECS130 Winter 2017

January 27, 2017 Floating point numbers

I Floating-point representation of numbers (scientiﬁc notation) has four components, for example,

− 3.1416 × 101 ← exponent ↑ ↑ ↑ sign signiﬁcand base

I Computers use binary (base 2) representation, each digit is the integer 0 or 1, e.g.

10110 = 1 × 24 + 0 × 23 + 1 × 22 + 1 × 21 + 0 × 20 = 22in base 10 and 1 1 1 1 1 1 1 = + + + + + + ... 10 16 32 256 512 4096 8192 1 1 1 1 1 1 = + + + + + + ... 24 25 28 29 212 213 = 1.100110011 ... × 2−4 Floating point numbers

I The representation of a ﬂoating point number:

E x = ±b0.b1b2 ··· bp−1 × 2

where

I It is normalized, i.e., b0 = 1 (the hidden bit)

I Precision (= p) is the number of bits in the signiﬁcand (mantissa) (including the hidden bit). −(p−1) I Machine epsilon m = 2 , the gap between the number 1 and the smallest ﬂoating point number that is greater than 1.

I Special numbers 0, −0, Inf = ∞, −Inf = −∞, NaN = “Not a Number”. IEEE standard

I All computers designed since 1985 use the IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985), represent each number as a binary number and use binary arithmetic.

I Essentials of the IEEE standard: 1. consistent representation of ﬂoating-point numbers by all machines adopting the standard; 2. correctly rounded ﬂoating-point operations, using various rounding modes; 3. consistent treatment of exceptional situation such as division by zero.

Citation of 1989 Turing Award to William Kahan: “... his devotion to providing systems that can be safely and robustly used for numerical computations has impacted the lives of virtually anyone who will use a computer in the future” IEEE single precision format

I Single format takes 32 bits (=4 bytes) long: ←−8 −→ ←−23 −→ s E f sign exponent binaryt point fraction s E−127 I It represents the number (−1) · (1.f) × 2

I The leading 1 in the fraction need not be stored explicitly since it is always 1 (hidden bit) −23 −7 I Precision p = 24 and machine epsilon m = 2 ≈ 1.2 × 10

I The “E − 127” in the exponent is to avoid the need for storage of a sign bit. Emin = (00000001)2 = (1)10, Emax = (11111110)2 = (254)10.

I The range of positive normalized numbers:

Emin−127 −126 −38 Nmin = 1.00 ··· 0 × 2 = 2 ≈ 1.2 × 10

Emax−127 128 38 Nmax = 1.11 ··· 1 × 2 ≈ 2 ≈ 3.4 × 10 .

I Special repsentations for 0, ±∞ and NaN. IEEE double pecision format

I IEEE double format takes 64 bits (= 8 bytes) long: ←−11 −→ ←−52 −→ s E f sign exponent binaryt point fraction s E−1023 I It represents the numer (−1) · (1.f) × 2 −52 −16 I Precision p = 53 and machine epsilon m = 2 ≈ 2.2 × 10

I The range of positive normalized numbers is from

−1022 −308 Nmin = 2 ≈ 2.2 × 10 1023 1024 308 Nmax = 1.11 ··· 1 × 2 ≈ 2 ≈ 1.8 × 10 .

I Special repsentations for 0, ±∞ and NaN. Rounding modes

I Let a positive real number x be in the normalized range, i.e., Nmin ≤ x ≤ Nmax, and write in the normalized form

E x = 1.b1b2 ··· bp−1bpbp+1 ... × 2 ,

I Then the closest ﬂoating-pont number less than or equal to x is

E x− = 1.b1b2 ··· bp−1 × 2

i.e., x− is obtained by truncating.

I The next ﬂoating-point number bigger than x− (also the next one that bigger than x) is

E x+ = (1.b1b2 ··· bp−1 + 0.00 ··· 01) × 2

I If x is negative, the situtation is reversed. rounding modes, cont’d Four rounding modes:

1. round down: ﬂ(x) = x−

2. round up: fl(x) = x+ 3. round towards zero: fl(x) = x− of x ≥ 0 fl(x) = x+ of x ≤ 0 4. round to nearest (IEEE default rounding mode): fl(x) = x− or x+ whichever is nearer to x.

Note: except that if x > Nmax, fl(x) = ∞, and if x < −Nmax, fl(x) = −∞. In the case of tie, i.e., x− and x+ are the same distance from x, the one with its least signiﬁcant bit equal to zero is chosen. Rounding error

I When the round to nearest is in eﬀect, |ﬂ(x) − x| 1 relerr(x) = ≤ . |x| 2 m

I Therefore, we have

 1 · 21−24 = 2−24 ≈ 5.96 · 10−8, single precision  2 relerr =  1 −52 −16 2 · 2 ≈ 1.11 × 10 , double precision. Floating-point arithmetic

I IEEE rules for correctly rounded ﬂoating-point operations: if x and y are correctly rounded ﬂoating-point numbers, then

fl(x + y) = (x + y)(1 + δ) fl(x − y) = (x − y)(1 + δ) fl(x × y) = (x × y)(1 + δ) fl(x/y) = (x/y)(1 + δ)

where 1 |δ| ≤ 2 m for the round to nearest,

I IEEE standard also requires that correctly rounded remainder and square root operations be provided. Floating-point arithmetic, cont’d IEEE standard response to exceptions

Event Example Set result to Invalid operation 0/0, 0 × ∞ NaN Division by zero Finite nonzero/0 ±∞ Overflow |x| > Nmax ±∞ or ±Nmax underflow x 6= 0, |x| < Nmin ±0, ±Nmin or subnormal Inexact whenever fl(x ◦ y) 6= x ◦ y correctly rounded value Floating-point arithmetic error

I Let xˆ and yˆ be the ﬂoating-point numbers and that

xˆ = x(1 + τ1) and yˆ = y(1 + τ2), for |τi| ≤ τ 1

where τi could be the relative errors in the process of “collecting/getting” the data from the original source or the previous operations, and

I Question: how do the four basic arithmetic operations behave? Floating-point arithmetic error: +, − Addition and subtraction: 1 ﬂ(ˆx +y ˆ) = (ˆx +y ˆ)(1 + δ), |δ| ≤ 2 m = x(1 + τ1)(1 + δ) + y(1 + τ2)(1 + δ)

= x + y + x(τ1 + δ + O(τm)) + y(τ2 + δ + O(τm)) x y = (x + y) 1 + (τ + δ + O(τ )) + (τ + δ + O(τ )) x + y 1 m x + y 2 m ≡ (x + y)(1 + δˆ),

where δˆ can be bounded as follows: |x| + |y| 1 |δˆ| ≤ τ + + O(τ ) . |x + y| 2 m m Floating-point arithmetic error: +, − Three possible cases: 1. If x and y have the same sign, i.e., xy > 0, then |x + y| = |x| + |y|; this implies 1 |δˆ| ≤ τ + + O(τ ) 1. 2 m m Thus fl(ˆx +y ˆ) approximates x + y well. 2. If x ≈ −y ⇒ |x + y| ≈ 0, then (|x| + |y|)/|x + y| 1; this implies that |δˆ| could be nearly or much bigger than 1. This is so called catastrophic cancellation, it causes relative errors or uncertainties already presented in xˆ and yˆ to be magnified. 3. In general, if (|x| + |y|)/|x + y| is not too big, fl(ˆx +y ˆ) provides a good approximation to x + y. Catastrophic cancellation: example 1 √ √ I Computing x + 1 − x straightforward causes substantial loss of significant digits for large n √ √ √ √ x fl( x + 1) fl( x) fl(fl( x + 1) − fl( x) 1.00e+10 1.00000000004999994e+05 1.00000000000000000e+05 4.99999441672116518e-06 1.00e+11 3.16227766018419061e+05 3.16227766016837908e+05 1.58115290105342865e-06 1.00e+12 1.00000000000050000e+06 1.00000000000000000e+06 5.00003807246685028e-07 1.00e+13 3.16227766016853740e+06 3.16227766016837955e+06 1.57859176397323608e-07 1.00e+14 1.00000000000000503e+07 1.00000000000000000e+07 5.02914190292358398e-08 1.00e+15 3.16227766016838104e+07 3.16227766016837917e+07 1.86264514923095703e-08 1.00e+16 1.00000000000000000e+08 1.00000000000000000e+08 0.00000000000000000e+00 Catastrophic cancellation: example 1

I Catastrophic cancellation can sometimes be avoided if a formula is properly reformulated. √ √ I For example, one can compute x + 1 − x almost to full precision by using the equality √ √ 1 x + 1 − x = √ √ . x + 1 + x

Consequently, the computed results are √ √ n ﬂ(1/( n + 1 + n)) 1.00e+10 4.999999999875000e-06 1.00e+11 1.581138830080237e-06 1.00e+12 4.999999999998749e-07 1.00e+13 1.581138830084150e-07 1.00e+14 4.999999999999987e-08 1.00e+15 1.581138830084189e-08 1.00e+16 5.000000000000000e-09 Catastrophic cancellation: example 2

I Consider the function 1 − cos x h(x) = x2 Note that 0 ≤ f(x) < 1/2 for all x 6= 0. −8 I Let x = 1.2 × 10 , then the computed

ﬂ(h(x)) = 0.770988...

is completely wrong!

I Alternatively, the function can be re-written as

sin(x/2)2 h(x) = . x/2

−8 I Consequently, for x = 1.2 × 10 , then the computed function ﬂ(h(x)) = 0.499999... < 1/2 is ﬁne! Floating-point arithmetic error: ×,/ Multiplication and Division:

ﬂ(ˆx × yˆ) = (ˆx × yˆ)(1 + δ)

= xy(1 + τ1)(1 + τ2)(1 + δ) ˆ ≡ xy(1 + δ×), ﬂ(ˆx/yˆ) = (ˆx/yˆ)(1 + δ) −1 = (x/y)(1 + τ1)(1 + τ2) (1 + δ) ˆ ≡ xy(1 + δ÷), ˆ where δ× = τ1 + τ2 + δ + O(τm) ˆ δ÷ = τ1 − τ2 + δ + O(τm). ˆ 1 ˆ 1 Thus |δ×| ≤ 2τ + 2 m + O(τm) and |δ÷| ≤ 2τ + 2 m + O(τm). Multiplication and division are very well-behaved! Reading

I Section 1.7 of Numerical Computing with MATLAB by C. Moler

I Websites discussions of numerical disasters:

I T. Huckle, Collection of software bugs http://www5.in.tum.de/∼huckle/bugse.html

I K. Vuik, Some disasters caused by numerical errors http://ta.twi.tudelft.nl/nw/users/vuik/wi211/disasters.html

I D. Arnold, Some disasters attributable to bad numerical computing http://www.ima.umn.edu/∼arnold/disasters/disasters.html

I In-depth material: D. Goldberg, What every computer scientist should know about ﬂoating-point arithmetic, ACM Computing Survey, Vol.23(1), pp.5-48, 1991 LU factorization – revisited: the need of pivoting in LU factorization, numerically

I LU factorization without pivoting

.0001 1 10−4 1 1 u u A = = = LU = 11 12 1 1 1 1 l21 1 u22

I In three decimal-digit ﬂoating-point arithmetic, we have

1 0 1 0 L = = , b ﬂ(1/10−4) 1 104 1 10−4 1 10−4 1 U = = , b ﬂ(1 − 104 · 1) −104

I Check: 1 0 10−4 1 10−4 1 LU = = 6≈ A, b b 104 1 −104 1 0 LU factorization – revisited: the need of pivoting in LU factorization, numerically 1 Consider solving Ax = for x using this LU factorization. I 2

I Solving 1 yb1 = ﬂ(1/1) = 1 Lyb = =⇒ 4 4 2 yb2 = ﬂ(2 − 10 · 1) = −10 . note that the value 2 has been “lost” by subtracting 104 from it.

I Solving

4 4 1 xb2 = ﬂ((−10 )/(−10 )) = 1 Uxb = yb = 4 =⇒ −4 −10 xb1 = ﬂ((1 − 1)/10 ) = 0, xb1 0 I xb = = is a completely erroneous solution to the correct xb2 1 1 answer x ≈ . 1 LU factorization – revisited: the need of pivoting in LU factorization, numerically

I LU factorization with partial pivoting

PA = LU,

I By exchanging the order of the rows of A, i.e.,

0 1 P = , 1 0

Then for the LU factorization of PA is given by

1 0 1 0 L = = , b ﬂ(10−4/1) 1 10−4 1 1 1 1 1 U = = . b ﬂ(1 − 10−4 · 1) 1

I The computed LU approximates A very accurately.

I As a result, the computed solution x is also perfect!