Floating Point Objectives

• look at floating point representation in its basic form • expose errors of a different form: error • highlight IEEE-754 standard

1 Why this is important:

• Errors come in two forms: truncation error and rounding error • we always have them … • case study: Intel • our jobs as developers: reduce impact

2 Intel

3 Intel

4 Intel

5 Intel Timeline

June 1994 Intel engineers discover the error. Managers decide the error will not impact many people. Keep the issue internal. June 1994 Dr Nicely at Lynchburg College notices computation problems Oct 19, 1994 After months of testing, Nicely confirms that other errors are not the cause. The problem is in the Intel Processor. Oct 24, 1994 Nicely contacts Intel. Intel duplicates error. Oct 30, 1994 After no action from Intel, Nicely sends an email

6 Intel Timeline

FROM: Dr. Thomas R. Nicely Professor of Mathematics Lynchburg College 1501 Lakeside Drive Lynchburg, Virginia 24501-3199

Phone: 804-522-8374 Fax: 804-522-8499 Internet: [email protected]

TO: Whom it may concern

RE: Bug in the Pentium FPU

DATE: 30 October 1994

It appears that there is a bug in the floating point unit (numeric coprocessor) of many, and perhaps all, Pentium processors.

In short, the Pentium FPU is returning erroneous values for certain division operations. For example, 0001/824633702441.0 is calculated incorrectly (all digits beyond the eighth significant digit are in error). This can be verified in compiled code, an ordinary spreadsheet such as Quattro Pro or Excel, or even the Windows calculator (use the scientific mode), by computing 00(824633702441.0)*(1/824633702441.0), which should equal 1 exactly (within some extremely small rounding error; in general, coprocessor results should contain 19 significant digits). However, the Pentiums tested return 0000.999999996274709702 . . .

7 Intel Timeline

Nov 1, 1994 Software company Phar Lap Software receives Nicely’s email. Sends to colleagues at Microsoft, Borland, Watcom, etc. decide the error will not impact many people. Keep the issue internal. Nov 2, 1994 Email with description goes global. Nov 15, 1994 USC reverse-engineers the chip to expose the problem. Intel still denies a problem. Stock falls. Nov 22, 1994 CNN Moneyline interviews Intel. Says the problem is minor. Nov 23, 1994 The MathWorks develops a fix. Nov 24, 1994 New York Times story. Intel still sending out flawed chips. Will replace chips only if it causeda problem in an important application. Dec 12, 1994 IBM halts shipment of Pentium based PCs 8 Dec 16, 1994 Intel stock falls again. Dec 19, 1994 More reports in the NYT: lawsuits, etc. Dec 20, 1994 Intel admits. Sets aside $420 million to fix. Numerical ”bugs”

Obvious Software has bugs

Not-SO-Obvious Numerical software has two unique bugs:

1. roundoff error 2. truncation error

9 Numerical Errors

Roundoff Roundoff occurs when digits in a decimal point (0.3333...) are lost (0.3333) due to a limit on the memory available for storing one numerical value.

Truncation Truncation error occurs when discrete values are used to approximate a mathematical expression.

10 Uncertainty: well- or ill-conditioned?

Errors in input data can cause uncertain results

• input data can be experimental or rounded. leads to a certain variation in the results • well-conditioned: numerical results are insensitive to small variations in the input • ill-conditioned: small variations lead to drastically different numerical calculations (a.k.a. poorly conditioned)

11 Our Job

As numerical analysts, we need to 1. solve a problem so that the calculation is not susceptible to large roundoff error 2. solve a problem so that the has a tolerable truncation error

How? • incorporate roundoff-truncation knowledge into • the mathematical model • the method • the algorithm • the software design

• awareness → correct interpretation of results 12 Idea: Keep going down past zero exponent:

23.625 = 1·24 +0·23 +1·22 +1·21 +1·20+1 · 2−1 + 0 · 2−2 + 1 · 2−3

So: Could store

• a fixed of bits with exponents > 0 • a fixed number of bits with exponents < 0

This is called fixed-point arithmetic.

Wanted: Real ... in a computer

Computers can represent , using bits:

4 3 2 1 0 23 = 1 · 2 + 0 · 2 + 1 · 2 + 1 · 2 + 1 · 2 = (10111)2

How would we represent fractions, e.g. 23.625?

13 Wanted: Real Numbers... in a computer

Computers can represent integers, using bits:

4 3 2 1 0 23 = 1 · 2 + 0 · 2 + 1 · 2 + 1 · 2 + 1 · 2 = (10111)2

How would we represent fractions, e.g. 23.625?

Idea: Keep going down past zero exponent:

23.625 = 1·24 +0·23 +1·22 +1·21 +1·20+1 · 2−1 + 0 · 2−2 + 1 · 2−3

So: Could store

• a fixed number of bits with exponents > 0 • a fixed number of bits with exponents < 0

This is called fixed-point arithmetic.

13 231 ··· 20 2−1 ··· 2−32 Smallest: 2−32 ≈ 10−10 Largest: 231 + ··· + 2−32 ≈ 109

Fixed-Point Numbers

Suppose we use units of 64 bits, with 32 bits for exponents > 0 and 32 bits for exponents < 0. What numbers can we represent?

14 Smallest: 2−32 ≈ 10−10 Largest: 231 + ··· + 2−32 ≈ 109

Fixed-Point Numbers

Suppose we use units of 64 bits, with 32 bits for exponents > 0 and 32 bits for exponents < 0. What numbers can we represent?

231 ··· 20 2−1 ··· 2−32

14 Fixed-Point Numbers

Suppose we use units of 64 bits, with 32 bits for exponents > 0 and 32 bits for exponents < 0. What numbers can we represent?

231 ··· 20 2−1 ··· 2−32 Smallest: 2−32 ≈ 10−10 Largest: 231 + ··· + 2−32 ≈ 109

14 For large numbers: about 19 For small numbers: few or none Idea: Don’t fix the location of the 0 exponent. Let it float.

Fixed-Point Numbers

How many ‘digits’ of relative accuracy (think relative rounding error) are available for the smallest vs. the largest number with 64 bits of fixed point?

231 ··· 20 2−1 ··· 2−32

15 Fixed-Point Numbers

How many ‘digits’ of relative accuracy (think relative rounding error) are available for the smallest vs. the largest number with 64 bits of fixed point?

231 ··· 20 2−1 ··· 2−32 For large numbers: about 19 For small numbers: few or none Idea: Don’t fix the location of the 0 exponent. Let it float.

15 Floating Points

Normalized Floating-Point Representation Real numbers are stored as

e x = ±(0.d1d2d3 ... dm)β × β

• d1d2d3 ... dm is the , e is the exponent • e is negative, positive or zero

• the general normalized form requires d1 6= 0

16 Floating Point

Example In base 10

• 1000.12345 can be written as

4 (0.100012345)10 × 10

• 0.000812345 can be written as

−3 (0.812345)10 × 10

17 Floating Point

Suppose we have only 3 bits for a (non-normalized) significand and a 1 bit exponent stored like

.d1 d2 d3 e1

All possible combinations give:

0002 = 0 ... × 2−1,0,1

1112 = 7

1 2 7 1 2 7 1 2 7 So we get 0, 16 , 16 ,..., 16 , 0, 4 , 4 ,..., 4 , and 0, 8 , 8 ,..., 8 . On the real line:

18 Overflow, Underflow

• computations too close to zero may result in underflow • computations too large may result in overflow • overflow error is considered more severe • underflow can just fall back to0

19 Normalizing

If we use the normalized form in our 4-bit case, we lose −1,0,1 1 1 0.0012 × 2 along with other. So we cannot represent 16 , 8 , 3 and 16 .

20 Next: Floating Point Numbers

• We’re familiar with base 10 representation of numbers: 1234 = 4 × 100 + 3 × 101 + 2 × 102 + 1 × 103 and .1234 = 1 × 10−1 + 2 × 10−2 + 3 × 10−3 + 4 × 10−4 • we write 1234.1234 as an part and a fractional part:

a3a2a1a0.b1b2b3b4 • For some (even simple) numbers, there may be an infinite number of digits to the right: π = 3.14159 ... 1/9 = 0.11111 ... √ 2 = 1.41421 ... 21 Other bases

• So far, we have just base 10. What about base β? • binary (β = 2), octal (β = 8), hexadecimal (β = 16), etc • In the β-system we have

n ∞ X k X −k (an ... a2a1a0.b1b2b3b4 ... )β = akβ + bkβ k=0 k=0

22 Integer Conversion

An algorithm to compute the base 2 representation of a base 10 integer

(N)10 = (ajaj−1 ... a2a0)2 j 1 0 = aj · 2 + ··· + a1 · 2 + a0 · 2

Compute (N)10/2 = Q + R/2:

N j−1 0 a0 = aj · 2 + ··· + a1 · 2 + 2 | {z } 2 =Q |{z} =R/2

Example

Example: compute (11)10 base 2

11/2 = 5R1 ⇒ a0 = 1

5/2 = 2R1 ⇒ a1 = 1

2/2 = 1R0 ⇒ a2 = 0

1/2 = 0R1 ⇒ a3 = 1

So (11)10 = (1011)2

23 The other way...

Convert a base-2 number to base-10:

(11 000 101)2 = 1 × 27 + 1 × 26 + 0 × 25 + 0 × 24 + 0 × 23 + 1 × 22 + 0 × 21 + 1 × 20 = 1 + 2(0 + 2(1 + 2(0 + 2(0 + 2(0 + 2(1 + 2(1))))))) = 197

24 Converting fractions

• straight forward way is not easy • goal: for x ∈ [0, 1] write

∞ X −k x = 0.b1b2b3b4 ··· = ckβ = (0.c1c2c3 ... )β k=1

• β(x) = (c1.c2c3c4 ... )β • multiplication by β in base-β only shifts the

25 Fraction Algorithm

An algorithm to compute the binary representation of a fraction x:

x = 0.b1b2b3b4 ... −1 = b1 · 2 + ...

Multiply x by 2. The integer part of 2x is b1

0 −1 −2 2x = b1 · 2 + b2 · 2 + b3 · 2 + ...

Example Example:Compute the binary representation of 0.625

2 · 0.625 = 1.25 ⇒ b−1 = 1

2 · 0.25 = 0.5 ⇒ b−2 = 0

2 · 0.5 = 1.0 ⇒ b−3 = 1

So (0.625)10 = (0.101)2

26 A problem with precision

r0 = x f o r k = 1, 2,..., m −k i f rk−1 ≥ 2 bk = 1 −k rk = rk−1 − 2 else

bk = 0 end end

−k −k k 2 bk rk = rk−1 − bk2 0 0.8125 1 0.5 1 0.3125 2 0.25 1 0.0625 3 0.125 0 0.0625 27 4 0.0625 1 0.0000 3 2 0 −1 3 13.5 = 2 + 2 + 2 + 2 = (1.1011)2 · 2

What pieces do you need to store an FP number?

Significand: (1.1011)2 Exponent: 3 Idea: Notice that the leading digit (in binary) of the significand is always one. Only store ‘1011’.

Floating Point numbers

Convert 13.5 = (1101.1)2 into floating point representation.

28 What pieces do you need to store an FP number?

Significand: (1.1011)2 Exponent: 3 Idea: Notice that the leading digit (in binary) of the significand is always one. Only store ‘1011’.

Floating Point numbers

Convert 13.5 = (1101.1)2 into floating point representation.

3 2 0 −1 3 13.5 = 2 + 2 + 2 + 2 = (1.1011)2 · 2

28 Significand: (1.1011)2 Exponent: 3 Idea: Notice that the leading digit (in binary) of the significand is always one. Only store ‘1011’.

Floating Point numbers

Convert 13.5 = (1101.1)2 into floating point representation.

3 2 0 −1 3 13.5 = 2 + 2 + 2 + 2 = (1.1011)2 · 2

What pieces do you need to store an FP number?

28 Floating Point numbers

Convert 13.5 = (1101.1)2 into floating point representation.

3 2 0 −1 3 13.5 = 2 + 2 + 2 + 2 = (1.1011)2 · 2

What pieces do you need to store an FP number?

Significand: (1.1011)2 Exponent: 3 Idea: Notice that the leading digit (in binary) of the significand is always one. Only store ‘1011’.

28 Floating Point numbers

Final storage format:

3 2 0 −1 3 13.5 = 2 + 2 + 2 + 2 = (1.1011)2 · 2

Significand: 1011 – a fixed number of bits Exponent: 3 – a (signed!) integer allowing a certain range Exponent is most often stored as a positive ‘offset’ from a certain negative number. E.g.

3 = −1023 + 1026 | {z } |{z} implicit offset stored

Actually stored: 1026, a positive integer.

29 Zero. Which is somewhat embarrassing.

Core problem: The implicit 1. It’s a great idea, were it not for this issue.

Unrepresentable numbers?

Can you think of a somewhat central number that we cannot represent as −p x = (1.______)2 · 2 ?

30 Unrepresentable numbers?

Can you think of a somewhat central number that we cannot represent as −p x = (1.______)2 · 2 ?

Zero. Which is somewhat embarrassing.

Core problem: The implicit 1. It’s a great idea, were it not for this issue.

30 Unrepresentable numbers?

Have to break the pattern. Idea:

• Declare one exponent ‘special’, and turn off the leading one for that one. (say, -1023, a.k.a. stored exponent 0) • For all larger exponents, the leading one remains in effect.

Bonus Q: With this convention, what is the binary representation of a zero?

31 Web link: IEEE 754 Web link: What Every Computer Scientist Should Know About Floating-Point Arithmetic

32 IEEE-754: Why?!?!

• IEEE-754 is a widely used standard accepted by hardware/software makers • defines the floating point distribution for our computation • offer several rounding modes which effect accuracy • Floating point arithmetic emerges in nearly every piece of code • even modest mathematical operation yield loss of significant bits • several pitfalls in common mathematical expressions

33 IEEE Floating Point (v. 754)

• How much storage do we actually use in practice? • 32-bit word lengths are typical • IEEE Single-precision floating-point numbers use 32 bits • IEEE Double-precision floating-point numbers use 64 bits • Goal: use the 32-bits to best represent the normalized floating point number

34 IEEE Single Precision (Marc-32)

x = ±q × 2m

• 1-bit sign • 8-bit exponent |m| • 23-bit fraction q • The leading one in the normalized signficand q does not

need to be represented: b1 = 1 is hidden bit • IEEE 754: put x in 1.f normalized form • 0 < m + 127 = c < 255 • Largest exponent = 127, Smallest exponent = −126 35 • Special cases: = 0, 255 • use 1-bit sign for negatives (1) and positives (0) IEEE Single Precision

x = ±q × 2m

Process for x = −52.125:

1. Convert both integer and fractional to binary:

x = −(110100.00100000000)2 5 2. Convert to 1.f form: x = − (1. 101 000 010 000 ... 0)2 × 2 |{z} | {z } 1 23 3. Convert exponent

5 = c − 127 ⇒ c = 132 ⇒ c = (10 000 100)2 | {z } 8 1 10 000 100 101 000 010 000 ... 0 |{z} | {z } | {z } 1 8 23

36 IEEE Single Precision

Special Cases:

• denormalized/subnormal numbers: use 1 extra bit in the significant: exponent is now −126 (less precision, more

range), indicated by 000000002 in the exponent field • two zeros: +0 and −0 (0 fraction, 0 exponent) • two ∞’s: +∞ and −∞

• ∞ (0 fraction, 111111112 exponenet)

• NaN (any fraction, 111111112 exponent)

37 IEEE Double Precision

• 1-bit sign • 11-bit exponent • 52-bit fraction • single-precision: about 6 decimal digits of precision • double-precision: about 15 decimal digits of precision • m = c − 1023

38 Precision vs. Range

type range approx range −3.40 × 1038 ≤ x ≤ −1.18 × 10−38 single 0 2−126 → 2128 1.18 × 10−38 ≤ x ≤ 3.40 × 1038 −1.80 × 10318 ≤ x ≤ −2.23 × 10−308 double 0 2−1022 → 21024 2.23 × 10−308 ≤ x ≤ 1.80 × 10308

small numbers example

39 First attempt:

• Significand as small as possible → all zeros after the implicit leading one • Exponent as small as possible: −7

−7 (1.0000)2 · 2 . Unfortunately: wrong. We can go way smaller by using the special exponent (which turns off the implicit leading one).

Subnormal Numbers

What is the smallest representable number in an FP system with 4 stored bits in the significand and an exponent range of [−7, 7]?

40 Subnormal Numbers

What is the smallest representable number in an FP system with 4 stored bits in the significand and an exponent range of [−7, 7]?

First attempt:

• Significand as small as possible → all zeros after the implicit leading one • Exponent as small as possible: −7

−7 (1.0000)2 · 2 . Unfortunately: wrong. We can go way smaller by using the special exponent (which turns off the implicit leading one).

40 Subnormal Numbers

We’ll assume that the special exponent is −8. So:

−7 (0.0001)2 · 2

Numbers with the special exponent are called subnormal (or denormal) FP numbers. Technically, zero is also a subnormal. Note: It is thus quite natural to ‘park’ the special exponent at the low end of the exponent range.

41 • FP systems without subnormals will underflow (return 0) as soon as the exponent range is exhausted. • This smallest representable normal number is called the underflow level, or UFL.

Subnormal Numbers

Why would you want to know about subnormals? Because computing with them is often slow, because it is implemented using ‘FP assist’, i.e. not in actual hardware. Many C compilers support options to ‘flush subnormals to zero’.

42 Subnormal Numbers

Why would you want to know about subnormals? Because computing with them is often slow, because it is implemented using ‘FP assist’, i.e. not in actual hardware. Many C compilers support options to ‘flush subnormals to zero’.

• FP systems without subnormals will underflow (return 0) as soon as the exponent range is exhausted. • This smallest representable normal number is called the underflow level, or UFL.

42 Subnormal Numbers

• Beyond the underflow level, subnormals provide for gradual underflow by ‘keeping going’ as long as there are bits in the significand, but it is important to note that subnormals don’t have as many accurate digits as normal numbers. • Analogously (but much more simply–no ‘supernormals’): the overflow level, OFL.

43 Summary: Translating Floating Point Numbers

To summarize: To translate a stored (double precision) floating point value consisting of the stored fraction (a 52-bit integer)

and the stored exponent value estored the into its (mathematical) value, follow the following method:  −52 −1023+e (1 + fraction ·2 ) · 2 stored estored 6= 0, value = −52 −1022 (fraction ·2 ) · 2 estored = 0.

44 A problem with precision

1 For other numbers, such as 5 = 0.2, an infinite length is needed.

0.2 → .0011 0011 0011 ...

So 0.2 is stored just fine in base-10, but needs infinite number of digits in base-2

!!! This is roundoff error in its basic form...

45 0 (1.0001)2 · 2 = x · (1 + 0.0001)2

What’s the smallest FP number > 1024 in that same system?

10 (1.0001)2 · 2 = x · (1 + 0.0001)2

Can we give that number a name?

Floating Point and Rounding Error

What is the relative error produced by working with floating point numbers? What is smallest floating point number > 1? Assume 4 stored bits in the significand.

46 What’s the smallest FP number > 1024 in that same system?

10 (1.0001)2 · 2 = x · (1 + 0.0001)2

Can we give that number a name?

Floating Point and Rounding Error

What is the relative error produced by working with floating point numbers? What is smallest floating point number > 1? Assume 4 stored bits in the significand.

0 (1.0001)2 · 2 = x · (1 + 0.0001)2

46 10 (1.0001)2 · 2 = x · (1 + 0.0001)2

Can we give that number a name?

Floating Point and Rounding Error

What is the relative error produced by working with floating point numbers? What is smallest floating point number > 1? Assume 4 stored bits in the significand.

0 (1.0001)2 · 2 = x · (1 + 0.0001)2

What’s the smallest FP number > 1024 in that same system?

46 Floating Point and Rounding Error

What is the relative error produced by working with floating point numbers? What is smallest floating point number > 1? Assume 4 stored bits in the significand.

0 (1.0001)2 · 2 = x · (1 + 0.0001)2

What’s the smallest FP number > 1024 in that same system?

10 (1.0001)2 · 2 = x · (1 + 0.0001)2

Can we give that number a name?

46 Floating Point and Rounding Error

Unit roundoff or machine precision or or

εmach is a bound on the relative error in a floating point represention for a given rounding procedure. With chopping as the rounding procedure, it is the smallest number such that float(1 + ε) 6= 1.

In the above system, εmach = (0.0001)2. Another related quantity is ULP, or .

It is important to note, that if chopping is used, then εmach is the distance between 1.0 and the next largest floating point number.

47 m

With chopping, take x = 1.0 and add 1/2, 1/4,..., 2−i:

Hidden bit ↓ ← 52 bits → 1 1 0 0 0 0 0 0 0 0 0 0 e e 1 0 1 0 0 0 0 0 0 0 0 0 e e 1 0 0 1 0 0 0 0 0 0 0 0 e e ...... 1 0 0 0 0 0 0 0 0 0 0 1 e e 1 0 0 0 0 0 0 0 0 0 0 0 e e

• Ooops! • use fl(x) to represent the floating point machine number for the real number x 48 • fl(1 + 2−52) 6= 1, but fl(1 + 2−53) = 1 m: Machine Epsilon

Machine epsilon m is the smallest number such that

fl(1 + m) 6= 1

With rounding-to-nearest,

• The double precision machine epsilon is about 2−52. • The single precision machine epsilon is about 2−23.

49 50 Floating Point Errors

• Not all reals can be exactly represented as a machine floating point number. Then what? • Round-off error • IEEE options: • Round to next nearest FP (preferred), Round to 0, Round up, and Round down

Let x+ and x− be the two floating point machine numbers closest to x

• round to nearest: round(x) = x− or x+, whichever is closest • round toward 0: round(x) = x− or x+, whichever is between 0 and x • round toward −∞ (down): round(x) = x− • round toward +∞ (up): round(x) = x+ 51 • The factor to get from one FP number to the next larger

one is (mostly) independent of magnitude: 1 + εmach. • Since we can’t represent any results between

x and x · (1 + εmach), that’s really the minimum error incurred. • In terms of relative error:

˜x − x x(1 + εmach) − x = = εmach. x x

At least theoretically, εmach is the maximum relative error in any FP operations. (Practical implementations do fall short of this.)

Floating Point and Rounding Error

What does this say about the relative error incurred in floating point calculations?

52 Floating Point and Rounding Error

What does this say about the relative error incurred in floating point calculations?

• The factor to get from one FP number to the next larger

one is (mostly) independent of magnitude: 1 + εmach. • Since we can’t represent any results between

x and x · (1 + εmach), that’s really the minimum error incurred. • In terms of relative error:

˜x − x x(1 + εmach) − x = = εmach. x x

At least theoretically, εmach is the maximum relative error in any FP operations. (Practical implementations do fall

short of this.) 52 2−52 ≈ 10−16

We can expect FP math to consistently introduce relative errors of about 10−16. Working in double precision gives you about 16 (decimal) accurate digits.

Floating Point and Rounding Error

What’s that same number for double-precision floating point? (52 bits in the significand)

53 Floating Point and Rounding Error

What’s that same number for double-precision floating point? (52 bits in the significand)

2−52 ≈ 10−16

We can expect FP math to consistently introduce relative errors of about 10−16. Working in double precision gives you about 16 (decimal) accurate digits.

53 Rough algorithm: 1. Bring both numbers onto a common exponent 2. Do grade-school addition from the front, until you run out of digits in your system. 3. Round result.

a = 1. 101 · 21 b = 0. 01001 · 21 a + b ≈ 1. 111 · 21

Implementing Arithmetic

How is floating point addition implemented? 1 −1 Consider adding a = (1.101)2 · 2 and b = (1.001)2 · 2 in a system with three bits in the significand.

54 Implementing Arithmetic

How is floating point addition implemented? 1 −1 Consider adding a = (1.101)2 · 2 and b = (1.001)2 · 2 in a system with three bits in the significand.

Rough algorithm: 1. Bring both numbers onto a common exponent 2. Do grade-school addition from the front, until you run out of digits in your system. 3. Round result.

a = 1. 101 · 21 b = 0. 01001 · 21 a + b ≈ 1. 111 · 21

54 a = 1. 1011 · 21 b = 1. 1010 · 21 a − b ≈ 0. 0001???? · 21

or, once we normalize,

1.???? · 2−3.

There is no data to indicate what the missing digits should be.

Problems with FP Addition

What happens if you subtract two numbers of very similar magnitude? 1 1 As an example, consider a = (1.1011)2 ·2 and b = (1.1010)2 ·2 .

55 Problems with FP Addition

What happens if you subtract two numbers of very similar magnitude? 1 1 As an example, consider a = (1.1011)2 ·2 and b = (1.1010)2 ·2 .

a = 1. 1011 · 21 b = 1. 1010 · 21 a − b ≈ 0. 0001???? · 21

or, once we normalize,

1.???? · 2−3.

There is no data to indicate what the missing digits should be.

55 Problems with FP Addition

→ Machine fills them with its ‘best guess’, which is not often good.

This phenomenon is called Catastrophic Cancellation.

56 Floating Point Arithmetic

• Problem: The set of representable machine numbers is FINITE. • So not all math operations are well defined! • Basic algebra breaks down in floating point arithmetic

Example

a + (b + c) 6= (a + b) + c

57 Floating Point Arithmetic

Rule 1.

fl(x) = x(1 + ), where || ≤ m

Rule 2. For all operations (one of +, −, ∗,/)

fl(x y) = (x y)(1 +  ), where | | ≤ m

Rule 3. For +, ∗ operations

fl(a b) = fl(b a)

There were many discussions on what conditions/rules should 58 be satisfied by floating point arithmetic. The IEEE standard isa set of standards adopted by many CPU manufacturers. Floating Point Arithmetic

Consider the sum of 3 numbers: y = a + b + c.

Done as fl(fl(a + b) + c)

η = fl(a + b) = (a + b)(1 + 1)

y1 = fl(η + c) = (η + c)(1 + 2)

= [(a + b)(1 + 1) + c](1 + 2)

= [(a + b + c) + (a + b)1)] (1 + 2)  a + b  = (a + b + c) 1 +  (1 +  ) +  a + b + c 1 2 2

So disregarding the high order term 12 a + b fl(fl(a+b)+c) = (a+b+c)(1+ ) with  ≈  + 3 3 a + b + c 1 2

59 Floating Point Arithmetic

If we redid the computation as y2 = fl(a + fl(b + c)) we would find

b + c fl(a+fl(b+c)) = (a+b+c)(1+ ) with  ≈  + 4 4 a + b + c 1 2 Main conclusion:

The first error is amplified by the factor (a + b)/y in the first case and (b + c)/y in the second case.

In order to sum n numbers more accurately, it is better to start with the small numbers first. [However, sorting before adding is usually not worth the cost!]

60 Floating Point Arithmetic

One of the most serious problems in floating point arithmetic is that of cancellation. If two large and close-by numbers are subtracted the result (a small number) carries very few accurate digits (why?). This is fine if the result is not reused. If the result is part of another calculation, then there may be a serious problem Example Roots of the equation

x2 + 2px − q = 0

Assume we want the root with smallest absolute value:

61 p 2 q y = −p + p + q = p p + p2 + q Catastrophic Cancellation

Adding c = a + b will result in a large error if • a  b • a  b Let a = x.xxx · · · × 100 b = y.yyy · · · × 10−8

finite precision xz.xxx xxxx}| xxxx xxxx{ Then + 0.000 0000 yyyy yyyy yyyy yyyy = x.xxx xxxx zzzz zzzz ???? ???? | {z } lost precision 62 Catastrophic Cancellation

Subtracting c = a − b will result in large error if a ≈ b. For example

lost a = x.xxxx xxxx xxx1 ssssz }|...{ lost z }| { b = x.xxxx xxxx xxx0 tttt ...

finite precision z }| { x.xxx xxxx xxx1 Then + x.xxx xxxx xxx0 = 0.000 0000 0001 ???? ???? | {z } lost precision

63 Summary

• addition: c = a + b if a  b or a  b • subtraction: c = a − b if a ≈ b • catastrophic: caused by a single operation, not by an accumulation of errors • can often be fixed by mathematical rearrangement

64 Loss of Significance

Example x = 0.37214 48693 and y = 0.37202 14371. What is the relative error in x − y in a computer with 5 decimal digits of accuracy?

|x − y − (¯x − y¯)| |0.37214 48693 − 0.37202 14371 − 0.37214 + 0.37202| = |x − y| |0.37214 48693 − 0.37202 14371| ≈ 3 × 10−2

65 Loss of Significance

Loss of Precision Theorem Let x and y be (normalized) floating point machine numbers with x > y > 0.

−p y −q If 2 ≤ 1 − x ≤ 2 for positive integers p and q, the significant binary digits lost in calculating x − y is between q and p.

66 Loss of Significance

Example Consider x = 37.593621 and y = 37.584216. y 2−11 < 1 − = 0.00025 01754 < 2−12 x So we lose 11 or 12 bits in the computation of x − y. yikes! Example Back to the other example (5 digits): x = 0.37214 and y = 0.37202. y 10−4 < 1 − = 0.00032 < 10−5 x So we lose 4 or 5 bits in the computation of x − y. Here,

x − y = 0.00012 which has only 1 significant digit that we can 67 be sure about Problem at x ≈ 0.

One type of fix: √ ! p  x2 + 1 + 1 f (x) = x2 + 1 − 1 √ x2 + 1 + 1 x2 = √ x2 + 1 + 1 no subtraction!

Loss of Significance

So what to do? Mainly rearrangement. p f (x) = x2 + 1 − 1

68 One type of fix: √ ! p  x2 + 1 + 1 f (x) = x2 + 1 − 1 √ x2 + 1 + 1 x2 = √ x2 + 1 + 1 no subtraction!

Loss of Significance

So what to do? Mainly rearrangement. p f (x) = x2 + 1 − 1

Problem at x ≈ 0.

68 Loss of Significance

So what to do? Mainly rearrangement. p f (x) = x2 + 1 − 1

Problem at x ≈ 0.

One type of fix: √ ! p  x2 + 1 + 1 f (x) = x2 + 1 − 1 √ x2 + 1 + 1 x2 = √ x2 + 1 + 1 no subtraction!

68