<<

Floating Point Arithmetic APM 6646 Suhan Zhong Note: This note mainly uses materials from the book Applied Linear Algebra(Second Edition).

Limitations of Digital Representations

Q1 How to represent real on a digital machine? Q2 Can we represent all real numbers on computer? In fact, digital computers can only represent a finite subset of the real numbers. Let’s use F to denote this finite set. Since F is finite, we can easily conclude that: 1 F is bounded (maxx∈F |x| < ∞) 2 Numbers in F have gaps (minx,y∈F |x − y|) The above constraints both give limitations to numerical computations. But 1 is no longer a problem since modern computers now can represent numbers sufficient large and small. While 2 is still an important concern throughout scientific computing. Now let’s take IEEE double precision arithmetic as an example. In this arithmetic, the interval [1, 2] is represented by the discrete subset

1, 1 + 2−52, 1 + 2 × 2−52, 1 + 3 × 2−52, ··· , 2 (1)

The interval [2, 4] is represented by the same numbers multiplied by 2,

2, 2 + 2−51, 2 + 2 × 2−51, 2 + 3 × 2−51, ··· , 4 (2) and in general, the interval [2j, 2j+1] is represented by (1) times 2j. Note that in IEEE double precision arithmetic, the gaps between adjacent numbers are never larger than 2−52 ≈ 2.22 × 10−16. Though it seems negligible, it is surprising how many carelessly constructed algorithms turn out to be unstable! For example, the differences between Gram-Schmidt and modified Gram-Schmidt algorithms.

Machine Epsilon

IEEE arithmetic is an example of an arithmetic system based on a floating point representation of the real numbers. The resolution of F is traditionally summarized by a known as machine epsilon: 1  = β1−t, (3) machine 2 where an integer β ≥ 2 is known as the base and an integer t > 1 is known as the precision. machine is half distance between 1 and the next larger floating point number. It has the following property: 0 0 For all x ∈ R, there exists x ∈ F such that |x − x | ≤ machine|x| (4)

(How to prove this property? Hint: by definition of machine) For the values of β and t common −6 −35 on various computers, machine usually lies between 10 and 10 . In IEEE single and double −24 −8 −53 −16 precision arithmetic, machine is specified to be 2 ≈ 5.96 × 10 and 2 ≈ 1.11 × 10 , respectively. Let fl: R → F be a function giving the closest floating point approximation to a , its rounded equivalent in the floating point system.

For all x ∈ R, there exists  with || ≤ machine such that fl(x) = x(1 + ) (5)

1 In other words, the difference between a real number and its closest floating point approximation is always smaller than machine in relative terms.

Floating Point Arithmetic

Now let’s focus on elementary arithmetic operations on F. Let x, y ∈ F, ∗ be one of the operations +, −, ×, ÷(on R) and ~ be its floating point analogue(on F). Then x ~ y must e given exactly by x ~ y = fl(x ∗ y) (6) Then we can conclude that the computer has a simple and powerful property.

Fundamental Axiom of Floating Point Arithmetic

For all x, y ∈ F, there exists  with || ≤ machine such that

x ~ y = (x ∗ y)(1 + )

That is, every operation of floating point arithmetic is exact up to a relative error of size at most machine.

2