<<

L. Vandenberghe ECE133A (Spring 2021) 17. IEEE floating point

floating point numbers with base 10 • floating point numbers with base 2 • IEEE floating point standard • machine precision • error •

17.1 Floating point numbers with base 10

4 G = .3 3 . . . 3= 10 ± ( 1 2 )10 ·

.3 3 . . . 3= is the mantissa (38 , 0 38 9, 3 ≠ 0 if G ≠ 0) • 1 2 ≤ ≤ 1 = is the mantissa length (or precision) • 4 is the exponent (4 4 4 ) • min ≤ ≤ max = 4 Interpretation: G = 3 10 1 3 10 2 3=10 10 ±( 1 − + 2 − + · · · + − )· Example (with = = 6):

12.625 = .126250 102 + ( )10 · 1 2 3 4 = 1 10− 2 10− 6 10− 2 10− + ( · + · + · + · 5 6 2 5 10− 0 10− 10 + · + · )· used in pocket calculators

IEEE floating point numbers 17.2 Properties

a finite set of numbers • unevenly spaced: distance between floating point numbers varies • = 1 – the smallest greater than 1 is 1 10− + + = – the smallest number greater than 10 is 10 10 2,... + − + largest positive number: • 4 = 4 .999 9 10 max = 1 10− 10 max +( ··· )10 · ( − )

smallest positive number: • 4 4 1 G = .100 0 10 min = 10 min− min +( ··· )10 ·

IEEE floating point numbers 17.3 Floating point numbers with base 2

4 G = .3 3 . . . 3= 2 ±( 1 2 )2 ·

.3 3 . . . 3= is the mantissa (38 0, 1 , 3 = 1 if G ≠ 0) • 1 2 ∈ { } 1 = is the mantissa length (or precision) • 4 is the exponent (4 4 4 ) • min ≤ ≤ max

= 4 Interpretation: G = 3 2 1 3 2 2 3=2 2 ±( 1 − + 2 − + · · · + − )· Example (with = = 8):

12.625 = .11001010 24 + ( )2 · 1 2 3 4 5 = 1 2− 1 2− 0 2− 0 2− 1 2− + ( · + · + · + · + · 6 7 8 4 0 2− 1 2− 0 2− 2 + · + · + · )· used in almost all computers

IEEE floating point numbers 17.4 Small example we enumerate all positive floating point numbers for

= = 3, 4 = 1, 4 = 2 min − max

0.25 0.5 1 2 2.5 3 3.5

.100 2 1 = 0.2500 .100 20 = 0.500 +( )2 · − +( )2 · .101 2 1 = 0.3125 .101 20 = 0.625 +( )2 · − +( )2 · .110 2 1 = 0.3750 .110 20 = 0.750 +( )2 · − +( )2 · .111 2 1 = 0.4375 .111 20 = 0.875 +( )2 · − +( )2 · .100 21 = 1.00 .100 22 = 2.0 +( )2 · +( )2 · .101 21 = 1.25 .101 22 = 2.5 +( )2 · +( )2 · .110 21 = 1.50 .110 22 = 3.0 +( )2 · +( )2 · .111 21 = 1.75 .111 22 = 3.5 +( )2 · +( )2 ·

IEEE floating point numbers 17.5 Properties for the floating point number system on page 17.4

a finite set of unevenly spaced numbers • the largest positive number is • 4 = 4 G = .111 1 2 max = 1 2− 2 max max +( ··· )2 · ( − )

the smallest positive number is • 4 4 1 G = .100 0 2 min = 2 min− min +( ··· )2 · in practice, the number system includes ‘subnormal’ numbers

unnormalized small numbers: 3 = 0, 4 = 4 • 1 min includes the number 0 •

IEEE floating point numbers 17.6 IEEE standard for binary arithmetic specifies two binary floating point number formats

IEEE standard single precision

= = 24, 4 = 125, 4 = 128 min − max requires 32 bits: 1 sign bit, 23 bits for mantissa, 8 bits for exponent

IEEE standard double precision

= = 53, 4 = 1021, 4 = 1024 min − max requires 64 bits: 1 sign bit, 52 bits for mantissa, 11 bits for exponent used in almost all modern computers

IEEE floating point numbers 17.7 Machine precision

Machine precision (of the binary floating point number system on page 17.4)

n = M = 2−

= is the mantissa length

Example: IEEE standard double precision

= , n 53 . 16 = 53 M = 2− 1 1102 10− ' ·

Interpretation

1 2n is the smallest floating point number greater than 1: + M . 1 1 = n 10 01 2 = 1 2 − = 1 2 M ( ··· )2 · + +

IEEE floating point numbers 17.8 Rounding

a floating-point number system is a finite set of numbers • all other numbers must be rounded • we use the notation fl G for the floating-point representation of G • ( )

Rounding rules

numbers are rounded to the nearest floating-point number • ties are resolved by rounding to the number with least significant bit 0 • (“round to nearest even”)

Example: numbers G 1, 1 2n are rounded to 1 or 1 2n ∈ ( + M) + M fl G = 1 for 1 G 1 n ( ) ≤ ≤ + M fl G = 1 2n for 1 n < G 1 2n ( ) + M + M ≤ + M therefore numbers between 1 and 1 n are indistinguishable from 1 + M IEEE floating point numbers 17.9 Rounding error and machine precision general bound on the rounding error:

fl G G | ( ) − | n G ≤ M | |

machine precision gives a bound on the relative error due to rounding • number of correct () digits in fl G is roughly • ( ) log n − 10 M

i.e., about 15 or 16 in IEEE double precision

fundamental limit on accuracy of numerical computations •

IEEE floating point numbers 17.10 Exercises

Exercise 1: explain the following results in MATLAB

>> (1 + 1e-16) - 1 ans = 0

>> (1 + 2e-16) - 1 ans = 2.2204e-16

>> (1 - 1e-16) - 1 ans = -1.1102e-16

>> 1 + (1e-16 - 1) ans = 1.1102e-16

IEEE floating point numbers 17.11 Exercise 2: run the following code in MATLAB and explain the result x = 2; for i=1:54 x = sqrt(x); end; for i=1:54 x = x^2; end

Exercise 3: explain the following results (log 1 G G 1 for small G) ( + )/ ≈ >> log(1 + 3e-16) / 3e-16 ans = 0.7401

>> log(1 + 3e-16) / ((1 + 3e-16) - 1) ans = 1.0000

IEEE floating point numbers 17.12 Exercise 4: the 5 G = 1 for G 10 16, 10 15 , evaluated as ( ) ∈ [ − − ] ((1 + x) - 1) / (1 + (x - 1))

2

1

0

16 15 10− x 10−

IEEE floating point numbers 17.13