L. Vandenberghe ECE133A (Spring 2021) 17. IEEE floating point numbers
floating point numbers with base 10 • floating point numbers with base 2 • IEEE floating point standard • machine precision • rounding error •
17.1 Floating point numbers with base 10
4 G = .3 3 . . . 3= 10 ± ( 1 2 )10 ·
.3 3 . . . 3= is the mantissa (38 integer, 0 38 9, 3 ≠ 0 if G ≠ 0) • 1 2 ≤ ≤ 1 = is the mantissa length (or precision) • 4 is the exponent (4 4 4 ) • min ≤ ≤ max = 4 Interpretation: G = 3 10 1 3 10 2 3=10 10 ±( 1 − + 2 − + · · · + − )· Example (with = = 6):
12.625 = .126250 102 + ( )10 · 1 2 3 4 = 1 10− 2 10− 6 10− 2 10− + ( · + · + · + · 5 6 2 5 10− 0 10− 10 + · + · )· used in pocket calculators
IEEE floating point numbers 17.2 Properties
a finite set of numbers • unevenly spaced: distance between floating point numbers varies • = 1 – the smallest number greater than 1 is 1 10− + + = – the smallest number greater than 10 is 10 10 2,... + − + largest positive number: • 4 = 4 .999 9 10 max = 1 10− 10 max +( ··· )10 · ( − )
smallest positive number: • 4 4 1 G = .100 0 10 min = 10 min− min +( ··· )10 ·
IEEE floating point numbers 17.3 Floating point numbers with base 2
4 G = .3 3 . . . 3= 2 ±( 1 2 )2 ·
.3 3 . . . 3= is the mantissa (38 0, 1 , 3 = 1 if G ≠ 0) • 1 2 ∈ { } 1 = is the mantissa length (or precision) • 4 is the exponent (4 4 4 ) • min ≤ ≤ max
= 4 Interpretation: G = 3 2 1 3 2 2 3=2 2 ±( 1 − + 2 − + · · · + − )· Example (with = = 8):
12.625 = .11001010 24 + ( )2 · 1 2 3 4 5 = 1 2− 1 2− 0 2− 0 2− 1 2− + ( · + · + · + · + · 6 7 8 4 0 2− 1 2− 0 2− 2 + · + · + · )· used in almost all computers
IEEE floating point numbers 17.4 Small example we enumerate all positive floating point numbers for
= = 3, 4 = 1, 4 = 2 min − max
0.25 0.5 1 2 2.5 3 3.5
.100 2 1 = 0.2500 .100 20 = 0.500 +( )2 · − +( )2 · .101 2 1 = 0.3125 .101 20 = 0.625 +( )2 · − +( )2 · .110 2 1 = 0.3750 .110 20 = 0.750 +( )2 · − +( )2 · .111 2 1 = 0.4375 .111 20 = 0.875 +( )2 · − +( )2 · .100 21 = 1.00 .100 22 = 2.0 +( )2 · +( )2 · .101 21 = 1.25 .101 22 = 2.5 +( )2 · +( )2 · .110 21 = 1.50 .110 22 = 3.0 +( )2 · +( )2 · .111 21 = 1.75 .111 22 = 3.5 +( )2 · +( )2 ·
IEEE floating point numbers 17.5 Properties for the floating point number system on page 17.4
a finite set of unevenly spaced numbers • the largest positive number is • 4 = 4 G = .111 1 2 max = 1 2− 2 max max +( ··· )2 · ( − )
the smallest positive number is • 4 4 1 G = .100 0 2 min = 2 min− min +( ··· )2 · in practice, the number system includes ‘subnormal’ numbers
unnormalized small numbers: 3 = 0, 4 = 4 • 1 min includes the number 0 •
IEEE floating point numbers 17.6 IEEE standard for binary arithmetic specifies two binary floating point number formats
IEEE standard single precision
= = 24, 4 = 125, 4 = 128 min − max requires 32 bits: 1 sign bit, 23 bits for mantissa, 8 bits for exponent
IEEE standard double precision
= = 53, 4 = 1021, 4 = 1024 min − max requires 64 bits: 1 sign bit, 52 bits for mantissa, 11 bits for exponent used in almost all modern computers
IEEE floating point numbers 17.7 Machine precision
Machine precision (of the binary floating point number system on page 17.4)
n = M = 2−
= is the mantissa length
Example: IEEE standard double precision
= , n 53 . 16 = 53 M = 2− 1 1102 10− ' ·
Interpretation
1 2n is the smallest floating point number greater than 1: + M . 1 1 = n 10 01 2 = 1 2 − = 1 2 M ( ··· )2 · + +
IEEE floating point numbers 17.8 Rounding
a floating-point number system is a finite set of numbers • all other numbers must be rounded • we use the notation fl G for the floating-point representation of G • ( )
Rounding rules
numbers are rounded to the nearest floating-point number • ties are resolved by rounding to the number with least significant bit 0 • (“round to nearest even”)
Example: numbers G 1, 1 2n are rounded to 1 or 1 2n ∈ ( + M) + M fl G = 1 for 1 G 1 n ( ) ≤ ≤ + M fl G = 1 2n for 1 n < G 1 2n ( ) + M + M ≤ + M therefore numbers between 1 and 1 n are indistinguishable from 1 + M IEEE floating point numbers 17.9 Rounding error and machine precision general bound on the rounding error:
fl G G | ( ) − | n G ≤ M | |
machine precision gives a bound on the relative error due to rounding • number of correct (decimal) digits in fl G is roughly • ( ) log n − 10 M
i.e., about 15 or 16 in IEEE double precision
fundamental limit on accuracy of numerical computations •
IEEE floating point numbers 17.10 Exercises
Exercise 1: explain the following results in MATLAB
>> (1 + 1e-16) - 1 ans = 0
>> (1 + 2e-16) - 1 ans = 2.2204e-16
>> (1 - 1e-16) - 1 ans = -1.1102e-16
>> 1 + (1e-16 - 1) ans = 1.1102e-16
IEEE floating point numbers 17.11 Exercise 2: run the following code in MATLAB and explain the result x = 2; for i=1:54 x = sqrt(x); end; for i=1:54 x = x^2; end
Exercise 3: explain the following results (log 1 G G 1 for small G) ( + )/ ≈ >> log(1 + 3e-16) / 3e-16 ans = 0.7401
>> log(1 + 3e-16) / ((1 + 3e-16) - 1) ans = 1.0000
IEEE floating point numbers 17.12 Exercise 4: the function 5 G = 1 for G 10 16, 10 15 , evaluated as ( ) ∈ [ − − ] ((1 + x) - 1) / (1 + (x - 1))
2
1
0
16 15 10− x 10−
IEEE floating point numbers 17.13