Floating-Point Numbers and Rounding Errors

FLOATING-POINT ARITHMETIC Rounding errors and their analysis 5.1671 × 10−3 ◦ Scientific notation: Floating-Point ◦ Floating-point notation: 5.1671e-3 ◦ Advantage: easy to tell the magnitude of the quantity and Arithmetic fix the precision of approximations: on Computers 777777 ≈ 8.81916 × 102 ◦ Keywords: mantissa, exponent, base, precision ◦ Applications in physics and engineering require huge amounts of numerical computations ◦ Need to fix a particular format to write down those huge numbers of numbers ◦ Solution: floating-point format with fixed precision ◦ Trade-off: effort for storing and calculation vs accuracy of computations We need to compute on a large scale… Human computers Digital computers Source: NASA ◦ Floating-point format: scientific notation but we restrict the precision in all computations Floating-Point ◦ Mainboards and graphic cards have dedicated floating- Arithmetic point units only for such computations on Computers ◦ Phones, tablets, laptops, super computers ◦ Many programming languages (C, C++, Java, ….) have datatypes only for specific floating-point formats ◦ Most popular floating-point formats: ◦ Single precision (4 bytes, 23 bit mantissa, 8 bit exponent) ◦ Double precision (8 bytes, 52 bit mantissa, 11 bit exponent) ◦ Rounding errors are typically negligible, but accumulate after many calculations ◦ Floating-point units implement computation with certain floating-point formats Floating-Point ◦ Single and double precision format, and possibly other Arithmetic formats such as quadruple precision on Computers ◦ Built-in capacity for addition, subtraction, multiplication, division, square root, and possibly trigonometric functions APPROXIMATION OF NUMBERS Absolute and Relative Errors, Unit roundoff, machine epsilon Approximating numbers Suppose that 푥 is some number that 푥෤ is an approximation to that number. The absolute error is | 푥 − 푥෤ | The relative error is | 푥 − 푥෤ | |푥| In practice, only the relative error is a good measure of approximation quality. An absolute error of 10^2 is unimportant if 푥 = 1012, but it is quite notable if 푥 = 104. Approximating numbers For the discussion of the relative error, we sometimes write 푥෤ = 푥 (1 + 훿 ) for some 훿 that indicates the fraction of how much the approximation 푥෤ differs from the true value 푥. In that case, 훿 is an upper bound for the relative error. 푥 −푥෤ | 푥 −푥෤ | = 훿, = |훿| 푥 | 푥 | Approximating numbers (Example) The quantity that we want to compute and the quantity that we use are typically different: 휋 = 3.1415926535 … vs 휋෤ = 3.14159 Absolute and relative error: 3.14159265359 … − 3.14159 ≈ 2.65 × 10−6 |3.14159265359…−3.14159| ≈ 8.44 × 10−7 |3.14159265359…| The relative error tells us how the magnitude of the error compares to the magnitude of the actual value. The relative error stays the same if we change the scale (e.g., units) Unit Roundoff The unit roundoff 풖 of a floating-point format is the maximum relative error of a number when approximated in the floating-point format: | 푥 −푥 | 푓푙표푎푡 < 푢, |푥| where 푥푓푙표푎푡 is the closest approximation to 푥 in the floating-point format. The unit roundoff depends only on the number of fractional digits that we keep in the floating- point format (and the base that we use). For example, if we use a decimal floating-point system with 5 fractional digits, then the closest approximation to 휋 is 휋푓푙표푎푡 = 3.14159 1 The unit roundoff when using five fractional digits is then is 푢 = 10−6 . 2 Machine Epsilon The machine epsilon 흐 of a floating-point format is the difference between 1 and the next larger floating-point number. Like the unit roundoff, the machine epsilon depends only on the precision (and the base) of the floating-point format. The machine epsilon and the unit roundoff are comparable to each other. Terminology may vary: some use machine epsilon and unit roundoff interchangeably. This has little consequences for practical purposes. APPROXIMATE CALCULATIONS Basic arithmetic operations We express all calculations in a few “basic” arithmetic operations, +, −, ×, ÷, 푠푞푟푡, 푒푥푝, … are replaced by approximate basic operations +෥ , −෥, ×෥, ÷෥, 푠푞푟푡෧, 푒푥푝෦ , … That’s because the result is always approximated by some number in the floating- point format. Modern implementations of floating-point arithmetic require, e.g., that the result of 푎 +෥ 푏 is the same as 푎 + 푏 after approximated in the floating-point format: | (푎 +෥ 푏) − (푎 + 푏) | ≤ 푢 (푎 + 푏) Similarly for the other elementary operations. The floating-point format gives the best approximation to the exact result within the floating-point format. Basic arithmetic operations Modern implementations of floating-point arithmetic require, e.g., that the result of 푎 +෥ 푏 is the same as 푎 + 푏 after approximated in the floating-point format: | (푎 +෥ 푏) − (푎 + 푏) | ≤ 푢 (푎 + 푏) So there exists −u ≤ 훿 ≤ 푢 such that 푎 +෥ 푏 = 푎 + 푏 1 + 훿 . Analogously for the other operations. Basic arithmetic operations In long calculations, inexact intermediate results are used as input for other inexact calculations and the errors accumulate in the long run: 푎 +෥ 푏 ×෥ 푐 +෥ 푑 = ( 푎 + 푏 1 + 훿1 ×෥ 푐 + 푑 1 + 훿2 ) = ( 푎 + 푏 1 + 훿1 × 푐 + 푑 1 + 훿2 ) 1 + 훿3 = 푎 + 푏 × 푐 + 푑 × 1 + 훿1 1 + 훿2 1 + 훿3 where −u ≤ 훿1, 훿2, 훿3 ≤ 푢 indicate the approximation errors at each step. 푎 +෥ 푏 ×෥ 푐 +෥ 푑 − 푎 + 푏 × 푐 + 푑 훿 + 훿 + 훿 훿 1 + 훿 + 훿 = 2 3 2 3 1 1 푎 + 푏 × 푐 + 푑 푎 + 푏 × 푐 + 푑 Basic arithmetic operations Another example: 푎 ×෥ 푏 −෥ 푐 = 푎 × 푏 1 + 훿1 −෥ 푐 = ( 푎 × 푏 1 + 훿1 − 푐) 1 + 훿2 where −u ≤ 훿1, 훿2 ≤ 푢 are the approximation errors at each step. We isolate the true result from the last expression: 푎 ×෥ 푏 −෥ 푐 = 푎 × 푏 − 푐 + 훿2 푎 × 푏 − 푐 + 푎 × 푏 훿1 1 + 훿2 Hence 푎 ×෥ 푏 −෥ 푐 − 푎×푏 −푐 푎×푏 훿 1+훿 = 훿 + 1 2 푎×푏 −푐 2 푎×푏 −푐 Basic arithmetic operations This effect is called cancellation: when subtracting two numbers 푥 and 푦 that are almost equal, then small relative errors in 푥 and 푦 lead to relatively big errors in 푥 − 푦. For the difference of perturbed variables 푥(1 + 훿1) and 푦(1 + 훿2): 푥 1 + 훿1 − 푦 1 + 훿2 = 푥 − 푦 + 훿1푥 − 훿2 푦 Thus, for the relative error we find: (푥(1+훿 )−푦(1+훿 ))−(푥−푦) 훿 푥 −훿 푦 | 훿 푥 −훿 푦 | | 푥 |+| 푦 | 1 2 = 1 2 , 1 2 = 푢 푥−푦 푥−푦 | 푥−푦 | | 푥−푦 | Underflow/Overflow of Floating-Point Numbers ◦ Most implementations are binary and have the exponent constrained to a certain range. ◦ If an intermediate calculation produces a very small non-zero number, then the result may be rounded to zero. ◦ If an intermediate calculation produces a very large number, then the result may be rounded to ±∞. 0 ∞ ◦ Lastly, certain operations such as Τ0 , Τ∞, −1, … produce NaN, Not-a- Number, which indicates invalid computations. Numerous consequences for the design of algorithms. For example, it is good practice to check the range of the operations involved in every division. Basic arithmetic operations ◦ The discussion of relative errors applies to input errors (e.g., measurement errors) as well as to roundoff errors (e.g., intermediate calculations). ◦ Roundoff errors are constrained by the machine epsilon/unit roundoff. ◦ The combination of roundoff errors over a sequence of operations may be accumulate. Analysis can be quite difficult. ◦ Analysis of relative errors very complex even for algorithms that are otherwise well-understood. ◦ Most important problem in practice: loss of relative precision when taking the difference of numbers that are almost equal.

Floating-Point Numbers and Rounding Errors

Scilab Textbook Companion for Numerical Methods for Engineers by S

Extended Precision Floating Point Arithmetic

Introduction & Floating Point NMMC §1.4.1, Atkinson §1.2 1 Floating

1.2 Round-Off Errors and Computer Arithmetic

Floating Point Representation (Unsigned) Fixed-Point Representation

Absolute and Relative Error All Computations on a Computer Are Approximate by Nature, Due to the Limited Precision on the Computer

Numerical Computations – with a View Towards R

CS 542G: Preliminaries, Floating-Point, Errors

Limitations of Digital Representations Machine Epsilon

Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates

• 1E-8 Smaller Than Machine Epsilon (Float) • Forward Sum Fails Utterly

Video 1: Rounding Errors % a Number System Can Be Represented As = ±1