An Introduction to Floating-Point Arithmetic and Computation
Total Page:16
File Type:pdf, Size:1020Kb
An Introduction to Floating-Point Arithmetic and Computation Jeff Arnold CERN openlab 9 May 2017 c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 1 Agenda • Introduction • Standards • Properties • Error-Free Transformations • Summation Techniques • Dot Products • Polynomial Evaluation • Value Safety • Pitfalls and Gremlins • Tools • References and Bibliography c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 2 Why is Floating-Point Arithmetic Important? • It is ubiquitous in scientific computing • Most research in HEP can't be done without it • Algorithms are needed which • Get the best answers • Get the best answers all the time • \Best"means the right answer for the situation and context • There is always a compromise between fast and accurate c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 3 Important to Teach About Floating-Point Arithmetic • A rigorous approach to floating-point arithmetic is seldom taught in programming courses • Not enough physicists/programmers study numerical analysis • Many physicists/programmers think floating-point arithmetic is • inaccurate and ill-defined • filled with unpredictable behaviors and random errors • mysterious • Physicists/programmers need to be able to develop correct, accurate and robust algorithms • they need to be able to write good code to implement those algorithms c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 4 Reasoning about Floating-Point Arithmetic Reasoning about floating-point arithmetic is important because • One can prove algorithms are correct without exhaustive evaluation • One can determine when they fail • One can prove algorithms are portable • One can estimate the errors in calculations • Hardware changes have made floating-point calculations appear to be less deterministic • SIMD instructions • hardware threading Accurate knowledge about these factors increases confidence in floating-point computations c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 5 Classification of real numbers In mathematics, the set of real numbers R consists of • rational numbers Q fp=q : p; q 2 Z; q 6= 0g • integers Z fp : jpj 2 Wg • whole W fp : p 2 N [ 0g • natural N fp : p 2 f1; 2; :::gg • irrational numbers fx : x 2 R x2 = Qg • algebraic numbers A • transcendental numbers Dyadic rationals: ratio of an integer and 2b where b is a whole number c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 6 Some Properties of Floating-Point Numbers Floating-point numbers do not behave as do the real numbers encountered in mathematics. While all floating-point numbers are rational numbers • The set of floating-point numbers does not form a field under the usual set of arithmetic operations • Some common rules of arithmetic are not always valid when applied to floating-point operations • There are only a finite number of floating-point numbers c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 7 Floating-Point Numbers are Rational Numbers What does this imply? • Since there are only a finite number of floating-point numbers, there are rational numbers which are not floating-point numbers • The decimal equivalent of any finite floating-point value contains a finite number of non-zero digits p • The values of transcendentals such as π, e and 2 cannot be represented exactly by a floating-point value regardless of format or precision c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 8 How Many Floating-Point Numbers Are There? p+1 • ∼ 2 (2emax + 1) • Single-precision: ∼ 4:3 × 109 • Double-precision: ∼ 1:8 × 1019 • Number of protons circulating in LHC: ∼ 6:7 × 1014 c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 9 Standards There have been three major standards affecting floating-point arithmetic: • IEEE 754-1985 Standard for Binary Floating-Point Arithmetic • IEEE 854-1987 Standard for Radix-Independent Floating-Point Arithmetic • IEEE 754-2008 Standard for Floating-Point Arithmetic • This is the current standard • It is also an ISO standard (ISO/IEC/IEEE 60559:2011) c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 10 IEEE 754-2008 • Merged IEEE 754-1985 and IEEE 854-1987 • Tried not to invalidate hardware which conformed to IEEE 754-1985 • Standardized larger formats • For example, quad-precision format • Standardized new instructions • For example, fused multiply-add (FMA) From now on, we will only talk about IEEE 754-2008 c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 11 Operations Specified by IEEE 754-2008 All these operations must return the correct finite-precision result using the current rounding mode • Addition • Subtraction • Multiplication • Division • Remainder • Fused multiply add (FMA) • Square root • Comparison c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 12 Other Operations Specified by IEEE 754-2008 • Conversions between different floating-point formats • Conversions between floating-point and integer formats • Conversion to integer must be correctly rounded • Conversion between floating-point formats and external representations as character sequences • Conversions must be monotonic • Under some conditions, binary ! decimal ! binary conversions must be exact (\round-trip"conversions) c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 13 Special Values • Zero • zero is signed • Infinity • infinity is signed • Subnormals • NaN (Not a Number) • Quiet NaN • Signaling NaN • NaNs do not have a sign c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 14 Rounding Modes in IEEE 754-2008 The result must be the infinity-precise result rounded to the desired floating-point format. Possible rounding modes are • Round to nearest • round to nearest even • in the case of ties, select the result with a significand which is even • required for binary and decimal • the default rounding mode for binary • round to nearest away • required only for decimal • round toward 0 • round toward +1 • round toward −∞ c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 15 Exceptions Specified by IEEE 754-2008 • Underflow • Absolute value of a non-zero result is less than the smallest non-zero finite floating-point number • Result is 0 • Overflow • Absolute value of a result is greater than the largest finite floating-point number • Result is ±∞ • Division by Zero • x=y where x is finite and non-zero and y = 0 • Inexact • The result, after rounding, is different than the infinitely-precise result c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 16 Exceptions Specified by IEEE 754-2008 • Invalid • pAn operand is a NaN • x where x < 0 p • however, −0 = −0 • (±∞) ± (±∞) • (±0) × (±∞) • (±0)=(±0) • (±∞)=(±∞) • some floating-point!integer or decimal conversions c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 17 Formats Specified in IEEE 754-2008 Formats • Basic Formats: • Binary with sizes of 32, 64 and 128 bits • Decimal with sizes of 64 and 128 bits • Other formats: • Binary with a size of 16 bits • Decimal with a size of 32 bits c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 18 Transcendental and Algebraic Functions The standard recommends the following functions be correctly rounded: • ex; ex − 1; 2x; 2x − 1; 10x; 10x − 1 • logα(Φ) for α = e; 2; 10 and Φ = x; 1 + x p • px2 + y2; 1= x; (1 + x)n; xn; x1=n • sin(x); cos(x); tan(x); sinh(x); cosh(x); tanh(x) and their inverse functions • sin(πx); cos(πx) • And more ::: c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 19 We're Not Going to Consider Everything... The rest of this talk will be limited to the following aspects of IEEE 754-2008: • Binary32, Binary64 and Binary128 formats • The radix in these cases is always 2: β = 2 • This includes the formats handled by the SSE and AVX instruction sets on the x86 architecture • We will not consider any aspects of decimal arithmetic or the decimal formats • We will not consider \double extended"format • Also known as the \IA32 x87"format • The rounding mode is assumed to be round-to-nearest-even c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 20 Storage Format of a Binary Floating-Point Number p − 1 bits w bits - - s E significand IEEE Name Format Size w p emin emax Binary32 Single 32 8 24 -126 +127 Binary64 Double 64 11 53 -1022 +1023 Binary128 Quad 128 15 113 -16382 +16383 Notes: • E = e − emin + 1 • emax = −emin + 1 • p − 1 will be addressed later c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 21 The Value of a Floating-Point Number The format of a floating-point number is determined by the quantities: • radix β • sometimes called the \base" • sign s 2 f0; 1g • exponent e • an integer such that emin ≤ e ≤ emax • precision p • the number of \digits"in the number c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 22 The Value of a Floating-Point Number The value of a floating-point number is determined by • the format of the number • the digits in the number: xi, 0 ≤ i < p, where 0 ≤ xi < β. The value of a floating-point number can be expressed as p−1 s e X −i x = (−) β xiβ i=0 where the significand is p−1 X −i m = xiβ i=0 with 0 ≤ m < β c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 23 The Value of a Floating-Point Number The value of a floating-point number can also be written p−1 s e−p+1 X p−i−1 x = (−) β xiβ i=0 where the integral significand is p−1 X p−i−1 M = xiβ i=0 and M is an integer such that 0 ≤ M < βp c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 24 The Value of a Floating-Point Number The value of a floating-point number can also be written as 8 M < (−)s if e − p + 1 < 0 x = β−(e−p+1) : (−)sβe−p+1M if e − p + 1 ≥ 0 where M is the integral significand.