An Introduction to Floating-Point Arithmetic and Computation

An Introduction to Floating-Point Arithmetic and Computation Jeff Arnold CERN openlab 9 May 2017 c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 1 Agenda • Introduction • Standards • Properties • Error-Free Transformations • Summation Techniques • Dot Products • Polynomial Evaluation • Value Safety • Pitfalls and Gremlins • Tools • References and Bibliography c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 2 Why is Floating-Point Arithmetic Important? • It is ubiquitous in scientific computing • Most research in HEP can't be done without it • Algorithms are needed which • Get the best answers • Get the best answers all the time • \Best"means the right answer for the situation and context • There is always a compromise between fast and accurate c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 3 Important to Teach About Floating-Point Arithmetic • A rigorous approach to floating-point arithmetic is seldom taught in programming courses • Not enough physicists/programmers study numerical analysis • Many physicists/programmers think floating-point arithmetic is • inaccurate and ill-defined • filled with unpredictable behaviors and random errors • mysterious • Physicists/programmers need to be able to develop correct, accurate and robust algorithms • they need to be able to write good code to implement those algorithms c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 4 Reasoning about Floating-Point Arithmetic Reasoning about floating-point arithmetic is important because • One can prove algorithms are correct without exhaustive evaluation • One can determine when they fail • One can prove algorithms are portable • One can estimate the errors in calculations • Hardware changes have made floating-point calculations appear to be less deterministic • SIMD instructions • hardware threading Accurate knowledge about these factors increases confidence in floating-point computations c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 5 Classification of real numbers In mathematics, the set of real numbers R consists of • rational numbers Q fp=q : p; q 2 Z; q 6= 0g • integers Z fp : jpj 2 Wg • whole W fp : p 2 N [ 0g • natural N fp : p 2 f1; 2; :::gg • irrational numbers fx : x 2 R x2 = Qg • algebraic numbers A • transcendental numbers Dyadic rationals: ratio of an integer and 2b where b is a whole number c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 6 Some Properties of Floating-Point Numbers Floating-point numbers do not behave as do the real numbers encountered in mathematics. While all floating-point numbers are rational numbers • The set of floating-point numbers does not form a field under the usual set of arithmetic operations • Some common rules of arithmetic are not always valid when applied to floating-point operations • There are only a finite number of floating-point numbers c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 7 Floating-Point Numbers are Rational Numbers What does this imply? • Since there are only a finite number of floating-point numbers, there are rational numbers which are not floating-point numbers • The decimal equivalent of any finite floating-point value contains a finite number of non-zero digits p • The values of transcendentals such as π, e and 2 cannot be represented exactly by a floating-point value regardless of format or precision c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 8 How Many Floating-Point Numbers Are There? p+1 • ∼ 2 (2emax + 1) • Single-precision: ∼ 4:3 × 109 • Double-precision: ∼ 1:8 × 1019 • Number of protons circulating in LHC: ∼ 6:7 × 1014 c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 9 Standards There have been three major standards affecting floating-point arithmetic: • IEEE 754-1985 Standard for Binary Floating-Point Arithmetic • IEEE 854-1987 Standard for Radix-Independent Floating-Point Arithmetic • IEEE 754-2008 Standard for Floating-Point Arithmetic • This is the current standard • It is also an ISO standard (ISO/IEC/IEEE 60559:2011) c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 10 IEEE 754-2008 • Merged IEEE 754-1985 and IEEE 854-1987 • Tried not to invalidate hardware which conformed to IEEE 754-1985 • Standardized larger formats • For example, quad-precision format • Standardized new instructions • For example, fused multiply-add (FMA) From now on, we will only talk about IEEE 754-2008 c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 11 Operations Specified by IEEE 754-2008 All these operations must return the correct finite-precision result using the current rounding mode • Addition • Subtraction • Multiplication • Division • Remainder • Fused multiply add (FMA) • Square root • Comparison c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 12 Other Operations Specified by IEEE 754-2008 • Conversions between different floating-point formats • Conversions between floating-point and integer formats • Conversion to integer must be correctly rounded • Conversion between floating-point formats and external representations as character sequences • Conversions must be monotonic • Under some conditions, binary ! decimal ! binary conversions must be exact (\round-trip"conversions) c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 13 Special Values • Zero • zero is signed • Infinity • infinity is signed • Subnormals • NaN (Not a Number) • Quiet NaN • Signaling NaN • NaNs do not have a sign c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 14 Rounding Modes in IEEE 754-2008 The result must be the infinity-precise result rounded to the desired floating-point format. Possible rounding modes are • Round to nearest • round to nearest even • in the case of ties, select the result with a significand which is even • required for binary and decimal • the default rounding mode for binary • round to nearest away • required only for decimal • round toward 0 • round toward +1 • round toward −∞ c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 15 Exceptions Specified by IEEE 754-2008 • Underflow • Absolute value of a non-zero result is less than the smallest non-zero finite floating-point number • Result is 0 • Overflow • Absolute value of a result is greater than the largest finite floating-point number • Result is ±∞ • Division by Zero • x=y where x is finite and non-zero and y = 0 • Inexact • The result, after rounding, is different than the infinitely-precise result c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 16 Exceptions Specified by IEEE 754-2008 • Invalid • pAn operand is a NaN • x where x < 0 p • however, −0 = −0 • (±∞) ± (±∞) • (±0) × (±∞) • (±0)=(±0) • (±∞)=(±∞) • some floating-point!integer or decimal conversions c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 17 Formats Specified in IEEE 754-2008 Formats • Basic Formats: • Binary with sizes of 32, 64 and 128 bits • Decimal with sizes of 64 and 128 bits • Other formats: • Binary with a size of 16 bits • Decimal with a size of 32 bits c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 18 Transcendental and Algebraic Functions The standard recommends the following functions be correctly rounded: • ex; ex − 1; 2x; 2x − 1; 10x; 10x − 1 • logα(Φ) for α = e; 2; 10 and Φ = x; 1 + x p • px2 + y2; 1= x; (1 + x)n; xn; x1=n • sin(x); cos(x); tan(x); sinh(x); cosh(x); tanh(x) and their inverse functions • sin(πx); cos(πx) • And more ::: c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 19 We're Not Going to Consider Everything... The rest of this talk will be limited to the following aspects of IEEE 754-2008: • Binary32, Binary64 and Binary128 formats • The radix in these cases is always 2: β = 2 • This includes the formats handled by the SSE and AVX instruction sets on the x86 architecture • We will not consider any aspects of decimal arithmetic or the decimal formats • We will not consider \double extended"format • Also known as the \IA32 x87"format • The rounding mode is assumed to be round-to-nearest-even c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 20 Storage Format of a Binary Floating-Point Number p − 1 bits w bits - - s E significand IEEE Name Format Size w p emin emax Binary32 Single 32 8 24 -126 +127 Binary64 Double 64 11 53 -1022 +1023 Binary128 Quad 128 15 113 -16382 +16383 Notes: • E = e − emin + 1 • emax = −emin + 1 • p − 1 will be addressed later c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 21 The Value of a Floating-Point Number The format of a floating-point number is determined by the quantities: • radix β • sometimes called the \base" • sign s 2 f0; 1g • exponent e • an integer such that emin ≤ e ≤ emax • precision p • the number of \digits"in the number c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 22 The Value of a Floating-Point Number The value of a floating-point number is determined by • the format of the number • the digits in the number: xi, 0 ≤ i < p, where 0 ≤ xi < β. The value of a floating-point number can be expressed as p−1 s e X −i x = (−) β xiβ i=0 where the significand is p−1 X −i m = xiβ i=0 with 0 ≤ m < β c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 23 The Value of a Floating-Point Number The value of a floating-point number can also be written p−1 s e−p+1 X p−i−1 x = (−) β xiβ i=0 where the integral significand is p−1 X p−i−1 M = xiβ i=0 and M is an integer such that 0 ≤ M < βp c 2017 Jeffrey M. Arnold Floating-Point Arithmetic and Computation 24 The Value of a Floating-Point Number The value of a floating-point number can also be written as 8 M < (−)s if e − p + 1 < 0 x = β−(e−p+1) : (−)sβe−p+1M if e − p + 1 ≥ 0 where M is the integral significand.

An Introduction to Floating-Point Arithmetic and Computation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support