Extended Precision Floating Point Arithmetic

Extended Precision Floating Point Numbers for Ill-Conditioned Problems Daniel Davis, Advisor: Dr. Scott Sarra Department of Mathematics, Marshall University Floating Point Number Systems Precision Extended Precision Patriot Missile Failure Floating point representation is based on scientific An increasing number of problems exist for which Patriot missile defense modules have been used by • The precision, p, of a floating point number notation, where a nonzero real decimal number, x, IEEE double is insufficient. These include modeling the U.S. Army since the mid-1960s. On February 21, system is the number of bits in the significand. is expressed as x = ±S × 10E, where 1 ≤ S < 10. of dynamical systems such as our solar system or the 1991, a Patriot protecting an Army barracks in Dha- This means that any normalized floating point The values of S and E are known as the significand • climate, and numerical cryptography. Several arbi- ran, Afghanistan failed to intercept a SCUD missile, number with precision p can be written as: and exponent, respectively. When discussing float- trary precision libraries have been developed, but leading to the death of 28 Americans. This failure E ing points, we are interested in the computer repre- x = ±(1.b1b2...bp−2bp−1)2 × 2 are too slow to be practical for many complex ap- was caused by floating point rounding error. The sentation of numbers, so we must consider base 2, or • The smallest x such that x > 1 is then: plications. David Bailey’s QD library may be used system measured time in tenths of seconds, using binary, rather than base 10. Thus, a non-zero num- −(p−1) in applications where two or four times double pre- only 24-bit registers. The algorithms employed by (1.00...01)2 = 1 + 2 ber, x, is written in the form, x = ±S × 2E, where cision is sufficient. Though much faster than ar- the Patriot were implemented in a way that allowed The gap between this number and 1 is called 1 ≤ S < 2. It follows, then, that the binary expan- • bitrary precision, the fact that these algorithms are the rounding error for the time to compound over machine epsilon, which we can write as: sion of the significand is given by S = (b0.b1b2b3...)2, implemented in software still forces them to be much time. When the Dharan system failed, it had been −(p−1) with b0 = 1. Consider x = 7.0. Then, in a floating m = (0.00...01)2 = 2 slower than double calculations. The table below left on for about 100 hours. 2 point system, we might have x = 1.75 × 2 . • Let xc be a floating point representation of a real displays a time comparison among double, double- number, x. Then the absolute error of xc is given double, and quad-double calculations for the basic Conclusion/Future Work |xc−x| IEEE Double Standard by |xc − x|. Its relative error is given by |x| . arithmetic operators, with the data normalized so • The maximum value of the rounding error for a that the times for doubles represent one time unit. Extended precision offers a solution for many prob- The ANSI/IEEE std 754 defines the double floating m lems in numerical mathematics that suffer from the binary floating point representation is equal to 2 . Type Add. Sub. Mult. Div. point format used in virtually all microprocessors. rounding error in IEEE double. However, that solu- Format Precision Machine Epsilon Double 1 1 1 1 tion comes at the cost of efficiency. Modern Graph- • Double values use a 64-bit word, with one bit −23 −7 Double-Double 21.45 11.87 13.83 15.53 ics Processing Units (GPUs) offer thousands of cores reserved for the sign, 11 for the exponent, and 52 Single p = 24 m = 2 ≈ 1.2 ∗ 10 −52 −16 Quad-Double 70.22 74.10 120.21 129.11 which can be highly advantageous for parallelizable for the significand. Double p = 53 m = 2 ≈ 2.2 ∗ 10 Table 2: Times Required for Arithmetic Operations with Data algorithms. A parallel implementation of the QD • The exponent field is stored using biased Table 1: Precision of Floating Point Formats Normalized so that Double Precision is 1 Unit For Each Operator library known as GQD was published in 2010, but representation, meaning that the value of the abandoned shortly after. In the future, we hope to exponent B is stored as E + B, where B is the implement the QD algorithms in OpenCL for exe- exponent bias, which is 1023 for double precision. Solutions of Chaotic Lorenz Equations cution across GPUs and other highly parallel plat- The 52-bit binary significand allows for about 16 • Chaotic systems are those which are highly sensitive to changes in initial conditions.. We seek time-step forms. decimal places of precision. independent solutions to the chaotic Lorenz equations: References 0 x = σ(y − x), x(0) = x0, 0 Sample Conversion y = rx − y − xz, y(0) = y0, (1) 0 [1] Yozo Hida, Xiaoye S Li, and David H Bailey. z = −bz + xy, z(0) = z0. Library for double-double and quad-double arithmetic. Let us convert x = 4.5 into an IEEE double. The We performed the experiment with step sizes ∆t = 1 × 10−6 and ∆t = 1 × 10−7. Figure 1 shows NERSC Division, Lawrence Berkeley National sign is positive, so the sign bit is 0. We use division 1 2 the resulting x-values of the solutions calculated in double precision, with the blue line representing the Laboratory, 2007. to find that x/22 = 1.125, giving us an exponent of solutions found using ∆t , and the green line using ∆t . The solutions visually diverge around t = 42. [2] Scott A Sarra and Clyde Meador. 2. Using biased representation, this value is stored 1 2 Figure 2 represents the same calculation performed in double-double precision. Here, the solutions do not On the numerical solution of chaotic dynamical systems as 1025, which has a binary string of 10000000001. using extend precision floating point arithmetic and very visually diverge for t ≤ 50. The decimal portion of .125 is converted to binary high order numerical methods. Nonlinear Analysis, 16(3):340–352, 2011. using the Repeated Multiply-by-2 method, append- ing the value left of the decimal point to the frac- tional part at each stage, until we reach the exact Acknowledgements value or the limit of the length of the significand. Thus, the double precision value is given by: Thank you to Dr. Sarra and Dr. Niese for your Sign Exponent Mantissa Figure 1: Solutions in double precision. Figure 2: Solutions in double-double precision. support and for answering so many questions during 0 10000000001 0010000000 ... this project..

Extended Precision Floating Point Arithmetic

Arithmetic Algorithms for Extended Precision Using Floating-Point Expansions Mioara Joldes, Olivier Marty, Jean-Michel Muller, Valentina Popescu

Floating Points

Hacking in C 2020 the C Programming Language Thom Wiggers

Floating Point Formats

Scilab Textbook Companion for Numerical Methods for Engineers by S

Introduction & Floating Point NMMC §1.4.1, Atkinson §1.2 1 Floating

1.2 Round-Off Errors and Computer Arithmetic

Floating Point Representation (Unsigned) Fixed-Point Representation

A Variable Precision Hardware Acceleration for Scientific Computing Andrea Bocco

Effectiveness of Floating-Point Precision on the Numerical Approximation by Spectral Methods

MIPS Floating Point

Absolute and Relative Error All Computations on a Computer Are Approximate by Nature, Due to the Limited Precision on the Computer