Revisiting" What Every Computer Scientist Should Know About

Computing and Software for Big Science manuscript No. (will be inserted by the editor) Revisiting “What Every Computer Scientist Should Know About Floating-point Arithmetic” Vincent Lafage Received: Abstract The differences between the sets in which We note that this standard emerged in a rich environ- ideal arithmetics takes place and the sets of floating ment of incompatible formats, and is now supported on point numbers are outlined. A set of classical problems all current major microprocessor architectures. Yet all in correct numerical evaluation is presented, to increase this is but modeling the numbers. Our familiarity with the awareness of newcomers to the field. A self-defense, numbers typically used for computational purpose, i.e. prophylactic approach to numerical computation is ad- the real numbers, may then blur our understanding of vocated. floating point numbers. We will first discuss the common floating point representation of real numbers in a computer and list the differences between the properties of the numbers and 1 Introduction the properties of their representation. We will then illustrate common examples where a naive implementa- Precision, and its commoditization, particularly due to tion of the computation can lead to unexpectedly large the plummeting of its cost and because of its repro- uncertainties. In the next part, we delineate some ap- ducibility, has been the base for our industrial societies proaches to understand, contain or mitigate these ac- since the mid XVIIIth century. The numerical preci- curacy problems. We then conclude on the need for a sion went even faster and farther than in mechanical broader awareness about floating point accuracy issue. manufacturing. Thirty years after the paper by Goldberg [1], the global population of developers has arguably grown by a factor of about fifty, and if floating-point was al- 2 From real numbers to floating point numbers ready deemed ubiquitous, the then young IEEE 754 norm is now also ubiquitous, on every desk and in ev- Our “intuitive” understanding of numbers, for measure- arXiv:2012.02492v1 [math.NA] 4 Dec 2020 ery portable device. Yet “floating-point arithmetic is still considered an esoteric subject by” numerous com- ments, payments, arithmetics and higher computations, puter practitioners. was developed over years of schooling and essentially builds on decimals which is not only a base-ten posi- While numerical computations are pervasive nowa- tional numeral system, but also the set of all decimal days, they may appear for what they are not. Obvi- n fractions, called D = p ,n Z,p N for decimal. ously, one needs computers, languages, compilers, and 10 ∈ ∈ This decimal notation, obvious for us, is rather new, libraries for computing. Furthermore one has to rely on dating back to François Viète in the XVIth century, some standard, such as IEEE 754 [2,3,4] and its evolu- who introduced systematic use of decimal fractions and tions [5,6], that quickly became ISO standards [7,8,9]. modern algebraic notations, and John Napier in the Vincent Lafage early XVIIth century (improving notations of Simon Université Paris-Saclay, CNRS/IN2P3, IJCLab, 91405, Orsay, Stevin) who popularized modern decimal separators France. (full stop dot as well as comma) to separate the inte- E-mail: [email protected] gral part of a number from its fractional part, in his 2 Vincent Lafage Rabdologia (). Decimal notation being central for Topologically, while B (like D) is dense in R, which the metric system, its use is widely spread. would allow for arbitrarily close approximations of real But computing with machines rely essentially on bi- numbers, the finite counterparts of B (or D) used for nary representations, and computers are using subsets floating points, are nowhere dense sets. n B D of the binary fractions, called B = p ,n Z,p N Similarly, while (like ) are closed under addition 2 ∈ ∈ for binary. and multiplication, their finite counterparts Fm,e is not. B = Z[1/2] is generated by adjoining 1/2 to Z, in – for multiplication, the number of significant digits of the same way that D = Z[1/10]. This construction is a the product is usually the sum of significant digits rational extension of ring Z or localization of the ring of of factors. integers with respect to the powers of two. Therefore, al- – for addition, the sum will be computed after shift- gebraically, (B, +, ) is a ring (like D). As it is not a field, ∗ ing the less significant term relative to the most sig- division, as well as all rational functions, will require nificant one, the exact resulting significand getting rounding. As B (like D) is only a small subset of the al- larger than the original one. gebraic numbers, all roots will require rounding. As B If these results happen to fit into the fixed size, then (like D) is only a small subset of the real numbers (R), the computation has been performed exactly, else an all transcendental functions1 will also require rounding inexact flag can be raised by the processor and this (even elementary ones: trigonometric functions, loga- requires rounding. This is a source of trouble: we can- rithms and exponential ; and special functions such as not catch, or trap in the specific case of IEEE 754, all Bessel functions, elliptic functions, Riemann ζ func- inexact exception that would appear in any realistic tion, hypergeometric series, polylogarithms. ). As B is computation. Most of them even appear in the simplest a ring, we get most usual good properties we are used operation (addition), and then we do not have a choice to in mathematics: to handle them: dynamically adaptable precision is pos- – commutativity: sible only with a massive loss of speed, and is useless x B, y B x + y = y + x, as soon as roots or transcendental functions1 are used. ∀ ∈ ∀ ∈ x B, y B x y = y x These exceptions will appear for all level of precision. ∀ ∈ ∀ ∈ × × – associativity: When representing a real number x, one find the x B, y B, z B x + (y + z) = (x + y)+ z, two nearest representable floating point numbers: x ∀ ∈ ∀ ∈ ∀ ∈ − x B, y B, z B x (y z) = (x y) z immediately below x and x+ immediately above x. Let ∀ ∈ ∀ ∈ ∀ ∈ × × × × – distributivity: us note m the middle of the interval [x , x+] (a real − x B, y B, z B x (y + z)= x y + x z number that cannot be represented in this floating point ∀ ∈ ∀ ∈ ∀ ∈ × × × format). In order to measure the effect of rounding, one Unfortunately, in practice we are limited to elements of B with a finite representation. m We call these “binary2 floating point numbers”, with x a given number of digits, in this case bits: m bits for x− x+ their mantissa or significand, and e bits for their expo- 1 1 nent. Let’s note these sets Fm,e: they build an infinite 2 ulp 2 ulp lattice of subsets of B. The size e of the exponent and m 1 ulp of the mantissa includes their respective sign bits. Fur- thermore some implementation do not store the most Fig. 1: The basics of rounding significant bit of the mantissa as it is one by default, then saving one bit for accuracy. Note that, this saving is possible only as we use a base two representation. uses the smallest discrepancy between two consecutive Most of the time we rely on the hardware embedded floating point numbers, x and x+: they differ by one F F − 24,8 for single precision and 53,11 for double preci- unit in the last place (ulp): see Fig. 1. So when rounding sion. to the nearest, the difference between the real number 1 as opposed to algebraic functions, i.e. functions that can be and the rounded one should be at most half a ulp: this defined as the root of a polynomial with rational coefficients, in is called correct rounding. As this proves difficult, a less particular those involving only algebraic operations: addition, strict requirement is to guarantee the rounding within subtraction, multiplication, and division, as well as fractional one ulp (i.e. all bits correct but possibly the last one), or rational exponents 2 which is called faithful rounding. In Fig. 1, both x and We will not deal here with other bases, except to illustrate − the cancellation phenomenon and double rounding in the more x+ are faithful roundings of x, but only x+ is the correct natural base ten rounding of x. Revisiting “What Every Computer Scientist Should Know About Floating-point Arithmetic” 3 Thus, the behavior of floats is not the same as the stored in the NaN sign bit and in a payload of 23 bits behavior of binary numbers. Addition and multiplica- for single precision (or 52 bits for double)4. tion inside the set of binary numbers are stable while With these extensions the lack of closure of bare for floats, it will be inexact most of the time, requiring float numbers is solved. But this solution comes at the round-off. From an object oriented point of view, while cost of losing associativity and distributivity (a major the floats Fm,e being smaller sets should be subtypes of behavioral shift), that are hard-wired in our ways of the binary number as a subset of B, the extra amount of thinking about numbers, in the same way the implicit exceptions they bring come in violation of the Liskov hierarchy of precisions is. Beyond algebraic properties, substitution principle (LSP) [10,11,12]3. NaN also destroys the total order that exists in R and B We must not regard these subsets as behavioral sub- . For low-level programming and optimization, it also types, which goes against our intuition.

Revisiting" What Every Computer Scientist Should Know About

On Various Ways to Split a Floating-Point Number

Floating Point Arithmetic

Floating Point Square Root Under HUB Format

AC Library for Emulating Low-Precision Arithmetic Fasi

Numerical Computation Guide

A Study of the Floating-Point Tuning Behaviour on the N-Body Problem

Customizing Floating-Point Units for Fpgas: Area-Performance-Standard

Floating Point

ERC Advanced Grant Research Proposal (Part B2)

Development of Elementary Mathematics Functions in an Avionics Context

Ulps and Relative Error

Low Power Floating-Point Multiplication and Squaring Units with Shared Circuitry