arXiv:2012.02492v1 [math.NA] 4 Dec 2020 in 56,ta ucl eaeIOsadrs[7,8,9]. standards ISO evolu- became its quickly and that [2,3,4] [5,6], 754 tions IEEE on as rely such to Obvi- standard, has some not. one and Furthermore are computing. compilers, for they languages, libraries computers, what needs for one appear ously, may they days, com- numerous by” practitioners. is subject puter arithmetic ev- esoteric in an “floating-point considered and Yet still desk device. al- 754 every on portable IEEE was ubiquitous, ery young floating-point also then now if is the and norm ubiquitous, fifty, deemed about ready of by grown factor arguably has a developers of population global inwn vnfse n ate hni mechanical in than farther preci- and numerical manufacturing. faster The even century. repro- went XVIIIth its sion societies mid of industrial our the because for since base and the cost been has its ducibility, of to plummeting due particularly the commoditization, its and Precision, Introduction 1 ad- is vocated. computation self-defense, numerical A to field. approach the prophylactic to newcomers of increase awareness to floating presented, the problems is classical of evaluation of numerical sets set correct in A the outlined. and are numbers place point takes arithmetics ideal Abstract Received: Lafage Vincent Know Should Arithmetic” Scientist Floating-point Computer About Every “What Revisiting E-mail: France 91405, IJCLab, CNRS/IN2P3, Paris-Saclay, Université Vincent wl eisre yteeditor) No. the manuscript by Science inserted Big be for (will Software and Computing hl ueia opttosaepraienowa- pervasive are computations numerical While hryyasatrtepprby paper the after years Thirty [email protected] . Lafage  h ieecsbtentest nwhich in sets the between differences The Goldberg 1,the [1], Orsay , inlnmrlsse,btas h e faldecimal all of posi- set base-ten the called a also fractions, only but not system, essentially numeral is and which tional schooling decimals on of builds years over developed was computations, higher and arithmetics payments, measure- ments, for numbers, of understanding “intuitive” Our numbers point floating to numbers real From 2 a issue. for accuracy need point the floating on about awareness ac- conclude broader these then We mitigate ap- problems. or some curacy contain delineate understand, we to part, proaches large next unexpectedly the to In lead uncertainties. can computation implementa- the naive il- of a then tion where will examples We common representation. and lustrate their numbers of the the properties list of the and properties computer the a between in differences numbers real of resentation of understanding our numbers. blur point then floating may numbers, real purpose, with the computational familiarity for Our used numbers. typically the numbers modeling all but Yet is architectures. this microprocessor on major supported environ- now current rich is all a and formats, in incompatible emerged of standard ment this that note We rlpr fanme rmisfatoa at nhis in part, fractional inte- its the from separate number to a comma) of as part well gral as dot Simon stop (full of notations (improving Stevin century XVIIth early hsdcmlntto,ovosfru,i ahrnew, rather is François us, to for back obvious dating notation, decimal This oenagbacnttos n John and notations, and algebraic fractions decimal modern of use systematic introduced who ewl rtdsustecmo otn on rep- point floating common the discuss first will We h ouaie oendcmlseparators decimal modern popularized who ) D =  10 n Viète p n , ∈ Z nteXIhcentury, XVIth the in p , ∈ N Napier

o decimal. for nthe in i.e. 2 Vincent Lafage

Rabdologia (). Decimal notation being central for Topologically, while B (like D) is dense in R, which the metric system, its use is widely spread. would allow for arbitrarily close approximations of real But computing with machines rely essentially on bi- numbers, the finite counterparts of B (or D) used for nary representations, and computers are using subsets floating points, are nowhere dense sets. n B D of the binary fractions, called B = p ,n Z,p N Similarly, while (like ) are closed under addition  2 ∈ ∈ for binary. and multiplication, their finite counterparts Fm,e is not. B = Z[1/2] is generated by adjoining 1/2 to Z, in – for multiplication, the number of significant digits of the same way that D = Z[1/10]. This construction is a the product is usually the sum of significant digits rational extension of ring Z or localization of the ring of of factors. integers with respect to the powers of two. Therefore, al- – for addition, the sum will be computed after shift- gebraically, (B, +, ) is a ring (like D). As it is not a field, ∗ ing the less significant term relative to the most sig- division, as well as all rational functions, will require nificant one, the exact resulting significand getting . As B (like D) is only a small subset of the al- larger than the original one. gebraic numbers, all roots will require rounding. As B If these results happen to fit into the fixed size, then (like D) is only a small subset of the real numbers (R), the computation has been performed exactly, else an all transcendental functions1 will also require rounding inexact flag can be raised by the processor and this (even elementary ones: trigonometric functions, loga- requires rounding. This is a source of trouble: we can- rithms and exponential ; and special functions such as not catch, or trap in the specific case of IEEE 754, all Bessel functions, elliptic functions, Riemann ζ func- inexact exception that would appear in any realistic tion, hypergeometric series, polylogarithms. . . ). As B is computation. Most of them even appear in the simplest a ring, we get most usual good properties we are used operation (addition), and then we do not have a choice to in mathematics: to handle them: dynamically adaptable precision is pos- – commutativity: sible only with a massive loss of speed, and is useless x B, y B x + y = y + x, as soon as roots or transcendental functions1 are used. ∀ ∈ ∀ ∈ x B, y B x y = y x These exceptions will appear for all level of precision. ∀ ∈ ∀ ∈ × × – associativity: When representing a real number x, one find the x B, y B, z B x + (y + z) = (x + y)+ z, two nearest representable floating point numbers: x ∀ ∈ ∀ ∈ ∀ ∈ − x B, y B, z B x (y z) = (x y) z immediately below x and x+ immediately above x. Let ∀ ∈ ∀ ∈ ∀ ∈ × × × × – distributivity: us note m the middle of the interval [x , x+] (a real − x B, y B, z B x (y + z)= x y + x z number that cannot be represented in this floating point ∀ ∈ ∀ ∈ ∀ ∈ × × × format). In order to measure the effect of rounding, one Unfortunately, in practice we are limited to elements of B with a finite representation. m We call these “binary2 floating point numbers”, with x a given number of digits, in this case bits: m bits for x− x+ their mantissa or significand, and e bits for their expo- 1 1 nent. Let’s note these sets Fm,e: they build an infinite 2 ulp 2 ulp lattice of subsets of B. The size e of the exponent and m 1 ulp of the mantissa includes their respective sign bits. Fur- thermore some implementation do not store the most Fig. 1: The basics of rounding significant bit of the mantissa as it is one by default, then saving one bit for accuracy. Note that, this saving is possible only as we use a base two representation. uses the smallest discrepancy between two consecutive Most of the time we rely on the hardware embedded floating point numbers, x and x+: they differ by one F F − 24,8 for single precision and 53,11 for double preci- unit in the last place (ulp): see Fig. 1. So when rounding sion. to the nearest, the difference between the real number 1 as opposed to algebraic functions, i.e. functions that can be and the rounded one should be at most half a ulp: this defined as the root of a polynomial with rational coefficients, in is called correct rounding. As this proves difficult, a less particular those involving only algebraic operations: addition, strict requirement is to guarantee the rounding within subtraction, multiplication, and division, as well as fractional one ulp (i.e. all bits correct but possibly the last one), or rational exponents 2 which is called faithful rounding. In Fig. 1, both x and We will not deal here with other bases, except to illustrate − the cancellation phenomenon and double rounding in the more x+ are faithful of x, but only x+ is the correct natural base ten rounding of x. Revisiting “What Every Computer Scientist Should Know About Floating-point Arithmetic” 3

Thus, the behavior of floats is not the same as the stored in the NaN sign bit and in a payload of 23 bits behavior of binary numbers. Addition and multiplica- for single precision (or 52 bits for double)4. tion inside the set of binary numbers are stable while With these extensions the lack of closure of bare for floats, it will be inexact most of the time, requiring float numbers is solved. But this solution comes at the round-off. From an object oriented point of view, while cost of losing associativity and distributivity (a major the floats Fm,e being smaller sets should be subtypes of behavioral shift), that are hard-wired in our ways of the binary number as a subset of B, the extra amount of thinking about numbers, in the same way the implicit exceptions they bring come in violation of the Liskov hierarchy of precisions is. Beyond algebraic properties, substitution principle (LSP) [10,11,12]3. NaN also destroys the total order that exists in R and B We must not regard these subsets as behavioral sub- . For low-level programming and optimization, it also types, which goes against our intuition. breaks the connection between logical equality and bi- nary representation equality. In the same way, while the defined exceptions for While the exceptional behaviors are usually appar- single and double are the same, they will not be trig- ent, even when replaced by their representations NaN gered as often for the highest precision, and in particu- or Infinity, and the inexact exception can be trig- lar for higher precision representation of the same pure gered with only half a ulp of loss of precision, a much number: a product of singles can overflow, where the more problematic case will not trigger any more specific product of the same values would be computed cor- exception: “catastrophic cancellation” [1], also known rectly, and even exactly, with doubles. So we do not get as “loss of significance” occurs when an operation on a tower or lattice of inheriting types, the smaller preci- two numbers increases relative error more than it in- sion deriving from the higher ones, as new behavior (in creases absolute error, for instance, when subtracting the form of more exceptions) would emerge from the close numbers. As a result of this closeness, the num- subtype, in violation of the LSP: in particular, precon- ber of significant digits in the outcome is less than dition on domain of functions are stricter in the subtype expected with the precision. For instance, using here (to prevent overflow or underflow). decimal floating points with three digits mantissa for Some may consider that subtypes extends the base the sake of clarity, when one considers a = 3.34 and type, with extra information. In this respect, higher pre- b =3.33, on the one hand the mathematical difference cision would be considered as subtypes of lower preci- a b =0.01 is computed exactly with the given floating − sion, extending it with extra digits, and keeping the point format. This computation, while exact, is a can- same methods. Yet, the exception being triggered more cellation: one recovers only one significant digit out of often with the lower precision, the higher precision sub- three. As this result could not be more exact, this can- type would lead to different behavior (at least better cellation is said to be benign. On the other hand, the correctness) of the same program. mathematical difference of the squares a2 b2 =0.0667 2 − In order to avoid too many exceptions, particularly is 6.67 10− in the given floating point format, but × 1 exception out of closure, extra elements are added to will be evaluated naively as 0.1=10− with the round- the set. First, to take overflow into account, B is affinely ing. No digit of the floating point result is correct, and extended by signed infinities: B = B , + . Then this result differ by 50% relative error or 333 units in ∪{−∞ ∞} signed zeros ( 0, +0) are included for consistency. One the last place: this cancellation is catastrophic. There is − more element is convenient to cover for undefined re- no systematic way to avoid catastrophic cancellation, sults such as functions on real numbers returning com- or to know how many digits have been lost. In particu- plex numbers, but also indeterminate forms: NaN for lar, catastrophic cancellation will not always lead to a “Not a Number”. With this we turn our number sys- zero, as illustrated with this example. tem into a broader representation. We then get another A corner case among the already pathological be- source of NaN, as the result of operations involving NaN. havior of catastrophic cancellation occurs when the ab- Which in turns represents different cases so that we may solute value of the computation is below the smallest want different NaN in our representation to give NaN a normal floating point number, i.e. 2 to the power of 128 more specific meaning: these extra informations can be the smallest exponent for the given format: 2− 38 1024 308∼ 0.29 10− for single precision, 2− 0.56 10− × ∼ × 3 Liskov’s notion of a behavioral subtype defines a notion of 4 as is the not well known and even less used case in IEEE 754. substitutability for objects; that is, if S is a subtype of T, then Language constructs or libraries to capitalize on this are unfor- objects of type T in a program may be replaced with objects of tunately absent. As these payloads are not well known, standard type S without altering any of the desirable properties of that libraries avoid the tedious work of correctly propagating them, program (e.g. correctness). which limits even more their usefulness. 4 Vincent Lafage for double precision. A strict application of the repre- provide us with elementary and approximate versions sentation would lead to an underflow exception, turning of functions atof or strtof. the result to 0, even if the mathematical result is not As for the binary to decimal conversion, although strictly null. This could occurs when subtracting close every binary floating point number has an exact and small numbers. To prevent this total loss of relative ac- finite decimal representation, it is usually not this ex- curacy, a special representation is included for subnor- act conversion that is used in practice, for instance in mal numbers (formerly known as denormal numbers). a printf. This faithful translation would require more While these numbers provide algorithms with more sta- digits than the effective precision used for the floating- bility, dealing with this anomalous representation can point format, leading to an absurd overprecision. The slow down computation by as much as 100: this fact decimal number that should be returned is then not the brands this part of the norm a shame to high perfor- exact one, but the decimal with the shortest expression mance computing. A first thing to take into account is that stand within one half ulp of the binary number. that such a slowdown points out a likely improper nor- This minimization leads to comparisons and computa- malization of the values (see subsection 4.2). A second tions with precision higher than the target precision aspect when subnormal numbers occur, is that extend- for the routine. The system libraries typically provided ing the precision from single to double is likely to cure a more pragmatic approach, not aiming at correctly the problem at a vastly less cost than redesigning the rounded but good enough results [15]. algorithm or renormalizing the values. Yet, correctly rounded conversion routines have ex- isted for the last thirty years [16] ( source available on www.netlib.org/fp). Faster algorithms even ap- 3 Use cases peared [17]. But they did not seem to make it in our compilers and system or language libraries. While the In this section we will present a collection of use cases compilers have gone a long way to evaluate the literal where the developers should be aware of the inaccura- and named constants in extra precision, and that mit- cies of the floating point computation, and refrain from igates the problem, the bulk of the conversions comes using the naive approach. from decimal data input/output: as long as these are not commoditized as language construct, these aspects of round-off are often considered for experts. 3.1 Converting decimal to binary and binary to This kind of round-off can pile-up in loop of conver- decimal sions (from one program to another, for instance when storing intermediate results in decimal format). As a As B D, every binary floating point number has an ⊂ consequence, one should avoid decimal storage as much exact decimal representation within a finite number of as possible and prefer binary storage, not only for per- decimal digits. But not conversely: the decimal num- formance, but also for precision. Moreover, the number bers with which we count our cents may not always of bits for accurately encoding a given number of deci- get a finite binary representation to the disarray of be- mal digits is not the naive estimation: it has been known ginners, the archetypal example being the violation of for over fifty years [18], that the 24 bits of single preci- the mathematical identity 0.1+0.2=0.3, whatever the sion are not enough for an accurate representation of 7 floating-point precision, but more evidently in single decimal digits, while 24 log (2) > 7. Hence caution is precision. 10 needed when estimating the required precision, usually So when, frequently, there is no n digits binary num- on number of decimal grounds. ber corresponding exactly to the input decimal number, we round off towards the closest n digits binary number. This requires converting decimal to a precision larger 3.2 Double rounding than n binary digits to correctly round it. Hidden in most hardware, below the least significant bit, one finds The classical issue of double rounding occurs when us- first a guard bit, a round bit and then a sticky bit: the ing two different precisions: a high precision and a lower two first are just two extra bits of precision, while the one. The exact result is first rounded to the high pre- sticky bit keeps track of all non zero bits lost during cision, then to the lower one. For instance with base the right-shifts of the mantissa. But the decimal to bi- 10 floating points, we will consider a mantissa of 2 dig- nary conversion is affected by a problem related to the its for the high precision and 1 digit for the lower one: table-maker dilemma (TMD) [13,14] and these three with an exact result of 3.47, the rounding to the high extra bits may not be sufficient to proceed to the cor- precision gives 3.5, and the rounding to the lower pre- rect rounding. The system standard C libraries usually cision gives 4, whereas the direct rounding of the exact Revisiting “What Every Computer Scientist Should Know About Floating-point Arithmetic” 5 result to the lower precision is 2. Double rounding hap- 3.4 Summation pens when computing in extended precision (such as the F hardware implementation of 64,16) that one converts to Another well known algorithm among the community is the smaller working precision: even if the extended pre- Monte-Carlo integration: basically, it is a loop sampling cision has been correctly rounded (within half a ulp), a configuration space, then accumulating a given con- the rounding to working precision may lead only to a tribution as function of the sampled point. We therefore faithful rounding (within one ulp). The correct solution proceed to add fluctuating values in a growing accumu- is addressed in Ref. [19]. In particular, while double cor- lator. The more one accumulates, the more the accuracy rect rounding is not always correct, the double faithful of the individual contribution is blurred. rounding is faithful. Ideally, one should sort the contributions from the smallest to the largest (for contributions with the same sign) to be sure to lose as little precision as possible. One could also apply a “divide and conquer” algorithm 3.3 Difference of two squares such as pairwise summation [20]. Another solution is to accumulate in extended arithmetic, or to use the sum- W. Kahan The difference of two squares a2 b2 seems innocuous, mation algorithm of [21] or other compen- − and if the units of the program have been chosen cor- sated sums [22,23,24,20,25,26]. rectly, we can easily avoid the overflow risk (however Let us specify this essential aspect of sums which it is enough to trigger this overflow that a or b be of is the accumulation of truncation error. As this error 19 154 order of 10 in single precision instead of 10 in dou- will statistically be distributed uniformly as much up- ble precision). The real issue is, when the two terms are wards as downwards, one can assimilate it to a random close, in the possibility to exhibit “catastrophic cancel- walk, and in the hypothesis of terms of comparable size, lation” [1], also known as “loss of significance”. These to expect for the sum of N terms to a resulting abso- cancellation occurs in an operation (usually a subtrac- lute uncertainty on the total of the order of √N . tion) on previously truncated operands, for instance, O   This does not seem alarming since the associated rela- the product of other operands (here, squares). The nu- tive uncertainty will be in 1/√N . This approach is merical result then has only a limited precision com- O   pared to that with which the operand computations however naive since the absolute error of truncation of were made. Even the sign of the result can be wrong. a sum is like the relative error compared to the largest term i.e. the accumulator. The relative error varies like A contrario the same difference on exact operands ǫ√N . Comparatively, pairwise summation has a leads to a benign cancellation. O   much tamer relative error varying like ǫ log N . Using double precision is usually the way to mitigate O p 2  As a matter of fact, the naive addition of 4 106 con- this effect. × tributions in a half precision accumulator causes a rel- The best solution is to use a factorized formulation ative uncertainty of 100 %, and in the case of a single of the computation a2 b2 = (a b) (a + b). The − − · precision accumulator, it reduces the precision within remaining subtraction can be the cause of “benign can- four significant digits at best (if each contribution is cellation” at most. In this case factorization used to correctly calculated to the accuracy of the machine and provide a speed gain (a single multiplication instead of correctly rounded). The average Monte Carlo can easily two) on top of the precision gain5. On contemporary produce such a four million events sampling. architectures, the speed gain has disappeared, as mul- tiplication is now as fast as addition. In this simple case, We can also combine these mild (and all the more the factorization is trivial, but in the expression of the treacherous) losses of precision with catastrophic can- area of a triangle, or the volume of a tetrahedron, the cellation. This is, for instance, the case of series falling work required is arduous. More generally, in the deter- under the Leibniz’s test for alternating series: a sum minantal expressions common in geometry, there is no of terms of alternating signs, of zero limit at infinity guarantee of the existence of these factorizations, and and of decreasing absolute values. The convergence of when they exist such results seems to be found by sheer these series is mathematically guaranteed, and the un- luck. certainty on the partial sum is bounded from above by the absolute value of the first neglected term in trun-

5 cating the serie. Each pair of successive terms of these technically, when a and b are of different magnitude, the non-factorized version suffers one less rounding error, therefore series is thus concerned with a potentially disastrous is more precise, but without affecting the meaning of the result compensation. The latest can be overcome in the case 6 Vincent Lafage where an analytical expression of terms exists, but there overshoots, particularly in the case of simple precision. is no general solution for purely numerical expressions. It is even possible to increase the evaluation speed [28]. The democratization of “multiply-accumulate” ma- chine instructions (fma) combining addition and multi- 3.5 Quadratic formula plication (see Subsection 4.4), leads us to strongly seek their use in this case [29]. Let’s start with the famous quadratic equation Furthermore, a polynomial being a sum (see Subsec- tion 3.4), compensation summation techniques, such as ax2 + bx + c =0, where a =0, (1) 6 the summation algorithm of W. Kahan [21], must be with ∆ = b2 4ac, (2) − taken into account. its corresponding discriminant, we note, for ∆> 0, its two exact solutions: 3.7 Variance estimate b √∆ x = − ± . (3) ± 2a The computation of variance estimator is a canonical We can identify two possibilities for “catastrophic can- example of “catastrophic cancellation” in that it is a cellation”: the first, in expressing the discriminant, and difference of squares. Let us consider the variance of a the second in the compensation between b and √∆. sample of N individuals. − The latter can be overcome reformulating the roots as: N 2 N ( xi) x2 Pi=1 σ2 = (x x¯)2 = (x2) x¯2 = Pi=1 i − N . (8) − − N 1 q = b sgn(b)√∆ = sgn(b) b + √∆ (4) − − − − | |  A golden rule for students used to be “never compute q x1 = , (5) the variance with single precision”... Because the vari- 2a ance so computed could become negative, it appears 2c c x2 = = . (6) abruptly when one extract the square root to obtain q ax1 the standard deviation. As evaluation occurs more and q is then always a sum of two like-sign terms, without more online, consequences can be dire if proper safe- catastrophic cancellation. We thus obtain an accurate guard exception handling is not provided. expression without execution branching. On the other Certainly, a two-pass approach, when possible, al- hand, avoiding execution branching blurs the link be- lows to stick with single precision. But one can also tween the two pairs of roots, and a test might be nec- take advantage of Welford [30] one pass online algo- essary to match x1 to x+ or x . rithm. The cost for accuracy then comes as an extra − A solution for taming the catastrophic cancellation division for each iteration loop. in the discriminant will be presented in Subsection 4.4. The problem also appears when evaluating covari- ance matrices and in evaluating centered moments of higher orders. 3.6 Polynomial evaluation

It is well known that polynomial function should not be 3.8 Area of a triangle evaluated in the naive way: on the one hand, because of the computational cost of all the exponentiation, on the The formula of Heron of Alexandria, allows to cal- other hand because of the accuracy loss it represents. culate the area S of any triangle knowing only the As we have seen in Subsection 3.3, when a factored lengths a, b and c of its three edges: out form exists, it had better be exploited. In addition, S = p(p a)(p b)(p c) (9) when factorization is not known, the best approach is p − − − a+b+c Horner-Ruffini method [27]: with p = 2 the half perimeter. This nice sym- metrical formula exhibits an instability during the nu- n 1 n p(x)= a0 + a1x + + an 1x − + anx merical computation, which appears for a needle-like · · · − (7) triangle, i.e. a triangle of which one edge is of partic- = a0 + x a1 + x + x(an 1 + xan)  · · · − · · · ularly small dimension compared with the others (con- It is not only a gain in speed, through the saving of op- frontation of small and large values). erations that it achieves, but also in accuracy, partly for By labelling the names of edges so that a>b>c, the same reason, and more importantly it is a guarantee and by reorganizing the terms to optimize the quanti- of stability of the result and safety against intermediate ties added or subtracted, W. Kahan proposed a more Revisiting “What Every Computer Scientist Should Know About Floating-point Arithmetic” 7 stable formula for S in Ref. [31]: anteed values, for example for one-loop functions in ra- 1 diative corrections [37]. [a + (b + c)] [c (a b)] [c + (a b)] [a + (b c)] 4p − − − − (10) 3.10 Function evaluation While its apparent symmetry is lost, this formula is much more robust than the naive one. He similarly Function evaluation often start with a range reduc- provided a stable formula for the computation of the tion: trigonometric functions are first linked to their volume of a tetrahedron based on a non trivial factor- π expression in the range [0; 4 ], and building upon the bi- ization [32]. nary representation of the argument, square root over These calculations may seem hazy textbook case: √2 √ [ 2 ; 2], exponential in the range [0; ln 2], logarithm they are in fact only the expression with small dimen- over [ √2; √2]. . . This apparently trivial transforma- − sions (n = 2, n = 3) of factors which intervene in more tion can in itself require extra computations to be re- general geometric calculations, such as the relativistic ally accurate. But before algorithms such as power se- phase space volume [33]. These factors turn out to be ries expansion or Newton-like iteration can be used determinants, notoriously delicate to evaluate precisely efficiently, an extra step needs to be taken, as these 6 if one proceeds naively . For instance, if we study the reduced intervals are already too large for quick con- production of intermediate bosons and electrons, their vergence. This is commonly achieved through lookup 10 square masses are in a ratio of 10 and certain kine- tables (LUT). matic factors will be analogous to the surface of a tri- angle with edges in such a ratio. 3.11 Function Minimization

3.9 Complex Arithmetic and Analysis This almost daily activity in our community, seems well under control. If we look more closely, in the vicinity of Complex numbers are a powerful tool for the physi- its minimum, the studied function is the sum of the min- cist and the engineer. Unfortunately, they are often imum value and of a quadratic form (Taylor’s theo- presented as the simplest example of Cartesian coordi- rem). With this form’s coefficients close to unity, a rela- nates, which often gives rise to high school-like imple- tive displacement of (h) around the position x of the 7 O 0 mentations which are mathematically correct but nu- minimum will generate a relative displacement of (h2) O merically fragile. These expressions are notably liable of the function to minimize. However, the smallest rel- to involve intermediate overflow although their result ative measurable displacement being bounded by the is representable (complex modulus, complex division m ǫ =2− , the variation δx0 which will and even complex multiplication). For instance, when not generate a perceptible difference will be (√ǫ), O naively evaluating the modulus of a single precision half the machine precision. Therefore if we calculate the complex number with the largest component greater function with double precision, the uncertainty on the than 2 1019, the intermediate sum of squares will over- × position of the minimum corresponds to the single pre- flow while the resulting modulus is representable as a cision. If we now evaluate the function to minimize in single precision floating point number. Other languages single precision, the position of the minimum will only integrate complex numbers better, such as Fortran, have the resolution of the half precision. Moreover if we Ada and D, and even better, C99. estimate the function by means of the half precision in As for complex analysis, particularly when evalu- vogue, the position will be known with less than two ating radiative corrections in particle physics, but also significant figures. This half precision being supposed when modeling 2D irrotational fluid flows, the subtleties to serve the training in machine learning, which rely of managing the zero sign in the vicinity of the cuts essentially on minimization, using half precision casts associated with connection points, can lead to results some doubts on the accuracy of the results. which are not trivially false [36]. It is therefore impor- Some studies have tested the consequences of using tant to use established libraries to benefit from guar- half precision in training neural networks, including the 6 Multivariate polynomial in nature, while relying on “sim- use of mixed precision with single precision accumula- ple” arithmetics, these multi-linear alternating form are the tors (see Refs. [38,39] or fixed point number instead of best place to exhibit massive cancellation. Computation exam- floating point numbers (see Refs. [40,41]). These pre- ples defeating even quadruple precision rely on these higher degree polynomial cancellation (see Ref. [34,35]) cisions are sufficient for training and running the net- 7 see comments such as // XXX: This is a grammar work. The question of the accuracy of a neural network school implementation. in the complex header of g++ reproducing a numerical function is yet open. 8 Vincent Lafage

4 Some approaches to control floating point This points out the need for different levels of cau- uncertainty tiousness in the evaluation strategy of a function, de- pending on the argument: some subdomains of the def- Now cautious that our prejudice will work against us, inition domain do not exhibit large condition number, we need some way to compute nevertheless. others need more care. From the definition, one can easily demonstrate that only power laws x C xn → × (with C and n some real constants) have a constant 4.1 Condition number and stability condition number x, κ (x) = n. Hence even functions ∀ with a nice unique expression may need piecewise eval- As we have seen, some computations have a risk of de- uation. Beside lacking of aesthetic appeal, this casuistry velopping large relative uncertainties even with precise has a clear impact on performance as the required tests input. In this subsection, we will attempt to quantify potentially lead to branch misses. Moreover, these tests the amplification factor between input uncertainty and come on top on those needed for checking that the ar- output uncertainty, and to discriminate the part inher- gument is inside the function’s domain of definition. ent to the function under study, called the condition The condition number of the problem reveals that number, from the part caused by the evaluation algo- the precision of the computation may have to be larger rithm. than the target accuracy. The condition number κ inherent to a mathematical Note that until this point, we have considered only function f evaluated at argument x with approximate exact mathematical expression: the condition number value xa compares the relative uncertainty on f(x) to κ (x) does not consider approximations and rounding. the relative uncertainty on the argument. It takes the Further uncertainties come from the numerical algo- form: rithm used to evaluate the expression: approximation, f(xa) f(x) f(xa) f(x) − − for instance truncation of a power or Fourier series f(x) (xa x) xf (x) κ (x)= = − ′ (11) xa x expansion, as well as the rounding, on each term and − f(x) ∼ f(x) x x on the sum. The condition number and extra uncer-

tainties contribute to define an overall stability of the κ is also known as the absolute value of the elasticity algorithm under small perturbations. These concepts of the function f. Note that κ is dimensionless. This are as old as computers [42] but deserve to be given a simple expression reveal that even the most elementary larger exposure: these were for instance not described function is prone to large uncertainties: considering the in Ref. [1], despite the numerous examples of this pa- addition of a constant c (f : x x + c), the corre- → per being developped around singular points for their sponding condition number κ = x exhibits a singu- x+c respective condition number. lar behavior around x = c, while being tame on the − Once the singular values of the condition number vast majority of value: next to the cancellation point, is found (most of the time simple zeros of function f relative error on the result will be far larger than rela- will be singular for its condition number κ), an asymp- tive error on the argument. This particular example il- totic analysis of f around these singular values must lustrates that catastrophic cancellation is amplified by be performed to assess the more stable expression: for a large condition number even before rounding takes instance, considering f = x ln x, one finds that place. → κ (x) = 1 , with singular value x = 1, which brings Beyond this formal definition, a more concrete con- ln x our attention to the expression of f around 1, f (x = sequence of the condition number will help us measure 1 + h) = ln(1+ h) for which the condition number its value as a tool: as the smallest relative uncertainty becomes: achievable with a given precision is of half a ulp, a h 1 neatly computed parameter of a function f will pro- κ (h)= (12) κ (1 + h) ln (1 + h) h∼0 (1 + h) duce a relative uncertainty of 2 ulp in the result of f. → Hence log10 κ is the number of digits of accuracy lost which is no more singular for h near 0 (but is singu- in this computation, and log κ is the number of bits lar for h near 1). Yet as the expression involves an 2 − of accuracy lost in this computation. indetermination, it needs to be evaluated with care as Moreover, the composition of functions will lead to described in Ref. [1]. Nowadays, the mathematical li- a net condition number equal to the product of compo- braries provide us with log1p. nent functions condition numbers. The 23 bits of sin- In real situations, the condition number must be gle precision can then succumb to repeatedly composed extended beyond univariate function by using an arbi- evaluation of a function with as moderate a condition trary norm instead of absolute value, and one must take number as 2. into account the uncertainties on physical constants as Revisiting “What Every Computer Scientist Should Know About Floating-point Arithmetic” 9 possibly greater than half a ulp (as mathematical con- it is not enough to couple a library [45,46] to its pro- stants have zero for their condition number). gram to automatically benefit from interval calculation. Worse, beyond the transformation of types, literal con- stants and, more importantly, input-output, these are 4.2 Nondimensionalization the expressions and algorithms that must be radically revised to curb the expansion of the resulting intervals. Scientists often compute to solve equations or simulate This technique has already been used for a thesis in phenomena where variables and parameters have spe- the context of calculating uncertainties for data acqui- cific units and physical dimensions, following dimen- sitions [47]. sional analysis. Dimensional homogeneity requires that all analytic functions and their arguments must be di- mensionless, as these functions can be expressed as an 4.4 Good news on accuracy front infinite power series of the argument and all terms must have the same dimension. Usage is to extract the char- Since  [5] the IEEE 754 standard takes into account acteristic units of the problem at hand, and use these the operations fma “ Fused Multiply and Add ” which for rescaling the dimensional quantities to dimension- allows one to perform only a rounding instead of two Vaschy Buckingham Bertrand less ones ( - - π theo- on the combination of a multiplication and an addition. rem). This choice of units allows one to simplify the This new operation primarily affects two major algo- equation, with parameters both fewer and dimension- rithms: the scalar product [25] (and therefore the prod- Reynolds less (such as the number in hydrodynamics, ucts of matrices but also convolutions) and the evalua- among others). Beside this aspect, useful to understand tion of polynomials by the Horner-Ruffini method. the problem, it also proves beneficial numerically as the These two algorithms, based on sums, are intended dimensionless variables are of the order of unity, where to be coupled with compensation techniques such as the the density of representable floating points is the high- summation algorithm of Kahan [48,49,50]. est (and where the overflow and underflow are as far as fma can also be used for Error-Free Transformation possible). (EFT) of arithmetics: given one operation and two ◦ floating point numbers a and b, the rounded operation a b would produce an approximate result s, but the 4.3 Interval arithmetic ◦ corresponding EFT will produce a sum of two floating If one would like a higher precision than the native points values s + e corresponding to the exact result. types of the processor, there are different solutions such This capacity is relevant to use when computing a re- 2 as double-double precision, quadruple precision, arbi- duced discriminant ∆′ = b′ ac riddled with cancel- 2 − lation when b′ ac. The EFT for the multiplication trary precision, rational representations. . . ∼ An important alternative is to represent each vari- has two steps: the most significant result is computed with a simple, approximate (rounded) product a b, also able by a pair of floating point numbers representing ∗ noted a b distinct from the exact product a b, and the bounds of an interval ensuring that it contains the ⊗ × true value. Numbers arithmetic then becomes interval the error term is evaluated with a fma as follows: arithmetic, which is more than twice as expensive to a b = a b + fma(a,b, a b) (13) evaluate, but provides mathematical certainties. This × ⊗ − ⊗ approach can be traced back to Archimedes and ap- One then apply this EFT to the two terms in ∆′ = 2 peared quickly but discreetly in the history of comput- b′ ac and if there is a cancellation between the most − ers. Beyond the calculation of each terminal, it is ad- significant terms of the EFT, adding the difference of visable to round the lower limit towards and the the least significant terms will provide for the missing −∞ upper limit towards + and this switching of the con- accuracy. ∞ trol register of the FPU considerably slows down the execution. A number of implementations bypass this problem by systematically subtracting ǫ from the lower 4.5 Stochastic Arithmetic bound and adding ǫ to the upper bound. As harmless as this extra ǫ may seem, it feeds the natural demon Beyond these self-defense and stability validation ap- of interval arithmetic, which constantly increases the proaches, we can use more recent environments for as- size of intervals. The result is then certainly fair, but sessing digital precision. These execution environments useless. The recent agreement on a standard will some- induce small disturbances at each calculation stage and what facilitate the use of the intervals [43,44]. However, thus make it possible to measure the fluctuation of 10 Vincent Lafage the result. Here the emphasis is about stochastic arith- and documentation of their algorithms and their library metic. This technique is akin to Monte Carlo. The dis- implementation. turbance induced in the computation can appear in the form of either a random noise added to the computa- tion or a random change of mode of rounding of the re- 4.7 Mixed precision sults. These environments are CADNA [51,52,53,54,55], Verificarlo [56,57] and Verrou [58,59,60,61] which As mentioned in Subsection 3.11, the algebraic intensity can be used with valgrind [62,63,64]. of machine learning computations and the best use of The Verrou environment allows, in addition to its both memory size and bandwidth had the community stochastic mode, to force all rounding of the calcula- turn to accuracies lower than single precision. Namely tion in an unconventional direction instead of the near- half-precision, either the IEEE compliant binary16, or est rounding: towards , towards + , towards 0, −∞ ∞ FP16 [5], or the bfloat16 of Google Brain [65]. FP16 and even to test the rounding off “as far as possible”, having a very restricted range (magnitude up to 65 536), a good way to get an idea of the resilience of ones nu- one often use a single precision accumulator while mul- merical code. These programs allow an instrumentation tiplying half precision variable. of the code which reveals numerical weaknesses at run- time. Their implementation has an impact on the per- Besides machine learning, dense and sparse linear formance of the code, and must be taken into account algebra, as well as Fast Fourier Transform are studied in the time budget for testing and optimization. with mixed precision in Ref. [66].

4.6 Double precision as an insurance 4.8 Software engineering The previous examples have shed a light on numeri- cal computation that is far from common expectation: The major programming paradigms prevents us from floating-point computing is not straightforward at all, dangerous practices through discipline: structured pro- while expressions like “number crunching” evoke blind gramming forbids the use of goto, modular program- use of power. In fact, while a mathematical formula ming prevents global variables, functional programming looks like a monolithic, deep-frozen archive of theoret- forbids mutations, strong-typing prevents implicit cast ical truth, its practical evaluation often requires to be of type, polymorphism discipline the indirect transfer split between different domains and regimes, each rely- of control of pointers to functions, etc. We should be ing on mathematically identical but numerically differ- looking for what to forbid to achieve a safer floating- ent expressions. This casuistry may feel uncomfortable point environment: let us note that it has not to rely on for the beginner, but it is also irritating to the opti- a specific syntax or language, but more on good prac- mization expert: as pipelining and vectorization are the tices. modern basis of computing performance, tests in the in- The first and most well known approach is that tests nermost loops of library are shunned. In the same way, for strict equalities should likely be banned (topological cache-misses induced by lookup table are also a pain aspect of the representation), but it is only the most for performance computing. elementary defense. All fields of engineering usually involve “factors of To coin Robert Martin [67] “A good architecture safety” to provide for unexpected situation. The fields allows you to defer framework decisions”. The choice of where the materials are known the better can afford to the precision is such a framework decision. In order to minimize these factors of safety: concrete walls or eleva- defer this choice, generic programming is the paradigm tor cables can be oversized with security factors of order of choice. The easiest way to implement it, is using sim- twenty, but planes have to stay within a security factor ple preprocessing of source or C style typedef to specify of two: this can only be achieved through a deep and at compilation time a generic floating-point type with thoroughly documented knowledge of the material at a given precision. Unfortunately, the precision of the hand. One might see the traditional recourse to double floating-point variables does not guarantee in itself the precision as such a factor of safety. While most com- accuracy of the result. Hence the full generic paradigm putations will not be used in real life with more than is required here, allowing to rewrite the function differ- single precision, intermediate result may require dou- ently for the various precision if needed. ble precision to achieve single accuracy. For end user Unit testing and Test Driven Development may also to rely only on single precision, and possibly benefit on look beneficial to the numerical computation commu- its gain in speed, they need a much higher level of test nity, but the amount of testing required is rather large. Revisiting “What Every Computer Scientist Should Know About Floating-point Arithmetic” 11

5 Conclusion Unfortunately, conducting such tests is not as simple as it seems [71]. “The purpose of computing is insight, not numbers.” (Richard Wesley Hamming, Turing Award ) Acknowledgements We gratefully acknowledge the stimulat- Experimental systematic error usually represent a ing discussions we had with Amel Korichi, Hadrien Grasland, large part of the discussion in a science paper or the- Pierre Aubert and Philippe Gauron. sis. However numerical systematic error are rarely dis- cussed. The propagation of numerical errors along the References computational process needs to be given attention as part of the systematic error study. 1. D. Goldberg, ACM Comput. Surv. 23(1), 5, Mar. . Floating-point numbers are a computer abstraction doi:10.1145/103162.103163 for real numbers. As we have shown in this paper, this 2. W. Kahan, J.F. Palmer, SIGNUM Newsl. 14(si-2), 13, Oct. . doi:10.1145/1057520.1057522 abstraction does not protect us completely from im- 3. IEEE Standard for Binary Floating-Point Arithmetic. plementation details: to coin Ref. [68,69] floating-point Tech. rep., Oct . doi:10.1109/IEEESTD.1985.82928 numbers are a leaky abstraction. 4. IEEE Standard for Radix-Independent Floating- Fifteen years after the computer power growth is Point Arithmetic. Tech. rep., Oct . doi:10.1109/IEEESTD.1987.81037 no more being driven by the processor frequency, in 5. IEEE Standard for Floating-Point Arithmetic. Tech. rep., order to achieve the expectation on science projects, we Aug . doi:10.1109/IEEESTD.2008.4610935 face the need for keeping the computing power growing, 6. IEEE Standard for Floating-Point Arithmetic. Tech. rep., and some in the field consider moving to less precise July . doi:10.1109/IEEESTD.2019.8766229 7. Binary floating-point arithmetic for microprocessor sys- computation to execute it faster. tems. Standard, International Organization for Standard- Admittedly double precision is not double accuracy. ization, Geneva, CH, Feb. . URL https://www.iso. Floating-point precision is a mean to achieve floating- org/standard/19706.html 8. Information technology — Microprocessor Systems — point accuracy, the goal of our computation, that has to Floating-Point arithmetic. Standard, International Orga- be asserted beforehands. The problem stands in com- nization for Standardization, Geneva, CH, Jun. . URL putations larger than the ones we can follow, where not https://www.iso.org/standard/57469.html only the accumulation of round-off errors, but the exis- 9. Information technology — Microprocessor Systems — Floating-Point arithmetic. Standard, International Orga- tence of numerical singularities hidden in the more and nization for Standardization, Geneva, CH, May . URL more complex domain spaces of these computation hin- https://www.iso.org/standard/80985.html ders our confidence in floating points computations. In 10. B. Liskov, in Addendum to the Proceedings on Object- the same way the software engineering has dealt with Oriented Programming Systems, Languages and Applica- tions (Addendum) (Association for Computing Machin- the growing complexity of modern programs over sev- ery, New York, NY, USA, 1987), OOPSLA ’87, p. 17–34. enty years, through a set of disciplines enforced by lan- doi:10.1145/62138.62141 guage paradigms as well as helper tools, floating point 11. B. Liskov, J.M. Wing, in ECOOP’ 93 — Object-Oriented computing must be understood as the difficult field it Programming, ed. by O.M. Nierstrasz (Springer Berlin Hei- delberg, Berlin, Heidelberg, 1993), pp. 118–141 is, deserving also its set of good practices and tools. 12. B.H. Liskov, J.M. Wing, ACM Trans. Program. Lang. Syst. As W. Kahan reminds us in the conclusion of Ref. 16(6), 1811–1841, Nov. . doi:10.1145/197320.197383. [70], we still have to rely on the usual strategies: dif- URL https://doi.org/10.1145/197320.197383 ferent compilers, and if possible different computers 13. A. Castiel, V. Lefèvre, P. Zimmermann, Interstices . URL https://hal.inria.fr/inria-00000567. http: (not only a different hardware, but a different architec- / / interstices . info / display . jsp ? id = c _ 5936. See ture, and possibly a different manufacturer), different also http : / / www . vinc17 . org / research / slides / libraries and different runtime environments also mat- epao2001-03.pdf ter. Using binary format data storage from end to end 14. C. Daramy, D. Defour, F. de Dinechin, J.M. Muller, in Advanced Signal Processing Algorithms, Architectures, and is also a good approach, separating the data model from Implementations XIII, vol. 5205, ed. by F.T. Luk. Interna- the data view. tional Society for Optics and Photonics (SPIE, 2003), vol. The self-defense techniques suggest us to develop 5205, pp. 458 – 464. doi:10.1117/12.505591 our codes in generic precision, instead of setting from 15. R. Regan. Quick and Dirty Floating-Point to Decimal Conversion. start a fixed precision. The relative safety provided by https://www.exploringbinary.com/quick-and-dirty- double precision is dismissed by teams ready for more floating-point-to-decimal-conversion, Nov.  aggressive approach, particularly with large amount of 16. D.M. Gay, Correctly Rounded Binary-Decimal and data, naively confusing precision with accuracy. An em- Decimal-Binary Conversions. Tech. Rep. 90-10, Numer- ical Analysis Manuscript AT&T Bell Laboratories, Mur- pirical, down to earth approach, comparing results of ray Hill, New Jersey 07974, Nov. . URL https : different precisions, should help decide on the precision. //ampl.com/REFS/rounding.pdf 12 Vincent Lafage

17. F. Loitsch, SIGPLAN Not. 45(6), 233, Jun. . 41. Y. Choi, M. El-Khamy, J. Lee, CoRR abs/1809.00095, doi:10.1145/1809028.1806623 . URL http://arxiv.org/abs/1809.00095 18. I.B. Goldberg, Commun. ACM 10(2), 105–106, Feb. . 42. A.M. Turing, The Quarterly Journal of Mechan- doi:10.1145/363067.363112 ics and Applied Mathematics 1(1), 287, 01 . 19. S. Boldo, G. Melquiond, When double rounding is odd. doi:10.1093/qjmam/1.1.287 Tech. Rep. Research Report No 2004-48, École Normale 43. IEEE Standard for Interval Arithmetic. Tech. rep., June Supérieure Lyon, Nov. . URL http://www.ens-lyon. . doi:10.1109/IEEESTD.2015.7140721. https : / / fr/LIP/Pub/Rapports/RR/RR2004/RR2004-48.pdf ieeexplore.ieee.org/document/7140721 20. N.J. Higham, SIAM Journal on Scientific Computing 44. IEEE Standard for Interval Arithmetic (Simplified). Tech. 14(4), 783, . doi:10.1137/0914050 rep., Jan . doi:10.1109/IEEESTD.2018.8277144 21. W. Kahan, Commun. ACM 8(1), 40, Jan. . 45. 2014 IEEE Conference on Norbert Wiener in the 21st Cen- doi:10.1145/363707.363723. http://doi.acm.org/10. tury (21CW) (Wiley-IEEE Computer Society Press, 2014). 1145/363707.363723 doi:10.1109/NORBERT.2014.6893849 22. J. Mcnamee, ACM SIGSAM Bulletin 38, 1, 03 . 46. M. Nehmeier, in 2014 IEEE Conference on Norbert doi:10.1145/980175.980177 Wiener in the 21st Century (21CW) [45], pp. 1–6. 23. T.O. Espelid, SIAM Review 37(4), 603, . doi:10.1109/NORBERT.2014.6893854. https://github. doi:10.1137/1037130 com/nehmeier/libieeep1788 Its main focus is the correct- 24. I. Babuška, in IFIP Congress (1), vol. 68 (1968), vol. 68, ness and not the performance of the implementation. pp. 11–23 47. F. Hautot, Cartographie topographique et radiologique 3D 25. T. Ogita, S.M. Rump, S. Oishi, SIAM Journal on Scientific en temps réel : acquisition, traitement, fusion des données Computing 26(6), 1955, . doi:10.1137/030601818 et gestion des incertitudes. Ph.D. thesis, Université Paris 26. S.M. Rump, T. Ogita, S. Oishi, SIAM Journal on Scientific Saclay,  Computing 31(2), 1269, . doi:10.1137/07068816X 48. S. Graillat, P. Langlois, N. Louvet. FMA Implementa- 27. W.G. Horner, Philosophical Transactions of the Royal So- tions of the Compensated Horner Scheme, Jan. . URL ciety of London 109, 308, . doi:10.2307/107508 https://perso.univ-perp.fr/langlois/slides/dag06_ 28. D.E. Knuth, Commun. ACM 5(12), 595, Dec. . sl.pdf. Reliable Implementation of Real Number Algo- doi:10.1145/355580.369074 rithms: Theory and Practice, Dagstuhl Seminar 29. S. Graillat, P. Langlois, N. Louvet, in Proceedings of 49. P. Langlois, N. Louvet. Operator Dependant Compensated the 2006 ACM Symposium on Applied Computing (ACM, Algorithms, May . URL http://perso.ens-lyon.fr/ New York, NY, USA, 2006), SAC ’06, pp. 1323–1327. nicolas.louvet/LaLo07b.pdf doi:10.1145/1141277.1141585 50. S. Graillat, P. Langlois, N. Louvet. Accurate dot prod- 30. B.P. Welford, Technometrics 4(3), 419, . ucts with FMA, Jul. . URL http://rnc7.loria.fr/ doi:10.1080/00401706.1962.10490022. URL https : louvet_poster.pdf //www.tandfonline.com/doi/abs/10.1080/00401706. 51. N. Scott, F. Jézéquel, C. Denis, J.M. Chesneaux, 1962.10490022 Computer Physics Communications 176(8), 507, . 31. W. Kahan, . URL https://people.eecs.berkeley. doi:https://doi.org/10.1016/j.cpc.2007.01.005 edu / ~wkahan / Triangle . pdf. https : / / people . eecs . 52. F. Jézéquel, J.M. Chesneaux, Computer berkeley.edu/~wkahan/Triangle.pdf Physics Communications 178(12), 933, . 32. W. Kahan, . URL https://people.eecs.berkeley. doi:https://doi.org/10.1016/j.cpc.2008.02.003 edu / ~wkahan / VtetLang . pdf. https : / / people . eecs . berkeley.edu/~wkahan/VtetLang.pdf 53. F. Jézéquel, J.M. Chesneaux, J.L. Lamotte, Com- 33. E. Byckling, K. Kajantie,  puter Physics Communications 181(11), 1927, . 34. S.M. Rump, in Reliability in computing, ed. doi:https://doi.org/10.1016/j.cpc.2010.07.012 by R.E. Moore (Elsevier, 1988), pp. 109–126. 54. J.L. Lamotte, J.M. Chesneaux, F. Jézéquel, Com- doi:https://doi.org/10.1016/B978-0-12-505630-4.50012-2. puter Physics Communications 181(11), 1925, . URL http://www.sciencedirect.com/science/article/ doi:https://doi.org/10.1016/j.cpc.2010.07.006 pii/B9780125056304500122 55. N. Scott, F. Jézéquel, C. Denis, J.M. Chesneaux. The 35. E. Loh, G.W. Walster, Reliable Computing 8(3), 245,  CADNA library, . URL http://cadna.lip6.fr/ 36. W. Kahan, J.W. Thomas, Augmenting a programming lan- 56. C. Denis, P.D.O. Castro, E. Petit. A tool for automatic guage with complex arithmetic. Tech. Rep. UCB/CSD-92- Montecarlo Arithmetic analysis. URL https://github. 667, EECS Department, University of California, Berkeley, com/verificarlo/verificarlo Dec . URL http://www2.eecs.berkeley.edu/Pubs/ 57. C. Denis, P.D.O. Castro, E. Petit. Verificarlo: checking TechRpts/1991/6127.html floating point accuracy through monte carlo arithmetic, 37. G.J. van Oldenborgh, Comput. Phys. Commun. 66, 1, . Sep.  doi:10.1016/0010-4655(91)90002-3 58. F. Févotte, B. Lathuilière, SCAN 2016 p. 47,  38. M. Courbariaux, Y. Bengio, J.P. David, in Advances in 59. F. Févotte, B. Lathuilière, in 13e colloque national en neural information processing systems (2015), pp. 3123– calcul des structures (Université Paris-Saclay, Giens, Var, 3131. URL https://arxiv.org/abs/1412.7024. Accepted France, 2017). URL https://hal.archives-ouvertes. as a workshop contribution at ICLR 2015 fr/hal-01908466 39. P. Micikevicius, S. Narang, J. Alben, G.F. Di- 60. F. Févotte, B. Lathuilière, VERROU: Assessing Floating- amos, E. Elsen, D. García, B. Ginsburg, M. Hous- Point Accuracy Without Recompiling, Oct. . URL ton, O. Kuchaiev, G. Venkatesh, H. Wu, CoRR https : / / hal . archives-ouvertes . fr / hal-01383417. abs/1710.03740, . URL http : / / arxiv . org / Working paper or preprint abs/1710.03740 61. F. Févotte, B. Lathuilière. Verrou: a floating-point round- 40. S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, ing errors checker. URL http://edf-hpc.github.io/ in International Conference on Machine Learning (2015), verrou/vr-manual.html. Verrou is french for “lock” pp. 1737–1746 62. URL http://www.valgrind.org/ Revisiting “What Every Computer Scientist Should Know About Floating-point Arithmetic” 13

63. N. Nethercote, J. Seward, Electronic Notes in Theoreti- cal Computer Science 89(2), 44 , . doi:10.1016/S1571- 0661(04)81042-9. RV ’2003, Run-time Verification (Satel- lite Workshop of CAV ’03) 64. N. Nethercote, J. Seward, SIGPLAN Not. 42(6), 89–100, Jun. . doi:10.1145/1273442.1250746 65. D.D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee, S. Avancha, D.T. Vooturi, N. Jammala- madaka, J. Huang, H. Yuen, J. Yang, J. Park, A. Heinecke, E. Georganas, S. Srinivasan, A. Kundu, M. Smelyanskiy, B. Kaul, P. Dubey, CoRR abs/1905.12322, . URL http://arxiv.org/abs/1905.12322 66. A. Abdelfattah, H. Anzt, E.G. Boman, E. Carson, T. Co- jean, J. Dongarra, M. Gates, T. Grützmacher, N.J. Higham, S. Li, N. Lindquist, Y. Liu, J. Loe, P. Luszczek, P. Nayak, S. Pranesh, S. Rajamanickam, T. Ribizel, B. Smith, K. Swirydowicz, S. Thomas, S. Tomov, Y.M. Tsai, I. Yamazaki, U.M. Yang,  67. R.C. Martin, Clean Architecture: A Craftsman’s Guide to Software Structure and Design. Martin, Robert C (Prentice Hall, 2018). URL https://books.google.fr/books?id= 8ngAkAEACAAJ 68. J.D. Cook. Floating point numbers are a leaky abstraction, . URL https://www.johndcook.com/blog/2009/04/ 06/numbers-are-a-leaky-abstraction/ 69. J. Spolsky. The Law of Leaky Abstractions, . URL https : / / www . joelonsoftware . com / 2002 / 11 / 11 / the-law-of-leaky-abstractions/ 70. W. Kahan. Beastly numbers, Jan. . URL https:// people.eecs.berkeley.edu/~wkahan/tests/numbeast. pdf 71. D. Monniaux, ACM Trans. Program. Lang. Syst. 30(3), 12:1, May . doi:10.1145/1353445.1353446