<<

NUMERICAL REPRODUCIBILITY AND INTERVAL ALGORITHMS 1 Numerical Reproducibility and Computations: Issues for Interval Algorithms Nathalie Revol, Member, IEEE, and Philippe Theveny´

Abstract—What is called numerical reproducibility is the The inclusion property is satisfied by the defi- problem of getting the same result when the scientific nition of the arithmetic operations, and other op- computation is run several times, either on the same erations and functions acting on intervals. Imple- machine or on different machines, with different types mentations of interval arithmetic on often and numbers of processing units, execution environments, computational loads etc. This problem is especially strin- rely on the floating-point unit provided by the gent for HPC numerical simulations. In what follows, the processor. The inclusion property is preserved via focus is on parallel implementations of interval arithmetic the use of directed roundings. If an interval is using floating-point arithmetic. For interval computations, given exactly for instance by its endpoints, the left numerical reproducibility is of course an issue for testing endpoint of its floating-point representation is the and debugging purposes. However, as long as the computed rounding towards −∞ of the left endpoint of the result encloses the exact and unknown result, the inclusion property, which is the main property of interval arithmetic, exact interval. Likewise, the right endpoint of its is satisfied and getting bit for bit identical results may not floating-point representation is the rounding towards be crucial. Still, implementation issues may invalidate the +∞ of the right endpoint of the exact interval. inclusion property. Several ways to preserve the inclusion Similarly, directed roundings are used to implement property are presented, on the example of the product of mathematical operations and functions: in this way, matrices with interval coefficients. roundoff errors are accounted for and the inclusion Index Terms—interval arithmetic, numerical repro- property is satisfied. ducibility, parallel implementation, floating-point arith- In this paper, the focus is on the problems en- metic, rounding mode. countered while implementing interval arithmetic, using floating-point arithmetic, on multicore archi- tectures. In a sense these issues relate to issues I.INTRODUCTION known as problems of numerical reproducibility in Interval arithmetic is a mean to perform numeri- scientific computing using floating-point arithmetic cal computations and to get a guarantee on the com- on emerging architectures. The common points and puted result. In interval arithmetic, one computes the differences and concerns which are specific to with intervals, not numbers. These intervals are interval arithmetic will be detailed. guaranteed to contain the (set of) exact values, both The main contributions of this paper are the arXiv:1312.3300v1 [cs.NA] 11 Dec 2013 input values and computed values, and this property identification of problems that cause numerical is preserved throughout the computations. Indeed, irreproducibility of interval computations, and the fundamental theorem of interval arithmetic is the recommendations to circumvent these problems. inclusion property: each computed interval contains The classification proposed here distinguishes be- the exact (set of) result(s). To learn more about tween three categories. The first category, addressed interval arithmetic, see [29], [30], [32], [42], [43], in Section IV, concerns problems of variable com- [50]. puting precision that occur both in sequential and parallel implementations, both for floating-point and Universite´ de Lyon - LIP (UMR 5668 CNRS - ENS de Lyon - interval computations. These behaviours are mo- INRIA - UCBL), ENS de Lyon, France Nathalie Revol is with INRIA - tivated by the quest of speed, at the expense of [email protected] - WWW home page: accuracy on the results, be they floating-point or http://perso.ens-lyon.fr/nathalie.revol/ intervals results. However, even if these behaviours Philippe Theveny´ is with ENS de Lyon - [email protected] - WWW home page: hinder numerical reproducibility, usually they do http://perso.ens-lyon.fr/philippe.theveny/ not threaten the validity of interval computations: NUMERICAL REPRODUCIBILITY AND INTERVAL ALGORITHMS 2 interval results satisfy the inclusion property. modification of shared data structures. . . . The order The second category, developed in Section V, in which communications from different senders concerns the . Due mainly to arrive at a common receiver, the relative order in the indeterminism in multithreaded computations, which threads or processes are executed, the order the order of operations may vary. It is the most in which shared data structures such as working acknowledged problem. Again, this problem is due lists are consulted or modified, is not deterministic. to the quest of short execution time, again it hin- Indeed, for the example of threads, the quest for ders numerical reproducibility of floating-point or speed leads a greedy scheduler to start a task as interval computations, but at first sight it does soon as it is ready, and not at the time or order at not invalidate the inclusion property. However, an which it should start in a sequential execution. One interval algorithm will be presented, whose validity consequence of this indeterminism is that speed is relies on an assumption on the order of operations. traded against numerical reproducibility of floating- The problems of the third category, detailed in point computations: computed results may differ, Section VI, are specific to interval computations and depending on the order in which events happened have an impact on the validity of interval results. in one run or another. However, numerical repro- The question is whether directed rounding modes, ducibility is a desirable feature, at least for debug- set by the user or the interval library, are respected ging purposes: how is it possible to find and fix a by the environment. These problems occur both for bug that occurs sporadically, in an indeterministic sequential and parallel computations. Ignoring the way? Testing is also impossible without numeri- rounding modes can permit to reduce the execution cal reproducibility: when a program returns results time. Indeed, changing the rounding mode can incur which differ from the results of a reference imple- a severe penalty: the slowdown lies between 10 mentation, nothing can be concluded in the absence and 100 on architectures where the rounding modes of numerical reproducibility. To address the issue of are accessed via global flags and where changing a numerical reproducibility, in [45], not only the use rounding mode implies to flush pipelines. However, of higher computing precision is recommended, but it seems that the most frequent motivation is either also the use of interval arithmetic or of some variant the ignorance of issues related to rounding modes, of it in order to get numerical guarantees, and the or the quest for an easier and faster development use of tools that can diagnose sensitive portions of of the execution environment by overlooking these code and take corrective actions. We follow these issues. tracks and we focus on interval arithmetic, the same For each category, we give recommendations conclusions can be drawn for its “variants” such as for the design of interval algorithms: numerical affine arithmetic or polynomial models, e.g. Taylor reproducibility is still not attained, but the impact models. of the problems identified is attenuated and the First, let us introduce the problem of numer- inclusion property is preserved. ical reproducibility in floating-point arithmetic. Floating-point arithmetic is standardised and the Before detailing our classification, numerical re- IEEE-754 standard [18], [19] specifies completely producibility is defined in Section II and then an the formats of floating-point numbers, their binary extensive bibliography is detailed in Section III. representation and the behaviour of arithmetic op- erations. One of the goals of this standardisation was to enable portability and reproducibility of II.PROBLEMOF NUMERICAL REPRODUCIBILITY numerical computations across architectures. How- In computing in general and in scientific com- ever, several runs of the same code on different puting in particular, the quest for speed has led platforms, or even on the same platform when the to parallel or distributed computations, from mul- code is parallel (typically, multithreaded) may yield tithreaded programming to high-performance com- different results. Getting the same result on the same puting (HPC). Within such computations, contrary platform is sometimes called repeatability. Only the to sequential ones, the order in which events occur word reproducibility will be used throughout this is not deterministic. Events here is a generic term paper. which covers communications, threads or processes, Let us give a first explanation to the lack of NUMERICAL REPRODUCIBILITY AND INTERVAL ALGORITHMS 3 numerical reproducibility. On some architectures, that rounding modes are not respected by numerical and with some programming languages and com- libraries such as BLAS [23], nor by compilers pilers, intermediate computations such as the in- when default options are used, nor by execution termediate sum in s ← a + b + c can take place environment for multithreaded computations, nor with variable precision. This precision can depend by parallel languages such as OpenMP or OpenCL. on the availability of extended size registers (such Either this is not documented, or this is explicitly as 80-bits registers on x87 or IA64 architectures, mentioned as being not supported, cf. Section when computing with 64-bit operands), on whether VI for a more detailed discussion. Respecting operations use registers or store intermediate results the rounding modes is required by the IEEE-754 in memory, on data alignment or on the execution standard for floating-point arithmetic and the of other threads [3], on whether an intermediate behaviours just mentioned are either misbehaviours computation is promoted to higher precision or not. as in the example of Section VI-A, or are (often The impact of the computing precision on numerical undocumented) features of the libraries, often irreproducibility is discussed in details in Section ”justified” by the quest of shorter execution times. IV. Let us stick to our toy example s ← a + b + c to These phenomena explain the lack of numerical illustrate a second cause to numerical irreproducibil- reproducibility for floating-point computations, i.e. ity. As floating-point is not associative the fact that two different runs yield two different (neither is ), the order according to results, or the loss of the inclusion property in which the operations are performed matters. Let interval arithmetic, due to the non-respect of us illustrate this with an example in double pre- rounding modes. cision: RN(1 + (2100 − 2100)) (where RN stands for rounding-to-nearest) yields 1, since RN(2100 − Facing this lack of reproducibility, various 2100) = 0, whereas RN((1 + 2100) − 2100) yields 0, reactions are possible. One consists in since RN(1 + 2100) = 2100. The last result is called acknowledging the computation of differing a catastrophic cancellation: most or all accuracy is results as an indication of a lack of numerical lost when close (and even equal in this example) stability. In a sense, a positive way of considering large numbers are subtracted and only roundoff numerical irreproducibility is to consider it as errors remain, hiding the meaningful result. Real- useful information on the numerical quality of life examples of this problem are to be found in the code. However, for debugging and testing [14], where the results of computations for ocean- purposes, reproducibility is more than helpful. For atmosphere simulation strongly depends on the or- such purposes, numerical reproducibility means der of operations. Other examples are given in [7] getting bitwise identical results from one run to the for the numerical simulation of punching of metal next. Indeed, this is the most common definition sheets. This lack of associativity of floating-point of numerical reproducibility. This definition is operations implies that the result of a reduction also useful for contractual or legal purposes operation depends on the order according to which (architectural design, drug design are instances the operations are performed. However, with multi- mentioned in the slides corresponding to [5]), as threaded or parallel or distributed implementations, long as a reference implementation, or at least a this order is not deterministic and the computed reference result, is given. However, this definition result thus varies from one execution to the next. is not totally satisfactory as the result is not How the order of operations influences numerical well-defined. In particular, it says nothing about computations and interval computations is detailed the accuracy of the result. Requiring the computed in Section V. result to be the correct rounding of the exact A last issue, which is specific to the result is a semantically meaningful definition of implementation of interval arithmetic on parallel numerical reproducibility. The computed result environments, is the respect of rounding modes. is thus uniquely defined. The difficulty with this As already mentioned, the implementation of definition is to devise an algorithm that is efficient interval arithmetic crucially requires directed on parallel platforms and that computes the correct rounding modes. However, it has been observed rounding of the exact result. This definition has NUMERICAL REPRODUCIBILITY AND INTERVAL ALGORITHMS 4 been adopted in [26], for LHC computations Numerical irreproducibility, in particular of of 600,000 jobs on 60,000 machines: efficient summations of floating-point numbers, is cited mathematical functions such as exponential and as early as 1994 in [21] for weather forecasting logarithm, with correct rounding, were available. applications. More recently, it has been clearly However, in most cases it is not known how to put in evidence in [7] where the application is the compute efficiently the correctly rounded result. numerical simulation of a deep drawing process For this reason, this definition of numerical for deforming metal sheets: depending on the reproducibility as computing the correct rounding execution, the computed variations of the thickness of the exact result may be too demanding. Thus of the metal sheet vary. Other references mentioning we prefer to keep separate the notions of numerical numerical irreproducibility are for instance [14] reproducibility (i.e. getting bitwise identical results) for an application in ocean-atmosphere simulation, and of correct rounding. [24] and reference [10] herein about digital breast tomosynthesis, or [51] for power state estimation To sum up, numerical reproducibility can be in the electricity grid. defined as getting bitwise identical results, where these results have to been specified in some way, e.g. by a reference implementation which can be slow. A. The Example of the Summation We consider it is simpler not to mix this notion with The non-associativity of floating-point operations the notion of correct rounding of the exact result, is the major explanation to the phenomenon of nu- which is uniquely specified. Our opinion has been merical irreproducibility. The simplest problem that reinforced by the results in [5]: it is even possible exemplifies the non-associativity is the summation to define several levels of numerical reproducibility, of n floating-point numbers. It is also called a each one corresponding to a level of accuracy. reduction of n numbers with the addition opera- One has thus a hierarchy of reproducibilities, cor- tion. Not surprisingly, many efforts to counteract responding to different tradeoffs between efficiency numerical irreproducibility focus on the summation and accuracy, ranging from low numerical quality problem, as accuracy can be lost due to catastrophic to correct rounding. cancellation or catastrophic absorption1 in summa- tion. Let us list some of them in chronological III.PREVIOUS WORK order of publication. An early work [14] on the In the previous section, we introduced various summation uses and compares several techniques sources of numerical irreproducibility and we to get more accuracy on the result: the conclu- delineated their main features. More detailed sion is that compensated summation and the use and technical explanations are given in [12]: of double-double arithmetic give the best results. implementation issues such as data race, out-of- Following this work, for physical simulations, in order execution, message buffering with insufficient [40] conservation laws were numerically enforced buffer size, non-blocking communication operations using compensated sums, either Kahan’s version are also introduced. A tentative classification or Knuth’s version, and implemented as MPI re- of indeterminism can be found in [3], it duction operators. Similarly, in [46], a simplified distinguishes between external determinism form of “single-single” arithmetic is employed on that roughly corresponds to getting the same results GPU to sum the energy and satisfy numerically the independently of the internal states reached during conservation law. Bailey, the main author of QD, the computation, and internal determinism that a library for double-double and quad-double arith- requires that internal execution steps are the same metic [15], also advocates the use of this higher- from one run to the other. External indeterminism precision arithmetic in [1]. However, even if the can be due to data alignment that varies from one use of extra computing precision yields accurate execution to the next, order of communications with “wildcard receives” of MPI, and other causes 1Catastrophic absorption occurs in a sum when small numbers are already mentioned. In what follows, only external added to a large number and “disappear” in the roundoff error, i.e. they do not appear in the final result, whereas adding first these small indeterminism will be considered. numbers would result in a number large enough to be added to the first large number and be “visible”. NUMERICAL REPRODUCIBILITY AND INTERVAL ALGORITHMS 5 results for more ill-conditioned summations, which Binary64, the use of the FMA as well as any opti- means the obtention of results with more correct misation based on the associativity of the operators. bits, it does not ensure numerical reproducibility, It also forbade changes of the rounding modes as solving even worse-conditioned problems shows. [34] and the only rounding mode is to-nearest Since the problem can be attributed to the variations [11, Section 4.2.4]. It did not include the handling in the summation order, a first approach to get of exceptions via flags, as required by the IEEE- reproducible results consists in fixing the reduction 754 standard. The seminal talk by Kahan in 1998 tree [51]. More precisely, using a fixed integer K, [20] has shaken these principles. Actually Kahan the array of summands is split into K chunks of disputed mainly the lack of exception handling and consecutive subarrays, each subarray (or chunk) the restriction to shorter (thus, less precise) formats being summed sequentially (but each chunk can be even when longer ones are available. It seems that summed independently of the other ones) and then this dispute opened the door to variations around the order of the reduction of the K partial sums is Java. Indeed Java Grande [35], [47] proposes to also sequential. allow the use of longer formats and of FMA. The Another approach in [5], [6] consists in what is use of the associativity of operations to optimise called pre-rounding. Even if the actual implemen- the execution time has been under close scrutiny tation is very different from the process explained for a longer lapse of time and remains prohibited here, it can be thought of as a sum in fixed-point, as [11, Section 15.7.3], as explicit rules are given, learnt in elementary school. The mantissa of every e.g. left-to-right priority for +, −, ×. However, summand is aligned with respect to the point, then a strict adherence to the initial principles of Java leftmost “vertical slice” is considered and added to can be enforced by using the StrictFp keyword. produce a sum S1. The width of this slice is chosen For instance, the VSEit environment for modelling in such a way that the sum S1 can be performed complex systems, simulating them and getting a exactly using the width of the mantissa of floating- graphical view [2], uses the StrictMath option point numbers (e.g. 53 bits for the double precision in Java as a way to get numerical reproducibility. format). As this sum S1 is exact, it is independent of However, the need for getting reproducibility is not the order in which the are performed and explained in much details. is thus numerically reproducible on any platform. To Finally, Intel MKL (Math Kernel Library) 11.0 get a more accurate result, a second vertical slice, introduces a feature called Conditional Numerical just right to the first one, can be summed, again Reproducibility (CNR) [49] which provides func- yielding an exact sum S2 and the final result is the tions for obtaining reproducible floating-point re- (floating-point, thus inexact) sum of S1 and S2. An sults. When using these new features, Intel MKL increase of the number K of slices corresponds to functions are designed to return the same floating- an increase of the computing precision. However, point results from run-to-run, subject to the follow- for a fixed K, each partial sum S1,...,SK is exact ing limitations: and thus numerically reproducible. The details to • calls to Intel MKL occur in a single executable determine the slices and to get S1,...SK are given • input and output arrays in calls must in [5] and a faster algorithm for exascale computing, be aligned on 16, 32, or 64 byte boundaries on i.e. for really large platforms, is provided in [6]. systems with SSE / AVX1 / AVX2 instructions support (resp.) • the number of computational threads used by B. Approaches to Reach Numerical Reproducibility the library remains constant throughout the run. In [22] a tool, called MARMOT, that detects These conditions are rather stringent. Another ap- and signals race conditions and deadlocks in MPI proach to numerical reproducibility consists in pro- codes is introduced, but reduction operations are not viding correctly rounded functions, at least for the handled. mathematical library [26]. The original proposal for Java by Gosling [10] The approaches presented here are not yet entirely included numerical reproducibility of floating-point satisfactory, either because they do not really offer computations. To reach reproducibility, it prohibited numerical reproducibility or because they handle the use of any format different from Binary32 and only summation, or because performances are too NUMERICAL REPRODUCIBILITY AND INTERVAL ALGORITHMS 6 drastically slowed down. Furthermore, none ad- and d are Binary32 floating-point formats. Notwith- dresses interval computations. In what follows, we standing the order in which the intermediate sums propose a classification of the sources of numerical are computed, let us focus on the problem of the irreproducibility for interval computations and some precision. If the architecture offers Binary64 and if recommendations to circumvent these problems, the language is C or Python, then the intermediate even if we do not have the definitive solution. In sums may be promoted to Binary64. It will not be our classification, a first source of problem is the the case if the language is Java with the StrictFp variability of the employed computing precision. keyword, or Fortran. In C, it may or may not be the case, depending on the compiler and on the IV. COMPUTING PRECISION compilation options: the compiler may prefer to take advantage of a vectorised architecture like SSE2 or A. Problems for Floating-Point Computations AVX, where two Binary32 floating-point additions The computing precision used for floating-point are performed in parallel, in the same amount of computations depends on the employed format (as time as one Binary64 floating-point addition. The defined by IEEE-754 standard [18], [19]). The sin- compiler may thus choose to execute (a+b)+(c+d) gle precision corresponds to the representation of in Binary32, where the two additions a + b and floating-point numbers on 32 bits, where 1 bit is c + d are performed in parallel, as it will execute used for the sign, 8 bits for the exponent and the faster than ((a + b) + c) + d. The compiler may also rest for the significand. The corresponding format decide to use more accurate registers (64 bits or 80 is called Binary32. The double precision uses 64 bits) when such vectorised devices are not available. bits, hence the name Binary64, with 1 bit for Thus, depending on the programming language, on the sign and 11 for the exponent. The quadruple the compiler and its options and on the architecture, precision, also known as Binary128, uses 128 the intermediate results may vary, as does the final bits, with 1 bit for the sign and 15 bits for the result. In some languages, it is possible to gain some exponent. The rest of Section IV-A owes much to a posteriori knowledge on the employed precision. [8]. In C99, the value of FLT_EVAL_METHOD gives an On some architectures (IA32 / x87), registers indication, at run-time, on the intermediate chosen have a longer format: 80 bits instead of 64. The idea format: indeterminate, double or long double. prevailing to the introduction of these long registers was to provide higher accuracy by computing, for B. Problems for Interval Computations a while, with higher precision than the Binary64 format and to round the result into a Binary64 The notion of precision in interval arithmetic number only after several operations. However it could be regarded as the radius of the input argu- entailed the so-called “double-rounding” problem: ments. It is known that the overestimation of the an intermediate result is first rounded into a 80- result of a calculation is proportional to the radius bit number, then into a 64-bit number when it is of the input interval, with a proportionality constant stored, but these two successive roundings can yield which depends on the computed expression. More a result different from rounding directly into a 64- precisely [32, Section 2.1], if f is a Lipschitz- n bit number. From the point of view of numerical continuous function: D ⊂ R → R, then reproducibility, the use of extended precision reg- q(f(x), f(x)) = O(r) if rad(x) ≤ r isters is also troublesome, as the operations which take place within registers and the temporary stor- where boldface letters denote interval quantities, age into memory can occur at different stages of f(x) is the exact range of f over the interval x, the computation, they may vary from run to run, f(x) is an overestimation of f(x) computed from depending for instance on the load of the current an expression for f using interval arithmetic in a processor, on data alignment or on the execution of straightforward way, and q stands for the Hausdorff other threads [3]. distance. As the first interval here encloses the Another issue is the format chosen for the in- second, q(f(x), f(x)) is simply max(inf(f(x)) − termediate result, say for the intermediate sums inf(f(x)), sup(f(x)) − sup(f(x))). In this formula, in the expression a + b + c + d, where a, b, c r can be considered as the precision. It is possible NUMERICAL REPRODUCIBILITY AND INTERVAL ALGORITHMS 7 to improve this result by using more elaborate ment on the other hand. The mid-rad representation approaches to interval arithmetic, such as a Taylor hm, ri corresponds to the interval [m − r, m + r] = expansion of order 1 of the function f. Indeed, if {x : |m − x| ≤ r}. An advantage of the mid-rad f is a continuously differentiable function then [32, representation is that thin intervals are represented Section 2.3], more accurately in floating-point arithmetic. For q(f(x), f(x)) = O(r2) if rad(x) ≤ r instance, let us consider m a non-zero floating-point number and r = 1/2ulp(m), i.e. r is a power where f(x) is computed using a so-called centered of 2 that corresponds, roughly speaking, to half form. As f is Lipschitz, it holds that the radius the last bit in the floating-point representation of of f(x) is also proportional to the radius of the m. (Let us recall [31] that for x ∈ R \{0}, if input interval. In other words, the accuracy on the x ∈ [2e, 2e+1) then ulp(x) = 2e−p+1 in radix-2 result improves with the radius, or precision, of the floating-point arithmetic with p bits used for the input. These results hold for an underlying exact significand and that ulp(−x) = ulp(x) for negative arithmetic. x.) In this example, m and r are floating-point numbers. Then the floating-point representation by This result in interval analysis can be seen as endpoints of the interval is [RD(m−r), RU(m+r)], an equivalent to the rule of thumb in numerical (where RD denotes the rounding mode towards floating-point analysis [16, Chapter 1]. Considered −∞ or rounding downwards and RU denotes the with optimism, this means that it suffices to rounding mode towards +∞ or rounding upwards), increase the computing precision – in floating-point which is [m−2r, m+2r]: in this example, the width arithmetic – or the precision on the inputs – in of the interval is doubled with the representation by interval arithmetic – to get more accurate results. endpoints. The reader has to be aware that no repre- In interval arithmetic, this can be done through sentation, neither mid-rad nor by endpoints, always bisection of the inputs, to get tight intervals as supersedes the other one. An example where the inputs and to reduce r in the formula above. The mid-rad representation is superior to the representa- final result is then the union of the results computed tion by endpoints has just been given. Conversely, for each subinterval. A more pragmatic point of an unbounded interval can be represented by its end- view is first that, even if the results are getting more points, say [1, +∞), but the only enclosing mid-rad and more accurate as the precision increases, there representation hm, +∞i with m > 1, corresponds is still no guarantee that the employed precision to R. will allow to reach a prescribed accuracy for the We also recommend the use of iterative refine- result. Second, reproducibility is still out of reach, ment where applicable. Indeed, even if the compu- as developed in Section V. tations of the midpoint and the radius of the result suffer from the aforementioned lack of accuracy, In what follows, we consider interval arithmetic iterative refinement (usually a few iterations suffice) implemented using floating-point arithmetic. The recovers a more accurate result from this inaccurate aforementioned theorems, that bound q(f(x), f(x)), one. do not account for the limited precision of the un- Let us illustrate this procedure on the example derlying floating-point arithmetic. Indeed, computed of square linear system solving Ax = b (cf. interval results also suffer from the problems of [32, Chapter 4], [30, Chapter 7], [42] for an in- floating-point arithmetic, namely the possible loss troduction). Once an initial approximation x0 is of accuracy. Even if directed roundings make it computed, the residual r = b − Ax0 is computed possible to ensure the inclusion property, there is using twice the current precision (and here we no information about how much the exact interval is rejoin the solutions in [1], [14], [40], [46] already overestimated, and no guarantee about the tightness mentioned), as much cancellation occurs in this of the computed interval. calculation. Then, solve – again approximately – the linear system Ae = r with the same C. Recommendations A, and re-use every pre-computations done on A, We advocate the use of the mid-rad representation typically a factorisation such as LU. Finally, correct on the one hand, and the use of iterative refine- the approximate solution: x1 ← mid(x0)+e. Under NUMERICAL REPRODUCIBILITY AND INTERVAL ALGORITHMS 8 specific assumptions, but independently of the order Finally, interval arithmetic implemented using of the operations, it is possible to relegate the effects floating-point arithmetic suffers from all these al- of the intermediate precision and of the condition gebraic features. In any case, the computed result number after the significant bits in the destination contains the exact result. However, it is not pos- format [33]. In other words, the overestimation due sible to guarantee that one order produces tighter to the floating-point arithmetic is minimal: only one results than another one, as illustrated by the follow- ulp (or very few ulps) on the precision of the interval ing example. In the Binary64 format, the smallest result. floating-point number larger than 1 is 1 + 2−52. Let Another study [48] also takes into account the ef- us consider three intervals with floating-point end- −53 −52 −52 fect of floating-point arithmetic. It suggests that the points: A1 = [−2 , 2 ], A2 = [−1, 2 ] and same approach applies to nonlinear system solving A3 = [1, 2]. Using the representation by endpoints and that the iterative refinement, called in this case and the implementation in floating-point arithmetic Newton iteration, again yields fully accurate results, of [a, a¯] + [b, ¯b] as [RD(a + b), RU(¯a + ¯b)], one gets i.e. up to 1 ulp of the exact result. for (A1 + A2) + A3: However, in both cases it is assumed that enough tmp1 := A1 + A2 steps have been performed: if there is a limit on the = [RD(−2−53 − 1), RU(2−52 + 2−52)] number of steps, then one run could converge but = [−1 − 2−52, 2−51] not the other one and again numerical reproducibil- and finally ity would not be gained. B1 := tmp1 + A3 To conclude on the impact of the computing = [RD(−1 − 2−52 + 1), RU(2−51 + 2)] precision: it raises no problem for the validity of = [−2−52, 2 + 2−51]. interval computations, i.e. the inclusion property is satisfied, but it influences the accuracy of the result And one gets for A1 + (A2 + A3): and the execution time. It seems that the same could tmp2 := A2 + A3 be said for the order of the operations, which is the = [RD(−1 + 1), RU(2−52 + 2)] issue discussed next, but it will be seen that the = [0, 2 + 2−51] validity of interval computations can depend on it. and eventually B2 := A1 + tmp2 −53 −52 −51 V. ORDEROFTHE OPERATIONS = [RD(−2 + 0), RU(2 + 2 + 2 )] = [−2−53, 2 + 2−50]. A. Problems for Interval Computations The exact result is B := [−2−53, 2 + 2−51] and As already abundantly mentioned, a main expla- this interval is representable using floating-point nation to the lack of reproducibility of floating-point endpoints. It can be observed that both B and computations is the lack of associativity of floating- 1 B enclose B: the inclusion property is satisfied. point operations (addition, multiplication). 2 Another observation is that B overestimates B Interval arithmetic also suffers from a lack of al- 1 to the left and B to the right. Of course, one gebraic properties. In interval arithmetic, the square 2 can construct examples where both endpoints are operation differs from the multiplication by the under- or over-estimated. same argument, because variables (x and y in the example) are decorrelated: Not only does the order in which non-associative [−1, 2]2 = {x2, x ∈ [−1, 2]} = [0, 4] operations such as additions are performed matter, 6= [−1, 2] · [−1, 2] but other orders do as well. Let us go back to the = {x · y, x ∈ [−1, 2], y ∈ [−1, 2]} bisection process mentioned in Section IV-B. Inde- = [−2, 4]. terminism is present in the bisection process, and it introduces more sources of irreproducibility. Indeed, This problem is often called variable dependency. with bisection techniques, one interval is split into In interval arithmetic, the multiplication is not 2 and one half (or the two halves) are stored in distributive over the addition, again because of the a list for later processing. The order in which decorrelation of the variables. intervals are created and processed is usually not NUMERICAL REPRODUCIBILITY AND INTERVAL ALGORITHMS 9

deterministic, if the list is used by several threads. RU((|Am| + Ar) · Br + Ar · |Bm|), again using This can even influence the creation or bisection optimised BLAS routines. In [41], the main benefit of intervals later in the computation. Typically, in which is announced is the gain in execution time, as algorithms for global optimisation [13], [36], the these routines are really well optimised for a variety best enclosure found so far for the optimum can of architectures, e.g. in Goto-BLAS or in ATLAS lead to the exploration or destruction of intervals in or in MKL, the library developed by Intel for its the list. As the order in which intervals are explored architectures. We also foresee the benefit of using varies, this enclosure for the optimum varies and reproducible libraries, once they are developed, such thus the content of the list varies – and not only the as the CNR version of Intel MKL, and once their order in which its content is processed. usage and performance are optimised. To sum up, so far only accuracy and speed of interval computations can be affected by the order B. Recommendations of operations, but not the validity of the results. As shown in the example of the sum of three As discussed now, the order may also impact the intervals, even if the result depends on the order validity of interval results, i.e. the inclusion property of the operations, the inclusion property holds may not be satisfied. and it always encloses the exact results. Thus an optimistic view is that numerical reproducibility is C. Concerns Regarding the Validity of Interval Re- irrelevant for interval computations, as the result sults always satisfies the inclusion property. An even Apart from their lack of numerical reproducibil- more optimistic view could be that getting different ity, there is another limit to the current usability of results is good, since the sought result lies in floating-point BLAS routines for interval computa- their intersection. Intersecting the various results tions. This limit lies in the use of strong assumptions would yield even more accuracy. Indeed, with the on the order in which operations are performed. Let example above, the exact result B is recovered by us again exemplify this issue on the product of ma- intersecting B1 and B2. Unfortunately this is rarely trices with interval coefficients. The algorithm given the case, and usually no computation yields tighter above reduces interval operations to floating-point results than the other ones. Thus the order may matrices operations. However, it requires 4 calls to matter. BLAS3 routines. An algorithm has recently been proposed [44] that requires only 3 calls to floating- We have already mentioned the benefit of using point matrix products. An important assumption to the mid-rad representation in Section IV-C. An- ensure the inclusion property is that two of these other, much more important, advantage is to be floating-point matrix products perform the basic able to benefit from efforts made in developing arithmetic operations in the same order. Namely, mathematical libraries. As shown by Rump in his the algorithm relies on the following theorem [44, pioneering work [41], in linear algebra in particular, Theorem 2.1]: if A and B are two n × n matrices it is possible to devise algorithms that use floating- with floating-point coefficients, if C = RN(A·B) is point routines. Typically these algorithms compute computed in any order and if Γ = RN(|A| · |B|) is the midpoint of the result using optimised numerical computed in the same order, then the error between routines, and they compute afterwards the radius of C and the exact product A · B satisfies the result using again optimised numerical routines. Let us illustrate this approach with the product of |RN(A · B) − A · B| ≤ RN((n + 2)uΓ + η) matrices with interval coefficients, given in [41]: where u is the unit roundoff and η is the smallest −52 A = hAm,Ari and B = hBm,Bri are matrices normal positive floating-point number (u = 2 with interval coefficients, represented as matrices of and η = 2−1022 in Binary64). This assumption on midpoints and matrices of radii. The product A·B is the order is not guaranteed by any BLAS library we enclosed in C = hCm,Cri where an interval enclos- know, and the discussion above tends to show that ing Cm is computed as [RD(Am·Bm), RU(Am·Bm)] this assumption does not hold for multithreaded using optimised BLAS3 routines for the product of computations. floating-point matrices. Then Cr is computed using NUMERICAL REPRODUCIBILITY AND INTERVAL ALGORITHMS 10

Two solutions can be proposed to preserve the modes. However, many obstacles are encountered. efficiency of this algorithm along with the inclu- First, it may happen that the compiler is too eager sion property. A straightforward solution consists to optimise a code to respect the required rounding in using a larger bound for the error on the matrix mode. For instance, let us compute the enclosure of product, a bound which holds whatever the order a/b where both a and b are floating-point numbers, for the computations of C and Γ. Following [16, using the following piece of pseudo-code: Chapter 2] and barring underflow, a bound of the set_rounding_mode (downwards) following form can be used instead [39]: left := a/b |RN(A · B) − A · B| ≤ set_rounding_mode (upwards) RN ({(1+3u)n −1}·({(1+2u)|A|}·{(1+2u)|B|})) . right := a/b A more demanding solution, implemented in [38], In order to spare a which is a rela- consists in implementing simultaneously the two tively expensive operation, a compiler may assign matrix products, in order to ensure the same order right := left, but then the result is not what of operations. Optimisations to get performances expected. This example is given in the gcc bug are done: block products to optimise the cache report #34678 entitled Optimization generates in- usage (Level 1) and hand-made vectorisation of correct code with -frounding-math options. Even the code yield performances, even if the perfor- using the right set of compilation options does not mances of well-known BLAS are difficult to attain. solve the problem. Usually these options, when do- Furthermore, this hand-made implementation has ing properly the job, neutralise optimisations done a better control on the rounding modes than the by the compiler and the resulting code may exhibit BLAS routines, and this is another major point in poor performances. satisfying the inclusion property, as discussed in the Second, interval computations may rely on next Section. The lack of respect of the rounding floating-point routines, such as the BLAS routines modes is another source of interval invalidity (to in our example of the product of matrices with paraphrase the term “numerical reproducibility”) interval coefficients. For interval computations, it and it is specific to interval computations. is desirable that the BLAS library respects the rounding mode set before the call to a routine. VI.ROUNDING MODES (However C99 seems to exclude that external li- A. Problems for Interval Computations braries called from C99 respect the rounding mode Directed rounding modes of floating-point arith- in use, on the contrary C99 allows them to set metic are crucial for a correct implementation of and modify the rounding mode.) This desirable interval arithmetic, i.e. for an implementation that behaviour is not documented for the libraries we preserves the inclusion property. This is a main dif- know of, as developed in [23]. The opposite has ference between interval computations and floating- been observed by A. Neumaier and years later by point computations, that usually employ rounding- F. Goualard in MatLab, as explained in his message to-nearest only. It is also a issue not only for to reliable [email protected] on 29 reproducibility, but even for correctness of interval March 2012 quoted below. computations, especially for parallel implementa- In a 2008 mail from Pr. Neumaier to Cleve tions, as it will be shown below. Moler forwarded to the IEEE 1788 stan- For some floating-point computations that make dard mailing list (http://grouper.ieee.org/ extensive use of EFT (Error-Free Transforms), such groups/1788/email/msg00204.html), it is as compensated sums mentioned in Section III, the stated that MATLAB resets the FPU con- already mentioned QD library [15], or for XBLAS, trol word after each external call, which a library for BLAS with extended precision [25], the would preclude any use of fast directed only usable rounding mode is rounding-to-nearest, rounding through some C MEX file call- otherwise these libraries fail to deliver meaningful ing, say, the fesetround() function. results in higher precision. According to my own tests, the picture is For interval arithmetic, both endpoints and more complicated than that. With MAT- mid-rad representations require directed rounding LAB R2010b under Linux/64 bits I am NUMERICAL REPRODUCIBILITY AND INTERVAL ALGORITHMS 11

perfectly able to switch the rounding di- cally reconfiguring the rounding modes as specified rection in some MEX file for subsequent by the IEEE 754 spec is unsupported. OpenMP computation in MATLAB. The catch is does not support it either, as less explicitly stated that the rounding direction may be reset in the documentation (from OpenMP Application by some later events (calling some non- Program Interface, Version 4.0 - RC 2 - March existent function, calling an M-file func- 2013, Public Review Release Candidate 2): This tion, ...). In addition, I cannot check the OpenMP API specification refers to ISO/IEC 1539- rounding direction using fegetround() in 1:2004 as Fortran 2003. The following features are a C MEX file because it always returns not supported: IEEE Arithmetic issues covered in that the rounding direction is set to near- Fortran 2003 Section 14 [. . . ] which are the issues est/even even when I can check by other related to rounding modes. means that it is not the case in the current Fourth, the execution environment, i.e. the MATLAB environment. support for multithreaded execution, usually does A more subtle argument against the naive use of not document either how the rounding modes mathematical libraries is based on the monotony are handled. For architectures or instructions sets of the operations. Let us take again the example which have the rounding mode specified in the of a matrix product, with nonnegative entries, such code for the operation, as CUDA for GPU, or as in Γ introduced in Section V-B. If the rounding as IA64 processors but without access from a mode is set to rounding upwards for instance, then high-level programming language, rounding modes P it suffices to compute Γi,j as RU( |ai,k|·|bk,j|) to are handled smoothly, even if they are accessed P k get an overestimation of k |ai,k| · |bk,j|. However, with more or less ease. For other environments if Γ is computed using Strassen’s formulae, then where the rounding modes are set via a global terms of the form x − z and y + z are introduced. flag, it is not clear how this flag is handled: it To get an upper bound on these terms, one needs is expected that it is properly saved and restored an overestimation x¯ of x, y¯ of y and z¯ of z, but when a thread is preempted or migrated, it is not also an underestimation z of z. The overestimation documented whether concurrent threads on the of x−z can thus be computed as RU(¯x−z) and the same core “share” the rounding mode or whether overestimation of y+z as RU(¯y+¯z). In other words, each of them can use its own rounding mode. one may need both over- and under-estimation of To quote [23]: How a rounding mode change is intermediate values, i.e. one may need to compute propagated from one thread or node of a cluster to many intermediate quantities twice. This would ruin all others is unspecified in the C standard. In MKL the fact that Strassen’s method is a fast method. the rounding mode can be specified only in the Anyway, if a library implementing Strassen’s prod- VML (Vector Math Library) part and any multi- uct is called with the rounding mode set to +∞ threading or clustering behavior is not documented. and respects it, it simply performs all operations with this rounding mode and there is no guarantee After this discussion, it may appear utopian to that the result overestimates the exact result. To rely too much on directed rounding modes to ensure sum up, there is little evidence and little hope that that the inclusion property is satisfied. mathematical libraries return an overestimation of the result when the rounding mode is set to +∞. Third, the programming language may or may B. Recommendations not support changes of the rounding mode. OpenCL Our main recommendation is to use bounds on does not support it, as explicitly stated in the docu- roundoff errors rather than using directed rounding mentation (from the OpenCL Specification Version: modes, when it is not safe to do so. These bounds 1.2, Document Revision: 15): Round to nearest must be computable using floating-point arithmetic. even is currently the only rounding mode required They must be computed using rounding-to-nearest by the OpenCL specification for single precision which is most likely to be in use. They are prefer- and double precision operations and is therefore ably independent of the rounding mode. For in- the default rounding mode. In addition, only static stance, the bound given in Section V-C can be made selection of rounding mode is supported. Dynami- independent of the rounding mode by replacing u, NUMERICAL REPRODUCIBILITY AND INTERVAL ALGORITHMS 12 which is the roundoff unit for rounding-to-nearest, • Second, convince developers and vendors of by 2u which is an upper bound for any rounding these bricks to clearly specify their behaviour, mode [39]. Another example of this approach to especially what regards rounding modes. floating-point roundoff errors is the FI LIB library • If the second step fails, replicate the work [17]. It was also adopted to account for roundoff done for the optimisation of the considered errors by the COSY library [37]. This approach is numerical bricks, to adapt them to the pecu- rather brute force, but it is robust to the change liarities and requirements of the interval al- of rounding mode. Furthermore, if the algorithm gorithm. A precursor to this recommendation implemented in a numerical routine is known, it is is the recommendation in [27] that aims at possible to use this numerical routine and to get an easing such developments: In order to achieve upper bound on its result. For instance, the bound in near-optimal performance, library developers Section V-C on the error of a matrix product can be must be given access to routines or kernels obtained using any routine for the matrix product, that provide computational- and utility-related as long as it does not use a fast method such as functionality at a lower level than the custom- Strassen’s. ary BLAS interface. This would make possible to use the lower level bricks used in high- performance BLAS, e.g. computations at the VII.CONCLUSION level of the blocks and not of the entire matrix. As developed in the preceding sections, obtaining • Get free from the rounding mode by bound- numerical reproducibility is a difficult task. Getting ing, roughly but robustly, errors with formulas it can severely impede performances in terms of independent of the rounding mode if needed. execution time. It is thus worth checking whether Eventually, let us emphasise that the problem numerical reproducibility is really needed, or of numerical reproducibility is different from the whether getting guarantees on the accuracy of the more general topic called Reproducible research in results suffices. For instance, as stated in [14] about science, which is for instance developed climate models: It is known that there are multiple in a complete issue of the Computing in Science stable regions in phase space [...] that the climate & Engineering magazine [4]. Reproducible research system could be attracted to. However, due to the corresponds to the possibility to reproduce compu- inherent chaotic nature of the numerical algorithms tational results by keeping track of the code version, involved, it is feared that slight changes during compiler version and options, input data. . . used to calculations could bring the system from one regime produce the results that are often only summed up, to another. In this example, qualitative information, mostly as figures, in papers. A possible solution such as the determination of the attractor for the is to adopt a “versioning” system not only for system under consideration, is more important code files, but also for compilation commands, data than reproducibility. This questioning goes further files, binary files. . . However, floating-point issues in a talk by Dongarra, similar to [9], where he are usually not considered in the related publica- advocated the quest for a guaranteed accuracy tions; they very probably should. and the use of small computing precision, such as single precision, rather than the quest for bit-to-bit reproducibility, for speed and energy-consumption ACKNOWLEDGEMENT reasons. This work is partly funded by the HPAC project However, numerical reproducibility may be of the French Agence Nationale de la Recherche mandatory. In such a case, our main recommenda- (ANR 11 BS02 013). tion to conclude this work is the following method- ology. REFERENCES • Firstly, develop interval algorithms that are based on well-established numerical bricks, so [1] D. H. Bailey. High-Precision Computation and Reproducibility. In Reproducibility in Computational and Experimental Math- as to benefit from their optimised implementa- ematics, Providence, RI, USA, December 2012. http://www. tion. davidhbailey.com/dhbtalks/dhb-icerm.pdf NUMERICAL REPRODUCIBILITY AND INTERVAL ALGORITHMS 13

[2] K.-H. Brassel. Advanced Object-Oriented Technologies in etaniemi. The operational HIRLAM 2 on parallel computers. In Modeling and Simulation: the VSEit Framework. In ESM 6th Workshop on the use of parallel processors in meteorology, 2001 European Simulation Multiconference 2001, pages 154– ECMWF, pages 63–74, November 1994. 160, Prague, Czech Republic, June 2001. [22] B. Krammer, K. Bidmon, M.S. Muller, and M.M. Resch. [3] W.-F. Chiang, G. Gopalakrishnan, Z. Rakamaric, D. H. Ahn, and MARMOT: An MPI analysis and checking tool. Advances in G. L. Lee. Determinism and Reproducibility in Large-Scale HPC Parallel Computing, 13:493–500, 2004. Systems (7 pages). In WODET 2013 - 4th Workshop on Deter- [23] C. Lauter and V. Menissier-Morain.´ There’s no reliable com- minism and Correctness in Parallel Programming, Washington, puting without reliable access to rounding modes. In SCAN 2012 USA, March 2013. http://wodet.cs.washington.edu/wp-content/ Symposium on Scientific Computing, Computer Arithmetics and uploads/2013/03/wodet2013-final12.pdf Verified Numerics, Russian Federation, pages 99–100, 2012 [4] S. Foeml and J. F. Claerbout (editors). Reproducible Research. [24] M. Leeser, J. Ramachandran, T. Wahl, and D. Yablonski. Special issue of Computing in Science & Engineering, 11(1), OpenCL Floating Point Software on Heterogeneous Architec- January/February 2009. tures – Portable or Not? In Workshop on Numerical Software [5] J. Demmel and H. D. Nguyen. Fast Reproducible Floating- Verification (NSV), 2012. http://www.cs.rice.edu/∼sc40/NSV/12/ Point Summation. In ARITH 2013, pages 163–172, Austin, program.html TX, USA, April 2013. http://www.eecs.berkeley.edu/∼hdnguyen/ [25] X. Li, J. Demmel, D. Bailey, G. Henry, Y. Hida, J. Iskandar, public/papers/ARITH21 Fast Sum.pdf W. Kahan, A. Kapur, M. Martin, T. Tung, and D.J. Yoo. [6] J. Demmel and H. D. Nguyen. Numerical Reproducibility and Design, implementation and testing of extended and mixed Accuracy at ExaScale. In ARITH 2013, Austin, TX, USA, April precision BLAS. ACM Transactions on Mathematical Software, 2013. http://www.arithsymposium.org/papers/paper 33.pdf 28(2):152–205, 2002. [7] K. Diethelm. The Limits of Reproducibility in Numerical [26] E. McIntosh, F. Schmidt, and F. de Dinechin. Massive Tracking Simulation. Computing in Science & Engineering, 14(1):64–71, on Heterogeneous Platforms. In ICAP 2006, Chamonix, France, 2012. 2006. [8] F. de Dinechin. De calculer juste a` calculer au plus juste (in En- [27] B. Marker, F. Van Zee, K. Goto, G. Quintana-Orti, and R. van de glish except the title). In School on Accuracy and Reproducibil- Geijn. Toward scalable matrix multiply on multithreaded archi- ity of Numerical Computations, Frejus,´ France, March 2013. tectures. Euro-Par 2007 Parallel Processing, pages 748–757, http://calcul.math.cnrs.fr/IMG/pdf/2013-Dinechin-PRCN.pdf 2007. [9] J. Dongarra. Algorithmic and Software Challenges when Moving [28] J. K. Martinsen and H. Grahn. A Methodology for Evaluating Towards Exascale. In ISUM 2013: 4th International Supercom- JavaScript Execution Behavior in Interactive Web Applications. puting Conference in Mexico, March 2013. http://www.netlib. In AICCSA 2011, 9th IEEE/ACS International Conference on org/utk/people/JackDongarra/SLIDES/isum0213.pdf Computer Systems and Applications, pages 241–248, December [10] J. Gosling. Sun Microsystems Laboratories - The First Ten 2011. ‘ Years - 19912001, chapter Java: an Overview. J. Treichel and [29] R. Moore. Interval Analysis. Prentice Hall, Englewood Cliffs, M. Holzer, 1995. 1966. [11] J. Gosling, B. Joy, G. Steele, G. Bracha, and A. Buckley. The [30] R. Moore, R.B. Kearfott, and M.J. Cloud. Introduction to Java R Language Specification - Java SE 7 Edition. Addison interval analysis. SIAM, 2009. Wesley, 2013. [31] J.-M. Muller et al. Handbook of Floating-point Arithmetic. [12] W. D. Gropp. Issues in Accurate and Reliable Use of Parallel Birkhauser,¨ 2010. Computing in Numerical Programs. In Bo Einarsson, editor, [32] A. Neumaier. Interval Methods for Systems of Equations. Accuracy and Reliability in Scientific Computing, pages 253– Cambridge University Press, 1990. 263. SIAM, 2005. [33] H. D. Nguyen. Efficient algorithms for verified scientific com- [13] E.R. Hansen and G.W. Walster. Global optimization using puting : Numerical linear algebra using interval arithmetic. Phd interval analysis. 2nd edition, Marcel Dekker, 2003. thesis, ENS de Lyon, January 2011. http://tel.archives-ouvertes. [14] Y. He and C.Q. Ding. Using Accurate Arithmetics to Improve fr/tel-00680352/PDF/NGUYEN Hong - Diep 2011 These.pdf Numerical Reproducibility and Stability in Parallel Applications. [34] M. Philippsen. Is Java ready for computational science? In The Journal of Supercomputing, 18:259–277, 2001. Euro-PDS’98 - 2nd European Parallel and Distributed Systems [15] Y. Hida, X. S. Li, and D. H. Bailey. Quad Double computation Conference, pages 299–304, Austria, 1998. package, 2003-2012. http://crd-legacy.lbl.gov/∼dhbailey/mpdist/ [35] M. Philippsen, R. F. Boisvert, V. S. Getov, R. Pozo, J. Moreira, [16] N. J. Higham. Accuracy and Stability of Numerical Algorithms. D. Gannon, and G. C. Fox. JavaGrande – High Performance 2nd edition, SIAM, 2002. Computing with Java. In Applied Parallel Computing. New [17] W. Hofschuster and W. Kramer.¨ FI LIB - A fast interval Paradigms for HPC in Industry and Academia, pages 20–36. library (Version 1.2) in ANSI-C, 2005. http://www2.math. Springer, 2001. uni-wuppertal.de/∼xsc/software/filib.html. [36] N. Revol, Y. Denneulin, J.-F. Mehaut,´ and B. Planquelle. [18] Institute of Electrical and Electronic Engineers and American Parallelization of continuous verified global optimization. In National Standards Institute. IEEE standard for binary floating- 19th IFIP TC7 Conf. on System Modelling and Optimization, point arithmetic. ANSI/IEEE Standard, Std 754-1985, New York, Cambridge, United Kingdom, 1999. 1985. [37] N. Revol, K. Makino, and M. Berz. Taylor models and floating- [19] Institute of Electrical and Electronic Engineers. IEEE standard point arithmetic: proof that arithmetic operations are validated in for floating-point arithmetic. IEEE Standard 754-2008, New COSY. Journal of Logic and Algebraic Programming, 64:135– York, 2008. 154, 2005. [20] W. Kahan and J. D. Darcy. How Java’s floating-point hurts [38] N. Revol, and P. Theveny.´ Parallel implementation of interval everyone everywhere. In ACM 1998 Workshop on Java for High- matrix multiplication. http://hal.inria.fr/hal-00801890, 2013. Performance Network Computing, 1998. http://www.cs.berkeley. [39] N. Revol. Tradeoffs between Accuracy and Efficiency for edu/∼wkahan/JAVAhurt.pdf Interval Matrix Multiplication. In Numerical Sofware: Design, [21] T. Kauranne, J. Oinonen, S. Saarinen, O. Serimaa, and J. Hi- Analysis and Verification, Santander, Spain, July 2012. NUMERICAL REPRODUCIBILITY AND INTERVAL ALGORITHMS 14

[40] R. W. Robey, J. M. Robey, and R. Aulwes. In search of numer- Philippe Theveny´ earned a MS degree in ical consistency in parallel programming. Parallel Computing, Computer Sciences from Universite´ de Ver- 37:217–229, 2011. sailles Saint-Quentin-en-Yvelines (France) in [41] S. M. Rump. INTLAB - INTerval LABoratory. In Tibor 2011. He is now a PhD candidate in Ecole´ Csendes, editor, Developments in Reliable Computing, pages 77- Normale Superieure´ de Lyon (France) under 104. Kluwer Academic Publishers, Dordrecht, 1999. the supervision of Nathalie Revol. [42] S.M. Rump. Computer-assisted Proofs and Self-validating Methods. In Bo Einarsson, editor, Accuracy and Reliability in Scientific Computing, pages 195–240. SIAM, 2005. [43] S.M. Rump. Verification methods: Rigorous results using floating-point arithmetic. Acta Numerica, 19:287–449, 2010. [44] S. M. Rump. Fast interval matrix multiplication. Numerical Algorithms, 61(1):1–34, 2012. [45] V. Stodden, D. H. Bailey, J. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the Default to Reproducible – Repro- ducibility in Computational and Experimental Mathematics. Re- port on ICERM workshop Reproducibility in Computational and Experimental Mathematics, December 2012. ICERM, Brown University, Providence, RI, USA, 2013. http://stodden.net/icerm report.pdf [46] M. Taufer, O. Padron, P. Saponaro, and S. Patel. Improving numerical reproducibility and stability in large-scale numerical simulations on GPUs. In IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1–9, Atlanta, GA, USA, April 2010. [47] G. K. Thiruvathukal. Java at Middle Age: Enabling Java for Computational Science. Computing in Science & Engineering, 2(1):74–84, 2002. [48] F. Tisseur. Newton’s method in floating point arithmetic and iterative refinement of generalized eigenvalue problems. SIAM Journal on Matrix Analysis and Applications, 22(4):1038–1057, 2001. [49] R. Todd. Introduction to Conditional Numerical Repro- ducibility (CNR), 2012. http://software.intel.com/en-us/articles/ introduction-to-the-conditional-numerical-reproducibility-cnr [50] W. Tucker. Validated Numerics: A Short Introduction to Rigorous Computations. Princeton University Press, 2011. [51] O. Villa, D. Chavarr´ı a Miranda, V. Gurumoorthi, A. Marquez,` and S. Krishnamoorthy. Effects of Floating- Point non-Associativity on Numerical Computations on Massively Multithreaded Systems. In Cray User Group, Atlanta, GA, USA, 2009. https://cug.org/5-publications/ proceedings attendee lists/CUG09CD/S09 Proceedings/pages/ authors/01-5Monday/4C-Villa/villa-paper.pdf

Nathalie Revol received her PhD degree in applied mathematics from INPG (Grenoble, France) in 1994. She has been assistant profes- sor at the University of Sciences and Technolo- gies of Lille, France, between 1995 and 2002. Since then she is an INRIA researcher at the LIP laboratory, ENS de Lyon, France. Since 2008 she co-chairs the IEEE P1788 working group for the standardization of interval arith- metic with R. B. Kearfott. Her research interests include interval arithmetic, floating-point arithmetic, and numerical quality.