<<

Sensitivities for Guiding Refinement in Arbitrary-Precision Arithmetic by Jesse Michel B.S., Massachusetts Institute of Technology (2019) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2020 © Massachusetts Institute of Technology 2020. All rights reserved.

Author...... Department of Electrical Engineering and Computer Science May 18, 2020

Certified by ...... Michael Carbin Jamieson Career Development Assistant Professor of Electrical Engineering and Computer Science Thesis Supervisor

Accepted by...... Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 Sensitivities for Guiding Refinement in Arbitrary-Precision Arithmetic by Jesse Michel

Submitted to the Department of Electrical Engineering and Computer Science on May 18, 2020, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science

Abstract Programmers often develop and analyze numerical algorithms assuming that they operate on real , but implementations generally use floating-point approximations. Arbitrary- precision arithmetic enables developers to write programs that operate over reals: given an output error bound, the program will produce a result within that bound. A key drawback of arbitrary-precision arithmetic is its speed. Fast implementations of arbitrary-precision arithmetic use arithmetic (which provides a lower and upper bound for all vari- ables and expressions in a computation) computed at successively higher precisions until the result is within the error bound. Current approaches refine computations at precisions that increase uniformly across the computation rather than changing precisions per-variable or per-operator. This thesis proposes a novel definition and implementation of derivatives through interval code that I use to create a sensitivity analysis. I present and analyze the critical path algorithm, which uses sensitivities to guide precision refinements in the compu- tation. Finally, I evaluate this approach empirically on sample programs and demonstrate its effectiveness.

Thesis Supervisor: Michael Carbin Title: Jamieson Career Development Assistant Professor of Electrical Engineering and Computer Science

3 4 Acknowledgments

I thank my advisor Michael Carbin. He helped guide the intuition and motivation that shaped this thesis and provided useful feedback and guidance on the experimental results. I would also like to thank Ben Sherman for helping to develop technical aspects of this thesis, for making the time to review my writing, and for his guidance throughout the research process. Alex Renda, Rogers Epstein, Stefan Grosser, and Nina Thacker provided useful feedback. I am grateful for the financial support that I have received from NSF grant CCF- 1751011. I thank my parents, sisters, and extended family for their love and support and my nephew Joseph for being a shining light in my life.

5 6 Contents

1 Introduction 13 1.1 Motivating example ...... 15 1.2 Thesis ...... 17 1.3 Outline ...... 17

2 Background on Interval Arithmetic 19 2.1 Interval ...... 20 2.2 Interval ...... 20 2.3 Interval ...... 21 2.4 Analysis ...... 23

3 Sensitivities for Precision Refinement 25 3.1 A baseline schedule ...... 26 3.2 Sensitivities from derivatives ...... 27 3.2.1 Constructing sensitivities ...... 27 3.2.2 Sensitivity as a derivative ...... 28 3.2.3 Introducing a cost model ...... 29 3.3 A schedule using sensitivities ...... 30 3.4 Analysis ...... 31 3.4.1 Uniform schedule ...... 32 3.4.2 Critical path schedule ...... 32 3.4.3 Cost-modeled schedule ...... 33 3.4.4 A comparison of schedules ...... 34

7 4 Automatic Differentiation of Interval Arithmetic 37 4.1 Introduction to automatic differentiation ...... 37 4.2 Automatic differentiation on intervals ...... 38 4.2.1 Derivative of interval addition ...... 39 4.2.2 Derivative of interval multiplication ...... 40 4.2.3 Derivative of interval sine ...... 41 4.3 Analysis ...... 42

5 Results 45 5.1 Schedules ...... 45 5.1.1 Baseline schedule ...... 45 5.1.2 Critical path schedule ...... 46 5.2 Empirical comparison ...... 46 5.2.1 Improving a configuration ...... 46 5.2.2 Improving a schedule ...... 47 5.3 Implementation ...... 48

6 Related Work 49 6.1 Mixed-precision tuning and sensitivity analysis ...... 50 6.2 Arbitrary-precision arithmetic ...... 50 6.2.1 Pull-based approaches ...... 51 6.2.2 Push-based approaches ...... 51

7 Discussion and Future Work 53 7.1 Benchmarks ...... 53 7.2 Further improving precision refinement ...... 54 7.2.1 Per-primitive cost modeling ...... 54 7.2.2 Unexplored trade-offs in precision refinement ...... 55 7.2.3 Generalizing the critical path algorithm ...... 56 7.3 New applications to experimental research ...... 56

8 Conclusions 57

8 List of Figures

1-1 Example of a uniform configuration ...... 15 1-2 Derivatives of the computation in Figure 1-1...... 16 1-3 Sensitivities of the computation in Figure 1-1 ...... 16 1-4 Example of a non-uniform configuration ...... 16

2-1 The four key monotonic regions for the definition of interval sine. . . . . 22 2-2 A simple Python implementation of interval sin...... 23

3-1 Computation graph for theoretical analysis ...... 31

4-1 Reverse-mode automatic differentiation on intervals...... 39 4-2 Interval addition with derivatives...... 40

9 10 List of Tables

3.1 Theoretical comparison of schedules ...... 35

5.1 Comparison of precisions for configurations ...... 47 5.2 Comparison of error and time for configurations ...... 47

7.1 FPBench benchmark results...... 55

11 12 Chapter 1

Introduction

Floating-point computations can produce arbitrarily large errors. For example, Python implements the IEEE-754 standard, which produces the following behavior for 64- floating- point numbers:

>>> 1 + 1e17 - 1e17 0.0

The result of this computation is 0 instead of 1! This leads to an arbitrarily large error in results; for example, (1 + 1e17 − 1e17)푥 will be always be 0 instead of 푥. Resilience to numerical-computing error is especially desirable for safety-critical software such as control systems for vehicles, medical equipment, and industrial plants, which are known to produce incorrect results because of numerical errors [12]. In contrast to floating-point arithmetic, arbitrary-precision arithmetic computes a result within a given error bound. Concretely, given the function 푦 = 푓(푥) and an error bound 휖, arbitrary-precision arithmetic produces a result 푦˜ such that

|푦˜ − 푦| < 휖.

Arbitrary-precision primitives It is necessary to use a data-type that supports arbi- trary rational numbers in order to refine to arbitrarily small error. I chose to use a multiple- precision floating-point implemented in MPFR [14]. To understand the representation, con-

13 sider the example of representing 휋 to 5 mantissa :

exponent ⏞ ⏟ −3 11.0012 = 110012 ×2 . ⏟ ⏞ mantissa

The exponent automatically adjusts as appropriate, so requesting 10 mantissa bits of preci- sion results in

−8 11.001001002 = 11001001002 × 2 .

Since the exponent adjusts automatically, I focus on setting the of mantissa bits for the variables and operators in the computation. For the rest of the thesis, bits of precision will denote mantissa bits.

Implementing arbitrary-precision arithmetic The push-based approach to implement- ing arbitrary-precision sets the precisions at which to compute each variable and operator and computes the error in the output. It then refines results at increasingly high precisions until the result is within the given error bound. Each pass through the computation uses interval arithmetic, which computes error bounds by “pushing” bounds from the leaves of the computation graph up to the root. For example, assuming no error in addition,  works such that [1, 2]  [3, 4] = [4, 6].

2 More realistically, suppose that the function +푝 : R × R → R for bounded addition at pre- cision 푝. Then, +푝 and +푝 compute the lower and upper bound for adding inputs truncated to precision 푝. They satisfy the property that for all 푎, 푏 ∈ R, (푎 +푝 푏) ≤ (푎 + 푏) ≤ (푎 +푝 푏) where + is exact and where as 푝 → ∞ the inequality becomes an equality. Assuming error in addition,

[1, 2] 푝 [3, 4] = [1 +푝 3, 2 +푝 4],

which will always have a lower bound ≤ 4 and an upper bound ≥ 6. Computing constants such as 휋 or 푒 makes the need for this type of approximation clearer since it requires infinite space to represent them exactly (since 휋 and 푒 are transcendental). However, they are soundly computed using arbitrary-precision arithmetic.

14 1.1 Motivating example

Current push-based implementations refine precisions uniformly across the computation graph [30, 25]. Concretely, this means setting all variables and operators to the same preci- sion (e.g. 1 mantissa bit) and if the error bound is not satisfied, repeating the computation at a higher precision (e.g. 2 mantissa bits). This means that certain variables and operators are computed to a high precision even when they contribute little to the error – an inefficient allocation of compute resources. For example, consider computing

푒 + 1000휋

to a generous error bound of 500. Existing approaches refine precision uniformly across vari- ables and operators [30, 25]. In the best case scenario, these approaches require 5 mantissa bits for , 푒, , and 휋 (since 푘 is a constant, it remains at a fixed precision). Note that  and  are the addition and multiplication operators over intervals respectively, described in detail in Chapter 2. An example of this computation is shown in Figure 1-1. Suppose the

 [3070, 3460]

푒 [2.62, 2.75]  [3070, 3330]

푘 [1000, 1000] 휋 [3.12, 3.25]

Figure 1-1: The figure presents a computation graph evaluated at a uniform precision of5 mantissa bits of precision (except for the constant 푘) with an error of 3460 − 2070 = 390. approach to precision-refinement is to start at a uniform precision of 1 mantissa bit,and then increment precisions (mantissa bits) until the error bound is satisfied. In this case, there are four refinement steps and the error bound of 500 will only be reached onthe4th refinement at 5 mantissa bits. I propose an approach that generates non-uniform precision assignments. To determine which vertices to refine to a higher precision, I introduce a novel sensitivity analysis that measures the infinitesimal change in the interval width of the output to an infinitesimal change to the interval width of each of the variables and operators in the computation. The

15 sensitivities are implemented with automatic differentiation through the interval code, which is novel as well. More explicitly, if the output is the interval 푦 = [푦, 푦], then for each interval (︂ 휕(푦−푦) 휕(푦−푦) )︂ 푥 = [푥, 푥], the derivatives will be 휕푥 , 휕푥 as shown in Figure 1-2. Note that the parentheses in the figure denote pairs of numbers (tuples), not open intervals. The sensitivity

 (1, −1)

푒 (1, −1)  (1, −1)

푘 N/A 휋 (1000, −1000)

Figure 1-2: Derivatives of the computation in Figure 1-1. is the difference between the derivative of the output with respect to the lower boundand 휕(푦−푦) 휕(푦−푦) the derivative of the output with respect to the upper bound, namely 휕푥 − 휕푥 . The resulting sensitivities are presented in Figure 1-3.

 2

푒 2  2

푘 N/A 휋 2000

Figure 1-3: Sensitivities are the derivative with respect to the lower bound minus the deriva- tive with respect to the upper bound as shown in Figure 1-2.

The most sensitive vertex in the computation graph in Figure 1-3 is 휋 because 2000 is the largest sensitivity. The proposed technique identifies the critical path as the path from the root to the most sensitive vertex. In this case, the critical path is  →  → 휋. The

 [3070, 3460]

푒 [2.5, 3]  [3070, 3330]

푘 [1000, 1000] 휋 [3.12, 3.25]

Figure 1-4: Computation graph using 5 mantissa bits for , , and 휋, 3 mantissa bits for 푒, and not changing the constant 푘. The critical path is bolded. resulting computation graph is shown in Figure 1-4. Along the critical path, variables on

16 operators are incremented by 2 mantissa bits, while the remainder of the computation graph is incremented by 1. This is an instantiation of the critical path algorithm. In this case, the first configuration satisfying the error bound assigns 5 mantissa bits along the critical path and 3 bits to 푒. As 푘 becomes larger, approaches using uniform refinement techniques compute more and more decimal places of 푒 unnecessarily. The critical path algorithm can avoid this problem.

1.2 Thesis

In this thesis, I investigate ways to improve precision refinement in arbitrary-precision arith- metic. I define a novel sensitivity analysis in terms of derivatives computed through interval code. Using these sensitivities, I propose an algorithm – the critical path algorithm – that guides the refinement process of arbitrary-precision arithmetic. The sensitivities use deriva- tives computed with reverse-mode automatic differentiation through interval code, which is novel. I implement a system for performing arbitrary-precision arithmetic and demonstrate that the critical path algorithm can guide refinements to produce more accurate results with less computation on certain programs.

1.3 Outline

The thesis is structured as follows. In Chapter 2, I explain how interval arithmetic works and elaborate on the mathematical and implementation challenges. Then in Chapter 3, I present the current approach to implementing arbitrary-precision arithmetic using interval arithmetic and show how it may be improved assuming derivatives of interval code can be efficiently computed. I describe the approach to efficient derivative computation inChap- ter 4. Next, I present empirical results using the proposed approach to precision refinement in arbitrary precision arithmetic in Chapter 5. I discuss some related work in Chapter 6 and finally present a discussion in Chapter 7 and conclusions in Chapter 8. The open-source implementation is available at https://github.com/psg-mit/fast_reals.

17 18 Chapter 2

Background on Interval Arithmetic

This chapter provides a brief introduction to interval arithmetic. I describe the interval operations for addition, multiplication, and sine. I also provide code to elucidate the under- lying implementation and provide an analysis of some of the properties that arise from using interval arithmetic.

Interval arithmetic is a method of computing that provides a bound on output error, use- ful in the implementation of push-based arbitrary-precision arithmetic. For a more thorough treatment of interval arithmetic, including an analysis of correctness, totality, closedness, optimality, and efficiency, see [18].

An interval version of 푓 : R푛 → R푚 will take 푛 intervals as an input ⃗푥 ∈ (R2)푛 and produce lower and upper bounds on each of the outputs. Thus, the interval arithmetic

′ 2푛 2푚 ′ ′ ′ computation of 푓 will be 푓 : R → R , where 푓 (⃗푥) produces 푚 intervals [푓 (⃗푥)푖, 푓 (⃗푥)푖] such that

′ ′ 푓 (⃗푥)푖 ≤ 푓(⃗푥)푖 ≤ 푓 (⃗푥)푖

for each 푖 = 1, 2, . . . , 푚. To achieve this, I take a compositional approach by converting each operation in 푓 to a version over intervals (that takes intervals as input and produces intervals as output).

19 2.1 Interval addition

In this section, I show how to implement interval addition given access to primitives provided

2 in a number of libraries such as MPFR [14]. Assume that the function +푝 : R×R → R that computes error-bounded interval addition at precision 푝 is given and satisfies the property

that (푎 +푝 푏) ≤ (푎 + 푏) ≤ (푎 +푝 푏) where + is exact and such that, in the limit as 푝 → ∞, the inequality becomes equality. The addition operator over intervals at precision 푝

2 2 2 is 푝 : R × R → R and has the following behavior: given the two intervals

푖1 = [푖1, 푖1] and 푖2 = [푖2, 푖2],

it computes the sum

푖1 푝 푖2 = [푖1 +푝 푖2, 푖1 +푝 푖2].

This is correct because there is a precondition that 푖1, 푖2 are valid intervals (i.e. 푖1 ≤ 푖1 and

푖2 ≤ 푖2) and addition is monotonic increasing. Thus, the minimum and maximum possible values of the sum are the lower and upper bounds given.

2.2 Interval multiplication

Implementing interval multiplication is a little more nuanced because multiplication over the reals is neither monotonic increasing nor monotonic decreasing over the reals. For example, −5 × −5 = 25 and −4 × −5 = 20, so increasing an argument may decrease the output. On the other hand, 5 × 5 = 25 and 5 × 6 = 30, so increasing an argument may increase the

output. It is possible to regain monotonicity by partitioning the reals into the negative R−

and non-negative R+. Kaucher multiplication is an algorithm that takes advantage of this structure is [22].

I present a simpler, but potentially less efficient, algorithm. Assume that the function

2 ×푝 : R × R → R that computes error-bounded multiplication at precision 푝 is given and

satisfies the property that (푎 ×푝 푏) ≤ (푎 × 푏) ≤ (푎 ×푝 푏) where × is exact and such that, in

20 the limit as 푝 → ∞, the inequality becomes equality. Given the two intervals

푖1 = [푖1, 푖1] and 푖2 = [푖2, 푖2],

2 2 2 the product at precision 푝 on intervals 푝 : R × R → R to

푖1 푝 푖2 = [min 푆, max 푆]

{︁ }︁ where 푆 = 푖1 ×푝 푖2, 푖1 ×푝 푖2, 푖1 ×푝 푖2, 푖1 ×푝 푖2 is the lower bound of each of the pairwise {︁ }︁ products and 푆 = 푖1 ×푝 푖2, 푖1 ×푝 푖2, 푖1 ×푝 푖2, 푖1 ×푝 푖2 is the upper bound of each of the pairwise products. The correctness proof is provided in Section 4.6 of [18].

2.3 Interval sine

2 2 Even more difficult is the computation of interval sin푝 : R → R . The contributions in this section are, to my knowledge, novel, but are not core to the thesis as a whole. This section serves the purpose of introducing some of the relevant challenges of implementing interval arithmetic.

Sine is periodic and composed of monotonic segments. I approach computing interval sine by cases with respect to these segments and identify which region or regions the bounds lie within. Figure 2-1 depicts the four monotonic segments key to the implementation of interval sine.

More formally, let 푥 ∈ R2 be given. In the implementation, if 푥 − 푥 is large (greater than 3), then the output range will be the range of sine (i.e. [−1, 1]). Otherwise, consider the following cases: 푥 ⊂ (I ∪ IV), 푥 ⊂ (II ∪ III), 푥 ⊂ (I ∪ II), and 푥 ⊂ (III ∪ IV). I assume

2 access to a function sin푝 : R → R , where 푝 is the precision of the result, defined such that sin푝(푥) ≤ sin(푥) ≤ sin푝(푥). This function is provided in MPFR [14]. In the first case where 푥 ∈ (I ∪ IV), the interval is monotonic increasing and the result is the under-approximation of the sine of the lower bound of the input and the over-approximation of the upper bound of the input. Similar reasoning applies to the other cases (assuming that 푥 − 푥 < 휋) and is

21 Figure 2-1: The four key monotonic regions for the definition of interval sine.

represented in the equation below:

⎧ ⎪ ⎪[sin푝 푥, sin푝 푥], for 푥 ⊂ (I ∪ IV) ⎪ ⎪ ⎪ ⎨⎪[sin푝 푥, sin푝 푥], for 푥 ⊂ (II ∪ III) sin푝(푥) = (2.1) ⎪ ⎪[min(sin 푥, sin 푥), 1], for 푥 ⊂ (I ∪ II) ⎪ ⎪ ⎪ ⎩⎪[−1, max(sin푝 푥, sin푝 푥)], for 푥 ⊂ (III ∪ IV)

The Python implementation in Figure 2-2 surfaces a few details that I did not specify in the mathematical presentation. For example, it shows how to identify the monotonic regions labeled in Figure 2-1 using cosine. I use the bigfloat Python wrapper for the MPFR library to compute sin푝, sin푝 : R → R [14]. I check that the width of the interval 푥 is less than 3 rather than 휋 because the contract of interval arithmetic allows for over-estimation (loose bounds) and in practice, the bounds are generally tight intervals (with width much less than 3). Also note that the cosine is used to identify the various regions by their slope, avoiding modular arithmetic and significantly simplifying the implementation when compared with other approaches ([2]). For example,

22 def interval_sin(interval, lower_at_p, upper_at_p): lower, upper= interval

# Computes the lower or upper bound ata given precision lp , up= lower_at_p , upper_at_p

# Start at the range of sine out_lower, out_upper=-1, 1

if sub(upper, lower, up)<3: # Signs of derivatives identify monotonic regions if (cos(lower, lp)>= 0) and (cos(upper, lp)>= 0): out_lower, out_upper= sin(lower, lp), sin(upper, up)

elif (cos(lower, up)<= 0) and (cos(upper, up)<= 0): out_lower, out_upper= sin(upper, lp), sin(lower, up)

elif (cos(lower, lp)>= 0) and (cos(upper, up)<= 0): out_lower= min(sin(lower, lp), sin(upper, lp))

elif (cos(lower, up)=< 0) and (cos(upper, lp)>= 0): out_upper= max(sin(lower, up), sin(upper, up))

return [out_lower, out_upper]

Figure 2-2: A simple Python implementation of interval sin. their implementation requires that 휋 is computed to higher precisions to compute sine at higher precisions for certain inputs. Furthermore, the cosine computation can be reused when

implementing the derivative of sin푝, which is both convenient and efficient (see Section 4.2.3 for more details).

2.4 Analysis

I will briefly reflect on some of the properties of interval arithmetic with these operators. Applying an interval operator is sound if for every input interval, the output interval contains the result. Since any operator can be implemented soundly by returning [−∞, ∞], we need a condition on tightness. A precision-parameterized functions on intervals (e.g., 푝) is tight if, for every input interval, in the limit as the precision approaches infinity, the width approach the width when the computation is exact (error-free) over the reals.

I analyze these properties with respect to 푝, ×푝, and sin푝.  is sound and tight because

23 it acts element-wise on each of the input intervals with +푝 : R × R → R. Similarly, 푝 is sound and tight. The implementation of sin푝 can be both sound and tight [2]. However, the implementation provided is sound, but not tight, because it returns [−1, 1] for input intervals with width greater than 3. It is tight for intervals narrower than 3. In the high precision limit for arbitrary-precision arithmetic, the input interval widths tend to 0 and thus sin푝 is tight.

24 Chapter 3

Sensitivities for Precision Refinement

The push-based approach to implementing arbitrary-precision arithmetic relies upon guess- ing appropriate precisions for all of the variables and operations in a computation and then checking whether the error of the result is small enough. If it is not, precisions must be increased, or refined, until the error bound is satisfied. In this chapter, I present an estab- lished approach to precision refinement and present a novel algorithm – the critical path algorithm – to improve precision refinement. I also provide an analysis of the convergence rate of different schedules operating on a particular class of computations. At a high level, the critical path algorithm guides precisions of the variables and operators in a computation using a heuristic. I represent a computation as its underlying computation graph (e.g. Figure 1-1). A computation graph is a directed acyclic graph consisting of vertices 푉 (the variables and operators in the computation) and edges such that a vertex

푣1 has a directed edge to 푣2 if and only if 푣1 is an operator and 푣2 is one of its arguments. Thus, the leaves will always be variables or constants, and the non-leaf nodes will always be operators. A forward pass on a computation graph performs operations from the leaves to the root following the precisions at each vertex. To more easily refer to parts of this push-based computation, I introduce the following terminology:

Definition 3.0.1. A configuration 퐶 : 푉 → N maps vertices in the computation graph to precisions such that a variable 푣 ∈ 푉 is represented using 퐶푣 bits of precision.

25 Definition 3.0.2. A schedule is a sequence of configurations 푆 : N → (푉 → N) such that 푛 ↦→ 푆(푛) where 푆(푛) is the 푛th configuration.

Each forward pass computes with respect to a configuration and a push-based com- putation will follow a schedule – computing with respect to the successive configurations

(푖) (푆 )푖∈N until the result lies within the error bounds. In general, schedules in push-based computations will produce configurations that assign variables to increasingly high preci-

(푘) (푘+1) sions (푆푣 < 푆푣 for all 푘 ∈ N), leading to a monotonically decreasing error on commonly occurring computations.

3.1 A baseline schedule

I begin by considering a baseline that generalizes the schedule proposed by iRRAM [30]. This schedule computes the function 푓(푥) by setting the (푘 + 1)th configuration as

(푘+1) (푘) 푘 푆푣 = 푆푣 + 푎푏 (3.1) where 푆(0), 푎, 푏 are parameters that define the behavior of the schedule. Notice that the precisions grow exponentially. I present this schedule simply because it is used in iRRAM, one of the fastest arbitrary-precision arithmetic libraries [5]. There are fundamental trade-offs between different choices of configurations, with the central concerns being: (1) overshooting – when the final configuration is at an unnecessarily high precision – and (2) undershooting – requiring too many forward passes to converge (i.e. the error bound is satisfied when 푘 in 푆(푘) is large).

The problem For the schedule in Equation 3.1, the final configuration 푆(푘) that satisfies the given error bound assigns each variable and operation to the same precision; this as- signment is rarely optimal and may be far from it. Chapter 1 provides a worked example of a case where uniform refinement is suboptimal and benefits from setting different variables and operations to different precisions. Although it may be possible to compute the necessary precisions optimally by hand (at least for simple cases), an optimal, automated approach to

26 precision refinement would require perfectly modeling floating-point error, which has evaded researchers. I choose to take a heuristic approach. Heuristics may be used to guide schedules to configurations satisfying error bounds, while minimizing the amount of total computation (the sum of all of the compute, generally measured in time, required for all of the configu- rations run). A good heuristic is fast to compute and guides the computation quickly to a configuration respecting the given error bound without overshooting or undershooting.

3.2 Sensitivities from derivatives

In this section, I describe the novel algorithm to compute sensitivities assuming that deriva- tives of the interval code are already provided. The sensitivities provide a measure of the amount of change in the output interval width from a change to the input interval width. In Chapter 4, I demonstrate that derivatives of interval code can be computed efficiently using automatic differentiation.

3.2.1 Constructing sensitivities

I now present a sensitivity analysis that is a key contribution of this thesis. My construction of sensitivities of interval computations assumes correctly computed derivatives are already provided. I will detail the implementation of these derivatives in the following chapter. Run- ning automatic differentiation on the computation graph of an interval arithmetic expression produces 4 partial derivatives for each 푣 ∈ 푉 . In particular, for vertex 푣푥 corresponding to the input interval 푥, if the function 푓 has the output 푦 = 푓(푥), then the change in the output 푦 = [푦, 푦] with respect to a change in the interval 푥 = [푥, 푥] is

휕푦 휕푦 휕푦 휕푦 , , , . (3.2) 휕푥 휕푥 휕푥 휕푥

휕푦 For example, 휕푥 is an intuitive answer to the question “what will be the change in the lower bound of the output given a small increase in the lower bound of 푣푥?” Increasing the

27 precision at which 푣푥 is computed decreases the width of the output interval, and thus,

휕푦 휕푦 휕푦 휕푦 , ≥ 0, , ≤ 0. 휕푥 휕푥 휕푥 휕푥

This leads to a natural definition of sensitivity that is one of the core contributions ofthis thesis. I define the sensitivity of 푣푥 with respect to a decrease in the width of 푥 as:

휕푦 휕푦 휕푦 휕푦 sens(푣 ) = + − − . (3.3) 푥 휕푥 휕푥 휕푥 휕푥

Implicitly, this formulation of sensitivity asserts that it is just as important to increase the lower bound as it is to decrease the upper bound because all of the coefficients on the derivatives have the same (unit) magnitude. Also note that sens(푣푥) ≥ 0 because of the previous inequalities.

3.2.2 Sensitivity as a derivative

I will now build a function that explicitly relates a change in the width of the interval corresponding to a vertex in the computation graph to the width of the output interval, giving a scalar-valued function whose derivative determines sensitivity. Formally, the sensitivities are the derivative of the composition of functions that take the derivative (퐷푓): R2 → R2×2 of an interval-valued function 푓 : R2 → R2 and transform it in terms of its directional derivatives. The goal is to understand the decrease in the output interval width as a result of decreasing the input interval width. This means directing perturbations in a positive direction for lower bounds and a negative direction for upper bounds. The function with respect to a specific input 푥 satisfying these properties is:

푧푥(푡) := (푔 ∘ 푓 ∘ ℎ푥)(푡) (3.4)

2 2 where 푔 : R → R, ℎ푥 : R → R with 푔(푥) = 푥 − 푥 and ℎ푥(푡) = (푥 + 푡, 푥 − 푡). In words, 푔 computes the (negative) width of an interval, and ℎ푥 symmetrically decreases the width of

28 the interval 푥 by 푡. If 푦 = 푓(푥), the derivative is:

⎡ 휕푦 휕푦 ⎤ ⎢ 휕푥 휕푥 ⎥ (퐷푓)푥 = ⎢ ⎥ (3.5) ⎣ 휕푦 휕푦 ⎦ 휕푥 휕푥

and the derivative of 푧푥(푡) is:

⎡ ⎤ ⎡ ⎤ [1 −1] 휕푦 휕푦 1 푑푧푥 ⎢ 휕푥 휕푥 ⎥ ⎢ ⎥ = ⎢ ⎥ ⎢ ⎥ . (3.6) 푑푡 ⎣ 휕푦 휕푦 ⎦ ⎣ ⎦ 휕푥 휕푥 −1

This evaluates exactly to the proposed sensitivity, meaning that if 푥 corresponds to the

vertex 푣푥 in the computation graph,

푑푧 푥 = sens(푣 ). 푑푡 푥

3.2.3 Introducing a cost model

The proposed sensitivity analysis (in Equation 3.3) implicitly assumes that decreasing the width of an interval by an infinitesimal amount 훿 is just as costly when the interval width is 100 as when the current interval width is 0.01. This assumption is often inaccurate. For example, computing the first 푛 digits of 휋 using the Bailey–Borwein–Plouffe algorithm has a computational complexity that is 푂(푛 log3(푛)) [3]. Incorporating the property that it requires more computational cost to refine narrower intervals may help to encourage cost- efficient configurations for computations using sensitivities to guide refinement.

The cost-dependent sensitivity analysis for the vertex 푣푥 in the computation graph cor- responding to 푥 is

[︂ ]︂ ⎡ 휕푦 ⎤ 푐1(푥) 푐2(푥) 푐3(푥) 푐4(푥) ⎢ 휕푥 ⎥ ⎢ 휕푦 ⎥ ⎢ ⎥ ′ ⎢ 휕푥 ⎥ sens (푣 ) = ⎢ ⎥ . (3.7) 푥 ⎢ 휕푦 ⎥ ⎢− ⎥ ⎢ 휕푥 ⎥ ⎣ 휕푦 ⎦ − 휕푥

29 I provide a theoretical analysis of the cost function 푐 where 푐푖(푥) = 푥 − 푥 with 푖 = 1, 2, 3, 4 in Section 3.4. This allows for per-operator cost functions that can model the difficulty of refining different parts of the compute graph, which I expand upon in theChapter7.

3.3 A schedule using sensitivities

In the previous section, I gave a formal definition of sensitivities, and I now explore how these sensitivities may be incorporated into schedules to produce faster computations. To describe these schedules, I introduce the following terminology:

Definition 3.3.1. The most sensitive vertex is the vertex 푣 ∈ 푉 such that

푣 = arg max sens푘(푤), 푤∈푉

where sens푘 is the sensitivity (as defined in Equation 3.3) of the given program evaluated at the 푘th configuration.

Definition 3.3.2. The critical path 푃 (푘) is the path from the most sensitive vertex 푣 (with ties broken arbitrarily) to the root for computation evaluated at the configuration 퐶(푘).

Note that the sensitivities may change throughout the course of the computation due to changes in values. Thus, the most sensitive vertex and critical path are parameterized by the configuration. Armed with this terminology, defining the schedule is quite straightforward. At each iteration, the configuration is refined by a larger increment along the critical path thanitis in the rest of the computation. Explicitly, I define this schedule as

⎧ ⎪ ′(푘) 푘 (푘) ⎪푆 + 푎1푏 if 푣 ∈ 푃 ′(푘+1) ⎨ 푣 1 푆푣 = (3.8) ⎪ ′(푘) 푘 ⎩푆푣 + 푎2푏2 otherwise

′(0) where 푆 , 푎1, 푏1, 푎2, 푏2 dictate the behavior of the schedule. I call this the critical path algorithm for precision refinement. In Chapter 5, I experimentally compare the baseline (iRRAM) schedule and the proposed schedule.

30 3.4 Analysis

In this section, I compare the asymptotic behavior and theoretical properties of the uniform, critical path, and cost-modeled schedules. Although the empirical results use multiple- precision floating point, I use fixed point for the theoretical results because it is easierto analyze. In particular, I consider numbers in the range [0, 1] in the form of fixed-point binary numbers ∞ ∑︁ −푖 푥 = 2 푏푖, 푖=1 where 푏1, 푏2, ... ∈ {0, 1}. Consider a computation of the form:

푛 ∑︁ 푦 = 푎푖푥푖, 푖=1 where 푎푖 is a constant and 푥푖 ∈ [0, 1] for all 푖. In this case, the sensitivities will be 2 for all

of the  and  operators, 푎푖 for each 푥푖, and not applicable for constants (because they are assumed to be binary rationals at infinite precision e.g. 푎1, 푎2, . . . , 푎푛). I explore the case −푖 where 푎푖 = 2 . Figure 3-1 presents the computation graph of 푦.



  ... 

푎1 푥1 푎2 푥2 푎푛 푥푛

Figure 3-1: Example computation graph of a family of computations that can benefit from the critical path algorithm. The critical path, which remains the same for all iterations, is −푖 bolded for 푎푖 = 2 .

I now introduce notation and properties that are useful in the analysis of different sched- ules. Let [푛] denote the {1, 2, . . . , 푛}.

Definition 3.4.1. Let 푝(푘) denote the precision of the variable 푥 on the 푘th iteration of a 푥푖 푖 given schedule.

31 Definition 3.4.2. Let 푤(푘) denote the width of the output (the error) on the 푘th iteration of a given schedule.

The properties below are simple, but useful to refer to in the analysis that follows.

Fact 1. The largest width possible with 푘 bits of precision is 2−푘.

1 Fact 2. Given a finite geometric series where the ratio between consecutive termsis 2 , the ∑︀푛 푗−푛 sum is 푖=푗 푡푖 = 2푡푗(1 − 2 ).

3.4.1 Uniform schedule

Assume that each refinement increments the precisions for each of the vertices in the com- putation graph by 1. Formally, this means that for every 푖 ∈ [푛],

푝(푘) = 푘. 푥푖

1 Since the error for each 푥푖 is the same, and is 2푘 (by Fact 1), it may be factored out of ∑︀푛 1 the summation. The other term contributing to the error is 푖=1 푎푖 = 1 − 2푛 (by Fact 2). Therefore, the width of the output when using uniform refinement is

1 (︂ 1 )︂ 푤(푘) = 1 − . (3.9) 푢 2푘 2푛

3.4.2 Critical path schedule

Figure 3-1 shows the critical path of the computation. Since the derivative for each of the

bounds of 푥푖 is 푎푖, the most sensitive vertex (the one with the largest derivative) is 푥1 for every refinement. I use a schedule where the configuration is incremented by 2alongthe critical path and by 1 everywhere else in the computation graph. As a result, if the first refinement sets every variable and operator to one bit of precision, the precisions willfollow the equations: 푝(푘) = 2푘 − 1, 푝(푘) = 푘. 푥1 푥푖

Again, by Fact 1, the widths of the intervals are 1 and 푤(푘) = 1 . Computing the output 22푘−1 푥푖 2푘 interval width is then a matter of combining these terms with the corresponding coefficients

32 −푖 푎푖 = 2 , giving rise to the formula:

푛 (푘) 1 1 ∑︁ 1 푤푝 = 2푘 + 푘 푖 . 2 2 푖=2 2

By Fact 1, it is straightforward to see that the width is

1 1 (︂ 1 )︂ 푤(푘) = + 1 − . (3.10) 푝 22푘 2푘+1 2푛−1

The result can be proven formally by induction.

3.4.3 Cost-modeled schedule

I analyze a schedule that uses the critical path algorithm with cost-modeled sensitivities, and I call this the cost-modeled schedule. I define the cost-aware sensitivity as

′ sens (푣푥) = sens(푣푥)(푥 − 푥), where “sens” is defined in Equation 3.3. The sensitivities sens′ are a special case of the cost model presented in Section 3.2.3. The most sensitive vertex is the one with the smallest product of the sensitivity and the interval width. I use the same schedule as in the previous algorithm, where for each refinement, the configuration is incremented by 2 along the critical path and by 1everywhere 푙(푙+1) else in the computation graph. Let 푇푙 = 2 be the 푙th triangular number. Intuitively, the refinement will proceed as follows:

1. The computation begins at precision 푝(1) = 1 for all 푖. The most sensitive vertex is 푥 . 푥푖 1

2. After one refinement, the precisions are 푝(2) = 3, 푝(2) = 2. 푥 and 푥 are equally 푥1 푥푖̸=1 1 2 sensitive.

(4) (4) 3. After two additional refinement steps, the precisions are 푝푥1 = 6, 푝푥2 = 5, and 푝(4) = 4. 푥 , 푥 , and 푥 are all equally sensitive. 푥푖̸∈{1,2} 1 2 3

33 (7) (7) (7) 4. After three additional refinement steps, the precisions are 푝푥1 = 10, 푝푥2 = 9, 푝푥3 = 8, 푝(7) = 7. 푥 , 푥 , 푥 , and 푥 are all equally sensitive. 푥푖̸∈{1,2,3} 1 2 3 4

. .

(1+푇푙) 푘. Assuming 푘 ≤ 푛, after 푘 additional refinement steps, the precisions are 푝푥1 = 푇푙+1, 푝(1+푇푙) = 푇 − 1, ..., 푝(1+푇푙) = 푘, where 푙 is defined so that 푘 = 푇 + 1. 푥 , 푥 , . . . , 푥 푥2 푙+1 푥푖̸∈[푙] 푙 1 2 푙+1 are all equally sensitive.

Using these observations and applying Fact 1 and Fact 2, the formula is:

푙 1 (︂ 1 )︂ (푇푙+1) 푤푐 = + 1 − , (3.11) 2푇푙+1+1 2푇푙+푙+1 2푛−푙 which can be confirmed using induction.

3.4.4 A comparison of schedules

In this section, I compare the three different schedules for a few specific values of 푛, which ∑︀푛 varies the number of terms in the summation 푖=1 푎푖푥푖, and I analyze the comparative asymptotic performances of the different schedules. I derive formulas for the widths of the output intervals as a function the number of terms in the summation 푛 and the number of refinements 푘. I fix 푘 = 푇푛 + 1 because it is the number of refinements at which all of the leaves of the cost-modeled schedule contribute the same amount of error. This simplifies the expression for the width to

(푇푛+1) 푛 푤푐 = . 2푇푛+1+1

Table 3.1 shows the number of additional refinements needed for results to lie within the

error bound from the cost-modeled schedule at the 푇푛+1th iteration. Even for relatively small 푛, it is clear that there is a significant practical advantage to using the cost-modeled schedule over the uniform and critical path schedules. The critical path schedule also consistently outperforms the uniform schedule.

34 Schedule 푛 = 5 푛 = 10 푛 = 15 Uniform 4 8 13 Crit. path 3 7 12

Table 3.1: The table shows the number of additional refinements needed for the uniform and critical path schedules to lie within the error bound that the cost-modeled schedule on the 푇푛 + 1th iteration (푇5 + 1 = 16, 푇10 + 1 = 56, and 푇15 + 1 = 121).

I now study the limiting behavior of the ratio between the width of the cost-modeled schedule and each of the other two schedules. I find that it is an exponentially better schedule (note that smaller is better for widths) because

푤(푇푛+1) 푤(푇푛+1) 푛 lim 푐 = lim 푐 = . 푛→∞ (푇푛+1) 푛→∞ (푇푛+1) 푛 푤푝 푤푢 2

The single additional bit added along the critical path significantly improves the refinement process for the cost-modeled schedule. This emphasizes the importance of using carefully considered scheduling algorithms. To a lesser extent, the critical path schedule outperforms the uniform schedule. The limit

푤(푇푛+1) 1 lim 푝 = 푛→∞ (푇푛+1) 푤푢 2

shows that the critical path schedule is a factor of two tighter than the uniform schedule at the same refinement iteration. A single additional bit of precision per refinement leadsto halving the interval width globally.

35 36 Chapter 4

Automatic Differentiation of Interval Arithmetic

In this chapter, I introduce automatic differentiation and detail both the implementation and relevant analysis behind computing derivatives of interval code. The recent popularity of deep learning led to a focus on efficient computation of derivatives. Indeed, the backprop- agation algorithm, key to deep learning, is a special case of automatic differentiation [1, 32]. I begin with a brief overview of different approaches and some design considerations inthe efficient computation of derivatives.

4.1 Introduction to automatic differentiation

Automatic Differentiation (AD) enables efficient derivative computations and is commonly used in machine learning to compute first- and second-order derivatives [4]. I will provide a brief overview of AD and direct readers to [4, 19] for further description. There are two chain-rule factorizations of derivatives that lead to two different realizations of AD with different properties: forward-mode and reverse-mode.

Forward-mode AD Consider a differentiable function 푓 : R푁 → R푀 . Forward-mode AD computes derivatives from the 푁 leaves of the computation graph up to the 푀 roots. This simple choice of computing derivatives from the leaves to the roots means that the values of

37 the 푁 leaf derivatives are assigned at initialization. As a result, forward-mode AD computes all of the 푀 output derivatives with respect to an assignment of input derivatives.

Reverse-mode AD Consider a differentiable function 푓 : R푁 → R푀 . In contrast to forward-mode, reverse-mode AD computes derivatives from the 푀 roots of the computation graph to the 푁 leaves. Reverse-mode AD reverses the dependencies, requiring an initializa- tion for the 푀 output derivatives and computing the 푁 input derivatives. Derivatives with respect to all inputs are computed with respect to an assignment of output derivatives.

4.2 Automatic differentiation on intervals

I decide to use reverse-mode AD because it computes from the outputs to the inputs. There- fore, given an initialization for the two output derivatives, reverse-mode AD computes the derivatives for all of the inputs and intermediate computations. In contrast, forward-mode would require a forward-pass for each input (in the case of interval arithmetic, both the lower and upper bound).

For simplicity, I do not use interval arithmetic to bound the error of the gradient com- putation, but I do leave the gradients parameterized by precision for convenience and note that my implementation can easily be extended to compute error-bounded gradients.

Figure 4-1 presents an implementation of the gradient computation code on intervals. Each variable has an appropriate weight (the derivative of the term with respect to self ) assigned during the forward-pass based on the operation performed. The gradient is then the sum of the product of the appropriate weight and the corresponding gradient, as given by the chain rule. I provide an example implementation for interval addition. The careful reader may notice that there is an unnecessary recomputation of the co-recursive call to grad in _compute_grad . Indeed, my implementation caches computed gradients appropriately, but these details are omitted in Figure 4-1 for simplicity.

38 def _compute_grad(self, parents, rte): grad=0 for (w1, w2), var in parents: lower, upper= var.grad() grad_term= add(mul(w1, lower, rte), mul(w2, upper, rte), rte) grad= add(grad_term, grad, rte) return grad

def grad ( self ): rte= precision(self.grad_precision)+ RoundTiesToEven self.lower_grad= self._compute_grad(self.ad_lower_parents, rte) self.upper_grad= self._compute_grad(self.ad_upper_parents, rte) return self.lower_grad, self.upper_grad

Figure 4-1: Reverse-mode automatic differentiation on intervals.

Extracting sensitivities Generating the sensitivity for each variable specified in Equa- tion 3.3 requires two simple but key steps. First, initialize the derivatives at the root

root.lower_grad, root.upper_grad = 1, -1 .

Then, for a vertex 푣 in the computation graph, I compute the sensitivity 푠푒푛푠(푣), described in Equation 3.3: v.lower_grad - v.upper_grad .

These correspond to the pre-composition with 푔 and post-composition with ℎ푥 that maps the four partial derivatives in Equation 3.5 to the sensitivity. Explicitly,

⎡ ⎤ [︂ ]︂ 1 ⎢ ⎥ 1 −1 , ⎣ ⎦ −1 correspond to initializing the derivatives at the root and sensitivity assignment respectively as they appear in Equation 3.6. The given code snippets constitute the implementation naturally arising from Equation 3.6.

4.2.1 Derivative of interval addition

Building upon the explanation of interval addition in Section 2.1, I now show how to take

2 2 2 derivatives through 푝 : R × R → R , the interval addition operator at precision 푝. Since

39 푝 is monotonic increasing, the derivative of the lower bound of the output with respect to the lower bound of either of the inputs is 1 and similarly, the derivative of the upper bound of the output with respect to the upper bound of either of the inputs is 1. All other derivatives are 0, with eight derivatives in total. Explicitly, Figure 4-2 presents an implementation of interval addition with derivatives. My implementation stores these derivatives and the corresponding object that are used in the recursive calls to grad .

def __add__(self, lower_at_p, upper_at_p): # Perform addition left , right= self.parents self . lower= add(left.lower, right.lower, lower_at_p) self . upper= add(left.upper, right.upper, upper_at_p)

# Add derivative information left.ad_lower_parents.append(((1, 0), self)) left.ad_upper_parents.append(((0, 1), self)) right.ad_lower_parents.append(((1, 0), self)) right.ad_upper_parents.append(((0, 1), self))

Figure 4-2: Interval addition with derivatives.

4.2.2 Derivative of interval multiplication

The derivative of interval multiplication is more difficult because the output interval isthe minimum and maximum of the set of pairwise products of the input intervals (as explained in Section 2.2). The derivative computation involves identifying the terms that contribute to the output and assigning the appropriate derivatives. I will only provide an example and forego the implementation as it is detailed and adds little additional insight (it is in the provided code).

Example 1. In this example, I assume error-free arithmetic to focus on the complexities on differentiation. Recall that the computation 푥  푦 = 푧 expands to

[푥, 푥]  [푦, 푦] = [푧, 푧].

40 Consider the example: [−1, 2]  [−4, 1] = [−8, 4].

The set of products 푆 = {−8, −1, 2, 4}. Since the 푧 = [min 푆, max 푆], the result is 푧 = [−8, 4]. There are four input terms {−1, 2, −4, 1} and two outputs −8, and 4, that lead to the eight derivatives shown in Equation 4.1. Each derivative is provides an answer to the intuitive question: “how much would a change in this input affect that output?”

(︁ 휕푧 휕푧 )︁ Since −1 only contributes to 푧 where it is multiplied by −4, the derivatives 휕푥 , 휕푥 are (0, −4). Similarly, 2 only contributes to 푧 and it is multiplied by −4, so the derivatives

(︁ 휕푧 휕푧 )︁ 휕푥 , 휕푥 are (−4, 0). Continuing in this way yields the eight derivatives

((0, −4), (−4, 0), (2, −1), (0, 0)), corresponding to the derivatives

(︃(︃ 휕푧 휕푧 )︃ (︃ 휕푧 휕푧 )︃ (︃휕푧 휕푧 )︃ (︃휕푧 휕푧 )︃)︃ , , , , , , , . (4.1) 휕푥 휕푥 휕푥 휕푥 휕푦 휕푦 휕푦 휕푦

4.2.3 Derivative of interval sine

Building on the understanding of how to compute derivatives of 푝 and 푝, I now briefly

cover how to compute the derivative of sin푝. It may help to take another look at the detailed

description of interval sine in Section 2.3 and to look at the definition of sin푝 in Equation 2.1.

41 Each of the cases for the derivative (assuming that 푥 − 푥 < 휋) is shown below

⎧ ⎪ ⎪((cos 푥, 0), (0, cos 푥)), for 푥 ⊂ (I ∪ IV) ⎪ 푝 푝 ⎪ ⎪ ⎪ ⎪((0, cos푝 푥), (cos푝 푥, 0)), for 푥 ⊂ (II ∪ III) ⎪ ⎪ ⎪ 푑sin ⎨⎪((cos푝 푥, 0), (0, 0)), for (푥 ⊂ (I ∪ II)) ∧ (sin푝 푥 < sin푝 푥) 푝 = (4.2) 푑푥 ⎪ ⎪((0, 0), (cos 푥, 0)), for (푥 ⊂ (I ∪ II)) ∧ ¬(sin 푥 < sin 푥) ⎪ 푝 푝 푝 ⎪ ⎪ ⎪ ⎪((0, cos푝 푥), (0, 0)), for (푥 ⊂ (III ∪ IV)) ∧ (sin푝 푥 > sin푝 푥) ⎪ ⎪ ⎪ ⎩⎪((0, 0), (0, cos푝 푥)), for (푥 ⊂ (III ∪ IV)) ∧ ¬(sin푝 푥 > sin푝 푥)

where the regions I, II, III, IV are those specified in Figure 2-1. If 푥 − 푥 ≥ 휋, my im- plementation returns (0, 0), which is potentially too “loose” (because sine evaluated at an interval with width 휋 may not span the whole range of sine), but is still sound because over-approximation is acceptable for interval arithmetic.

4.3 Analysis

In this section, I introduce the mathematical challenges that arise from taking derivatives through interval code and highlight some additional concerns in my implementation. Non- differentiability is of particular concern. Addition on intervals  is differentiable, but mul- tiplication on intervals  is only differentiable almost everywhere. For example, consider [−2, 4]  [−4, 2] = [−16, 8]: the upper bound could either be from −2 × −4 or 4 × 2. Thus, this computation is not differentiable, but most computations like the one shown inExam- ple 1 are differentiable. My implementation takes the derivative with respect to the selected computation as a result of the nondeterministic choices (arising from the computation of min and max with multiplicity at the extrema) made during the computation. Similarly, sin is differentiable almost everywhere and is not differentiable, for example, at the interval [−1, 2], which has a width of 3 and does not span [−1, 1]. Example 1 also exhibits dead-zones, where a set of inputs has a zero derivative. In Example 1, where [−1, 2]  [−4, 1] = [−8, 4], “1” could be replaced with any 푎 in the open interval (−4, 2) and produce the same result. Since none of these values of 푎 will contribute to

42 the output, they will have a derivative of (0, 0). Similarly, there is a dead-zone for all inputs with a width greater than 3 for sin (the same is true for any definition for a width ≥ 2휋). These dead-zones present a challenge for using derivatives as sensitivities. For example, all of the derivatives are 0 for sin for a wide interval, indicating that it is not important to decrease the interval widths. However, this is clearly not the case, as a non-infinitesimal change (like the those used in refinement) may indeed yield a narrower interval (and amore accurate result). Together, dead-zones and non-differentiability present cases where this approach for us- ing derivatives as sensitivities may fail. They also highlight some subtleties of computing derivatives through interval code that may be worth further mathematical exploration and analysis. Since computations “break ties” by making arbitrary non-deterministic choices, I compute derivatives for every input (even at points that are technically not differentiable). I now move on to establishing the benefit of computing these derivatives.

43 44 Chapter 5

Results

In this chapter, I present an experiment and provide empirical results demonstrating the effectiveness of the critical path algorithm described in Chapter 3. Consider the computation 푦 = 휋 + 2100000푒, (5.1)

which is the motivating example from Section 1.1 depicted in Figure 1-1, except with 푘 = 2100000. In this case, changing the precision at which 휋 is computed from 1 to 2 bits reduces the output error by approximately 1, whereas for 푒, the same change in precision reduces the output error by approximately 299999.

5.1 Schedules

In this section, I define the baseline schedule and the critical path schedule that I will compare empirically in Section 5.2 on the computation in Equation 5.1.

5.1.1 Baseline schedule iRRAM uses a precision schedule 푆 that uniformly refines variables and operators with

(푘) (푘−1) 푘 푆푣 = 푆푣 + 50 · 1.25 ,

45 where 푆(0) = 0. The computation is computed at configurations starting with 푆(1). Intu- itively, the refinement process increases the precision of every variable and operator inthe program by 25% until the output error is within the error bound.

5.1.2 Critical path schedule

Now consider the alternate precision schedule 푆′ that sets variables and operators to different precisions depending on whether or not they are on the critical path 푃 (푘) = {+, ×, 휋} for every 푘. The instantiation of the critical path algorithm I use is

⎧ ⎪푆′(푘) + 50 · 1.33푘 if 푣 ∈ 푃 (푘) ′(푘+1) ⎨ 푣 푆푣 = ⎪ ′(푘) 푘 ⎩푆푣 + 50 · 1.25 otherwise where 푆′(0) = 0. Notice that the precision refinements increase at a faster rate along the critical path than in the rest of the program.

5.2 Empirical comparison

In this section, I present the experimental results from the implementation of Equation 5.1, with the goal of comparing the baseline schedule and the critical path schedule. I begin by comparing two configurations arising from the two schedules at the final iteration of a computation satisfying the error bound of 10−12000. Then I show how this affects the schedules as a whole for the same error bound.

5.2.1 Improving a configuration

I compare the time it takes to run two precision configurations that constitute a mapping from variables and operations (in this case {+, 휋, ×, 2100000, 푒}) to precisions for the example presented in Equation 5.1 and Figure 1-1. The error in the computation is the width of the output interval. Notice that in Table 5.1, the precisions along the critical path (which is {+, ×, 푒} because

휕푦 ′(24) (29) 휕푒 has the largest derivative) for 푆 are higher than the precisions in 푆 , while the

46 Configuration + 휋 × 2100000 푒 푆(29) 129046 129046 129046 129046 129046 푆′(24) 142047 42151 142047 42151 142047

Table 5.1: The table presents a comparison of the precisions generated on the 29th iteration of the baseline schedule, 푆(29), and the 24th configuration of the critical path schedule, 푆′(24).

Configuration Error Time(sec) 푆(29) 1.5 · 10−8743 0.037 푆′(24) 3.1 · 10−12657 0.027

Table 5.2: The table presents a comparison of the first configurations satisfying the error bound of 10−12000 for the baseline and critical path schedules. variables not on the critical path have a lower precision. Furthermore, note from Table 5.2 that using the configuration 푆′(24) has more output error than 푆(29). This means that using the critical path schedule produces a superior final configuration with a speed increase of roughly 37% and higher output accuracy.

5.2.2 Improving a schedule

The amount of total computation of a schedule can be understood in terms of the number of configurations computed and the time to compute each configuration. In the previous section, I demonstrate that computing using the critical path algorithm requires less time on the final configuration and fewer refinement steps. Since this is the case, it followsthat the schedule as a whole will take less total time as well.

Let 푡(푆(푘)) denote the time it takes to run the schedule 푆 for 푘 iterations (i.e. to run all of the configurations 푆(1), 푆(2), . . . , 푆(푘)). Continuing this example, it is clear that the effect of using the critical path algorithm is even more pronounced at the schedule level. I find 푡(푆(29)) = 0.163s and 푡(푆′(24)) = 0.112s, while the error on the last iteration is as shown in Table 5.2. This means that this approach produces a 45% speed increase with higher output accuracy.

47 5.3 Implementation

I implement a push-based system of arbitrary-precision arithmetic that uses interval arith- metic and computes derivatives using reverse-mode automatic differentiation. The imple- mentation is in Python and uses the bigfloat wrapper for MPFR for multiple-precision floating-point computations [14]. The implementation is available at https://github.com/ psg-mit/fast_reals.

Implementation challenges I encountered three core challenges in the implementation:

1. Caching – computations are cached automatically by MPFR.

2. Inconsistent performance – at low precisions, performance characteristics are erratic.

3. Timing error – simple programs are in the range of timing error.

I solve challenge 1 by running each experiment in a separate thread – allowing caching, but only within each run and not among runs. I resolve challenges 2 and 3 by setting small enough tolerances that durations are large enough to be easily measurable.

48 Chapter 6

Related Work

Related work aims to improve the performance of arbitrary-precision arithmetic with careful software implementations, by restructuring computations for efficiency, and by caching [23, 25, 30]. Due to performance concerns, interval arithmetic and arbitrary-precision arithmetic have not yet been widely adopted. However, interval and arbitrary-precision arithmetic has uses in robotics and more generally for , and has been in the proof of the Kepler conjecture and the Lorenz in 3D [17, 21, 34].

Implementations of arbitrary-precision arithmetic often rely on interval arithmetic, which researchers are working to accelerate by introducing further standardization – namely IEEE- 1788 – and by creating specialized computer architectures [13, 24]. I take a mixed-precision approach, which would see significant improvements with hardware support. For example, field-programmable gate arrays (FPGAs) have been used for mixed-precision computation in other applications [16, 29].

I tackle the problem of allocating precisions to variables and operators in expressions in arbitrary-real computations. Although I do not know of other work addressing this particular problem, mixed-precision tuning shares similar concerns and trade-offs. I draw inspiration from prior work this the topic. In Chapter 4, I present my approach to automatic dif- ferentiation for interval arithmetic, where I include appropriate references as they arise. I now provide a background of various approaches to sensitivity analysis and to implementing arbitrary-precision arithmetic.

49 6.1 Mixed-precision tuning and sensitivity analysis

Mixed-precision tuning of floating-point computations involves assigning variables in apro- gram different floating-point precisions (e.g. float32 versus float64). The goal is is to minimize run-time, space, etc. while respecting a bound on the output error.

The numerical tuning approaches providing error bounds often use SMT-solvers and restrict the inputs to interval ranges [6, 7, 9, 10, 11]. One common way to use sensitivity analysis techniques (which measure the effect that changing an input parameter has onthe output) is to produce annotations that identify operations requiring high precision, while satisfying an error bound [31, 33]. For example, Hwang et. al. use automatic differentiation to produce a sensitivity analysis of air quality models [20].

6.2 Arbitrary-precision arithmetic

I present a new categorization of two high-level approaches to arbitrary-precision arithmetic. The pull-based approach propagates error bounds recursively from the output to each of the sub-expressions, requiring a single pass through the computation but often producing overly- precise results. I call this approach pull-based because error flows down from the root of the computation graph to the leaves. The push-based approach computes the corresponding error at precisions set arbitrarily and refines results at increasingly high precisions until the given error bound is satisfied. Each pass through the computation uses interval arithmetic, which computes error bounds by “pushing” bounds from the leaves of the computation graph up to the root. The push-based approach sometimes requires multiple passes through the computation, but it can potentially avoid producing unnecessarily-precise results. A survey of the techniques to implementing arbitrary-precision arithmetic is presented in [15]. Due to looseness in the analysis of specifying error thresholds, there seems to be some consensus that the push-based approach is faster [15, 23, 25, 26, 30].

50 6.2.1 Pull-based approaches

A representative pull-based approach is to use binary Cauchy sequences and to evaluate sub-expressions at higher precisions, ensuring a single pass from the root to the leaves is required [27]. Concretely, a is represented by an infinite sequence of integers in some base and a denominator that increases exponentially by that base at every iteration.

In the binary case, for an integer sequence {푧푘}, the corresponding real number is the limit −푘 of the sequence 푧푘2 . There have been efforts to accelerate this approach by caching results even at different precisions [23]. Unfortunately, the pull-based approach often computes overly-precise results and, especially as expression-size scales, has worse overall performance than push-based approaches (based upon the results of the CCA 2000 competition) [5].

6.2.2 Push-based approaches iRRAM In iRRAM [30], intervals are iteratively, globally refined with a uniform precision for each node in the computation tree to yield a result with the desired precision. In terms of relevant optimizations, iRRAM supports by-hand labeling of specific parts of a computation as more sensitive and thus computing them with higher precision than the rest of the pro- gram. Müller evaluates iRRAM by showing its performance in computing simple arithmetic √︁ (e.g. 1/3, log(1/3)), iterative functions (the logistic map 푥푖 = 3.75푥푖−1(1 − 푥푖−1)), and inverting the Hilbert Matrix. Since these computations have a computation graph with few or no branches, I would not expect significant speed increases using my proposed approach on this set of benchmarks.

RealLib Lambov [25] aims to make low-precision arbitrary-precision arithmetic compara- ble to floating point in terms of speed. Their core insight is that sometimes using a pull-based approach is faster on small subtrees of the computation graph. They provide a programming model that allows users to have pull-based sub-expressions within the computation in the overall push-based computation. At high precisions, they find that iRRAM generally outper- forms RealLib, but in the low-precision regime on particular computations, their approach yields orders-of-magnitude faster results than iRRAM, even giving speeds comparable to

51 floating point in some cases. They also accelerate their computation by caching results, so that if they appear in multiple places, the expression may be reused.

52 Chapter 7

Discussion and Future Work

In this chapter, I introduce some additional, preliminary benchmark results, reflect upon some ways future researchers may improve precision refinement, and I lay out a new appli- cation of the techniques developed in this thesis to experimental research.

7.1 Benchmarks

Currently, there is not a comprehensive benchmark suite for arbitrary-precision numerical tasks with significant branching in the computation graph. The CCA benchmarks are mostly computations that have little branching, such as arctan 1050. The FPBench benchmark suite for scientific computations has more programs with branched computation graphs, but the inputs are defined over intervals without a clear arbitrary-precision translation [8]. Furthermore, there are few arbitrary-precision constants such as 푒, 휋, ℎ, etc. A comprehensive collection of scientific computations that rely on high-precision inputs and constants would help compare future work aim to speeding up arbitrary-precision arithmetic.

Methodology The benchmarks have variables that come from an interval range, constants specified as floats, and operators. I implement a -based uniform random samplerthat selects a point from the interval range and provides an arbitrary-precision sample. Due to the heavy use of random sampling, it is relatively computationally expensive to increase precisions. The constants are left as-is and the precision remains the same throughout the

53 computation. The operations are replaced with their arbitrary-precision equivalents. For simplicity, I use a subset of the FPBench benchmarks that does not have loops or other language primitives that are possible to support [8].

Preliminary results Table 7.1 shows the speedup from using the critical path schedule (Section 5.1.2) instead of the baseline schedule (Section 5.1.1). The parameters for the critical path schedule are the same as the parameters in the experiments in Section 5. The results are comparable, if not slightly worse, using the critical path algorithm for all of the benchmarks except verhulst. Looking at the underlying computations, verhulst is the only benchmark that has a single clear choice of critical path and benefits with a 2.12x speedup. For the other computations, the critical path remains the same throughout the computation and thus, that path is over-refined. In other words, the refinement results in a little extra computation with little benefit in output precision. This problem is compounded bythe methodology where more digits of variables are sampled on-the-fly, which is computationally difficult. The benefits of the critical path schedule on general purpose computation are limited by the lack of per-variable and per-operation cost modeling and by the simplicity of the algorithm. I discuss ways to broaden the applicability of this technique and to extend it in Section 7.2.

7.2 Further improving precision refinement

The critical path algorithm is a new approach to precision refinement that inspires a number of questions and opens up many directions for future research.

7.2.1 Per-primitive cost modeling

The sensitivity analysis that I focus on in this thesis does not take into account the compu- tational difficulty of refining different variables and operations. For example, generating the 10th digit of precision for 휋 will generally require more compute than for a sum. Operators will require different amounts of compute that will scale differently with 푝. These differ- ences can be accounted for by incorporating them into a cost model like the one presented

54 Benchmarks # Ops Speedup carbon gas 15 0.97 doppler1 11 1.0 doppler2 11 0.99 doppler3 11 0.97 jetEngine 28 0.92 predPrey 7 0.93 rigidbody1 11 0.96 rigidbody2 13 0.93 sine 11 0.91 sineOrder3 6 0.96 sqroot 12 0.97 turbine1 16 0.92 turbine2 13 0.96 turbine3 16 0.96 verhulst 5 2.12

Table 7.1: The figure presents the FPBench benchmark results. “#Ops” is the number of variables and operations in the computation. The “Speedup” is the ratio of the time it takes to respect an error bound of 10−12000 using the critical path schedule (Section 5.1.2) and the baseline schedule (Section 5.1.1). in Section 3.2.3. Defining the cost model could be done theoretically (by hand-coding the asymptotic behavior of each variable and operator) or empirically (by collecting data for each of the variables and operators and modeling the observed behavior).

7.2.2 Unexplored trade-offs in precision refinement

I think there an opportunity to explore schedules that grow at asymptotically different rates. For example, applications where computing extra digits of sub-computations requires significant resources, a schedule that grows linearly rather than exponentially is likelyto perform better (more iterations until reaching the error bound, but less overshoot). On the other hand, for applications where computing extra digits of sub-computations requires little additional compute, a schedule that grows super-exponentially rather than exponentially is likely to perform better (fewer iterations until satisfying the error bound). An important characteristic is that these various refinement rates are application dependent, which makes the development of more comprehensive benchmarks all the more important (as I argue in

55 Section 7.1).

7.2.3 Generalizing the critical path algorithm

The critical path algorithm refines uniformly across the computation except along the critical path. This algorithm is relatively easy to implement and analyze with respect to alternate algorithms that may empirically perform better. For example, consider an algorithm that uses the sensitivities to refine at different rates along each path in the computation graph (from the root to the leaf) based on the sensitivity of each of the leaves. Understanding the degree to which to refine each of these paths is an open problem that could lead to significant speed improvements, since the effect of the critical path algorithm will be compounded along all paths simultaneously.

7.3 New applications to experimental research

Experimental research may provide an excellent future direction for using sensitivity analysis in interval code. An experiment may yield a set of variables with measurement error that is naturally represented with intervals [28]. Conclusive experimental results require certainty, so minimizing the error in the output is important. This can be efficiently computed with interval arithmetic and automatic differentiation over these intervals. The scientist may wonder, “what parameters should be measured with higher precision in order to produce the greatest increase in the accuracy of the results?” The sensitivity defined in Equation 3.3 is one answer to this question. Scientists may also have a metric of how much effort it takes to measure different parameters, which can be incorporated into the sensitivity analysis using a cost model as presented in Section 3.2.3.

56 Chapter 8

Conclusions

This thesis explores an opportunity to improve precision refinements in implementations of arbitrary-precision arithmetic. I introduce the critical path algorithm as a way to guide precision refinements using a sensitivity analysis. This new sensitivity analysis uses novel, efficiently computed, derivatives of interval code. I describe some of the challenges ofim- plementing reverse-mode automatic differentiation through intervals and provide an analysis of the properties of these derivatives. I provide a system that implements the critical path algorithm for arbitrary-arithmetic programs and demonstrate that the algorithm can speed up computation. There there are many opportunities for applying automatic differentiation through interval code and for improving precision-refinement algorithms that I hope future research will explore.

57 58 Bibliography

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- mawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Ra- jat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Van- houcke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. CoRR, 2016.

[2] Matthias Althoff and Dmitry Grebenyuk. Implementation of interval arithmetic in CORA 2016. In ARCH@CPSWeek, 2016.

[3] David Bailey, Peter Borwein, and Simon Plouffe. On the Rapid Computation of Various Polylogarithmic Constants. 1997.

[4] Atilim Günes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jef- frey Mark Siskind. Automatic differentiation in machine learning: A survey. Journal of Machine Learning Research, 2017.

[5] Jens Blanck. Exact real arithmetic systems: Results of competition. In Computability and Complexity in Analysis, 2001.

[6] Wei-Fan Chiang, Mark Baranowski, Ian Briggs, Alexey Solovyev, Ganesh Gopalakrish- nan, and Zvonimir Rakamarić. Rigorous floating-point mixed-precision tuning. SIG- PLAN Not., 2017.

[7] Wei-Fan Chiang, Mark Baranowski, Ian Briggs, Alexey Solovyev, Ganesh Gopalakrish- nan, and Zvonimir Rakamarić. Rigorous floating-point mixed-precision tuning. ACM SIGPLAN Notices, 2017.

[8] Nasrine Damouche, Matthieu Martel, Pavel Panchekha, Jason Qiu, Alex Sanchez-Stern, and Zachary Tatlock. Toward a standard benchmark format and suite for floating-point analysis. 2016.

[9] Eva Darulova, Anastasiia Izycheva, Fariha Nasir, Fabian Ritter, Heiko Becker, and Robert Bastian. Daisy - framework for analysis and optimization of numerical programs (tool paper). In TACAS, 2018.

59 [10] Eva Darulova and Viktor Kuncak. Sound compilation of reals. Principles of Program- ming Languages, 2014.

[11] Eva Darulova and Viktor Kuncak. Towards a for reals. ACM Trans. Program. Lang. Syst., 2017.

[12] A. Di Franco, H. Guo, and . Rubio-González. A comprehensive study of real-world numerical bug characteristics. In International Conference on Automated Software En- gineering, 2017.

[13] W. Edmonson and G. Melquiond. IEEE interval standard working group - P1788: Current status. In Symposium on Computer Arithmetic, 2009.

[14] Laurent Fousse, Guillaume Hanrot, Vincent Lefèvre, Patrick Pélissier, and Paul Zimmer- mann. MPFR: A multiple-precision binary floating-point library with correct . ACM Trans. Math. Softw., 2007.

[15] Paul Gowland and David Lester. A survey of exact arithmetic implementations. In Computability and Complexity in Analysis, 2001.

[16] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, 2015.

[17] Thomas Hales. A proof of the Kepler conjecture. Annals of Mathematics, 2005.

[18] T. Hickey, Q. Ju, and M. H. Van Emden. Interval arithmetic: From principles to implementation. J. ACM, 2001.

[19] Philipp H. Hoffmann. A hitchhiker’s guide to automatic differentiation. Numerical Algorithms, 2016.

[20] Dongming Hwang, Daewon W. Byun, and M. [Talat Odman]. An automatic differen- tiation technique for sensitivity analysis of numerical advection schemes in air quality models. Atmospheric Environment, 1997.

[21] Luc Jaulin and Benoît Desrochers. Introduction to the algebra of separators with ap- plication to path planning. Engineering Applications of Artificial Intelligence, 2014.

[22] E. Kaucher. Interval Analysis in the Extended Interval Space IR. 1980.

[23] Hideyuki Kawabata. Speeding up exact real arithmetic on fast binary cauchy sequences by using memoization based on quantized precision. In Journal of Information Process- ing, 2017.

[24] Reinhard Kirchner and Ulrich W. Kulisch. Hardware support for interval arithmetic. Reliable Computing, 2006.

[25] Branimir Lambov. RealLib: An efficient implementation of exact real arithmetic. In Mathematical Structures in Computer Science, 2007.

60 [26] Yong Li and Yong Jun-Hai. Efficient exact arithmetic over constructive reals. In The 4th Annual Conference on Theory and Applications of Models of Computation, 2007.

[27] Valérie Ménissier-Morain. Arbitrary precision real arithmetic: design and algorithms. The Journal of Logic and Algebraic Programming, 2005.

[28] Ramon E. Moore, R. Baker Kearfott, and Michael J. Cloud. First Applications of Interval Arithmetic, chapter 3, pages 19–29.

[29] Duncan J.M Moss, Srivatsan Krishnan, Eriko Nurvitadhi, Piotr Ratuszniak, Chris Johnson, Jaewoong Sim, Asit Mishra, Debbie Marr, Suchit Subhaschandra, and Philip H.W. Leong. A customizable matrix multiplication framework for the Intel HARPv2 Xeon+FPGA platform: A deep learning case study. In International Sympo- sium on Field-Programmable Gate Arrays, 2018.

[30] Norbert Th. Müller. The iRRAM: Exact arithmetic in C++. In Computability and Complexity in Analysis, 2000.

[31] B. Nongpoh, R. Ray, S. Dutta, and A. Banerjee. AutoSense: A framework for automated sensitivity analysis of program data. IEEE Transactions on Software Engineering, 2017.

[32] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS-W, 2017.

[33] Pooja Roy, Rajarshi Ray, Chundong Wang, and Weng Fai Wong. ASAC: Automatic sensitivity analysis for approximate computing. Conference on Languages, and Tools for Embedded Systems, 2014.

[34] Warwick Tucker. A rigorous ODE solver and Smale’s 14th problem. Foundations of Computational Mathematics, 2002.

61