Efficient Optimization of Loops and Limits with Randomized

Efficient Optimization of Loops and Limits with Randomized Telescoping Sums Alex Beatson 1 Ryan P. Adams 1 Abstract over time, each learning step requires looping over time We consider optimization problems in which the steps. More broadly, in many scientific and engineering objective requires an inner loop with many steps applications one wishes to optimize an objective that is or is the limit of a sequence of increasingly costly defined as the limit of a sequence of approximations with approximations. Meta-learning, training recurrent both fidelity and computational cost increasing according neural networks, and optimization of the solu- to a natural number n ≥ 1. Inner-loop examples include: tions to differential equations are all examples integration by Monte Carlo or quadrature with n evaluation of optimization problems with this character. In points; solving ordinary differential equations (ODEs) with 1 such problems, it can be expensive to compute an Euler or Runge Kutta method with n steps and O( n ) step the objective function value and its gradient, but size; and solving partial differential equations (PDEs) with truncating the loop or using less accurate approxi- a finite element basis with size or order increasing with n. mations can induce biases that damage the overall Whether the task is fitting parameters to data, identifying solution. We propose randomized telescope (RT) the parameters of a natural system, or optimizing the design gradient estimators, which represent the objec- of a mechanical part, in this work we seek to more rapidly tive as the sum of a telescoping series and sample solve problems in which the objective function demands linear combinations of terms to provide cheap un- a tradeoff between computational cost and accuracy. We biased gradient estimates. We identify conditions D formalize this by considering parameters θ 2 R and a loss under which RT estimators achieve optimization function L(θ) that is the uniform limit of a sequence Ln(θ): convergence rates independent of the length of the loop or the required accuracy of the approxi- min L(θ) = min lim Ln(θ) : (1) mation. We also derive a method for tuning RT θ θ n!H estimators online to maximize a lower bound on Some problems may involve a finite horizon H, in the expected decrease in loss per unit of computa- other cases H = 1. We also introduce a cost function. We evaluate our adaptive RT estimators on a tion C : N+ ! R that is nondecreasing in n to represent the range of applications including meta-optimization cost of computing Ln and its gradient. of learning rates, variational inference of ODE pa- A principal challenge of optimization problems with the rameters, and training an LSTM to model long form in Eq. 1 is selecting a finite N such that the minimum sequences. of the surrogate LN is close to that of L, but without LN (or its gradients) being too expensive. Choosing a large N can 1. Introduction be computationally prohibitive, while choosing a small N may bias optimization. Meta-optimizing learning rates with Many important optimization problems consist of objective truncated horizons can choose wrong hyperparameters by functions that can only be computed iteratively or as the orders of magnitude (Wu et al., 2018). Truncating backpro- limit of an approximation. Machine learning and scientific pogation through time for recurrent neural networks (RNNs) computing provide many important examples. In meta- favors short term dependencies (Tallec & Ollivier, 2017). learning, evaluation of the objective typically requires the Using too coarse a discretization to solve an ODE or PDE training of a model, a case of bi-level optimization. When can cause error in the solution and bias outer-loop optimiza- training a model on sequential data or to make decisions tion. These optimization problems thus experience a sharp trade-off between efficient computation and bias. 1Department of Computer Science, Princeton University, Princeton, NJ, USA. Correspondence to: Alex Beatson <abeat- We propose randomized telescope (RT) gradient estimators, [email protected]>. which provide cheap unbiased gradient estimates to allow Proceedings of the 36 th International Conference on Machine efficient optimization of these objectives. RT estimators Learning, Long Beach, California, PMLR 97, 2019. Copyright represent the objective or its gradients as a telescoping se- 2019 by the author(s). ries of differences between intermediate values, and draw Efficient Optimization of Loops and Limits weighted samples from this series to maintain unbiasedness Proposition 2.1. Unbiasedness of RT estimators. The RT while balancing variance and expected computation. estimators in (2) are unbiased estimators of YH as long as The paper proceeds as follows. Section 2 introduces RT esti- H X mators and their history. Section 3 formalizes RT estimators EN∼q[W (n; N)1fN ≥ ng]= W (n; N)q(N)=1; 8n : for optimization, and discusses related work in optimization. N=n Section 4 discusses conditions for finite variance and computation, and proves RT estimators can achieve optimization See Appendix B for a short proof. Although we are coining guarantees for loops and limits. Section 5 discusses de- the term “randomized telescope” to refer to the family of signing RT estimators by maximizing a bound on expected estimators with the form of Eq. 2, the underlying trick has a improvement per unit of computation. Section 6 describes long history, discussed in the next section. The literature we practical considerations for adapting RT estimators online. are aware of focusses on one or both of two special cases of Section 7 presents experimental results. Section 8 discusses Eq. 2, defined by choice of weight function W (n; N). We limitations and future work. Appendix A presents algorithm will also focus on these two variants of RT estimators, but pseudocode. Appendix B presents proofs. Code may be we observe that there is a larger family. found at https://github.com/PrincetonLIPS/ randomized_telescopes. Most related work uses the “Russian roulette” estimator originally discovered and named by von Neumann and Ulam (Kahn, 1955), which we term RT-RR and has the form 2. Unbiased randomized truncation 1 In this section, we discuss the general problem of estimating W (n; N) = 1fN ≥ ng : (3) 1 − Pn−1 q(n0) limits through randomized truncation. The first subsection n0=1 presents the randomized telescope family of unbiased esti- It can be seen as summing the iterates ∆n while flipping a mators, while the second subsection describes their history biased coin at each iterate. With probability q(n), the series (dating back to von Neumann and Ulam). In the following is truncated at term N = n. With probability 1 − q(n), sections, we will describe how this technique can be used the process continues, and all future terms are upweighted to provide cheap unbiased gradient estimates and accelerate by 1 to maintain unbiasedness. optimization for many problems. 1−q(n) The other important special case of Eq. 2 is the “single sam- 2.1. Randomized telescope estimators ple” estimator RT-SS, referred to as “single term weighted truncation” in Lyne et al. (2015). RT-SS takes Consider estimating any quantity YH := limn!H Yn 1 for n 2 N+ where H 2 N+ [ f1g. Assume that we can W (n; N) = 1fn = Ng : (4) compute Yn for any finite n 2 N+, but since the cost is q(N) nondecreasing in n there is a point at which this becomes This is directly importance sampling the differences ∆ . impractical. Rather than truncating at a fixed value short n of the limit, we may find it useful to construct an unbiased We will later prove conditions under which RT-SS and RT- estimator of YH and take on some randomness in return for RR should be preferred. Of all estimators in the form reduced computational cost. of Eq. 2 which obey proposition 2.1 and for all q, RT- SS minimizes the variance across worst-case diagonal co- Define the backward difference ∆n and represent the quan- variances Cov(∆i; ∆j). Within the same family, RT-RR tity of interest YH with a telescoping series: achieves minimum variance when ∆i and ∆j are indepen- H ( dent for all i; j. X Yn − Yn−1 n > 1 YH = ∆n where ∆n = : Y1 n = 1 n=1 2.2. A brief history of unbiased randomized truncation We may sample from this telescoping series to provide unbi- The essential trick—unbiased estimation of a quantity via ased estimates of Y , introducing variance to our estimator H randomized truncation of a series—dates back to unpub- in exchange for reducing expected computation. We use lished work from John von Neumann and Stanislaw Ulam. the name randomized telescope (RT) to refer to the fam- They are credited for using it to develop a Monte Carlo ily of estimators indexed by a distribution q over the inte- method for matrix inversion in Forsythe & Leibler (1950), gers 1;:::;H (for example, a geometric distribution) and a and for a method for particle diffusion in Kahn (1955). weight function W (n; N): N It has been applied and rediscovered in a number of fields X and applications. The early work from von Neumann and Y^H = ∆nW (n; N) N 2 f1;:::;Hg ∼ q : (2) n=1 Ulam led to its use in computational physics, in neutron Efficient Optimization of Loops and Limits transport problems (Spanier & Gelbard, 1969), for studying the randomized telescope estimator lattice fermions (Kuti, 1982), and to estimate functional N X integrals (Wagner, 1987). In computer graphics Arvo & Kirk G^(θ) = ∆ (θ)W (n; N) (5) (1990) introduced its use for ray tracing; it is now widely n n=1 used in rendering software. In statistical estimation, it has been used for estimation of derivatives (Rychlik, 1990), where N 2 f1; 2;:::;Hg is drawn according to a proposal unbiased kernel density estimation (Rychlik, 1995), doubly- distribution q, and together W and q satisfy proposition 2.1.

Load more