<<

Efficient Optimization of Loops and Limits with Randomized Telescoping Sums

Alex Beatson 1 Ryan P. Adams 1

Abstract over time, each learning step requires looping over time We consider optimization problems in which the steps. More broadly, in many scientific and engineering objective requires an inner loop with many steps applications one wishes to optimize an objective that is or is the of a of increasingly costly defined as the limit of a sequence of approximations with approximations. Meta-learning, training recurrent both fidelity and computational cost increasing according neural networks, and optimization of the solu- to a n ≥ 1. Inner-loop examples include: tions to differential equations are all examples integration by Monte Carlo or quadrature with n evaluation of optimization problems with this character. In points; solving ordinary differential equations (ODEs) with 1 such problems, it can be expensive to compute an Euler or Runge Kutta method with n steps and O( n ) step the objective value and its , but size; and solving partial differential equations (PDEs) with truncating the loop or using less accurate approxi- a finite element basis with size or order increasing with n. mations can induce biases that damage the overall Whether the task is fitting parameters to data, identifying solution. We propose randomized telescope (RT) the parameters of a natural system, or optimizing the design gradient estimators, which represent the objec- of a mechanical part, in this work we seek to more rapidly tive as the sum of a telescoping and sample solve problems in which the objective function demands linear combinations of terms to provide cheap un- a tradeoff between computational cost and accuracy. We biased gradient estimates. We identify conditions D formalize this by considering parameters θ ∈ R and a loss under which RT estimators achieve optimization function L(θ) that is the uniform limit of a sequence Ln(θ): convergence rates independent of the length of the loop or the required accuracy of the approxi- min L(θ) = min lim Ln(θ) . (1) mation. We also derive a method for tuning RT θ θ n→H estimators online to maximize a lower bound on Some problems may involve a finite horizon H, in the expected decrease in loss per unit of computa- other cases H = ∞. We also introduce a cost func- tion. We evaluate our adaptive RT estimators on a tion C : N+ → R that is nondecreasing in n to represent the range of applications including meta-optimization cost of computing Ln and its gradient. of learning rates, variational inference of ODE pa- A principal challenge of optimization problems with the rameters, and training an LSTM to model long form in Eq. 1 is selecting a finite N such that the minimum . of the surrogate LN is close to that of L, but without LN (or its ) being too expensive. Choosing a large N can 1. Introduction be computationally prohibitive, while choosing a small N may bias optimization. Meta-optimizing learning rates with Many important optimization problems consist of objective truncated horizons can choose wrong hyperparameters by functions that can only be computed iteratively or as the orders of magnitude (Wu et al., 2018). Truncating backpro- limit of an approximation. Machine learning and scientific pogation through time for recurrent neural networks (RNNs) computing provide many important examples. In meta- favors short term dependencies (Tallec & Ollivier, 2017). learning, evaluation of the objective typically requires the Using too coarse a discretization to solve an ODE or PDE training of a model, a case of bi-level optimization. When can cause error in the solution and bias outer-loop optimiza- training a model on sequential data or to make decisions tion. These optimization problems thus experience a sharp trade-off between efficient computation and bias. 1Department of Computer Science, Princeton University, Princeton, NJ, USA. Correspondence to: Alex Beatson . which provide cheap unbiased gradient estimates to allow Proceedings of the 36 th International Conference on Machine efficient optimization of these objectives. RT estimators Learning, Long Beach, California, PMLR 97, 2019. Copyright represent the objective or its gradients as a telescoping se- 2019 by the author(s). ries of differences between intermediate values, and draw Efficient Optimization of Loops and Limits weighted samples from this series to maintain unbiasedness Proposition 2.1. Unbiasedness of RT estimators. The RT while balancing variance and expected computation. estimators in (2) are unbiased estimators of YH as long as

The paper proceeds as follows. Section 2 introduces RT esti- H X mators and their history. Section 3 formalizes RT estimators EN∼q[W (n, N)1{N ≥ n}]= W (n, N)q(N)=1, ∀n . for optimization, and discusses related work in optimization. N=n Section 4 discusses conditions for finite variance and compu- tation, and proves RT estimators can achieve optimization See Appendix B for a short proof. Although we are coining guarantees for loops and limits. Section 5 discusses de- the term “randomized telescope” to refer to the family of signing RT estimators by maximizing a bound on expected estimators with the form of Eq. 2, the underlying trick has a improvement per unit of computation. Section 6 describes long history, discussed in the next section. The literature we practical considerations for adapting RT estimators online. are aware of focusses on one or both of two special cases of Section 7 presents experimental results. Section 8 discusses Eq. 2, defined by choice of weight function W (n, N). We limitations and future work. Appendix A presents algorithm will also focus on these two variants of RT estimators, but pseudocode. Appendix B presents proofs. Code may be we observe that there is a larger family. found at https://github.com/PrincetonLIPS/ randomized_telescopes. Most related work uses the “Russian roulette” estimator originally discovered and named by von Neumann and Ulam (Kahn, 1955), which we term RT-RR and has the form 2. Unbiased randomized truncation 1 In this section, we discuss the general problem of estimating W (n, N) = 1{N ≥ n} . (3) 1 − Pn−1 q(n0) limits through randomized truncation. The first subsection n0=1 presents the randomized telescope family of unbiased esti- It can be seen as summing the iterates ∆n while flipping a mators, while the second subsection describes their history biased coin at each iterate. With probability q(n), the series (dating back to von Neumann and Ulam). In the following is truncated at term N = n. With probability 1 − q(n), sections, we will describe how this technique can be used the process continues, and all future terms are upweighted to provide cheap unbiased gradient estimates and accelerate by 1 to maintain unbiasedness. optimization for many problems. 1−q(n) The other important special case of Eq. 2 is the “single sam- 2.1. Randomized telescope estimators ple” estimator RT-SS, referred to as “single term weighted truncation” in Lyne et al. (2015). RT-SS takes Consider estimating any quantity YH := limn→H Yn 1 for n ∈ N+ where H ∈ N+ ∪ {∞}. Assume that we can W (n, N) = 1{n = N} . (4) compute Yn for any finite n ∈ N+, but since the cost is q(N) nondecreasing in n there is a point at which this becomes This is directly importance sampling the differences ∆ . impractical. Rather than truncating at a fixed value short n of the limit, we may find it useful to construct an unbiased We will later prove conditions under which RT-SS and RT- estimator of YH and take on some randomness in return for RR should be preferred. Of all estimators in the form reduced computational cost. of Eq. 2 which obey proposition 2.1 and for all q, RT- SS minimizes the variance across worst-case diagonal co- Define the backward difference ∆n and represent the quan- variances Cov(∆i, ∆j). Within the same family, RT-RR tity of interest YH with a : achieves minimum variance when ∆i and ∆j are indepen- H ( dent for all i, j. X Yn − Yn−1 n > 1 YH = ∆n where ∆n = . Y1 n = 1 n=1 2.2. A brief history of unbiased randomized truncation We may sample from this telescoping series to provide unbi- The essential trick—unbiased estimation of a quantity via ased estimates of Y , introducing variance to our estimator H randomized truncation of a series—dates back to unpub- in exchange for reducing expected computation. We use lished work from John von Neumann and Stanislaw Ulam. the name randomized telescope (RT) to refer to the fam- They are credited for using it to develop a Monte Carlo ily of estimators indexed by a distribution q over the inte- method for inversion in Forsythe & Leibler (1950), gers 1,...,H (for example, a geometric distribution) and a and for a method for particle diffusion in Kahn (1955). weight function W (n, N):

N It has been applied and rediscovered in a number of fields X and applications. The early work from von Neumann and YˆH = ∆nW (n, N) N ∈ {1,...,H} ∼ q . (2) n=1 Ulam led to its use in computational physics, in neutron Efficient Optimization of Loops and Limits transport problems (Spanier & Gelbard, 1969), for studying the randomized telescope estimator lattice fermions (Kuti, 1982), and to estimate functional N X (Wagner, 1987). In computer graphics Arvo & Kirk Gˆ(θ) = ∆ (θ)W (n, N) (5) (1990) introduced its use for ray tracing; it is now widely n n=1 used in rendering software. In statistical estimation, it has been used for estimation of (Rychlik, 1990), where N ∈ {1, 2,...,H} is drawn according to a proposal unbiased kernel density estimation (Rychlik, 1995), doubly- distribution q, and together W and q satisfy proposition 2.1. intractable Bayesian posterior distributions (Girolami et al., Note that due to linearity of differentiation, and let- 2013; Lyne et al., 2015; Wei & Murray, 2016), and unbiased ting L0(θ) := 0, we have Markov chain Monte Carlo (Jacob et al., 2017). N N X X The underlying trick has been rediscovered by Fearnhead ∆n(θ)W (n, N) = ∇θ (Ln(θ)−Ln−1(θ))W (n, N) . et al. (2008) for unbiased estimation in particle filtering, n=1 n=1 by McLeish (2010) for debiasing Monte Carlo estimates, L (θ) by Rhee & Glynn (2012; 2015) for unbiased estimation in Thus, when the computation of n can reuse most of L (θ) stochastic differential equations, and by Tallec & Ollivier the computation performed for n−1 , we can evalu- Gˆ (θ) (2017) to debias truncated backpropagation. The latter also ate N via forward or backward automatic differentia- G (θ) uses RT estimators for optimization; however, it only con- tion with cost approximately equal to computing N , Gˆ (θ) ≈ C(N) siders fixed “Russian roulette”-style randomized telescope i.e., N has computation cost . This most L (θ) estimators and does not consider convergence rates or how often occurs when evaluating n involves an inner n to adapt the estimator online (our main contributions). loop with a step size which does not change with , e.g., meta-learning and training RNNs, but not solv- ing ODEs or PDEs. When computing Ln(θ) does not 3. Optimizing loops and limits reuse computation evaluating GˆN (θ) has computation PN 1 In this paper, we consider optimizing functions defined cost n=1 C(n) {W (n, N) 6= 0}. as limits. Consider a problem where, given parame- ters θ we can obtain a series of approximate losses Ln(θ), 3.2. Related work in optimization which converges uniformly to some limit limn→H Ln := L, Gradient-based bilevel optimization has seen extensive work for n ∈ N+ and H ∈ N+ ∪ {∞}. We assume the sequence in literature. See Jameson (1988) for an early example of op- of gradients with respect to θ, denoted Gn(θ) := ∇θLn(θ) timizing implicit functions, Christianson (1998) for a math- converge uniformly to a limit G(θ). Under this uni- ematical treatment, and Maclaurin et al. (2015); Franceschi form convergence and assuming convergence of Ln, we et al. (2017) for recent treatments in machine learning. Sha- have limn→H ∇θLn(θ) = ∇θ limn→H Ln(θ) (see Theo- ban et al. (2018) propose truncating only the backward pass rem 7.17 in Rudin et al. (1976)), and so G(θ) is indeed by only backpropagating through the final few optimization the gradient of our objective L(θ). We assume there is steps to reduce memory requirements. Metz et al. (2018) a computational cost C(n) associated with evaluating Ln propose linearly increasing the number of inner steps over or Gn, nondecreasing with n, and we wish to efficiently the course of the outer optimization. minimize L with respect to θ. Loops are an important spe- cial case of this framework, where Ln is the final output An important case of bi-level optimization is optimization resulting from running e.g., a training loop or RNN for some of architectures and hyperparameters. Truncation causes number of steps increasing in n. bias, as shown by Wu et al. (2018) for learning rates and by Metz et al. (2018) for neural optimizers. 3.1. Randomized telescopes for optimization Bi-level optimization is also used for meta-learning across We propose using randomized telescopes as a stochastic related tasks (Schmidhuber, 1987; Bengio et al., 1992). Ravi gradient estimator for such optimization problems. We aim & Larochelle (2016) train an initialization and optimizer, to accelerate optimization much as mini-batch stochastic and Finn et al. (2017) only an initialization, to minimize val- gradient descent accelerates optimization for large datasets: idation loss. The latter paper shows increasing performance using Monte Carlo sampling to decrease the expected cost with the number of steps used in the inner optimization. of each optimization step, at the price of increasing variance However, in practice the number of inner loop steps must in the gradient estimates, without introducing bias. be kept small to allow training over many tasks. Bi-level optimization can be accelerated by amortization. Consider the gradient G(θ) = limn→H Gn(θ), and Variational inference can be seen as bi-level optimization; the backward difference ∆n(θ) = Gn(θ) − Gn−1(θ), where G (θ) = 0, so that G(θ) = PH ∆ (θ). We use variational autoencoders (Kingma & Welling, 2014) amor- 0 n=1 n tize the inner optimization with a predictive model of the Efficient Optimization of Loops and Limits solution to the inner objective. Recent work such as Brock ing bounds on the variance and expected computation for et al. (2018); Lorraine & Duvenaud (2018) amortizes hyper- polynomially decaying q(N) and ψn. parameter optimization in a similar fashion. Theorem 4.1. Bounded variance and compute with However, amortizing the inner loop induces bias. Cremer polynomial convergence of ψ. Assume ψ con- c p et al. (2018) demonstrate this in VAEs, while Kim et al. verges according to ψn ≤ /n or faster, for con- (2018) show that in VAEs, combining amortization with stants p > 0 and c > 0. Choose the RT-SS estimator p+1/2 ˆ truncation by taking several gradient steps on the output with q(N) ∝ 1/(N ). The resulting estimator G p−1/2 2 i of the encoder can reduce this bias. This shows these tech- achieves expected compute C ≤ (HH ) , where HH is niques are orthogonal to our contributions: while fully amor- the Hth generalized of order i, and ˆ 2 2 p−1/2 2 ˜2 tizing the inner optimization causes bias, predictive models expected squared norm E[||G||2] ≤ cψ(HH ) := G . 1 of the limit can accelerate convergence of Ln to L. p− /2 3 The limit limH→∞ HH is finite iff p > /2, in Our work is also related to work on training sequence mod- which case it is given by the Riemannian zeta func- 1 p− /2 1 els. Tallec & Ollivier (2017) use the Russian roulette estima- tion, limH→∞ HH = ζ(p − /2). Accordingly, the es- tor to debias truncated backpropagation through time. They timator achieves horizon-agnostic variance and expected use a fixed geometrically decaying q(N), and show that this compute bounds iff p > 3/2. improves validation loss for Penn Treebank. They do not consider efficiency of optimization, or methods to automat- The corresponding bounds for geometrically decaying q(N) ically set or adapt the hyperparameters of the randomized and ψn follow. telescope. Trinh et al. (2018) learn long term dependencies Theorem 4.2. Bounded variance and compute with geo- with auxiliary losses. Other work accelerates optimization metric convergence of ψ. Assume ψn converges accord- n of sequence models by replacing recurrent models with ing to ψn ≤ cp , or faster, for 0 < p < 1. Choose RT-SS models which use convolution or attention (Vaswani et al., and with q(N) ∝ pN . The resulting estimator Gˆ achieves 2017), which can be trained more efficiently. expected compute C ≤ (1 − p)−2 and expected squared ˆ 2 c ˜2 norm ||G||2 ≤ (1−p)2 := G . Thus, the estimator achieves 4. Convergence rates with fixed RT estimators horizon-agnostic variance and expected compute bounds for all 0 < p < 1. Before considering more complex large-scale problems, ˆ we examine the simple RT estimator for stochastic gra- Given a setting and estimator G from either 4.1 or 4.2, dient descent on convex problems. We assume that with corresponding expected compute cost C and upper bound on expected squared norm G˜2, the following theorem the sequence Ln(θ) and units for C are chosen such that C(n) = n. We study RT-SS, with q(N) fixed a priori. considers regret guarantees when using this estimator to We consider optimizing parameters θ ∈ K, where K ⊂ Rd perform stochastic gradient descent. is a bounded, convex and compact set with diameter Theorem 4.3. Asymptotic regret bounds for optimizing bounded by D. We assume L(θ) is convex in θ, and Gn(θ) infinite-horizon programs. Assume the setting from 4.1 ˜ converge according to ||∆n||2 ≤ ψn, where ψn converges or 4.2, and the corresponding C and G from those theorems. polynomially or geometrically. The quantity of inter- Let Rt be the instantaneous regret at the tth step of opti- est is the instantaneous regret, Rt = L(θt) − minθ L(θ), mization, Rt = L(θt) − minθ L(θ). Let t(B) be the great- where θt is the parameter after t steps of SGD. est t such that a computational budget B is not exceeded. Use online gradient descent with step size η = √ D . t t [||Gˆ||2] In this setting, any fixed truncation scheme using LN E 2 As B → ∞, the asymptotic instantaneous regret is bounded as a surrogate for L, with fixed N < H, cannot q ˜ C achieve limt→∞ Rt = 0. Meanwhile, the fully unrolled by Rt(B) ≤ O(GD B ), independent of H. estimator has computational cost which scales with H. In the many situations where H → ∞, it is impossible to take Theorem 4.3 indicates that if Gn converges sufficiently fast even a single gradient step with this estimator. and Ln is convex, the RT estimator provably optimizes the limit. The randomized telescope estimator overcomes these draw- backs by exploiting the fact that Gn converges according to ||∆n||2 ≤ ψn. As long as q is chosen to have tails no 5. Adaptive RT estimators lighter than ψn, for sufficiently fast convergence, the re- In practice, the estimator considered in the previous sec- sulting RT-SS gradient estimator achieves asymptotic regret tion may have high variance. This section develops an bounds invariant to H in terms of convergence rate. objective for designing such estimators, and derives closed- All proofs are deferred to Appendix B. We begin by prov- form W (n, N) and q which maximize this objective given 2 estimates of E[||∆i||2] and assumptions on Cov(∆i, ∆j). Efficient Optimization of Loops and Limits

5.1. Choosing between unbiased estimators rate, this construction means we need only tune one hyper- parameter (the reference learning rate), and can still choose We propose choosing an estimator which achieves the best between a family of gradient estimators online. Instead of lower bound on the expected improvement per compute directly maximizing J, we choose η for Gˆ by maximizing unit spent, given smoothness assumptions on the loss. Our t improvement relative to the reference estimator in terms analysis builds on that of Balles et al. (2016): they adap- of J, the lower bound on expected improvement. tively choose a batch size using batch covariance informa- tion, while we choose between between arbitrary unbiased Assume that η¯t has been set optimally for a problem and gradient estimators using knowledge of those estimators’ reference estimator G¯ up to some constant k, i.e., expected squared norms and computation cost. ||∇ ||2 η¯ = k θt 2 . (10) Here we assume that the true gradient of the objec- t ¯ 2 LE[||Gt(θt)||2] tive ∇θ[L(θ)] := ∇θ (for compactness of notation) is ¯ smooth in θ. We do not assume convexity. Note that ∇θ Then the expected improvement J obtained by the reference is not necessarily equal to G(θ), as the loss L(θ) and its estimator G¯ is: gradient G(θ) may be random variables due to sampling of k2 ||∇ ||4 J¯ = (k − ) θt (11) data and/or latent variables. ¯ 2 2 2LE[||Gt(θt)||2] We assume that L is L-smooth (the gradients of L(θ) We assume that 0 < k < 2, such that J¯ is positive and the are L-Lipschitz), i.e., there exists a constant L > 0 such reference has guaranteed expected improvement. Now set that ∇ −∇ ≤L||θ −θ || ∀θ , θ ∈ d. It follows θb θa b a 2 a b R the learning rate according to (Balles et al., 2016; Bottou et al., 2018) that, when perform- ˆ 2 ing SGD with an unbiased stochastic gradient estimator Gt, [||Gˆ || ] η =η ¯ E t 2 . (12) t t ¯ 2 E[||Gt||2] E[LH (θt) − LH (θt+1)] It follows that the expected improvement Jˆ obtained by the 2 T Lηt 2 estimator Gˆ is ≥ [ηt∇ Gˆt(θt)] − [ ||Gˆt(θt)|| ] . (6) E θt E 2 2 [||G¯ (θ )||2] Jˆ = E t t 2 J¯ (13) Unbiasedness of Gˆ implies [∇T Gˆ (θ )] = ||∇T ||2, thus: ˆ 2 E θt t t θt 2 E[||Gt(θt)||2] Let the expected computation cost of evaluating Gˆ be Cˆ. ˆ E[LH (θt) − LH (θt+1)] We want to maximize J/Cˆ. If we use the above method −1 Lη2 Jˆ ˆ ˆ ˆ 2 2 t ˆ 2 to choose ηt, we have /C ∝ CE||Gt(θt)||2 . We ≥ E[ηt||∇θ||2] − E[ ||Gt(θt)||2] := J. (7) 2 ˆ ˆ 2−1 call CE||Gt(θt)||2 the relative optimization efficiency, Above, J is a lower bound on the expected improvement in or ROE. We decide between gradient estimators Gˆ by choos- the loss from one optimization step. Given a fixed choice ing the one which maximizes the ROE. Once an estimator of Gˆt(θt), how should one pick the learning rate ηt to max- is chosen, one should choose a learning rate according to imize J and what is the corresponding lower bound on (12) relative to a reference learning rate η¯ and estimator G¯. expected improvement? 5.2. Optimal weighted sampling for RT estimators ? dJ ? Optimizing ηt by finding ηt s.t. /dηt = 0 yields Now that we have an objective, we can consider designing 2 ||∇θ|| 1 RT estimators which optimize the ROE. For the classes η? = 2 ∝ (8) t ˆ 2 ˆ 2 of single sample and Russian roulette estimators, we prove LE[||Gt(θt)||2] E[||Gt(θt)||2] 4 conditions under which that class maximizes the ROE across ||∇θ|| 1 J ? = ∝ . (9) an arbitrary choice of RT estimators. We also derive closed- ˆ 2 ˆ 2 2LE[||Gt(θt)||2] E[||Gt(θt)||2] form expressions for the optimal sampling distribution q for each class, under the conditions where that class is optimal. ˆ 2 This tells us how to choose ηt if we know L, ||G||2, etc. In practice, it is unlikely that we know L or even ||∇θ ||2. We We assume that computation can be reused and t ˆ PN instead assume we have access to some “reference” learning evaluating GH = n=1 ∆nW (n, N) has computation rate η¯t, which has been optimized for use with a “reference” cost C(N). As described in Section 3.1, this 2 gradient estimator G¯t, with known [||G¯t|| ]. When using is approximately true for many objectives. When E 2 PN RT estimators, we may have access to learning rates which it is not, the cost of computing n=1 ∆nW (n, N) PN 1 have been optimized for use with the un-truncated estimator. is n=1 C(n) {(W (n, N) 6= 0) or (W (n + 1,N) 6= 0)}. Even when we do not know an optimal reference learning This would penalize the ROE of dense W (n, N) and favor Efficient Optimization of Loops and Limits

¯ ¯ 2 sparse W (n, N), possibly impacting the optimality con- has cost c¯i. We assume knowledge of E[||Gi −Gj||2]. ditions for RT-RR. We mitigate this inaccuracy by subse- We aim to find a S ∈ S, where S is quence selection (described in the following subsection), the set of over the 1,..., H¯ which allows construction of sparse sampling strategies. which have final element S−1 = H¯ . Given S, we take L = L¯ , G = G¯ , C(n) = C¯(S ), H = |S|, We begin by showing the RT-SS estimator is optimal with n Sn n Sn n and ∆n = Gn − Gn−1, where G0 := 0. regards to worst-case diagonal covariances Cov(∆i, ∆j), and deriving the optimal q(N). In practice, we greedily construct S by adding indexes i ¯ Theorem 5.1. Optimality of RT-SS under adversarial cor- to the sequence [H] or removing indexes i from the se- ¯ relation. Consider the family of estimators presented quence [1,..., H]. As this step requires minimal computa- tion, we perform both greedy adding and greedy removal in Equation 2. Assume θ, ∇θ, and G are univari- ate. For any fixed sampling distribution q, the single- and return the S with the best ROE. The minimal subse- ¯ sample RT estimator RT-SS minimizes the worst-case quence S = [H] is always considered, allowing RT estima- variance of Gˆ across an adversarial choice of covari- tors to fall back on the original full-horizon estimator. p p ances Cov(∆i, ∆j) ≤ Var(∆i) Var(∆j). Theorem 5.2. Optimal q under adversarial correlation. 6. Practical implementation Consider the family of estimators presented in Equation 2. 6.1. Tuning the estimator Assume Cov(∆i, ∆i) and Cov(∆i, ∆j) are diagonal. The q 2 ¯ ¯ 2 E[||∆n||2 We estimate the expected squared distances [||Gi − Gj||2] RT-SS estimator with qn ∝ maximizes the ROE E C(n) by maintaining exponential moving averages. We keep track across an adversarial choice of diagonal covariance matri- of the computational budget B used so far by the RT es- ces Cov(∆ , ∆ ) ≤ pCov(∆ , ∆ ) Cov(∆ , ∆ ) . i j kk i i kk j j kk timator, and “tune” the estimator every KC¯(H¯ ) units of ¯ ¯ We next show the RT-RR estimator is optimal computation, where C(H) is the compute required to eval- uate G¯ ¯ , and K is a “tuning frequency” hyperparameter. when Cov(∆i, ∆i) is diagonal and ∆i and ∆j are H independent for j 6= i, and derive the optimal q(N). During tuning, the gradients Gi are computed, the squared norms ||G¯ − G¯ ||2 are computed, and the exponential mov- . i j 2 Theorem 5.3. Optimality of RT-RR under independence ing averages are updated. At the end of tuning, the estimator Consider the family of estimators presented in Eq. 2. Assume is updated using the expected squared norms; i.e. a subse- the ∆ are univariate. When the ∆ are uncorrelated, for j j quence is selected, q is set according to section 5.2 with any importance sampling distribution q, the Russian roulette choice of RT-RR or RT-SS left as a hyperparameter, and the estimator achieves the minimum variance in this family and learning rate is adapted according to section 5.1 thus maximizes the optimization efficiency lower bound. Theorem 5.4. Optimal q under independence. Consider 6.2. Controlling sequence length the family of estimators presented in Equation 2. As- sume Cov(∆i, ∆i) is diagonal and ∆i and ∆j are indepen- Tuning and subsequence selection require computation. q [||∆ ||2 Consider using RT to optimize an objective with an inner dent. The RT-RR estimator with Q(i) ∝ E i 2 ], C(i)−C(i−1) loop of size M. If we let G¯i be the gradient of the loss PH 2 where Q(i) = Pr(n ≥ i) = j=i q(j), maximizes the after i inner steps, we must maintain M − M exponential ¯ ¯ 2 ROE. moving averages E||Gi − Gj||2, and compute M gradients G¯i each time we tune the estimator. The computational cost 5.3. Subsequence selection of the tuning step under this scheme is O(M 2). This is unacceptable if we wish our method to scale well with the The scheme for designing RT estimators given in the previ- size of loops we might wish to optimize. ous subsection contains assumptions which will often not hold in practice. To partially alleviate these concerns, we To circumvent this, we choose base subsequences such ¯ i ¯ can design the sequence of iterates over which we apply the that Ci ∝ 2 . This ensures that H = O(log2 M), where M RT estimator to maximize the ROE. is the maximum number of steps we wish to unroll. We must maintain O(log2 M) exponential moving averages. Com- Some sequences may result in more efficient estimators, 2 puting the gradients G¯i during each tuning step requires depending on how the intermediate iterates Gn correlate PH¯ i H¯ compute C = k ∗ 2 . Noting that C¯ ¯ = k ∗ 2 with G. The variance of the estimator, and the ROE, will tune i=1 H PN i N+1 ¯ and that 2 < 2 ∀N yields Ctune < 2C ¯ = 2M. be reduced if we choose a sequence Ln such that Gn is i=1 H positively correlated with G for all n.

We begin with a reference sequence L¯i, G¯i, with cost function C¯, where i, j ∈ N and i, j ≤ H¯ , and where G¯i Efficient Optimization of Loops and Limits

7. Experiments 2.00 Untruncated, H = 9 1.75 Untruncated, H = 6 For all experiments, we tune learning rates for Untruncated, H = 4 1.50 RT-SS, H=9 the full-horizon un-truncated estimator via grid RT-SS, H=6 RT-SS, H=4 −b 1.25 search over all a × 10 , for a ∈ {1.0, 2.2, 5.5} RT-RR, H=9 RT-RR, H=6 1.00 and b ∈ {0.0, 1.0, 2.0, 3.0, 5.0}. The same learning RT-RR, H=4 rates are used for the truncated estimators and (as reference 0.75 learning rates) for the RT estimators. We do not decay the 0.50 Negative ELBO learning rate. Experiments are run with the random seeds 0.25 0, 1, 2, 3, 4 and we plot means and standard deviation. 0.00 0 200 400 600 800 1000 We use the same hyperparameters for our online tuning procedure for all experiments: the tuning frequency K is Gradient evaluations (thousands) set to 5, and the exponential moving average weight α is set Figure 1. Lotka-Volterra parameter inference to 0.9. These hyperparameters were not extensively tuned. For each problem, we compare deterministic, RT-SS, and 64 64 θ RT-RR estimators, each with a range of truncations. batch size of (i.e., samples of are performed at each step) and a learning rate of 0.01. Evaluation is performed with a batch size of 512. 7.1. Lotka-Volterra ODE Figure 7.1 shows the loss of the different estimators over We first experiment with variational inference of parameters the course of training. RT-SS estimators outperform the of a Lotka-Volterra (LV) ODE. LV ODEs are defined by the un-truncated estimator without inducing bias. They are predator-prey equations, where u and u are predator and 2 1 competitive with the truncation H = 6, while avoiding the prey populations, respectively: bias present with the truncation H = 4, at the cost of some du1 du2 variance. Some RT-RR estimators experience issues with = Au1 − Bu1u2 = Cu1u2 − Du2 dt dt optimization, appearing to obtain the same biased solution We aim to infer the parameters λ = [u1(t = 0), u2(t = as the H = 4 truncation. 0), A, B, C, D]. The true parameters are drawn from U([1.0, 0.4, 0.8, 0.4, 1.5, 0.4], [1.5, 0.6, 1.2, 0.6, 2.0, 0.6]), 7.2. MNIST learning rate chosen empirically to ensure stability solving the equations. We generate ground-truth data by solving the equations We next experiment with meta-optimization of a learning using RK4 (a common 4th-order Runge Kutta method) rate on MNIST. We largely follow the procedure used by from t = 0 to t = 5 with 10000 steps. The learner is given Wu et al. (2018). We use a feedforward network with two access to five equally spaced noisy observations y(t), hidden layers of 100 units, with weights initialized from a generated according to y(t) = u(t) + N (0, 0.1). Gaussian with standard deviation 0.1, and biases initialized to zero. Optimization is performed with a batch size of 100. We place a diagonal Gaussian prior on θ with the same mean and standard deviation as the data-generating distribu- The neural network is trained by SGD with momentum us- tion. The variational posterior is a diagonal Gaussian q(λ) ing Polyak averaging, with the momentum parameter fixed with mean µ and standard deviation σ. The parameters to 0.9. We aim to learn a learning rate η0 and decay λ for the optimized are θ = [˜µ, σ˜]. We let µ = g(˜µ) and σ = g(˜σ), inner-loop optimization. These are initialized to 0.01 and 0.1 where g(˜x) = log(1 + ex˜), to ensure positivity. We use a respectively. The learning rate for the inner optimization at t −λ reflecting boundary to ensure positivity of parameter sam- an inner optimization step t is ηt = η0(1 + 5000 ) . ples from q. The variational posterior is initialized to have As in Wu et al. (2018), we pre-train the for 50 steps with mean equal to the prior and standard deviation 0.1. n a learning rate of 0.1. Ln is the evaluation loss after 2 + 1 The loss considered is the negative evidence lower bound training steps with a batch size of 100. The evaluation loss n (negative ELBO). The ELBO is: is measured over 2 + 1 validation batches or the entire validation set, whichever is smaller. The outer optimization X  ELBO(q(θ))=Eq(θ) log p(y(t)|uθ(t))+DKL q(θ)||p(θ) is performed with a learning rate of 0.01. t Above, uθ(t) is the value of the solution uθ to the LV ODE RT-SS estimators achieve faster convergence than fixed- with parameters θ, evaluated at time t. We consider a se- truncation estimators. RT-RR estimators suffer from very quence Ln(θ), where in computing the ELBO, uθ(t) is poor convergence. Truncated estimators appear to obtain approximated by solving the ODE using RK4 with 2n + 1 biased solutions. The un-truncated estimator achieves a steps, and linearly interpolating the solution to the 5 observa- slightly better loss than the RT estimators, but takes signifi- tion times. The outer-loop optimization is performed with a cantly longer to converge. Efficient Optimization of Loops and Limits

0.20 Untruncated, H = 9 Untruncated, H = 9 Untruncated, H = 7 2.4 Untruncated, H = 7 Untruncated, H = 5 Untruncated, H = 5 0.18 RT-SS, H=9 Untruncated, H = 3 RT-SS, H=7 2.2 RT-SS, H=9 RT-SS, H=5 RT-SS, H=7 RT-RR, H=9 RT-SS, H=5 0.16 RT-RR, H=7 RT-SS, H=3 2.0 RT-RR, H=5 RT-RR, H=9 RT-RR, H=7 0.14 RT-RR, H=5 1.8 RT-RR, H=3 Evaluation loss

0.12 1.6 Training bits-per-character

0 500 1000 1500 2000 0 500 1000 1500 2000 2500 3000 3500 4000 Neural network evaluations (thousands) LSTM cell evaluations (thousands) Figure 2. MNIST learning rate meta-optimization Figure 3. LSTM training on enwik8 7.3. enwik8 LSTM requires computation. It might be possible to remove this Finally, we study a high-dimensional optimization problem: tuning step by estimating covariance structure online using training an LSTM to model sequences on enwik8. These just the values of Gˆ observed during each optimization step. data are the first 100M bytes of a Wikipedia XML dump. There are 205 unique tokens. We use the first 90M, 5M, and RT estimators beyond RT-SS and RT-RR. There is a rich 5M characters as the training, evaluation, and test sets. family defined by choices of q and W (n, N). The optimal member depends on covariance structure between the G . We build on code1 from Merity et al. (2017; 2018). We train i We explore RT-SS and RT-RR under strict covariance as- an LSTM with 1000 hidden units and 400-dimensional input sumptions. Relaxing these assumptions and optimizing q and output embeddings. The model has 5.9M parameters. and W across a wider family could improve adaptive esti- The only regularization is an ` penalty on the weights 2 mator performance for high-dimensional problems such as with magnitude 10−6. The optimization is performed with training RNNs. a learning rate of 2.2. This model is not state-of-the-art: our aim to investigate performance of RT estimators for Predictive models of the sequence limit. Using any se- optimizing high-dimensional neural networks, rather than quence Gn with RT yields an unbiased estimator as long to maximize performance at a language modeling task. as the sequence is consistent, i.e. its limit G is the true gradient. Combining randomized telescopes with predictive We choose L to be the mean cross-entropy after unrolling n models of the gradients (Jaderberg et al., 2017; Weber et al., the LSTM training for 2n−1 + 1 steps. We choose the hori- 2019) might yield a fast-converging sequence, leading to zon H = 9, such that the un-truncated loop has 257 steps, estimators with low computation and variance. chosen to be close to the 200-length training sequences used by Merity et al. (2018). 9. Conclusion Figure 7.3 shows the training bits-per-character (propor- tional to the training cross-entropy loss). RT estimators We investigated the use of randomly truncated unbiased provide some acceleration over the un-truncated H = 9 esti- gradient estimators for optimizing objectives which involve mator early in training, but after about 200k cell evaluations, loops and limits. We proved these estimators can achieve fall back on the un-truncated estimator, subsequently pro- horizon-independent convergence rates for optimizing loops gressing slightly more slowly due to computational cost of and limits. We derived adaptive variants which can be tuned tuning. We conjecture that the diagonal covariance assump- online to maximize a lower bound on expected improvement tion in Section 5 is unsuited to high-dimensional problems, per unit computation. Experimental results matched theo- and leads to overly conservative estimators. retical intuitions that the single sample estimator is more robust than Russian roulette for optimization. The adaptive 8. Limitations and future work RT-SS estimator often significantly accelerates optimization, and can otherwise fall back on the un-truncated estimator. Other optimizers. We develop the lower bound on ex- pected improvement for SGD. Important future directions 10. Acknowledgements would investigate adaptive and momentum-based SGD methods such as Adam (Kingma & Ba, 2014). We would like to thank Matthew Johnson, Peter Orbanz, Tuning step. Our method includes a tuning step which and James Saunderson for helpful discussions. This work was funded by the Alfred P. Sloan Foundation and NSF 1http://github.com/salesforce/awd-lstm-lm IIS-1421780. Efficient Optimization of Loops and Limits

References Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., Silver, D., and Kavukcuoglu, K. Decoupled Arvo, J. and Kirk, D. Particle transport and image synthe- neural interfaces using synthetic gradients. In Interna- sis. ACM SIGGRAPH Computer Graphics, 24(4):63–66, tional Conference on Machine Learning, pp. 1627–1635. 1990. JMLR. org, 2017. Balles, L., Romero, J., and Hennig, P. Coupling adaptive Jameson, A. Aerodynamic design via control theory. Jour- batch sizes with learning rates. In Uncertainty in Artificial nal of Scientific Computing, 3(3):233–260, 1988. Intelligence, 2016. Kahn, H. Use of different Monte Carlo sampling techniques. Bengio, S., Bengio, Y., Cloutier, J., and Gecsei, J. On the 1955. optimization of a synaptic learning rule. In Conference on Optimality in Artificial and Biological Neural Networks, Kim, Y., Wiseman, S., Miller, A. C., Sontag, D., and Rush, 1992. A. M. Semi-amortized variational autoencoders. In Inter- national conference on Machine Learning, 2018. Bottou, L., Curtis, F. E., and Nocedal, J. Optimization Kingma, D. P. and Ba, J. Adam: A method for stochastic methods for large-scale machine learning. SIAM Review, optimization. In International Conference on Learning 60(2):223–311, 2018. Representations, 2014. Brock, A., Lim, T., Ritchie, J., and Weston, N. SMASH: Kingma, D. P. and Welling, M. Auto-encoding variational One-shot model architecture search through hypernet- Bayes. In International Conference on Learning Repre- works. In International Conference on Learning Repre- sentations, 2014. sentations, 2018. Kuti, J. Stochastic method for the numerical study of lattice Christianson, B. Reverse accumulation and implicit func- fermions. Physical Review Letters, 49(3):183, 1982. tions. Optimization Methods and Software, 9(4):307–322, Lorraine, J. and Duvenaud, D. Stochastic hyperparame- 1998. ter optimization through hypernetworks. arXiv preprint Cremer, C., Li, X., and Duvenaud, D. Inference sub- arXiv:1802.09419, 2018. optimality in variational autoencoders. arXiv preprint Lyne, A.-M., Girolami, M., Atchadé, Y., Strathmann, H., arXiv:1801.03558, 2018. Simpson, D., et al. On Russian roulette estimates for Bayesian inference with doubly-intractable likelihoods. Fearnhead, P., Papaspiliopoulos, O., and Roberts, G. O. Statistical science, 30(4):443–467, 2015. Particle filters for partially observed diffusions. Jour- nal of the Royal Statistical Society: Series B (Statistical Maclaurin, D., Duvenaud, D., and Adams, R. Gradient- Methodology), 70(4):755–777, 2008. based hyperparameter optimization through reversible learning. In International Conference on Machine Learn- Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta- ing, pp. 2113–2122, 2015. learning for fast adaptation of deep networks. In Interna- tional Conference on Machine Learning, 2017. McLeish, D. A general method for debiasing a Monte Carlo estimator. Monte Carlo Methods and Applications, 2010. Forsythe, G. E. and Leibler, R. A. Matrix inversion by a Monte Carlo method. Mathematics of Computation, 4 Merity, S., Keskar, N. S., and Socher, R. Regularizing (31):127–129, 1950. and optimizing LSTM language models. arXiv preprint arXiv:1708.02182, 2017. Franceschi, L., Donini, M., Frasconi, P., and Pontil, M. Merity, S., Keskar, N. S., and Socher, R. An analysis of neu- Forward and reverse gradient-based hyperparameter opti- ral language modeling at multiple scales. arXiv preprint mization. In International Conference on Machine Learn- arXiv:1803.08240, 2018. ing, 2017. Metz, L., Maheswaranathan, N., Nixon, J., Freeman, C. D., Girolami, M., Lyne, A.-M., Strathmann, H., Simpson, D., and Sohl-Dickstein, J. Learned optimizers that outper- and Atchade, Y. Playing Russian roulette with intractable form SGD on wall-clock and validation loss. arXiv likelihoods. Technical report, Citeseer, 2013. preprint arXiv:1810.10180, 2018. Jacob, P. E., O’Leary, J., and Atchadé, Y. F. Unbiased Ravi, S. and Larochelle, H. Optimization as a model for few- Markov chain Monte Carlo with couplings. arXiv preprint shot learning. In International Conference on Learning arXiv:1708.03625, 2017. Representations, 2016. Efficient Optimization of Loops and Limits

Rhee, C.-h. and Glynn, P. W. A new approach to unbiased Wu, Y., Ren, M., Liao, R., and Grosse., R. Understanding estimation for SDEs. In Proceedings of the Winter Simula- short-horizon bias in stochastic meta-optimization. In tion Conference, pp. 17. Winter Simulation Conference, International Conference on Learning Representations, 2012. 2018.

Rhee, C.-h. and Glynn, P. W. Unbiased estimation with convergence for SDE models. Operations Research, 63(5):1026–1043, 2015.

Rudin, W. et al. Principles of , vol- ume 3. McGraw-hill New York, 1976.

Rychlik, T. Unbiased nonparametric estimation of the of the mean. Statistics & probability letters, 10 (4):329–333, 1990.

Rychlik, T. A class of unbiased kernel estimates of a proba- bility density function. Applicationes Mathematicae, 22 (4):485–497, 1995.

Schmidhuber, J. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta- ... hook. PhD thesis, Technische Universität München, 1987.

Shaban, A., Cheng, C.-A., Hatch, N., and Boots, B. Trun- cated back-propagation for bilevel optimization. arXiv preprint arXiv:1810.10667, 2018.

Spanier, J. and Gelbard, E. M. Monte Carlo Principles and Neutron Transport Problems. Addison-Wesley Publishing Company, 1969.

Tallec, C. and Ollivier, Y. Unbiasing truncated backpropa- gation through time. arXiv preprint arXiv:1705.08209, 2017.

Trinh, T. H., Dai, A. M., Luong, T., and Le, Q. V. Learning longer-term dependencies in RNNs with auxiliary losses. arXiv preprint arXiv:1803.00144, 2018.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten- tion is all you need. In Neural Information Processing Systems, pp. 5998–6008, 2017.

Wagner, W. Unbiased Monte Carlo evaluation of certain functional integrals. Journal of Computational Physics, 71(1):21–33, 1987.

Weber, T., Heess, N., Buesing, L., and Silver, D. Credit assignment techniques in stochastic computation graphs. arXiv preprint arXiv:1901.01761, 2019.

Wei, C. and Murray, I. Markov chain truncation for doubly- intractable inference. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.