Industrial and Systems Engineering

Adaptive Stochastic Optimization

Frank E. Curtis1 and Katya Scheinberg2 1Department of Industrial and Systems Engineering, Lehigh University

2Department of Operations Research and Information Engineering, Cornell University

ISE Technical Report 20T-001 Adaptive Stochastic Optimization The purpose of this article is to summarize re- Frank E. Curtis Katya Scheinberg cent work and motivate continued research on ◦ the design and analysis of adaptive stochastic op- January 18, 2020 timization methods. In particular, we present an analytical framework—new for the context Abstract. Optimization lies at the heart of of adaptive deterministic optimization—that sets and signal processing. Contem- the stage for establishing convergence rate guar- porary approaches based on the stochastic gradi- antees for adaptive stochastic optimization tech- ent method are non-adaptive in the sense that niques. With this framework in hand, we remark their implementation employs prescribed param- on important open questions related to how it eter values that need to be tuned for each appli- can be extended further for the design of new cation. This article summarizes recent research methods. We also discuss challenges and oppor- and motivates future work on adaptive stochastic tunities for their use in real-world systems. optimization methods, which have the potential to offer significant computational savings when 2 Background training large-scale systems. Many problems in ML and SP are formulated as optimization problems. For example, given a 1 Introduction data vector y Rm from an unknown distribu- tion, one often∈ desires to have a vector of model The successes of stochastic optimization algo- parameters x Rn such that a composite objec- ∈ rithms for solving problems arising machine tive function f : Rn R is minimized, as in learning (ML) and signal processing (SP) are → now widely recognized. Scores of articles have min f(x), where f(x) := Ey[φ(x, y)] + λρ(x). x n appeared in recent years as researchers aim to ∈R build upon fundamental methodologies such as Here, the function φ : Rd+m R defines the → the stochastic gradient method (SG) [25]. The data fitting term Ey[φ(x, y)]. For example, in su- motivation and scope of much of these efforts pervised machine learning, the vector y may rep- have been captured in various books and review resent the input and output from an unknown articles; see, e.g., [4, 12, 16, 19, 27]. mapping and one aims to find x to minimize the Despite these advances and accumulation of discrepancy between the output vector and the knowledge, there remain significant challenges in predicted value captured by φ. Alternatively, the the use of stochastic optimization algorithms in vector y may represent a noisy signal measure- practice. The dirty secret in the use of these algo- ment and one may aim to find x that filters out rithms is the tremendous computational costs re- the noise to reveal the true signal. The function quired to tune an algorithm for each application. ρ : Rn R with weight λ [0, ) is included For large-scale, real-world systems, tuning an al- as a regularizer→ . This can be∈ used∞ to induce de- gorithm to solve a single problem might require sirable properties of the vector x, such as spar- weeks or months of effort on a supercomputer be- sity, and/or to help avoid over-fitting a particular fore the algorithm performs well. To appreciate set of data vectors that are used when (approxi- the consumption of energy to accomplish this, mately) minimizing f. Supposing that instances the authors of [1] list multiple recent articles in of y can be generated—one-by-one or in mini- which training a model for a single task requires batches, essentially ad infinitum—the problem to thousands of CPU-days, and remark how each minimize f becomes a stochastic problem over x. 104 CPU-days is comparable to driving from Los Traditional algorithms for minimizing f are of- Angeles to San Francisco with 50 Toyota Camrys. ten very simple to understand and implement. One avenue for avoiding expensive tuning ef- For example, given a solution estimate xk, the forts is to employ adaptive optimization algo- well-known and celebrated stochastic gradient rithms. Long the focus in the deterministic op- method (SG) [25] computes the next estimate as timization community with widespread success x x α g , where g estimates the gradi- k+1 ← k − k k k in practice, such algorithms become significantly ent of f at xk by taking a random sample yik and more difficult to design for the stochastic regime setting g φ(x , y )+λ ρ(x ) (or by tak- k ← ∇x k ik ∇x k in which many modern problems reside, includ- ing a mini-batch of samples and setting gk as the ing those arising in large-scale ML and SP. average sampled gradient). This value estimates

1 the gradient since Ek[gk] = f(xk), where Ek[ ] batch) at the current iterate xk and trial iterate conditions on the history of∇ the behavior of the· x α g . If the mini-batch objective would not k − k k algorithm up to iteration k N. Under reason- reduce sufficiently as a result of the trial step, able assumptions about the∈ stochastic gradient then the step is not taken, the stepsize parameter estimates and with a prescribed sequence of step- is reduced by a factor, and the mini-batch size is size parameters αk , such an algorithm enjoys increased by a factor. This results in a more con- good convergence{ properties,} which ensure that servative stepsize with a more accurate gradient xk converges in probability to a minimizer, or estimate in the subsequent iteration. Otherwise, at{ least} a stationary point, of f. if the mini-batch objective would reduce suffi- A practical issue in the use of SG is that the ciently with the trial step, then the step is taken, variance of the stochastic gradient estimates, i.e., the stepsize parameter is increased by a factor, 2 Ek[ gk f(xk) 2], can be large, which inhibits and the mini-batch size is reduced by a factor. thek algorithm− ∇ fromk attaining convergence rate Despite the data accesses required by this adap- guarantees on par with those for first-order al- tive algorithm to evaluate mini-batch objective gorithms in deterministic settings. To address values in each iteration, the attained testing ac- this, variance reduction techniques have been curacy with all initializations competes with that proposed and analyzed, such as those used in attained by the best SG run. the algorithms SVRG, SAGA, and other methods This experiment demonstrates the potentially [13, 14, 17, 21, 26]. That said, SG and these vari- significant savings in computational costs offered ants of it are inherently nonadaptive in the sense by adaptive stochastic optimization methods. that each iteration involves a prescribed number While one might be able to achieve good prac- of data samples to compute gk in addition to a tical performance with a nonadaptive method, it prescribed sequence of stepsizes αk . Determin- might only come after expensive tuning efforts. ing which parameter values—defining{ } the mini- By contrast, an adaptive algorithm can perform batch sizes, stepsizes, and other factors—work well without such expensive tuning. well for a particular problem is a nontrivial task. Tuning these parameters means that problems 4 Framework for Analyzing Adap- cannot be solved once; they need to be solved tive Deterministic Methods numerous times until reasonable parameter val- ues are determined for future use on new data. Rigorous development of adaptive stochastic op- timization methods requires a solid foundation in 3 Illustrative Example terms of convergence rate guarantees. The types of adaptive methods that have enjoyed great suc- To illustrate the use of adaptivity in stochastic cess in the realm of deterministic optimization optimization, consider a problem of binary clas- are (a) trust region, (b) line search, and (c) regu- sification by logistic regression using the well- larized Newton methods. Extending these tech- known MNIST dataset. In particular, consider niques to the stochastic regime is a highly non- the minimization of a logistic loss plus an `2-norm trivial task. After all, these methods tradition- 4 squared regularizer (with λ = 10− ) in order to ally require accurate function information at each classify images as either of the digit 5 or not. iterate, which is what they use to adapt their be- Employing SG with a mini-batch size of 64 havior. When an oracle can only return stochas- and different fixed stepsizes, one obtains the plot tic function estimates, comparing function values of testing accuracy over 10 epochs given on the to make adaptive algorithmic decisions can be left in Figure1. One finds that for a stepsize of problematic. In particular, when objective val- αk = 1 for all k N, the model achieves testing ues are merely estimated, poor decisions can be ∈ accuracy around 98%. However, for a stepsize of made, and the combined effects of these poor de- αk = 0.01 for all k N, the algorithm stagnates cisions can be difficult to estimate and control. ∈ and never achieves accuracy much above 90%. As a first step toward showing how these chal- By comparison, we also ran an adaptive lenges can be overcome, let us establish a gen- method. This method, like SG, begins with a eral framework for convergence analysis for adap- mini-batch size of 64 and the stepsize param- tive deterministic optimization. This will lay a eter indicated in the plot on the right in Fig- foundation for the framework that we present for ure1. However, in each iteration, it checks the adaptive stochastic optimization in 5. value of the objective (only over the current mini- The analytical framework presented§ here is

2 Figure 1: SG vs. an adaptive stochastic method on regularized logistic regression on MNIST. new for the deterministic optimization literature. this sequence may vary by type of algorithm A typical convergence analysis for adaptive de- and assumptions on f. terministic optimization partitions the set of it- erations into successful and unsuccessful ones. W is a sequence of indicators; specifically, •{ k} Nonzero progress in reducing the objective func- for all k N, if iteration k is successful, then ∈ tion is made on successful iterations, whereas un- Wk = 1, and Wk = 1 otherwise. successful iterations merely result in an update − of a model or algorithmic parameter to promote Tε, the stopping time, is the index of the first success in the subsequent iteration. As such, a • iterate that satisfies a desired convergence convergence rate guarantee results from a lower criterion parameterized by ε. bound on the progress made in each successful it- eration and a limit on the number of unsuccessful These quantities are not part of the algorithm iterations that can occur between successful ones. itself, and therefore do not influence the iterates. By contrast, in the framework presented here, the They are merely tools of the analysis. analysis is structured around a measure in which At the heart of the analysis is the goal to show progress is made in all iterations. that the following condition holds. In the remainder of this section, we consider all three aforementioned types of adaptive algo- Condition 4.1 The following statements hold rithms under the assumption that f is continu- with respect to (Φk, αk,Wk) and Tε. ously differentiable with f being Lipschitz con- { } tinuous with constant L ∇ (0, ). Each of these 1. There exists a scalar αε (0, ) such that methods follows the general∈ algorithmic∞ frame- ∈ ∞ for each k N such that αk γαε, the work that we state as Algorithm1. The role of iteration is guaranteed∈ to be successful,≤ i.e., the sequence αk 0 in the algorithm is to Wk = 1. Therefore, αk αε for all k N. control the length{ } of ≥ the trial steps s (α ) . In ≥ ∈ { k k } particular, as seen in our discussion of each type 2. There exists a nondecreasing function hε : of adaptive algorithm, one presumes that for a [0, ) (0, ) and scalar Θ (0, ) such that,∞ for→ all k∞ < T , Φ Φ ∈ Θ∞h (α ). given model mk the norm of sk(α) is directly  k − k+1 ≥ ε k proportional to the magnitude of α. Another assumption—and the reason that we refer to it as The goal to satisfy Condition 4.1 is motivated a deterministic optimization framework—is that by the fact that, if it holds, it is trivial to derive the models agree with the objective at least up to (since Φk 0 for all k N) that ≥ ∈ first-order derivatives, i.e., mk(xk) = f(xk) and Φ mk(xk) = f(xk) for all k N. 0 ∇ ∇ ∈ T . (1) Our analysis involves three central ingredients. ≤ Θhε(αε) We define them here in such a way that they are easily generalized when we consider our adaptive For generality, we have written αε and hε as pa- stochastic framework later on. rameterized by ε. However, in the context of dif- ferent algorithms, one or the other of these quan- Φ 0 is a sequence whose role is to mea- tities may be independent of ε. Throughout our •{ k} ≥ sure progress of the algorithm. The choice of analysis, we denote f := infx n f(x) > . ∗ ∈R −∞

3 Algorithm 1 Adaptive deterministic framework condition that αk τ f(xk) for some suitably large constant τ ≤(0,k∇).1 Thisk condition is ac- Initialization tually not necessary∈ for∞ the deterministic setting, Choose constants η (0, 1), γ (1, ), and ∈ ∈ ∞ but is needed in the stochastic setting to ensure α (0, ). Choose an initial iterate x0 n∈ ∞ ∈ that the trial step is not too large compared to R and stepsize parameter α0 (0, α]. ∈ the size of the gradient estimate. We impose it 1. Determine model and compute step now in the deterministic setting for consistency. Choose a local model mk of f around xk. For this TR instance of Algorithm1, consider Compute a step sk(αk) such that the model the first-order ε-stationarity stopping time reduction m (x ) m (x + s (α )) 0 is k k − k k k k ≥ sufficiently large. Tε := min k N : f(xk) ε , (3) { ∈ k∇ k ≤ } 2. Check for sufficient reduction in f corresponding to which we define Check if the reduction f(x ) f(x +s (α )) k − k k k is sufficiently large relative to the model re- 2 Φk := ν(f(xk) f ) + (1 ν)αk (4) duction m (x ) m (x + s (α )) using a − ∗ − k k − k k k k condition parameterized by η. for some ν (0, 1) (to be determined below). ∈ 3. Successful iteration Standard TR analysis involves two key results; If sufficient reduction has been attained see, e.g., [22]. First, while f(xk) > ε, if αk is k∇ k (along with other potential requirements), sufficiently small, then iteration k is successful, then set x x + s (α ) and α i.e., Wk = 1; in particular, αk αε := c1ε for all k+1 ← k k k k+1 ← ≥ min γαk, α . k N for some sufficiently small c1 (0, ) de- { } pendent∈ on L, η, and γ. In addition,∈ if iteration∞ k 4. Unsuccessful iteration is successful, then the ratio condition (2) and our 1 Otherwise, xk+1 xk and αk+1 γ− αk. imposed condition α τ f(x ) yield ← ← k ≤ k∇ k k 5. Next iteration f(x ) f(x ) ηc α2 Set k k + 1. k − k+1 ≥ 2 k ← for some c (0, ), meaning that 2 ∈ ∞ 4.1 Classical trust region Φ Φ νηc α2 (1 ν)(γ2 1)α2; k − k+1 ≥ 2 k − − − k In a classical trust region (TR) method, the otherwise, if iteration k is unsuccessful, then model mk is chosen as at least a first-order accu- rate Taylor series approximation of f at xk, and 2 2 Φk Φk+1 = (1 ν)(1 γ− )αk. the step sk(αk) is computed as a minimizer of mk − − − in a ball of radius αk centered at xk. In Step 2, We aim to show in either case that Φk Φk+1 the sufficient reduction condition is chosen as 2 − ≥ Θαk, for some Θ > 0. This can be done by choos- f(x ) f(x + s (α )) ing ν sufficiently close to 1 such that k − k k k η. (2) m(x ) m(x + s (α )) ≥ k k k k 2 2 − νηc2 (1 ν)(γ γ− ). Figure2 shows the need to distinguish between ≥ − − successful and unsuccessful iterations in a TR In this manner, it follows from the observations 2 method. Even though the model is (at least) above that Condition 4.1 holds with hε(αk) := αk 2 first-order accurate, a large trust region radius and Θ := (1 ν)(1 γ− ). Thus, by (1), the may allow a large enough step such that the re- number of iterations− − required until a first-order duction predicted by the model does not well- ε-stationary point is reached satisfies represent the reduction in the function itself. We 2 contrast these illustrations later with situations ν(f(x0) f ) + (1 ν)α0 ∗ Tε − 2 −2 2 . in the stochastic setting, which are complicated ≤ (1 ν)(1 γ− )c1ε by the fact that the model might be inaccurate − − 2 regardless the size of the step. This shows that Tε = (ε− ). For simplicity in our discussions, for iteration O 1 k N to be successful, we impose the additional Throughout the paper, consider all norms to be `2. ∈

4 xk xk +sk xk xk +sk

Figure 2: Illustration of successful (left) and unsuccessful (right) steps in a trust region method.

4.2 Classical line search Overall, if iteration k is successful, then

In a classical line search (LS) method, the model Φ Φ νηc α f(x ) 2 k − k+1 ≥ 3 kk∇ k k is again chosen as at least a first-order accurate 2 2 (1 ν)((Lαβ¯ + 1) γ 1)αk f(xk) . Taylor series approximation of f at xk, with care − − − k∇ k taken to ensure it is convex so a minimizer of it On unsuccessful iterations, one simply finds exists. The trial step sk(αk) is defined as αkdk for 1 2 some direction of sufficient descent dk. In Step 2, Φk Φk+1 (1 ν)(1 γ− )αk f(xk) . the sufficient reduction condition often includes − ≥ − − k∇ k the Armijo condition By selecting ν sufficiently close to 1 such that

2 1 f(x ) f(x + s (α )) η f(x )T s (α ). νηc3 (1 ν)((Lαβ¯ + 1) γ γ− ), k − k k k ≥ − ∇ k k k ≥ − − As is common, suppose m is chosen and d is one ensures, while f(xk) > ε, that Condi- k k k∇ k 2 computed such that, for a successful iteration, tion 4.1 holds with hε(αk) := αkε and Θ := 1 (1 ν)(1 γ− ). Thus, by (1), the number of it- one finds for some c3 (0, ), dependent on L ∈ ∞ erations− required− until a first-order ε-stationary and the angle between dk and f(xk), that −∇ point is reached satisfies 2 f(xk) f(xk+1) ηc3αk f(xk) . 2 − ≥ k∇ k ν(f(x0) f ) + (1 ν)α0 f(x0) T − ∗ − k∇ k , ε ≤ (1 ν)(1 γ 1)αε2 Using common techniques, this can be ensured − − − with αk α for all k N for some α (0, ) 2 ≥ ∈ ∈ ∞ which again shows that T = (ε− ). dependent on L, η, and γ. Similarly as in 4.1, ε O § let us also impose that dk β f(xk) for some suitably large β (0k, k). ≤ k∇ k 4.3 Regularized Newton ∈ ∞ For this LS instance of Algorithm1, for the Regularized Newton methods have been popu- stopping time Tε defined in (3), consider lar in recent years due to their ability to of-

2 fer optimal worst-case complexity guarantees for Φk := ν(f(xk) f ) + (1 ν)αk f(xk) . − ∗ − k∇ k nonconvex smooth optimization over the class of practical second-order methods. For example, in If iteration k is successful, then cubicly regularized Newton, the model is chosen 2 as m (x) = f(x ) + f(x )T (x x ) + 1 (x Φk Φk+1 νηc3αk f(xk) k k k k 2 T 2 ∇ 1 − 3 − − ≥ k∇ k xk) f(xk)(x xk) + x xk , and in (1 ν)(γα f(x ) 2 α f(x ) 2). ∇ − 3αk k − k − − kk∇ k+1 k − kk∇ k k Step 2 the imposed sufficient reduction condi- tion has the same form as that in a trust region By Lipschitz continuity of f, it follows that ∇ method, namely, the ratio condition (2). As for the aforementioned line search method, f(xk+1) (Lαkβ + 1) f(xk) . k∇ k ≤ k∇ k one can show that if the Hessian of f is Lipschitz

Squaring both sides and applying αk α¯ yields continuous on a set containing the iterates, an ≤ iteration of cubicly regularized Newton will be f(x ) 2 (Lαβ¯ + 1)2 f(x ) 2. successful if the stepsize parameter is sufficiently k∇ k+1 k ≤ k∇ k k

5 small; consequently, αk α for all k N for 4.4 Additional examples some α (0, ) dependent≥ on L, the Lipschitz∈ Our analysis so far has focused on the setting constant∈ for ∞2f, η, and γ. However, to prove of having a stopping time based on a first-order the optimal complexity∇ guarantee in this setting, stationarity criterion and no assumptions on f one must account for the progress in a successful besides first- and/or second-order Lipschitz con- iteration as being dependent on the magnitude of tinuous differentiability. However, the framework the gradient at the next iterate, not the current can be extended for other situations as well. one. (As we shall see, this leads to complications For example, if one is interested in approxi- for the analysis in the stochastic regime.) For mate second-order stationarity, then one can let this reason, and ignoring the trivial case when f(x ) ε, let us define Tε := min k N : χk ε , 0 { ∈ ≤ } k∇ k ≤ 2 where χk := max f(xk) , λmin( f(xk)) , Tε := min k N : f(xk+1) ε {k∇ k − ∇ } { ∈ k∇ k ≤ } with λ ( ) denoting the minimum eigenvalue of min · along with its symmetric matrix argument. One can show that if the model mk is chosen as a (regular- 3/2 Φk := ν(f(xk) f ) + (1 ν)αk f(xk) . ized) second-order Taylor series approximation − ∗ − k∇ k of f at xk, then for all of the aforementioned 3 If iteration k is successful, then methods one obtains that Tε = (ε− ). For TR, for example, one can derive thisO using 3/2 f(xk) f(xk+1) ηc4αk f(xk+1) 3 Φk := ν(f(xk) f ) + (1 ν)αk. − ≥ k∇ k − ∗ − for some c4 (0, )[7], meaning that On the other hand, if one is interested in an- ∈ ∞ alyzing specially the case of f being convex, or Φ Φ νηc α f(x ) 3/2 even strongly convex, then one might consider k − k+1 ≥ 4 kk∇ k+1 k + (1 ν)α f(x ) 3/2 Tε := min k N : f(xk) f ε . − kk∇ k k { ∈ − ∗ ≤ } 3/2 (1 ν)αk+1 f(xk+1) In this case, improved complexity bounds can be − − k∇ k obtained through other careful choices of Φk . νηc α f(x ) 3/2 { } ≥ 4 kk∇ k+1 k For example, when f is convex and the LS (1 ν)γα f(x ) 3/2; method from 4.2 is employed, one can let − − kk∇ k+1 k § 1 1 otherwise, if iteration k is unsuccessful, then Φk := ν + (1 ν)αk. ε − f(xk) f −  − ∗  1 3/2 Φk Φk+1 (1 ν)(1 γ− )αk f(xk+1) . Under the assumption that level sets of f are − ≥ − − k∇ k 1 bounded, one can show that (f(xk+1) f )− 1 − ∗ − Choosing ν sufficiently close to 1 such that (f(xk) f )− is uniformly bounded below by a positive− constant∗ over all successful iterations, 1 νηc4 (1 ν)(γ + 1 γ− ), while for all k N one has αk α. Hence, ≥ − − for a suitable constant∈ ν, one can≥ determine one ensures, while f(xk+1) > ε, that Condi- hε(αk) and Θ to satisfy Condition 4.1. In this k∇ k 3/2 tion 4.1 holds with hε(αk) := αkε and Θ := case, the function hε does not depend on ε, but 1 1 1 (1 νε)(1 γ− ). Thus, by (1), the number of Φ0 = (ε− ), meaning that Tε = (ε− ). iterations− required− until a first-order ε-stationary Similarly,O when f is strongly convexO and the point will be reached (in the next iteration) has LS method from 4.2 is employed, consider § 1 1 3/2 ν(f(x0) f ) + (1 ν)α0 f(x0) Φk := ν log log Tε − ∗ − k∇ k , ε − f(xk) f ≤ (1 ν)(1 γ 1)αε3/2     − ∗  − − − + (1 ν) log(α ). − k 3/2 1 which shows that Tε = (ε− ). This time, log((f(xk+1) f )− ) log((f(xk) O 1 − ∗ − − One can obtain the same result with a second- f )− ) is uniformly bounded below by a positive order TR method; see [11]. It should be said, constant∗ over all successful iterations, and similar however, that this method requires a more com- to the convex case one can determine ν, hε, and plicated mechanism for adjusting the stepsize pa- Θ independent of ε in order to show that Φ0 = 1 1 rameter than that stated in Algorithm1. (log(ε− )) implies T = (log(ε− )). O ε O

6 5 Framework for Analyzing Adap- Algorithm 2 Adaptive stochastic framework tive Stochastic Methods Initialization Choose (η, δ , δ ) (0, 1)3, γ (1, ), and We now present a generalization of the frame- 1 2 ∈ ∈ ∞ α (0, ). Choose an initial iterate x0 work introduced in the previous section that al- n∈ ∞ ∈ R and stepsize parameter α0 (0, α]. lows for the analysis of adaptive stochastic opti- ∈ mization methods. This framework is based on 1. Determine model and compute step the techniques proposed in [8], which served to Choose a stochastic model mk of f around analyze the behavior of algorithms when deriva- xk, which satisfies some sufficient accuracy tive estimates are stochastic, but function val- requirement with probability at least 1 δ . − 1 ues could be computed exactly. It is also based Compute a step sk(αk) such that the model on the subsequent work in [2], which allows for reduction mk(xk) mk(xk + sk(αk)) 0 is function values to be stochastic as well. For our sufficiently large. − ≥ purposes of providing intuition, we discuss a sim- plification of the framework, avoiding some tech- 2. Check for sufficient reduction ˜0 ˜s nical details. See [8,2] for complete details. Compute estimates fk and fk of f(xk) and As in the deterministic setting, let us define f(xk + sk(αk)), respectively, which satisfy a generic algorithmic framework, which we state some sufficient accuracy requirement with probability at least 1 δ . Check if f˜0 f˜s as Algorithm2, that encapsulates multiple types − 2 k − k of adaptive algorithms. This algorithm has the is sufficiently large relative to the model re- duction m (x ) m (x + s (α )) using a same structure as Algorithm1, except it makes k k − k k k k use of a stochastic model of f to compute the condition parameterized by η. trial step, and makes use of stochastic objective 3. Successful iteration value estimates when determining whether suffi- If sufficient reduction has been attained cient reduction has been achieved. (along with other potential requirements), Corresponding to Algorithm2, let (Φk,Wk) { } then set xk+1 xk + sk(αk) and αk+1 be a stochastic process such that Φk 0 for all ← ← { } ≥ min γαk, α . k N. The sequences Φk and Wk play simi- { } ∈ { } { } lar roles as for our analysis of Algorithm1, but it 4. Unsuccessful iteration 1 should be noted that now each Wk is a random Otherwise, x x and α γ− α . k+1 ← k k+1 ← k indicator influenced by the iterate sequence xk { } 5. Next iteration and stepsize parameter sequence αk , which are { } Set k k + 1. themselves stochastic processes. ← Let k denote the σ-algebra generated by all stochasticF processes within Algorithm1 at the beginning of iteration k. Roughly, k it is gener- 1 k F has Wk = 1 with probability 1 δ > 2 , con- ated by (Φj, αj, xj) j=0. Note that this includes − ditioned on k. (This means that, if it be- { k 1 } (Φ , α , x ) − , but does not include W as this F { j j j }j=0 k comes sufficiently small, one is more likely depends on what happens dur- to see an increase in the stepsize parameter ing iteration k. We then let Tε denote a family than a decrease in it.) of stopping times for (Φ ,W ) with respect to k k 2. There exists a nondecreasing function h : parameterized by{ ε (0, } ). ε k [0, ) (0, ) and scalar Θ (0, ) such {FThe} goal of our analytical∈ framework∞ is to de- that,∞ for→ all ∞k < T , the conditional∈ ∞ expec- rive an upper bound for the expected stopping ε tation of Φk Φk+1 with respect to k is at time E[Tε] under various assumptions on the be- least Θh (α −); specifically, F havior of (Φ ,W ) and on the objective f. At ε k { k k } the heart of the analysis is the goal to show that 1 1 k

7 large, namely, at least Θhε(αk) for each k < Tε. parameters, and hence the algorithm is not adap- In addition, the stepsize parameter is allowed to tive in our desirable sense. Hence, one still needs fall below the threshold αε; in fact, it can become to tune such algorithms for each application. arbitrarily small. That said, if αk αε, then one Returning to our setting, in the types of error ≤ n tends to find αk+1 > αk. bounds presented below, for a given f : R R → With this added flexibility, one can still prove and x Rn, let f˜(x), g(x), and H(x) denote ∈ complexity guarantees. Intuitively, the reduc- stochastic approximations of f(x), f(x), and ∇ tion Φ Φ is at least the deterministic 2f(x), respectively. k k+1 ∇ amount Θ−h (α ) often enough that one is able ε ε Taylor-like Conditions. Corresponding to bound the total number of such occurrences • to a norm , let (x , ∆ ) denote a ball (since Φ 0). The following theorem (see B k k k of radius ∆k ·centered k at x . If the function, Theorem{ 2.2} ≥ in [2]) bounds the expected stop- k k gradient, and Hessian estimates satisfy ping time in terms of a deterministic value. ˜ 2 f(xk) f(xk) κf ∆k, (5a) Theorem 5.2 If Condition 5.1 holds, then | − | ≤ g(x ) f(x ) κ ∆ , (5b) k k − ∇ k k ≤ g k 2 1 δ Φ0 and H(xk) f(xk) κH (5c) E[Tε] − + 1. k − ∇ k ≤ ≤ 1 2δ · Θhε(αε) − for some nonnegative scalars (κf , κg, κH ), T In 5.2–5.3, we summarize how this frame- then the model mk(x) = f˜(xk)+g(xk) (x § 1 T − work has been applied to analyze the behavior xk) + (x xk) H(xk)(x xk) gives an ap- 2 − − of stochastic TR and LS methods. In each case, proximation of f within B(xk, ∆k) that is the keys to applying the framework are deter- comparable to that given by an accurate mining how to design the process Φ as well first-order Taylor series approximation (with { k} as specify details of Algorithm2 to ensure that error dependent on ∆k). Similarly, if (5) Condition 5.1 holds. For these aspects, we first holds with the right-hand side values re- 3 2 need to describe different adaptive accuracy re- placed by κf ∆k, κg∆k, and κH ∆k, respec- quirements for stochastic function and derivative tively, then mk gives an approximation of f estimates that might be imposed in Steps 1 and 2, that is comparable to that given by an accu- as well the techniques that have been developed rate second-order Taylor series approxima- to ensure these requirements. tion. In a stochastic setting when unbiased estimates of f(x ), f(x ), and 2f(x ) k ∇ k ∇ k 5.1 Error Bounds for Stochastic Function can be computed, such conditions can be en- and Derivative Estimates sured, with some sufficiently high probabil- ity 1 δ, by sample average approximations In this section, we describe various types of using− a sufficiently large number of samples. conditions that one may require in an adaptive For example, to satisfy (5b) with probability stochastic optimization algorithm when comput- 1 δ, the sample size for computing g(x ) − k ing objective function, gradient, and Hessian es- Vg can be Ω κ2 ∆2 , where Vg is the variance timates. These conditions have been used in g k of the stochastic gradient estimators corre- some previously proposed adaptive stochastic op-   sponding to a mini-batch size of 1. Here, timization algorithms [2,5,8, 23]. Ω-notation hides dependence on δ, which is Let us remark in passing that one does not weak if g(x ) f(x ) is bounded. necessarily need to employ sophisticated error k k − ∇ k k bound conditions in stochastic optimization to Gradient Norm Condition. If, for some • achieve improved convergence rates. For exam- θ [0, 1), one has that ∈ ple, in the case of minimizing strongly convex f, g(x ) f(x ) θ f(x ) , (6) the linear rate of convergence of gradient de- k k − ∇ k k ≤ k∇ k k scent can be emulated by an SG-like method if then we say that g(xk) satisfies the gradient the mini-batch size grows exponentially [15, 24]. norm condition at xk. Unfortunately, veri- However, attaining similar improvements in the fying the gradient norm condition at xk re- (not strongly) convex and nonconvex settings has quires knowledge of f(xk) , which makes proved elusive. Moreover, while the stochastic es- it an impractical condition.k∇ Ink [5], a heuris- timates improve with the progress of such an al- tic is proposed that attempts to approxi- gorithm, this improvement is based on prescribed mate the sample size for which the gradient

8 norm condition holds. More recently, in [3], 5.2 Stochastic trust region the authors improve upon the gradient norm The idea of designing stochastic TR methods has condition by introducing an angle condition, been attractive for years, even before the recent which in principle allows smaller sample set explosion of efforts on algorithms for stochastic sizes to be employed. However, again, the optimization; see, e.g., [9]. This is due to the angle condition requires a bound in terms impressive practical performance that TR meth- of f(x ) , for which a heuristic estimate k ods offer, especially when (approximate) second- needsk∇ to bek employed. order information is available. Since TR meth- ods typically compute trial steps by minimizing Instead of employing the unknown quan- a quadratic model of the objective in a neighbor- tity f(x ) on the right hand side of the k∇ k k hood around the current iterate, they are inher- norm and angle conditions, one can substi- ently equipped to avoid certain spurious station- tute ε (0, ), the desired stationarity tol- ∈ ∞ ary points that are not local minimizers. In ad- erance. In this manner, while f(xk) > ε dition, by normalizing the step length by a trust and g f(x ) θε, thek∇ norm condi-k k k − ∇ k k ≤ region radius, the behavior of the algorithm is tion is satisfied. This general idea has been kept relatively stable. Indeed, this feature allows exploited recently in several articles (e.g., TR methods to offer stability even in the non- [28, 30]), where gradient and Hessian esti- adaptive stochastic regime; see [13]. mates are computed based on large-enough However, it has not been until the last cou- numbers of samples, then assumed to be ac- ple years that researchers have been able to de- curate in every iteration (with high probabil- sign stochastic TR methods that can offer strong ity) until an -stationary solution is reached. expected complexity bounds, which are essen- While strong iteration complexity guaran- tial for ensuring that the practical performance tees can be proved for such algorithms, these of such methods can be competitive over broad approaches are too conservative to be com- classes of problems. The recently proposed al- petitive with truly stochastic algorithms. gorithm known as STORM, introduced in [10], achieves such complexity guarantees by requiring Stochastic Gradient Norm Conditions. the Taylor-like conditions (5) to hold with proba- • Consider again the conditions in (5), al- bility at least 1 δ, conditioned on .2 In partic- − Fk though now consider the specific setting of ular, in Step 1 of Algorithm2 a model mk(x) := defining ∆k := αk g(xk) for all k N. f˜(x )+g(x )T (x x )+ 1 (x x )T H(x )(x x ) k k ∈ k k k 2 k k k This condition, which involves a bound com- is computed with− components− satisfying (5)− with parable to (6) when g(xk) = f(xk), is par- probability 1 δ1 (where ∆k := αk for all k N), ∇ − ∈ ticularly useful in the context of LS meth- then the step sk(αk) is computed by minimizing ods. It is possible to impose it in probability mk (approximately) within a ball of radius αk. In since, unlike (6), it does not require knowl- ˜0 ˜s Step 2, the estimates fk and fk are computed to edge of f(xk) . In this case, to satisfy satisfy (5a) with probability 1 δ . The imposed k∇ k − 2 (5) for ∆k := αk g(xk) with probability sufficient reduction condition is 1 δ, the samplek size fork computing g(x ) − k f˜0 f˜s Vg k − k η. need only be Ω κ2 α2 g(x ) 2 . While g(xk) g kk k k mk(xk) mk(xk + sk) ≥ is not known when the sample size for com- − puting it is chosen, one can use a simple loop An iteration is successful if the above holds and gk ταk for some user-defined τ (0, ). that guesses the value of g(xk) , then iter- k k ≥ ∈ ∞ atively increases the numberk ofk samples as For brevity, we omit some details of the al- needed; see [8, 23]. gorithm. For example, the constant κf in (5) cannot be too large, whereas κg and κH can be arbitrarily large. Naturally, the magnitudes of these constants affects the constants in the con- Note that all of the above conditions can be vergence rate; see [2] for further details. adaptive in terms of the progress of an algorithm, Based on the stochastic process generated by assuming that ∆k and/or g(xk) vanish as Algorithm2, let us define, as in the deterministic k . However,{ } this behavior{k doesk} not have → ∞ 2In what follows, all accuracy conditions are assumed to be monotonic, which is a benefit of adaptive to hold with some probability, conditioned on k. We stepsizes and accuracy requirements. omit mention of this conditioning for brevity. F

9 xk xk +sk xk xk +sk xk +sk xk xk +sk xk

Figure 3: Illustration of “good” and “bad” models and estimates in a stochastic trust region method: (first) good model and good estimates; (second) good model and bad estimates; (third) bad model and good estimates; (fourth) bad model and bad estimates. setting, T and Φ by (3) and (4), respectively. ble that the iteration will be deemed successful ε { k} It is shown in [2] that for sufficiently small con- despite the fact that Φk+1 > Φk. (Recall that stants δ1, δ2, and Θ (independent of ε), Condi- this cannot occur in the deterministic setting, tion 5.1 holds with where Φ decreases monotonically.) The key { k} 2 step in showing that Φk+1 < Φk in expectation is αε := ζε and h(αk) := αk to establish that, in each iteration, the possible for some positive constants ζ and Θ that depend decrease of this measure is proportional to any on the algorithm parameters and properties of possible increase, and thus by ensuring that δ1δ2 f, but not on ε. The probability 1 δ that is sufficiently small and (1 δ1)(1 δ2) is suffi- − − − arises in Condition 5.1 is at least (1 δ1)(1 δ2). ciently large, one can ensure a desired reduction Thus, by Theorem 5.2, the expected− complexity− in expectation. of reaching a first-order ε-stationary point is at The same TR algorithm can be employed with 2 most (ε− ), which matches the complexity of minor modifications to obtain good expected the deterministicO version of the algorithm up to complexity properties with respect to achieving the factor dependent on (1 δ1)(1 δ2). second-order ε-stationarity. In this case, the re- Let us give some intuition− of how− Condition 5.1 quirements on the estimates need to be stronger; is ensured. Let us say that the model mk is in particular, (5) has to be imposed with right- 3 2 “good” if its components satisfy (5) and “bad” hand side values κf αk, κgαk, and κH αk, re- 0 s spectively. In this case, Condition 5.1 holds for otherwise. Similarly, the estimates fk and fk 3 are “good” if they satisfy (5a) and “bad” oth- h(αk) = αk. Hence, the expected complexity of erwise. By construction of Steps 1 and 2 of the reaching an second-order ε-stationary point by 3 algorithm, m is “good” with probability at least Algorithm2 is bounded by (ε− ), which simi- k O 0 s larly matches the deterministic complexity. 1 δ1 and fk and fk are “good” with probabil- ity− at least 1 δ . Figure3 illustrates the four − 2 possible outcomes. When both the model and 5.3 Line Search Methods the estimates are good, the algorithm essentially behaves as its deterministic counterpart; in par- A major disadvantage of an algorithm such as SG ticular, if α α and k < T , then the k-th it- is that one is very limited in the choice of step- k ≤ ε ε eration is successful and the reduction Φk Φk+1 size sequence that can be employed to adhere to is sufficiently large. If the model is bad and− the the theoretical guidelines. One would like to be estimates are good, or if the model is good and able to employ a type of line search, as has been the estimates are bad, then the worst case (de- standard practice throughout the history of re- picted in Figure3) is that the step is deemed search on deterministic optimization algorithms. unsuccessful, even though αk αε. This shows However, devising line search methods for the that the stepsize parameter can≤ continue to de- stochastic regime turns out to be extremely diffi- crease, even if it is already small. Finally, if both cult. This is partially due to the fact that, unlike the model and estimates are bad, which happens TR methods, LS algorithms employ steps that with probability at most δ1δ2, then it is possi- are highly influenced by the norm of the gradient

10 f(x ) , or in the stochastic regime influenced which, as seen in the deterministic case, is the k∇ k k by g(xk) . Since the norm of the gradient esti- desired reduction in f if iteration k is successful. matek can varyk dramatically from one iteration to This control sequence needs to be set carefully. the next, the algorithm needs to have a carefully The first value in the sequence is set arbitrarily, controlled stepsize selection mechanism to ensure with subsequent values set as follows. If itera- convergence. tion k is unsuccessful, then ∆k+1 ∆k. Other- In [3] and [15], two backtracking line search wise, if iteration k is successful, then← one checks methods were proposed that use different heuris- whether the step is reliable in the sense that the tic sample size strategies when computing gradi- accuracy parameter is sufficiently small, i.e., ent and function estimates. In both cases, the α g(x ) 2 ∆2 . (7) backtracking is based on the Armijo condition kk k k ≥ k applied to function estimates that are computed If (7) holds, then one sets ∆ γ∆ ; oth- on the same batch as the gradient estimates. A k+1 √ k 1 ← different type of LS method that uses a proba- erwise, one sets ∆k+1 γ− ∆k to promote reliability of the step in← the subsequent iteration. bilistic Wolfe condition for choosing the stepsize p Using ∆ defined in this manner, an additional was proposed in [18], although this approach pos- { k} sesses no known theoretical guarantees. bound is imposed on the variance of the objective value estimates. For all k N, one requires In [29], the authors argue that with over- ∈ parametrization of deep neural networks (DNNs), ˜0 2 ˜s 2 max E f f(xk) , E f f(xk + sk(αk)) the variance of the stochastic gradients tends to { | k − | | k − | } 2 2 4 zero near stationarity points. Under this as- max κf αk f(xk) , κf ∆k . ≤ { k∇ k } sumption, the authors analyzed a stochastic LS method, which has good practical performance in Note that because of the max in the right hand some cases. However, not only is such an assump- side of this inequality, and because ∆k is a known value, it is not necessary to know f(x ) 2 tion unreasonably strong, even for DNNs, but it k∇ k k also does not extend to the fully stochastic set- in order to impose this condition. Also note ting, since in that setting the assumption would that this condition is stronger than imposing the imply zero generalization error at the solution— Taylor-like condition (5a) with some probability an ideal situation, but hardly realistic in general. less than 1, because the bound on expectation does not allow f˜0 f(x ) to be arbitrarily large Here we summarize results in [23], where an LS | k − k | method with an adaptive sample size selection with positive probability, while (5a) allows it, and mechanism is proposed and complexity bounds thus is more tolerant to outliers. are provided. This method can be described as For analyzing this LS instance of Algorithm2, a particular case of Algorithm2. As in the de- again let Tε be as in (3). However, now let terministic case, a stochastic model mk is chosen αk 2 2 Φk := ν(f(xk) f )+(1 ν)( 2 f(xk) +η∆k). in Step 1 and s(α ) = α d , where d makes − ∗ − L k∇ k k − k k k an obtuse angle with the gradient estimate g(xk). Using a strategy similar to that for STORM com- The sufficient reduction in Step 2 is based on the bined with the logic of 4.2 it is shown in [23] that estimated Armijo condition Condition 5.1 holds with§

f˜0 f˜s ηg(x )T s (α ). 2 k − k ≥ − k k k h(αk) = αkε . The algorithm requires that the components of The expected complexity of this stochastic LS the model mk satisfy (5) with probability at least method has been analyzed for minimizing convex 1 δ , and that the estimates f˜0 and f˜s sat- − 1 k k and strongly convex f using modified definitions isfy (5a) with probability at least 1 δ2. Here, for Φk and Tε as described in 4.4; see [23]. it is critical that (5a) not be imposed− with the § right-hand side being α g 2 (even though the k k 5.4 Stochastic regularized Newton deterministic case might suggestk k this as being ap- propriate) since this quantity can vary uncontrol- As we have seen, stochastic TR and LS algo- lably from one iteration to the next. To avoid this rithms have been developed that fit into the issue, the approach defines an additional control adaptive stochastic framework that we have de- sequence ∆k used for controlling the accuracy scribed, showing that they can achieve expected ˜0 { ˜s} of fk and fk . Intuitively, for all k N, the complexity guarantees on par with their deter- value ∆2 is meant to approximate α ∈f(x ) 2, ministic counterparts. However, neither of these k kk∇ k k

11 types of algorithms achieves complexity guaran- whether one can employ the adaptive stochas- tees that are optimal in the deterministic regime. tic framework to analyze a stochastic variant of Cublicly regularized Newton [6,7, 20], de- TRACE, the main challenge being that TRACE scribed and analyzed in 4.3, enjoys optimal con- involves a relatively complicated strategy for up- vergence rates for second-order§ methods for min- dating the stepsize parameter. Specifically, de- imizing nonconvex functions. We have shown terministic TRACE requires knowledge of the ex- how our adaptive deterministic framework gives act Lagrange multiplier of the trust region con- 3/2 a (ε− ) complexity bound for this method straint at a solution of the step computation sub- O for achieving first-order ε-stationarity, in partic- problem. If the model mk is stochastic and the ular, to achieve f(xk+1) ε. There has also subproblem is solved only approximately, then it been some work thatk∇ proposesk ≤ stochastic and ran- remains open how to maintain the optimal (ex- domized versions of cubic regularization meth- pected) complexity guarantee. In addition, the ods [28, 30], but these impose strong conditions issue of determining the correct stopping time in on the accuracy of the function, gradient, and the stochastic regime is also open for this method. Hessian estimates that are equivalent to using ε (i.e., the desired accuracy threshold) in place of 6 Other possible extensions ∆k for all k N in (5). Thus, these approaches ∈ essentially reduce to sample average approxima- We have shown that the analytical framework for tion with very tight accuracy tolerances and are analyzing adaptive stochastic optimization algo- not adaptive enough to be competitive with truly rithms presented in 5 has offered a solid foun- stochastic algorithms in practice. For example, in dation upon which stochastic§ TR and stochastic [28], no adaptive stepsizes or batch sizes are em- LS algorithms have been proposed and analyzed. ployed, and in [30] only Hessian approximations We have also shown the challenges and opportu- are assumed to be stochastic. Moreover, the con- nities for extending the use of this framework for vergence analysis is performed under the assump- analyzing algorithms whose deterministic coun- tion that the estimates are sufficiently accurate terparts have optimal complexity. (for a given ε) in every iteration. Thus, essen- Other interesting questions remain to be an- tially, the analysis is reduced to that in the de- swered. For example, great opportunities exist terministic setting and applies only as long as no for the design of error bound conditions other iteration fails to satisfy the accuracy condition. than those mentioned in 5.1, especially when it Hence, a critical open question is whether one can comes to bounds that are§ tailored for particular extend the framework described here (or another problem settings. While improved error bounds approach) to develop and analyze a stochastic al- might not lead to improvements in iteration com- gorithm that achieves optimal deterministic com- plexity, they can have great effects on the work plexity. complexity of various algorithms, which trans- The key difficulty in extending the analysis de- lates directly into performance gains in practice. scribed in 4.3 to the stochastic regime is the def- The proposed analytical framework might also § inition of the stopping time. For Theorem 5.2 to benefit from extensions in terms of the employed hold, Tε has to be a stopping time with respect to algorithmic parameters. For example, rather k . However, with sk(αk) being random, xk+1 than using a single constant γ when updating {F } is not measurable in k ; hence, Tε, as it is de- the stepsize parameter, one might consider differ- {F } fined in the deterministic setting, is not a valid ent values for increases vs. decreases, and when stopping time in the stochastic regime. A differ- different types of steps are computed. This will ent definition is needed that would be agreeable allow for improved bounds on the accuracy prob- with the analysis in the stochastic setting. ability tolerances δ1 and δ2. Another algorithmic framework that enjoys op- Finally, numerous open questions remain in timal complexity guarantees is the Trust Re- terms of how best to implement adaptive stochas- gion Algorithm with Contractions and Expan- tic algorithms in practice. For different algorithm sions (TRACE) [11]. This algorithm borrows instances, practitioners need to explore how best much from the traditional TR methodology also to adjust mini-batch sizes and stepsizes so that followed by STORM, but incorporates a few al- one can truly achieve good performance with- gorithmic variations that reduces the complex- out wasteful tuning efforts. One hopes that with 2 3/2 ity from (ε− ) to (ε− ) for achieving first- additional theoretical advances, these practical order ε-stationarity.O O It remains an open question questions will become easier to answer.

12 References [10] R. Chen, M. Menickelly, and K. Scheinberg. Stochastic optimization using a trust-region [1] H. Asi and J. C. Duchi. The importance method and random models. Mathematical of better models in stochastic optimiza- Programming, 169(2):447–487, 2018. tion. Proceedings of the National Academy of Sciences of the United States of America, [11] F. E. Curtis, D. P. Robinson, and 116(46):22924–22930, 2019. M. Samadi. A trust region algorithm with a 3/2 worst-case iteration complexity of (− ) O [2] J. Blanchet, C. Cartis, M. Menickelly, and for nonconvex optimization. Mathematical K. Scheinberg. Convergence rate analysis of Programming, 162(1):1–32, 2017. a stochastic trust-region method via super- [12] F. E. Curtis and K. Scheinberg. Optimiza- martingales. INFORMS Journal on Opti- tion Methods for Supervised Machine Learn- mization, 1(2):92–119, 2019. ing: From Linear Models to Deep Learning. In INFORMS Tutorials in Operations Re- [3] R. Bollapragada, R. Byrd, and J. Nocedal. search, chapter 5, pages 89–114. Institute for Adaptive sampling strategies for stochastic Operations Research and the Management optimization. SIAM Journal on Optimiza- Sciences (INFORMS), 2017. tion, 28(4):3312–3343, 2018. [13] F. E. Curtis, K. Scheinberg, and R. Shi. [4] L. Bottou, F. E. Curtis, and J. Nocedal. A Stochastic Trust Region Algorithm Optimization Methods for Large-Scale Ma- Based on Careful Step Normalization. chine Learning. SIAM Review, 60(2):223– INFORMS Journal on Optimization, 311, 2018. https://doi.org/10.1287/ijoo.2018.0010, 2019. [5] R. Byrd, G. M. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimization [14] A. Defazio, F. Bach, and S. Lacoste- methods for machine learning. 134:127–155, Julien. SAGA: A Fast Incremental Gra- 2012. dient Method With Support for Non- Strongly Convex Composite Objectives. In [6] C Cartis, N. I. M. Gould, and Ph. L. Z. Ghahramani, M. Welling, C. Cortes, Toint. Adaptive cubic regularisation meth- N. D. Lawrence, and K. Q. Weinberger, ed- ods for unconstrained optimization. Part I: itors, Advances in Neural Information Pro- Motivation, convergence and numerical re- cessing Systems 27, pages 1646–1654. Cur- sults. Mathematical Programming, 127:245– ran Associates, Inc., 2014. 295, 2011. [15] M. Friedlander and M. Schmidt. Hybrid [7] C. Cartis, N. I. M. Gould, and Ph. L. deterministic-stochastic methods for data Toint. Adaptive cubic regularisation meth- fitting. SIAM Journal on Scientific Com- ods for unconstrained optimization. Part puting, 34(3):1380–1405, 2012. II: Worst-case function- and derivative- [16] I. Goodfellow, Y. Bengio, and A. Courville. evaluation complexity. Mathematical Pro- Deep Learning. MIT Press, 2016. http: gramming, 130:295–319, 2011. //www.deeplearningbook.org.

[8] C. Cartis and K. Scheinberg. Global con- [17] R. Johnson and T. Zhang. Accelerating vergence rate analysis of unconstrained op- Stochastic Gradient Descent Using Predic- timization methods based on probabilis- tive Variance Reduction. In Proceedings of tic models. Mathematical Programming, the 26th International Conference on Neural 169(2):337–375, 2018. Information Processing Systems, NIPS’13, pages 315–323, USA, 2013. Curran Asso- [9] K. H. Chang, M. K. Li, and H. Wan. ciates Inc. Stochastic trust-region response-surface method (strong) - a new response-surface [18] M. Mahsereci and P. Hennig. Probabilis- framework for simulation optimization. IN- tic line searches for stochastic optimiza- FORMS Journal on Computing, 25(2):230– tion. Journal of Machine Learning Research, 243, 2013. 18(119):1–59, 2017.

13 [19] M. Mohri, A. Rostamizadeh, and A. Tal- [30] P. Xu, F. Roosta, and M. W. Ma- walkar. Foundations of Machine Learning. honey. Newton-type methods for non- The MIT Press, second edition, 2018. convex optimization under inexact hessian information. Mathematical Programming, [20] Yu. Nesterov and B. T. Polyak. Cubic 2019. DOI: https://doi.org/10.1007/s10107- regularization of newton method and its 019-01405-z. global performance. Mathematical Program- ing, 108(1):177–205, 2006.

[21] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Tak´aˇc. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, ICML’17, pages 2613–2621, USA, 2017. JMLR.

[22] J. Nocedal and S. J. Wright. Numerical Op- timization. Springer, Second edition, 2006.

[23] C. Paquette and K. Scheinberg. A stochastic line search method with expected complex- ity analysis. arXiv:1807.07994, 2018.

[24] Raghu. Pasupathy, Peter. Glynn, Soumyadip. Ghosh, and Fatemeh S. Hashemi. On sampling rates in simulation- based recursions. SIAM Journal on Optimization, 28(1):45–73, 2018.

[25] H. Robbins and S. Monro. A Stochastic Ap- proximation Method. The Annals of Math- ematical , 22(3):400–407, 1951.

[26] M. Schmidt, N. Le Roux, and F. Bach. Min- imizing finite sums with the stochastic aver- age gradient. Mathematical Programming, 162(1-2):83–112, 2017.

[27] S. Sra, S. Nowozin, and S. J. Wright. Op- timization for Machine Learning. The MIT Press, 2011.

[28] N. Tripuraneni, M. Stern, C. Jin, J. Regier, and M. I. Jordan. Stochastic cubic regu- larization for fast nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Gar- nett, editors, Advances in Neural Informa- tion Processing Systems 31, pages 2904– 2913. Curran Associates, Inc., 2018.

[29] S. Vaswani, A. Mishkin, I. H. Laradji, M. W. Schmidt, G. Gidel, and S. Lacoste- Julien. Painless stochastic gradient: Inter- polation, line-search, and convergence rates. arXiv:1905.09997, 2019.

14 Appendix

Analysis in 4.1. The lower bound on α can be seen in the proof of Theorem 4.5 in Nocedal • § { k} & Wright (2006). In particular, under the assumption that f(xk) > ε for all k N, one can k∇ k ∈ derive a lower bound on the trust region radius of the form c1ε, where c1 depends on L, η, γ, and an upper bound β on the norm of the Hessian of a quadratic term used in the model mk, the latter of which is equal to L if an exact second-order Taylor series model is used. The reduction in f on successful iterations can be seen as follows. First, Cauchy decrease (see Lemma 4.3 in Nocedal & Wright (2006)) and the bound α τ f(x ) imply that k ≤ k∇ k k m (x ) m (s (α )) 1 f(x ) min α , 1 f(x ) k k − k k k ≥ 2 k∇ k k { k β k∇ k k} 1 ( 1 )α min α , 1 α ≥ 2 τ k { k βτ k} = 1 ( 1 min 1, 1 )α2, 2 τ { βτ } k from which it follows that for a successful iteration

f(x ) f(x ) 1 η( 1 min 1, 1 )α2. k − k+1 ≥ 2 τ { βτ } k Analysis in 4.2. The lower bound for the reduction in f can be seen in various settings. For • § example, if dk = Mk f(xk) for all k N, where Mk is a sequence of real symmetric − ∇ ∈ { } matrices with eigenvalues uniformly bounded in a positive interval [κ1, κ2], then one finds that η f(x )T s (α ) = ηα f(x )T d = ηα f(x )T M f(x ) ηα κ f(x ) 2. − ∇ k k k − k∇ k k k∇ k k∇ k ≥ k 1k∇ k k

Similarly, the lower bound for αk can be seen in various settings. For example, in the same setting as above, one has from{ Lipschitz} continuity of the gradient that

f(x + s (α )) f(x ) f(x )T s (α ) + 1 L s (α ) 2 k k k − k ≤ ∇ k k k 2 k k k k = α f(x )T M f(x ) + 1 Lα2 M f(x ) 2 − k∇ k k∇ k 2 kk k∇ k k α κ f(x ) 2 + 1 Lκ2α2 f(x ) 2. ≤ − k 1k∇ k k 2 2 kk∇ k k On the other hand, if αk fails to satisfy the Armijo condition, then f(x + s (α )) f(x ) > η f(x )T s (α ) k k k − k ∇ k k k = ηα f(x )T M f(x ) − k∇ k k∇ k ηα κ f(x ) 2. ≥ − k 1k∇ k k Combined, this shows that 2(1 η)κ − 1 αk Lκ2 , ≥ 2 1 2 from which it follows by the structure of the algorithm that αk 2γ− (1 η)κ1/(Lκ2). Finally, the upper bound on the norm of the gradient after a successful{ } ≥ step follows− since f(x ) = f(x ) f(x ) + f(x ) k∇ k+1 k k∇ k+1 − ∇ k ∇ k k L x x + f(x ) ≤ k k+1 − kk k∇ k k Lα β f(x ) + f(x ) . ≤ k k∇ k k k∇ k k Analysis in 4.4. The desired results in convex and strongly convex settings can be derived • using similar§ techniques as in [8, 3.3– 3.4]. First, suppose f is convex, has at least one global minimizer (call it x ), and has bounded§ § level sets in the sense that, for some D (0, ), ∗ ∈ ∞ n x x D for all x R with f(x) f(x0). k − ∗k ≤ ∈ ≤ Convexity of f implies for all k N that ∈ T f f(xk) f(xk) (x xk) D f(xk) . ∗ − ≥ ∇ ∗ − ≥ − k∇ k

15 If iteration k is successful, then the above implies that

(f(xk) f ) (f(xk+1) f ) = f(xk) f(xk+1) − ∗ − − ∗ − ηc α f(x ) 2 ≥ 3 kk∇ k k ηc3 2 2 αk(f(xk) f ) . ≥ D − ∗

Dividing by (f(xk) f )(f(xk+1) f ) > 0 and noting that f(xk) > f(xk+1), one finds that − ∗ − ∗

1 1 ηc3 f(xk) f ηc3 2 αk − ∗ 2 α, f(xk+1) f f(xk) f D f(xk+1) f D − ∗ − − ∗ ≥ − ∗ ≥ 1 1 showing, as desired, that (f(xk+1) f )− (f(xk) f )− is uniformly bounded below by a positive constant over all successful− iterations.∗ − − ∗ If f is strongly convex, then, as is well known, one has for some c (0, ) that ∈ ∞ 2 n f(x) c(f(x) f ) for all x R . k∇ k ≥ − ∗ ∈ This shows that if iteration k is successful, then (similar to above)

(f(xk) f ) (f(xk+1) f ) ηc3cαk(f(xk) f ), − ∗ − − ∗ ≥ − ∗ which implies that f(xk+1) f (1 ηc3cαk)(f(xk) f ). − ∗ ≤ − − ∗ By the definition of f , it is clear from this inequality that for this successful iteration one must have α 1/(ηc c), meaning∗ (1 ηc cα ) [0, 1). Taking logs, this implies that k ≤ 3 − 3 k ∈

log(f(xk+1) f ) log(1 ηc3αk) + log(f(xk) f ), − ∗ ≤ − − ∗ which after rearrangement yields

log(f(xk+1) f ) log(1 ηc3αk) log(f(xk) f ). − − ∗ ≥ − − − − ∗ 1 1 As desired, this shows that log((f(xk+1) f )− ) log((f(xk) f )− ) is uniformly bounded below by a positive constant over all successful− ∗ iterations.− − ∗

16