APPROXIMATELY EXACT

SARA FRIDOVICH-KEIL† AND BENJAMIN RECHT†

Abstract. We propose approximately exact line search, which uses only function evaluations to select a step size within a constant fraction of the exact line search minimizer. We bound the number of iterations and function evaluations, showing linear convergence on smooth, strongly con- vex objectives with no dependence on the initial step size for three variations of : using the true gradient, using an approximate gradient, and using a random search direction. We demonstrate experimental speedup on logistic regression for both gradient descent and minibatch stochastic gradient descent and on a benchmark set of derivative-free optimization objectives using quasi-Newton search directions.

Key words. line search methods, derivative free optimization, convex optimization

AMS subject classifications. 90C25, 90C56, 90C30

1. Introduction. Consider the unconstrained optimization problem:

argmin f(x) x n ∈R where f may be known analytically, derived from a large dataset, and/or available only as a black box. In our analysis, we assume that f is strongly convex and has Lipschitz gradient, but our experiments include more general objectives. Algorithm1 presents a prototypical approach in our quest to minimize the objective f.

Algorithm 1: Descent Method

Input: f, x0,K,T0 > 0, β ∈ (0, 1) t 1 ← T0 for− k = 0, k < K, k = k + 1 do dk ← SearchDirection(f, xk) 1 tk ← LineSearch(f, xk, β− tk 1, β, dk) − xk+1 ← xk + tkdk end return xK

In each step k with iterate xk, we choose a search direction dk (by convention a descent direction) and a step size tk, and update xk accordingly. Our contribution is a novel method for the LineSearch subroutine, approximately exact line search arXiv:2011.04721v1 [math.OC] 9 Nov 2020 (Algorithm2, or AELS), an adaptive variant of golden section search [3] that operates without a predefined search interval or required accuracy. As its name suggests, the core of our approach is to approximate the solution of an exact line search, the step size that minimizes the one-dimensional slice of the objective along the search direction. Exact line search is a natural and effective heuristic for choosing the step size, but computing the minimization exactly can be too costly for a single step of Algorithm1.

∗Funding: This research is generously supported in part by ONR awards N00014-20-1-2497 and N00014-18-1-2833, NSF CPS award 1931853, and the DARPA Assured Autonomy program (FA8750- 18-C-0101). SFK is also supported by NSF GRFP. †Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA 94720 ([email protected], [email protected]). 1 2 SARA FRIDOVICH-KEIL AND BENJAMIN RECHT

AELS instead approximates this minimizer within a constant fraction, using only a logarithmic number of objective evaluations. Although many other approximate line searches exist, these generally either require additional gradient or directional derivative computations, which may be expensive or unavailable in some settings, or are inefficient, using more function evaluations or taking smaller steps than necessary. Approximately exact line search is parameter-free, with nominal input parameters of adjustment factor and initial step size that have minimal or no influence on the iteration complexity, respectively. Although its performance bounds depend on the Lipschitz and strong convexity parameters of the objective function (and the quality of the gradient approximation), these parameters play no role in the algorithm itself. AELS requires only unimodality (not convexity) to find a good step size, and can adjust the step size as much as necessary in each step. It selects a step size using only function evaluations, so irregularities in the gradient have no direct impact on its execution and it applies naturally to derivative-free optimization (DFO). We prove linear iteration complexity (of a flavor similar to [24]) and logarithmic function evaluation complexity per iteration in three settings: available gradient, approximate gradient via finite differencing, and random direction search (see [37, 36]). AELS shows strong empirical performance on benchmark strongly convex tasks (regularized logistic regression) with full batch gradient descent (GD) and minibatch stochastic gradient descent (SGD) [33], converging faster than Armijo line searches. These performance gains persist over a wide range of minibatch sizes, including single- example minibatches, and a wide range of required accuracies, with performance degrading to the level of an adaptive Armijo line search as allowable relative error 6 drops to 10− . AELS also shows competitive empirical performance using BFGS [16] search directions on benchmark nonconvex DFO objectives [21], tending to converge faster than both Nelder-Mead [22] (designed specifically for DFO) and BFGS with a Wolfe line search [40] (requires directional derivative estimation in the line search, but guarantees a step size for which the BFGS direction update is well-defined). 1.1. Outline. We begin with a review of related work in Section2. We then introduce approximately exact line search in Section3. Section4 reviews notation and useful preliminaries. Section5 contains both a summary of convergence rates for general Armijo line searches (Subsection 5.1) and our bounds on the iteration and function evaluation complexity of approximately exact line search (Subsection 5.2). These results are proven in Section6 and validated experimentally in Section7. Our conclusions are presented in Section8. 2. Related work. Line search is a well-studied problem, and many approaches have been proposed to address the basic goal of choosing a step size as well as various practical challenges like stochasticity and nonconvexity. In this section, we offer a survey of relevant literature, with a particular emphasis on work that inspired ours. Step size schedules. The simplest step size is simply a constant, which will suffice 1 for convergence provided that the step size is no more than L for an objective with L- Lipschitz gradient [32]. In practice, however, even for L-smooth objectives L is rarely known, and an improperly chosen fixed step size can result in divergence or arbitrarily slow convergence. Another simple strategy is a decreasing step size, for instance tk ∝ 1 k . This has the benefit of guaranteeing convergence, but is still susceptible to slow convergence if the initial step size is too small [5]. Other schedules include the Polyak step size [31] and a hierarchical variant, which adaptively updates a lower bound on the objective to choose the step sizes without line search [19]. Along with these strategies, many heuristics and hand-tuned step size schedules exist. Engineers may APPROXIMATELY EXACT LINE SEARCH 3

hand-tune a series of step sizes during the course of optimization, manually adjusting to changes in curvature of the objective [5]. These strategies can be successful, but require substantial human effort which can be averted by a line search algorithm. Stochastic methods. In addition to these fixed, pre-scheduled, or hand-tuned step sizes, several variations of line search have been proposed (and proven) specifically for the stochastic setting; these methods typically adjust the step size by at most a constant factor in each step. Paquette and Scheinberg [29] prove convergence of such a method in the stochastic setting, with careful tracking of expected function decrease in each step. Likewise, Berahas, Cao, and Scheinberg prove convergence of a slightly simpler line search [7] in the noisy setting, provided sufficient bounds on the noise. Another class of methods diverges from the classical framework of Algorithm1 by borrowing from online learning [27, 13, 15] to choose each iterate without line search, also with proven convergence in the stochastic setting [28]. Many methods for the stochastic setting also include variance reduction, such as adaptively choosing the minibatch size in each iteration [8,9, 17] or accumulating information from prior gradients [35, 34,1]. In the case of interpolating classifiers, Vaswani et al.[38] prove that backtracking line search with a periodic step size reset converges at the same rate in stochastic (minibatch) and deterministic (full batch) gradient settings. Line search methods have also been studied in the DFO setting; see [12] for an overview. Jamieson, Nowak, and Recht [20] offer lower bounds on the number of (potentially noisy) function evaluations, Nesterov and Spokoiny [24] and Golovin et al.[18] analyze methods with random search directions, and Berahas et al.[6] compare various gradient approximation methods. Armijo condition methods. Perhaps the most common line search is Armijo back- tracking, which begins with a fixed initial step size and geometrically decreases it until the Armijo condition is satisfied [32, 26, 23]. Definition 2.1. Armijo condition [2]: If our line search direction is d (such that dT ∇f(x) < 0) from iterate x, then the Armijo condition is f(x + td) ≤ f(x) + γtdT ∇f(x) , 1 1 where γ ∈ (0, 1) and typically γ = 2 (we use γ = 2 throughout). In the case of steepest descent, we have d = −∇f(x), with corresponding Armijo condition f(x − t∇f(x)) ≤ f(x) − γtk∇f(x)k2 . On convex objective functions with L-Lipschitz gradients, such a strategy (taking Armijo steps in the negative gradient direction) guarantees convergence with a rate 1 depending on the chosen initial step size, which is intended to be ≥ L [32, 23]. As we might expect, this implementation (denoted “traditional backtracking” in our experiments) performs well with a sufficiently large initial step size but can converge quite slowly if the initial step size is too small. More sophisticated Armijo line searches attempt to find a reasonably large step size that satisfies Definition 2.1. One option is to replace the fixed initial step size with a value slightly larger than the previous step size [23] (denoted “adaptive backtracking” in our experiments), which like many of the stochastic line searches increases the step size by at most a constant factor in each step. Another is the polynomial interpolation method of Nocedal and Wright, which seeks to minimize a local approximation of the objective while satisfying Definition 2.1[26]. Although the Armijo condition requires no additional gradient evaluations, it does rely on the gradient at the initial point, making it more sensitive (than approximately exact line search, which uses no gradient information) to noisy gradients like those available in stochastic or derivative-free optimization. 4 SARA FRIDOVICH-KEIL AND BENJAMIN RECHT

Curvature condition methods. In addition to the Armijo condition, which ensures that the step size is not too large, many line searches include curvature conditions, such as the Wolfe or Goldstein conditions [40, 26], which aim to ensure that the step size is also not too small. Although this largely remedies the small step size failure mode of pure Armijo backtracking methods, it comes at a cost. The Wolfe curvature condition requires gradient or directional derivative measurements during the line search, which may be expensive, unavailable, or noisy in certain situations (e.g. stochastic optimization or DFO). The Goldstein condition closely parallels the Armijo condition and requires no extra gradient evaluations, but depending on the objective it may exclude the exact line search minimizer [26]. Bracketing methods. If the exact line search step size is known to lie within a certain interval, intermediate function evaluations can reduce the search interval geo- metrically until a desired accuracy is achieved [10,3], and this search interval can be found adaptively (see e.g. Algorithms 3.5 and 3.6 in [26], although these use gradient evaluations). Our proposed approximately exact line search (Algorithm2) is particu- larly inspired by golden section search [3], which minimizes any unimodal 1D function in logarithmic time (as a function of the initial search interval and the desired preci- sion). Golden section search proceeds in a manner much like binary search, reducing the size of the active search window by a constant fraction with each additional func- tion evaluation, and uses no gradient evaluations whatsoever. Algorithm2 relies on the same logic to bracket the minimum with only function evaluations, but operates without a known search interval or fixed precision. Instead, we use a geometrically- increasing sequence of trial step sizes to determine the search interval, and search to a relative precision that guarantees a step size within a constant fraction of the true minimizing step size. This relative precision allows Algorithm2 to smoothly trade off the function complexity of the line search and the precision of the selected step size: when we are close to the minimum, our line search has high precision, whereas farther away we accept lower precision in favor of proceeding to the next step.

3. Proposed algorithm. The goal of approximately exact line search (Algo- rithm2), as its name suggests, is to find a step size within a constant fraction of (and without exceeding) the exact line search minimizer. AELS achieves this goal by iteratively increasing and/or decreasing the step size by a fixed multiplicative factor as long as doing so decreases the line search objective. Our procedure follows the same logic as golden section search [3]: a set of three sample step sizes contains the exact line search minimizer of a unimodal 1D objective as long as, of our three samples, the one in the middle has the lowest objective value. Accordingly, we evaluate the objective at an exponentially spaced set of step sizes, stopping when we know the exact minimizer lies in the interior of our sample step sizes. Among the three sample step sizes that most closely bracket the minimizer, we use the smallest to ensure that we never exceed the true minimizer. Using only function evaluations to choose the step size, rather than relying on the Armijo condition, requires a small overhead in function evaluations per step but enables greater flexibility in the objective functions we can optimize efficiently (by removing any gradient requirements). We also note that the “warm start” procedure 1 of initializing the step size slightly larger than the previous step size used (T = β− t) is beneficial experimentally, although an alternative approach of initializing with a step size equal to the previous step size (T = t) achieves the same worst-case bounds in terms of steps and function evaluations. Since Algorithm2 chooses a step size slightly smaller than an exact line search (rather than slightly larger), this larger APPROXIMATELY EXACT LINE SEARCH 5

initialization may on average better approximate the desired step size in practice. A python implementation of AELS, including various search directions, special cases to optimize nonconvex objectives (like those in our DFO experiments), and adaptations for finite precision function evaluations (which may be indistinguishable in a wide, flat minimum), is provided at https://github.com/modestyachts/AELS.

Algorithm 2: Approximately Exact Line Search Input: f, x, T > 0, β ∈ (0, 1), d t ← T fold ← f(x) fnew ← f(x + td) /* Decide to try increasing or decreasing the step size */ α ← β if fnew ≤ fold then 1 α ← β− end /* Increase/Decrease the step size as long as f is decreasing */ repeat t ← αt fold ← fnew fnew ← f(x + td) until fnew ≥ fold /* If increasing the step size failed, try decreasing it */ if t = T/β then t ← T α ← β (fnew, fold) ← (fold, fnew) repeat t ← αt fold ← fnew fnew ← f(x + td) until fnew > fold end /* Return a step size ≤ the exact line search minimizer */ if α < 1 then return t end return β2t

Figure1 illustrates the three typical modes of AELS. If T is sufficiently larger than the exact line search minimizer t (Figure 1a), our first candidate step will immediately increase the objective value,∗ so we iteratively decrease the step size until we are guaranteed that our step size tk does not exceed t . If T is sufficiently smaller than t (Figure 1b), we iteratively increase the candidate∗ step size until it surpasses ∗ t , and then select the largest previous candidate step size tk that we can guarantee does∗ not exceed t . If T is sufficiently close to t , we must try both larger and smaller ∗ ∗ step sizes in order to find a tk we can guarantee is within a constant fraction of t . ∗ 6 SARA FRIDOVICH-KEIL AND BENJAMIN RECHT

AAAB7nicbVBNS8NAEJ3Ur1q/qh69BItQEUoiih6LXjxWsB/QhrLZbNqlm03YnYgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5fiK4Rsf5tgorq2vrG8XN0tb2zu5eef+gpeNUUdaksYhVxyeaCS5ZEzkK1kkUI5EvWNsf3U799iNTmsfyAccJ8yIykDzklKCR2mH16QyD03654tScGexl4uakAjka/fJXL4hpGjGJVBCtu66ToJcRhZwKNin1Us0SQkdkwLqGShIx7WWzcyf2iVECO4yVKYn2TP09kZFI63Hkm86I4FAvelPxP6+bYnjtZVwmKTJJ54vCVNgY29Pf7YArRlGMDSFUcXOrTYdEEYomoZIJwV18eZm0zmvuZc25v6jUb/I4inAEx1AFF66gDnfQgCZQGMEzvMKblVgv1rv1MW8tWPnMIfyB9fkDaf6O9g== AAAB7nicbVBNS8NAEJ3Ur1q/qh69BItQEUoiih6LXjxWsB/QhrLZbNqlm03YnYgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5fiK4Rsf5tgorq2vrG8XN0tb2zu5eef+gpeNUUdaksYhVxyeaCS5ZEzkK1kkUI5EvWNsf3U799iNTmsfyAccJ8yIykDzklKCR2mH16QyD03654tScGexl4uakAjka/fJXL4hpGjGJVBCtu66ToJcRhZwKNin1Us0SQkdkwLqGShIx7WWzcyf2iVECO4yVKYn2TP09kZFI63Hkm86I4FAvelPxP6+bYnjtZVwmKTJJ54vCVNgY29Pf7YArRlGMDSFUcXOrTYdEEYomoZIJwV18eZm0zmvuZc25v6jUb/I4inAEx1AFF66gDnfQgCZQGMEzvMKblVgv1rv1MW8tWPnMIfyB9fkDaf6O9g==

AAAB7nicbVBNS8NAEJ3Ur1q/qh69BItQEUoiih6LXjxWsB/QhrLZbNqlm03YnYgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5fiK4Rsf5tgorq2vrG8XN0tb2zu5eef+gpeNUUdaksYhVxyeaCS5ZEzkK1kkUI5EvWNsf3U799iNTmsfyAccJ8yIykDzklKCR2mH16QyD03654tScGexl4uakAjka/fJXL4hpGjGJVBCtu66ToJcRhZwKNin1Us0SQkdkwLqGShIx7WWzcyf2iVECO4yVKYn2TP09kZFI63Hkm86I4FAvelPxP6+bYnjtZVwmKTJJ54vCVNgY29Pf7YArRlGMDSFUcXOrTYdEEYomoZIJwV18eZm0zmvuZc25v6jUb/I4inAEx1AFF66gDnfQgCZQGMEzvMKblVgv1rv1MW8tWPnMIfyB9fkDaf6O9g== f(x + td) f(x + td) f(x + td)

AAACYXicZZHPTtwwEMa9aUthoXShRy5RFyQOK5RwKUfUcuC4qOyCtFkhx5kkFv4TjScUFOURem2free+CN6wqpQykjU/ffaM/Y3TSklHUfRnELx5+27j/ebWcHvnw+7H0d7+3NkaBcyEVRZvU+5ASQMzkqTgtkLgOlVwk95/W+3fPAA6ac01PVWw1LwwMpeCk5e+H9Lh3WgcnURdhK8hXsOYrWN6tze4TDIrag2GhOLOLeKoomXDkaRQ0A6T2kHFxT0vYOHRcA1u2XRvbcMjr2RhbtEvQ2GnDvsllJ8tG2mqmsCIXrum9BYQIe+rXDvNqZyEHlZdXUdU6i67J516UIVF2WkF8qqU4nESrqrIWuUmjzny/l0LY8Fk3pYqKge1d2yzlbkL8KYRplwiZBd+8FoSYJMYi7ptEjUHJJ+xy8MEwcAPYbXmJmsSKuHfQ9qm9bOP/5/0a5ifnsSer07H51/Xv7DJDthndsxi9oWds0s2ZTMmWMF+sl/s9+BvsBWMgv2Xo8FgXfOJ9SI4eAYn5Lqr AAACYXicZZHPTtwwEMa9aUthoXShRy5RFyQOK5RwKUfUcuC4qOyCtFkhx5kkFv4TjScUFOURem2free+CN6wqpQykjU/ffaM/Y3TSklHUfRnELx5+27j/ebWcHvnw+7H0d7+3NkaBcyEVRZvU+5ASQMzkqTgtkLgOlVwk95/W+3fPAA6ac01PVWw1LwwMpeCk5e+H9Lh3WgcnURdhK8hXsOYrWN6tze4TDIrag2GhOLOLeKoomXDkaRQ0A6T2kHFxT0vYOHRcA1u2XRvbcMjr2RhbtEvQ2GnDvsllJ8tG2mqmsCIXrum9BYQIe+rXDvNqZyEHlZdXUdU6i67J516UIVF2WkF8qqU4nESrqrIWuUmjzny/l0LY8Fk3pYqKge1d2yzlbkL8KYRplwiZBd+8FoSYJMYi7ptEjUHJJ+xy8MEwcAPYbXmJmsSKuHfQ9qm9bOP/5/0a5ifnsSer07H51/Xv7DJDthndsxi9oWds0s2ZTMmWMF+sl/s9+BvsBWMgv2Xo8FgXfOJ9SI4eAYn5Lqr t t tAAACYXicZZHPTtwwEMa9aUthoXShRy5RFyQOK5RwKUfUcuC4qOyCtFkhx5kkFv4TjScUFOURem2free+CN6wqpQykjU/ffaM/Y3TSklHUfRnELx5+27j/ebWcHvnw+7H0d7+3NkaBcyEVRZvU+5ASQMzkqTgtkLgOlVwk95/W+3fPAA6ac01PVWw1LwwMpeCk5e+H9Lh3WgcnURdhK8hXsOYrWN6tze4TDIrag2GhOLOLeKoomXDkaRQ0A6T2kHFxT0vYOHRcA1u2XRvbcMjr2RhbtEvQ2GnDvsllJ8tG2mqmsCIXrum9BYQIe+rXDvNqZyEHlZdXUdU6i67J516UIVF2WkF8qqU4nESrqrIWuUmjzny/l0LY8Fk3pYqKge1d2yzlbkL8KYRplwiZBd+8FoSYJMYi7ptEjUHJJ+xy8MEwcAPYbXmJmsSKuHfQ9qm9bOP/5/0a5ifnsSer07H51/Xv7DJDthndsxi9oWds0s2ZTMmWMF+sl/s9+BvsBWMgv2Xo8FgXfOJ9SI4eAYn5Lqr

AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4Kokoeix68dhCv6ANZbOdtGs3m7C7EUroL/DiQRGv/iRv/hu3bQ7a+mDg8d4MM/OCRHBtXPfbWVvf2NzaLuwUd/f2Dw5LR8ctHaeKYZPFIladgGoUXGLTcCOwkyikUSCwHYzvZ377CZXmsWyYSYJ+RIeSh5xRY6V6o18quxV3DrJKvJyUIUetX/rqDWKWRigNE1Trrucmxs+oMpwJnBZ7qcaEsjEdYtdSSSPUfjY/dErOrTIgYaxsSUPm6u+JjEZaT6LAdkbUjPSyNxP/87qpCW/9jMskNSjZYlGYCmJiMvuaDLhCZsTEEsoUt7cSNqKKMmOzKdoQvOWXV0nrsuJdV9z6Vbl6l8dRgFM4gwvw4Aaq8AA1aAIDhGd4hTfn0Xlx3p2PReuak8+cwB84nz+w+4zc

AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4Kokoeix68dhCv6ANZbOdtGs3m7C7EUroL/DiQRGv/iRv/hu3bQ7a+mDg8d4MM/OCRHBtXPfbWVvf2NzaLuwUd/f2Dw5LR8ctHaeKYZPFIladgGoUXGLTcCOwkyikUSCwHYzvZ377CZXmsWyYSYJ+RIeSh5xRY6V6o18quxV3DrJKvJyUIUetX/rqDWKWRigNE1Trrucmxs+oMpwJnBZ7qcaEsjEdYtdSSSPUfjY/dErOrTIgYaxsSUPm6u+JjEZaT6LAdkbUjPSyNxP/87qpCW/9jMskNSjZYlGYCmJiMvuaDLhCZsTEEsoUt7cSNqKKMmOzKdoQvOWXV0nrsuJdV9z6Vbl6l8dRgFM4gwvw4Aaq8AA1aAIDhGd4hTfn0Xlx3p2PReuak8+cwB84nz+w+4zc AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4Kokoeix68dhCv6ANZbOdtGs3m7C7EUroL/DiQRGv/iRv/hu3bQ7a+mDg8d4MM/OCRHBtXPfbWVvf2NzaLuwUd/f2Dw5LR8ctHaeKYZPFIladgGoUXGLTcCOwkyikUSCwHYzvZ377CZXmsWyYSYJ+RIeSh5xRY6V6o18quxV3DrJKvJyUIUetX/rqDWKWRigNE1Trrucmxs+oMpwJnBZ7qcaEsjEdYtdSSSPUfjY/dErOrTIgYaxsSUPm6u+JjEZaT6LAdkbUjPSyNxP/87qpCW/9jMskNSjZYlGYCmJiMvuaDLhCZsTEEsoUt7cSNqKKMmOzKdoQvOWXV0nrsuJdV9z6Vbl6l8dRgFM4gwvw4Aaq8AA1aAIDhGd4hTfn0Xlx3p2PReuak8+cwB84nz+w+4zc

AAACX3icZZHPTttAEMY3pi00bWmgJ9SL1ahSDxGyc2mPCDhwDFITkOIIrddje8X+sWbHBWT5CbjCw3HkTbpxIySXkVbz07c7s/vNppWSjqLoaRBsvXn7bnvn/fDDx0+7n0d7+wtnaxQwF1ZZvEy5AyUNzEmSgssKgetUwUV6fbLev/gD6KQ1v+mugpXmhZG5FJy8dB5djcbRYdRF+BriDYzZJmZXe4OzJLOi1mBIKO7cMo4qWjUcSQoF7TCpHVRcXPMClh4N1+BWTffSNvzulSzMLfplKOzUYb+E8l+rRpqqJjCi164pvQFEyPsq105zKiehh3VX1xGVusvuTqceVGFRdlqBvCqluJ2E6yqyVrnJbY68f9fSWDCZt6WKykHtHdtsbe4UvGmEGZcI2akfu5YE2CTGom6bRC0AyWfs8jBBMHAjrNbcZE1CJbw8pG1aP/v4/0m/hsX0MPZ8Ph0fHW9+YYd9Zd/YDxazn+yInbEZmzPBgN2zB/Y4eA62g91g9O9oMNjUfGG9CA7+AscOugs= AAACX3icZZHPTttAEMY3pi00bWmgJ9SL1ahSDxGyc2mPCDhwDFITkOIIrddje8X+sWbHBWT5CbjCw3HkTbpxIySXkVbz07c7s/vNppWSjqLoaRBsvXn7bnvn/fDDx0+7n0d7+wtnaxQwF1ZZvEy5AyUNzEmSgssKgetUwUV6fbLev/gD6KQ1v+mugpXmhZG5FJy8dB5djcbRYdRF+BriDYzZJmZXe4OzJLOi1mBIKO7cMo4qWjUcSQoF7TCpHVRcXPMClh4N1+BWTffSNvzulSzMLfplKOzUYb+E8l+rRpqqJjCi164pvQFEyPsq105zKiehh3VX1xGVusvuTqceVGFRdlqBvCqluJ2E6yqyVrnJbY68f9fSWDCZt6WKykHtHdtsbe4UvGmEGZcI2akfu5YE2CTGom6bRC0AyWfs8jBBMHAjrNbcZE1CJbw8pG1aP/v4/0m/hsX0MPZ8Ph0fHW9+YYd9Zd/YDxazn+yInbEZmzPBgN2zB/Y4eA62g91g9O9oMNjUfGG9CA7+AscOugs=

AAACX3icZZHPTttAEMY3pi00bWmgJ9SL1ahSDxGyc2mPCDhwDFITkOIIrddje8X+sWbHBWT5CbjCw3HkTbpxIySXkVbz07c7s/vNppWSjqLoaRBsvXn7bnvn/fDDx0+7n0d7+wtnaxQwF1ZZvEy5AyUNzEmSgssKgetUwUV6fbLev/gD6KQ1v+mugpXmhZG5FJy8dB5djcbRYdRF+BriDYzZJmZXe4OzJLOi1mBIKO7cMo4qWjUcSQoF7TCpHVRcXPMClh4N1+BWTffSNvzulSzMLfplKOzUYb+E8l+rRpqqJjCi164pvQFEyPsq105zKiehh3VX1xGVusvuTqceVGFRdlqBvCqluJ2E6yqyVrnJbY68f9fSWDCZt6WKykHtHdtsbe4UvGmEGZcI2akfu5YE2CTGom6bRC0AyWfs8jBBMHAjrNbcZE1CJbw8pG1aP/v4/0m/hsX0MPZ8Ph0fHW9+YYd9Zd/YDxazn+yInbEZmzPBgN2zB/Y4eA62g91g9O9oMNjUfGG9CA7+AscOugs=

AAACY3icZZHNatwwEMe1Ttqmm7T5aG+lYLIplLIEO5fkGNIcctxCdxNYL4ssj20RfZjRuEkwfoZc20frA/Q9qnWWgpsBMT/+0oz0H6WVko6i6Pcg2Nh88fLV1uvh9s6bt7t7+wczZ2sUMBVWWbxJuQMlDUxJkoKbCoHrVMF1evt1tX/9A9BJa77TQwULzQsjcyk4eWl6RMsvR8u9UXQcdRE+h3gNI7aOyXJ/cJVkVtQaDAnFnZvHUUWLhiNJoaAdJrWDiotbXsDco+Ea3KLpXtuGn7yShblFvwyFnTrsl1B+tmikqWoCI3rtmtKbQIS8r3LtNKdyHHpYdXUdUam77B506kEVFmWnFcirUor7cbiqImuVG9/nyPt3zY0Fk3lbqqgc1N6xzVbmLsGbRphwiZBd+tFrSYBNYizqtknUDJB8xi4PEwQDd8JqzU3WJFTCv4e0TetnH/8/6ecwOzmOPX87GZ1frH9hi31gh+wzi9kpO2dXbMKmTDDJHtlP9mvwJ9gODoL3T0eDwbrmHetF8PEvhB67SA== AAACY3icZZHNatwwEMe1Ttqmm7T5aG+lYLIplLIEO5fkGNIcctxCdxNYL4ssj20RfZjRuEkwfoZc20frA/Q9qnWWgpsBMT/+0oz0H6WVko6i6Pcg2Nh88fLV1uvh9s6bt7t7+wczZ2sUMBVWWbxJuQMlDUxJkoKbCoHrVMF1evt1tX/9A9BJa77TQwULzQsjcyk4eWl6RMsvR8u9UXQcdRE+h3gNI7aOyXJ/cJVkVtQaDAnFnZvHUUWLhiNJoaAdJrWDiotbXsDco+Ea3KLpXtuGn7yShblFvwyFnTrsl1B+tmikqWoCI3rtmtKbQIS8r3LtNKdyHHpYdXUdUam77B506kEVFmWnFcirUor7cbiqImuVG9/nyPt3zY0Fk3lbqqgc1N6xzVbmLsGbRphwiZBd+tFrSYBNYizqtknUDJB8xi4PEwQDd8JqzU3WJFTCv4e0TetnH/8/6ecwOzmOPX87GZ1frH9hi31gh+wzi9kpO2dXbMKmTDDJHtlP9mvwJ9gODoL3T0eDwbrmHetF8PEvhB67SA==

AAACY3icZZHNatwwEMe1Ttqmm7T5aG+lYLIplLIEO5fkGNIcctxCdxNYL4ssj20RfZjRuEkwfoZc20frA/Q9qnWWgpsBMT/+0oz0H6WVko6i6Pcg2Nh88fLV1uvh9s6bt7t7+wczZ2sUMBVWWbxJuQMlDUxJkoKbCoHrVMF1evt1tX/9A9BJa77TQwULzQsjcyk4eWl6RMsvR8u9UXQcdRE+h3gNI7aOyXJ/cJVkVtQaDAnFnZvHUUWLhiNJoaAdJrWDiotbXsDco+Ea3KLpXtuGn7yShblFvwyFnTrsl1B+tmikqWoCI3rtmtKbQIS8r3LtNKdyHHpYdXUdUam77B506kEVFmWnFcirUor7cbiqImuVG9/nyPt3zY0Fk3lbqqgc1N6xzVbmLsGbRphwiZBd+tFrSYBNYizqtknUDJB8xi4PEwQDd8JqzU3WJFTCv4e0TetnH/8/6ecwOzmOPX87GZ1frH9hi31gh+wzi9kpO2dXbMKmTDDJHtlP9mvwJ9gODoL3T0eDwbrmHetF8PEvhB67SA==

AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0n8QI9FLx4rWFtoQ9lsJ+3S3U3Y3Qil9C948aCIV/+QN/+NSZqDtj4YeLw3w8y8IBbcWNf9dkorq2vrG+XNytb2zu5edf/g0USJZthikYh0J6AGBVfYstwK7MQaqQwEtoPxbea3n1AbHqkHO4nRl3SoeMgZtbnUH1f61Zpbd3OQZeIVpAYFmv3qV28QsUSiskxQY7qeG1t/SrXlTOCs0ksMxpSN6RC7KVVUovGn+a0zcpIqAxJGOi1lSa7+nphSacxEBmmnpHZkFr1M/M/rJja89qdcxYlFxeaLwkQQG5HscTLgGpkVk5RQpnl6K2EjqimzaTxZCN7iy8vk8azundcv7y9qjZsijjIcwTGcggdX0IA7aEILGIzgGV7hzZHOi/PufMxbS04xcwh/4Hz+AJUSjfE= 0 tAAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0n8QI9FLx4rWFtoQ9lsJ+3S3U3Y3Qil9C948aCIV/+QN/+NSZqDtj4YeLw3w8y8IBbcWNf9dkorq2vrG+XNytb2zu5edf/g0USJZthikYh0J6AGBVfYstwK7MQaqQwEtoPxbea3n1AbHqkHO4nRl3SoeMgZtbnUH1f61Zpbd3OQZeIVpAYFmv3qV28QsUSiskxQY7qeG1t/SrXlTOCs0ksMxpSN6RC7KVVUovGn+a0zcpIqAxJGOi1lSa7+nphSacxEBmmnpHZkFr1M/M/rJja89qdcxYlFxeaLwkQQG5HscTLgGpkVk5RQpnl6K2EjqimzaTxZCN7iy8vk8azundcv7y9qjZsijjIcwTGcggdX0IA7aEILGIzgGV7hzZHOi/PufMxbS04xcwh/4Hz+AJUSjfE= k t T 0 T tAAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0n8QI9FLx4rWFtoQ9lsJ+3S3U3Y3Qil9C948aCIV/+QN/+NSZqDtj4YeLw3w8y8IBbcWNf9dkorq2vrG+XNytb2zu5edf/g0USJZthikYh0J6AGBVfYstwK7MQaqQwEtoPxbea3n1AbHqkHO4nRl3SoeMgZtbnUH1f61Zpbd3OQZeIVpAYFmv3qV28QsUSiskxQY7qeG1t/SrXlTOCs0ksMxpSN6RC7KVVUovGn+a0zcpIqAxJGOi1lSa7+nphSacxEBmmnpHZkFr1M/M/rJja89qdcxYlFxeaLwkQQG5HscTLgGpkVk5RQpnl6K2EjqimzaTxZCN7iy8vk8azundcv7y9qjZsijjIcwTGcggdX0IA7aEILGIzgGV7hzZHOi/PufMxbS04xcwh/4Hz+AJUSjfE= k t 0 tk T t ⇤ ⇤ ⇤

(a) T too large (b) T too small (c) T near t∗

Fig. 1: Illustration of Algorithm2 in three characteristic scenarios defined by the relationship between T and t , with the chosen step size denoted tk. Arrows indicate ∗ trials of candidate step sizes, with each candidate step shown in a lighter shade√ than the previous. In this illustration and in our experiments, we use β = 2/(1 + 5), the inverse golden ratio.

4. Preliminaries. We begin by compiling useful facts and notation that appear in our analysis. 4.1. Notation. We use the following notation: • k·k ≡ k·k2, the Euclidean norm. • L denotes the Lipschitz constant of the gradient of f, and m denotes its strong convexity parameter. vT f(x) • fv0 (x) ≡ ∇v , the directional derivative of f at x in direction v. • f denotesk thek minimum value of a (typically convex) function f. x denotes an∗ optimal input point, such that f(x ) = f . ∗ • ∆ is the user-specified tolerance for kx∗ − x k∗ at termination. ∗ Definition 4.1. A function f has L-Lipschitz gradient if for all x, y: k∇f(x) − ∇f(y)k ≤ Lkx − yk . Equivalently (using Taylor’s theorem), f has L-Lipschitz gradient if for all x, y: L f(y) ≤ f(x) + ∇f(x)T (y − x) + ky − xk2 . 2 Definition 4.2. A function f is m-strongly convex if for all x, y: k∇f(x) − ∇f(y)k ≥ mkx − yk . Equivalently (using Taylor’s theorem), f is m-strongly convex if for all x, y: m f(y) ≥ f(x) + ∇f(x)T (y − x) + ky − xk2 . 2 4.2. Gradient approximation. For derivative-free optimization, we approxi- mate the gradient or directional derivative using finite differences. In this setting, we can often guarantee performance that is nearly as fast as in the gradient setting, but with an extra error term determined by the quality of our gradient approximation. f Lemma 4.3. Let v be any unit vector and σ > 0 be a sampling radius. Let gv (x) = f(x+σv) f(x) σ − be a forward finite differencing approximation of fv0 (x) with radius σ. Then for all x: f Lσ |g (x) − f 0 (x)| ≤ . v v 2 APPROXIMATELY EXACT LINE SEARCH 7

Proof. See Theorem 2.1 in [6]. Corollary 4.4. A forward finite differencing gradient approximation g(x) with Lσ√n sampling radius σ satisfies kg(x) − ∇f(x)k ≤ 2 , where n is the dimension of x. We work with forward finite differencing rather than central finite differencing because although central finite differencing can give better accuracy, it requires twice as many function evaluations. 4.3. Function properties. Although we show experimentally that AELS is competitive even on nonconvex objective functions with potentially non-Lipschitz gra- dients, we prove convergence rates in the simpler setting of strongly convex functions with Lipschitz gradients. Functions with various subsets of these properties satisfy relationships that will be useful in our analysis. Lemma 4.5. If f is m-strongly convex, then for all x: 1 f(x) − f ≤ k∇f(x)k2 . ∗ 2m

Proof. See chapter 3 in [32].

Lemma 4.6. Let f be any m-strongly with L-Lipschitz gradient, x be the current iterate, d be any vector with dT ∇f(x) < 0, and t > 0 be any step size. Then for all x: mt2 Lt2 f(x) + tdT ∇f(x) + kdk2 ≤ f(x + td) ≤ f(x) + tdT ∇f(x) + kdk2 . 2 2 T Note that if we set d = −fv0 (x)v for an arbitrary unit vector v, then d ∇f(x) = 2 −(fv0 (x)) < 0 and     tm 2 tL 2 f(x) − t 1 − (f 0 (x)) ≤ f(x − tf 0 (x)v) ≤ f(x) − t 1 − (f 0 (x)) . 2 v v 2 v Regardless of the choice of d, the first inequality requires only m-strong convexity and the second inequality requires only L-Lipschitz gradient. Proof. See chapter 2 in [25]. 5. Theoretical results. In this section, we state and discuss convergence rates for both general Armijo line searches and AELS; proofs are deferred to Section6. We focus on objective functions that are strongly convex and have Lipschitz gradients. We analyze three schemes for selecting dk (the search direction in Algorithm1): 1. dk = −∇f(xk). 2. dk is an approximation of −∇f(xk) with bounded error, e.g. by using σ-finite differencing along coordinate directions. 3. dk = −µkvk, where vk is a uniformly random unit vector and µk is a σ-finite differencing approximation of the directional derivative in direction vk. The first scheme is useful for objective functions which are efficiently differen- tiable, while the second and third are useful for derivative-free optimization. We begin by analyzing general line searches whose step sizes satisfy the Armijo condition and then proceed with our analysis of AELS. Both types of line search guarantee linear convergence (modulo gradient approximation errors), but AELS has two advantages in its convergence bounds compared to Armijo methods. 8 SARA FRIDOVICH-KEIL AND BENJAMIN RECHT

The first is that AELS’ convergence rates (Theorem 5.5) depend only on the ob- jective function, with no dependence on the initial step size T0. For Armijo methods, the convergence rates (Theorem 5.2) depend on the minimum step size taken, which 1 can be needlessly slow if T0 < L and the line search is insufficiently adaptive. The second advantage appears in random direction search, in which the expected convergence rate is truly linear for AELS, whereas an Armijo line search has an additive error term from the gradient approximation used in the Armijo condition. One downside of AELS is that it requires a slight overhead in function evaluations (see Theorem 5.6) compared to most Armijo line searches, in order to compensate for not using gradient information. In practice, this overhead is minimal compared to the speedup of convergence in fewer iterations, as shown in Section7. 5.1. Convergence rates for general Armijo line searches. t T Definition 5.1. ta is the positive solution to f(x + td) = f(x) + 2 d ∇f(x), the Armijo condition (Definition 2.1) in direction d (with dT ∇f(x) < 0) from iterate x. Note that in some cases we work with an approximation to the gradient, in which case we replace the gradient in this condition with its approximation. We begin with upper bounds on the iteration complexity of general line searches based on the Armijo condition (Definition 2.1) using our three gradient approxima- tion methods (true gradient, finite differencing, and random direction). These bounds apply to any line search whose step sizes satisfy the Armijo condition, such as the tra- ditional backtracking line search and its adaptive version we consider in Section7. As is evident in the bounds, the convergence rate of Armijo-based methods depends on the minimum step size they take. Traditional backtracking line search, for instance, never increases the step size from a user-provided initial guess, which may be arbi- trarily small compared to ta, the largest step size that satisfies the Armijo condition (Definition 5.1). Adaptive backtracking line search can increase the step size, but may require many steps to reach a step size near ta. Lemma 5.2. Let f be m-strongly convex and have L-Lipschitz gradient, and let xk+1 = xk −tkg(xk), where tk ∈ [tmin, tmax] satisfies the Armijo condition with respect to direction −g(xk) (i.e., g(xk) is the gradient approximation and −g(xk) is the search direction) and n is the dimension of x. Using the true gradient,

k f(xk) − f ≤ (1 − mtmin) (f(x0) − f ) . ∗ ∗ Using a σ-finite differencing gradient approximation, 2 √ k σL tmax nkx0 − x k k f(xk) − f ≤ (1 − mtmin) (f(x0) − f ) + ∗ (1 − (1 − mtmin) ) . ∗ ∗ 2mtmin Using a random direction and approximating the directional derivative by σ-finite differencing,  k mtmin E[f(xk) − f ] ≤ 1 − (f(x0) − f ) ∗ n ∗ 2  k! σL tmaxnkx0 − x k mtmin + ∗ 1 − 1 − . 2mtmin n

Lemma 5.2 shows in general linear convergence with a worst-case rate determined by the strong convexity parameter m of the objective and the minimum step size taken APPROXIMATELY EXACT LINE SEARCH 9

during the line search: the greater the curvature of the objective and the larger the steps, the faster the convergence. In particular, for traditional backtracking line search β with initial step size T0 using the true gradient, tmin ≥ min(T0, L ). For adaptive 1 backtracking (which increases the initial step size by a constant factor β− in each T0 β step) using the true gradient, tmin ≥ min( βk , L ) in step k. We also experimented with a “forward-tracking” line search, which increases the initial step size as much β as necessary in each step to guarantee tmin ≥ L , assuming the true gradient is used. This method guarantees a step size within a β fraction of the maximum step size that satisfies the Armijo condition; we found it to be more efficient than both traditional and adaptive backtracking line searches, but slower than AELS, which bypasses the Armijo condition entirely. Approximating the gradient by σ-finite differencing slows the convergence rate by an additive error term, which goes to zero as σ → 0. This error term is larger for higher-dimensional objectives (since each dimension of the gradient must be esti- mated), ill-conditioned objectives (with large L and small m), and objective-algorithm pairs with large variation in step size (often related to poor conditioning). The two- term rate means we expect linear convergence until a certain accuracy threshold, beyond which we can no longer improve because the magnitude of the true gradient is overwhelmed by approximation error. The accuracy threshold can be dropped by decreasing the sampling radius σ; in practice this could be done adaptively to achieve a desired precision with performance mostly adhering to the linear term. If descent proceeds along random directions with approximate directional deriv- atives, the linear rate is worsened by a factor of the dimension (as observed in [24]), with a similar σ-dependent accuracy threshold that can be managed by (potentially adaptive) reduction of σ.

5.2. Approximately exact line search.

Definition 5.3. t ≡ argmint f(x + td), the exact minimizer of a line search along direction d (with∗dT ∇f(x) < 0) from iterate x.

The core idea of AELS is to find a step size close to (and less than) t . More precisely, we have the following invariant in each step of Algorithm2. ∗

Invariant 5.4. Let t be the step size used in a step of Algorithm2. Then t ∈ [β2t , t ], where t is the step size used in exact line search (using the same search direction∗ ∗ from the∗ same iterate).

Although AELS may choose a step size that does not satisfy the Armijo condition, it does so only if the larger step size achieves a smaller objective value compared to ta. This property is central to the proof of Theorem 5.5, which bounds the iteration complexity of AELS.

Theorem 5.5. On m-strongly convex objectives with L-Lipschitz gradients, Al- gorithm2 satisfies the following convergence rates: Using the true gradient,

  r k 2 m f(xk) − f ≤ 1 − β 1 − 1 − (f(x0) − f ) . ∗ L ∗ 10 SARA FRIDOVICH-KEIL AND BENJAMIN RECHT

Using a -finite differencing gradient approximation with 2˜σm∆ and , σ σ = L√n σ˜ < 1

  r k 2 1 − σ˜ m f(xk) − f ≤ 1 − β 1 − 1 − (f(x0) − f ) ∗ (1 +σ ˜)2 L ∗ 2 σ˜(1 +σ ˜) L∆kx0 − x k ∗ + 2 2 p m  . β (1 − σ˜) 1 − 1 − L Using a random direction and approximating the directional derivative (by any method with the correct sign),

k  β2  r m E[f(xk) − f ] ≤ 1 − 1 − 1 − (f(x0) − f ) , ∗ n L ∗ where n is the dimension of x. Theorem 5.5 parallels Lemma 5.2, with a difference in the random direction regime, in which AELS is robust to directional derivative approximation error. More generally, AELS has the feature of choosing the step size based only on function evalu- ations, so the performance of the line search itself has no dependence on the accuracy of the gradient approximation. Again, this difference tends to not appear in practice, although we do observe that in the stochastic setting (explored in Figure3) AELS performs much better with smaller minibatches (less accurate gradient and function evaluations) compared to the baseline Armijo line searches. Theorem 5.6. On m-strongly convex objectives with L-Lipschitz gradients, Al- gorithm2 uses at most

& p m !'! γmax(1 + 1 − L ) k 4 + log1/β p m γmin(1 − 1 − L ) & 2 p m !' β γmin(1 − 1 − L ) mT0 + log1/β max , p m . mT0 βγmax(1 + 1 − L )

T dk f(xk) function evaluations for k steps, where γ = − ∇ 2 ∈ [γmin, γmax] according to k dk Lemma 6.2. k k The first step requires some overhead to adjust the initial step size, and all sub- sequent steps require up to four function evaluations, plus some potential overhead depending on the conditioning of the objective and the quality of the gradient ap- proximation (appearing through γk). AELS tends to require slightly more function evaluations per step compared to simpler line searches like the backtracking Armijo methods we consider in Section7 because it relies on function evaluations only, and at least two function evaluations are required just to determine if the objective is locally increasing or decreasing. Note that Theorem 5.6 counts only function evaluations (not gradient evalua- tions), so if the gradient is approximated by forward finite differencing that will cost an additional n function evaluations per step. Using a random direction and approxi- mating the directional derivative by forward finite differencing requires one additional function evaluation per step. 6. Proofs of theoretical results. We now prove the bounds in Section5. APPROXIMATELY EXACT LINE SEARCH 11

6.1. Convergence rates for general Armijo line searches. Proof of Lemma 5.2. The convergence rates using the true gradient and using a finite differencing approximation both follow (via Corollary 4.4) from the more general rate: If kg(x) − ∇f(x)k ≤ ρ, then

k tmaxρLkx0 − x k k f(xk) − f ≤ (1 − mtmin) (f(x0) − f ) + ∗ (1 − (1 − mtmin) ) . ∗ ∗ mtmin We prove this more general rate using a technique based on chapter 3 of [32].

f(xk+1) − f = f(xk − tkg(xk)) − f ∗ ∗ tk 2 ≤ f(xk) − f − kg(xk)k (since tk satisfies Armijo) ∗ 2 tk 2 T = f(xk) − f − k∇f(xk)k − tk(g(xk) − ∇f(xk)) ∇f(xk) ∗ 2 t − k kg(x ) − ∇f(x )k2 2 k k T ≤ f(xk) − f − mtk(f(xk) − f ) − tk(g(xk) − ∇f(xk)) ∇f(xk) ∗ ∗ ≤ (1 − mtk)(f(xk) − f ) + tkρk∇f(xk)k ∗ ≤ (1 − mtmin)(f(xk) − f ) + tmaxρLkxk − x k (by smoothness) , ∗ ∗ where the third to last line follows from strong convexity (Lemma 4.5). Then,

k 1 k X− i f(xk) − f ≤ (1 − mtmin) (f(x0) − f ) + (1 − mtmin) tmaxρLkxi − x k ∗ ∗ ∗ i=0

k tmaxρLkx0 − x k k = (1 − mtmin) (f(x0) − f ) + ∗ (1 − (1 − mtmin) ) , ∗ mtmin where the last term captures the error introduced by approximating the gradient. 2 This error term is bounded for any fixed k, and if tmin ∈ (0, m ), we can bound the last term by its limit as k → ∞:

k tmaxρLkx0 − x k f(xk) − f ≤ (1 − mtmin) (f(x0) − f ) + ∗ . ∗ ∗ mtmin Next, we prove the convergence rate for the random direction case, where d = − | | ≤ Lσ i e k k ( fv0k (xk) + ρk)vk for an arbitrary unit vector vk and ρk 2 ( . ., d is a σ-finite differencing approximation of the directional derivative magnitude along vk, using the bound of Lemma 4.3). − Let tk be the step size used in step k, so xk+1 = xk + tk( fv0k (xk) + ρk)vk. Using the approximated Armijo condition:

tk 2 f(x +1) ≤ f(x ) − (−f 0 (x ) + ρ ) k k 2 vk k k tk 2 tk 2 = f(x ) − (f 0 (x )) + t ρ f 0 (x ) − ρ k 2 vk k k k vk k 2 k tk 2 ≤ f(x ) − (f 0 (x )) + t ρ f 0 (x ) k 2 vk k k k vk k tk 2 Lσ ≤ f(x ) − (f 0 (x )) + t |f 0 (x )| k 2 vk k k 2 vk k 2 tmin 2 L σkxk − x k ≤ f(x ) − (f 0 (x )) + tmax ∗ , k 2 vk k 2 12 SARA FRIDOVICH-KEIL AND BENJAMIN RECHT

where we used Lemma 4.3 and Lipschitz gradient to bound ρkfd0 k (xk). Now, we take expectations over all the random directions v0, ..., vk:

2 tmin 2 L σkxk − x k E[f(xk+1)] ≤ E[f(xk)] − E[(f 0 (xk)) ] + tmax ∗ 2 vk 2 2 tmin   2  L σkxk − x k = E[f(xk)] − E E (fv0 (xk)) |v0, ..., vk 1 + tmax ∗ 2 k − 2 2 tmin   T 2  L σkxk − x k = E[f(xk)] − E E (vk ∇f(xk)) |v0, ..., vk 1 + tmax ∗ 2 − 2 2 tmin  T  T   L σkxk − x k = E[f(xk)] − E (∇f(xk)) E vkv ∇f(xk) + tmax ∗ 2 k 2 2 tmin  2 L σkxk − x k = E[f(xk)] − E k∇f(xk)k + tmax ∗ , 2n 2

z where n is the dimension of the space and vk = for z ∼ N (0,In). Subtracting f z ∗ from each side, we have: k k

2 tmin  2 L σkxk − x k E[f(xk+1) − f ] ≤ E[f(xk) − f ] − E k∇f(xk)k + tmax ∗ ∗ ∗ 2n 2   2 mtmin L σkxk − x k ≤ E[f(xk) − f ] 1 − + tmax ∗ , ∗ n 2

where the last inequality follows from strong convexity (Lemma 4.5). Repeating over the k steps, we have nearly linear convergence:

 k mtmin E[f(xk) − f ] ≤ (f(x0) − f ) 1 − ∗ ∗ n 2  k! σL tmaxnkx0 − x k mtmin + ∗ 1 − 1 − , 2mtmin n

where the last term captures the error introduced by approximating the directional derivative by finite differencing in each step. This error term is bounded for any 2 2n σL tmaxn x0 x∗ number of iterations k, and if tmin ∈ (0, ) it converges to k − k , which m 2mtmin can be managed by judicious choice of σ.

Lemma 5.2 depends on tmin, which in turn may depend on ta, so we next prove bounds on ta. Lemma 6.1 will also be useful in the convergence analysis of AELS. Lemma 6.1. If f is m-strongly convex with L-Lipschitz gradient and d is a search (descent) direction, then −dT ∇f(x)  1 1  t ∈ , . a kdk2 L m Note that the first inequality requires only the Lipschitz gradient condition, and the second requires only the strong convexity condition. Without strong convexity (as dT f(x) m → 0) we approach the one-sided bound ta ≥ − L ∇d 2 . k k Proof. Applying Lemma 4.6 to the case t = ta,

mt2 t Lt2 f(x) + t dT ∇f(x) + a kdk2 ≤ f(x) + a dT ∇f(x) ≤ f(x) + t dT ∇f(x) + a kdk2 . a 2 2 a 2 APPROXIMATELY EXACT LINE SEARCH 13

Rearranging, we have: mt −1 Lt a kdk2 ≤ dT ∇f(x) ≤ a kdk2 , 2 2 2 which implies the lemma. dT f(x) Lemma 6.1 gives upper and lower bounds on ta as a function of ∇d 2 , which varies depending on the local behavior of the objective and the methodk fork choosing dT f(x) a descent direction d. Accordingly, the next step is to bound ∇d 2 for each of the descent directions we consider. k k

Lemma 6.2. If we set d = −fv0 (x)v for an arbitrary unit vector v, including the f(x) case where v = ∇f(x) so d = −∇f(x), then k∇ k −dT ∇f(x) = 1 . kdk2

Lσ√n If we set d = −∇f(x) + ρ where kρk ≤ 2 (i.e., d is a σ-finite differencing approximation of −∇ , using the bound of Corollary 4.4), with 2˜σm∆ and f(x) σ = L√n σ˜ < 1, then

−dT ∇f(x)  1 − σ˜ 1  ∈ , . kdk2 (1 +σ ˜)2 1 − σ˜

Lσ If we set d = (−fv0 (x) + ρ)v for an arbitrary unit vector v where |ρ| < 2 (i.e., kdk is a σ-finite differencing approximation of the directional derivative magnitude 2˜σm∆ along v, using the bound of Lemma 4.3), with σ = L and σ˜ < 1, then

−dT ∇f(x)  1 1  ∈ , . kdk2 1 +σ ˜ 1 − σ˜

Proof. If we set d = −fv0 (x)v for an arbitrary unit vector v, then

T T −d ∇f(x) fv0 (x)v ∇f(x) 2 = 2 = 1 . kdk (fv0 (x))

−∇ k k ≤ Lσ√n 2˜σm∆ If we set d = f(x) + ρ where ρ 2 , with σ = L√n andσ ˜ < 1, then

−dT ∇f(x) (∇f(x) − ρ)T ∇f(x) = kdk2 k∇f(x) − ρk2 k∇f(x)k2 − k∇f(x)kkρk k∇f(x)k  ∈ , k∇f(x) − ρk2 k∇f(x) − ρk mkx − x k(mkx − x k − σm˜ ∆) mkx − x k  ⊆ ∗ ∗ , ∗ (mkx − x k +σm ˜ ∆)2 mkx − x k − σm˜ ∆ ∗ ∗ " 1 − σ˜∆ # x x∗ 1 ⊆ k − k , (1 + σ˜∆ )2 1 − σ˜∆ x x∗ x x∗ k − k k − k  1 − σ˜ 1  ⊆ , . (1 +σ ˜)2 1 − σ˜ 14 SARA FRIDOVICH-KEIL AND BENJAMIN RECHT

Lσ If we set d = (−fv0 (x) + ρ)v for an arbitrary unit vector v where |ρ| < 2 , with 2˜σm∆ σ = L andσ ˜ < 1, then

T T −d ∇f(x) (fv0 (x) − ρ)v ∇f(x) fv0 (x) 2 = 2 = kdk (fv0 (x) − ρ) fv0 (x) − ρ  mkx − x k mkx − x k  ∈ ∗ , ∗ mkx − x k + mσ˜∆ mkx − x k − mσ˜∆ " ∗ # ∗ 1 1 ⊆ , 1 + σ˜∆ 1 − σ˜∆ x x∗ x x∗ k − k k − k  1 1  ⊆ , . 1 +σ ˜ 1 − σ˜

The constraints on σ in Lemma 6.2 guarantee that we choose a descent direction; in our subsequent analysis we assume they are satisfied. Although we commonly don’t 8 know m and L in practice, setting σ to be a fixed small constant (e.g. 10− ) suffices dT f(x) in our experiments. In the limit as σ → 0, the bounds on − d∇ 2 using finite differencing approach 1, the value using the true gradient or directionalk k derivative.

6.2. Approximately exact line search. Proof of Invariant 5.4. We begin with some notation and observations. Let h(s) = f(x + sd) for current iterate x and search direction d. If f is strictly unimodal (e.g. if f is strongly convex), then h is decreasing for all s ∈ [0, t ) and ∗ increasing for all s > t . Therefore, if we can find some s1 < s2 < s3 such that ∗ h(s1) ≥ h(s2) and h(s2) ≤ h(s3), then t ∈ [s1, s3]. AELS follows this methodology ∗ 1 1 2 by always returning a step size t for which h(t) ≥ h(β− t) and h(β− t) ≤ h(β− t). We analyze AELS (Algorithm2) step by step, beginning with the first If state- ment, which compares h(0) and h(T ). There are three cases. If h(T ) > h(0), then we know t < T , and accordingly we choose to backtrack (setting α = β < 1). Note that by convexity,∗ our first backtracking step (with step size βT ) in this scenario will always satisfy h(βT ) < h(T ). We then repeatedly multiply 1 the step size by β until we reach a step size t with h(t) ≥ h(β− t), at which point we 1 2 are also guaranteed that h(β− t) < h(β− t). AELS returns this t, which satisfies the desired invariant. If instead h(T ) ≤ h(0), we cannot be sure the relationship between t and T , so 1 ∗ we try to forward-track (setting α = β− > 1). In the first forward-tracking step, 1 1 we test a step size β− T . If h(β− T ) < h(T ), we continue forward-tracking until we reach a step size t with h(t) ≥ h(βt), at which point we are also guaranteed that h(βt) < h(β2t). AELS returns β2t, which satisfies the desired invariant. 1 The third and final case is if h(T ) ≤ h(0) but h(β− T ) ≥ h(T ), meaning that we tried forward-tracking for one loop iteration and immediately the objective value increased. We then follow the second If statement, reset the step size to T , and know 1 that h(T ) < h(β− T ). We backtrack by factors of β until we reach a step size t with 1 1 2 h(t) > h(β− t), at which point we are also guaranteed that h(β− t) ≤ h(β− t). AELS returns this t, which satisfies the desired invariant. To bound the iteration complexity of AELS, we must first bound t . ∗ Lemma 6.3. If f is m-strongly convex with L-Lipschitz gradient and d is a descent APPROXIMATELY EXACT LINE SEARCH 15

direction, then

−dT ∇f(x)  r m r m t ∈ 1 − 1 − , 1 + 1 − . ∗ mkdk2 L L

Note that without strong convexity (as m → 0) we approach the one-sided bound dT f(x) t ≥ −2L∇d 2 . ∗ k k Proof. By optimality of t , ∗  dT ∇f(x)  f(x + t d) ≤ f x − d ∗ Lkdk2 (dT ∇f(x))2 ≤ f(x) − , 2Lkdk2

where the last line follows from Lemma 6.1. Combining this with strong convexity (Definition 4.2),

2 T 2 T mt 2 (d ∇f(x)) f(x) + t d ∇f(x) + ∗ kdk ≤ f(x) − , ∗ 2 2Lkdk2

2 T 2 mt∗ 2 T (d f(x)) which implies that 2 kdk + t d ∇f(x) + 2L∇ d 2 ≤ 0. This is satisfied for t in T ∗ k k ∗ d f(x)  p m p m  the interval − m ∇d 2 1 − 1 − L , 1 + 1 − L . k k We can now prove Theorem 5.5, which bounds the iteration complexity of AELS. Proof of Theorem 5.5. Let f be m-strongly convex with L-Lipschitz gradient. Let xk+1 = xk −tkg(xk), where tk is chosen according to Algorithm2 with search direction T g(xk) f(xk) d = −g(x ). Let γ = ∇ 2 . k k k g(xk) k k We begin with the case where kg(x)−∇f(x)k ≤ ρ, which occurs when either g(xk) is the true gradient ∇f(xk) or when g(xk) is a finite differencing approximation of the gradient. Let ta be the step size that satisfies the Armijo condition with equality (with gradient approximation g(xk) and direction −g(xk)) and t be the exact line ∗ search step size (in direction −g(xk)). We begin with the observation that if tk > ta, it must be because f(xk − tkg(xk)) ≤ f(xk − tag(xk)). Then,

min(tk, ta) 2 f(xk+1) − f ≤ f(xk) − f − kg(xk)k ∗ ∗ 2 min(tk, ta) 2 ≤ f(xk) − f − k∇f(xk)k ∗ 2 min(t , t ) − min(t , t )(g(x ) − ∇f(x ))T ∇f(x ) − k a ρ2 k a k k k 2 2  r  γkβ m 2 γkρ ≤ f(xk) − f − 1 − 1 − k∇f(xk)k + k∇f(xk)k ∗ 2m L m   r  2 m γmaxρL ≤ 1 − γminβ 1 − 1 − (f(xk) − f ) + kxk − x k , L ∗ m ∗

where the third line follows by combining Lemma 6.1, Invariant 5.4, and Lemma 6.3 h β2 p m 1 i to show that min(tk, ta) ∈ γk m (1 − 1 − L ), m , and the fourth line follows from 16 SARA FRIDOVICH-KEIL AND BENJAMIN RECHT

strong convexity (Lemma 4.5), with γk ∈ [γmin, γmax] according to Lemma 6.2. Ap- plying this relation iteratively (and simplifying the error term as a geometric series),   r k 2 m f(xk) − f ≤ 1 − γminβ 1 − 1 − (f(x0) − f ) ∗ L ∗ k 1   r i X− 2 m γmaxρLkxi − x k + 1 − γminβ 1 − 1 − ∗ L m i=0   r k 2 m ≤ 1 − γminβ 1 − 1 − (f(x0) − f ) L ∗   r k! γmaxρLkx0 − x k 2 m ∗ − − − − + 2 p m  1 1 γminβ 1 1 mγminβ 1 − 1 − L L   r k 2 m ≤ 1 − γminβ 1 − 1 − (f(x0) − f ) L ∗

γmaxρLkx0 − x k ∗ + 2 p m  . mγminβ 1 − 1 − L Combining this bound with Lemma 6.2 and Corollary 4.4 yields the stated bounds using the true gradient and a finite differencing approximation. Next, we prove the bound for the case where g(xk) = µkvk, where µk is a finite differencing approximation of fv0k (xk) for a uniformly random unit vector vk. For

Algorithm2, all we need to assume about µk is that it has the same sign as fv0k (xk), so that −g(xk) is a descent direction. Let tk be the step size chosen in step k, so xk+1 = xk − tkµkvk. We begin by introducing some auxiliary variables: • | | − sk = tk µk , so xk+1 = xk skvksign(fv0k (xk)). • | | sa = ta fv0k (xk) , where ta satisfies the Armijo condition (Definition 2.1) − − − with equality (for d = fv0k (xk)vk), so f(xk savksign(fv0k (xk)) = f(xk) sa | | ∈ 1 1 ∈ 1 | | 1 | | 2 fv0k (xk) . Since by Lemma 6.1 ta [ L , m ], sa [ L fv0k (xk) , m fv0k (xk) ]. • s = t |fv0 (xk)|, where t is an exact line search step size (along direction ∗ ∗ k ∗ d = −fv0 (xk)vk), so f(xk −s vksign(fv0 (xk)) = mins f(xk −svksign(fv0 (xk)). k ∗ k k Since by Lemma 6.3 t ∈  1 1 − p1 − m  , 1 1 + p1 − m , we have ∗ m L m L ∈  1 − p − m  | | 1 p − m  | | s m 1 1 L fv0k (xk) , m 1 + 1 L fv0k (xk) . ∗ 2 Rewriting Invariant 5.4 in this notation, we have that sk ∈ [β s , s ]. One of the following cases must hold: ∗ ∗ • ≤ − ≤ − sk | | sk sa: Then f(xk skvksign(fv0k (xk))) f(xk) 2 fv0k (xk) . • − ≤ − sa | | sk > sa: Then f(xk skvksign(fv0k (xk))) f(xk) 2 fv0k (xk) . Combining these cases, we have:

min(sk, sa) f(x +1) ≤ f(x ) − |f 0 (x )| k k 2 vk k 1 β2  r m 1  ≤ f(x ) − min 1 − 1 − |f 0 (x )|, |f 0 (x )| |f 0 (x )| k 2 m L vk k L vk k vk k  2  r   1 β m 1 2 = f(x ) − min 1 − 1 − , (f 0 (x )) k 2 m L L vk k 2  r  β m 2 ≤ f(x ) − 1 − 1 − (f 0 (x )) , k 2m L vk k APPROXIMATELY EXACT LINE SEARCH 17

where we plugged in our bounds on sk and sa and then simplified by noticing that β2 p m  1 for any m ∈ (0,L], m 1 − 1 − L ≤ L . Now, we take expectations over all the random directions v0, ..., vk and follow the same steps as in the proof of the random direction portion of Lemma 5.2:

2  r  β m 2 E[f(xk+1)] ≤ E[f(xk)] − 1 − 1 − E[(f 0 (xk)) ] 2m L vk 2  r  β m  2 = E[f(xk)] − 1 − 1 − E k∇f(xk)k , 2mn L

z where n is the dimension of the space and vk = for z ∼ N (0,In). Subtracting f z ∗ from each side, we have: k k

2  r  β m  2 E[f(xk+1) − f ] ≤ E[f(xk) − f ] − 1 − 1 − E k∇f(xk)k ∗ ∗ 2mn L  β2  r m ≤ E[f(xk) − f ] 1 − 1 − 1 − , ∗ n L

where the last inequality follows from strong convexity (Lemma 4.5). Repeating over the k steps, we have linear convergence:

k  β2  r m E[f(xk) − f ] ≤ (f(x0) − f ) 1 − 1 − 1 − . ∗ ∗ n L

Finally, we prove Theorem 5.6, which bounds the function evaluation complexity of AELS (Algorithm2).

Proof of Theorem 5.6. Let tk denote the step size taken in step k, where by con- 1 vention we set t 1 = T0. The initial step size in step k is then β− tk 1. We ignore − − the initial evaluation of f(xk) in each step as this can be reused between steps (except for the first step). In step k, AELS first makes one function evaluation to compute 1 f(xk +β− tk 1dk). Then, there are three cases (as in the proof of Invariant 5.4, whose h notation we− reuse). 1 If h(β− tk 1) > h(0), then we repeatedly multiply the step size by β (making one function evaluation− for each factor of β) until we reach a step size t with t ∈ [β2t , t ], which is used as t . If p is the number of function evaluations in this loop,∗ then∗ k l  m p 1 tk−1 tk = β − tk 1, so p ≤ 2 + log1/β . − t∗ 1 1 If instead h(β− tk 1) ≤ h(0), we try to forward-track (setting α = β− > 1). − 2 2 In the first forward-tracking step, we test a step size β− tk 1. If h(β− tk 1) < 1 1 − − h(β− tk 1), we continue multiplying the step size by β− (making one function eval- − 1 2 uation for each factor of β− ) until we reach a step size t with t ∈ [t , β− t ], and then use t = β2t. If q is the number of function evaluations in this∗ loop,∗ then k l  m q+1 t∗ tk = β− tk 1, so q ≤ log1/β . − tk−1 1 2 1 The third and final case is if h(β− tk 1) ≤ h(0) but h(β− tk 1) ≥ h(β− tk 1), meaning that we tried forward-tracking for− one loop iteration and− immediately− the objective value increased. We then follow the second If statement, reset the step 1 1 2 size to β− tk 1, and know that h(β− tk 1) ≤ h(β− tk 1). We backtrack by factors − − 2 − of β until we reach a step size t with t ∈ [β t , t ], which is used as tk. We used 2 ∗ ∗ one function evaluation to evaluate h(β− tk 1), and then r function evaluations to − 18 SARA FRIDOVICH-KEIL AND BENJAMIN RECHT l  m r 1 tk−1 backtrack (one per loop iteration). Then t = β t 1, so r ≤ 2 + log k − k 1/β t∗ − l  m and the number of additional function evaluations is ≤ 3 + log tk−1 . 1/β t∗ Combining these cases, the number of function evaluations in step k is       t tk 1 nk ≤ 1 + max log1/β ∗ , 3 + log1/β − . tk 1 t − ∗

For k = 0, tk 1 = T0, so − & p m !' γ0(1 + 1 − ) mT ≤ L 0 n0 1 + log1/β max , 3 p m . mT0 β γ0(1 − 1 − L )

γk−1  2 p m  p m  For k > 0, tk 1 ∈ β 1 − 1 − , 1 + 1 − (Lemma 6.3), so − m L L & p m !' γmax(1 + 1 − ) ≤ L nk 4 + log1/β p m . γmin(1 − 1 − L )

We can now bound the total number of function evaluations Nk in the first k steps:

k 1 X− Nk = 1 + n0 + nk i=1 & p m !' γ0(1 + 1 − ) mT ≤ L 0 2 + log1/β max , 3 p m mT0 β γ0(1 − 1 − L ) & p m !'! γmax(1 + 1 − ) − L + (k 1) 4 + log1/β p m γmin(1 − 1 − L ) & p m !'! γmax(1 + 1 − ) ≤ L k 4 + log1/β p m γmin(1 − 1 − L ) & 2 p m !' β γmin(1 − 1 − L ) mT0 + log1/β max , p m . mT0 βγmax(1 + 1 − L )

7. Empirical results. We evaluate the empirical performance of approximately exact line search and baseline approaches in three settings: deterministic gradient descent for logistic regression, minibatch (stochastic) gradient descent for logistic regression, and derivative-free optimization (DFO). The objective function for logistic regression is strongly convex with Lipschitz gradient, but these assumptions do not hold on the DFO benchmark problems [21]. All experiments are run on a 4 socket Intel Xeon CPU E7-8870 machine with 18 cores per socket and 1TB of DRAM. We run experiments in parallel, ensuring that at most one trial is run concurrently on each core. 7.1. Logistic regression. Our logistic regression experiments use the following 1 objective function (with λ = N ):

N λ 1 X f(x) = xT x + log(1 + exp(−y (xT z ))) , 2 N i i i=1 APPROXIMATELY EXACT LINE SEARCH 19

where x is a vector of parameters to be optimized, N is the number of data points, and each data point has covariates zi and binary label yi. This objective is commonly 1 PN T used in practice, has L-Lipschitz gradient with L ≤ λ + 2λmax( N i=1 zizi ) (where λmax(·) denotes the maximum eigenvalue of a positive semidefinite matrix), and is m-strongly convex with m ≥ λ. We define a benchmark set of logistic regression objectives comprised of the three datasets summarized in Table1, where we append 1 to each covariate vector to allow for a bias in the model (this extra dimension is not included in Table1). For each problem, we use the origin as a standard initial point x0. We estimate a Barzilai- Borwein [4] step size tBB there, and include each experiment using various initial step sizes T0: 0.01tBB, 0.1tBB, tBB, 10tBB, and 100tBB. Our code is available at https://github.com/modestyachts/AELS.

Table 1: Datasets used for logistic regression experiments.

Dataset Data points Variables adult (LIBSVM [30] version of UCI [14] dataset) 32561 123 quantum (from KDD Cup 2004 [11]) 50000 78 protein (from KDD Cup 2004 [11]) 145751 74

In our logistic regression experiments, we compare the following line searches: • Approximately exact: Algorithm2. • Traditional backtracking: At every step, starts with the same initial step size T0 and decreases it geometrically until the Armijo condition is satisfied. • Adaptive backtracking: Follows the same “warm start” initial step size as AELS and decreases it geometrically until the Armijo condition is satisfied. We plot results using performance profiles, introduced by Mor´eand Wild [21] to compare computational efficiency in terms of function evaluations for the DFO setting. Performance profiles visualize the fraction of benchmark problems (on the vertical axis) solved within various multiples of the computational cost incurred by the fastest algorithm to achieve convergence on each problem (performance ratio, on the horizontal axis). For our logistic regression experiments, since all algorithms use both function and gradient evaluations, we use computation time rather than function evaluations as our notion of computational cost. 7.1.1. Deterministic gradient descent. We begin our logistic regression ex- periments by executing each algorithm as described, using the full dataset to compute the objective function and its gradient exactly wherever required. In this setting, the upper bounds on function and gradient evaluations from Section5 apply. Our ex- periments in this section, shown in Figure2, confirm they are indeed predictive of empirical performance both between the algorithms we consider and across the range of initial step sizes T0 we test for each objective function. Traditional backtracking line search performs well with a sufficiently large T0, but slows down dramatically if T0 is too small. Adaptive backtracking largely fixes this problem, at the cost of some overhead in the number of steps needed to reach a good step size. AELS achieves performance similar to that of traditional backtracking with a good T0, without any notable dependence on T0. 7.1.2. Minibatch stochastic gradient descent. Although our analysis does not extend to the stochastic function setting, in practice with large datasets it is often desirable to work with only a small subset (minibatch) of the data at a time. In this 20 SARA FRIDOVICH-KEIL AND BENJAMIN RECHT

1.0

0.8

0.6

0.4

0.2

Approximately Exact Adaptive Backtracking 0.0 Traditional Backtracking

0 20 40 60 80 100 Performance Ratio

4 Fig. 2: Performance profile for gradient descent (to a relative error of 10− ) on our logistic regression benchmark problems.

section, we repeat the above experiments except that at each iteration we randomly sample a minibatch from the dataset and replace all required objective function and gradient values with their approximations using only the current minibatch. As we saw in Section5, our proposed algorithm makes the tradeoff of slightly more expensive steps in exchange for requiring fewer steps to converge, resulting in a net speedup. In the minibatch setting, our algorithm still requires more computation per step than simpler line searches, but since we are optimizing an approximate objective in each step it is a priori unclear whether this investment should enable any decrease in the number of steps to convergence. In particular, there are two (not necessarily disjoint) regimes of stochastic opti- mization in which we might expect a more expensive line search subroutine to run slower than a simpler subroutine: when the minibatch size is too small and when the iterates approach the optimum. In the first case, the approximate objective we optimize in each step can be far from the true objective, so putting more computational effort into choosing each step size might be counterproductive. We explore this issue in Figure3 by repeating our logistic regression experiment with decreasing minibatch sizes, and find that AELS remains competitive across a wide range of minibatch sizes, including the extreme case of single-example minibatches. The second scenario (approaching the minimum) is similar, except that for func- tions with wide, flat minima we run the risk of taking a large counterproductive step even with a minibatch size that might be large enough to make progress farther from the minimum. We explore this issue in Figure4, in which we hold the minibatch size constant (at 100) and vary the relative accuracy required for convergence, forcing the iterates to approach the minimizer. We find that AELS remains fastest until the APPROXIMATELY EXACT LINE SEARCH 21

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0 0 20 40 60 80 100 0 20 40 60 80 100 Performance Ratio Performance Ratio (a) Minibatch size: 1600 (b) Minibatch size: 400

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0 0 20 40 60 80 100 0 20 40 60 80 100 Performance Ratio Performance Ratio (c) Minibatch size: 100 (d) Minibatch size: 25

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

Approximately Exact 0.2 0.2 Adaptive Backtracking Traditional Backtracking 0.0 0.0 0 20 40 60 80 100 0 20 40 60 80 100 Performance Ratio Performance Ratio (e) Minibatch size: 6 (f) Minibatch size: 1

4 Fig. 3: Performance profiles for stochastic gradient descent (to a relative error of 10− ) on our logistic regression benchmark problems, repeated with varying minibatch sizes and over 10 random seeds (using the same set of seeds for each algorithm). The performance improvement of AELS in the full batch regime (Figure2) transfers to the minibatch regime across all minibatch sizes tested.

6 required relative error drops to 10− , at which point it performs similarly to adaptive backtracking line search. Although performance may vary across objective functions, this experiment gives a sense of how close we can get to the minimum before our line searches are no longer helpful. These experiments suggest that even our relatively expensive line search method can improve performance over a wide range of batch sizes and required accuracies, likely including many practical scenarios. Additionally, the line search method we pro- pose could be easily combined with an adaptive batch size or other variance reduction 22 SARA FRIDOVICH-KEIL AND BENJAMIN RECHT

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0 0 20 40 60 80 100 0 20 40 60 80 100 Performance Ratio Performance Ratio (a) Relative error: 10−3 (b) Relative error: 10−4

1.0 1.0

0.9 0.8 0.8

0.6 0.7 0.6

0.4 0.5

0.4 Approximately Exact 0.2 0.3 Adaptive Backtracking Traditional Backtracking 0.0 0.2 0 20 40 60 80 100 0 20 40 60 80 100 Performance Ratio Performance Ratio (c) Relative error: 10−5 (d) Relative error: 10−6

Fig. 4: Performance profiles for stochastic gradient descent (with minibatches of size 100) on our logistic regression benchmark problems, repeated with varying required accuracy and over 10 random seeds (using the same set of seeds for each algorithm). The performance improvement of AELS in the minibatch regime (Figure3) transfers to the various accuracy regimes tested, with performance degrading to roughly the 6 same as adaptive backtracking in the high accuracy regime (relative error 10− ). method, as needed for specific problems. 7.2. Derivative-free optimization. Our derivative-free optimization (DFO) experiments use the 53 benchmark problems proposed by Mor´eand Wild [21], all with an initial step size T0 = 1. These problems are low to medium dimensional (no more than 12 dimensions), smooth (although accessed only through function evaluations), generally nonconvex (although often have unique minima), and include standard starting points sometimes near and sometimes far relative to a minimum. 8 Our finite differencing uses the same sampling radius (σ ≈ 1.5 × 10− ) as scipy.optimize [39]. We experimented with a heuristic for adapting σ, but found a fixed σ worked better on these problems. We also experimented with central finite differencing, but found forward finite differencing to be a bit faster, offering a favorable tradeoff of fewer function evaluations for minimal lost accuracy. We use the same forward finite differencing approximation in place of the gradient in both gradient descent and BFGS [16]. We compare the following methods: • BFGS with approximately exact line search (Algorithm2). • Gradient descent with approximately exact line search. • Random direction search with approximately exact line search. APPROXIMATELY EXACT LINE SEARCH 23

• BFGS with adaptive backtracking line search. • BFGS with traditional backtracking line search. • Nelder-Mead [22], as implemented in scipy.optimize [39]. • BFGS with a Wolfe line search [40], as implemented in scipy.optimize [39]. Figure5 shows a performance profile comparing these methods. We find that all of the line searches applied to random or approximate gradient search directions are outperformed by scipy.optimize [39] implementations of BFGS [16] and Nelder- Mead [22], likely because many of the objective functions are poorly conditioned, and Nelder-Mead was designed specifically for the derivative-free setting. However, applying line search methods to BFGS search directions allows both approximately exact line search (Algorithm2) and traditional backtracking line search to compete effectively with these well-tuned baseline methods. Both also converge faster than BFGS with a Wolfe line search (as it is implemented in scipy.optimize), even though the BFGS update step is only guaranteed to be well-defined if the step size satisfies the (which require additional gradient estimation).

1.0

0.8

0.6

0.4

Approximately Exact + BFGS Approximately Exact 0.2 Approximately Exact + Random Adaptive Backtracking + BFGS Traditional Backtracking + BFGS Nelder-Mead Wolfe + BFGS 0.0 0 20 40 60 80 100 Performance Ratio

Fig. 5: Performance profile on the DFO benchmark of Mor´eand Wild [21].

8. Conclusions. We introduced a simple line search method, approximately ex- act line search (AELS, Algorithm2), that uses only function evaluations to find a step size within a constant fraction of the exact line search minimizer. Our algorithm can be thought of as an adaptive version of golden section search. We proved linear convergence of our method applied to gradient descent as well as derivative-free opti- mization with either a finite differencing gradient approximation or a random search direction (in which case we prove linear expected convergence). We also bounded the number of function evaluations per iteration as logarithmic in properties of the objective function and gradient approximation. We compared these bounds to those for Armijo line searches such as the popular backtracking method, and found that our method is more robust to improper choices of the initial step size T0 as well as better 24 SARA FRIDOVICH-KEIL AND BENJAMIN RECHT suited to the derivative-free setting, in which Armijo methods depend more strongly on the quality of the gradient approximation. We illustrated these worst-case bounds using gradient descent on a strongly con- vex logistic regression benchmark, on which we found our method converges faster than various Armijo backtracking methods across a range of initial step sizes T0. We then stress-tested our approach in the stochastic setting by repeating our logistic re- gression benchmark trials with decreasing minibatch size and decreasing relative error at convergence. Although line search is commonly believed to be useless or counter- productive in the stochastic setting, in which our analysis does not directly apply, we found that AELS remains performant on logistic regression even with single-example 5 minibatches or when optimizing to a relative error less than 10− . These regimes are of great practical interest, and our experiments suggest that AELS may be effective well beyond the settings we analyzed. Finally, we presented empirical results of our method using BFGS search direc- tions on a DFO benchmark (including nonconvex but smooth objectives), and showed faster convergence compared to popular baselines including Nelder-Mead and BFGS with a Wolfe line search. These experiments suggest that, although our convergence analysis focused on strongly convex functions, with proper implementation and search directions (as provided in our code) our method extends gracefully to nonconvex ob- jectives. This result is encouraging but unsurprising, since our line search method itself relies only on unimodality rather than convexity. In total, our investigation indicates that line search, particularly AELS, improves performance in many regimes, including ones in which line search is typically avoided. We hope that our work spurs further theoretical research in characterizing the per- formance of approximately exact line search and other line searches in the stochastic and nonconvex settings, as well as further empirical research exploring the range of practical optimization problems for which these line searches improve performance. Acknowledgments. We appreciate helpful feedback from Ludwig Schmidt.

REFERENCES

[1] L. B. Almeida, T. Langlois, J. D. Amaral, and A. Plakhov, Parameter adaptation in stochastic optimization, Publications of the Newton Institute, Cambridge Univ. Press, 1999, pp. 111–134. [2] L. Armijo, Minimization of functions having Lipschitz continuous first partial derivatives, Pacific J. Math., 16 (1966), pp. 1–3. [3] M. Avriel and D. J. Wilde, Golden block search for the maximum of unimodal functions, Management Science, 14 (1968), pp. 307–319. [4] J. Barzilai and J. J. Borwein, Two-point step size gradient methods, IMA J. Numer. Anal., 8 (1988), pp. 141–148. [5] Y. Bengio, Practical recommendations for gradient-based training of deep architectures, Neural Networks: Tricks of the Trade, (2012), pp. 437–478. [6] A. S. Berahas, L. Cao, K. Choromanski, and K. Scheinberg, A theoretical and empirical comparison of gradient approximations in derivative-free optimization, 2019, https://arxiv. org/abs/1905.01332. [7] A. S. Berahas, L. Cao, and K. Scheinberg, Global convergence rate analysis of a generic line search algorithm with noise, 2019, https://arxiv.org/abs/1910.04055. [8] R. Bollapragada, R. Byrd, and J. Nocedal, Adaptive sampling strategies for stochastic optimization, SIAM J. Optim., 28 (2018), pp. 3312–3343. [9] R. Bollapragada, D. Mudigere, J. Nocedal, H.-J. M. Shi, and P. T. P. Tang, A progres- sive batching l-bfgs method for machine learning, 2018, https://arxiv.org/abs/1802.05374. [10] R. P. Brent, An algorithm with guaranteed convergence for finding a zero of a function, Comput. J., 14 (1971), pp. 422–425. [11] R. Caruana, T. Joachims, and L. Backstrom, KDD-cup 2004: Results and analysis, APPROXIMATELY EXACT LINE SEARCH 25

SIGKDD Explorations Newsletter, 6 (2004), pp. 95–108. [12] A. R. Conn, K. Scheinberg, and L. N. Vicente, Introduction to derivative-free optimization, vol. 8 of MPS-SIAM Ser. Optim., SIAM, 2009. [13] A. Cutkosky and K. Boahen, Online learning without prior information, 2017, https://arxiv. org/abs/1703.02629. [14] D. Dua and C. Graff, UCI machine learning repository, 2017, http://archive.ics.uci.edu/ml. [15] J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., 12 (2011), pp. 2121–2159. [16] R. Fletcher, Newton-like methods, Wiley, 1986, pp. 44–74. [17] M. P. Friedlander and M. Schmidt, Hybrid deterministic-stochastic methods for data fitting, SIAM J. Sci. Comput., 34 (2012), pp. A1380–A1405. [18] D. Golovin, J. Karro, G. Kochanski, C. Lee, X. Song, and Q. Zhang, Gradientless descent: High-dimensional zeroth-order optimization, 2020, https://arxiv.org/abs/1911. 06317. [19] E. Hazan and S. Kakade, Revisiting the Polyak step size, 2019, https://arxiv.org/abs/1905. 00313. [20] K. G. Jamieson, R. Nowak, and B. Recht, Query complexity of derivative-free optimization, in Advances in Neural Information Processing Systems, 2012, pp. 2672–2680. [21] J. J. More´ and S. M. Wild, Benchmarking derivative-free optimization algorithms, SIAM Journal on Optimization, 20 (2009), pp. 172–191. [22] J. A. Nelder and R. Mead, A simplex method for function minimization, Comput. J., 7 (1965), pp. 308–313. [23] Y. Nesterov, Gradient methods for minimizing composite functions, Math. Program., 140 (2013), pp. 125–161. [24] Y. Nesterov and V. Spokoiny, Random gradient-free minimization of convex functions, Found. Comput. Math., 17 (2017), pp. 527–566. [25] Y. E. Nesterov, Introductory lectures on convex optimization - A basic course, vol. 87 of Appl. Optim., Springer, 2004. [26] J. Nocedal and S. J. Wright, Line search methods, Numer. Optim., (2006), pp. 30–65. [27] F. Orabona and D. Pal´ , Coin betting and parameter-free online learning, 2016, https://arxiv. org/abs/1602.04128. [28] F. Orabona and T. Tommasi, Training deep networks without learning rates through coin betting, 2017, https://arxiv.org/abs/1705.07795. [29] C. Paquette and K. Scheinberg, A stochastic line search method with convergence rate analysis, 2018, https://arxiv.org/abs/1807.07994. [30] J. C. Platt, Fast training of support vector machines using sequential minimal optimization, MIT Press, Cambridge, MA, USA, 1999, pp. 185–208. [31] B. Polyak, Introduction to optimization, Optimization Software, New York, 1987. [32] B. Recht and S. J. Wright, Optimization for modern data analysis, 2019. Preprint available at http://eecs.berkeley.edu/∼brecht/opt4mlbook. [33] H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Stat., (1951), pp. 400–407. [34] T. Schaul, S. Zhang, and Y. LeCun, No more pesky learning rates, 2012, https://arxiv.org/ abs/1206.1106. [35] M. Schmidt, N. Le Roux, and F. Bach, Minimizing finite sums with the stochastic average gradient, Math. Program., 162 (2017), pp. 83–112. [36] S. U. Stich, Convex optimization with random pursuit, PhD thesis, ETH Zurich, 2014. [37] S. U. Stich, C. L. Muller,¨ and B. Gartner¨ , Optimization of convex functions with random pursuit, SIAM J. Optim., 23 (2013), pp. 1284–1309. [38] S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien, Painless stochastic gradient: Interpolation, line-search, and convergence rates, 2019, https://arxiv. org/abs/1905.09997. [39] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. Jarrod Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. Carey, I.˙ Polat, Y. Feng, E. W. Moore, J. Vand erPlas, D. Lax- alde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1. 0 Contrib- utors, SciPy 1.0: Fundamental algorithms for scientific computing in python, Nature Methods, 17 (2020), pp. 261–272. [40] P. Wolfe, Convergence conditions for ascent methods, SIAM Rev., 11 (1969), pp. 226–235.