The Gradient Sampling Methodology
Total Page:16
File Type:pdf, Size:1020Kb
The Gradient Sampling Methodology James V. Burke∗ Frank E. Curtisy Adrian S. Lewisz Michael L. Overtonx January 14, 2019 1 Introduction in particular, this leads to calling the negative gra- dient, namely, −∇f(x), the direction of steepest de- The principal methodology for minimizing a smooth scent for f at x. However, when f is not differentiable function is the steepest descent (gradient) method. near x, one finds that following the negative gradient One way to extend this methodology to the minimiza- direction might offer only a small amount of decrease tion of a nonsmooth function involves approximating in f; indeed, obtaining decrease from x along −∇f(x) subdifferentials through the random sampling of gra- may be possible only with a very small stepsize. The dients. This approach, known as gradient sampling GS methodology is based on the idea of stabilizing (GS), gained a solid theoretical foundation about a this definition of steepest descent by instead finding decade ago [BLO05, Kiw07], and has developed into a a direction to approximately solve comprehensive methodology for handling nonsmooth, potentially nonconvex functions in the context of op- min max gT d; (2) kdk2≤1 g2@¯ f(x) timization algorithms. In this article, we summarize the foundations of the GS methodology, provide an ¯ where @f(x) is the -subdifferential of f at x [Gol77]. overview of the enhancements and extensions to it To understand the context of this idea, recall that that have developed over the past decade, and high- the (Clarke) subdifferential of a locally Lipschitz f light some interesting open questions related to GS. at x, denoted @f¯ (x), is the convex hull of the limits of all sequences of gradients evaluated at sequences of points, at which f is differentiable, that converge 2 Fundamental Idea to x [Cla75]. The -subdifferential, in turn, is the convex hull of all subdifferentials at points within an The central idea of gradient sampling can be ex- -neighborhood of x. Although the -subdifferential n plained as follows. When a function f : R ! R of f at x is not readily computed, the central idea is differentiable at a point x (at which rf(x) 6= 0), of gradient sampling is to approximate the solution the traditional steepest descent direction for f at x of (2) by finding the smallest norm vector in the con- in the 2-norm is found by observing that vex hull of gradients computed at randomly generated points in an -neighborhood of x, then normalizing rf(x) arg min rf(x)T d = − ; (1) the result to have unit norm. See [BLO02] for analy- kdk2≤1 krf(x)k2 sis on approximating an -subdifferential by sampling gradients at randomly generated points. ∗Department of Mathematics, University of Washington, Seattle, WA. [email protected]. Supported in part by the A complete algorithm based on the GS methodol- U.S. National Science Foundation grant DMS-1514559. ogy is stated as Algorithm1, taken from the recent yDepartment of Industrial and Systems Engineering, Lehigh survey paper [BCL+19]. To illustrate the efficacy of University, Bethlehem, PA. [email protected]. Sup- ported in part by the U.S. Department of Energy grant this algorithm compared to more standard gradient DE-SC0010615 and National Science Foundation grant CCF- and subgradient methodologies, let us show its per- 1740796. formance on a nonsmooth variant of the nonconvex zSchool of Operations Research and Informa- Rosenbrock function [Ros60], namely, tion Engineering, Cornell University, Ithaca, NY. [email protected]. Supported in part by the f(x) = 8jx2 − x j + (1 − x )2: (3) U.S. National Science Foundation grant DMS-1613996. 1 2 1 xCourant Institute of Mathematical Sciences, New York University. [email protected]. Supported in part by the U.S. Na- The contours of this function are shown in Figure1; tional Science Foundation grant DMS-1620083. the black asterisk indicates the initial iterate x0 = 1 (0:1; 0:1) and the red asterisk indicates the unique progress is slow since the method has no mecha- minimizer x∗ = (1; 1). The blue dots show the iter- nism for identifying the tangential direction of de- ates generated by the gradient sampling method (Al- scent along the parabola. Instead, it is destined to os- gorithm1) converging to x∗, roughly tracing out the cillate back-and-forth across the parabola as it creeps parabola on which f is nonsmooth, but never actually tangentially toward the minimizer x∗. In this exper- 2 landing on it, even to finite precision. In contrast, iment, x1 − x2 was nonzero (even in finite precision) the magenta dots show the iterates of the gradient at all but a handful of the iterates, and since the only method with the same line search enforcing (4) from subgradient of f at such a point is the gradient, the Algorithm1, indicating that these iterates move di- method is, for all practical purposes, identical to a rectly toward the parabola on which f is nonsmooth gradient method with the same stepsizes. The iter- and stall without moving along it toward the mini- ates of this method with ftkg = f0:1=kg are shown in mizer x∗. The essential difficulty is that the direction Figure3, and the performance with different choices of descent tangential to the parabola is overwhelmed for ftkg is shown in Figure4. With the same function by the steepness of the graph of the function near the and gradient evaluation budget as the methods above, parabola. The gradient sampling method, by choos- this approach|for all stepsize choices|is slow. One ing the direction of least norm in the convex hull of might be able to obtain better results by tuning the sampled gradients, is able to approximate the tan- stepsize choice further. Note, however, that Algo- gential directions of descent toward x∗. rithm1 does not require such parameter tuning. The poor behavior of the gradient method in Of course, there are other effective algorithms for this context is well known, even in the convex case nonsmooth optimization that we do not consider [HUL93]; see [AO18] for a discussion of the behav- here, in part because they are significantly more com- ior of the gradient method on a simple nonsmooth plicated to describe; these include bundle methods convex function. In these experiments, for both al- [Kiw85, SZ92], which have been used extensively for 2 gorithms, x1 − x2 was nonzero at all iterates, even in decades, and quasi-Newton methods [LO13]. For a finite precision, so gradients were always defined. Fig- collection of surveys of recent developments in nons- ure2 shows the function values ff(xk)g generated by mooth optimization methods, see [BGKM19]. the two methods. Both algorithms were terminated The function f in (3) is an example of an impor- as soon as the objective and/or gradient was evalu- tant class of functions, namely those that are partly ated at 2000 points|including iterates, trial points smooth with respect to a manifold in the sense de- in the line searches, and randomly generated points fined in [Lew02]. In the convex case, this concept is at which the gradient is evaluated for Algorithm1. related to that of the U-Lagrangian [LOS99]. Algorithm1 is able to reach iterates with much better objective values within the same budget.1 It is also instructive to consider a subgradient 3 Convergence Theory method [Sho85, Rus06] that sets iterates by Algorithm1 is conceptually straightforward. At each k+1 k k x x − tkd ; iterate, one need only compute gradients at randomly k k sampled points, project the origin onto the convex where d is any subgradient of f at x (i.e., any hull of these gradients (by solving a strongly convex ¯ k element of @f(x )) and ftkg is set as a fixed step- quadratic program (QP) for which specialized algo- size or according to a diminishing stepsize schedule. rithms have been designed [Kiw86]), and perform a This is a popular approach in the optimization lit- line search. The other details relate to dynamically erature, which has convergence guarantees in vari- setting the sampling radii fkg and ensuring that the ous contexts without requiring that the value of f objective f is differentiable at each iterate. decreases at each iteration. By not requiring mono- On the other hand, the convergence theory for the tonic decrease, the method does not get stuck near algorithm when minimizing a locally Lipschitz func- the parabola on which f is nonsmooth. However, tion involves important, subtle details. Rademacher's 1 We used the following parameters for Algorithm1: 0 = theorem states that locally Lipschitz functions are −8 ν0 = 0:1, m = 3, β = 10 , γ = 0:5, opt = νopt = 0, and differentiable almost everywhere [Cla83], ensuring θ = θν = 0:1. The gradient method used the same line search that the gradients sampled at the randomly generated with β = 10−8 and γ = 0:5. Both algorithms used most of their function and gradient evaluations in later iterations. The points are well defined with probability one. How- final sampling radius for Algorithm1 was 10 −5. ever, this is not sufficient to ensure convergence. To 2 Algorithm 1 : Gradient Sampling with a Line Search 0 Require: initial point x at which f is differentiable, initial sampling radius 0 2 (0; 1), initial stationarity target ν0 2 [0; 1), sample size m ≥ n + 1, line search parameters (β; γ) 2 (0; 1) × (0; 1), termination tolerances (opt; νopt) 2 [0; 1) × [0; 1), and reduction factors (θ; θν ) 2 (0; 1] × (0; 1] 1: for k 2 N do k;1 k;m k n k 2: independently sample fx ; : : : ; x g uniformly from B(x ; k) := fx 2 R : kx − x k2 ≤ kg k 1 2 k k k;1 k;m 3: compute g as the solution of ming2Gk 2 kgk2, where G := convfrf(x ); rf(x );:::;