Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions
Total Page:16
File Type:pdf, Size:1020Kb
Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions Jingzhao Zhang 1 Hongzhou Lin 1 Stefanie Jegelka 1 Suvrit Sra 1 Ali Jadbabaie 1 Abstract Table 1. When the problem is nonconvex and nonsmooth, find- We provide the first non-asymptotic analysis for ing a -stationary point is intractable, see Theorem 11. Thus we finding stationary points of nonsmooth, noncon- introduce a refined notion, (δ; )-stationarity, and provide non- vex functions. In particular, we study the class asymptotic convergence rates for finding (δ; )-stationary point. of Hadamard semi-differentiable functions, per- haps the largest class of nonsmooth functions for DETERMINISTIC RATES CONVEX NONCONVEX −0:5 −2 which the chain rule of calculus holds. This class L-SMOOTH O( ) O( ) −2 −3 −1 contains examples such as ReLU neural networks L-LIPSCHITZ O( ) O~( δ ) and others with non-differentiable activation func- tions. We first show that finding an -stationary STOCHASTIC RATES CONVEX NONCONVEX point with first-order methods is impossible in −2 −4 finite time. We then introduce the notion of (δ; )- L-SMOOTH O( ) O( ) −2 ~ −4 −1 stationarity, which allows for an -approximate L-LIPSCHITZ O( ) O( δ ) gradient to be the convex combination of general- ized gradients evaluated at points within distance of nonsmooth nonconvex problems in machine learning, δ to the solution. We propose a series of random- a canonical example being deep ReLU neural networks, ized first-order methods and analyze their com- obtaining a non-asymptotic convergence analysis is an im- plexity of finding a (δ; )-stationary point. Fur- portant open problem of fundamental interest. thermore, we provide a lower bound and show We tackle this problem for nonsmooth functions that are that our stochastic algorithm has min-max opti- Lipschitz and directionally differentiable. This class is rich mal dependence on δ. Empirically, our methods enough to cover common machine learning problems, in- perform well for training ReLU neural networks. cluding ReLU neural networks. Surprisingly, even for this seemingly restricted class, finding an -stationary point, i.e., 1. Introduction a point x¯ for which d(0; @f(¯x)) ≤ , is intractable. In other words, no algorithm can guarantee to find an -stationary Gradient based optimization underlies most of machine point within a finite number of iterations. learning and it has attracted tremendous research attention over the years. While non-asymptotic complexity analysis This intractability suggests that, to obtain meaningful non- of gradient based methods is well-established for convex asymptotic results, we need to refine the notion of stationar- and smooth nonconvex problems, little is known for non- ity. We introduce such a notion and base our analysis on it, smooth nonconvex problems. We summarize the known leading to the following main contributions of the paper: rates (black) in Table1 based on the references (Nesterov, • We show that a traditional -stationary point cannot be arXiv:2002.04130v3 [math.OC] 29 Jun 2020 2018; Carmon et al., 2017; Arjevani et al., 2019). obtained in finite time (Theorem5). Within the nonsmooth nonconvex setting, recent research • We study the notion of (δ; )-stationary points (see Def- results have focused on asymptotic convergence analy- inition4). For smooth functions, this notion reduces to sis (Bena¨ım et al., 2005; Kiwiel, 2007; Majewski et al., usual -stationarity by setting δ = O(/L). We provide a 2018; Davis et al., 2018; Bolte & Pauwels, 2019). Despite Ω(δ−1) lower bound on the number of calls if algorithms their advances, these results fail to address finite-time, non- are only allowed access to a generalized gradient oracle. asymptotic convergence rates. Given the widespread use • We propose a normalized “gradient descent” style algo- 1Massachusetts Institute of Technology. Correspondence to: rithm that achieves O~(−3δ−1) complexity in finding a Jingzhao Zhang <[email protected]>. (δ; )-stationary point in the deterministic setting. Proceedings of the 37 th International Conference on Machine • We propose a momentum based algorithm that achieves ~ −4 −1 Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by O( δ ) complexity in finding a (δ; )-stationary point the author(s). in the stochastic finite variance setting. Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions As a proof of concept to validate our theoretical findings, we tion enjoys high-order smoothness, a stronger goal is to implement our stochastic algorithm and show that it matches find an approximate second-order stationary point and could the performance of empirically used SGD with momentum thus escape saddle points too. Many methods focus on this method for training ResNets on the Cifar10 dataset. goal (Ge et al., 2015; Agarwal et al., 2017; Jin et al., 2017; Daneshmand et al., 2018; Fang et al., 2019). Our results attempt to bridge the gap from recent advances in developing a non-asymptotic theory for nonconvex opti- mization algorithms to settings that apply to training deep 2. Preliminaries neural networks, where, due to non-differentiability of the In this section, we set up the notion of generalized direc- activations, most existing theory does not directly apply. tional derivatives that will play a central role in our analysis. Throughout the paper, we assume that the nonsmooth func- 1.1. Related Work tion f is L-Lipschitz continuous (more precise assumptions Asymptotic convergence for nonsmooth nonconvex on the function class are outlined in x2.3). functions. Bena¨ım et al.(2005) study the convergence of subgradient methods from a differential inclusion per- 2.1. Generalized gradients spective; Majewski et al.(2018) extend the result to include We start with the definition of generalized gradients, follow- proximal and implicit updates. Bolte & Pauwels(2019) ing (Clarke, 1990), for which we first need: focus on formally justifying the back propagation rule un- d der nonsmooth conditions. In parallel, Davis et al.(2018) Definition 1. Given a point x 2 R , and direction d, the proved asymptotic convergence of subgradient methods as- generalized directional derivative of f is defined as suming the objective function to be Whitney stratifiable. ◦ f(y+td)−f(y) f (x; d) := lim sup t : The class of Whitney stratifiable functions is broader than y!x;t#0 regular functions studied in (Majewski et al., 2018), and it does not assume the regularity inequality (see Lemma 6.3 Definition 2. The generalized gradient of f is defined as and (51) in (Majewski et al., 2018)). Another line of work @f(x) := fg j hg; di ≤ f ◦(x; d); 8d 2 dg: (Mifflin, 1977; Kiwiel, 2007; Burke et al., 2018) studies R convergence of gradient sampling algorithms. These algo- rithms assume a deterministic generalized gradient oracle. We recall below the following basic properties of the gener- Our methods draw intuition from these algorithms and their alized gradient, see e.g., (Clarke, 1990) for details. analysis, but are non-asymptotic in contrast. Proposition 1 (Properties of generalized gradients). Structured nonsmooth nonconvex problems. Another 1. @f(x) is a nonempty, convex compact set. For all line of research in nonconvex optimization is to exploit vectors g 2 @f(x), we have kgk ≤ L. structure: Duchi & Ruan(2018); Drusvyatskiy & Paquette 2. f ◦(x; d) = maxfhg; di j g 2 @f(x)g. (2019); Davis & Drusvyatskiy(2019) consider the composi- tion structure f ◦ g of convex and smooth functions; Bolte 3. @f(x) is an upper-semicontinuous set valued map. et al.(2018); Zhang & He(2018); Beck & Hallak(2020) 4. f is differentiable almost everywhere (as it is L- study composite objectives of the form f + g where one Lipschitz); let conv(·) denote the convex hull, then function is differentiable or convex/concave. With such structure, one can apply proximal gradient algorithms if the @f(x) = conv gjg = lim rf(xk); xk ! x : proximal mapping can be efficiently evaluated. However, k!1 this usually requires weak convexity, i.e., adding a quadratic 5. Let B denote the unit Euclidean ball. Then, function makes the function convex, which is not satisfied by several simple functions, e.g., −|xj. @f(x) = \δ>0 [y2x+δB @f(y): Stationary points under smoothness. When the objec- y; z λ 2 (0; 1) g 2 @f(λy + tive function is smooth, SGD finds an -stationary point in 6. For any , there exists and (1 − λ)z) f(y) − f(z) = hg; y − zi O(−4) gradient calls (Ghadimi & Lan, 2013), which im- such that . proves to O(−2) for convex problems. Fast upper bounds under a variety of settings (deterministic, finite-sum, stochas- 2.2. Directional derivatives tic) are studied in (Carmon et al., 2018; Fang et al., 2018; Since general nonsmooth functions can have arbitrarily large Zhou et al., 2018; Nguyen et al., 2019; Allen-Zhu, 2018; variations in their “gradients,” we must restrict the function Reddi et al., 2016). More recently, lower bounds have also class to be able to develop a meaningful complexity theory. been developed (Carmon et al., 2017; Drori & Shamir, 2019; We show below that directionally differentiable functions Arjevani et al., 2019; Foster et al., 2019). When the func- match this purpose well. Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions Definition 3. A function f is called directionally differen- 2.3. Nonsmooth function class of interest tiable in the sense of Hadamard (cf. (Sova, 1964; Shapiro, Throughout the paper, we focus on the set of Lipschitz, 1990)) if for any mapping ' : R+ ! X for which '(0) = x '(t)−'(0) directionally differentiable and bounded (below) functions: and limt!0+ t = d, the following limit exists: F(∆;L) := ffjf is L-Lipschitz; 0 1 f (x; d) = lim t (f('(t)) − f(x)): (1) f is directionally differentiable; t!0+ f(x0) − inf f(x) ≤ ∆g; (2) x In the rest of the paper, we will say a function f is direc- n where a function f : R ! R is L−Lipschitz if tionally differentiable if it is directionally differentiable in n the sense of Hadamard at all x.