General Hölder Smooth Convergence Rates Follow from Specialized Rates Assuming Growth Bounds
Total Page:16
File Type:pdf, Size:1020Kb
General H¨older Smooth Convergence Rates Follow From Specialized Rates Assuming Growth Bounds Benjamin Grimmer∗ Abstract Often in the analysis of first-order methods for both smooth and nonsmooth optimization, assuming the existence of a growth/error bound or a KL condition facilitates much stronger convergence analysis. Hence the analysis is done twice, once for the general case and once for the growth bounded case. We give meta-theorems for deriving general convergence rates from those assuming a growth lower bound. Applying this simple but conceptually powerful tool to the proximal point method, subgradient method, bundle method, gradient descent and universal accelerated method immediately recovers their known convergence rates for general convex optimization problems from their specialized rates. Our results apply to lift any rate based on H¨older continuity of the objective’s gradient and H¨older growth bounds to apply to any problem with a weaker growth bound or when no growth bound is assumed. 1 Introduction We consider minimizing a closed proper convex function f : Rn R → ∪ {∞} min f(x) (1) x∈Rn that attains its minimum value at some x∗. Typically first-order optimization methods suppose f possesses some continuity or smoothness structure. These assumptions are broadly captured by assuming f is (L, η)-H¨older smooth, defined as g g′ L x x′ η for all x,x′ Rn, g ∂f(x), g′ ∂f(x′) (2) k − k≤ k − k ∈ ∈ ∈ where ∂f(x) = g Rn f(x′) f(x)+ g,x′ x x′ Rn is the subdifferential of f at x. arXiv:2104.10196v1 [math.OC] 20 Apr 2021 { ∈ | ≥ h − i ∀ ∈ } When η = 1, this corresponds to the common assumption of L-smoothness (i.e., having f(x) be ∇ L-Lipschitz). When η = 0, this captures the standard nonsmooth optimization model of having f(x) itself be L-Lipschitz. Throughout the first-order optimization literature, improved convergence rates typically follow from assuming the given objective function satisfies a growth/error bound or Kurdyka-Lojasiewicz condition (see [15, 14, 9, 1, 2] for a sample of works developing these ideas). One common formal- ization of this comes from assuming (α, p)-H¨olderian growth holds, defined by f(x) f(x )+ α x x p for all x Rn . (3) ≥ ∗ k − ∗k ∈ ∗ School of Operations Research and Information Engineering, Cornell University, Ithaca, NY 14850, USA; people.orie.cornell.edu/bdg79/ This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1650441. 1 General Quadratic Growth Sharp Growth Proximal Point 1 1 ∗ f(x ) f(x ) ǫ O O log f(x0)−f(x ) O 0 − ∗ − Method [24, 5] ǫ α ǫ α2 Polyak Subgradient 1 1 1 ∗ O O O log f(x0)−f(x ) Method [21, 22] ǫ2 ǫα α2 ǫ Proximal Bundle 1 1 O O - Method [8, 4] ǫ3 ǫα2 Gradient 1 1 ∗ O O log f(x0)−f(x ) - Descent [18] ǫ α ǫ Restarted Universal 1 1 ∗ O O log f(x0)−f(x ) - Method [25, 23] √ǫ √α ǫ Table 1: Known convergence rates for several methods. The proximal point method makes no smoothness or continuity assumptions. The subgradient and bundle method rates assume Lipschitz continuity (η = 0) and the gradient descent and universal method rates assume Lipschitz gradient (η = 1), although these methods can be analyzed for generic η. The two most important cases of this bound are Quadratic Growth given by p = 2 (generalizing strong convexity [16]) and Sharp Growth given by p = 1 (which occurs broadly in nonsmooth optimization [3]). Typically different convergence proofs are employed for the cases of minimizing f with and without assuming the existence of a given growth lower bound. Table 1 summarizes the number of iterations required to guarantee ǫ-accuracy for several well-known first-order methods: the Proximal Point Method with any stepsize ρ> 0 defined by 1 x = prox (x ) := argmin f(x)+ x x 2 , (4) k+1 ρ,f k { 2ρk − kk } the Polyak Subgradient Method defined by x = x ρ g for some g ∂f(x ) (5) k+1 k − k k k ∈ k with stepsize ρ = (f(x ) f(x ))/ g 2, and Gradient Descent k k − ∗ k kk x = x ρ f(x ) (6) k+1 k − k∇ k with stepsize ρ = f(x ) (1−η)/η /L1/η as well as the more sophisticated Proximal Bundle k k∇ k k Method1 of [11, 26] and a restarted variant of the Universal Gradient Method of [19]. Our Contribution. This work presents a pair of meta-theorems for deriving general convergence rates from rates that assume the existence of a growth lower bound. In terms of Table 1, we show that each convergence rate implies all of the convergence rates to its left: the quadratic growth column’s rates imply all of the general setting’s rates and each sharp growth rate implies that method’s general and quadratic growth rate. More generally, our results show that any convergence 1Stronger convergence rates have been established for related level bundle methods [12, 7, 10], which share many core elements with proximal bundle methods. 2 rate assuming growth with exponent p implies rates for any growth exponent q>p and for the general setting. An early version of these results [6] only considered the case of nonsmooth Lipschitz continuous optimization. A natural question is whether the reverse implications hold (whether each column of Table 1 implies the rates to its right). If one is willing to modify the given first-order method, then the literature already provides a partial answer through restarting schemes [17, 13, 27, 25, 23]. Such schemes repeatedly run a given first-order method until some criterion is met and then restart the method at the current iterate. For example, Renegar and Grimmer [23] show how a general convergence rate can be extended to yield a convergence rate under H¨older growth. Combining this with our rate lifting theory allows any rate assuming growth with exponent p to give a convergence rate applying under any growth exponent q<p. 2 Rate Lifting Theorems Now we formalize our model for a generic first-order method fom. Note that for it to be meaningful to lift a convergence rate to apply to general problems, the inputs to fom need to be independent of the existence of a growth bound (3), but may depend on the H¨older smoothness constants (L, η) or the optimal objective value f(x∗). We make the following three assumptions about fom: (A1) The method fom computes a sequence of iterates x ∞ . The next iterate x is determined { k}k=0 k+1 by the first-order oracle values (f(x ), g ) k+1 where g ∂f(x ). { j j }j=0 j ∈ j (A2) The distance from any x to some fixed x argmin f is at most some constant D> 0. k ∗ ∈ (A3) If f is (L, η)-H¨older smooth and possesses (α, p)-H¨older growth on B(x∗, D), then for any ǫ> 0, fom finds an ǫ-minimizer xk with k K(f(x ) f(x ),ǫ,α) ≤ 0 − ∗ for some function K : R3 R capturing the rate of convergence. ++ → On the Generality of (A1)-(A3). Note that (A1) allows the computation of xk+1 to depend on the function and subgradient values at xk+1. Hence the proximal point method is included in our model since it is equivalent to x = x ρ g , where g ∂f(x ). We remark that k+1 k − k k+1 k+1 ∈ k+1 (A2) holds for all of the previously mentioned algorithms under proper selection of their stepsize parameters. In fact, many common first-order methods are nonexpansive, giving D = x x . k 0 − ∗k Lastly, note that the convergence bound K(f(x ) f(x ),ǫ,α) in (A3) can depend on L,η,p even 0 − ∗ though our notation does not enumerate this. If one wants to make no continuity or smoothness assumptions about f, (L, η) can be set as ( , 0) (the proximal point method is one such example ∞ as it converges independent of such structure). The following pair of convergence rate lifting theorems show that these assumptions suffice to give general convergence guarantees without H¨older growth and guarantees for H¨older growth with any exponent q>p. Theorem 2.1. Consider any method fom satisfying (A1)-(A3) and any (L, η)-H¨older smooth func- tion f. For any ǫ> 0, fom will find f(x γ g ) f(x ) ǫ by iteration k − k k − ∗ ≤ k K(f(x ) f(x ), ǫ, ǫ/Dp) ≤ 0 − ∗ (1−η)/η 1/η gk /L if η > 0 where γk = k k (0 if η = 0. 3 Theorem 2.2. Consider any method fom satisfying (A1)-(A3) and any (L, η)-H¨older smooth func- tion f possessing (α, q)-H¨older growth with q>p. For any ǫ> 0, fom will find f(x γ g ) f(x ) k− k k − ∗ ≤ ǫ by iteration k K(f(x ) f(x ),ǫ,αp/qǫ1−p/q) ≤ 0 − ∗ (1−η)/η 1/η gk /L if η > 0 where γk = k k (0 if η = 0. Applying these rate lifting theorems amounts to simply substituting α with ǫ/Dp or αp/qǫ1−p/q in any guarantee depending on (α, p)-H¨older growth. For example, the Polyak subgradient method satisfies (A2) with D = x x and converges for any L-Lipschitz objective with quadratic growth k 0 − ∗k p = 2 at rate 8L2 K(f(x ) f(x ),ǫ,α)= 0 − ∗ αǫ establishing (A3) (a proof of this fact is given in the appendix for completeness).