Randomized Smoothing for (Parallel) Stochastic Optimization

Randomized Smoothing for (Parallel) Stochastic Optimization John C. Duchi [email protected] Peter L. Bartlett [email protected] Martin J. Wainwright [email protected] University of California, Berkeley, Berkeley, CA 94720 Abstract where ϕ : R is a known regularizing function, which mayX be → non-smooth. The problem (2) has wide By combining randomized smoothing tech- applicability in machine learning problems; essentially niques with accelerated gradient methods, we all empirical risk-minimization procedures fall into the obtain convergence rates for stochastic opti- form (2), where the distribution P in the definition (1) mization procedures, both in expectation and is either the empirical distribution over some sample with high probability, that have optimal de- of n datapoints, or it may simply be the (unknown) pendence on the variance of the gradient es- population distribution of the samples ξ Ξ. timates. To the best of our knowledge, these ∈ are the first variance-based convergence guar- As a first motivating example, consider support vector antees for non-smooth optimization. A com- machines (SVMs) (Cortes & Vapnik, 1995). In this bination of our techniques with recent work setting, the loss F and regularizer ϕ are defined by on decentralized optimization yields order- λ F (x; ξ)=[1 ξ,x ] and ϕ(x)= x 2 , (3) optimal parallel stochastic optimization algo- −h i + 2 k k2 rithms. We give applications of our results where [α] := max α, 0 . Here, the samples take the to several statistical machine learning prob- + form ξ = ba, where{ b } 1, +1 is the label of the lems, providing experimental results demon- data point a Rd, and∈ the{− goal} of the learner is to strating the effectiveness of our algorithms. find an x Rd∈that separates positive b from negative. ∈ More complicated examples include structured predic- 1. Introduction tion (Taskar, 2005), inverse convex or combinatorial optimization (e.g. Ahuja & Orlin, 2001) and inverse In this paper, we develop and analyze procedures for reinforcement learning or optimal control (Abbeel, solving a class of stochastic optimization problems 2008). The learner receives examples of the form that frequently arise in machine learning and statis- (ξ,ν) where ξ is the input to a system (for exam- tics. Formally, consider a collection F ( ; ξ), ξ Ξ of ple, in NLP applications ξ may be a sentence) and { · ∈ } closed convex functions, each with domain containing ν is a target (e.g. the parse tree for the sen- the closed convex set Rd. Let P be a probability tence∈ Vξ) that belongs to a potentially complicated set X ⊆ distribution over the sample space Ξ and consider the . The goal of the learner is to find parameters x so expected convex function f : R defined via thatV ν = argmax x,φ(ξ,v) , where φ is a feature X → v mapping. Given a∈V lossh ℓ(ν,v)i measuring the penalty f(x):= E F (x; ξ) = F (x; ξ)dP (ξ). (1) for predicting v = ν, the objective F (x;(ξ,ν)) is 6 ZΞ max [ℓ(ν,v)+ x,φ(ξ,v) x,φ(ξ,ν) ] . (4) v h i−h i We focus on potentially non-smooth stochastic opti- ∈V mization problems of the form These examples highlight one of the two main diffi- culties in solving the problem (2). The first, which is min f(x)+ ϕ(x) , (2) x by now well known (e.g. Nemirovski et al., 2009), is ∈X that it is often difficult to compute the integral (1). th Appearing in Proceedings of the 29 International Confer- Indeed, when ξ is high-dimensional, the integral can- ence on Machine Learning, Edinburgh, Scotland, UK, 2012. not be efficiently computed, and in machine learn- Copyright 2012 by the author(s)/owner(s). ing problems, we rarely even know the distribution Randomized Smoothing for Stochastic Optimization P . Thus, throughout this work, we assume only that The main contribution of our paper is to develop algo- we have access to i.i.d. samples ξ P , and conse- rithms for non-smooth stochastic optimization whose quently we adopt the current method∼ of choice and convergence rate depends on the variance σ2 of the focus on stochastic gradient procedures for solving stochastic (sub)gradient estimate. In particular, we the convex program (2) (Nemirovski et al., 2009; Lan, show that the ability to issue several queries to the 2010; Duchi & Singer, 2009; Xiao, 2010). In the oracle stochastic oracle for the original objective (2) can give model we assume, the optimizer issues a query vector faster rates of convergence than a simple stochastic x, after which the oracle samples a point ξ i.i.d. ac- oracle (to our knowledge, this is the first such result cording to P and returns a vector g ∂xF (x; ξ). The for non-smooth optimization). Our theorems quan- second difficulty of solving the problem∈ (2), which in tify the above statement in terms of expected values this stochastic setting is the main focus of our paper, (Theorem 1) and, under an additional reasonable tail is that the functions F and expected function f may condition, with high probability (Theorem 2). In ad- be non-smooth (i.e. non-differentiable). dition, we give extensions to the strongly convex case in Theorem 3. One consequence of our results is that a When the objective function f is smooth, meaning procedure that queries the non-smooth stochastic or- that it has Lipschitz continuous gradient, recent work acle for m subgradients at iteration t achieves rate of by Juditsky et al. (2008) and Lan (2010) has shown convergence (RL /√Tm) in expectation and with that if the variance of a stochastic gradient estimate 0 high probability.O (Here L is the Lipschitz constant is at most σ2 then stochastic optimization procedures 0 of the function and R is the ℓ -radius of its domain.) may obtain convergence rate (σ/√T ). Of partic- 2 This convergence rate is optimal up to constant fac- ular relevance here is that if insteadO of receiving sin- tors, and our algorithms have applications in statistics, gle stochastic gradient estimates the algorithm receives distributed optimization, and machine learning. m unbiased estimates of the gradient, the variance of the gradient estimator is reduced by a factor of m. Notation For a parameter p [1, ], we define Dekel et al. (2011) exploit this fact to develop asymp- the ℓ ball B (x,u) := y x ∈y ∞u . Addition totically order-optimal distributed optimization algo- p p p of sets A and B is defined{ | k as− thek ≤ Minkowski} sum rithms. The dependence on the variance is essential for A + B = x Rd x = y + z,y A, z B , and improvements gained through parallelism; however, to multiplication{ ∈ of a set| A by a scalar∈ α is∈ defined} to the best of our knowledge there has thus far been no be αA := αx x A . For any function f, we let work on non-smooth stochastic problems for which a supp f := {x f|(x)∈= 0} denote its support. Given a reduction in the variance of the stochastic subgradient convex function{ | f we6 use}∂f(x) to denote its subdiffer- estimate gives an improvement in convergence rates. ential at the point x. We define the shorthand notation The starting point for our approach is a convolution- ∂f(x) = sup g g ∂f(x) . The dual norm k k {k k | ∈ } k·k∗ based smoothing technique amenable to non-smooth of the norm is defined as z := sup x 1 z,x . A k·k k k k k≤ h i stochastic optimization problems. Let µ be a density function f is L -Lipschitz with respect∗ to the norm 0 k·k and consider the smoothed objective function over if f(x) f(y) L x y for all x,y . X | − | ≤ 0 k − k ∈ X The gradient of f is L1-Lipschitz continuous with re- f (x) := f(x + y)µ(y)dy = E [f(x + Z)], (5) spect to the norm over if µ µ k·k X Z f(x) f(y) L1 x y for x,y . where Z is a random variable with density µ. The k∇ − ∇ k∗ ≤ k − k ∈X A function ψ is strongly convex with respect to a norm function fµ is convex whenever f is convex, and fµ is over if for all x,y , guaranteed to be differentiable (e.g. Bertsekas, 1973). k·k X ∈X The important aspect of the convolution (5) to note is 1 ψ(y) ψ(x)+ ψ(x),y x + x y 2 . that by Fubini’s theorem, we can write ≥ h∇ − i 2 k − k Given a convex and differentiable function ψ, the fµ(x)= Eµ[F (x + Z; ξ) ξ]dP (ξ), (6) associated Bregman divergence between x and y is | Z D (x,y) := ψ(x) ψ(y) ψ(y),x y . We write ψ − − h∇ − i so that samples of subgradients g ∂F (x + Z; ξ) for drawing ξ from the distribution P as ξ P . ∼ Z µ and ξ P are unbiased gradient∈ estimates of ∼ ∼ fµ(x). By adding random perturbations, we do not 2. Algorithms and Main Results assume we know anything about the function F , and the perturbations allow us to automatically smooth We begin by describing our base algorithm, which even complex F for which finding a smooth proxy is builds off of Tseng’s (2008) work on accelerated gradi- difficult (e.g. the structured prediction problem (4)). ent methods. The method generates three sequences of Randomized Smoothing for Stochastic Optimization points, denoted x ,y ,z 3. The algorithm also Since the function f = f µ is smooth, Assumption A { t t t}∈X µ ∗ requires a non-increasing sequence of smoothing pa- ensures that fµ is close to f but not too “jagged.” We rameters ut R to control the perturbation and—as elaborate on conditions under which Assumption A is standard{ (Tseng}⊂ , 2008; Lan, 2010; Xiao, 2010)—uses holds after stating our first two theorems: a proximal function ψ strongly convex with respect to Theorem 1.

Randomized Smoothing for (Parallel) Stochastic Optimization

Lipschitz Continuity of Convex Functions

The Beginnings 2 2. the Topology of Metric Spaces 5 3. Sequences and Completeness 9 4

Convergence Rates for Deterministic and Stochastic Subgradient

The Lipschitz Constant of Self-Attention

Hölder-Continuity for the Nonlinear Stochastic Heat Equation with Rough Initial Conditions

1. the Rademacher Theorem

On Lipschitz Continuity of Projections 2

Stability and Convergence of Stochastic Gradient Clipping: Beyond Lipschitz Continuity and Smoothness

Arxiv:1706.10175V1 [Math.CV] 30 Jun 2017 Uscnomlslto Fteieult 11 Ihrsett H D the to Respect with (1.1) Inequality the Metric

Metric Spaces

Functions of One Variable

Fat Homeomorphisms and Unbounded Derivate Containers