Quick viewing(Text Mode)

Randomized Smoothing for (Parallel) Stochastic Optimization

Randomized Smoothing for (Parallel) Stochastic Optimization

Randomized Smoothing for (Parallel) Stochastic Optimization

John C. Duchi [email protected] Peter L. Bartlett [email protected] Martin J. Wainwright [email protected] University of California, Berkeley, Berkeley, CA 94720

Abstract where ϕ : R is a known regularizing , which mayX be → non-smooth. The problem (2) has wide By combining randomized smoothing tech- applicability in machine learning problems; essentially niques with accelerated gradient methods, we all empirical risk-minimization procedures fall into the obtain convergence rates for stochastic opti- form (2), where the distribution P in the definition (1) mization procedures, both in expectation and is either the empirical distribution over some sample with high probability, that have optimal de- of n datapoints, or it may simply be the (unknown) pendence on the variance of the gradient es- population distribution of the samples ξ Ξ. timates. To the best of our knowledge, these ∈ are the first variance-based convergence guar- As a first motivating example, consider support vector antees for non-smooth optimization. A com- machines (SVMs) (Cortes & Vapnik, 1995). In this bination of our techniques with recent work setting, the loss F and regularizer ϕ are defined by on decentralized optimization yields order- λ F (x; ξ)=[1 ξ,x ] and ϕ(x)= x 2 , (3) optimal parallel stochastic optimization algo- −h i + 2 k k2 rithms. We give applications of our results where [α] := max α, 0 . Here, the samples take the to several statistical machine learning prob- + form ξ = ba, where{ b } 1, +1 is the label of the lems, providing experimental results demon- data point a Rd, and∈ the{− goal} of the learner is to strating the effectiveness of our algorithms. find an x Rd∈that separates positive b from negative. ∈ More complicated examples include structured predic- 1. Introduction tion (Taskar, 2005), inverse convex or combinatorial optimization (e.g. Ahuja & Orlin, 2001) and inverse In this paper, we develop and analyze procedures for reinforcement learning or optimal control (Abbeel, solving a class of stochastic optimization problems 2008). The learner receives examples of the form that frequently arise in machine learning and statis- (ξ,ν) where ξ is the input to a system (for exam- tics. Formally, consider a collection F ( ; ξ), ξ Ξ of ple, in NLP applications ξ may be a sentence) and { · ∈ } closed convex functions, each with domain containing ν is a target (e.g. the parse tree for the sen- the closed convex set Rd. Let P be a probability tence∈ Vξ) that belongs to a potentially complicated set X ⊆ distribution over the sample space Ξ and consider the . The goal of the learner is to find parameters x so expected convex function f : R defined via Vthat ν = argmax x,φ(ξ,v) , where φ is a feature X → v mapping. Given a∈V lossh ℓ(ν,v)i measuring the penalty f(x):= E F (x; ξ) = F (x; ξ)dP (ξ). (1) for predicting v = ν, the objective F (x;(ξ,ν)) is 6 ZΞ   max [ℓ(ν,v)+ x,φ(ξ,v) x,φ(ξ,ν) ] . (4) v h i−h i We focus on potentially non-smooth stochastic opti- ∈V mization problems of the form These examples highlight one of the two main diffi- culties in solving the problem (2). The first, which is min f(x)+ ϕ(x) , (2) x by now well known (e.g. Nemirovski et al., 2009), is ∈X  that it is often difficult to compute the integral (1). th Appearing in Proceedings of the 29 International Confer- Indeed, when ξ is high-dimensional, the integral can- ence on Machine Learning, Edinburgh, Scotland, UK, 2012. not be efficiently computed, and in machine learn- Copyright 2012 by the author(s)/owner(s). ing problems, we rarely even know the distribution Randomized Smoothing for Stochastic Optimization

P . Thus, throughout this work, we assume only that The main contribution of our paper is to develop algo- we have access to i.i.d. samples ξ P , and conse- rithms for non-smooth stochastic optimization whose quently we adopt the current method∼ of choice and convergence rate depends on the variance σ2 of the focus on stochastic gradient procedures for solving stochastic (sub)gradient estimate. In particular, we the convex program (2) (Nemirovski et al., 2009; Lan, show that the ability to issue several queries to the 2010; Duchi & Singer, 2009; Xiao, 2010). In the oracle stochastic oracle for the original objective (2) can give model we assume, the optimizer issues a query vector faster rates of convergence than a simple stochastic x, after which the oracle samples a point ξ i.i.d. ac- oracle (to our knowledge, this is the first such result cording to P and returns a vector g ∂xF (x; ξ). The for non-smooth optimization). Our theorems quan- second difficulty of solving the problem∈ (2), which in tify the above statement in terms of expected values this stochastic setting is the main focus of our paper, (Theorem 1) and, under an additional reasonable tail is that the functions F and expected function f may condition, with high probability (Theorem 2). In ad- be non-smooth (i.e. non-differentiable). dition, we give extensions to the strongly convex case in Theorem 3. One consequence of our results is that a When the objective function f is smooth, meaning procedure that queries the non-smooth stochastic or- that it has Lipschitz continuous gradient, recent work acle for m subgradients at iteration t achieves rate of by Juditsky et al. (2008) and Lan (2010) has shown convergence (RL /√Tm) in expectation and with that if the variance of a stochastic gradient estimate 0 high probability.O (Here L is the Lipschitz constant is at most σ2 then stochastic optimization procedures 0 of the function and R is the ℓ -radius of its domain.) may obtain convergence rate (σ/√T ). Of partic- 2 This convergence rate is optimal up to constant fac- ular relevance here is that if insteadO of receiving sin- tors, and our algorithms have applications in statistics, gle stochastic gradient estimates the algorithm receives distributed optimization, and machine learning. m unbiased estimates of the gradient, the variance of the gradient estimator is reduced by a factor of m. Notation For a parameter p [1, ], we define Dekel et al. (2011) exploit this fact to develop asymp- the ℓ ball B (x,u) := y x ∈y ∞u . Addition totically order-optimal distributed optimization algo- p p p of sets A and B is defined{ | k as− thek ≤ Minkowski} sum rithms. The dependence on the variance is essential for A + B = x Rd x = y + z,y A, z B , and improvements gained through parallelism; however, to multiplication{ ∈ of a set| A by a scalar∈ α is∈ defined} to the best of our knowledge there has thus far been no be αA := αx x A . For any function f, we let work on non-smooth stochastic problems for which a supp f := {x f|(x)∈= 0} denote its support. Given a reduction in the variance of the stochastic subgradient convex function{ | f we6 use}∂f(x) to denote its subdiffer- estimate gives an improvement in convergence rates. ential at the point x. We define the shorthand notation The starting point for our approach is a convolution- ∂f(x) = sup g g ∂f(x) . The dual k k {k k | ∈ } k·k∗ based smoothing technique amenable to non-smooth of the norm is defined as z := sup x 1 z,x . A k·k k k k k≤ h i stochastic optimization problems. Let µ be a density function f is L -Lipschitz with respect∗ to the norm 0 k·k and consider the smoothed objective function over if f(x) f(y) L x y for all x,y . X | − | ≤ 0 k − k ∈ X The gradient of f is L1-Lipschitz continuous with re- f (x) := f(x + y)µ(y)dy = E [f(x + Z)], (5) spect to the norm over if µ µ k·k X Z f(x) f(y) L1 x y for x,y . where Z is a random variable with density µ. The k∇ − ∇ k∗ ≤ k − k ∈X A function ψ is strongly convex with respect to a norm function fµ is convex whenever f is convex, and fµ is over if for all x,y , guaranteed to be differentiable (e.g. Bertsekas, 1973). k·k X ∈X The important aspect of the convolution (5) to note is 1 ψ(y) ψ(x)+ ψ(x),y x + x y 2 . that by Fubini’s theorem, we can write ≥ h∇ − i 2 k − k Given a convex and differentiable function ψ, the fµ(x)= Eµ[F (x + Z; ξ) ξ]dP (ξ), (6) associated Bregman divergence between x and y is | Z D (x,y) := ψ(x) ψ(y) ψ(y),x y . We write ψ − − h∇ − i so that samples of subgradients g ∂F (x + Z; ξ) for drawing ξ from the distribution P as ξ P . ∼ Z µ and ξ P are unbiased gradient∈ estimates of ∼ ∼ fµ(x). By adding random perturbations, we do not 2. Algorithms and Main Results assume we know anything about the function F , and the perturbations allow us to automatically smooth We begin by describing our base algorithm, which even complex F for which finding a smooth proxy is builds off of Tseng’s (2008) work on accelerated gradi- difficult (e.g. the structured prediction problem (4)). ent methods. The method generates three sequences of Randomized Smoothing for Stochastic Optimization points, denoted x ,y ,z 3. The algorithm also Since the function f = f µ is smooth, Assumption A { t t t}∈X µ ∗ requires a non-increasing sequence of smoothing pa- ensures that fµ is close to f but not too “jagged.” We rameters ut R to control the perturbation and—as elaborate on conditions under which Assumption A is standard{ (Tseng}⊂ , 2008; Lan, 2010; Xiao, 2010)—uses holds after stating our first two theorems: a proximal function ψ strongly convex with respect to Theorem 1. Define ut = θtu and µt to be the density the norm to regularize the points. At iteration of utZ. For any x∗ t, the algorithmk·k computes stochastic gradients at m ∈X E points drawn from a neighborhood around y : [f(xT )+ ϕ(xT )] [f(x∗)+ ϕ(x∗)] t − ≤ T m 6L1ψ(x∗) 2ηψ(x∗) 1 1 E 2 4L0u (i) Draw random variables Zi,t i=1 i.i.d. according + + et + , { } Tu √ ηT √ k k∗ T to the distribution µ. T t=1 t X

(ii) Compute m stochastic (sub)gradients at the where et := fµt (yt) gt is the error in the gradient ∇ − points yt + utZi,t for i = 1, 2,...,m: estimate. g ∂F (y + u Z ,ξ ), The preceding result, which provides convergence in i,t ∈ t t i,t i,t where ξi,t P , for i = 1, 2,...,m. (7) expectation, can be extended to bounds that hold with ∼ high probability under suitable tail conditions on the 1 m (iii) Compute the average g = g . error et := fµ(yt) gt. In particular, let t denote t m i=1 i,t ∇ − F the σ-field of the random variables gi,s, i = 1,...,m With the definition of the gradientP sampling scheme, and s = 0,...,t. To obtain high-probability results, we can give our update scheme. We require scalars we make the following assumption. L1 and η to control step sizes, and as in Tseng’s Assumption B (Sub-Gaussian errors). The error is R work assume the sequence θt satisfies θ0 = 1 ( ,σ) sub-Gaussian, meaning that there exists a { } ⊂ k·k 2 constant∗ σ > 0 such that and θt = 2/(1 + 1 + 4/θt 1). Starting from an ini- − tial point x , we define the update recursions 2 2 0 q E[exp( et /σ ) t 1] exp(1) for t N. (9) ∈X k k∗ |F − ≤ ∈ y = (1 θ )x + θ z (8a) t − t t t t t t Past work on smooth stochastic optimiza- 1 1 tion (Juditsky et al., 2008; Lan, 2010; Xiao, 2010) has zt+1 = argmin gτ ,x + ϕ(x) x θτ h i θτ imposed this type of tail assumption, and Corollary 4 ∈X  τ=0 τ=0 X X gives sufficient conditions for the assumption to hold. L1 η√t + 1 + + Dψ(x,x0) (8b) Theorem 2. In addition to the conditions of Theo- ut θt+1 h i  rem 1, suppose that is compact with x x∗ R x = (1 θ )x + θ z . (8c) X k − k≤ t+1 − t t t t+1 for all x and that Assumption B holds. With ∈ X In prior work on accelerated schemes for stochastic and probability at least 1 2δ, − non-stochastic optimization (Tseng, 2008; Lan, 2010; f(xT )+ ϕ(xT ) [f(x∗)+ ϕ(x∗)] Xiao, 2010), the term L1 is set equal to the Lipschitz − ≤ T constant of f; we use L1/ut to allow varying amounts 6L1ψ(x∗) 4L0u 4ηψ(x∗) 1 1 2 ∇ + + + E[ et ] of due to the shrinking sequence of ut; the Tu T √ ηT √ k k∗ T t=1 t damping term η√t/θt provides control over the fluc- X tuations induced by the random vector g . 2 1 1 1 t 4σ max log δ , 3e log(T ) log δ σR log δ + + . n ηTq o q√T 2.1. Convergence Rates for Convex Objectives We now state our main results on the convergence rate We now give two corollaries that help to explain the of the randomized smoothing procedure (7) with ac- bounds Theorem 1 provides. In each corollary we as- 2 celerated dual averaging updates (8a)–(8c), providing sume that the point x∗ satisfies ψ(x∗) R , and ∈X 2≤ proofs in the appendices. We begin with our main as- we use the proximal function ψ(x)= 1 x . 2 k k2 sumption, which ensures that the smoothing operator Corollary 1. Let µ be uniform on z z 2 u and 2 { | k k ≤ } and smoothed function fµ are relatively well-behaved. assume E[ ∂F (x; ξ) ] L2. If we set u = Rd1/4 and k k2 ≤ 0 Assumption A (Smoothing). The random variable Z η = L0/R√m, then is zero-mean with density µ and there are constants L 0 1/4 E 10L0Rd 5L0R and L1 such that for u> 0, [f(x+uZ)] f(x)+L0u, E[f(xT )+ϕ(xT )] [f(x∗)+ϕ(x∗)] + . E L1 ≤ − ≤ T √Tm and [f(x+uZ)] has u -Lipschitz continuous gradient. Randomized Smoothing for Stochastic Optimization

Corollary 2. Let µ be the d-dimensional normal dis- Algorithm 1 Epoch-based stochastic gradient 2 tribution with covariance u I and assume F ( ; ξ) is L0- Input: strong convexity parameter λ, Lipschitz pa- · 1/4 Lipschitz for ξ Ξ. With smoothing u = Rd− , rameter L , smoothing u, damping η ∈ 1 for do 1/4 i =1 to k 10L0Rd 5L0R i i E[f(x )+ϕ(x )] [f(x∗)+ϕ(x∗)] + . Set η(i)= η 2 and u(i)= u 2− T T T √ · · − ≤ Tm Set t(i) = max 12η(i)/λ, 4 L1/u(i)λ Run updates (8a{ )–(8c) for t = t(i) timesteps} with p There are several objectives f + ϕ with domains Damping parameter η(i) for which the natural geometry is non-Euclidean, • Gradients g computed via the procedure (7) whichX motivates the mirror descent family of algo- t • with smoothing parameter u(i) rithms (Nemirovski & Yudin, 1983). By using differ- Initial points x = x(i 1) and z = x(i 1) ent distributions µ for the random perturbations Zi,t • 0 − 0 − in (7), we can take advantage of non-Euclidean geom- Set x(i) to be the output of the previous step. etry. Here we give an example that is quite useful for end for problems in which the optimizer x∗ is sparse; for ex- Output: x(k) ample, the optimization set may be a simplex or ℓ -ball, or ϕ(x) = λ x . TheX idea in this corollary 1 k k1 is that we achieve a pair of dual norms that may give smoothing parameter u = Rd1/4, and damping param- better performance than the ℓ2-ℓ2 pair above. eter η = L /R√m, with probability at least 1 δ, 0 − Corollary 3. Let µ be the uniform density on 1/4 L0Rd B (0,u) and assume that F ( ; ξ) is L0-Lipschitz con- f(xT )+ ϕ(xT ) f(x∗) ϕ(x∗)= + ∞ · − − O T ··· tinuous with respect to the ℓ1-norm over + B (0,u)  X ∞ 2 for ξ Ξ. Use the prox function ψ(x) = 1 x 1 1 2(p 1) p L R L0R log δ L0R max log , log T ∈ − k k 0 + + { δ } . for p = 1+1/ log d, set u = R√d log d and η = √ √q T √m Tm Tm  L0/R√m log d. Then 2.2. Strongly Convex Optimization E[f(x )+ ϕ(x )] [f(x∗)+ ϕ(x∗)] T T − When the objective f + ϕ is strongly convex—for ex- L0 x∗ √d log d L0 x∗ log d = k k1 + k k1 . ample, when the regularizer ϕ(x) = λ x 2—it is O T √Tm 2 2   possible to attain rates of convergence fasterk k than 1/√T (Hazan & Kale, 2010; Ghadimi & Lan, 2010). The dimension dependence of √d log d on the leading 1/4 Following the idea of Hazan and Kale, we apply the al- 1/T term in the corollary is weaker than the d de- gorithm (8a)–(8c) in a series of epochs. Let x(i) denote pendence in the earlier corollaries, so for very large m the output of epoch i. Each epoch i consists of running the corollary is not as strong as one desires for prob- the accelerated algorithm (8a)–(8c) for t(i) iterations, lems with non-Euclidean geometry. But for large T , except we replace D (x,x ) with D (x,x(i 1)). In √ ψ 0 ψ the 1/ Tm terms dominate the convergence rates. addition to Assumption A, we make − Inspection of the above corollaries shows that we have Assumption C (Strong convexity). The function f + achieved our goal: we have convergence rates that ϕ is λ-strongly convex over : for any x,y and X ∈ X provably improve with the “mini-batch” (or gradient any f ′(x) ∂f(x) and ϕ′(x) ∂ϕ(x), sample) size m. Additionally, Agarwal et al. (2012) ∈ ∈ give lower bounds for stochastic optimization that f(y)+ ϕ(y) show these rates are optimal: they cannot be improved λ 2 f(x)+ ϕ(x)+ f ′(x)+ ϕ′(x),y x + y x . by more than constant factors. ≥ h − i 2 k − k2

Our final corollary specializes the high probability con- Now we describe the parameters and execution of the vergence result in Theorem 2 by showing that the error algorithm. We perform the updates (7) and (8a)–(8c) is sub-Gaussian (9). We state the corollary for prob- with the damping parameter η(i) and smoothing mag- lems with Euclidean geometry, but it is clear that sim- nitude ut = u(i) fixed within each epoch, and set them ilar results hold for non-Euclidean geometry as above. i i to η(i) = 2 η and u(i) = 2− u within epoch i. We Corollary 4. Assume that F ( ,ξ) is L -Lipschitz with set number of· iterations t(i) in· epoch i to be · 0 respect to the ℓ -norm. Let ψ(x)= 1 x 2 and assume 2 2 k k2 that is compact with x x R for x,x . L1 12η(i) ∗ 2 ∗ t(i) = max 4 , . UsingX a smoothing distributionk − µk uniform≤ on B (0∈,u X), u(i)λ λ 2  s  Randomized Smoothing for Stochastic Optimization

See Algorithm 1 for pseudocode. We obtain At first glance, Algorithm 1 appears to require knowl- Theorem 3. Let x(k) be the output of the epoch-based edge of many of the parameters of the stochastic opti- Algorithm 1 after k epochs, let σ2 be a bound on the mization problem (2). However, a closer inspection of variance of the stochastic gradient estimate, and let the convergence rates guaranteed by Theorem 3 shows that the algorithm is robust to mis-specification of the M f(x(0)) + ϕ(x(0)) f(x∗) ϕ(x∗). Then after ≥M − − values. Indeed, the smoothness Lipschitz constant L1 log2 ǫ epochs of Algorithm 1, appears in the algorithm only via L1/u(i)λ, and the E[f(x(k)) + ϕ(x(k))] [f(x∗)+ ϕ(x∗)] strong convexity parameter λ appears in η(i)/λ and − L /u(i)λ. In both cases, we can write L and λ as ǫ σ2 1 1 ǫ + 5 + L u a function of the other inputs u and η: indeed, set ≤ · M 2η 0   Cu = L1/λu and Cη = η/λ. The optimization er- ror in Theorem 3 has linear dependence on η 1, and and the number of updates (8a)–(8c) is bounded by − the number of iterations (10) also has linear depen- dence on C = η/λ, so the algorithm is robust to (and L M Mη η 10 1 + 24 . the global rate of convergence does not change with) uλǫ λǫ r mis-specification of η and λ. Similarly, Theorem 3’s optimization error is linear in L0u, and the iteration To translate Theorem 3 into a slightly more intuitive bound (10) grows as √C , or is linear in u 1/2. As 2 M M u − bound, we set η = σ /2 and u = /L0. These a consequence, the epoch based algorithm—similar to choices imply that after at most Nemirovski et al.’s 2009 study of stochastic gradient descent—is robust to mis-specification of its input pa- L L σ2 10 0 1 + 12 (10) rameters. See also our simulations in Section 3. λǫ λǫ r Now we turn to exploiting parallel computation to updates (8a)–(8c), we obtain optimization error 11ǫ. achieve variance reductions, which yield convergence We can apply these bounds to the SVM (3) and struc- rate improvements for our algorithms. As motivation, tured prediction (4) problems described in the intro- note that Corollary 5 makes clear that the convergence duction. The next corollary makes clear that comput- rate of Algorithm 1 to optimization accuracy ǫ linearly ing m stochastic gradients in parallel (while employing improves with the number of samples m, at least until m L0 . In our distributed setting, we assume our randomized perturbation techniques) yields linear ≥ d1/4√λǫ speedup as a function of the batch size m: we have n processors and take as our objective 2 Corollary 5. Let the regularizer ϕ(x) = λ x . 1 n 2 k k2 E f(x) := fi(x) for fi(x) := Pi [F (x; ξ)]. (11) Assume that ξ 2 L0 (in the SVM case) or n k k ≤ i=1 φ(ξ,ν) 2 L0 (for the structured prediction prob- X lem).k Thenk ≤ with m subgradient evaluations and µ ei- The distributed variant of Algorithm 1 is as follows: a ther the d-dimensional normal or uniform on B2(0,u), Algorithm 1 outputs an x with centralized processor executes Algorithm 1 while gra- dient computation is distributed among several proces-

E[f(x)+ ϕ(x)] [f(x∗)+ ϕ(x∗)] = (ǫ) sors, each of which computes some number m of local − b O gradients that are then averaged across the network. We replace the gradient sampling (7) at iteration t after at mostb b with the following: each processor i draws m i.i.d. 1/4 L0d L0 samples Z µ, j = 1,...,m, then the sampling + i,j ∼ O √ λmǫ scheme (7) is replaced with each processor i query-  λǫ  ing its local oracle at the m points yt + Zi,j, yielding iterations of the method (8a)–(8c). stochastic gradients

g ∂F (y + Z ; ξ ) (12) 2.3. Robustness and Distributed Computing i,j ∈ t i,j i,j where ξ i.i.d. P for j 1,...,m . In this final expository section, we give a few remarks i,j ∼ i ∈ { } on the robustness of our algorithms to the choice of The network then computes g = 1 n m g . stepsizes, and we describe techniques that allow paral- t nm i=1 j=1 i,j lelization. For concreteness, we focus our remarks on We may analyze the run-time and convergenceP P rate of the strongly convex Algorithm 1, though essentially Algorithm 1 with the sampling scheme (12) using the identical results hold for Section 2.1’s procedures. technique developed by Dekel et al. (2011). Assume Randomized Smoothing for Stochastic Optimization one unit of time is required to sample and compute a ) ∗

single gradient vector as in (12), and let c(n) be the x ( amount of time required to achieve consensus in an ϕ

− −1 n-node network. Then ) 10 ∗ x Corollary 6. Under the conditions of Theorem 3, ( f −

2 ) b L0L1 m + c(n) σ x

( −2 max (m + c(n)) , 10 O · λǫ m · nλǫ ϕ  r 

n o ) + b x

( 3 units of time of the procedure (12) are sufficient for 10

f −1 10 1 E[f(x)+ ϕ(x)] [f(x∗)+ ϕ(x∗)] ǫ. 10 1 − ≤ 10 −1 3 η 1/u 10 10 Corollary 6 givesb usb our desired result: order-optimal Figure 1. Optimality gap of the vector xb output by Al- distributed optimization—i.e. linear speedup in the gorithm 1 for the SVM problem (13) plotted against the number n of processors—so long as c(n) is at most inverse smoothing parameter 1/u and damping stepsize η. the order of m. In Figure 1, we plot the average optimality gap of the point x Algorithm 1 selects. From the fig- 3. Experimental Results ure, we see that the performance of the method Though the results we have presented thus far are is nearly identical—achievingb optimization accuracy 2 essentially theoretically unimprovable (Agarwal et al., better than 10− after 2000 iterations—so long as (η, 1/u) [1, 1000] [.1, 100]. The method suffers 2012), it is important to understand their practical ∈ × performance. In particular, we would like to under- some performance degradation if η is too small, that is, η 1 or so, or u is too small, that is, 1/u 100. Even stand the robustness properties of the algorithms— ≤ ≥ whether they still perform well under mis-specification when these extreme settings of u or η occur, however, of problem parameters—and how they compare to the method still has optimization accuracy on the or- 1 state-of-the-art stochastic optimization algorithms. der of 10− . The breakdown points in the figure are intuitive: if η is too small, the damping in the proximal Robustness: We begin by studying the robustness gradient update (8b) will not overcome the stochastic- of Algorithm 1 as a function of the parameters L , u, ity of the vectors gt; if u is too small, the perturbed 1 E and η: the Lipschitz continuity constant of the gra- function [F (x + uZ; ξ)] is nearly non-smooth. dient fµ of the smoothed objective, the amount of perturbation,∇ and the damping stepsize. (For lack of Learning: Now we turn to showing the per- space, we omit robustness results for our other algo- formance benefits of parallelization and the random- rithms.) As we discuss following Theorem 3, it is no ized perturbation methods. We begin with experi- ments based on a metric learning problem (Xing et al., loss of generality to view estimating the constant L1 as choosing 1/u, whence our robustness analysis reduces 2003). For points i,j = 1,...,n we are given a vec- tor a Rd, and a measure b 0 of the similarity to studying the effects of the choices for u and η. i ∈ ij ≥ between the vectors ai and aj . (Here bij = 0 means For our experiment, we generate n = 1000 vectors that ai and aj are the same.) The goal is to learn a d ai 1, 0, 1 with d = 200 dimensions, each entry matrix X such that (ai aj),X(ai aj ) bij . One ∈ {− } 1 h − − i≈ set to 0 with probability 2 and uniformly 1 with method for doing so is to minimize the objective 1 {± } probability 2 . We set bi = sign( ai,w ) for a random Rd h i 1 vector w distributed as N(0,Id d) and flip the f(X)= tr X(ai aj )(ai aj )⊤ bij ∈ × n − − − signs of 10% of the bi, and solve the SVM problem 2 i=j X6  n subject  to tr(X) C, X 0. (14) 1 λ 2 ≤  minimize [1 bi ai,x ]+ + x 2 (13) x Rd n − h i 2 k k A stochastic gradient for this problem is simple: given i=1 ∈ X a matrix X, choose a pair (i,j) uniformly at random, with λ = .1. By construction, the Lipschitz constant then compute the subgradient L0 10, so that we can estimate the values u and ≈ sign[ (ai aj ),X(ai aj ) bij](ai aj )(ai aj )⊤. η Corollary 5 specifies. We run Algorithm 1 for 2000 h − − i− − − iterations of the updates (8a)–(8c), using m = 5 sam- We solve ten random instances of the metric learning ples. We perform 50 such experiments. problem (14) with d = 100 and n = 2000, yielding an Randomized Smoothing for Stochastic Optimization

m = 1 4 worker threads (denoted Acc (i)). In the right plot, m = 2 0.2 m = 4 we show the time in seconds required to achieve an m = 8 m = 16 ǫ = .004 optimality gap for Algorithm 1 as a function m = 32 m = 64 of the number of threads computing stochastic sub- ) 0.15 m = 128 ∗ gradient estimates. The blue (top) line corresponds to X (

f each worker thread using a batch of size m = 10 to

)- 0.1 t estimate a stochastic gradient, which is further aver- X (

f aged, while the red (lower) line corresponds to each worker using batches of size 20. We see improvement 0.05 of approximately 1/n for n workers, as we expect.

0 2 4 6 8 10 Structured Prediction: Our final experiment is Time (s) to learn feature weights for a probabilistic pars- ing task using a hypergraph parser. Hypergraph Figure 2. Optimization error in the metric learning prob- parsers (Klein & Manning, 2002) convert parsing lem (14) versus time in seconds. Each line indicates error tasks—which require assigning the productions in when using m samples in the gradient estimate (7). a probabilistic context-free grammar (PCFG) to a Time to optimality gap of 0.004 0 10 4.5 sentence—to finding maximum-weight paths in hyper- Acc (1) Batch size 10 Acc (2) 4 Batch size 20 graphs. To learn the weights on the features, we min- Acc (3) Acc (4) 3.5 Pegasos imize a loss of the form (4), where the datum ξ is a −1 10 3 sentence and the label ν is a parse tree, which corre- 2.5 sponds to a weighted path between a sentence node 2

Optimality gap −2 10

Mean time (seconds) and an initial sentential production node in the hy- 1.5

1 pergraph associated with the PCFG. We only sketch

−3 10 0.5 0 0.5 1 1.5 2 2.5 3 3.5 0 2 4 6 8 10 12 14 our setup here. The important conditions we note are Time (s) Number of worker threads that the feature function φ decomposes across edges in the hypergraph, and we use standard lexical fea- Figure 3. Left: Optimality gap of Pegasos and the acceler- tures (Taskar et al., 2004). If we let v be a 0, 1 ma- ated strongly convex methods with {1,..., 4} threads ver- trix with entries for each edge in a hypergraph,{ where} sus time. Right: time to achieve optimality gap ǫ = .004 vj,k = 1 means edge (j,k) is selected, then labels ν are for accelerated methods versus number of threads. paths in the hypergraph and our loss is the Hamming 6 objective with 2 10 terms. In Figure 2, we plot the loss: ℓ(v,ν) = j,k 1(vj,k=νj,k). This loss decomposes ≈ · 6 optimality gap f(X ) inf ∗ f(X ) as a function across edges of the hypergraph, meaning the objective t X ∗ P of computation time.− We plot∈X several lines, each of which captures the performance of the algorithm us- max [ℓ(v,ν)+ x,φ(ξ,v) x,φ(ξ,ν) ]+ = v h i−h i ing a different number m of samples in the smoothing ∈V max 1(v =ν ) + x,φj,k(ξ,vj,k) φj,k(ξ,νj,k) step (7). Receiving more samples gives improvements v j,k6 j,k h − i ∈V j,k in convergence rate as a function of time. Our the- X  ory also predicts that for m d, there should be no is computable in time cubic in the length of the sen- improvement in time taken to≥ minimize the objective. tence ξ (Klein & Manning, 2002). (Here we have used Figure 2 suggests this is correct: the plots for m = 64 φ to indicate the features associated with the edge and m = 128 are essentially indistinguishable. j,k (j,k) in the hypergraph). The hypergraph represen- tation we use is quite large: each word in a sen- Support Vector Machines: For this experiment, tence generates some 2002 different possible produc- we investigate solving an SVM problem (3) using the tions in the corresponding context free grammar, each Reuters RCV1 dataset (Lewis et al., 2004), which con- of which requires thousands of features, yielding bil- sists of 800,000 training examples for binary classifica- lions of weights; moreover, each hypergraph requires tion tasks involving prediction of the topic of a news approximately 200kB of memory to store. document from the words it contains. We compare Algorithm 1 to the state-of-the-art SVM solver Pe- In Figure 4, we plot the results of 10 experiments gasos (Shalev-Shwartz et al., 2007). In the left plot for minimizing the structured prediction loss (4) on of Fig. 3 we plot the optimality gap f(x )+ ϕ(x ) the Wall Street Journal portion of the Penn Tree- t t − f(x∗) ϕ(x∗) as a function of computation time for bank (Marcus et al., 1994). We use ℓ2-regularization Pegasos− and our perturbation method with 1, 2, 3, and ϕ(x)=(λ/2) x 2 with multiplier λ = .25. We plot k k2 Randomized Smoothing for Stochastic Optimization

1 1 10 10 the 28th International Conference on Machine Learning 1 1 , ) ) ∗ 2 ∗ 2 2011. x x ( 4 ( 4 ϕ 6 ϕ 6 − − Duchi, J. C. and Singer, Y. Efficient online and batch 0 0 ) 10 8 ) 10 8 ∗ ∗ x x RDA learning using forward-backward splitting. Journal of ( ( f f Machine Learning Research, 10:2873–2898, 2009. − − ) ) x x ( ( Ghadimi, S. and Lan, G. Optimal stochastic approxima- ϕ −1

ϕ −1 10 10

) + tion algorithms for strongly convex stochastic composite ) + x x ( (

f optimization. 2010. f

−2 −2 Hazan, E. and Kale, S. An optimal algorithm 10 10 0 500 1000 1500 0 50 100 150 200 Iterations Time (seconds) for stochastic strongly convex optimization. URL http://arxiv.org/abs/1006.2425, 2010. Figure 4. Optimality gaps for hypergraph-based parsing Juditsky, A., Nemirovski, A., and Tauvel, C. Solving vari- task (structured prediction). The legend for each plot gives ational inequalities with the stochastic mirror-prox algo- the number of threads used to compute stochastic subgra- rithm. URL http://arxiv.org/abs/0809.0815, 2008. dients. Left: optimality gap versus number of iterations. Klein, D. and Manning, C.˙ Parsing and hypergraphs. In Right: optimality gap versus computation time. New Developments in Parsing Technology. Kluwer Aca- demic, 2002. Algorithm 1’s optimality gap as a function of the num- Lan, G. An optimal method for stochas- ber of iterations (left) and the amount of time (right) tic composite optimization. Mathemati- required by the method. The right plot in Figure 4 cal Programming, 2010. Online first. URL http://www.ise.ufl.edu/glan/papers/OPT_SA4.pdf. also shows the performance of the state-of-the-art reg- ularized dual averaging (RDA) algorithm (Xiao, 2010). Lewis, D., Yang, Y., Rose, T., and Li, F. RCV1: A new benchmark collection for text categorization research. The left plot evidences a striking benefit in the number Journal of Machine Learning Research, 5:361–397, 2004. of iterations performed by the method as the number Marcus, M.P. Santorini, B. and Marcinkiewicz, M.A.˙ of threads increases. As the right plot shows, it is Building a large annotated corpus of English: the Penn not trivial to translate this into improvements in ac- Treebank. Computational Linguistics, 19:313–330, 1994. tual running time. This is a consequence of the mem- Nemirovski, A. and Yudin, D. Problem Complexity and ory overhead engendered by multiple threads accessing Method Efficiency in Optimization. Wiley, 1983. the same memory as well as synchronization among the Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. threads. Nonetheless, as the amount of time increases, Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4): the benefit of using multiple threads—and thus reduc- 1574–1609, 2009. ing the variance of the stochastic gradient estimate via Shalev-Shwartz, S., Singer, Y., and Srebro, N. Pegasos: the averaged gradient (12)—is clear. We also see the Primal estimated sub-gradient solver for SVM. In Pro- perhaps surprising result that even when using a sin- ceedings of the 24th International Conference on Ma- gle thread, the perturbation-based accelerated method chine Learning, 2007. yields improved performance over prior algorithms. Taskar, B. Learning Structured Prediction Models: A Large Margin Approach. PhD thesis, Stanford Univer- sity, 2005. References Taskar, B. Klein, D. Collins, M. Koller, D. and Manning, Abbeel, P. Apprenticeship Learning and Reinforcement C.˙Max-margin parsing. In Empirical Methods in Natu- Learning with Application to Robotic Control. PhD the- ral Language Processing, 2004. sis, Stanford University, 2008. Tseng, P. On accelerated proximal gradient meth- Agarwal, Alekh, Bartlett, Peter L., Ravikumar, Pradeep, ods for convex-concave optimization. 2008. URL and Wainwright, Martin J. Information-theoretic lower http://www.math.washington.edu/~tseng/papers/apgm.pdf. bounds on the oracle complexity of convex optimization. Xiao, L. Dual averaging methods for regularized stochastic IEEE Transactions on Information Theory, 58(5):3235– learning and online optimization. Journal of Machine 3249, May 2012. Learning Research, 11:2543–2596, 2010. Ahuja, R. and Orlin, J. Inverse optimization. Operations Xing, E., Ng, A.Y., Jordan, M., and Russell, S. Distance Research, 49(5):771–783, 2001. metric learning, with application to clustering with side- Bertsekas, D. P. Stochastic optimization problems with information. In Advances in Neural Information Process- nondifferentiable cost functionals. Journal of Optimiza- ing Systems 15, 2003. tion Theory and Applications, 12(2):218–231, 1973. Yousefian, F., Nedic, A., and Shanbhag, U. Convex nondif- Cortes, C. and Vapnik, V. Support-vector networks. Ma- ferentiable stochastic optimization: a local randomized chine Learning, 20(3):273–297, September 1995. smoothing technique. In Proceedings of IEEE American Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. Control Conference, pp. 4875–4880, 2010. Optimal distributed online prediction. In Proceedings of Randomized Smoothing for Stochastic Optimization

Appendix: Proofs

A. Proofs for Non-Strongly Convex Now recall that ηt = L0√t + 1/R√m by construction. Optimization Coupled with the inequality

In this section, we provide the proofs of Theorems 1 T 1 T 1 1+ dt = 1+2(√T 1) 2√T, (16) and 2, as well as Corollaries 1 through 4. We begin √ ≤ √ − ≤ t=1 t 1 t with the proofs of the corollaries, after which we give X Z the full proofs of the theorems. In both cases, we defer we use that 2√T + 1/T + 2/√T 5/√T to obtain some of the more technical lemmas to appendices. ≤

E[f(x )+ ϕ(x )] [f(x∗)+ ϕ(x∗)] The general technique for the proof of each corollary T T − 2 is as follows. First, we recognize that the randomly 6L0R √d 5L0R 4L0u E + + . smoothed function fµ(x) = f(x + Z) for Z µ has ≤ Tu √Tm T Lipschitz continuous gradients and is uniformly∼ close to the original non-smooth function f. This allows us Plugging the specified setting of u = Rd1/4 completes to apply Theorem 1. The second step is to realize that the proof. with the sampling procedure (7), the variance E e 2 t The proof of Corollary 2 is essentially identical, differ- decreases at a rate of approximately 1/m, the numberk k∗ ing only in the setting of u = Rd 1/4 and the applica- of gradient samples. Choosing the stepsizes appro- − tion of Lemma 9 instead of Lemma 8 in Appendix C. priately in the theorems then completes the proofs. Proofs of these corollaries require relatively tight con- trol of the smoothness properties of the smoothing con- A.2. Proof of Corollary 3 volution (5), so we refer frequently to lemmas stated Under the stated conditions of the corollary, Lemma 6 in Appendix C. implies that when µ is uniform on B (0,u), then the ∞ function fµ(x) := Eµf(x + Z) has a L0/u-Lipschitz A.1. Proof of Corollaries 1 and 2 continuous gradient with respect to the ℓ1-norm, and moreover it satisfies the upper bound fµ(x) f(x)+ We begin by proving Corollary 1. Recall the averaged L0du ≤ m . quantity g = 1 g , and that g ∂F (y + 2 t m i=1 i,t i,t ∈ t Zi; ξi), where the random variables Zi are distributed Now, fix x and let gi ∂F (x + Zi; ξi), with P 1 m ∈ X ∈ uniformly on the ball B2(0,u). g = m i=1 gi. We claim that the error satisfies

From Lemma 8 in Appendix C, the variance of gt as P 2 E 2 L0 log d an estimate of fµ(yt) satisfies g fµ(x) C (17) ∇ k − ∇ k∞ ≤ m 2   2 2 2 L0 for some universal constant C. Indeed Lemma 6 σ := E et = E gt fµ(yt) . (15) k k2 k − ∇ k2 ≤ m shows that E[g] = f (x); moreover, component j ∇ µ Further, for Z distributed uniformly on B2(0,u), we of the random vector gi is an unbiased estimator of have the bound the jth component of fµ(x). Since gi L0 ∇ k k∞ ≤ and fµ(x) L0, the vector gi fµ(x) is a k∇ k∞ ≤ − ∇ f(x) E[f(x + Z)] f(x)+ L0u, d-dimensional random vector whose components are ≤ ≤ sub-Gaussian with sub-Gaussian parameter 4L2. Con- and moreover, the function f has L √d/u-Lipschitz 0 µ 0 ditional on x, the g are independent, so g f (x) continuous gradient. Thus, applying Lemma 8 and i µ has sub-Gaussian components with parameter− at∇ most Theorem 1 with the setting Lt = L0√d/uθt, we obtain 2 4L0/m. Applying standard sub-Gaussian tail bounds E[f(x )+ ϕ(x )] [f(x∗)+ ϕ(x∗)] to the ℓ -norm bounded vectors gi fµ(x) (we omit T T − ∞ −∇ T 1 the proof) yields the claim (17). 6L R2√d 2η R2 1 − 1 L2 4L u 0 + T + 0 + 0 , ≤ Tu T T η · m T Now, as in the proof of Corollary 1, we can ap- t=0 t 1 2 X ply Theorem 1. Recall that 2(p 1) x p is strongly d − k k where we have used the bound (15). convex over R with respect to the ℓp-norm for any Randomized Smoothing for Stochastic Optimization p (1, 2] (e.g. Xiao, 2010). Thus, with the choice we see that with probability at least 1 2δ ∈ 1 2 − ψ(x) = 2(p 1) x p for p = 1+1/ log d, it is clear − k k f(x )+ ϕ(x ) f(x∗) ϕ(x∗) that the squared radius R2 of the set is order T T − − 2 2 X 2 2 2 x∗ log d x∗ log d. Essentially, all that remains 6L0R √d 4L0u 4R η 2L k kp ≤ k k1 + + + 0 is to relate the Lipschitz constant L0 with respect to ≤ Tu T √T + 1 m√Tη the ℓ1 norm to that for the ℓp norm. Let q be con- 2 1 1 jugate to p, that is, 1/q + 1/p = 1. Under the as- L max log , log(T + 1) L0R log δ + C 0 δ + sumptions of the theorem, we have q = 1 + log d. For (T + 1)mη √qTm Rd 1/q  any g , we have g q d g . Of course, ∈ k k ≤ k k∞ 1/(log d+1) 1/(log d) for a universal constant C. Plugging in η = L0/R√m d d = exp(1), so g q e g . ≤ k k ≤ k k∞ and u = Rd1/4 gives the desired result. Having shown that the Lipschitz constant L for the ℓp norm satisfies L L exp(1), where L is the Lipschitz ≤ 0 0 A.4. Proof of Theorem 1 constant with respect to the ℓ1 norm, we simply apply Theorem 1 and the variance bound (17) to get the This proof is more involved than that of the above result. Specifically, Theorem 1 implies corollaries. In We build on techniques used in the work of Tseng (2008), Lan (2010), and Xiao (2010). E[f(x )+ ϕ(x )] [f(x∗)+ ϕ(x∗)] The changing smoothness of the stochastic objective— T T − T 1 which comes from changing the shape parameter of the 6L R2 2η R2 C − 1 L2 log d 4L du 0 + T + 0 + 0 . sampling distribution Z in the averaging step (7)— ≤ Tu T T η · m 2T t=0 t adds some challenge. Essentially, the idea of the proof X is to define ft(x) := Eµf(x+utZ), where ut is the non- Plugging in the values for u, η , and R x∗ √log d increasing sequence of shape parameters in the averag- t ≤ k k1 and using bound (16) completes the proof. ing scheme (7). We show that ft(x) ft 1(x) for all t, which is intuitive because the variance≤ of− the sampling A.3. Proof of Corollary 4 scheme is decreasing, while Jensen’s inequality tells us that f(x) ft(x). Then we apply the (stochastic) ac- The proof of this corollary requires an auxiliary result celerated gradient≤ method (Tseng, 2008; Xiao, 2010) showing that Assumption B holds under the stated to the sequence of functions ft decreasing to f, and by conditions. The following result (whose proof we omit) allowing ut to decrease appropriately we achieve our can be shown using a Doob martingale construction result. In the proof, we frequently use Lt as shorthand with norm-bounded random vectors. In stating it, we for the quantity L1/ut. We also simply assume that recall the definition of the sigma field t from Assump- ψ(x)= Dψ(x,x0) for notational convenience. tion B. F We begin by stating two technical lemmas: Lemma 1. Using the notation of Theorem 2, suppose that F ( ; ξ) is L -Lipschitz continuous with respect to Lemma 2. Let ft be a sequence of functions such that · 0 the norm over + supp µ for P -a.e. ξ. Then ft has Lt-Lipschitz continuous gradients with respect to k·k X the norm and assume that ft(x) ft 1(x) for any 2 k·k ≤ − x . Let the sequence xt,yt,zt be generated ac- E et ∈ X { } exp k k∗ t 1 exp(1), cording to the updates (8a)–(8c), and define the error σ2 |F − ≤     2 term et = ft(yt) gt. Then for any x∗ , 2 2 16L0 ∇ − ∈X where σ : = 2 max E[ et t 1], . k k∗ |F − m 1   2 [ft(xt+1)+ ϕ(xt+1)] θt t 1 ηt+1 Using this lemma, we now prove Corollary 4. When [f (x∗)+ ϕ(x∗)] + L + ψ(x∗) ≤ θ τ t+1 θ µ is the uniform distribution on B2(0,u), Lemma 8 τ=0 τ t+1 X   from Appendix C implies that fµ is Lipschitz with t t ∇ 1 2 1 constant L1 = L0√d/u. As discussed previously, + et + eτ ,zτ x∗ . 2θ η k k∗ θ h − i Lemma 1 ensures that the error e satisfies Assump- τ=0 τ τ τ=0 τ t X X tion B. Noting the inequality See Appendix A.6 for the proof of this claim. max log(1/δ), log T log(1/δ) max log(1/δ), log T 1 θt 1 { }≤ { } Lemma 3. Let the sequence θt satisfy −2 = 2 θt θt−1 p 2 t 1 1 and combining the bound in Theorem 2 with Lemma 1, and θ0 = 1. Then θt t+2 , and τ=0 θ = θ2 . ≤ τ t P Randomized Smoothing for Stochastic Optimization

The second statement was proved by Tseng (2008); Recall the filtration of σ-fields so that x ,y ,z Ft t t t ∈ the first follows by a straightforward induction. t 1, that is, t contains the randomness in the Fstochastic− oracleF to time t. Since g is an unbi- We now proceed with the proof of the theorem. Defin- t ased estimator of ft(yt) by construction, we have ing ft(x) := E[f(x + utZ)], let us first verify that ∇ E[gt t 1]= ft(yt) and ft(x) ft 1(x) for any x and t so that Lemma 2 |F − ∇ ≤ − ∈X can be applied. Since ut ut 1, we may define a ≤ − E[ et,zt x∗ ]= E E[ et,zt x∗ t 1] random variable U with support on 0, 1 such that h − i h − i|F − u { } E E P(U =1)= t [0, 1]. Then =  [et t 1],zt x∗  = 0, ut−1 ∈ h |F − − i   ft(x)= E[f(x + utZ)] = E f x + ut 1ZE[U] where we have used the fact that zt is measurable with − P E P respect to t 1. Now, recall from Lemma 3 that θt [U = 1] [f(x + ut 1Z)] + [U = 0] f(x), − ≤  −  2 F 2 2 ≤ 2+t and that (1 θt)/θt = 1/θt 1. Thus where the inequality follows from Jensen’s inequality. − − By a second application of Jensen’s inequality, we have 2 θt 1 1 1 2+ t 3 − = = for t 4. f(x) = f(x + ut 1EZ) Ef(x + ut 1Z) = ft 1(x). 2 2 θ 1 θt ≤ 1 t ≤ 2 ≥ Combined with the− previous≤ inequality,− we conclude− t − − 2+t that ft(x) E[f(x + ut 1Z)] = ft 1(x) as claimed. ≤ − − Furthermore, we have θt+1 θt, so by multiplying Consequently, we have verified that the function ft ≤ 2 both sides of our bound (20) by θT 1 and taking ex- satisfies the assumptions of Lemma 2 where ft has − ∇ pectations over the random vectors gt, Lipschitz parameter Lt = L1/ut and error term et = ft(yt) gt. We apply the lemma momentarily. E[f(x )+ ϕ(x )] [f(x∗)+ ϕ(x∗)] ∇ − T T − Using Assumption A that f(x) E[f(x + utZ)] 2 2 ≥ − θT 1LT ψ(x∗)+ θT 1ηT ψ(x∗)+ θT 1TL0u L u = f (x) L u for all x , Lemma 3 implies ≤ − − − 0 t t − 0 t ∈X T 1 T 1 − 1 2 − 1 1 + θT 1 E et + θT 1 E[ et,zt x∗ ] − 2η k k∗ − h − i 2 [f(xT )+ ϕ(xT )] 2 [f(x∗)+ ϕ(x∗)] t=0 t t=0 θT 1 − θT 1 X X − − T 1 T 1 6L1ψ(x∗) 2ηT ψ(x∗) 1 − 1 2 4L0u 1 − 1 + + E et + , = [f(xT )+ ϕ(xT )] [f(x∗)+ ϕ(x∗)] ≤ Tu T T η k k∗ T θ2 − θ t=0 t T 1 t=0 t X − X 1 where we used the fact that LT = L1/uT = L1/θT u. 2 [fT 1(xT )+ ϕ(xT )] (18) ≤ θT 1 − This completes the proof of Theorem 1. − T 1 T 1 − 1 − L0ut [ft(x∗)+ ϕ(x∗)] + , A.5. Proof of Theorem 2 − θ θ t=0 t t=0 t X X An examination of the proof of Theorem 1 shows that which by the definition of ut is in turn bounded by to control the probability of deviation from the ex- T 1 pected convergence rate, we need to control two terms: 1 − 1 T 1 1 2 [f (x )+ϕ(x )] [f (x∗)+ϕ(x∗)]+TL u. the squared error sequence − et and the se- 2 T 1 T T t 0 t=0 2ηt θ − − θ k k∗ T 1 t=0 t T 1 1 − quence − et,zt x∗ . The next two lemmas X (19) t=0 θt P handle these terms.h − i Now we apply Lemma 2 to the bound (19), which gives P Lemma 4. Let be compact with x x∗ R for 1 X k − k ≤ [f(x )+ ϕ(x ) f(x ) ϕ(x )] all x . Under Assumption B, we have 2 T T ∗ ∗ ∈X θT 1 − − − T 1 T 1 − 2 ηT − 1 2 P 2 1 Tǫ LT ψ(x∗)+ ψ(x∗) et θT 1 et,zt x∗ ǫ exp 2 2 . ≤ θ 2θ η k k∗ − θt h − i≥ ≤ − R σ T t=0 t t  t=0    X X T 1 (21) − 1 + e ,z x∗ + TL u. (20) Consequently, with probability at least 1 δ, θ h t t − i 0 − t=0 t X T 1 1 − The non-probabilistic bound (20) is the key to the re- 2 1 log δ θT 1 et,zt x∗ Rσ . (22) − θ h − i≤ s T mainder of this proof, as well as the starting point for t=0 t the proof of Theorem 2 in the next section. What X remains is to take expectations in the bound (20). Lemma 5. In the notation of Theorem 2 and under Randomized Smoothing for Stochastic Optimization

Assumption B, we have A.6. Proof of Lemma 2 T 1 T 1 − 1 2 − 1 2 Define the linearized version of the cumulative objec- log P et E[ et ]+ ǫ 2η k k∗ ≥ 2η k k∗ tive t=0 t t=0 t  X X  t ǫ2 η 1 0 ℓt(z) := [fτ (yτ )+ gτ ,z yτ + ϕ(z)], (25) max T 1 , 2 ǫ . (23) θ h − i ≤ − 4 1 −4σ τ=0 τ 32eσ t=0− 2  ηt  X and let ℓ 1(z) denote the indicator function of . For Consequently, with probabilityP at least 1 δ, − X − conciseness, we adopt the shorthand notation T 1 T 1 − 1 − 1 1 2 2 α− = Lt + ηt/θt and φt(x)= ft(x)+ ϕ(x). et E[ et ] (24) t 2η k k∗ ≤ 2η k k∗ t=0 t t=0 t X X By the smoothness of ft, we have 2 4σ 1 1 f (x )+ ϕ(x ) f (y )+ f (y ),x y + max log , 2e log(T + 1) log . t t+1 t+1 ≤ t t h∇ t t t+1 − ti η δ r δ   φt(xt+1)

The proofs of these probabilistic lemmas are tech- Lt 2 | {z } + x y + ϕ(x ). nical and build off of concentration results for 2 k t+1 − tk t+1 sums of random variables similar to those used From the definition (8a)–(8c) of the triple (xt,yt,zt), by Nemirovski et al. (2009); we omit them as they we obtain are somewhat lengthy and require several auxilary φ (x ) f (y )+ f (y ),θ z + (1 θ )x statements on concentration of sub-Gaussian and sub- t t+1 ≤ t t h∇ t t t t+1 − t ti exponential random variables. (We provide proofs in Lt 2 + θtzt+1 θtzt + ϕ(θtzt+1 + (1 θt)xt). the journal version of this paper.) 2 k − k − Finally, by convexity of the regularizer ϕ, we conclude Equipped with these lemmas, we now prove Theo- rem 2. Let us recall the deterministic bound (20) from φt(xt+1) θt ft(yt)+ ft(yt),zt+1 yt + ϕ(zt+1) the proof of Theorem 1: ≤ h∇ − i h Ltθt 2 1 + zt+1 zt (26) 2 [f(xT )+ ϕ(xT ) f(x∗) ϕ(x∗)] 2 k − k θT 1 − − i − + (1 θt)[ft(yt)+ ft(yt),xt yt + ϕ(xt)]. T 1 − h∇ − i ηT − 1 2 LT ψ(x∗)+ ψ(x∗)+ et By the strong convexity of ψ, it is clear that we have ≤ θ 2θ η k k∗ T t=0 t t X the lower bound T 1 − 1 1 2 + e ,z x∗ + TL u. Dψ(x,y) x y . (27) θ h t t − i 0 ≥ 2 k − k t=0 t X On the other hand, by the convexity of ft, we have Noting that θT 1 θt for t 0,...,T 1 , Lemma 5 − ≤ ∈ { − } implies that with probability at least 1 δ ft(yt)+ ft(yt),xt yt ft(xt). (28) − h∇ − i≤ T 1 T 1 Substituting inequalities (27) and (28) into the upper − 1 2 − 1 2 θT 1 et E[ et ] bound (26) and simplifying yields − 2θ η k k∗ ≤ 2η k k∗ t=0 t t t=0 t X X φt(xt+1) θt ft(yt)+ ft(yt),zt+1 yt + ϕ(zt+1) 4σ2 ≤ h∇ − i + max log(1/δ), 2e log(T + 1) log(1/δ) . + L θ D (z ,z ) η  t t ψ t+1 t n o + (1 θ )[f (x )+ ϕ(x )]. Applying Lemma 4, we see thatp with probability at − t t t  t least 1 δ − We now re-write this upper bound in terms of the error T 1 Rσ log 1 et = ft(yt) gt. In particular, 2 − 1 δ ∇ − θT 1 et,zt x∗ . − θ h − i≤ q√ φt(xt+1) t=0 t T X θ f (y )+ g ,z y + ϕ(z ) ≤ t t t h t t+1 − ti t+1 The terms remaining to control are deterministic, and + LtθtDψ(zt+1,zt) were bounded previously in the proof of Theorem 1; θ2 η + (1 θt)[ft(xt)+ ϕ(xt)] + θt et,zt+1 yt in particular, we have θ2 L 6L1 , T −1 T 4ηT , −  h − i T 1 T T u θT T +1 2 4L u − ≤ ≤ = θt [ℓt(zt+1) ℓt 1(zt+1)+ LtDψ(zt+1,zt)] 2 0 − and θT 1TL0u T +1 . Combining the above bounds − − ≤ + (1 θ )[f (x )+ ϕ(x )] + θ e ,z y (29) completes the proof. − t t t t t h t t+1 − ti Randomized Smoothing for Stochastic Optimization

1 Using the fact that zt minimizes ℓt 1(x)+ α ψ(x), the fact that ft ft 1. By applying the three steps − t ≤ − 2 first order conditions for optimality imply that for all above successively to [ft 1(xt)+ ϕ(xt)]/θt 1, then to −2 − 1 [ft 2(xt 1)+ ϕ(xt 1)]/θ , and so on until t = 0, we g ∂ℓt 1(zt), we have g + α ψ(zt),x zt 0. t 2 ∈ − t ∇ − ≥ find− − − − Thus, first-order convexityD gives E 1 1 2 [ft(xt+1)+ ϕ(xt+1)] ℓt 1(x) ℓt 1(zt) g,x zt ψ(zt),x zt θt − − − ≥h − i≥−αt h∇ − i 1 θ0 1 1 1 1 −2 [f0(x0)+ ϕ(x0)] + ℓt(zt+1)+ ψ(zt+1) = ψ(z ) ψ(x)+ D (x,z ). ≤ θ αt+1 α t − α α ψ t 0 t t t t ητ 1 Adding ℓ (z ) to both sides of the above and substi- D (z ,z )+ e ,z y t t+1 − θ ψ τ+1 τ θ h τ τ+1 − τ i τ=0 τ τ=0 τ tuting x = zt+1, we conclude X X 1 ℓt(zt+1) ℓt 1(zt+1) ℓt(zt+1) ℓt 1(zt) ℓ 1(z0) ψ(z0). − − ≤ − − − − − α0 1 1 1 ψ(zt)+ ψ(zt+1) Dψ(zt+1,zt). − αt αt − αt By construction, θ0 = 1, we have ℓ 1(z0) = 0, and 1 − zt+1 minimizes ℓt(x) + ψ(x) over . Thus, for Combining this inequality with the bound (29) and αt+1 X 1 any x∗ , we have using the definition αt− = Lt + ηt/θt, we find ∈X 1 1 [ft(xt+1)+ ϕ(xt+1)] ℓt(x∗)+ ψ(x∗) ft(xt+1)+ ϕ(xt+1) 2 θt ≤ αt+1 2 1 t t θt ℓt(zt+1) ℓt(zt) ψ(zt) ητ 1 ≤ − − αt Dψ(zτ+1,zτ )+ eτ ,zτ+1 yτ . h − θτ θτ h − i 1 ηt τ=0 τ=0 + ψ(zt+1) Dψ(zt+1,zt) X X αt − θt Recalling the definition (25) of ℓt and noting that i + (1 θ )[f (x )+ ϕ(x )] + θ e ,z y ft(yt)+ ft(yt),x yt ft(x) by convexity, we ex- − t t t t t h t t+1 − ti h∇ − i≤ pand ℓt and have 2 1 θt ℓt(zt+1) ℓt(zt) ψ(zt) ≤ − − αt 1 h 2 [ft(xt+1)+ ϕ(xt+1)] 1 ηt θt + ψ(zt+1) Dψ(zt+1,zt) αt+1 − θt t 1 1 i [fτ (yτ )+ gτ ,x∗ yτ + ϕ(x∗)] + ψ(x∗) + (1 θt)[ft(xt)+ ϕ(xt)] + θt et,zt+1 yt ≤ θ h − i α − h − i τ=0 τ t+1 X 1 t t since α− is non-decreasing. We now divide both t ητ 1 2 D (z ,z )+ e ,z y sides by θt and unwrap the recursion. Recall that − θ ψ τ+1 τ θ h τ τ+1 − ti 2 2 τ=0 τ τ=0 τ (1 θt)/θt = 1/θt 1 and ft ft 1 by construction, so X X − − ≤ − t we obtain 1 = [f (y )+ f (y ),x∗ y + ϕ(x∗)] 1 θ τ τ h∇ τ τ − τ i τ=0 τ 2 [ft(xt+1)+ ϕ(xt+1)] X θt 1 + ψ(x∗) 1 θt 1 αt+1 −2 [ft(xt)+ ϕ(xt)] + ℓt(zt+1) ℓt(zt) ψ(zt) ≤ θt − − αt t t ητ 1 1 ηt 1 Dψ(zτ+1,zτ )+ eτ ,zτ+1 x∗ + ψ(zt+1) Dψ(zt+1,zt)+ et,zt+1 yt − θ θ h − i τ=0 τ τ=0 τ αt+1 − θt θt h − i X X t (i) 1 1 1 1 = 2 [ft(xt)+ ϕ(xt)] + ℓt(zt+1) ℓt(zt) ψ(zt) [fτ (x∗)+ ϕ(x∗)] + ψ(x∗) (30) θt 1 − − αt ≤ θ α τ=0 τ t+1 − X 1 ηt 1 t t + ψ(zt+1) Dψ(zt+1,zt)+ et,zt+1 yt ητ 1 αt+1 − θt θt h − i D (z ,z )+ e ,z x∗ . − θ ψ τ+1 τ θ h τ τ+1 − i (ii) τ=0 τ τ=0 τ 1 1 X X 2 [ft 1(xt)+ ϕ(xt)] + ℓt(zt+1) ℓt(zt) ψ(zt) ≤ θt 1 − − − αt Now we use the Fenchel-Young inequality applied to − 1 2 1 2 1 ηt 1 the conjugates 2 and 2 , which gives + ψ(zt+1) Dψ(zt+1,zt)+ et,zt+1 yt . k·k k·k∗ αt+1 − θt θt h − i e ,z x∗ = e ,z x∗ + e ,z z h t t+1 − i h t t − i h t t+1 − ti 2 2 1 2 ηt 2 The equality (i) follows since (1 θt)/θt = 1/θt 1, e ,z x + e + z z . − − t t ∗ t t t+1 while the inequality (ii) is a consequence of the ≤h − i 2ηt k k∗ 2 k − k Randomized Smoothing for Stochastic Optimization

In particular, which is the content of Theorem 1.

ηt 1 Proof of Proposition 1 We begin by recasting the Dψ(zt+1,zt)+ et,zt+1 x∗ − θt θt h − i convergence guarantee of Lemma 2 in the necessary 1 2 1 epoch-based notation. Let xτ (i) denote the value of e + e ,z x∗ . t t t xτ in epoch i, and similarly for zτ (i), gτ (i), and so ≤ 2ηtθt k k∗ θt h − i on. Let fi(x)= Eµ[f(x + u(i)Z)] denote the mollified Using this inequality and rearranging (30) gives the function during epoch i, and define the error of the statement of the lemma. gradient estimate gτ (i) at iteration τ of epoch i to be e (i) = f (y (i)) g (i). Since ψ( ) is 1-strongly τ ∇ i τ − τ · B. Proof of Theorem 3 convex with respect to the norm , the output x(i) of iteration i of Algorithm 1 satisfiesk·k In this section, we provide the promised proof of Theo- f (x(i)) + ϕ(x(i)) [f (x∗)+ ϕ(x∗)] rem 1. Our proof is based on the following proposition, i − i which shows the exponential decrease of the optimiza- 2 L1 η(i) θt(i) 1 + Dψ(x∗,x(i 1)) (32) ≤ − u(i) θ − tion error as a function of the number of epochs.  t(i)  t(i) 1 Proposition 1. Let x(k) denote the output of Algo- − 2 1 2 rithm 1 and let Assumptions A and C. In addition, let + θt(i) 1 eτ (i) − 2θ η(i) k k∗ M τ=0 τ f(x(0))+ϕ(x(0)) f(x∗) ϕ(x∗) denote an upper X ≥ − − t(i) 1 bound on the initial optimality gap. Then − 2 1 + θt(i) 1 eτ (i),zτ (i) x∗ . − θτ h − i E[f(x(k)) + ϕ(x(k))] [f(x∗)+ ϕ(x∗)] τ=0 − X 2 k k σ Define the error factors 2− M + 5 2− + L u . (31) ≤ · 2η 0 t(i) 1   − 1 2 E(i) := eτ (i) 2θ η(i) k k∗ τ=0 τ Before proving Proposition 1, we give the proof of The- X t(i) 1 orem 3. To that end, we compute the number of itera- − 1 tions required to achieve a particular ǫ-accuracy for the + eτ (i),zτ (i) x∗ , (33) θτ h − i optimization error (31). We first note that by choos- τ=0 M X ing k = log2 ǫ , we can replace the expected optimality and apply the smoothing assumption A to the single- gap (31) with epoch bound (32) to see that

2 2 f(x(i)) + ϕ(x(i)) [f(x∗)+ ϕ(x∗)] ǫ ǫ σ ǫ σ − M + 5 + L0u = ǫ + 5 + L0u . M · M 2η · M 2η 2 L1 η(i)     θt(i) 1 + Dψ(x∗,x(i 1)) ≤ − u(i) θ −  t(i)  What remains is to compute the number of iterations 2 + θt(i) 1E(i)+ L0u(i). (34) of the three-series updates (8a)–(8c). To that end, − we compute the sum of t(i) as chosen across all the Note that by our strong-convexity assumption C, for epochs of Algorithm 1. Using that max a,b a + b { } ≤ any x we have for a,b 0, we have ∈X ≥ 1 Dψ(x∗,x) [f(x)+ ϕ(x) f(x∗) ϕ(x∗)] . k k L k 2iη ≤ λ − − t(i) 4 1 (√2)i + 12 ≤ uλ λ Define φ(x)= f(x)+ ϕ(x) for notational convenience. i=1 i=1 r i=1 X X X Then we can replace the upper bound (34) with L k 12η k √ k 1 √ i k i k 2 = 4( 2) ( 2)− + 2 2 − θt(i) 1 L1 η(i) uλ λ − + [φ(x(i 1)) φ(x∗)] r i=1 i=1 λ u(i) θ − − X X  t(i)  4 L1 1 k 24η k 2 (√2) + 2 . + θt(i) 1E(i)+ L0u(i). ≤ √2 uλ · 1 √2/2 λ − r − To simplify, we make the definitions of the multiplica- M Plugging in the choice of k = log2( /ǫ), we conclude tion factors M(i) and πk(i) as that 2 k θt(i) 1 L1 η(i) k M(i) := − + and πk(i) := M(j). 4 L M 24η M λ u(i) θt(i) t(i) 1 + ,   j=i+1 ≤ √ uλ · ǫ λ · ǫ Y i=1 2 1r r (35) X − Randomized Smoothing for Stochastic Optimization

Then recursively applying the bound (34), the defini- Now we take expectations of the error terms E(i). Re- tions (33), (35), and that of the joint function φ imply calling that E[eτ (i) τ 1(i)] = 0, since the draws of Z and ξ are independent| F − in each iteration, we have f(x(k)) + ϕ(x(k)) f(x∗) ϕ(x∗) − − t(i) 1 − 2 2 M(k)[f(x(k 1)) + ϕ(x(k 1)) f(x∗) ϕ(x∗)] E σ σ ≤ − − − − [E(i)] = 2 2 ≤ 2θ η(i) 2i+1θ η =0 τ t(i) 1 + θt(k) 1E(k)+ L0u(k) τ − − X M(k) M(k 1) [φ(x(k 2)) φ(x∗)] by the definition of the recursion for the θt (recall ≤ − − − Lemma 3). Consequently, we use the choices η(i) =  2 i i + θt(k 1) 1E(k 1) + L0u(k 1) η 2 and u(i)= u 2− in Algorithm 1, and we have − − − − · · 2  + θt(k) 1E(k)+ L0u(k) E[f(x(k)) + ϕ(x(k)) f(x∗) ϕ(x∗)] − − − k 2 k k k σ i i M(i) [f(x(0)) + ϕ(x(0)) f(x∗) ϕ(x∗)] 2− M + π (i)2− + L u π (i)2− ≤ − − ≤ 2η k 0 k  i=1  i=1 i=1 Y X X k k 2 k k i 2 kM σ 5 − i + M(j) θt(i) 1E(i)+ L0u(i) 2− + + L0u 2− − ≤ 2η 12 i=1  j=i+1  i=1 X Y h i   X   k k 2 k k i M 2 kM k σ 5 − M(i) + πk(i) θt(i) 1E(i)+ L0u(i) = 2− + 2− + L0u ≤ − 2η 6 i=1 i=1   i=1    Y  X h i X (36) 2 k k σ 2− M + 5 2− + L u . ≤ · 2η 0 where in the last line (36) we recalled the definition   M f(x(0)) + ϕ(x(0)) [f(x∗)+ ϕ(x∗)]. So if we can The bound above is evidently our desired re- ≥ − 1 choose η(i) and t(i) so that M(i) < 2 , we will have sult (31). bounds that decrease geometrically in the number of epochs k, so very few epochs are necessary. To that end, choose the iteration count specified in C. Smoothing Properties Algorithm 1: In this section, we discuss the analytic properties of the smoothed function f from the convolution (5). L 12η(i) µ t(i) = max 4 1 , . We simply state the results, preferring to avoid length- ( su(i)λ λ ) ening the already lengthy theoretical treatment, and referring the reader to Yousefian et al. (2010) for one Noting that example proof, excepting the sharpness argument (the other proofs are similar). We assume throughout that 2 2 2 θt 1 θt θt 2+t 2 functions are sufficiently integrable without bothering − = = 2 = , θt (1 θt)θt 1 θt ≤ 1 t with measurability conditions (since F ( ; ξ) is convex, − − − 2+t this is no real loss of generality (Bertsekas· , 1973)). By this choice of t(i) yields the bound Fubini’s theorem, we have

2 θt(i) 1 L1 η(i) f (x)= F (y; ξ)µ(x y)dydP (ξ) M(i)= − + µ λ u(i) θ Ξ Rd −  t(i)  Z Z 4L 2η(i) 1 2 5 1 = F (x; ξ)dP (ξ). 1 + + = < µ ≤ λu(i)t(i)2 λt(i) ≤ 4 12 12 2 ZΞ Here Fµ(x; ξ)=(F ( ; ξ) µ)(x). We begin with the on the multipliers M(i). So after k epochs, the observation that since· µ ∗is a density with respect to bound (36) tells us that Lebesgue measure, the function fµ is in fact differ- entiable (Bertsekas, 1973). So we have already made f(x(k)) + ϕ(x(k)) f(x∗) ϕ(x∗) − − our problem somewhat smoother, as it is now differen- k k tiable; for the remainder, we consider finer properties kM 2 2− + πk(i)θt(i) 1E(i)+ L0 πk(i)u(i). of the smoothing operation. In particular, we will show ≤ − i=1 i=1 that under suitable conditions on µ, F ( ; ξ), and P , the X X · Randomized Smoothing for Stochastic Optimization function fµ is uniformly close to f over and fµ is The latter estimate is tight. Lipschitz continuous. X ∇ A remark on notation before proceeding: since f is A similar lemma can be proved when µ is the den- convex, it is almost-everywhere differentiable, and we sity of the uniform distribution on B2(0,u). In this can abuse notation and take its gradient inside of inte- case, Yousefian et al. (2010) give (i)–(iii) of the follow- grals and expectations with respect to Lebesgue mea- ing lemma. sure. Similarly, F ( ; ξ) is differen- Lemma 8 (Yousefian, Nedi´c, Shanbhag). Let fµ be tiable with respect to· Lebesgue measure, so we use the defined as in (5) where µ is the uniform density on the same abuse of notation for F and write F (x + Z; ξ), ℓ -ball of radius u. Assume that E[ ∂F (x; ξ) 2] L2 ∇ 2 2 0 which exists with probability 1. for x + B (0,u). Then k k ≤ ∈X 2 Lemma 6. Let µ be the uniform density on the ℓ - E 2 2 ∞ ball of radius u. Assume that [ ∂F (x; ξ) ] L0 for (i) f(x) fµ(x) f(x)+ L0u all x + B (0,u) Then k k∞ ≤ ≤ ≤ ∈X ∞ (ii) f is L -Lipschitz over . µ 0 X L0d (i) f(x) fµ(x) f(x)+ u ≤ ≤ 2 (iii) fµ is continuously differentiable; moreover, its L0√d (ii) fµ is L0-Lipschitz with respect to the ℓ1-norm gradient is u -Lipschitz continuous. over . X (iv) Let Z µ. Then E[ F (x + Z; ξ)] = fµ(x), ∼ ∇ 2 ∇ (iii) fµ is continuously differentiable; moreover, its E 2 and [ fµ(x) F (x + Z; ξ) 2] L0. L0 k∇ − ∇ k ≤ gradient is u -Lipschitz continuous with respect to the ℓ1-norm. In addition, there exists a function f for which each of the bounds (i)–(iv) cannot be improved by more than (iv) Let Z µ. Then E[ F (x + Z; ξ)] = fµ(x) and E ∼ ∇ 2 2∇ a constant factor. [ fµ(x) F (x + Z; ξ) ] 4L0. k∇ − ∇ k∞ ≤ Lastly, for situations in which F ( ; ξ) is L -Lipschitz There exist functions for which each of the estimates 0 with respect to the ℓ -norm over all· of Rd and for P - (i)–(iii) are tight simultaneously, and (iv) is tight at 2 a.e. ξ, we can use the normal distribution to perform least to a factor of 1/4. smoothing of the expected function f. In the next lemma, we study continuity properties with respect to Remark: Note that the hypothesis of this lemma is the ℓ -norm. satisfied if for any fixed ξ Ξ, the function F ( ; ξ) is 2 2 ∈ · Lemma 9. Let µ be the N(0,u Id d) distribution. L0-Lipschitz with respect to the ℓ1-norm. × Assume that F ( ; ξ) is L0-Lipschitz with respect to the The following lemma provides bounds for uniform · ℓ2-norm—that is smoothing of functions Lipschitz with respect to the

ℓ2-norm while sampling from an ℓ -ball. sup g g ∂F (x; ξ),x L0 for P -a.e. ξ. ∞ {k k2 | ∈ ∈X}≤ Lemma 7. Let µ be the uniform density on B (0,u) E 2 2 ∞ Then the following properties hold: and assume that [ ∂F (x; ξ) 2] L0 for x + B (0,u). Then k k ≤ ∈ X ∞ (i) f(x) f (x) f(x)+ L u√d ≤ µ ≤ 0 (i) The function f satisfies the upper bound f(x) (ii) f is L -Lipschitz with respect to the ℓ norm f (x) f(x)+ L u√d ≤ µ 0 2 µ ≤ 0 (iii) fµ is continuously differentiable; moreover, its (ii) The function fµ is L0-Lipschitz over . L0 X gradient is u -Lipschitz continuous with respect (iii) The function fµ is continuously differentiable; to the ℓ2-norm. moreover, its gradient is 2√dL0 Lipschitz contin- u (iv) Let Z µ. Then E[ F (x + Z; ξ)] = fµ(x), uous. ∼ ∇ ∇ and E[ f (x) F (x + Z; ξ) 2] L2. k∇ µ − ∇ k2 ≤ 0 (iv) For random variables Z µ and ξ P , we have ∼ ∼ In addition, there exists a function f for which each of E[ F (x + Z; ξ)] = f (x) ∇ ∇ µ the bounds (i)–(iv) cannot be improved by more than and a constant factor. E[ f (x) F (x + Z; ξ) 2] L2. k∇ µ − ∇ k2 ≤ 0