<<

Making Descent Optimal for Strongly Convex Stochastic Optimization

Alexander Rakhlin [email protected] University of Pennsylvania Ohad Shamir [email protected] Microsoft Research New England Karthik Sridharan [email protected] University of Pennsylvania

Abstract convex learning problems. Given a convex loss func- Stochastic (SGD) is a sim- tion and a training set of T examples, SGD can be ple and popular method to solve stochas- used to obtain a sequence of T predictors, whose av- tic optimization problems which arise in ma- erage has a which converges (with chine learning. For strongly convex prob- T ) to the optimal one in the class of predictors we con- lems, its convergence rate was known to be sider. The common framework to analyze such first- O(log(T )/T ), by running SGD for T itera- order is via stochastic optimization, where tions and returning the average point. How- our goal is to optimize an unknown convex ever, recent results showed that using a dif- F , given only unbiased estimates of F ’s subgradients ferent , one can get an optimal (see Sec. 2 for a more precise definition). O(1/T ) rate. This might lead one to be- An important special case is when F is strongly con- lieve that standard SGD is suboptimal, and vex (intuitively, can be lower bounded by a quadratic maybe should even be replaced as a method function). Such functions arise, for instance, in Sup- of choice. In this paper, we investigate the port Vector Machines and other regularized learning optimality of SGD in a stochastic setting. algorithms. For such problems, there is a well-known We show that for smooth problems, the algo- O(log(T )/T ) convergence guarantee for SGD with av- rithm attains the optimal O(1/T ) rate. How- eraging. This rate is obtained using the analysis of ever, for non-smooth problems, the conver- the algorithm in the harder setting of online learning gence rate with averaging might really be (Hazan et al., 2007), combined with an online-to-batch Ω(log(T )/T ), and this is not just an artifact conversion (see (Hazan & Kale, 2011) for more details). of the analysis. On the flip side, we show that a simple modification of the averaging Surprisingly, a recent paper by Hazan and Kale (Hazan step suffices to recover the O(1/T ) rate, and & Kale, 2011) showed that in fact, an O(log(T )/T ) is no other change of the algorithm is neces- not the best that one can achieve for strongly convex sary. We also present experimental results stochastic problems. In particular, an optimal O(1/T ) which support our findings, and point out rate can be obtained using a different algorithm, which open problems. is somewhat similar to SGD but is more complex (al- though with comparable computational complexity)1. A very similar algorithm was also presented recently 1. Introduction by Juditsky and Nesterov (Juditsky & Nesterov, 2010). 1 Stochastic gradient descent (SGD) is one of the sim- Roughly speaking, the algorithm divides the T it- erations into exponentially increasing epochs, and runs plest and most popular first-order methods to solve stochastic gradient descent with averaging on each one. th The resulting point of each epoch is used as the starting Appearing in Proceedings of the 29 International Confer- point of the next epoch. The algorithm returns the result- ence on , Edinburgh, Scotland, UK, 2012. ing point of the last epoch. Copyright 2012 by the author(s)/owner(s). Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

These results left an important gap: Namely, whether are more general. In particular, the standard online the true convergence rate of SGD, possibly with some analysis of SGD requires the step size of the algorithm sort of averaging, might also be O(1/T ), and the at round t to equal 1/λt, where λ is the strong con- known O(log(T )/T ) result is just an artifact of the vexity parameter of F . In contrast, our analysis copes analysis. Indeed, the whole motivation of (Hazan & with any step size /λt, as long as c is not too small. Kale, 2011) was that the standard online analysis is In terms of related work, we note that the performance too loose to analyze the stochastic setting properly. of SGD in a stochastic setting has been extensively re- Perhaps a similar looseness applies to the analysis of searched in theory (see for SGD as well? This question has immediate practical instance (Kushner & Yin, 2003)). However, these re- relevance: if the new algorithms enjoy a better rate sults are usually obtained under smoothness assump- than SGD, it might indicate they will work better in tions, and are often asymptotic, so we do not get an ex- practice, and that practitioners should abandon SGD plicit bound in terms of T which applies to our setting. in favor of them. We also note that a finite-sample analysis of SGD in In this paper, we study the convergence rate of SGD the stochastic setting was recently presented in (Bach for stochastic strongly convex problems, with the fol- & Moulines, 2011). However, the focus there was dif- lowing contributions: ferent than ours, and also obtained bounds which hold only in expectation rather than in high probability. • First, we extend known results to show that if F More importantly, the analysis was carried out un- is not only strongly convex, but also smooth (with der stronger smoothness assumptions than our anal- respect to the optimum), then SGD with and ysis, and to the best of our understanding, does not without averaging achieves the optimal O(1/T ) apply to general, possibly non-smooth, strongly con- convergence rate. vex stochastic optimization problems. For example, smoothness assumptions may not cover the applica- • We then show that for non-smooth F , there are tion of SGD to support vector machines (as in (Shalev- cases where the convergence rate of SGD with Shwartz et al., 2011)), since it uses a non-smooth loss averaging is Ω(log(T )/T ). In other words, the function, and thus the underlying function F we are O(log(T )/T ) bound for general strongly convex trying to stochastically optimize may not be smooth. problems is real, and not just an artifact of the currently-known analysis. 2. Preliminaries • However, we show that one can recover the op- We use bold-face letters to denote vectors. Given some timal O(1/T ) convergence rate (in expectation vector w, we use w to denote its i-th coordinate. Simi- and in high probability) by a simple modification i larly, given some indexed vector w , we let w denote of the averaging step: Instead of averaging of T t t,i its i-th coordinate. We let 1 denote the indicator points, we only average the last αT points, where A function for some event A. α ∈ (0, 1) is arbitrary. Thus, to obtain an optimal rate, one does not need to use an algorithm signifi- We consider the standard setting of convex stochas- cantly different than SGD, such as those discussed tic optimization, using first-order methods. Our goal earlier. is to minimize a F over some convex domain W (which is assumed to be a subset of some • We perform an empirical study on both artificial Hilbert space). However, we do not know F , and the and real-world data, which supports our findings. only information available is through a stochastic gra- dient oracle, which given some w ∈ W, produces a Moreover, our rate upper bounds are shown to hold vector gˆ, whose expectation E[gˆ] = g is a subgradient in expectation, as well as in high probability (up to a of F at w. Using a bounded number T of calls to this log(log(T )) factor). While the focus here is on getting oracle, we wish to find a point wT such that F (wt) is as the optimal rate in terms of T , we note that our up- small as possible. In particular, we will assume that F per bounds are also optimal in terms of other standard attains a minimum at some w∗ ∈ W, and our analysis ∗ problem parameters, such as the strong convexity pa- provides bounds on F (wt) − F (w ) either in expecta- rameter and the variance of the stochastic . tion or in high probability (the high probability results Following the paradigm of (Hazan & Kale, 2011), we are stronger, but require more effort and have slightly analyze the algorithm directly in the stochastic setting, worse dependence on some problem parameters). The and avoid an online analysis with an online-to-batch application of this framework to learning is straightfor- conversion. This also allows us to prove results which ward (see for instance (Shalev-Shwartz et al., 2009)): Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization given a hypothesis class W and a set of T i.i.d. exam- size of Θ(1/t) is necessary for the algorithm to obtain ples, we wish to find a predictor w whose expected loss an optimal convergence rate (see Appendix A in the F (w) is close to optimal over W. Since the examples full version (Rakhlin et al., 2011)). are chosen i.i.d., the subgradient of the In general, we will assume that regardless of how w is with respect to any individual example can be shown 1 initialized, it holds that [kgˆ k2] ≤ G2 for some fixed to be an unbiased estimate of a subgradient of F . E t constant G. Note that this is a somewhat weaker as- We will focus on an important special case of the prob- sumption than (Hazan & Kale, 2011), which required 2 2 lem, characterized by F being a strongly convex func- that kgˆtk ≤ G with probability 1, since we focus tion. Formally, we say that a function F is λ-strongly only on bounds which hold in expectation. These types convex, if for all w, w0 ∈ W and any subgradient g of of assumptions are common in the literature, and are F at w, generally implied by taking W to be a bounded do- main, or alternatively, assuming that w is initialized λ 1 F (w0) ≥ F (w) + hg, w0 − wi + kw0 − wk2. (1) not too far from w∗ and F satisfies certain technical 2 conditions (see for instance the proof of Theorem 1 in (Shalev-Shwartz et al., 2011)). Another possible property of F we will consider is smoothness, at least with respect to the optimum w∗. Full proofs of our results are provided in Appendix B Formally, a function F is µ-smooth with respect to w∗ of the full version of this paper (Rakhlin et al., 2011). if for all w ∈ W, µ 3. Smooth Functions F (w) − F (w∗) ≤ kw − w∗k2. (2) 2 We begin by considering the case where the expected Such functions arise, for instance, in logistic and least- function F (·) is both strongly convex and smooth with squares regression, and in general for learning linear respect to w∗. Our starting point is to show a O(1/T ) predictors where the loss function has a Lipschitz- for the last point obtained by SGD. This result is well continuous gradient. known in the literature (see for instance (Nemirovski The algorithm we focus on is stochastic gradient de- et al., 2009)) and we include a proof for completeness. scent (SGD). The SGD algorithm is parameterized by Later on, we will show how to extend it to a high- probability bound. step sizes η1, . . . , ηT , and is defined as follows: Theorem 1. Suppose F is λ-strongly convex and µ- smooth with respect to w∗ over a W, and 1. Initialize w1 ∈ W arbitrarily (or randomly) 2 2 that E[kgˆtk ] ≤ G . Then if we pick ηt = c/λt for 2. For t = 1,...,T : some constant c > 1/2, it holds for any T that

• Query the stochastic gradient oracle at wt to   2 ∗ 1 c µG get a random gˆ such that [gˆ ] = g is a [F (wT ) − F (w )] ≤ max 4 , . t E t t E 2 2 − 1/c λ2T subgradient of F at wt. ˆ • Let wt+1 = ΠW (wt − ηtgt), where ΠW is the The theorem is an immediate corollary of the following projection operator on W. key lemma, and the definition of µ-smoothness with respect to w∗. This algorithm returns a sequence of points Lemma 1. Suppose F is λ-strongly convex over a con- w1,..., wT . To obtain a single point, one can 2 2 use several strategies. Perhaps the simplest one is to vex set W, and that E[kgˆtk ] ≤ G . Then if we pick ηt = c/λt for some constant c > 1/2, it holds for any return the last point, wT +1. Another procedure, for which the standard online analysis of SGD applies T that (Hazan et al., 2007), is to return the average point   2  ∗ 2 c G E kwT − w k ≤ max 4 , 2 . 1 2 − 1/c λ T w¯ = (w + ... + w ). T T 1 T We now turn to discuss the behavior of the average For stochastic optimization of λ-strongly functions, point w¯ T = (w1 + ... + wT )/T , and show that for the standard analysis (through online learning) focuses smooth F , it also enjoys an optimal O(1/T ) conver- gence rate (with even better dependence on c). on the step size ηt being exactly 1/λt (Hazan et al., 2007). Our analysis will consider more general step- Theorem 2. Suppose F is λ-strongly convex and µ- sizes c/λt, where c is a constant. We note that a step smooth with respect to w∗ over a convex set W, and Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

2 2 that E[kgˆtk ] ≤ G . Then if we pick ηt = c/λt for lower bound. The intuition for this is that the global ∗ some constant c > 1/2, E[F (w¯ T ) − F (w )] is at most optimum lies at a corner of W, so SGD “approaches” ( s ) it only from one direction. As a result, averaging the µG2 4µG µG 4c 1 points returned by SGD actually hurts us. 2 max , , . λ2 λ λ 2 − 1/c T Theorem 3. Consider the strongly convex stochastic optimization problem presented above. If SGD is ini- A rough proof intuition is the following: Lemma 1 tialized at any point in W, and ran with ηt = c/t, then ∗ for any T ≥ T + 1, where T = max{2, c/2}, we have implies that the Euclidean√ distance of wt from w is 0 0 on the order of 1/ t, so the squared distance of w¯ T T −1 T √ c X 1 from w∗ is on the order of ((1/T ) P 1/ t)2 ≈ 1/T , [F (w¯ ) − F (w∗)] ≥ . t=1 E T 16T t and the rest follows from smoothness. t=T0 When c is considered a constant, this lower bound is 4. Non-Smooth Functions Ω(log(T )/T ). We now turn to the discuss the more general case While the lower bound scales with c, we remind the where the function F may not be smooth (i.e. there reader that one must pick ηt = c/t with constant c for is no constant µ which satisfies Eq. (2) uniformly for an optimal convergence rate in general (see discussion all w ∈ W). In the context of learning, this may hap- in Sec. 2). pen when we try to learn a predictor with respect to This example is relatively straightforward but not fully a non-smooth loss function, such as the hinge loss. satisfying, since it crucially relies on the fact that w∗ As discussed earlier, SGD with averaging is known to is on the border of W. In strongly convex problems, have a rate of at most O(log(T )/T ). In the previous w∗ usually lies in the interior of W, so perhaps the section, we saw that for smooth F , the rate is actu- Ω(log(T )/T ) lower bound does not hold in such cases. ally O(1/T ). Moreover, (Hazan & Kale, 2011) showed Our main result, presented below, shows that this is that for using a different algorithm than SGD, one can not the case, and that even if w∗ is well inside the inte- obtain a rate of O(1/T ) even in the non-smooth case. rior of W, an Ω(log(T )/T ) rate for SGD with averaging This might lead us to believe that an O(1/T ) rate for can be unavoidable. The intuition is that we construct SGD is possible in the non-smooth case, and that the a non-smooth F , which forces wt to approach the opti- O(log(T )/T ) analysis is simply not tight. mum from just one direction, creating the same effect as in the previous example. However, this intuition turns out to be wrong. Be- low, we show that there are strongly convex stochastic In particular, let F be the 1-strongly convex function optimization problems in Euclidean space, in which ( 1 w w ≥ 0 the convergence rate of SGD with averaging is lower F (w) = kwk2 + 1 1 , bounded by Ω(log(T )/T ). Thus, the logarithm in the 2 −7w1 w1 < 0 bound is not merely a shortcoming in the standard over the domain W = [−1, 1]d, which has a global online analysis of SGD, but is really a property of the minimum at 0. Suppose the stochastic gradient oracle, algorithm. given a point wt, returns the gradient estimate We begin with the following relatively simple example, ( (Zt, 0,..., 0) w1 ≥ 0 which shows the essence of the idea. Let F be the 1- gˆt = wt + , strongly convex function (−7, 0,..., 0) w1 < 0 where Z is a random variable uniformly distributed 1 2 t F (w) = kwk + w1, 2 over [−1, 3]. It is easily verified that E[gˆt] is a subgra- dient of F (w ), and that [kgˆ k2] ≤ d + 63 which is a d t E t over the domain W = [0, 1] , which has a global bounded quantity for fixed d. minimum at 0. Suppose the stochastic gradient or- Theorem 4. Consider the strongly convex stochas- acle, given a point w , returns the gradient estimate t tic optimization problem presented above. If SGD is gˆ = w + (Z , 0,..., 0), where Z is uniformly dis- t t t t initialized at any point w with w ≥ 0, and ran tributed over [−1, 3]. It is easily verified that [gˆ ] 1 1,1 E t with η = c/t, then for any T ≥ T + 2, where is a subgradient of F (w ), and that [kgˆ k2] ≤ d + 5 t 0 t E t T = max{2, 6c + 1}, we have which is a bounded quantity for fixed d. 0 T   The following theorem implies in this case, the conver- ∗ 3c X 1 T0 E [F (w¯ T ) − F (w )] ≥ − . gence rate of SGD with averaging has a Ω(log(T )/T ) 16T t T t=T0+2 Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

When c is considered a constant, this lower bound is Lemma 1. In particular, starting as in the proof of Ω(log(T )/T ). Lemma 1, and extracting the inner products, we get

We note that the requirement of w1,1 ≥ 0 is just T T 2 X X ηtG for convenience, and the analysis also carries through, [hg , w − w∗i] ≤ + E t t 2 with some second-order factors, if we let w1,1 < 0. t=(1−α)T +1 t=(1−α)T +1 T  ∗ 2 ∗ 2  X E[kwt − w k ] E[kwt+1 − w k ] 5. Recovering an O(1/T ) Rate for SGD − . 2ηt 2ηt with α-Suffix Averaging t=(1−α)T +1 (3) In the previous section, we showed that SGD with averaging may have a rate of Ω(log(T )/T ) for non- Rearranging the .h.s., and using the convexity of F α ∗ smooth F . To get the optimal O(1/T ) rate for any to relate the l.h.s. to E[F (w¯ T ) − F (w )], we get a F , we might turn to the algorithms of (Hazan & convergence upper bound of Kale, 2011) and (Juditsky & Nesterov, 2010). How-  ever, these algorithms constitute a significant depar- ∗ 2 T 1 E[kw(1−α)T +1 − w k ] 2 X ture from standard SGD. In this section, we show that  + G ηt 2αT η(1−α)T +1 it is actually possible to get an O(1/T ) rate using a t=(1−α)T +1  much simpler modification of the algorithm: given the T   X ∗ 2 1 1 sequence of points w1,..., wT provided by SGD, in- + E[kwt − w k ] −  . ηt ηt−1 stead of returning the average w¯ T = (w1+...+wT )/T , t=(1−α)T +1 we average and return just a suffix, namely Lemma 1 tells us that with any strongly convex F , α w(1−α)T +1 + ... + wT w¯ = , even non-smooth, we have [kw − w∗k2] ≤ O(1/t). T αT E t Plugging this in and performing a few more manipu- for some constant α ∈ (0, 1) (assuming αT and lations, the result follows. (1 − α)T are integers). We call this procedure α-suffix averaging. One potential disadvantage of suffix averaging is that Theorem 5. Consider SGD with α-suffix averaging as if we cannot store all the iterates w in memory, then described above, and with step sizes η = c/λt where t t we need to know from which iterate αT to start com- c > 1/2 is a constant. Suppose F is λ-strongly convex, puting the suffix average (in contrast, standard aver- and that [kgˆ k2] ≤ G for all t. Then for any T , it E t aging can be computed “on-the-fly” without knowing holds that the stopping time T in advance). However, even if T  0 c 0  1  c + 2 + c log 1−α G2 is not known, this can be easily addressed in several [F (w¯ α ) − F (w∗)] ≤ , ways. For example, since our results are robust to the E T α λT value of α, it is really enough to guess when we passed 0 n 2 1 o where c = max c , 4−2/c . some “constant” portion of all iterates. Alternatively, one can divide the rounds into exponentially increasing Note that for any constant α ∈ (0, 1), the bound above epochs, and maintain the average just of the current is O(G2/λT ). This applies to any relevant step size epoch. Such an average would always correspond to a c/λt, and matches the optimal guarantees in (Hazan constant-portion suffix of all iterates. & Kale, 2011) up to constant factors. However, this is shown for standard SGD, as opposed to the more spe- 6. High-Probability Bounds cialized algorithm of (Hazan & Kale, 2011). Finally, we note that it might be tempting to use Thm. 5 as a All our previous bounds were on the expected subopti- guide to choose the averaging window, by optimizing mality E[F (w)−F (w∗)] of an appropriate predictor w. the bound for α (for instance, for c = 1, the optimum We now outline how these results can be strengthened ∗ is achieved around α ≈ 0.65). However, we note that to bounds on F (wt)−F (w ) which hold with arbitrar- the optimal value of α is dependent on the constants ily high probability 1 − δ, with the bound depending in the bound, which may not be the tightest or most logarithmically on δ. They are slightly worse than our “correct” ones. in-expectation bounds by having worse dependence on the step size parameter c and an additional log(log(T )) Proof Sketch. The proof combines the analysis of factor (interestingly, a similar factor also appears in online gradient descent (Hazan et al., 2007) and the analysis of (Hazan & Kale, 2011), and we do not Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

know if it is necessary). The key result is the follow- 6 SGD−A ing strengthening of Lemma 1, under slightly stronger SGD− 5 technical conditions. SGD−L EPOCH−GD 4 )) * T Lemma 2. Let δ ∈ (0, 1/e) and T ≥ 4. Suppose F * is λ-strongly convex over a convex set W, and that F(w 3 − )

2 2 T kgˆtk ≤ G with probability 1. Then if we pick ηt = 2 c/λt for some constant c > 1/2, such that 2c is a whole (F(w number, it holds with probability at least 1 − δ that for 1 2 0 any t ∈ {4c + 4c, . . . , T − 1,T } that 4 6 8 10 12 14 16 log (T) 2 2 2 ∗ 2 12c G c log(log(t)/δ) kwt − w k ≤ + 8(121G + 1)G . λ2t λt Figure 1. Results for smooth strongly convex stochastic optimization problem. The experiment was repeated 10 We note that the assumptions on 2c and t are only times, and we report the mean and standard deviation for for simplifying the result. To obtain high probabil- each choice of T . The X-axis is the log-number of rounds ∗ ity versions of Thm. 1, Thm. 2, and Thm. 5, we sim- log(T ), and the Y-axis is (F (wT )−F (w ))∗T . The scaling ply need to plug in this lemma in lieu of Lemma 1 in by T means that a roughly constant graph corresponds to their proofs. This leads overall to rates of the form a Θ(1/T ) rate, whereas a linearly increasing graph corre- O(log(log(T )/δ)/T ) which hold with probability 1−δ. sponds to a Θ(log(T )/T ) rate.

7. Experiments Second, as another simple experiment, we measured We now turn to empirically study how the algorithms the performance of the algorithms on the non-smooth, behave, and compare it to our theoretical findings. strongly convex problem described in the proof of Thm. 4. In particular, we simulated this problem with We studied the following four algorithms: d = 5, and picked w1 uniformly at random from W. The results are presented in Fig. 2. As our theory 1. Sgd-A: Performing SGD and then returning the indicates, Sgd-A seems to have an Θ(log(T )/T ) con- average point over all T rounds. vergence rate, whereas the other 3 algorithms all seem to have the optimal Θ(1/T ) convergence rate. Among 2. Sgd-α: Performing SGD with α-suffix averaging. We chose α = 1/2 - namely, we return the average these algorithms, the SGD variants Sgd-L and Sgd-α point over the last T/2 rounds. seem to perform somewhat better than Epoch-Gd. Also, while the average performance of Sgd-L and 3. Sgd-L: Performing SGD and returning the point Sgd-α are similar, Sgd-α has less variance. This is obtained in the last round. reasonable, considering the fact that Sgd-α returns an average of many points, whereas Sgd-L return only 4. Epoch-Gd: The optimal algorithm of (Hazan & the very last point. Kale, 2011) for strongly convex stochastic opti- mization. Finally, we performed a set of experiments on real- world data. We used the same 3 binary classification First, as a simple sanity check, we measured the perfor- datasets (ccat,cov1 and astro-ph) used by (Shalev- mance of these algorithms on a simple, strongly convex Shwartz et al., 2011) and (Joachims, 2006), to test the stochastic optimization problem, which is also smooth. performance of optimization algorithms for Support We define W = [−1, 1]5, and F (w) = kwk2. The Vector Machines using linear kernels. Each of these stochastic gradient oracle, given a point w, returns datasets is composed of a training set and a test set. m the stochastic gradient w + z where z is uniformly dis- Given a training set of instance-label pairs, {xi, yi}i=1, tributed in [−1, 1]5. Clearly, this is an unbiased esti- we defined F to be the standard (non-smooth) objec- mate of the gradient of F at w. The initial point w1 of tive function of Support Vector Machines, namely all 4 algorithms was chosen uniformly at random from m λ 1 X W. The results are presented in Fig. 1, and it is clear F (w) = kwk2 + max{0, 1 − y hx , wi}. (4) 2 m i i that all 4 algorithms indeed achieve a Θ(1/T ) rate, i=1 matching our theoretical analysis (Thm. 1, Thm. 2 and Thm. 5). The results also seem to indicate that Following (Shalev-Shwartz et al., 2011) and (Joachims, Sgd-A has a somewhat worse performance in terms of 2006), we took λ = 10−4 for ccat, λ = 10−6 for cov1, leading constants. and λ = 5 × 10−5 for astro-ph. The stochastic gra- Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

35 ASTRO − Training Loss SGD−A 10 30 SGD− SGD−A SGD−α SGD−L 8 25 SGD−L EPOCH−GD EPOCH−GD )) * T 20 * 6 )) T

F(w 15 −

) 4 (F(w T 10 2 log

(F(w 2 5

0 0

−5 4 6 8 10 12 14 16 −2 log (T) 0 2 4 6 8 10 12 2 log (T) 2 ASTRO − Test Loss Figure 2. Results for the non-smooth strongly convex 10 SGD−A α stochastic optimization problem. The experiment was re- 8 SGD− SGD−L peated 10 times, and we report the mean and standard de- EPOCH−GD 6 ))

viation for each choice of T . The X-axis is the log-number T ∗ of rounds log(T ), and the Y-axis is (F (wT ) − F (w )) ∗ T . 4 (F(w The scaling by T means that a roughly constant graph cor- 2 log responds to a Θ(1/T ) rate, whereas a linearly increasing 2 graph corresponds to a Θ(log(T )/T ) rate. 0

−2 0 2 4 6 8 10 12 log (T) 2 dient given wt was computed by taking a single ran- domly drawn training example (xi, yi), and computing Figure 3. Results for the astro-ph dataset. The left row the gradient with respect to that example, namely refers to the average loss on the training data, and the right row refers to the average loss on the test data. Each

gˆt = λwt − 1yihxi,wti≤1yixi. experiment was repeated 10 times, and we report the mean and standard deviation for each choice of T . The X-axis is Each dataset comes with a separate test set, and we the log-number of rounds log(T ), and the Y-axis is the log also report the objective function value with respect of the objective function log(F (wT )). to that set (as in Eq. (4), this time with {xi, yi} rep- resenting the test set examples). All algorithms were d CCAT − Training Loss initialized at w1 = 0, with W = R (i.e. no projections 8 SGD−A were performed - see the discussion in Sec. 2). SGD−α SGD−L The results of the experiments are presented in 6 EPOCH−GD )) T Fig. 3,Fig. 4 and Fig. 5. In all experiments, Sgd- 4 (F(w A performed the worst. The other 3 algorithms per- 2 formed rather similarly, with Sgd-α being slightly bet- log 2 ter on the Cov1 dataset, and Sgd-L being slightly 0 better on the other 2 datasets. −2 0 2 4 6 8 10 12 log (T) In summary, our experiments indicate the following: 2 CCAT − Test Loss

8 SGD−A • Sgd-A, which averages over all T predictors, is SGD−α worse than the other approaches. This accords SGD−L 6 EPOCH−GD ))

with our theory, as well as the results reported in T (Shalev-Shwartz et al., 2011). 4 (F(w 2

log 2 • The Epoch-Gd algorithm does have better per- formance than Sgd-A, but a similar or better 0

performance was obtained using the simpler ap- −2 0 2 4 6 8 10 12 proaches of α-suffix averaging (Sgd-α) or even log (T) 2 just returning the last predictor (Sgd-L). The good performance of Sgd-α is supported by our Figure 4. Results for the ccat dataset. See Fig. 3 caption theoretical results, and so does the performance of for details. Sgd-L in the strongly convex and smooth case. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

COV1 − Training Loss SGD, without averaging, obtain an O(1/T ) rate for 12 SGD−A general strongly convex problems? Also, a fuller em- SGD−α 10 SGD−L pirical study is warranted of whether and which aver- EPOCH−GD

)) 8 aging scheme is best in practice. T

6 (F(w

2 Acknowledgements: We thank Elad Hazan and

log 4 Satyen Kale for helpful comments on an earlier ver- 2 sion of this paper.

0

0 2 4 6 8 10 12 14 16 log (T) References 2 COV1 − Test Loss Bach, F. and Moulines, E. Non-asymptotic analysis 12 SGD−A SGD−α of stochastic approximation algorithms for machine 10 SGD−L learning. In NIPS, 2011. EPOCH−GD

)) 8 t Hazan, E. and Kale, S. Beyond the regret minimiza- 6 (F(w 2 tion barrier: An optimal algorithm for stochastic log 4 strongly-. In COLT, 2011. 2

0 Hazan, E., Agarwal, A., and Kale, S. Logarithmic re-

0 2 4 6 8 10 12 14 16 gret algorithms for online convex optimization. Ma- log (T) 2 chine Learning, 69(2-3):169–192, 2007.

Figure 5. Results for the ccat dataset. See Fig. 3 caption Joachims, T. Training linear SVMs in linear time. In for details. KDD, 2006. Juditsky, A. and Nesterov, Y. Primal-dual subgradi- ent methods for minimizing uniformly convex func- • Sgd-L also performed rather well (with what tions. Technical Report (August 2010), available at seems like a Θ(1/T ) rate) on the non-smooth http://hal.archives-ouvertes.fr/docs/00/50 problem reported in Fig. 2, although with a larger /89/33/PDF/Strong-hal.pdf, 2010. variance than Sgd-α. Our current theory does not cover the convergence of the last predictor in Kushner, H. and Yin, G. Stochastic Approxima- non-smooth problems - see the discussion below. tion and Recursive Algorithms and Applications. Springer, 2nd edition, 2003.

8. Discussion Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to In this paper, we analyzed the behavior of SGD for stochastic programming. SIAM J. Optim., 19(4): strongly convex stochastic optimization problems. We 1574–1609, 2009. demonstrated that this simple and well-known algo- rithm performs optimally whenever the underlying Rakhlin, A., Shamir, O., and Sridharan, K. Mak- function is smooth, but the standard averaging step ing gradient descent optimal for strongly convex can make it suboptimal for non-smooth problems. stochastic optimization. ArXiv Technical Report, However, a simple modification of the averaging step arXiv:1109.5647, 2011. suffices to recover the optimal rate, and a more sophis- ticated algorithm is not necessary. Our experiments Shalev-Shwartz, S., Shamir, O., Srebro, N., and Srid- seem to support this conclusion. haran, K. Stochastic convex optimization. In COLT, 2009. There are several open issues remaining. In particular, the O(1/T ) rate in the non-smooth case still requires Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, some sort of averaging. However, in our experiments A. Pegasos: primal estimated sub-gradient solver and other studies (e.g. (Shalev-Shwartz et al., 2011)), for svm. Mathematical Programming, 127(1):3–30, returning the last iterate wT also seems to perform 2011. quite well. Our current theory does not cover this - at best, one can use Lemma 1 and Jensen’s√ inequality to argue that the last iterate has a O(1/ T ) rate, but the behavior in practice is clearly much better. Does