Variance-Reduced and Projection-Free Stochastic Optimization

Elad Hazan [email protected] Princeton University, Princeton, NJ 08540, USA

Haipeng Luo [email protected] Princeton University, Princeton, NJ 08540, USA

Abstract more (see for example (Hazan & Kale, 2012; Hazan et al., The Frank-Wolfe optimization algorithm has re- 2012; Jaggi, 2013; Dudik et al., 2012; Zhang et al., 2012; cently regained popularity for machine learn- Harchaoui et al., 2015)). ing applications due to its projection-free prop- The Frank-Wolfe algorithm (Frank & Wolfe, 1956) (also erty and its ability to handle structured con- known as conditional gradient) and it variants are natural straints. However, in the stochastic learning set- candidates for solving these problems, due to its projection- ting, it is still relatively understudied compared free property and its ability to handle structured constraints. to the gradient descent counterpart. In this work, However, despite gaining more popularity recently, its ap- leveraging a recent variance reduction technique, plicability and efficiency in the stochastic learning setting, we propose two stochastic Frank-Wolfe variants where computing stochastic gradients is much faster than which substantially improve previous results in computing exact gradients, is still relatively understudied terms of the number of stochastic gradient evalu- compared to variants of projected gradient descent meth- ations needed to achieve 1 −  accuracy. For ex- ods. 1 1 ample, we improve from O(  ) to O(ln  ) if the objective function is smooth and strongly con- In this work, we thus try to answer the following question: 1 1 what running time can a projection-free algorithm achieve vex, and from O( 2 ) to O( 1.5 ) if the objective function is smooth and Lipschitz. The theoretical in terms of the number of stochastic gradient evaluations improvement is also observed in experiments on and the number of linear optimizations needed to achieve a real-world datasets for a multiclass classification certain accuracy? Utilizing Nesterov’s acceleration tech- application. nique (Nesterov, 1983) and the recent variance reduction idea (Johnson & Zhang, 2013; Mahdavi et al., 2013), we propose two new algorithms that are substantially faster 1. Introduction than previous work. Specifically, to achieve 1 −  accuracy, while the number of linear optimization is the same as pre- We consider the following optimization problem vious work, the improvement of the number of stochastic n gradient evaluations is summarized in Table1: 1 X min f(w) = min fi(w) w∈Ω w∈Ω n arXiv:1602.02101v2 [cs.LG] 14 Sep 2017 i=1 previous work this work 1 1 which is an extremely common objective in machine learn- Smooth O( 2 ) O( 1.5 ) ing. We are interested in the case where 1) n, usually Smooth and O( 1 ) O(ln 1 ) corresponding to the number of training examples, is very Strongly Convex   large and therefore stochastic optimization is much more efficient; and 2) the domain Ω admits fast linear optimiza- Table 1: Comparisons of number of stochastic gradients tion, while projecting onto it is much slower, necessitating projection-free optimization algorithms. Examples of such The extra overhead of our algorithms is computing at most problem include multiclass classification, multitask learn- 1 O(ln  ) exact gradients, which is computationally insignif- ing, recommendation systems, matrix learning and many icant compared to the other operations. A more detailed comparisons to previous work is included in Table2, which Proceedings of the 33 rd International Conference on , New York, NY, USA, 2016. JMLR: W&CP volume will be further explained in Section2. 48. Copyright 2016 by the author(s). While the idea of our algorithms is quite straightforward, Variance-Reduced and Projection-Free Stochastic Optimization we emphasize that our analysis is non-trivial, especially w ∈ Rd, is much faster than projection onto Ω, formally 2 for the second algorithm where the convergence of a se- argminv∈Ω kw − vk . Examples of such domains include quence of auxiliary points in Nesterov’s algorithm needs to the set of all bounded trace norm matrices, the convex hull be shown. of all rotation matrices, flow polytope and many more (see for instance (Hazan & Kale, 2012)). To support our theoretical results, we also conducted ex- periments on three large real-word datasets for a multiclass classification application. These experiments show signifi- 2.1. Example Application: Multiclass Classification cant improvement over both previous projection-free algo- Consider a multiclass classification problem where a set rithms and algorithms such as projected stochastic gradient of training examples (ei, yi)i=1,...,n is given beforehand. m descent and its variance-reduced version. Here ei ∈ R is a feature vector and yi ∈ {1, . . . , h} The rest of the paper is organized as follows: Section2 se- is the label. Our goal is to find an accurate linear predic- > > h×m tups the problem more formally and discusses related work. tor, a matrix w = [w1 ; ..., wh ] ∈ R that predicts > Our two new algorithms are presented and analyzed in Sec- argmax` w` e for any example e. Note that here the di- tion3 and4, followed by experiment details in Section5. mensionality d is hm. Previous work (Dudik et al., 2012; Zhang et al., 2012) 2. Preliminary and Related Work found that finding w by minimizing a regularized multi- variate logistic loss gives a very accurate predictor in gen- d We assume each function fi is convex and L-smooth in R eral. Specifically, the objective can be written in our nota- so that for any w, v ∈ Rd,1 tion with ∇f (v)>(w − v) ≤ f (w) − f (v)  X  i i i f (w) = log 1 + exp(w>e − w> e ) i ` i yi i > L 2 ≤ ∇f (v) (w − v) + kw − vk . `6=yi i 2 and Ω = {w ∈ h×m : kwk ≤ τ} where k·k We will use two more important properties of smoothness. R ∗ ∗ denotes the matrix trace norm. In this case, projecting The first one is onto Ω is equivalent to performing an SVD, which takes 2 k∇fi(w) − ∇fi(v)k ≤ O(hm min{h, m}) time, while linear optimization on Ω (1) > amounts to finding the top singular vector, which can be 2L(fi(w) − fi(v) − ∇fi(v) (w − v)) done in time linear to the number of non-zeros in the corre- (proven in AppendixA for completeness), and the second sponding h by m matrix, and is thus much faster. One can one is also verify that each fi is smooth. The number of examples n f (λw + (1 − λ)v) ≥ can be prohibitively large for non-stochastic methods (for i instance, tens of millions for the ImageNet dataset (Deng L 2 (2) λfi(w) + (1 − λ)fi(v) − λ(1 − λ) kw − vk et al., 2009)), which makes stochastic optimization neces- 2 sary. for any w, v ∈ Ω and λ ∈ [0, 1]. Notice that f = 1 Pn n i=1 fi is also L-smooth since smoothness is preserved 2.2. Detailed Efficiency Comparisons under convex combinations. We call ∇fi(w) a stochastic gradient for f at some w, For some cases, we also assume each fi is G-Lipschitz: where i is picked from {1, . . . , n} uniformly at random. k∇fi(w)k ≤ G for any w ∈ Ω, and f (although not nec- Note that a stochastic gradient ∇fi(w) is an unbiased es- essarily each fi) is α-strongly convex, that is, timator of the exact gradient ∇f(w). The efficiency of a projection-free algorithm is measured by how many num- > α 2 f(w) − f(v) ≤ ∇f(w) (w − v) − kw − vk bers of exact gradient evaluations, stochastic gradient eval- 2 uations and linear optimizations respectively are needed to L for any w, v ∈ Ω. As usual, µ = α is called the condition achieve 1− accuracy, that is, to output a point w ∈ Ω such ∗ ∗ number of f. that E[f(w) − f(w )] ≤  where w ∈ argminw∈Ω f(w) is any optimum. We assume the domain Ω ⊂ Rd is a compact convex set with diameter D. We are interested in the case where lin- In Table2, we summarize the efficiency (and extra assump- > 2 ear optimization on Ω, formally argminv∈Ω w v for any tions needed beside convexity and smoothness ) of existing

1We thank Sebastian Pokutta and Gabor´ Braun for pointing 2In general, condition “G-Lipschitz” in Table2 means each d out that fi needs to be defined over R , rather than only over Ω, fi is G-Lipschitz, except for our STORC algorithm which only in order for property (1) to hold. requires f being G-Lipschitz. Variance-Reduced and Projection-Free Stochastic Optimization

Algorithm Extra Conditions #Exact Gradients #Stochastic Gradients #Linear Optimizations LD2 LD2 Frank-Wolfe O(  ) 0 O(  ) α-strongly convex 2 2 (Garber & Hazan, 2013) O(dµρ ln LD ) 0 O(dµρ ln LD ) Ω is polytope   G2LD4 LD2 SFW G-Lipschitz 0 O( 3 ) O(  ) 2 2 4 2 2 G O( d (LD +GD) ) O( d(LD +GD) ) Online-FW -Lipschitz 0 4 2 (Hazan & Kale, 2012) G-Lipschitz G4D4 G4D4 0 O( 4 ) O( 4 ) (L = ∞ allowed) 2 2 2 G O( G D ) O( LD ) SCGS -Lipschitz 0 2  G-Lipschitz 2 2 (Lan & Zhou, 2014) 0 O( G ) O( LD ) α-strongly convex α  2 2 4 2 SVRF (this work) O(ln LD ) O( L D ) O( LD )  √ 2  LD2 LD2G LD2 G-Lipschitz O(ln  ) O( 1.5 ) O(  ) ∗ LD2 LD2 LD2 STORC (this work) ∇f(w ) = 0 O(ln  ) O(  ) O(  ) LD2 2 LD2 LD2 α-strongly convex O(ln  ) O(µ ln  ) O(  ) Table 2: Comparisons of different Frank-Wolfe variants (see Section 2.2 for further explanations). algorithms in the literature as well as the two new algo- for the setting studied here.4 In any case, the result is worse rithms we propose. Below we briefly explain these results than SFW for both the number of stochastic gradients and from top to bottom. the number of linear optimizations. The standard Frank-Wolfe algorithm: Stochastic Condition Gradient Sliding (SCGS), recently proposed by (Lan & Zhou, 2014), uses Nesterov’s accel- > vk = argmin ∇f(wk−1) v eration technique to speed up Frank-Wolfe. Without strong v∈Ω 1 (3) convexity, SCGS needs O( 2 ) stochastic gradients, im- wk = (1 − γk)wk−1 + γkvk proving SFW. With strong convexity, this number can even 1 be improved to O(  ). In both cases, the number of linear 1 1 for some appropriate chosen γk requires O(  ) iteration optimization steps is O(  ). without additional conditions (Frank & Wolfe, 1956; Jaggi, The key idea of our algorithms is to combine the variance 2013). In a recent paper, Garber & Hazan(2013) give reduction technique proposed in (Johnson & Zhang, 2013; a variant that requires O(dµρ ln 1 ) iterations when f is  Mahdavi et al., 2013) with some of the above-mentioned strongly convex and smooth, and Ω is a polytope3. Al- algorithms. For example, our algorithm SVRF combines though the dependence on  is much better, the geometric this technique with SFW, also improving the number of constant ρ depends on the polyhedral set and can be very stochastic gradients from O( 1 ) to O( 1 ), but without large. Moreover, each iteration of the algorithm requires 3 2 any extra conditions (such as Lipschitzness required for further computation besides the linear optimization step. SCGS). More importantly, despite having seemingly same The most obvious way to obtain a stochastic Frank-Wolfe convergence rate, SVRF substantially outperforms SCGS variant is to replace ∇f(wk−1) by some ∇fi(wk−1), or empirically (see Section5). more generally the average of some number of iid samples On the other hand, our second algorithm STORC combines of ∇f (w ) (mini-batch approach). We call this method i k−1 variance reduction with SCGS, providing even further im- SFW and include its analysis in AppendixB since we do provements. Specifically, the number of stochastic gradi- not find it explicitly analyzed before. SFW needs O( 1 ) 3 ents is improved to: O( 1 ) when f is Lipschitz; O( 1 ) stochastic gradients and O( 1 ) linear optimization steps to 1.5   when ∇f(w∗) = 0; and finally O(ln 1 ) when f is strongly reach an -approximate optimum.  convex. Note that the condition ∇f(w∗) = 0 essentially

The work by Hazan & Kale(2012) focuses on a online 4 learning setting. One can extract two results from this work The first result comes from the setting where the online loss functions are stochastic, and the second one comes from a com- 3See also recent follow up work (Lacoste-Julien & Jaggi, pletely online setting with the standard online-to-batch conver- 2015). sion. Variance-Reduced and Projection-Free Stochastic Optimization means that w∗ is in the interior of Ω, but it is still an inter- Algorithm 1 Stochastic Variance-Reduced Frank-Wolfe esting case when the optimum is not unique and doing un- (SVRF) constraint optimization would not necessary return a point 1 Pn 1: Input: Objective function f = fi. in Ω. n i=1 2: Input: Parameters γk, mk and Nk. 1 > 3: Initialize: w0 = argmin ∇f(x) w for some ar- Both of our algorithms require O(  ) linear optimization w∈Ω steps as previous work, and overall require computing bitrary x ∈ Ω. LD2 4: for t = 1, 2,...,T do O(ln  ) exact gradients. However, we emphasize that this extra overhead is much more affordable compared to 5: Take snapshot: x0 = wt−1 and compute ∇f(x0). non-stochastic Frank-Wolfe (that is, computing exact gra- 6: for k = 1 to Nt do ˜ dients every iteration) since it does not have any polyno- 7: Compute ∇k, the average of mk iid samples of ˜ mial dependence on parameters such as d, L or µ. ∇f(xk−1, x0). ˜ > 8: Compute vk = argminv∈Ω ∇k v. 2.3. Variance-Reduced Stochastic Gradients 9: Compute xk = (1 − γk)xk−1 + γkvk. 10: end for

Originally proposed in (Johnson & Zhang, 2013) and inde- 11: Set wt = xNt . pendently in (Mahdavi et al., 2013), the idea of variance- 12: end for reduced stochastic gradients is proven to be highly useful and has been extended to various different algorithms (such as (Frostig et al., 2015; Moritz et al., 2016)). a mini-batch of variance-reduced stochastic gradients, and take snapshots every once in a while. We call this algorithm A variance-reduced stochastic gradient at some point w ∈ Stochastic Variance-Reduced Frank-Wolfe (SVRF), whose Ω with some snapshot w ∈ Ω is defined as 0 pseudocode is presented in Alg1. The convergence rate of

∇˜ f(w; w0) = ∇fi(w) − (∇fi(w0) − ∇f(w0)), this algorithm is shown in the following theorem. Theorem 1. With the following parameters, where i is again picked from {1, . . . , n} uniformly at ran- dom. The snapshot w0 is usually a decision point from 2 γ = , m = 96(k + 1),N = 2t+3 − 2, some previous iteration of the algorithm and its exact gra- k k + 1 k t dient ∇f(w0) has been pre-computed before, so that com- 2 puting ∇˜ f(w; w0) only requires two standard stochastic ∗ LD Algorithm1 ensures E[f(wt) − f(w )] ≤ 2t+1 for any t. gradient evaluations: ∇fi(w) and ∇fi(w0). A variance-reduced stochastic gradient is clearly also unbi- Before proving this theorem, we first show a direct impli- ˜ cation of this convergence result. ased, that is, E[∇f(w; w0)] = ∇f(w). More importantly, the term ∇fi(w0) − ∇f(w0) serves as a correction term Corollary 1. To achieve 1 −  accuracy, Algorithm1 re- to reduce the variance of the stochastic gradient. Formally, LD2 L2D4 quires O(ln  ) exact gradient evaluations, O( 2 ) one can prove the following LD2 stochastic gradient evaluations and O(  ) linear opti- Lemma 1. For any w, w0 ∈ Ω, we have mizations. ˜ 2 E[k∇f(w; w0) − ∇f(w)k ] ∗ ∗ Proof. According to the algorithm and the choice of pa- ≤ 6L(2E[f(w) − f(w )] + E[f(w0) − f(w )]). rameters, it is clear that these three numbers are T + 1, PT PNt T PT T t=1 k=1 mk = O(4 ) and t=1 Nt = O(2 ) re- In words, the variance of the variance-reduced stochas- spectively. Theorem1 implies that T should be of order tic gradient is bounded by how close the current point LD2 Θ(log2  ). Plugging in all parameters concludes the and the snapshot are to the optimum. The original work proof. ˜ 2 proves a bound on E[k∇f(w; w0)k ] under the assump- tion ∇f(w∗) = 0, which we do not require here. How- ever, the main idea of the proof is similar and we defer it to To prove Theorem1, we first consider a fixed iteration t Section6. and prove the following lemma: Lemma 2. For any k, we have 3. Stochastic Variance-Reduced Frank-Wolfe 4LD2 [f(x ) − f(w∗)] ≤ With the previous discussion, our first algorithm is pretty E k k + 2 straightforward: compared to the standard Frank-Wolfe, ˜ 2 L2D2 we simply replace the exact gradient with the average of if E[k∇s − ∇f(xs−1)k ] ≤ (s+1)2 for all s ≤ k. Variance-Reduced and Projection-Free Stochastic Optimization

We defer the proof of this lemma to Section6 for coher- Algorithm 2 STOchastic variance-Reduced Conditional ence. With the help of Lemma2, we are now ready to gradient sliding (STORC) prove the main convergence result. 1 Pn 1: Input: Objective function f = n i=1 fi. 2: Input: Parameters γk, βk, ηt,k, mt,k and Nt. Proof of Theorem1. We prove by induction. For t = 0, by > 3: Initialize: w0 = argminw∈Ω ∇f(x) w for some ar- smoothness, the optimality of w0 and convexity, we have bitrary x ∈ Ω. 4: for t = 1, 2,... do > L 2 f(w0) ≤ f(x) + ∇f(x) (w0 − x) + kw0 − xk 5: Take snapshot: y = w and compute ∇f(y ). 2 0 t−1 0 2 6: Initialize x0 = y0. > ∗ LD ≤ f(x) + ∇f(x) (w − x) + 7: for k = 1 to Nt do 2 2 8: Compute zk = (1 − γk)yk−1 + γkxk−1. ∗ LD ≤ f(w ) + . 9: Compute ∇˜ k, the average of mt,k iid samples of 2 ˜ ∇f(zk; y0). 2 βk 2 > ∗ LD 10: Let g(x) = kx − xk−1k + ∇˜ x. Now assuming E[f(wt−1) − f(w )] ≤ 2t , we consider 2 k iteration t of the algorithm and use another induction to 11: Compute xk, the output of using standard Frank- 2 ∗ 4LD Wolfe to solve minx∈Ω g(x) until the duality gap show E[f(xk)−f(w )] ≤ k+2 for any k ≤ Nt. The base is at most ηt,k, that is, case is trivial since x0 = wt−1. Suppose E[f(xs−1) − 2 ∗ 4LD ˜ > f(w )] ≤ s+1 for any s ≤ k. Now because ∇s is the max ∇g(xk) (xk − x) ≤ ηt,k . (4) x∈Ω average of ms iid samples of ∇˜ f(xs−1; x0), its variance is reduced by a factor of ms. That is, with Lemma1 we have 12: Compute yk = (1 − γk)yk−1 + γkxk. 13: end for 2 [k∇˜ s − ∇f(xs−1)k ] E 14: Set wt = yNt . 15: 6L ∗ ∗ end for ≤ (2E[f(xs−1) − f(w )] + E[f(x0) − f(w )]) ms 6L 8LD2 LD2  The algorithm makes use of two auxiliary sequences x ≤ + t k ms s + 1 2 and zk (Line 8 and 12), which is standard for Nesterov’s al-  2 2  2 2 6L 8LD 8LD L D gorithm. xk is obtained by approximately solving a square ≤ + = 2 , ms s + 1 s + 1 (s + 1) norm regularized linear optimization so that it is close to xk−1 (Line 11). Note that this step does not require com- t+3 where the last inequality is by the fact s ≤ Nt = 2 − puting any extra gradients of f or fi, and is done by per- 2 and the last equality is by plugging the choice of ms. forming the standard Frank-Wolfe algorithm (Eq. (3)) until Therefore the condition of Lemma2 is satisfied and the the duality gap is at most a certain value ηt,k. The duality induction is completed. Finally with the choice of Nt we gap is a certificate of approximate optimality (see (Jaggi, ∗ ∗ thus prove E[f(wt) − f(w )] = E[f(xNt ) − f(w )] ≤ 2013)), and is a side product of the linear optimization per- 4LD2 LD2 = t+1 . Nt+2 2 formed at each step, requiring no extra cost. Also note that the stochastic gradients are computed at the We remark that in Alg1, we essentially restart the algo- sequence z instead of y , which is also standard in Nes- rithm (that is, reseting k to 1) after taking a new snapshot. k k terov’s algorithm. However, according to Lemma1, we However, another option is to continue increasing k and thus need to show the convergence rate of the auxiliary se- never reset it. Although one can show that this only leads to quence z , which appears to be rarely studied previously constant speed up for the convergence, it provides more sta- k to the best our knowledge. This is one of the key steps in ble update and is thus what we implement in experiments. our analysis. 4. Stochastic Variance-Reduced Conditional The main convergence result of STORC is the following: Gradient Sliding Theorem 2. With the following parameters (where Dt is defined later below): Our second algorithm applies variance reduction to the 2 SCGS algorithm (Lan & Zhou, 2014). Again, the key dif- 2 3L 2LDt γk = , βk = , ηt,k = , ference is that we replace the stochastic gradients with the k + 1 k Ntk average of a mini-batch of variance-reduced stochastic gra- ∗ LD2 dients, and take snapshots every once in a while. See pseu- Algorithm2 ensures E[f(wt) − f(w )] ≤ 2t+1 for any t if docode in Alg2 for details. any of the following three cases holds: Variance-Reduced and Projection-Free Stochastic Optimization

∗ t +2 ˜ 2 (a) ∇f(w ) = 0 and Dt = D,Nt = d2 2 e, mt,k = Next note that Lemma1 implies that E[k∇s − ∇f(zs)k ] 6L ∗ ∗ 900Nt. is at most (2 [f(zs) − f(w )] + [f(y ) − f(w )]). mt,s E E 0 ∗ t +2 So the key is to bound E[f(zs) − f(w )]. With z1 = (b) f is G-Lipschitz and Dt = D,Nt = d2 2 e, mt,k = 2 y one can verify that [k∇˜ 1 − ∇f(z1)k ] is at most 24NtG(k+1) 0 E 2 2 2 2 700Nt + LD . 18L ∗ 18L D L Dt [f(y ) − f(w )] ≤ t ≤ for all three mt,1 E 0 mt,12 4Nt 2 2 2 µD ∗ 8LDt (c) f is α-strongly convex and Dt = 2t−1 ,Nt = cases, and thus E[f(ys)−f(w )] ≤ s(s+1) holds for s = 1 √ L d 32µe, mt,k = 5600Ntµ where µ = α . by Lemma3. Now suppose it holds for any s < k, below we discuss the three cases separately to show that it also Again we first give a direct implication of the above result: holds for s = k. Corollary 2. To achieve 1 −  accuracy, Algorithm2 re- LD2 LD2 Case (a). By smoothness, the condition ∇f(w∗) = 0, quires O(ln  ) exact gradient evaluations and O(  ) linear optimizations. The numbers of stochastic gradi- the construction of zs, and Cauchy-Schwarz inequality, we have for any 1 < s ≤ k, ent evaluations for Case√ (a), (b) and (c) are respectively LD2 LD2 LD2G 2 LD2 O(  ), O(  + 1.5 ) and O(µ ln  ). ∗ > f(zs) ≤ f(ys−1) + (∇f(ys−1) − ∇f(w )) (zs − ys−1) β D2 Proof. Line 11 requires O( k ) iterations of the standard L 2 ηt,k + kz − y k 2 s s−1 Frank-Wolfe algorithm since g(x) is βk-smooth (see e.g. ∗ > (Jaggi, 2013, Theorem 2)). So the numbers of exact gradi- = f(ys−1) + γs(∇f(ys−1) − ∇f(w )) (xs−1 − ys−1) ent evaluations, stochastic gradient evaluations and linear Lγ2 T N s 2 P P t + kxs−1 − ys−1k optimizations are respectively T +1, t=1 k=1 mt,k and 2 2 O(PT PNt βkD ). Theorem2 implies that T should LD2γ2 t=1 k=1 ηt,k ∗ s ≤ f(y ) + γsDk∇f(y ) − ∇f(w )k + . LD2 s−1 s−1 2 be of order Θ(log2  ). Plugging in all parameters proves the corollary.

To prove Theorem2, we again first consider a fixed iter- Property (1) and the optimality of w∗ implies: ation t and use the following lemma, which is essentially proven in (Lan & Zhou, 2014). We include a distilled proof ∗ 2 in AppendixC for completeness. k∇f(ys−1) − ∇f(w )k ∗ ∗ > ∗ ∗ 2 2 ≤ 2L(f(ys−1) − f(w ) − ∇f(w ) (ys−1 − w )) Lemma 3. Suppose E[ky0 − w k ] ≤ Dt holds for some ∗ positive constant Dt ≤ D. Then for any k, we have ≤ 2L(f(ys−1) − f(w )).

8LD2 ∗ t f(w∗) E[f(yk) − f(w )] ≤ So subtracting and taking expectation on both sides, k(k + 1) and applying Jensen’s inequality and the inductive assump-

2 2 tion, we have ˜ 2 L Dt if [k∇s − ∇f(zs)k ] ≤ 2 for all s ≤ k. E Nt(s+1) ∗ E[f(zs) − f(w )] Proof of Theorem2. We prove by induction. The base case q ∗ ∗ t = 0 holds by the exact same argument as in the proof of ≤ E[f(ys−1) − f(w )] + γsD 2LE[f(ys−1) − f(w )] 2 ∗ LD 2 Theorem1. Suppose E[f(wt−1) − f(w )] ≤ 2t and 2LD consider iteration t. Below we use another induction to + 2 8LD2 (s + 1) prove [f(y ) − f(w∗)] ≤ t for any 1 ≤ k ≤ N , E k k(k+1) t 8LD2 8LD2 2LD2 55LD2 which will concludes the proof since for any of the three ≤ + + < . p 2 2 ∗ ∗ (s − 1)s (s + 1) (s − 1)s (s + 1) (s + 1) cases, we have E[f(wt) − f(w )] = E[f(yNt ) − f(w )] 2 2 8LDt LD which is at most 2 ≤ t+1 . N 2 2 t ∗ LD On the other hand, we have E[f(y0) − f(w )] ≤ 2t ≤ ∗ 2 2 2 2 2 We first show that the condition E[ky0 − w k ] ≤ Dt 16LD 40LD 40LD ˜ 2 2 < 2 ≤ 2 . So [k∇s − ∇f(zs)k (Nt−1) (Nt+1) (s+1) E holds. This is trivial for Cases (a) and (b) when Dt = D. 900L2D2 is at most 2 , and the choice of mt,s ensures that For Case (c), by strong convexity and the inductive assump- mt,s(s+1) ∗ 2 2 ∗ L2D2 tion, we have E[ky0 − w k ] ≤ E[f(y0) − f(w )] ≤ this bound is at most 2 , satisfying the condition of α Nt(s+1) LD2 2 α2t−1 = Dt . Lemma3 and thus completing the induction. Variance-Reduced and Projection-Free Stochastic Optimization

Case (b). With the G-Lipschitz condition we proceed dataset #features #categories #examples similarly and bound f(zs) by news20 62,061 20 15,935

> L 2 rcv1 47,236 53 15,564 f(y ) + ∇f(y ) (zs − y ) + kzs − y k s−1 s−1 s−1 2 s−1 aloi 128 1,000 108,000 LD2γ2 = f(y ) + γ ∇f(y )>(x − y ) + s s−1 s s−1 s−1 s−1 2 Table 3: Summary of datasets LD2γ2 ≤ f(y ) + γ GD + s . s−1 s 2 The case for s = 2 can be verified similarly using the bound So using bounds derived previously and the choice of mt,s, ∗ ∗ on E[f(y0) − f(w )] and E[f(y1) − f(w )] (base case). ˜ 2 2 we bound E[k∇s − ∇f(zs)k as follows: ∗ LD Finally we bound the term E[f(y0) − f(w )] ≤ 2t = 2 2 2  2 2 2  LDt 32LDt 32LDt 6L 16LD 4GD 4LD 40LD ≤ 2 ≤ 2 , and conclude that the variance + + + 2µ (Nt+1) (s+1) 2 2 2 2 mt,s (s − 1)s s + 1 (s + 1) (s + 1) ˜ 2 6L 896µLDt 32LDt E[k∇s − ∇f(zs)k is at most ( 2 + 2 ) ≤ 2 2 2 mt,s (s+1) (s+1) 6L  4GD 116LD  L D 2 2 L Dt < + 2 < 2 , N (s+1)2 , completing the induction by Lemma3. mt,s s + 1 (s + 1) Nt(s + 1) t again completing the induction. 5. Experiments

Case (c). Using the definition of zs and ys and direct cal- To support our theory, we conduct experiments in the mul- calution, one can remove the dependence of xs and verify ticlass classification problem mentioned in Sec 2.1. Three datasets are selected from the LIBSVM repository5 with s + 1 s − 2 ys−1 = zs + ys−2 relatively large number of features, categories and exam- 2s − 1 2s − 1 ples, summarized in the Table3. s ≥ 2 λ = s+1 for any . Now we apply Property (2) with 2s−1 : Recall that the is multivariate logistic loss and s + 1 s − 2 Ω is the set of matrices with bounded trace norm τ. We f(y ) ≥ f(z ) + f(y ) s−1 2s − 1 s 2s − 1 s−2 focus on how fast the loss decreases instead of the final test error rate so that the tuning of τ is less important, and is L (s + 1)(s − 2) − kz − y k2 fixed to 50 throughout. 2 (2s − 1)2 s s−2 s + 1 We compare six algorithms. Four of them (SFW, SCGS, = f(w∗) + (f(z ) − f(w∗))+ 2s − 1 s SVRF, STORC) are projection-free as discussed, and the s − 2 L(s − 2) other two are standard projected stochastic gradient descent (f(y ) − f(w∗)) − ky − y k2 (SGD) and its variance-reduced version (SVRG (Johnson 2s − 1 s−2 2(s + 1) s−1 s−2 & Zhang, 2013)), both of which require expensive projec- 1 L ≥ f(w∗) + (f(z ) − f(w∗)) − ky − y k2, tion. 2 s 2 s−1 s−2 For most of the parameters in these algorithms, we roughly where the equality is by adding and subtracting f(w∗) and follow what the theory suggests. For example, the size of the fact y − y = s+1 (z − y ), and the last s−1 s−2 2s−1 s s−2 mini-batch of stochastic gradients at round k is set to k2, k3 inequality is by f(y ) ≥ f(w∗) and trivial relaxations. s−2 and k respectively for SFW, SCGS and SVRF, and is fixed ∗ ∗ Rearranging gives f(zs)−f(w ) ≤ 2(f(ys−1 −f(w ))+ to 100 for the other three. The number of iterations be- 2 Lkys−1 − ys−2k . Applying Cauchy-Schwarz inequality, tween taking two snapshots for variance-reduced methods strong convexity and the fact µ ≥ 1, we continue with (SVRG, SVRF and STORC) are fixed to 50. The√ learning c/ k ∗ rate is set to the typical decaying sequence for SGD f(zs) − f(w ) and a constant c0 for SVRG as the original work suggests ∗ 0 ≤ 2(f(ys−1 − f(w )) for some best tuned c and c . ∗ 2 ∗ 2 + 2L(kys−1 − w k + kys−2 − w k ) Since the complexity of computing gradients, performing ∗ ≤ 2(f(ys−1 − f(w )) linear optimization and projecting are very different, we ∗ ∗ measure the actual running time of the algorithms and see + 4µ(f(ys−1) − f(w ) + f(ys−2) − f(w )) how fast the loss decreases. Results can be found in Fig- ≤ 6µ(f(y − f(w∗)) + 4µ(f(y ) − f(w∗)), s−1 s−2 ure1, where one can clearly observe that for all datasets,

For s ≥ 3, we use the inductive assumption to show 5 2 2 2 https://www.csie.ntu.edu.tw/˜cjlin/ ∗ 48µLDt 32µLDt 448µLDt E[f(zs) − f(w )] ≤ (s−1)s + (s−2)(s−1) ≤ (s+1)2 . libsvmtools/datasets/ Variance-Reduced and Projection-Free Stochastic Optimization

3 4 7

3.8 2.9 6.8 SGD 3.6 SVRG

2.8 3.4 6.6 SCGS STORC Loss Loss 3.2 Loss 2.7 6.4 SFW 3 SVRF 2.6 6.2 2.8

2.5 2.6 6 0 100 200 300 400 500 0 100 200 300 400 500 0 50 100 150 Time (s) Time (s) Time (s) (a) news20 (b) rcv1 (c) aloi

Figure 1: Comparison of six algorithms on three multiclass datasets (best viewed in color)

Lγ2 SGD and SVRG are significantly slower compared to the s 2 2 kvs − xs−1k . Rewriting and using the fact that others, due to the expensive projection step, highlighting kvs − xs−1k ≤ D leads to the usefulness of projection-free algorithms. Moreover, we ˜ > also observe large improvement gained from the variance f(xs) ≤ f(xs−1) + γs∇s (vs − xs−1) reduction technique, especially when comparing SCGS and LD2γ2 + γ (∇f(x ) − ∇˜ )>(v − x ) + s . STORC, as well as SFW and SVRF on the aloi dataset. s s−1 s s s−1 2 Interestingly, even though the STORC algorithm gives the best theoretical results, empirically the simpler algorithms ˜ > ˜ > ∗ The optimality of vs implies ∇s vs ≤ ∇s w . So with SFW and SVRF tend to have consistent better performance. further rewriting we arrive at

> ∗ 6. Omitted Proofs f(xs) ≤ f(xs−1) + γs∇f(xs−1) (w − xs−1) LD2γ2 Proof of Lemma1 . Let denotes the conditional expec- + γ (∇f(x ) − ∇˜ )>(v − w∗) + s . Ei s s−1 s s 2 tation given all the past except the realization of i. We have

2 > ∗ i[k∇˜ f(w; w0) − ∇f(w)k ] By convexity, term ∇f(xs−1) (w − xs−1) is bounded E ∗ 2 by f(w ) − f(xs−1), and by Cauchy-Schwarz inequal- = Ei[k∇fi(w) − ∇fi(w0) + ∇f(w0) − ∇f(w)k ] > ∗ ity, term (∇f(xs−1) − ∇˜ s) (vs − w ) is bounded by ∗ ∗ = Ei[k(∇fi(w) − ∇fi(w )) − (∇fi(w0) − ∇fi(w )) ˜ LD2 Dk∇s − ∇f(xs−1)k, which in expectation is at most s+1 + (∇f(w ) − ∇f(w∗)) − (∇f(w) − ∇f(w∗))k2] ˜ 2 0 by the condition on E[k∇s − ∇f(xs−1)k ] and Jensen’s ∗ 2 ∗ ∗ ≤ 3Ei[k∇fi(w) − ∇fi(w )k + k(∇fi(w0) − ∇fi(w )) inequality. Therefore we can bound E[f(xs) − f(w )] by ∗ 2 ∗ 2 − (∇f(w0) − ∇f(w ))k + k∇f(w) − ∇f(w )k ] 2 2 2 ∗ LD γs LD γs ∗ 2 ∗ 2 (1 − γs)E[f(xs−1) − f(w )] + + ≤ 3Ei[k∇fi(w) − ∇fi(w )k + k∇fi(w0) − ∇fi(w )k s + 1 2 ∗ 2 ∗ 2 2 + k∇f(w) − ∇f(w )k ] = (1 − γs)E[f(xs−1) − f(w )] + LD γs .

∗ 4LD2 where the first inequality is Cauchy-Schwarz inequal- Finally we prove E[f(xk) − f(w )] ≤ k+2 by induction. ∗ ity, and the second one is by the fact Ei[∇fi(w0) − The base case is trival: E[f(x1) − f(w )] is bounded by ∗ ∗ ∗ 2 2 2 ∇fi(w )] = ∇f(w0) − ∇f(w ) and that the variance (1−γ1)E[f(x0)−f(w )]+LD γ1 = LD since γ1 = 1. of a is bounded by its second moment. ∗ 4LD2 2 Suppose E[f(xs−1)−f(w )] ≤ s+1 then with γs = s+1 ∗ We now apply Property (1) to bound each of the three we bound E[f(xs) − f(w )] by ∗ 2 terms above. For example, Eik∇fi(w) − ∇fi(w )k ≤ 4LD2  2 1  4LD2 2L [f (w) − f (w∗) − ∇f (w∗)>(w − w∗)] = 1 − + ≤ , Ei i i i s + 1 s + 1 s + 1 s + 2 2L(f(w)−f(w∗)−∇f(w∗)>(w−w∗)), which is at most 2L(f(w) − f(w∗)) by the optimality of w∗. Proceeding completing the induction. similarly for the other two terms concludes the proof. 7. Conclusion and Open Problems Proof of Lemma2 . For any s ≤ k, by smoothness we > have f(xs) ≤ f(xs−1) + ∇f(xs−1) (xs − xs−1) + We conclude that the variance reduction technique, previ- L 2 2 kxs − xs−1k . Plugging in xs = (1 − γs)xs−1 + γsvs ously shown to be highly useful for gradient descent vari- > gives f(xs) ≤ f(xs−1) + γs∇f(xs−1) (vs − xs−1) + ants, can also be very helpful in speeding up projection-free Variance-Reduced and Projection-Free Stochastic Optimization algorithms. The main open question is, in the strongly con- Johnson, Rie and Zhang, Tong. Accelerating stochastic vex case, whether the number of stochastic gradients for gradient descent using predictive variance reduction. In 2 1 1 STORC can be improved from O(µ ln  ) to O(µ ln  ), Advances in Neural Information Processing Systems 27, which is typical for gradient descent methods, and whether pp. 315–323, 2013. the number of linear optimizations can be improved from O( 1 ) to O(ln 1 ). Lacoste-Julien, Simon and Jaggi, Martin. On the global   linear convergence of frank-wolfe optimization variants. Acknowledgements The authors acknowledge support In Advances in Neural Information Processing Systems from the National Science Foundation grant IIS-1523815 29, pp. 496–504, 2015. and a Google research award. Lan, Guanghui and Zhou, Yi. Conditional gradient sliding for convex optimization. Optimization-Online preprint References (4605), 2014. Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, Mahdavi, Mehrdad, Zhang, Lijun, and Jin, Rong. Mixed and Fei-Fei, Li. Imagenet: A large-scale hierarchical optimization for smooth functions. In Advances in Neu- image database. In Computer Vision and Pattern Recog- ral Information Processing Systems, pp. 674–682, 2013. nition, 2009. CVPR 2009. IEEE Conference on, pp. 248– 255. IEEE, 2009. Moritz, Philipp, Nishihara, Robert, and Jordan, Michael I. A linearly-convergent stochastic l-bfgs algorithm. In Dudik, Miro, Harchaoui, Zaid, and Malick, Jer´ ome.ˆ Lifted Proceedings of the Nineteenth International Conference coordinate descent for learning with trace-norm regular- on Artificial Intelligence and , 2016. ization. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, vol- Nesterov, YU. E. A method of solving a convex program- ume 22, pp. 327–336, 2012. ming problem with convergence rate o(1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983. Frank, Marguerite and Wolfe, Philip. An algorithm for quadratic programming. Naval research logistics quar- Zhang, Xinhua, Schuurmans, Dale, and Yu, Yao-liang. terly, 3(1-2):95–110, 1956. Accelerated training for matrix-norm regularization: A boosting approach. In Advances in Neural Information Frostig, Roy, Ge, Rong, Kakade, Sham M, and Sidford, Processing Systems 26, pp. 2906–2914, 2012. Aaron. Competing with the empirical risk minimizer in a single pass. In Proceedings of the 28th Annual Confer- ence on Learning Theory, 2015. Garber, Dan and Hazan, Elad. A linearly conver- gent conditional gradient algorithm with applications to online and stochastic optimization. arXiv preprint arXiv:1301.4666, 2013. Harchaoui, Zaid, Juditsky, Anatoli, and Nemirovski, Arkadi. Conditional gradient algorithms for norm- regularized smooth convex optimization. Mathematical Programming, 152(1-2):75–112, 2015. Hazan, Elad and Kale, Satyen. Projection-free online learn- ing. In Proceedings of the 29th International Conference on Machine Learning, 2012. Hazan, Elad, Kale, Satyen, and Shalev-Shwartz, Shai. Near-optimal algorithms for online matrix prediction. In COLT 2012 - The 25th Annual Conference on Learn- ing Theory, June 25-27, 2012, Edinburgh, Scotland, pp. 38.1–38.13, 2012. Jaggi, Martin. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, pp. 427–435, 2013. Variance-Reduced and Projection-Free Stochastic Optimization Supplementary material for “Variance-Reduced and Projection-Free Stochastic Optimization”

A. Proof of Property (1) Proof. We drop the subscript i for conciseness. Define g(w) = f(w) − ∇f(v)>w, which is clearly also convex and L-smooth on Ω. Since ∇g(v) = 0, v is one of the minimizers of g(w). Therefore we have 1 g(v) − g(w) ≤ g(w − ∇g(w)) − g(w) L 1 L 1 ≤ ∇g(w)>(w − ∇g(w) − w) + kw − ∇g(w) − wk2 (by smoothness of g) L 2 L 1 1 = − k∇g(w)k2 = − k∇f(w) − ∇f(v)k2 2L 2L Rearranging and plugging in the definition of g concludes the proof.

B. Analysis for SFW The concrete update of SFW is

˜ > vk = argmin ∇k v v∈Ω

wk = (1 − γk)wk−1 + γkvk where ∇˜ k is the average of mk iid samples of stochastic gradient ∇fi(wk−1). The convergence rate of SFW is presented below. 2 2  G(k+1)  Theorem 3. If each fi is G-Lipschitz, then with γk = k+1 and mk = LD , SFW ensures for any k,

4LD2 [f(w ) − f(w∗)] ≤ . E k k + 2

Proof. Similar to the proof of Lemma2, we first proceed as follows, L f(w ) ≤ f(w ) + ∇f(w )>(w − w ) + kw − w k2 (smoothness) k k−1 k−1 k k−1 2 k k−1 Lγ2 = f(w ) + γ ∇f(w )>(v − w ) + k kv − x k2 (w − w = γ (v − w )) k−1 k k−1 k k−1 2 k k−1 k k−1 k k k−1 LD2γ2 ≤ f(w ) + γ ∇˜ >(v − w ) + γ (∇f(w ) − ∇˜ )>(v − w ) + k (kv − w k ≤ D) k−1 k k k k−1 k k−1 k k k−1 2 k k−1 LD2γ2 ≤ f(w ) + γ ∇˜ >(w∗ − w ) + γ (∇f(w ) − ∇˜ )>(v − w ) + k (by optimality of v ) k−1 k k k−1 k k−1 k k k−1 2 k LD2γ2 = f(w ) + γ ∇f(w )>(w∗ − w ) + γ (∇f(w ) − ∇˜ )>(v − w∗) + k k−1 k k−1 k−1 k k−1 k k 2 LD2γ2 ≤ f(w ) + γ (f(w∗) − f(w )) + γ Dk∇˜ − ∇f(w )k + k , k−1 k k−1 k k k−1 2 where the last step is by convexity and Cauchy-Schwarz inequality. Since fi is G-Lipschitz, with Jensen’s inequality, we q 2 G LDγk further have [k∇˜ k − ∇f(wk−1)k] ≤ [k∇˜ k − ∇f(wk−1)k ] ≤ √ , which is at most with the choice of γk E E mk 2 ∗ ∗ 2 2 and mk. So we arrive at E[f(wk) − f(w )] ≤ (1 − γk)E[f(wk−1) − f(w )] + LD γk. It remains to use a simple induction to conclude the proof.

LD2 G2 LD2 3 G2LD4 Now it is clear that to achieve 1 −  accuracy, SFW needs O(  ) iterations, and in total O( L2D2 (  ) ) = O( 3 ) stochastic gradients. Variance-Reduced and Projection-Free Stochastic Optimization C. Proof of Lemma3

Proof. Let δs = ∇˜ s − ∇f(zs). For any s ≤ k, we proceed as follows: L f(y ) ≤ f(z ) + ∇f(z )>(y − z ) + ky − z k2 (by smoothness) s s s s s 2 s s > > ∗ > ∗ = (1 − γs)(f(zs) + ∇f(zs) (ys−1 − zs)) + γs(f(zs) + ∇f(zs) (w − zs)) + γs∇f(zs) (xs − w ) Lγ2 + s kx − x k2 (by definition of y and z ) 2 s s−1 s s Lγ2 ≤ (1 − γ )f(y ) + γ f(w∗) + γ ∇f(z )>(x − w∗) + s kx − x k2 (by convexity) s s−1 s s s s 2 s s−1 Lγ2 = (1 − γ )f(y ) + γ f(w∗) + γ ∇˜ >(x − w∗) + s kx − x k2 + γ δ>(w∗ − x ) s s−1 s s s s 2 s s−1 s s s Lγ2 ≤ (1 − γ )f(y ) + γ f(w∗) + γ η − γ β (x − x )>(x − w∗) + s kx − x k2 + γ δ>(w∗ − x ) s s−1 s s t,s s s s s−1 s 2 s s−1 s s s (by Eq. (4)) β γ = (1 − γ )f(y ) + γ f(w∗) + γ η + s s (kx − w∗k2 − kx − w∗k2)+ s s−1 s s t,s 2 s−1 s γ   s (Lγ − β ) kx − x k2 + 2δ>(x − x ) + 2δ>(w∗ − x ) 2 s s s s−1 s s−1 s s s−1 2 ! ∗ βsγs ∗ 2 ∗ 2 γs kδsk > ∗ ≤ (1 − γs)f(ys−1) + γsf(w ) + γsηt,s + (kxs−1 − w k − kxs − w k ) + + 2δs (w − xs−1) , 2 2 βs − Lγs where the last inequality is by the fact βs ≥ Lγs and thus

2 2 2 2 > kδsk δs kδsk (Lγs − βs) kxs − xs−1k + 2δs (xs−1 − xs) = − (βs − Lγs) xs − xs−1 − ≤ . βs − Lγs βs − Lγs βs − Lγs

2 2 > ∗ 2 L Dt def 2 Note that [δ (w − xs−1)] = 0. So with the condition [kδsk ] ≤ 2 = σ we arrive at E s E Nt(s+1) s

 2  ∗ ∗ βs ∗ 2 ∗ 2 σs E[f(ys)−f(w )] ≤ (1−γs)E[f(ys−1)−f(w )]+γs ηt,s + (E[kxs−1 − w k ] − E[kxs − w k ]) + . 2 2(βs − Lγs)

2 Now define Γs = Γs−1(1 − γs) when s > 1 and Γ1 = 1. By induction, one can verify Γs = s(s+1) and the following:

k  2  X γs βs 2 2 σ [f(y ) − f(w∗)] ≤ Γ η + ( [kx − w∗k ] − [kx − w∗k ]) + s , E k k Γ t,s 2 E s−1 E s 2(β − Lγ ) s=1 s s s which is at most

k  2  k   ! X γs σ Γk γ1β1 2 X γsβs γs−1βs−1 2 Γ η + s + [kx − w∗k ] + − [kx − w∗k ] . k Γ s 2(β − Lγ ) 2 Γ E 0 Γ Γ E s−1 s=1 s s s 1 s=2 s s−1

∗ 2 2 Finally plugging in the parameters γs, βs, ηt,s, Γs and the bound E[kx0 − w k ] ≤ Dt concludes the proof:

k 2 X 2LD2 LD2  3LD2 8LD2 [f(y ) − f(w∗)] ≤ k t + t + t ≤ t . E k k(k + 1) N k 2N (k + 1) k(k + 1) k(k + 1) s=1 t t