Variance-Reduced and Projection-Free Stochastic Optimization

Variance-Reduced and Projection-Free Stochastic Optimization Elad Hazan [email protected] Princeton University, Princeton, NJ 08540, USA Haipeng Luo [email protected] Princeton University, Princeton, NJ 08540, USA Abstract more (see for example (Hazan & Kale, 2012; Hazan et al., The Frank-Wolfe optimization algorithm has re- 2012; Jaggi, 2013; Dudik et al., 2012; Zhang et al., 2012; cently regained popularity for machine learn- Harchaoui et al., 2015)). ing applications due to its projection-free prop- The Frank-Wolfe algorithm (Frank & Wolfe, 1956) (also erty and its ability to handle structured con- known as conditional gradient) and it variants are natural straints. However, in the stochastic learning set- candidates for solving these problems, due to its projection- ting, it is still relatively understudied compared free property and its ability to handle structured constraints. to the gradient descent counterpart. In this work, However, despite gaining more popularity recently, its ap- leveraging a recent variance reduction technique, plicability and efficiency in the stochastic learning setting, we propose two stochastic Frank-Wolfe variants where computing stochastic gradients is much faster than which substantially improve previous results in computing exact gradients, is still relatively understudied terms of the number of stochastic gradient evalu- compared to variants of projected gradient descent meth- ations needed to achieve 1 − accuracy. For ex- ods. 1 1 ample, we improve from O( ) to O(ln ) if the objective function is smooth and strongly con- In this work, we thus try to answer the following question: 1 1 what running time can a projection-free algorithm achieve vex, and from O( 2 ) to O( 1:5 ) if the objective function is smooth and Lipschitz. The theoretical in terms of the number of stochastic gradient evaluations improvement is also observed in experiments on and the number of linear optimizations needed to achieve a real-world datasets for a multiclass classification certain accuracy? Utilizing Nesterov’s acceleration tech- application. nique (Nesterov, 1983) and the recent variance reduction idea (Johnson & Zhang, 2013; Mahdavi et al., 2013), we propose two new algorithms that are substantially faster 1. Introduction than previous work. Specifically, to achieve 1 − accuracy, while the number of linear optimization is the same as pre- We consider the following optimization problem vious work, the improvement of the number of stochastic n gradient evaluations is summarized in Table1: 1 X min f(w) = min fi(w) w2Ω w2Ω n arXiv:1602.02101v2 [cs.LG] 14 Sep 2017 i=1 previous work this work 1 1 which is an extremely common objective in machine learn- Smooth O( 2 ) O( 1:5 ) ing. We are interested in the case where 1) n, usually Smooth and O( 1 ) O(ln 1 ) corresponding to the number of training examples, is very Strongly Convex large and therefore stochastic optimization is much more efficient; and 2) the domain Ω admits fast linear optimiza- Table 1: Comparisons of number of stochastic gradients tion, while projecting onto it is much slower, necessitating projection-free optimization algorithms. Examples of such The extra overhead of our algorithms is computing at most problem include multiclass classification, multitask learn- 1 O(ln ) exact gradients, which is computationally insignif- ing, recommendation systems, matrix learning and many icant compared to the other operations. A more detailed comparisons to previous work is included in Table2, which Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume will be further explained in Section2. 48. Copyright 2016 by the author(s). While the idea of our algorithms is quite straightforward, Variance-Reduced and Projection-Free Stochastic Optimization we emphasize that our analysis is non-trivial, especially w 2 Rd, is much faster than projection onto Ω, formally 2 for the second algorithm where the convergence of a se- argminv2Ω kw − vk . Examples of such domains include quence of auxiliary points in Nesterov’s algorithm needs to the set of all bounded trace norm matrices, the convex hull be shown. of all rotation matrices, flow polytope and many more (see for instance (Hazan & Kale, 2012)). To support our theoretical results, we also conducted experiments on three large real-word datasets for a multiclass classification application. These experiments show signifi- 2.1. Example Application: Multiclass Classification cant improvement over both previous projection-free algo- Consider a multiclass classification problem where a set rithms and algorithms such as projected stochastic gradient of training examples (ei; yi)i=1;:::;n is given beforehand. m descent and its variance-reduced version. Here ei 2 R is a feature vector and yi 2 f1; : : : ; hg The rest of the paper is organized as follows: Section2 se- is the label. Our goal is to find an accurate linear predic- > > h×m tups the problem more formally and discusses related work. tor, a matrix w = [w1 ; :::; wh ] 2 R that predicts > Our two new algorithms are presented and analyzed in Sec- argmax` w` e for any example e. Note that here the di- tion3 and4, followed by experiment details in Section5. mensionality d is hm. Previous work (Dudik et al., 2012; Zhang et al., 2012) 2. Preliminary and Related Work found that finding w by minimizing a regularized multi- variate logistic loss gives a very accurate predictor in gen- d We assume each function fi is convex and L-smooth in R eral. Specifically, the objective can be written in our nota- so that for any w; v 2 Rd,1 tion with rf (v)>(w − v) ≤ f (w) − f (v) X i i i f (w) = log 1 + exp(w>e − w> e ) i ` i yi i > L 2 ≤ rf (v) (w − v) + kw − vk : `6=yi i 2 and Ω = fw 2 h×m : kwk ≤ τg where k·k We will use two more important properties of smoothness. R ∗ ∗ denotes the matrix trace norm. In this case, projecting The first one is onto Ω is equivalent to performing an SVD, which takes 2 krfi(w) − rfi(v)k ≤ O(hm minfh; mg) time, while linear optimization on Ω (1) > amounts to finding the top singular vector, which can be 2L(fi(w) − fi(v) − rfi(v) (w − v)) done in time linear to the number of non-zeros in the corre- (proven in AppendixA for completeness), and the second sponding h by m matrix, and is thus much faster. One can one is also verify that each fi is smooth. The number of examples n f (λw + (1 − λ)v) ≥ can be prohibitively large for non-stochastic methods (for i instance, tens of millions for the ImageNet dataset (Deng L 2 (2) λfi(w) + (1 − λ)fi(v) − λ(1 − λ) kw − vk et al., 2009)), which makes stochastic optimization neces- 2 sary. for any w; v 2 Ω and λ 2 [0; 1]. Notice that f = 1 Pn n i=1 fi is also L-smooth since smoothness is preserved 2.2. Detailed Efficiency Comparisons under convex combinations. We call rfi(w) a stochastic gradient for f at some w, For some cases, we also assume each fi is G-Lipschitz: where i is picked from f1; : : : ; ng uniformly at random. krfi(w)k ≤ G for any w 2 Ω, and f (although not nec- Note that a stochastic gradient rfi(w) is an unbiased es- essarily each fi) is α-strongly convex, that is, timator of the exact gradient rf(w). The efficiency of a projection-free algorithm is measured by how many num- > α 2 f(w) − f(v) ≤ rf(w) (w − v) − kw − vk bers of exact gradient evaluations, stochastic gradient eval- 2 uations and linear optimizations respectively are needed to L for any w; v 2 Ω. As usual, µ = α is called the condition achieve 1− accuracy, that is, to output a point w 2 Ω such ∗ ∗ number of f. that E[f(w) − f(w )] ≤ where w 2 argminw2Ω f(w) is any optimum. We assume the domain Ω ⊂ Rd is a compact convex set with diameter D. We are interested in the case where lin- In Table2, we summarize the efficiency (and extra assump- > 2 ear optimization on Ω, formally argminv2Ω w v for any tions needed beside convexity and smoothness ) of existing 1We thank Sebastian Pokutta and Gabor´ Braun for pointing 2In general, condition “G-Lipschitz” in Table2 means each d out that fi needs to be defined over R , rather than only over Ω, fi is G-Lipschitz, except for our STORC algorithm which only in order for property (1) to hold. requires f being G-Lipschitz. Variance-Reduced and Projection-Free Stochastic Optimization Algorithm Extra Conditions #Exact Gradients #Stochastic Gradients #Linear Optimizations LD2 LD2 Frank-Wolfe O( ) 0 O( ) α-strongly convex 2 2 (Garber & Hazan, 2013) O(dµρ ln LD ) 0 O(dµρ ln LD ) Ω is polytope G2LD4 LD2 SFW G-Lipschitz 0 O( 3 ) O( ) 2 2 4 2 2 G O( d (LD +GD) ) O( d(LD +GD) ) Online-FW -Lipschitz 0 4 2 (Hazan & Kale, 2012) G-Lipschitz G4D4 G4D4 0 O( 4 ) O( 4 ) (L = 1 allowed) 2 2 2 G O( G D ) O( LD ) SCGS -Lipschitz 0 2 G-Lipschitz 2 2 (Lan & Zhou, 2014) 0 O( G ) O( LD ) α-strongly convex α 2 2 4 2 SVRF (this work) O(ln LD ) O( L D ) O( LD ) p 2 LD2 LD2G LD2 G-Lipschitz O(ln ) O( 1:5 ) O( ) ∗ LD2 LD2 LD2 STORC (this work) rf(w ) = 0 O(ln ) O( ) O( ) LD2 2 LD2 LD2 α-strongly convex O(ln ) O(µ ln ) O( ) Table 2: Comparisons of different Frank-Wolfe variants (see Section 2.2 for further explanations).

Load more