Tutorial: PART 2
Total Page:16
File Type:pdf, Size:1020Kb
Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization • Stochastic optimization, ERM, online regret minimization • Online gradient descent 2. Regularization • AdaGrad and optimal regularization 3. Gradient Descent++ • Frank-Wolfe, acceleration, variance reduction, second order methods, non-convex optimization Accelerating gradient descent? 1. Adaptive regularization (AdaGrad) works for non-smooth&non-convex 2. Variance reduction uses special ERM structure very effective for smooth&convex 3. Acceleration/momentum smooth convex only, general purpose optimization since 80’s Condition number of convex functions # defined as � = , where (simplified) $ 0 ≺ �� ≼ �+� � ≼ �� � = strong convexity, � = smoothness Non-convex smooth functions: (simplified) −�� ≼ �+� � ≼ �� Why do we care? well-conditioned functions exhibit much faster optimization! (but equivalent via reductions) Examples Smooth gradient descent The descent lemma, �-smooth functions: (algorithm: �012 = �0 − ��4 ) + � �012 − � �0 ≤ −�0 �012 − �0 + � �0 − �012 1 = − � + ��+ � + = − � + 0 4� 0 Thus, for M-bounded functions: ( � �0 ≤ �) 1 −2� ≤ � � − �(� ) = > �(� ) − � � ≤ − > � + ; 2 012 0 4� 0 0 0 Thus, exists a t for which, 8�� � + ≤ 0 � Smooth gradient descent 2 Conclusions: for � = � − �� and T = Ω , finds 012 0 4 C + �0 ≤ � 1. Holds even for non-convex functions ∗ 2. For convex functions implies � �0 − � � ≤ �(�) (faster for smooth!) Non-convex stochastic gradient descent The descent lemma, �-smooth functions: (algorithm: �012 = �0 − ��N0) + � � �012 − � �0 ≤ � −�4 �012 − �0 + � �0 − �012 + + + + = � −�P0 ⋅ ��4 + � �N4 = −��4 + � �� �N4 + + + = −��4 + � �(�4 + ��� �P0 ) Thus, for M-bounded functions: ( � �0 ≤ �) M� T = � ⇒ ∃ . � + ≤ � �+ 0Y; 0 Controlling the variance: Interpolating GD and SGD Model: both full and stochastic gradients. Estimator combines both into lower variance RV: �012 = �0 − � �P� �0 − �P� �[ + ��(�[) Every so oFten, compute Full gradient and restart at new �[. Theorem: [Schmidt, LeRoux, Bach ‘12; Johnson and Zhang ‘13; Mahdavi, Zhang, Jin ’13] Variance reduction for well-conditioned functions � 0 ≺ �� ≼ �+� � ≼ �� , � = � � should be Produces an � approximate solution in time interpreted as 1 � � + � � log 1 � � Acceleration/momentum [Nesterov ‘83] • Optimal gradient complexity (smooth, convex) • modest practical improvements, non-convex “momentum” methods. • With variance reduction, fastest possible running time of first- order methods: 1 � (� + ��) � log � [Woodworth, Srebro ’15] – tight lower bound w. gradient oracle Experiments w. convex losses Improve upon GradientDescent++ ? Next few slides: Move from first order (GD++) to second order Higher Order Optimization • Gradient Descent – Direction of Steepest Descent • Second Order Methods – Use Local Curvature Newton’s method (+ Trust region) �2 + p2 �012 = �0 − � [� �(�)] ��(�) �+ For non-convex function: can move to ∞ Solution: solve a quadratic approximation in a � local area (trust region) m Newton’s method (+ Trust region) �2 + p2 �012 = �0 − � [� �(�)] ��(�) �+ d3 time per iteration! Infeasible for ML!! �m Till recently… Speed up the Newton direction computation?? • Spielman-Teng ‘04: diagonally dominant systems of equations in linear time! • 2015 Godel prize • Used by Daitch-Speilman for faster flow algorithms • Erdogu-Montanari ‘15: low rank approximation & inversion by Sherman-Morisson • Allow stochastic information • Still prohibitive: rank * d2 Stochastic Newton? �„ + p2 �012 = �0 − � [� �(�)] ��(�) ; 2 + • ERM, rank-1 loss: arg min � s∼ u [ℓ � �s,�s + |�| ] r + • unbiased estimator of the Hessian: y+ { ; � = az az ⋅ ℓ′ � �s,bz + � � ~ �[1,… ,�] p2 p2 • clearly � �y+ = �+� , but � �y+ ≠ �+� Circumvent Hessian creation and inversion! • 3 steps: • (1) represent Hessian inverse as infinite series For any distribution on naturals i ∼ � �p+ = > � − �+ s s†[ 0‡ ˆ • (2) sample from the infinite series (Hessian-gradient product) , ONCE 1 �+�p2�� = > � − �+� s �F = � � − �+� s �F ⋅ s∼‰ Pr[�] s • (3) estimate Hessian-power by sampling i.i.d. data examples 1 = E Œ � − �+� �f ⋅ s∼‰,‹∼[s] ‹ Pr[�] ‹†2 0‡ s Single example Vector-vector products only Linear-time Second-order Stochastic Algorithm (LiSSA) • Use the estimator �•p+� defined previously • Compute a full (large batch) gradient �f • Move in the direction �•p+� �� • Theoretical running time to produce an � approximate solution for � well-conditioned functions (convex): [Agarwal, Bullins, Hazan ‘15] 1 1 � �� log + �� � log � � 1. Faster than first-order methods! 2. Indications this is tight [Arjevani,Shamir ‘16] What about constraints?? Next few slides – projection free (Frank-Wolfe) methods Recommendation systems movies rating of user i 1 5 for movie j 2 3 2 4 1 users 5 4 2 3 1 1 1 5 5 Recommendation systems movies complete 1 4 5 2 1 5 missing entries 3 4 2 2 3 5 5 2 4 4 1 1 3 3 3 3 3 3 users 2 3 1 5 4 4 3 4 2 1 2 3 1 2 4 1 5 1 4 5 3 3 5 2 Recommendation systems movies 1 4 5 2 1 5 3 4 2 2 3 5 5 2 4 4 1 1 3 3 3 3 3 5 users 2 3 1 5 4 4 get new data 3 4 2 1 2 3 1 2 4 1 5 1 4 5 3 3 5 2 Recommendation systems movies re-complete 1 2 5 5 4 4 missing entries 2 4 3 2 3 1 5 2 4 2 3 1 3 3 2 4 5 5 users 3 3 2 5 1 5 2 4 3 4 2 3 1 1 1 1 1 1 4 5 3 3 5 2 Assume low rank of “true matrix”, convex relaxation: bounded trace norm Bounded trace norm matrices • Trace norm of a matrix = sum of singular values • K = { X | X is a matrix with trace norm at most D } • Computational bottleneck: projections on K require eigendecomposition: O(n3) operations • But: linear optimization over K is easier computing top eigenvector; O(sparsity) time Projections à linear optimization 1. Matrix completion (K = bounded trace norm matrices) eigen decomposition largest singular vector computation 2. Online routing (K = flow polytope) conic optimization over flow polytope shortest path computation 3. Rotations (K = rotation matrices) conic optimization over rotations set Wahba’s algorithm – linear time 4. Matroids (K = matroid polytope) convex opt. via ellipsoid method greedy algorithm – nearly linear time! Conditional Gradient algorithm [Frank, Wolfe ’56] Convex opt problem: min �(�) •∈‘ • f is smooth, convex • linear opt over K is easy vt = arg min f(xt)>x x K r 2 x = x + ⌘ (v x ) t+1 t t t − t 1. At iteration t: convex comb. of at most t vertices (sparsity) 2 2. No learning rate. � ≈ (independent of diameter, gradients etc.) 0 0 f(x ) FW theorem r t vt+1 xt+1 = xt + ⌘t(vt xt) ,vt = arg min t>x x − x K r t+1 2 1 xt Theorem: f(xt) f(x⇤)=O( ) − t Proof, main observation: f(x ) f(x⇤)=f(x + ⌘ (v x )) f(x⇤) t+1 − t t t − t − 2 β 2 f(x ) f(x⇤)+⌘ (v x )> + ⌘ v x β-smoothness of f t − t t − t rt t 2 k t − tk 2 β 2 f(x ) f(x⇤)+⌘ (x⇤ x )> + ⌘ v x optimality of v t − t − t rt t 2 k t − tk t 2 β 2 f(x ) f(x⇤)+⌘ (f(x⇤) f(x )) + ⌘ v x convexity of f t − t − t t 2 k t − tk 2 ⌘t β 2 (1 ⌘ )(f(x ) f(x⇤)) + D . − t t − 2 * Thus: ht = f(xt) – f(x ) 2 1 h (1 ⌘ )h + O(⌘ ) ⌘t,ht = O( ) t+1 − t t t t Online Conditional Gradient • Set �2 ∈ � arbitrarily • For t = 1, 2,…, 1. Use �0, obtain ft 2. Compute �012 as follows t > vt = arg min i=1 fi(xi)+βtxt x x K r 2 ⇣ ↵ ↵ ⌘ xt+1 (1 Pt− )xt + t− vt xt − xt+1 t i=1 fi(xi)+βtxt r vt P Theorem:[Hazan, Kale ’12] ������ = �(�m/—) Theorem: [Garber, Hazan ‘13] For polytopes, strongly-convex and smooth losses, 1. Offline: convergence after t steps: �p˜(0) 2. Online: ������ = �( �) Next few slides: survey state-of-the-art Non-convex optimization in ML Machine Distribution over label {a} ∈ �› 1 arg min > ℓs �, �s, �s + � � •∈œ• � s†2 0‡ u What is Optimization But generally speaking... We’re screwed. ! Local (non global) minima of f0 ! All kinds of constraints (even restricting to continuous functions): Solution concepts for non-convex optimization?h(x)=sin(2πx)=0 250 200 150 • Global minimization is NP hard, even for degree 4 100 50 th 0 polynomials. Even local minimization up to 4 order −50 3 2 3 1 2 0 1 optimality conditions is NP hard. −1 0 −1 −2 −2 −3 −3 • Algorithmic stability is sufficient [Hardt, RechtDuchi, (UCSinger Berkeley) Convex‘16] Optimization for Machine Learning Fall 2009 7 / 53 • Optimization approaches: • Finding vanishing gradients / local minima efficiently • Graduated optimization / homotopy method • Quasi-convexity • Structure of local optima (probabilistic assumptions that allow alternating minimization,…) Gradient/Hessian based methods Goal: find point x such that 1. �� � ≤ � (approximate first order optimality) 2. �+� � ≽ −�� (approximate second order optimality) 2 1. (we’ve proved) GD algorithm: � = � − �� finds in � 012 0 4 C (expensive) iterations point (1) 2 2. (we’ve proved) SGD algorithm: � = � − ��P finds in � (cheap) 012 0 4 CŸ iterations point (1) 2 3. SGD algorithm with noise finds in � (cheap) iterations (1&2) C [Ge, Huang, Jin, Yuan ‘15] 2 4. Recent second order methods: find in � (expensive) iterations C¡/¢ point (1&2) [Carmon,Duchi, Hinder, Sidford ‘16] [Agarwal, Allen-Zuo, Bullins, Hazan, Ma ‘16] Recap Chair/car What is Optimization But generally speaking... We’re screwed. ! Local (non global) minima of f0 1. Online learning and stochastic ! All kinds of constraints (even restricting to continuous functions): optimization h(x)=sin(2πx)=0 • Regret minimization • Online gradient descent 250 200 2. Regularization 150 • AdaGradand optimal 100 50 regularization 0 −50 3. Advanced optimization 3 2 3 1 2 0 1 • Frank-Wolfe, acceleration, −1 0 −1 −2 −2 −3 −3 variance reduction, second order methods, non-convex Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53 optimization Bibliography & more information, see: http://www.cs.princeton.edu/~ehazan/tutorial/SimonsTutorial.htm Thank you!.