Tutorial: PART 2

Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization • Stochastic optimization, ERM, online regret minimization • Online gradient descent 2. Regularization • AdaGrad and optimal regularization 3. Gradient Descent++ • Frank-Wolfe, acceleration, variance reduction, second order methods, non-convex optimization Accelerating gradient descent? 1. Adaptive regularization (AdaGrad) works for non-smooth&non-convex 2. Variance reduction uses special ERM structure very effective for smooth&convex 3. Acceleration/momentum smooth convex only, general purpose optimization since 80’s Condition number of convex functions # defined as � = , where (simplified) $ 0 ≺ �� ≼ �+� � ≼ �� = strong convexity, � = smoothness Non-convex smooth functions: (simplified) −�� ≼ �+� � ≼ �� Why do we care? well-conditioned functions exhibit much faster optimization! (but equivalent via reductions) Examples Smooth gradient descent The descent lemma, �-smooth functions: (algorithm: �012 = �0 − ��4 ) + � �012 − � �0 ≤ −�0 �012 − �0 + � �0 − �012 1 = − � + ��+ � + = − � + 0 4� 0 Thus, for M-bounded functions: ( � �0 ≤ �) 1 −2� ≤ � � − �(� ) = > �(� ) − � � ≤ − > � + ; 2 012 0 4� 0 0 0 Thus, exists a t for which, 8�� + ≤ 0 � Smooth gradient descent 2 Conclusions: for � = � − �� and T = Ω , finds 012 0 4 C + �0 ≤ � 1. Holds even for non-convex functions ∗ 2. For convex functions implies � �0 − � � ≤ �(�) (faster for smooth!) Non-convex stochastic gradient descent The descent lemma, �-smooth functions: (algorithm: �012 = �0 − ��N0) + � � �012 − � �0 ≤ � −�4 �012 − �0 + � �0 − �012 + + + + = � −�P0 ⋅ ��4 + � �N4 = −��4 + � �� N4 + + + = −��4 + � �(�4 + �� P0 ) Thus, for M-bounded functions: ( � �0 ≤ �) M� T = � ⇒ ∃ . � + ≤ � �+ 0Y; 0 Controlling the variance: Interpolating GD and SGD Model: both full and stochastic gradients. Estimator combines both into lower variance RV: �012 = �0 − � �P� �0 − �P� �[ + ��(�[) Every so oFten, compute Full gradient and restart at new �[. Theorem: [Schmidt, LeRoux, Bach ‘12; Johnson and Zhang ‘13; Mahdavi, Zhang, Jin ’13] Variance reduction for well-conditioned functions � 0 ≺ �� ≼ �+� � ≼ �� , � = � � should be Produces an � approximate solution in time interpreted as 1 � � + � � log 1 � � Acceleration/momentum [Nesterov ‘83] • Optimal gradient complexity (smooth, convex) • modest practical improvements, non-convex “momentum” methods. • With variance reduction, fastest possible running time of first- order methods: 1 � (� + ��) � log � [Woodworth, Srebro ’15] – tight lower bound w. gradient oracle Experiments w. convex losses Improve upon GradientDescent++ ? Next few slides: Move from first order (GD++) to second order Higher Order Optimization • Gradient Descent – Direction of Steepest Descent • Second Order Methods – Use Local Curvature Newton’s method (+ Trust region) �2 + p2 �012 = �0 − � [� �(�)] ��(�) �+ For non-convex function: can move to ∞ Solution: solve a quadratic approximation in a � local area (trust region) m Newton’s method (+ Trust region) �2 + p2 �012 = �0 − � [� �(�)] ��(�) �+ d3 time per iteration! Infeasible for ML!! �m Till recently… Speed up the Newton direction computation?? • Spielman-Teng ‘04: diagonally dominant systems of equations in linear time! • 2015 Godel prize • Used by Daitch-Speilman for faster flow algorithms • Erdogu-Montanari ‘15: low rank approximation & inversion by Sherman-Morisson • Allow stochastic information • Still prohibitive: rank * d2 Stochastic Newton? �„ + p2 �012 = �0 − � [� �(�)] ��(�) ; 2 + • ERM, rank-1 loss: arg min � s∼ u [ℓ � �s,�s + |�| ] r + • unbiased estimator of the Hessian: y+ { ; � = az az ⋅ ℓ′ � �s,bz + � � ~ �[1,… ,�] p2 p2 • clearly � �y+ = �+� , but � �y+ ≠ �+� Circumvent Hessian creation and inversion! • 3 steps: • (1) represent Hessian inverse as infinite series For any distribution on naturals i ∼ � �p+ = > � − �+ s s†[ 0‡ ˆ • (2) sample from the infinite series (Hessian-gradient product) , ONCE 1 �+�p2�� = > � − �+� s �F = � � − �+� s �F ⋅ s∼‰ Pr[�] s • (3) estimate Hessian-power by sampling i.i.d. data examples 1 = E Œ � − �+� �f ⋅ s∼‰,‹∼[s] ‹ Pr[�] ‹†2 0‡ s Single example Vector-vector products only Linear-time Second-order Stochastic Algorithm (LiSSA) • Use the estimator �•p+� defined previously • Compute a full (large batch) gradient �f • Move in the direction �•p+� �� • Theoretical running time to produce an � approximate solution for � well-conditioned functions (convex): [Agarwal, Bullins, Hazan ‘15] 1 1 � �� log + �� log � � 1. Faster than first-order methods! 2. Indications this is tight [Arjevani,Shamir ‘16] What about constraints?? Next few slides – projection free (Frank-Wolfe) methods Recommendation systems movies rating of user i 1 5 for movie j 2 3 2 4 1 users 5 4 2 3 1 1 1 5 5 Recommendation systems movies complete 1 4 5 2 1 5 missing entries 3 4 2 2 3 5 5 2 4 4 1 1 3 3 3 3 3 3 users 2 3 1 5 4 4 3 4 2 1 2 3 1 2 4 1 5 1 4 5 3 3 5 2 Recommendation systems movies 1 4 5 2 1 5 3 4 2 2 3 5 5 2 4 4 1 1 3 3 3 3 3 5 users 2 3 1 5 4 4 get new data 3 4 2 1 2 3 1 2 4 1 5 1 4 5 3 3 5 2 Recommendation systems movies re-complete 1 2 5 5 4 4 missing entries 2 4 3 2 3 1 5 2 4 2 3 1 3 3 2 4 5 5 users 3 3 2 5 1 5 2 4 3 4 2 3 1 1 1 1 1 1 4 5 3 3 5 2 Assume low rank of “true matrix”, convex relaxation: bounded trace norm Bounded trace norm matrices • Trace norm of a matrix = sum of singular values • K = { X | X is a matrix with trace norm at most D } • Computational bottleneck: projections on K require eigendecomposition: O(n3) operations • But: linear optimization over K is easier computing top eigenvector; O(sparsity) time Projections à linear optimization 1. Matrix completion (K = bounded trace norm matrices) eigen decomposition largest singular vector computation 2. Online routing (K = flow polytope) conic optimization over flow polytope shortest path computation 3. Rotations (K = rotation matrices) conic optimization over rotations set Wahba’s algorithm – linear time 4. Matroids (K = matroid polytope) convex opt. via ellipsoid method greedy algorithm – nearly linear time! Conditional Gradient algorithm [Frank, Wolfe ’56] Convex opt problem: min �(�) •∈‘ • f is smooth, convex • linear opt over K is easy vt = arg min f(xt)>x x K r 2 x = x + ⌘ (v x ) t+1 t t t − t 1. At iteration t: convex comb. of at most t vertices (sparsity) 2 2. No learning rate. � ≈ (independent of diameter, gradients etc.) 0 0 f(x ) FW theorem r t vt+1 xt+1 = xt + ⌘t(vt xt) ,vt = arg min t>x x − x K r t+1 2 1 xt Theorem: f(xt) f(x⇤)=O( ) − t Proof, main observation: f(x ) f(x⇤)=f(x + ⌘ (v x )) f(x⇤) t+1 − t t t − t − 2 β 2 f(x ) f(x⇤)+⌘ (v x )> + ⌘ v x β-smoothness of f t − t t − t rt t 2 k t − tk 2 β 2 f(x ) f(x⇤)+⌘ (x⇤ x )> + ⌘ v x optimality of v t − t − t rt t 2 k t − tk t 2 β 2 f(x ) f(x⇤)+⌘ (f(x⇤) f(x )) + ⌘ v x convexity of f t − t − t t 2 k t − tk 2 ⌘t β 2 (1 ⌘ )(f(x ) f(x⇤)) + D . − t t − 2 * Thus: ht = f(xt) – f(x ) 2 1 h (1 ⌘ )h + O(⌘ ) ⌘t,ht = O( ) t+1 − t t t t Online Conditional Gradient • Set �2 ∈ � arbitrarily • For t = 1, 2,…, 1. Use �0, obtain ft 2. Compute �012 as follows t > vt = arg min i=1 fi(xi)+βtxt x x K r 2 ⇣ ↵ ↵ ⌘ xt+1 (1 Pt− )xt + t− vt xt − xt+1 t i=1 fi(xi)+βtxt r vt P Theorem:[Hazan, Kale ’12] �� = �(�m/—) Theorem: [Garber, Hazan ‘13] For polytopes, strongly-convex and smooth losses, 1. Offline: convergence after t steps: �p˜(0) 2. Online: �� = �( �) Next few slides: survey state-of-the-art Non-convex optimization in ML Machine Distribution over label {a} ∈ �› 1 arg min > ℓs �, �s, �s + � � •∈œ• � s†2 0‡ u What is Optimization But generally speaking... We’re screwed. ! Local (non global) minima of f0 ! All kinds of constraints (even restricting to continuous functions): Solution concepts for non-convex optimization?h(x)=sin(2πx)=0 250 200 150 • Global minimization is NP hard, even for degree 4 100 50 th 0 polynomials. Even local minimization up to 4 order −50 3 2 3 1 2 0 1 optimality conditions is NP hard. −1 0 −1 −2 −2 −3 −3 • Algorithmic stability is sufficient [Hardt, RechtDuchi, (UCSinger Berkeley) Convex‘16] Optimization for Machine Learning Fall 2009 7 / 53 • Optimization approaches: • Finding vanishing gradients / local minima efficiently • Graduated optimization / homotopy method • Quasi-convexity • Structure of local optima (probabilistic assumptions that allow alternating minimization,…) Gradient/Hessian based methods Goal: find point x such that 1. �� ≤ � (approximate first order optimality) 2. �+� � ≽ −�� (approximate second order optimality) 2 1. (we’ve proved) GD algorithm: � = � − �� finds in � 012 0 4 C (expensive) iterations point (1) 2 2. (we’ve proved) SGD algorithm: � = � − ��P finds in � (cheap) 012 0 4 CŸ iterations point (1) 2 3. SGD algorithm with noise finds in � (cheap) iterations (1&2) C [Ge, Huang, Jin, Yuan ‘15] 2 4. Recent second order methods: find in � (expensive) iterations C¡/¢ point (1&2) [Carmon,Duchi, Hinder, Sidford ‘16] [Agarwal, Allen-Zuo, Bullins, Hazan, Ma ‘16] Recap Chair/car What is Optimization But generally speaking... We’re screwed. ! Local (non global) minima of f0 1. Online learning and stochastic ! All kinds of constraints (even restricting to continuous functions): optimization h(x)=sin(2πx)=0 • Regret minimization • Online gradient descent 250 200 2. Regularization 150 • AdaGradand optimal 100 50 regularization 0 −50 3. Advanced optimization 3 2 3 1 2 0 1 • Frank-Wolfe, acceleration, −1 0 −1 −2 −2 −3 −3 variance reduction, second order methods, non-convex Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53 optimization Bibliography & more information, see: http://www.cs.princeton.edu/~ehazan/tutorial/SimonsTutorial.htm Thank you!.

Tutorial: PART 2

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support