Tutorial: PART 2

Optimization for Machine Learning

Elad Hazan Princeton University

+ help from Sanjeev Arora & Yoram Singer Agenda

1. Learning as mathematical optimization • Stochastic optimization, ERM, online regret minimization • Online 2. Regularization • AdaGrad and optimal regularization 3. Gradient Descent++ • Frank-Wolfe, acceleration, variance reduction, second order methods, non- Accelerating gradient descent?

1. Adaptive regularization (AdaGrad) works for non-smooth&non-convex 2. Variance reduction uses special ERM structure very effective for smooth&convex 3. Acceleration/momentum smooth convex only, general purpose optimization since 80’s Condition number of convex functions

defined as � = , where (simplified)

0 ≺ �� ≼ �� � ≼ ��

� = strong convexity, � = smoothness Non-convex smooth functions: (simplified)

−�� ≼ �� � ≼ ��

Why do we care? well-conditioned functions exhibit much faster optimization! (but equivalent via reductions) Examples Smooth gradient descent

The descent lemma, �-smooth functions: (algorithm: � = � − �� )

� � − � � ≤ −� � − � + � � − �

1 = − � + �� � = − � 4�

Thus, for M-bounded functions: ( � � ≤ �)

1 −2� ≤ � � − �(� ) = �(� ) − � � ≤ − � 4� Thus, exists a t for which, 8�� � ≤ � Smooth gradient descent

Conclusions: for � = � − �� and T = Ω , finds

� ≤ �

1. Holds even for non-convex functions ∗ 2. For convex functions implies � � − � � ≤ �(�) (faster for smooth!) Non-convex stochastic gradient descent

The descent lemma, �-smooth functions: (algorithm: � = � − ��)

� � � − � � ≤ � −� � − � + � � − �

= � −� ⋅ �� + � � = −�� + � �� �

= −�� + � �(� + ��� � )

Thus, for M-bounded functions: ( � � ≤ �)

M� T = � ⇒ ∃ . � ≤ � � Controlling the variance: Interpolating GD and SGD

Model: both full and stochastic gradients. Estimator combines both into lower variance RV:

� = � − � �� � − �� � + ��(�)

Every so often, compute full gradient and restart at new �.

Theorem: [Schmidt, LeRoux, Bach ‘12; Johnson and Zhang ‘13; Mahdavi, Zhang, Jin ’13] Variance reduction for well-conditioned functions � 0 ≺ �� ≼ �� � ≼ �� , � = � � should be Produces an � approximate solution in time interpreted as 1 � � + � � log 1 � � Acceleration/momentum [Nesterov ‘83]

• Optimal gradient complexity (smooth, convex)

• modest practical improvements, non-convex “momentum” methods.

• With variance reduction, fastest possible running time of first- order methods: 1 � (� + ��) � log �

[Woodworth, Srebro ’15] – tight lower bound w. gradient oracle Experiments w. convex losses

Improve upon GradientDescent++ ? Next few slides:

Move from first order (GD++) to second order Higher Order Optimization

• Gradient Descent – Direction of Steepest Descent • Second Order Methods – Use Local Curvature Newton’s method (+ Trust region)

� = � − � [� �(�)] ��(�)

For non-convex function: can move to ∞ Solution: solve a quadratic approximation in a � local area (trust region) Newton’s method (+ Trust region)

� = � − � [� �(�)] ��(�)

� d3 time per iteration! Infeasible for ML!! �

Till recently… Speed up the Newton direction computation??

• Spielman-Teng ‘04: diagonally dominant systems of equations in linear time! • 2015 Godel prize • Used by Daitch-Speilman for faster flow algorithms

• Erdogu-Montanari ‘15: low rank approximation & inversion by Sherman-Morisson • Allow stochastic information • Still prohibitive: rank * d2 Stochastic Newton? �

� = � − � [� �(�)] ��(�)

• ERM, rank-1 loss: arg min � ∼ [ℓ � �,� + |�| ]

• unbiased estimator of the Hessian:

� = a a ⋅ ℓ′ � �,b + � � ~ �[1,… ,�]

• clearly � � = �� , but � � ≠ �� Circumvent Hessian creation and inversion!

• 3 steps: • (1) represent Hessian inverse as infinite series For any distribution on naturals i ∼ � � = � − � • (2) sample from the infinite series (Hessian-gradient product) , ONCE 1 ���� = � − �� �f = � � − �� �f ⋅ ∼ Pr[�] • (3) estimate Hessian-power by sampling i.i.d. data examples

1 = E � − �� �f ⋅ ∼,∼[] Pr[�] Single example Vector-vector products only Linear-time Second-order Stochastic Algorithm (LiSSA)

• Use the estimator �� defined previously • Compute a full (large batch) gradient �f • Move in the direction �� �� • Theoretical running time to produce an � approximate solution for � well-conditioned functions (convex): [Agarwal, Bullins, Hazan ‘15]

1 1 � �� log + �� � log � �

1. Faster than first-order methods! 2. Indications this is tight [Arjevani,Shamir ‘16] What about constraints??

Next few slides – projection free (Frank-Wolfe) methods Recommendation systems

movies rating of user i 1 5 for movie j 2 3 2 4 1

users 5 4 2 3 1 1 1 5 5 Recommendation systems

movies complete 1 4 5 2 1 5 missing entries 3 4 2 2 3 5 5 2 4 4 1 1 3 3 3 3 3 3 users 2 3 1 5 4 4 3 4 2 1 2 3 1 2 4 1 5 1 4 5 3 3 5 2 Recommendation systems

movies 1 4 5 2 1 5 3 4 2 2 3 5 5 2 4 4 1 1 3 3 3 3 3 5 users 2 3 1 5 4 4 get new data 3 4 2 1 2 3 1 2 4 1 5 1 4 5 3 3 5 2 Recommendation systems

movies re-complete 1 2 5 5 4 4 missing entries 2 4 3 2 3 1 5 2 4 2 3 1 3 3 2 4 5 5 users 3 3 2 5 1 5 2 4 3 4 2 3 1 1 1 1 1 1 4 5 3 3 5 2

Assume low rank of “true matrix”, convex relaxation: bounded trace norm Bounded trace norm matrices

• Trace norm of a matrix = sum of singular values • K = { X | X is a matrix with trace norm at most D }

• Computational bottleneck: projections on K require eigendecomposition: O(n3) operations

• But: linear optimization over K is easier computing top eigenvector; O(sparsity) time Projections à linear optimization

1. Matrix completion (K = bounded trace norm matrices) eigen decomposition largest singular vector computation

2. Online routing (K = flow polytope) conic optimization over flow polytope shortest path computation

3. Rotations (K = rotation matrices) conic optimization over rotations set Wahba’s algorithm – linear time

4. Matroids (K = matroid polytope) convex opt. via – nearly linear time! Conditional Gradient algorithm [Frank, Wolfe ’56]

Convex opt problem: min �(�) ∈

• f is smooth, convex • linear opt over K is easy

vt = arg min f(xt)>x x K r 2 x = x + ⌘ (v x ) t+1 t t t t

1. At iteration t: convex comb. of at most t vertices (sparsity) 2. No learning rate. � ≈ (independent of diameter, gradients etc.) f(x ) FW theorem r t

vt+1 xt+1 = xt + ⌘t(vt xt) ,vt = arg min t>x x x K r t+1 2 1 xt Theorem: f(xt) f(x⇤)=O( ) t

Proof, main observation:

f(x ) f(x⇤)=f(x + ⌘ (v x )) f(x⇤) t+1 t t t t 2 2 f(x ) f(x⇤)+⌘ (v x )> + ⌘ v x -smoothness of f  t t t t rt t 2 k t tk 2 2 f(x ) f(x⇤)+⌘ (x⇤ x )> + ⌘ v x optimality of v  t t t rt t 2 k t tk t 2 2 f(x ) f(x⇤)+⌘ (f(x⇤) f(x )) + ⌘ v x convexity of f  t t t t 2 k t tk 2 ⌘t 2 (1 ⌘ )(f(x ) f(x⇤)) + D .  t t 2

* Thus: ht = f(xt) – f(x ) 2 1 h (1 ⌘ )h + O(⌘ ) ⌘t,ht = O( ) t+1  t t t t Online Conditional Gradient

• Set � ∈ � arbitrarily • For t = 1, 2,…,

1. Use �, obtain ft 2. Compute � as follows t > vt = arg min i=1 fi(xi)+txt x x K r 2 ⇣ ↵ ↵ ⌘ xt+1 (1 Pt )xt + t vt xt xt+1 t i=1 fi(xi)+txt r vt P Theorem:[Hazan, Kale ’12] ������ = �(�/) Theorem: [Garber, Hazan ‘13] For polytopes, strongly-convex and smooth losses, 1. Offline: convergence after t steps: �() 2. Online: ������ = �( �) Next few slides: survey state-of-the-art Non-convex optimization in ML

Machine

Distribution over label {a} ∈ �

1 arg min ℓ �, �, � + � � ∈ � What is Optimization But generally speaking...

We’re screwed. ! Local (non global) minima of f0 ! All kinds of constraints (even restricting to continuous functions): Solution concepts for non-convex optimization?h(x)=sin(2πx)=0

250

200

150 • Global minimization is NP hard, even for degree 4 100 50

th 0

polynomials. Even local minimization up to 4 order −50 3 2 3 1 2 0 1 optimality conditions is NP hard. −1 0 −1 −2 −2 −3 −3

• Algorithmic stability is sufficient [Hardt, RechtDuchi, (UCSinger Berkeley) Convex‘16] Optimization for Machine Learning Fall 2009 7 / 53

• Optimization approaches: • Finding vanishing gradients / local minima efficiently • Graduated optimization / homotopy method • Quasi-convexity • Structure of local optima (probabilistic assumptions that allow alternating minimization,…) Gradient/Hessian based methods

Goal: find point x such that 1. �� � ≤ � (approximate first order optimality) 2. �� � ≽ −�� (approximate second order optimality)

1. (we’ve proved) GD algorithm: � = � − �� finds in � (expensive) iterations point (1) 2. (we’ve proved) SGD algorithm: � = � − �� finds in � (cheap) iterations point (1) 3. SGD algorithm with noise finds in � (cheap) iterations (1&2) [Ge, Huang, Jin, Yuan ‘15] 4. Recent second order methods: find in � (expensive) iterations / point (1&2) [Carmon,Duchi, Hinder, Sidford ‘16] [Agarwal, Allen-Zuo, Bullins, Hazan, Ma ‘16] Recap

Chair/car

What is Optimization But generally speaking...

We’re screwed. ! Local (non global) minima of f0 1. Online learning and stochastic ! All kinds of constraints (even restricting to continuous functions): optimization h(x)=sin(2πx)=0 • Regret minimization • Online gradient descent

250

200 2. Regularization 150 • AdaGradand optimal 100

50 regularization

0

−50 3. Advanced optimization 3 2 3 1 2 0 1 • Frank-Wolfe, acceleration, −1 0 −1 −2 −2 −3 −3 variance reduction, second order methods, non-convex Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53 optimization Bibliography & more information, see: http://www.cs.princeton.edu/~ehazan/tutorial/SimonsTutorial.htm

Thank you!