Lecture 6: Stochastic Gradient Descent Sanjeev Arora Elad Hazan

COS 402 – Machine Learning and Artificial Intelligence Fall 2016 Lecture 6: stochastic gradient descent Sanjeev Arora Elad Hazan Admin • Exercise 2 (implementation) this Thu, in class • Exercise 3 (written), this Thu, in class • Movie – “Ex Machina” + discussion panel w. Prof. Hasson (PNI) Wed Oct. 5th 19:30 tickets from Bella; room 204 COS • Today: special guest - Dr. Yoram Singer @ Google Recap • Definition + fundamental theorem of statistical learning, motivated efficient algorithms/optimization • Convexity and it’s computational importance • Local greedy optimization – gradient descent Agenda • Stochastic gradient descent • Dr. Singer on opt @ google & beyond Mathematical optimization Input: function �: � ↦ �, for � ⊆ �' Output: point � ∈ �, such that � � ≤ � � ∀ � ∈ � What is Optimization But generally speaking... We’re screwed. ! Local (non global) minima of f0 Prefer Convex Problems! All kinds of constraints (even restricting to continuous functions): h(x)=sin(2πx)=0 250 200 150 100 50 0 −50 3 2 3 1 2 0 1 −1 0 −1 −2 −2 −3 −3 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53 6 Convex functions: local à global Sum of convex functions à also convex Convex Functions and Sets n A function f : R R is convex if for x, y domfand any a [0, 1], ! 2 2 f(ax +(1 a)y) af(x)+(1 a)f(y) − − Convex Sets Deﬁnition AsetC Rn is convex if for x, y C and any α [0, 1], ⊆ ∈ ∈ αx +(1 α)y C. − ∈ n AsetC R is convex if for x, y C and any a [0, 1], ✓ 2 2 y ax +(1 a)y C x − 2 8 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 9 / 53 Special case: optimization for linear classification Given a sample � = �0, �0 ,… , �3,�3 , find hyperplane (through the origin w.l.o.g) such that: � min # of mistakes 7 = min ℓ(� �>,�>) for a convex loss function 7 90 Convex relaxation for 0-1 loss 1. Ridge / linear regression = A ℓ �, �>,�> = � �> − �> 2. SVM = ℓ �,�>, �> = max{0,1 − �> � �>} i.e. for |w|=|x_i|=1, We have: H = 1 − �> � �> = {0 �> = � �> H {≤ 2 �> ≠ � �> Greedy optimization: gradient descent • Move in the direction of steepest descent: �KL0 ← �K − �� K We saw: for certain step size choice, 1 ∗ 1 � Q �K ≤ min � � + � R∗∈T � K * p p3 p2 p1 GD for linear classification 1 = min Q ℓ � �>,�> 7 90 � > 1 � = W − � Q ℓY �=� ,� � KL0 X � K > > > > 0 • Complexity? iterations, each taking ~ linear time in data set Z[ 3' • Overall � running time, m=# of examples in Rd Z[ • Can we speed it up?? GD for linear classification • What if we take a single example, and compute gradient only w.r.t it’s loss?? • Which example? • --> uniformly at random… • Why would this work? SGD for linear classification 1 = min Q ℓ � �>,�> 7 90 � > Y = �KL0 = WX − � ℓ �K �>b ,�>b �>b • Uniformly at random?! �K ∼ �[1, … , �] Has expectation = full gradient • Each iteration is much faster O(md) à O(d), convergence?? Crucial for SGD: linearity of expectation and derivatives 0 Let f W = ∑ ℓ � , then for � ∼ �[1, … ,�] chosen uniformly at random, 3 > > K we have 1 1 E[�ℓ>b � ] = ∑ �ℓ> � = � Q ℓ> � = �ℓ(�) >s0 � � > Greedy optimization: gradient descent • Move in a random direction, whose expectation is the steepest descent: • Denote by �t� W a vector random variable whose expectation is the gradient, � �t� W = �� t �KL0 ← �K − �� K Stochastic gradient descent – constrained case v v �KL0 ← �K − � �� K , E �� K = ��(�K) �KL0 = arg min |�KL0 − �| x∈T Stochastic gradient descent – constrained set Let: v �KL0 ← �K − � �� K • G = upper bound on norm of gradient estimators v E �� K = ��(�K) |�v� � | ≤ � �KL0 = arg min |�KL0 − �| K x∈T • D = diameter of constraint set ∀�, � ∈ � . |� − �| ≤ � } Theorem: for step size � = ~ H 0 ∗ }~ E[� ∑K �K] ≤ min � � + H R∗∈T H v �KL0 ← �K − � �� K v E �� K = ��(�K) Proof: �KL0 = arg min |�KL0 − �| x∈T 1. We have proved: (for any sequence of �K) 1 = 1 = ∗ �� Q �K �K] ≤ min Q �K � + � R∗∈T � � K K 2. By property of expectation: 1 ∗ 1 = ∗ �� E[� Q �K − min � � ] ≤ Q �� K (�K − � )] ≤ � R∗∈T � � K K Summary • Mathematical & convex optimization • Gradient descent algorithm, linear classification • Stochastic gradient descent.

Lecture 6: Stochastic Gradient Descent Sanjeev Arora Elad Hazan

Training Autoencoders by Alternating Minimization

Learning to Learn by Gradient Descent by Gradient Descent

Training Neural Networks Without Gradients: a Scalable ADMM Approach

Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments

CSE 152: Computer Vision Manmohan Chandraker

Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches

GEE: a Gradient-Based Explainable Variational Autoencoder for Network Anomaly Detection

Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder

Gradient Descent (PDF)

Lecture 10: Recurrent Neural Networks

Stochastic Gradient Descent in Machine Learning

An Evolutionary Method for Training Autoencoders for Deep Learning Networks