Lecture 6: Stochastic Gradient Descent Sanjeev Arora Elad Hazan

Lecture 6: Stochastic Gradient Descent Sanjeev Arora Elad Hazan

COS 402 – Machine Learning and Artificial Intelligence Fall 2016 Lecture 6: stochastic gradient descent Sanjeev Arora Elad Hazan Admin • Exercise 2 (implementation) this Thu, in class • Exercise 3 (written), this Thu, in class • Movie – “Ex Machina” + discussion panel w. Prof. Hasson (PNI) Wed Oct. 5th 19:30 tickets from Bella; room 204 COS • Today: special guest - Dr. Yoram Singer @ Google Recap • Definition + fundamental theorem of statistical learning, motivated efficient algorithms/optimization • Convexity and it’s computational importance • Local greedy optimization – gradient descent Agenda • Stochastic gradient descent • Dr. Singer on opt @ google & beyond Mathematical optimization Input: function �: � ↦ �, for � ⊆ �' Output: point � ∈ �, such that � � ≤ � � ∀ � ∈ � What is Optimization But generally speaking... We’re screwed. ! Local (non global) minima of f0 Prefer Convex Problems! All kinds of constraints (even restricting to continuous functions): h(x)=sin(2πx)=0 250 200 150 100 50 0 −50 3 2 3 1 2 0 1 −1 0 −1 −2 −2 −3 −3 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53 6 Convex functions: local à global Sum of convex functions à also convex Convex Functions and Sets n A function f : R R is convex if for x, y domfand any a [0, 1], ! 2 2 f(ax +(1 a)y) af(x)+(1 a)f(y) − − Convex Sets Definition AsetC Rn is convex if for x, y C and any α [0, 1], ⊆ ∈ ∈ αx +(1 α)y C. − ∈ n AsetC R is convex if for x, y C and any a [0, 1], ✓ 2 2 y ax +(1 a)y C x − 2 8 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 9 / 53 Special case: optimization for linear classification Given a sample � = �0, �0 ,… , �3,�3 , find hyperplane (through the origin w.l.o.g) such that: � min # of mistakes 7 = min ℓ(� �>,�>) for a convex loss function 7 90 Convex relaxation for 0-1 loss 1. Ridge / linear regression = A ℓ �, �>,�> = � �> − �> 2. SVM = ℓ �,�>, �> = max{0,1 − �> � �>} i.e. for |w|=|x_i|=1, We have: H = 1 − �> � �> = {0 �> = � �> H {≤ 2 �> ≠ � �> Greedy optimization: gradient descent • Move in the direction of steepest descent: �KL0 ← �K − ��� �K We saw: for certain step size choice, 1 ∗ 1 � Q �K ≤ min � � + � R∗∈T � K * p p3 p2 p1 GD for linear classification 1 = min Q ℓ � �>,�> 7 90 � > 1 � = W − � Q ℓY �=� ,� � KL0 X � K > > > > 0 • Complexity? iterations, each taking ~ linear time in data set Z[ 3' • Overall � running time, m=# of examples in Rd Z[ • Can we speed it up?? GD for linear classification • What if we take a single example, and compute gradient only w.r.t it’s loss?? • Which example? • --> uniformly at random… • Why would this work? SGD for linear classification 1 = min Q ℓ � �>,�> 7 90 � > Y = �KL0 = WX − � ℓ �K �>b ,�>b �>b • Uniformly at random?! �K ∼ �[1, … , �] Has expectation = full gradient • Each iteration is much faster O(md) à O(d), convergence?? Crucial for SGD: linearity of expectation and derivatives 0 Let f W = ∑ ℓ � , then for � ∼ �[1, … ,�] chosen uniformly at random, 3 > > K we have 1 1 E[�ℓ>b � ] = ∑ �ℓ> � = � Q ℓ> � = �ℓ(�) >s0 � � > Greedy optimization: gradient descent • Move in a random direction, whose expectation is the steepest descent: • Denote by �t� W a vector random variable whose expectation is the gradient, � �t� W = �� � t �KL0 ← �K − ��� �K Stochastic gradient descent – constrained case v v �KL0 ← �K − � �� �K , E �� �K = ��(�K) �KL0 = arg min |�KL0 − �| x∈T Stochastic gradient descent – constrained set Let: v �KL0 ← �K − � �� �K • G = upper bound on norm of gradient estimators v E �� �K = ��(�K) |�v� � | ≤ � �KL0 = arg min |�KL0 − �| K x∈T • D = diameter of constraint set ∀�, � ∈ � . |� − �| ≤ � } Theorem: for step size � = ~ H 0 ∗ }~ E[� ∑K �K] ≤ min � � + H R∗∈T H v �KL0 ← �K − � �� �K v E �� �K = ��(�K) Proof: �KL0 = arg min |�KL0 − �| x∈T 1. We have proved: (for any sequence of �K) 1 = 1 = ∗ �� Q �K �K] ≤ min Q �K � + � R∗∈T � � K K 2. By property of expectation: 1 ∗ 1 = ∗ �� E[� Q �K − min � � ] ≤ Q �� �K (�K − � )] ≤ � R∗∈T � � K K Summary • Mathematical & convex optimization • Gradient descent algorithm, linear classification • Stochastic gradient descent.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    20 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us