<<

COS 402 – and Fall 2016

Lecture 6: stochastic descent

Sanjeev Arora Elad Hazan Admin

• Exercise 2 (implementation) this Thu, in class • Exercise 3 (written), this Thu, in class • Movie – “Ex Machina” + discussion panel w. Prof. Hasson (PNI) Wed Oct. 5th 19:30 tickets from Bella; room 204 COS

• Today: special guest - Dr. Yoram Singer @ Google Recap

• Definition + fundamental theorem of statistical learning, motivated efficient /optimization • Convexity and it’s computational importance • Local greedy optimization – gradient descent Agenda

• Stochastic gradient descent • Dr. Singer on opt @ google & beyond Mathematical optimization

Input: �: � ↦ �, for � ⊆ � Output: point � ∈ �, such that � � ≤ � � ∀ � ∈ � What is Optimization But generally speaking...

We’re screwed. ! Local (non global) minima of f0 Prefer Convex Problems! All kinds of constraints (even restricting to continuous functions): h(x)=sin(2πx)=0

250

200

150

100

50

0

−50 3 2 3 1 2 0 1 −1 0 −1 −2 −2 −3 −3

Duchi (UC Berkeley) for Machine Learning Fall 2009 7 / 53 6 Convex functions: local à global

Sum of convex functions à also convex Convex Functions and Sets

n A function f : R R is convex if for x, y domfand any a [0, 1], ! 2 2

f(ax +(1 a)y) af(x)+(1 a)f(y)  Convex Sets

Definition AsetC Rn is convex if for x, y C and any α [0, 1], ⊆ ∈ ∈ αx +(1 α)y C. − ∈

n AsetC R is convex if for x, y C and any a [0, 1], ✓ 2 2 y ax +(1 a)y C x 2

8

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 9 / 53 Special case: optimization for linear classification

Given a sample � = �, � ,… , �,� , find hyperplane (through the origin w.l.o.g) such that:

� min # of mistakes

min ℓ(� �,�) for a convex Convex relaxation for 0-1 loss

1. Ridge / linear regression ℓ �, �,� = � � − � 2. SVM ℓ �,�, � = max{0,1 − � � �}

i.e. for |w|=|x_i|=1, We have:

1 − � � � = {0 � = � � {≤ 2 � ≠ � � Greedy optimization: gradient descent

• Move in the direction of steepest descent:

� ← � − ��� �

We saw: for certain step size choice,

1 ∗ 1 � � ≤ min � � + � ∗∈ � * p p3 p2 p1 GD for linear classification

1 min ℓ � �,� �

1 � = w − � ℓ �� ,� � �

• Complexity? iterations, each taking ~ linear time in data set • Overall � running time, m=# of examples in Rd • Can we speed it up?? GD for linear classification

• What if we take a single example, and compute gradient only w.r.t it’s loss??

• Which example? • --> uniformly at random…

• Why would this work? SGD for linear classification

1 min ℓ � �,� �

� = w − � ℓ � � ,� �

• Uniformly at random?! � ∼ �[1, … , �] Has expectation = full gradient

• Each iteration is much faster O(md) à O(d), convergence?? Crucial for SGD: linearity of expectation and

Let f w = ∑ ℓ � , then for � ∼ �[1, … ,�] chosen uniformly at random, we have

1 1 E[�ℓ � ] = ∑ �ℓ � = � ℓ � = �ℓ(�) � � Greedy optimization: gradient descent

• Move in a random direction, whose expectation is the steepest descent:

• Denote by �� w a vector random variable whose expectation is the gradient, � �� w = �� �

� ← � − ��� � Stochastic gradient descent – constrained case

� ← � − � �� � , E �� � = ��(�) � = arg min |� − �| ∈ Stochastic gradient descent – constrained set Let: � ← � − � �� � • G = upper bound on norm of gradient estimators E �� � = ��(�) |�� � | ≤ � � = arg min |� − �| ∈

• D = diameter of set ∀�, � ∈ � . |� − �| ≤ �

Theorem: for step size � =

∗ E[� ∑ �] ≤ min � � + ∗∈ � ← � − � �� � E �� � = ��(�) Proof: � = arg min |� − �| ∈ 1. We have proved: (for any sequence of �)

1 1 ∗ �� � �] ≤ min � � + � ∗∈ � � 2. By property of expectation:

1 ∗ 1 ∗ �� E[� � − min � � ] ≤ �� � (� − � )] ≤ � ∗∈ � � Summary

• Mathematical & convex optimization • Gradient descent , linear classification • Stochastic gradient descent