<<

On : Gradients of the Log Loss, Multi-Class Classification, and Other Optimization Techniques

Karl Stratos

June 20, 2018

1 / 22 d I Model. The probability of on is parameterized by w ∈ R as a dot product squashed under the sigmoid/ σ : R → [0, 1].

1 p (1|x, w) := σ(w · x) := 1 + exp(−w · x)

The probability of off is p (0|x, w) = 1 − σ(w · x) = σ(−w · x)

I Today’s focus: 1. Optimizing the log loss by 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent

Recall: Logistic Regression d I Task. Given input x ∈ R , predict either 1 or 0 (on or off).

2 / 22 The probability of off is p (0|x, w) = 1 − σ(w · x) = σ(−w · x)

I Today’s focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent

Recall: Logistic Regression d I Task. Given input x ∈ R , predict either 1 or 0 (on or off).

d I Model. The probability of on is parameterized by w ∈ R as a dot product squashed under the sigmoid/logistic function σ : R → [0, 1].

1 p (1|x, w) := σ(w · x) := 1 + exp(−w · x)

2 / 22 I Today’s focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent

Recall: Logistic Regression d I Task. Given input x ∈ R , predict either 1 or 0 (on or off).

d I Model. The probability of on is parameterized by w ∈ R as a dot product squashed under the sigmoid/logistic function σ : R → [0, 1].

1 p (1|x, w) := σ(w · x) := 1 + exp(−w · x)

The probability of off is p (0|x, w) = 1 − σ(w · x) = σ(−w · x)

2 / 22 Recall: Logistic Regression d I Task. Given input x ∈ R , predict either 1 or 0 (on or off).

d I Model. The probability of on is parameterized by w ∈ R as a dot product squashed under the sigmoid/logistic function σ : R → [0, 1].

1 p (1|x, w) := σ(w · x) := 1 + exp(−w · x)

The probability of off is p (0|x, w) = 1 − σ(w · x) = σ(−w · x)

I Today’s focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent 2 / 22 Overview

Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton’s Method Stochastic Gradient Descent (SGD)

3 / 22 I Gradient given by

∇w (− log p (y|x, w)) = −(y − σ(w · x))x

Negative Log Probability Under Logistic Function

I Using y ∈ {0, 1} to denote the two classes,

− log p (y|x, w) = −y log σ(w · x) − (1 − y) log σ(−w · x)  log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0

is convex in w.

4 / 22 I Gradient given by

∇w (− log p (y|x, w)) = −(y − σ(w · x))x

Negative Log Probability Under Logistic Function

I Using y ∈ {0, 1} to denote the two classes,

− log p (y|x, w) = −y log σ(w · x) − (1 − y) log σ(−w · x)  log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0

is convex in w.

4 / 22 Negative Log Probability Under Logistic Function

I Using y ∈ {0, 1} to denote the two classes,

− log p (y|x, w) = −y log σ(w · x) − (1 − y) log σ(−w · x)  log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0

is convex in w.

I Gradient given by

∇w (− log p (y|x, w)) = −(y − σ(w · x))x

4 / 22 n 1 X    ∇J LOG(w) = y(i) − σ w · x(i) x(i) S n i=1

I Unlike in , there is no closed-form solution for LOG LOG wS := arg min JS (w) w∈Rd

LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution.

Logistic Regression Objective  (i) (i) n I Given iid samples S = (x ,y ) i=1, find w that minimizes the empirical negative log likelihood of S (“log loss”): n   LOG 1 X (i) (i) J (w) := − log p y x , w S n i=1

I Gradient?

5 / 22 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w∈Rd

LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution.

Logistic Regression Objective  (i) (i) n I Given iid samples S = (x ,y ) i=1, find w that minimizes the empirical negative log likelihood of S (“log loss”): n   LOG 1 X (i) (i) J (w) := − log p y x , w S n i=1

I Gradient? n 1 X    ∇J LOG(w) = y(i) − σ w · x(i) x(i) S n i=1

5 / 22 LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution.

Logistic Regression Objective  (i) (i) n I Given iid samples S = (x ,y ) i=1, find w that minimizes the empirical negative log likelihood of S (“log loss”): n   LOG 1 X (i) (i) J (w) := − log p y x , w S n i=1

I Gradient? n 1 X    ∇J LOG(w) = y(i) − σ w · x(i) x(i) S n i=1

I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w∈Rd

5 / 22 Logistic Regression Objective  (i) (i) n I Given iid samples S = (x ,y ) i=1, find w that minimizes the empirical negative log likelihood of S (“log loss”): n   LOG 1 X (i) (i) J (w) := − log p y x , w S n i=1

I Gradient? n 1 X    ∇J LOG(w) = y(i) − σ w · x(i) x(i) S n i=1

I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w∈Rd

LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. 5 / 22 Gradient Descent for Logistic Regression

Input: training objective n   LOG 1 X (i) (i) J (w) := − log p y x , w S n i=1 number of iterations T n LOG LOG LOG Output: parameter wˆ ∈ R such that JS (wˆ ) ≈ JS (wS )

1. Initialize θ0 (e.g., randomly). 2. For t = 0 ...T − 1,

n ηt X    θt+1 = θt + y(i) − σ w · x(i) x(i) n i=1

3. Return θT .

6 / 22 Overview

Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton’s Method Stochastic Gradient Descent (SGD)

7 / 22 y d I A log- has w ∈ R for each possible class y ∈ {1 . . . m} to define the probability of the class equaling y:

y  exp(w · x) p y x, θ = Pm y0 y0=1 exp(w · x)

 t m where θ = w t=1 denotes the set of parameters

I Is this a valid ?

From Binary to m Classes

d I A logistic regressor has a single w ∈ R to define the probability of “on” (against “off”):

exp(w · x) p (1|x, w) = 1 + exp(w · x)

8 / 22 I Is this a valid probability distribution?

From Binary to m Classes

d I A logistic regressor has a single w ∈ R to define the probability of “on” (against “off”):

exp(w · x) p (1|x, w) = 1 + exp(w · x)

y d I A log-linear model has w ∈ R for each possible class y ∈ {1 . . . m} to define the probability of the class equaling y:

y  exp(w · x) p y x, θ = Pm y0 y0=1 exp(w · x)

 t m where θ = w t=1 denotes the set of parameters

8 / 22 From Binary to m Classes

d I A logistic regressor has a single w ∈ R to define the probability of “on” (against “off”):

exp(w · x) p (1|x, w) = 1 + exp(w · x)

y d I A log-linear model has w ∈ R for each possible class y ∈ {1 . . . m} to define the probability of the class equaling y:

y  exp(w · x) p y x, θ = Pm y0 y0=1 exp(w · x)

 t m where θ = w t=1 denotes the set of parameters

I Is this a valid probability distribution?

8 / 22 I A log-linear model is a linear transformation by a matrix m×d 1 m W ∈ R (with rows w ... w ) followed by softmax:

p (y|x,W ) = softmaxy(W x)

Softmax Formulation

m I We can transform any vector z ∈ R into a probability distribution over m elements by the m m−1 softmax : R → ∆ ,

exp(zi) softmaxi(z) := Pm ∀i = 1 . . . m j=1 exp(zj)

9 / 22 Softmax Formulation

m I We can transform any vector z ∈ R into a probability distribution over m elements by the softmax function m m−1 softmax : R → ∆ ,

exp(zi) softmaxi(z) := Pm ∀i = 1 . . . m j=1 exp(zj)

I A log-linear model is a linear transformation by a matrix m×d 1 m W ∈ R (with rows w ... w ) followed by softmax:

p (y|x,W ) = softmaxy(W x)

9 / 22 l I For any l ∈ {1 . . . m}, the gradient wrt. w is given by    −(1 − p l x, θ )x if l = y ∇wl − log p y x, θ =  p l x, θ x if l 6= y

Negative Log Probability Under Log-Linear Model

I For any y ∈ {1 . . . m},

 m   X  y0  y − log p y x, θ = log  exp w · x  − w · x | {z } y0=1 linear | {z } constant wrt. y

is convex in each wl ∈ θ.

10 / 22 Negative Log Probability Under Log-Linear Model

I For any y ∈ {1 . . . m},

 m   X  y0  y − log p y x, θ = log  exp w · x  − w · x | {z } y0=1 linear | {z } constant wrt. y

is convex in each wl ∈ θ.

l I For any l ∈ {1 . . . m}, the gradient wrt. w is given by    −(1 − p l x, θ )x if l = y ∇wl − log p y x, θ =  p l x, θ x if l 6= y

10 / 22 Log-Linear Model Objective

 (i) (i) n I Given iid samples S = (x ,y ) i=1, find w that minimizes the empirical negative log likelihood of S n   LLM 1 X (i) (i) J (w) := − log p y x , θ S n i=1 This is the so-called cross-entropy loss.

I Again convex and differentiable, can be optimized by gradient descent to reach an optimal solution.

11 / 22 Overview

Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton’s Method Stochastic Gradient Descent (SGD)

12 / 22 I Equivalent to minimizing a second-order Taylor approximation but with uniform curvature 1 J(θ) + ∇J(θ)∆ + ∆>I ∆ 2 d×d

One Way to Motivate Gradient Descent

d d I At current parameter value θ ∈ R , choose update ∆ ∈ R by minimizing the first-order Taylor approximation around θ with squared l2 regularization: 1 J(θ + ∆) ≈ J(θ) + ∇J(θ)>∆ + ||∆||2 2 2 | {z } J1(∆) GD ∆ = arg min J1(∆) = −∇J(θ) ∆∈Rd

13 / 22 One Way to Motivate Gradient Descent

d d I At current parameter value θ ∈ R , choose update ∆ ∈ R by minimizing the first-order Taylor approximation around θ with squared l2 regularization: 1 J(θ + ∆) ≈ J(θ) + ∇J(θ)>∆ + ||∆||2 2 2 | {z } J1(∆) GD ∆ = arg min J1(∆) = −∇J(θ) ∆∈Rd

I Equivalent to minimizing a second-order Taylor approximation but with uniform curvature 1 J(θ) + ∇J(θ)∆ + ∆>I ∆ 2 d×d

13 / 22 I Equivalent to gradient descent after a change of coordinates by ∇2J(θ)1/2

Newton’s Method

I Actually minimize the second-order Taylor approximation: 1 J(θ + ∆) ≈ J(θ) + ∇J(θ)>∆ + ∆>∇2J(θ)∆ 2 | {z } J2(∆) NEWTON 2 −1 ∆ = arg min J2(∆) = −∇ J(θ) ∇J(θ) ∆∈Rd

14 / 22 Newton’s Method

I Actually minimize the second-order Taylor approximation: 1 J(θ + ∆) ≈ J(θ) + ∇J(θ)>∆ + ∆>∇2J(θ)∆ 2 | {z } J2(∆) NEWTON 2 −1 ∆ = arg min J2(∆) = −∇ J(θ) ∇J(θ) ∆∈Rd

I Equivalent to gradient descent after a change of coordinates by ∇2J(θ)1/2

14 / 22 Overview

Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton’s Method Stochastic Gradient Descent (SGD)

15 / 22 I So the gradient is also a function of the entire .

n n 1 X   1 X    − y(i) − w · x(i) x(i) y(i) − σ w · x(i) x(i) n n i=1 i=1 | {z } | {z } LS LOG ∇JS (w) ∇JS (w)

I Thus one update in gradient descent requires summing over all n samples.

Loss: a Function of All Samples

∗ I Empirical loss JS(θ) is a function of the entire data S.

n n 2   1 X  (i) (i) 1 X (i) (i) y − w · x − log p y x , w 2n n i=1 i=1 | {z } | {z } LS LOG JS (w) JS (w)

∗We’re normalizing by a constant for convenience without loss of generality. 16 / 22 I Thus one update in gradient descent requires summing over all n samples.

Loss: a Function of All Samples

∗ I Empirical loss JS(θ) is a function of the entire data S.

n n 2   1 X  (i) (i) 1 X (i) (i) y − w · x − log p y x , w 2n n i=1 i=1 | {z } | {z } LS LOG JS (w) JS (w)

I So the gradient is also a function of the entire data.

n n 1 X   1 X    − y(i) − w · x(i) x(i) y(i) − σ w · x(i) x(i) n n i=1 i=1 | {z } | {z } LS LOG ∇JS (w) ∇JS (w)

∗We’re normalizing by a constant for convenience without loss of generality. 16 / 22 Loss: a Function of All Samples

∗ I Empirical loss JS(θ) is a function of the entire data S.

n n 2   1 X  (i) (i) 1 X (i) (i) y − w · x − log p y x , w 2n n i=1 i=1 | {z } | {z } LS LOG JS (w) JS (w)

I So the gradient is also a function of the entire data.

n n 1 X   1 X    − y(i) − w · x(i) x(i) y(i) − σ w · x(i) x(i) n n i=1 i=1 | {z } | {z } LS LOG ∇JS (w) ∇JS (w)

I Thus one update in gradient descent requires summing over all n samples.

∗We’re normalizing by a constant for convenience without loss of generality. 16 / 22 I In expectation, the gradient is consistent:

n h (i) i 1 X (i) Ei ∇b JS(w) = ∇b JS(w) = ∇JS(w) n i=1

I This is stochastic gradient descent (SGD): estimate the gradient with a single random sample. This is justified as long as the gradient is consistent in expectation.

Gradient Estimation Based on a Single Sample

I What if we use a single uniformly random sample i ∈ {1 . . . n} to estimate the gradient?      − y(i) − w · x(i) x(i) y(i) − σ w · x(i) x(i) | {z } | {z } (i) LS (i) LOG ∇b JS (w) ∇b JS (w)

17 / 22 I This is stochastic gradient descent (SGD): estimate the gradient with a single random sample. This is justified as long as the gradient is consistent in expectation.

Gradient Estimation Based on a Single Sample

I What if we use a single uniformly random sample i ∈ {1 . . . n} to estimate the gradient?      − y(i) − w · x(i) x(i) y(i) − σ w · x(i) x(i) | {z } | {z } (i) LS (i) LOG ∇b JS (w) ∇b JS (w)

I In expectation, the gradient is consistent:

n h (i) i 1 X (i) Ei ∇b JS(w) = ∇b JS(w) = ∇JS(w) n i=1

17 / 22 Gradient Estimation Based on a Single Sample

I What if we use a single uniformly random sample i ∈ {1 . . . n} to estimate the gradient?      − y(i) − w · x(i) x(i) y(i) − σ w · x(i) x(i) | {z } | {z } (i) LS (i) LOG ∇b JS (w) ∇b JS (w)

I In expectation, the gradient is consistent:

n h (i) i 1 X (i) Ei ∇b JS(w) = ∇b JS(w) = ∇JS(w) n i=1

I This is stochastic gradient descent (SGD): estimate the gradient with a single random sample. This is justified as long as the gradient is consistent in expectation.

17 / 22 h (B) i I Still consistent: EB ∇b JS(w) = ∇JS(w)

I Mini-batches allow for a more stable gradient estimation.

I SGD is a special case with |B| = 1.

SGD with Mini-Batches

I Instead of estimating the gradient based on a single random example i, use a random “mini-batch” B ⊆ {1 . . . n}.

1 X  (i) (i) (i) 1 X  (i)  (i) (i) − y − w · x x y − σ w · x x |B| |B| i∈B i∈B | {z } | {z } (B) LS (B) LOG ∇b JS (w) ∇b JS (w)

18 / 22 I Mini-batches allow for a more stable gradient estimation.

I SGD is a special case with |B| = 1.

SGD with Mini-Batches

I Instead of estimating the gradient based on a single random example i, use a random “mini-batch” B ⊆ {1 . . . n}.

1 X  (i) (i) (i) 1 X  (i)  (i) (i) − y − w · x x y − σ w · x x |B| |B| i∈B i∈B | {z } | {z } (B) LS (B) LOG ∇b JS (w) ∇b JS (w)

h (B) i I Still consistent: EB ∇b JS(w) = ∇JS(w)

18 / 22 SGD with Mini-Batches

I Instead of estimating the gradient based on a single random example i, use a random “mini-batch” B ⊆ {1 . . . n}.

1 X  (i) (i) (i) 1 X  (i)  (i) (i) − y − w · x x y − σ w · x x |B| |B| i∈B i∈B | {z } | {z } (B) LS (B) LOG ∇b JS (w) ∇b JS (w)

h (B) i I Still consistent: EB ∇b JS(w) = ∇JS(w)

I Mini-batches allow for a more stable gradient estimation.

I SGD is a special case with |B| = 1.

18 / 22 Stochastic Gradient Descent

Input: training objective J(θ) ∈ R of form

n 1 X J(θ) = J (θ) n i i=1 , number of iterations T n Output: parameter θˆ ∈ R such that J(θˆ) is small 1. Initialize θˆ (e.g., randomly). 2. For t = 0 ...T − 1, 2.1 For i ∈ {1 . . . n} in random order,

ˆ ˆ t,i ˆ θ ← θ − η ∇Ji(θ)

3. Return θˆ.

19 / 22 Stochastic Gradient Descent for Linear Regression

Input: training objective

n 1 X  2 J LS(w) := y(i) − w · x(i) S 2 i=1 number of iterations T n LS LS LS Output: parameter wˆ ∈ R such that JS (wˆ ) ≈ JS (wS ) 1. Initialize wˆ (e.g., randomly). 2. For t = 0 ...T − 1, 2.1 For i ∈ {1 . . . n} in random order,   wˆ ← wˆ − ηt,i y(i) − wˆ > · x(i) x(i)

3. Return wˆ .

20 / 22 Stochastic Gradient Descent for Logistic Regression

Input: training objective n   LOG 1 X (i) (i) J (w) := − log p y x , w S n i=1 number of iterations T n LOG LOG LOG Output: parameter wˆ ∈ R such that JS (wˆ ) ≈ JS (wS )

1. Initialize wˆ (e.g., randomly). 2. For t = 0 ...T − 1, 2.1 For i ∈ {1 . . . n} in random order,    wˆ ← wˆ + ηt,i y(i) − σ wˆ · x(i) x(i)

3. Return wˆ .

21 / 22 Summary

I Logistic regression: binary classifier that can be trained by optimizing the log loss

I Log-linear model: multi-class classifier that can be trained by optimizing the cross-entropy loss

I Newton’s method: local search using the curvature of the

I SGD: gradient descent with stochastic gradient estimation

I Cornerstone of modern large-scale

22 / 22