On Logistic Regression: Gradients of the Log Loss, Multi-Class Classification, and Other Optimization Techniques
Karl Stratos
June 20, 2018
1 / 22 d I Model. The probability of on is parameterized by w ∈ R as a dot product squashed under the sigmoid/logistic function σ : R → [0, 1].
1 p (1|x, w) := σ(w · x) := 1 + exp(−w · x)
The probability of off is p (0|x, w) = 1 − σ(w · x) = σ(−w · x)
I Today’s focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent
Recall: Logistic Regression d I Task. Given input x ∈ R , predict either 1 or 0 (on or off).
2 / 22 The probability of off is p (0|x, w) = 1 − σ(w · x) = σ(−w · x)
I Today’s focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent
Recall: Logistic Regression d I Task. Given input x ∈ R , predict either 1 or 0 (on or off).
d I Model. The probability of on is parameterized by w ∈ R as a dot product squashed under the sigmoid/logistic function σ : R → [0, 1].
1 p (1|x, w) := σ(w · x) := 1 + exp(−w · x)
2 / 22 I Today’s focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent
Recall: Logistic Regression d I Task. Given input x ∈ R , predict either 1 or 0 (on or off).
d I Model. The probability of on is parameterized by w ∈ R as a dot product squashed under the sigmoid/logistic function σ : R → [0, 1].
1 p (1|x, w) := σ(w · x) := 1 + exp(−w · x)
The probability of off is p (0|x, w) = 1 − σ(w · x) = σ(−w · x)
2 / 22 Recall: Logistic Regression d I Task. Given input x ∈ R , predict either 1 or 0 (on or off).
d I Model. The probability of on is parameterized by w ∈ R as a dot product squashed under the sigmoid/logistic function σ : R → [0, 1].
1 p (1|x, w) := σ(w · x) := 1 + exp(−w · x)
The probability of off is p (0|x, w) = 1 − σ(w · x) = σ(−w · x)
I Today’s focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent 2 / 22 Overview
Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton’s Method Stochastic Gradient Descent (SGD)
3 / 22 I Gradient given by
∇w (− log p (y|x, w)) = −(y − σ(w · x))x
Negative Log Probability Under Logistic Function
I Using y ∈ {0, 1} to denote the two classes,
− log p (y|x, w) = −y log σ(w · x) − (1 − y) log σ(−w · x) log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0
is convex in w.
4 / 22 I Gradient given by
∇w (− log p (y|x, w)) = −(y − σ(w · x))x
Negative Log Probability Under Logistic Function
I Using y ∈ {0, 1} to denote the two classes,
− log p (y|x, w) = −y log σ(w · x) − (1 − y) log σ(−w · x) log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0
is convex in w.
4 / 22 Negative Log Probability Under Logistic Function
I Using y ∈ {0, 1} to denote the two classes,
− log p (y|x, w) = −y log σ(w · x) − (1 − y) log σ(−w · x) log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0
is convex in w.
I Gradient given by
∇w (− log p (y|x, w)) = −(y − σ(w · x))x
4 / 22 n 1 X ∇J LOG(w) = y(i) − σ w · x(i) x(i) S n i=1
I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w∈Rd
LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution.
Logistic Regression Objective (i) (i) n I Given iid samples S = (x ,y ) i=1, find w that minimizes the empirical negative log likelihood of S (“log loss”): n LOG 1 X (i) (i) J (w) := − log p y x , w S n i=1
I Gradient?
5 / 22 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w∈Rd
LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution.
Logistic Regression Objective (i) (i) n I Given iid samples S = (x ,y ) i=1, find w that minimizes the empirical negative log likelihood of S (“log loss”): n LOG 1 X (i) (i) J (w) := − log p y x , w S n i=1
I Gradient? n 1 X ∇J LOG(w) = y(i) − σ w · x(i) x(i) S n i=1
5 / 22 LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution.
Logistic Regression Objective (i) (i) n I Given iid samples S = (x ,y ) i=1, find w that minimizes the empirical negative log likelihood of S (“log loss”): n LOG 1 X (i) (i) J (w) := − log p y x , w S n i=1
I Gradient? n 1 X ∇J LOG(w) = y(i) − σ w · x(i) x(i) S n i=1
I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w∈Rd
5 / 22 Logistic Regression Objective (i) (i) n I Given iid samples S = (x ,y ) i=1, find w that minimizes the empirical negative log likelihood of S (“log loss”): n LOG 1 X (i) (i) J (w) := − log p y x , w S n i=1
I Gradient? n 1 X ∇J LOG(w) = y(i) − σ w · x(i) x(i) S n i=1
I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w∈Rd
LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. 5 / 22 Gradient Descent for Logistic Regression
Input: training objective n LOG 1 X (i) (i) J (w) := − log p y x , w S n i=1 number of iterations T n LOG LOG LOG Output: parameter wˆ ∈ R such that JS (wˆ ) ≈ JS (wS )
1. Initialize θ0 (e.g., randomly). 2. For t = 0 ...T − 1,
n ηt X θt+1 = θt + y(i) − σ w · x(i) x(i) n i=1
3. Return θT .
6 / 22 Overview
Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton’s Method Stochastic Gradient Descent (SGD)
7 / 22 y d I A log-linear model has w ∈ R for each possible class y ∈ {1 . . . m} to define the probability of the class equaling y:
y exp(w · x) p y x, θ = Pm y0 y0=1 exp(w · x)
t m where θ = w t=1 denotes the set of parameters
I Is this a valid probability distribution?
From Binary to m Classes
d I A logistic regressor has a single w ∈ R to define the probability of “on” (against “off”):
exp(w · x) p (1|x, w) = 1 + exp(w · x)
8 / 22 I Is this a valid probability distribution?
From Binary to m Classes
d I A logistic regressor has a single w ∈ R to define the probability of “on” (against “off”):
exp(w · x) p (1|x, w) = 1 + exp(w · x)
y d I A log-linear model has w ∈ R for each possible class y ∈ {1 . . . m} to define the probability of the class equaling y:
y exp(w · x) p y x, θ = Pm y0 y0=1 exp(w · x)
t m where θ = w t=1 denotes the set of parameters
8 / 22 From Binary to m Classes
d I A logistic regressor has a single w ∈ R to define the probability of “on” (against “off”):
exp(w · x) p (1|x, w) = 1 + exp(w · x)
y d I A log-linear model has w ∈ R for each possible class y ∈ {1 . . . m} to define the probability of the class equaling y:
y exp(w · x) p y x, θ = Pm y0 y0=1 exp(w · x)
t m where θ = w t=1 denotes the set of parameters
I Is this a valid probability distribution?
8 / 22 I A log-linear model is a linear transformation by a matrix m×d 1 m W ∈ R (with rows w ... w ) followed by softmax:
p (y|x,W ) = softmaxy(W x)
Softmax Formulation
m I We can transform any vector z ∈ R into a probability distribution over m elements by the softmax function m m−1 softmax : R → ∆ ,
exp(zi) softmaxi(z) := Pm ∀i = 1 . . . m j=1 exp(zj)
9 / 22 Softmax Formulation
m I We can transform any vector z ∈ R into a probability distribution over m elements by the softmax function m m−1 softmax : R → ∆ ,
exp(zi) softmaxi(z) := Pm ∀i = 1 . . . m j=1 exp(zj)
I A log-linear model is a linear transformation by a matrix m×d 1 m W ∈ R (with rows w ... w ) followed by softmax:
p (y|x,W ) = softmaxy(W x)
9 / 22 l I For any l ∈ {1 . . . m}, the gradient wrt. w is given by −(1 − p l x, θ )x if l = y ∇wl − log p y x, θ = p l x, θ x if l 6= y
Negative Log Probability Under Log-Linear Model
I For any y ∈ {1 . . . m},
m X y0 y − log p y x, θ = log exp w · x − w · x | {z } y0=1 linear | {z } constant wrt. y
is convex in each wl ∈ θ.
10 / 22 Negative Log Probability Under Log-Linear Model
I For any y ∈ {1 . . . m},
m X y0 y − log p y x, θ = log exp w · x − w · x | {z } y0=1 linear | {z } constant wrt. y
is convex in each wl ∈ θ.
l I For any l ∈ {1 . . . m}, the gradient wrt. w is given by −(1 − p l x, θ )x if l = y ∇wl − log p y x, θ = p l x, θ x if l 6= y
10 / 22 Log-Linear Model Objective
(i) (i) n I Given iid samples S = (x ,y ) i=1, find w that minimizes the empirical negative log likelihood of S n LLM 1 X (i) (i) J (w) := − log p y x , θ S n i=1 This is the so-called cross-entropy loss.
I Again convex and differentiable, can be optimized by gradient descent to reach an optimal solution.
11 / 22 Overview
Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton’s Method Stochastic Gradient Descent (SGD)
12 / 22 I Equivalent to minimizing a second-order Taylor approximation but with uniform curvature 1 J(θ) + ∇J(θ)∆ + ∆>I ∆ 2 d×d
One Way to Motivate Gradient Descent
d d I At current parameter value θ ∈ R , choose update ∆ ∈ R by minimizing the first-order Taylor approximation around θ with squared l2 regularization: 1 J(θ + ∆) ≈ J(θ) + ∇J(θ)>∆ + ||∆||2 2 2 | {z } J1(∆) GD ∆ = arg min J1(∆) = −∇J(θ) ∆∈Rd
13 / 22 One Way to Motivate Gradient Descent
d d I At current parameter value θ ∈ R , choose update ∆ ∈ R by minimizing the first-order Taylor approximation around θ with squared l2 regularization: 1 J(θ + ∆) ≈ J(θ) + ∇J(θ)>∆ + ||∆||2 2 2 | {z } J1(∆) GD ∆ = arg min J1(∆) = −∇J(θ) ∆∈Rd
I Equivalent to minimizing a second-order Taylor approximation but with uniform curvature 1 J(θ) + ∇J(θ)∆ + ∆>I ∆ 2 d×d
13 / 22 I Equivalent to gradient descent after a change of coordinates by ∇2J(θ)1/2
Newton’s Method
I Actually minimize the second-order Taylor approximation: 1 J(θ + ∆) ≈ J(θ) + ∇J(θ)>∆ + ∆>∇2J(θ)∆ 2 | {z } J2(∆) NEWTON 2 −1 ∆ = arg min J2(∆) = −∇ J(θ) ∇J(θ) ∆∈Rd
14 / 22 Newton’s Method
I Actually minimize the second-order Taylor approximation: 1 J(θ + ∆) ≈ J(θ) + ∇J(θ)>∆ + ∆>∇2J(θ)∆ 2 | {z } J2(∆) NEWTON 2 −1 ∆ = arg min J2(∆) = −∇ J(θ) ∇J(θ) ∆∈Rd
I Equivalent to gradient descent after a change of coordinates by ∇2J(θ)1/2
14 / 22 Overview
Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton’s Method Stochastic Gradient Descent (SGD)
15 / 22 I So the gradient is also a function of the entire data.
n n 1 X 1 X − y(i) − w · x(i) x(i) y(i) − σ w · x(i) x(i) n n i=1 i=1 | {z } | {z } LS LOG ∇JS (w) ∇JS (w)
I Thus one update in gradient descent requires summing over all n samples.
Loss: a Function of All Samples
∗ I Empirical loss JS(θ) is a function of the entire data S.
n n 2 1 X (i) (i) 1 X (i) (i) y − w · x − log p y x , w 2n n i=1 i=1 | {z } | {z } LS LOG JS (w) JS (w)
∗We’re normalizing by a constant for convenience without loss of generality. 16 / 22 I Thus one update in gradient descent requires summing over all n samples.
Loss: a Function of All Samples
∗ I Empirical loss JS(θ) is a function of the entire data S.
n n 2 1 X (i) (i) 1 X (i) (i) y − w · x − log p y x , w 2n n i=1 i=1 | {z } | {z } LS LOG JS (w) JS (w)
I So the gradient is also a function of the entire data.
n n 1 X 1 X − y(i) − w · x(i) x(i) y(i) − σ w · x(i) x(i) n n i=1 i=1 | {z } | {z } LS LOG ∇JS (w) ∇JS (w)
∗We’re normalizing by a constant for convenience without loss of generality. 16 / 22 Loss: a Function of All Samples
∗ I Empirical loss JS(θ) is a function of the entire data S.
n n 2 1 X (i) (i) 1 X (i) (i) y − w · x − log p y x , w 2n n i=1 i=1 | {z } | {z } LS LOG JS (w) JS (w)
I So the gradient is also a function of the entire data.
n n 1 X 1 X − y(i) − w · x(i) x(i) y(i) − σ w · x(i) x(i) n n i=1 i=1 | {z } | {z } LS LOG ∇JS (w) ∇JS (w)
I Thus one update in gradient descent requires summing over all n samples.
∗We’re normalizing by a constant for convenience without loss of generality. 16 / 22 I In expectation, the gradient is consistent:
n h (i) i 1 X (i) Ei ∇b JS(w) = ∇b JS(w) = ∇JS(w) n i=1
I This is stochastic gradient descent (SGD): estimate the gradient with a single random sample. This is justified as long as the gradient is consistent in expectation.
Gradient Estimation Based on a Single Sample
I What if we use a single uniformly random sample i ∈ {1 . . . n} to estimate the gradient? − y(i) − w · x(i) x(i) y(i) − σ w · x(i) x(i) | {z } | {z } (i) LS (i) LOG ∇b JS (w) ∇b JS (w)
17 / 22 I This is stochastic gradient descent (SGD): estimate the gradient with a single random sample. This is justified as long as the gradient is consistent in expectation.
Gradient Estimation Based on a Single Sample
I What if we use a single uniformly random sample i ∈ {1 . . . n} to estimate the gradient? − y(i) − w · x(i) x(i) y(i) − σ w · x(i) x(i) | {z } | {z } (i) LS (i) LOG ∇b JS (w) ∇b JS (w)
I In expectation, the gradient is consistent:
n h (i) i 1 X (i) Ei ∇b JS(w) = ∇b JS(w) = ∇JS(w) n i=1
17 / 22 Gradient Estimation Based on a Single Sample
I What if we use a single uniformly random sample i ∈ {1 . . . n} to estimate the gradient? − y(i) − w · x(i) x(i) y(i) − σ w · x(i) x(i) | {z } | {z } (i) LS (i) LOG ∇b JS (w) ∇b JS (w)
I In expectation, the gradient is consistent:
n h (i) i 1 X (i) Ei ∇b JS(w) = ∇b JS(w) = ∇JS(w) n i=1
I This is stochastic gradient descent (SGD): estimate the gradient with a single random sample. This is justified as long as the gradient is consistent in expectation.
17 / 22 h (B) i I Still consistent: EB ∇b JS(w) = ∇JS(w)
I Mini-batches allow for a more stable gradient estimation.
I SGD is a special case with |B| = 1.
SGD with Mini-Batches
I Instead of estimating the gradient based on a single random example i, use a random “mini-batch” B ⊆ {1 . . . n}.
1 X (i) (i) (i) 1 X (i) (i) (i) − y − w · x x y − σ w · x x |B| |B| i∈B i∈B | {z } | {z } (B) LS (B) LOG ∇b JS (w) ∇b JS (w)
18 / 22 I Mini-batches allow for a more stable gradient estimation.
I SGD is a special case with |B| = 1.
SGD with Mini-Batches
I Instead of estimating the gradient based on a single random example i, use a random “mini-batch” B ⊆ {1 . . . n}.
1 X (i) (i) (i) 1 X (i) (i) (i) − y − w · x x y − σ w · x x |B| |B| i∈B i∈B | {z } | {z } (B) LS (B) LOG ∇b JS (w) ∇b JS (w)
h (B) i I Still consistent: EB ∇b JS(w) = ∇JS(w)
18 / 22 SGD with Mini-Batches
I Instead of estimating the gradient based on a single random example i, use a random “mini-batch” B ⊆ {1 . . . n}.
1 X (i) (i) (i) 1 X (i) (i) (i) − y − w · x x y − σ w · x x |B| |B| i∈B i∈B | {z } | {z } (B) LS (B) LOG ∇b JS (w) ∇b JS (w)
h (B) i I Still consistent: EB ∇b JS(w) = ∇JS(w)
I Mini-batches allow for a more stable gradient estimation.
I SGD is a special case with |B| = 1.
18 / 22 Stochastic Gradient Descent
Input: training objective J(θ) ∈ R of form
n 1 X J(θ) = J (θ) n i i=1 , number of iterations T n Output: parameter θˆ ∈ R such that J(θˆ) is small 1. Initialize θˆ (e.g., randomly). 2. For t = 0 ...T − 1, 2.1 For i ∈ {1 . . . n} in random order,
ˆ ˆ t,i ˆ θ ← θ − η ∇Ji(θ)
3. Return θˆ.
19 / 22 Stochastic Gradient Descent for Linear Regression
Input: training objective
n 1 X 2 J LS(w) := y(i) − w · x(i) S 2 i=1 number of iterations T n LS LS LS Output: parameter wˆ ∈ R such that JS (wˆ ) ≈ JS (wS ) 1. Initialize wˆ (e.g., randomly). 2. For t = 0 ...T − 1, 2.1 For i ∈ {1 . . . n} in random order, wˆ ← wˆ − ηt,i y(i) − wˆ > · x(i) x(i)
3. Return wˆ .
20 / 22 Stochastic Gradient Descent for Logistic Regression
Input: training objective n LOG 1 X (i) (i) J (w) := − log p y x , w S n i=1 number of iterations T n LOG LOG LOG Output: parameter wˆ ∈ R such that JS (wˆ ) ≈ JS (wS )
1. Initialize wˆ (e.g., randomly). 2. For t = 0 ...T − 1, 2.1 For i ∈ {1 . . . n} in random order, wˆ ← wˆ + ηt,i y(i) − σ wˆ · x(i) x(i)
3. Return wˆ .
21 / 22 Summary
I Logistic regression: binary classifier that can be trained by optimizing the log loss
I Log-linear model: multi-class classifier that can be trained by optimizing the cross-entropy loss
I Newton’s method: local search using the curvature of the loss function
I SGD: gradient descent with stochastic gradient estimation
I Cornerstone of modern large-scale machine learning
22 / 22