On Logistic Regression: Gradients of the Log Loss, Multi-Class Classiﬁcation, and Other Optimization Techniques

On Logistic Regression: Gradients of the Log Loss, Multi-Class Classification, and Other Optimization Techniques Karl Stratos June 20, 2018 1 / 22 d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). 2 / 22 The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) 2 / 22 I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) 2 / 22 Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent 2 / 22 Overview Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton's Method Stochastic Gradient Descent (SGD) 3 / 22 I Gradient given by rw (− log p (yjx; w)) = −(y − σ(w · x))x Negative Log Probability Under Logistic Function I Using y 2 f0; 1g to denote the two classes, − log p (yjx; w) = −y log σ(w · x) − (1 − y) log σ(−w · x) log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0 is convex in w. 4 / 22 I Gradient given by rw (− log p (yjx; w)) = −(y − σ(w · x))x Negative Log Probability Under Logistic Function I Using y 2 f0; 1g to denote the two classes, − log p (yjx; w) = −y log σ(w · x) − (1 − y) log σ(−w · x) log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0 is convex in w. 4 / 22 Negative Log Probability Under Logistic Function I Using y 2 f0; 1g to denote the two classes, − log p (yjx; w) = −y log σ(w · x) − (1 − y) log σ(−w · x) log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0 is convex in w. I Gradient given by rw (− log p (yjx; w)) = −(y − σ(w · x))x 4 / 22 n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? 5 / 22 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 5 / 22 LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd 5 / 22 Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. 5 / 22 Gradient Descent for Logistic Regression Input: training objective n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 number of iterations T n LOG LOG LOG Output: parameter w^ 2 R such that JS (w^ ) ≈ JS (wS ) 1. Initialize θ0 (e.g., randomly). 2. For t = 0 :::T − 1, n ηt X θt+1 = θt + y(i) − σ w · x(i) x(i) n i=1 3. Return θT . 6 / 22 Overview Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton's Method Stochastic Gradient Descent (SGD) 7 / 22 y d I A log-linear model has w 2 R for each possible class y 2 f1 : : : mg to define the probability of the class equaling y: y exp(w · x) p y x; θ = Pm y0 y0=1 exp(w · x) t m where θ = w t=1 denotes the set of parameters I Is this a valid probability distribution? From Binary to m Classes d I A logistic regressor has a single w 2 R to define the probability of \on" (against “off"): exp(w · x) p (1jx; w) = 1 + exp(w · x) 8 / 22 I Is this a valid probability distribution? From Binary to m Classes d I A logistic regressor has a single w 2 R to define the probability of \on" (against “off"): exp(w · x) p (1jx; w) = 1 + exp(w · x) y d I A log-linear model has w 2 R for each possible class y 2 f1 : : : mg to define the probability of the class equaling y: y exp(w · x) p y x; θ = Pm y0 y0=1 exp(w · x) t m where θ = w t=1 denotes the set of parameters 8 / 22 From Binary to m Classes d I A logistic regressor has a single w 2 R to define the probability of \on" (against “off"): exp(w · x) p (1jx; w) = 1 + exp(w · x) y d I A log-linear model has w 2 R for each possible class y 2 f1 : : : mg to define the probability of the class equaling y: y exp(w · x) p y x; θ = Pm y0 y0=1 exp(w · x) t m where θ = w t=1 denotes the set of parameters I Is this a valid probability distribution? 8 / 22 I A log-linear model is a linear transformation by a matrix m×d 1 m W 2 R (with rows w ::: w ) followed by softmax: p (yjx;W ) = softmaxy(W x) Softmax Formulation m I We can transform any vector z 2 R into a probability distribution over m elements by the softmax function m m−1 softmax : R ! ∆ , exp(zi) softmaxi(z) := Pm 8i = 1 : : : m j=1 exp(zj) 9 / 22 Softmax Formulation m I We can transform any vector z 2 R into a probability distribution over m elements by the softmax function m m−1 softmax : R ! ∆ , exp(zi) softmaxi(z) := Pm 8i = 1 : : : m j=1 exp(zj) I A log-linear model is a linear transformation by a matrix m×d 1 m W 2 R (with rows w ::: w ) followed by softmax: p (yjx;W ) = softmaxy(W x) 9 / 22 l I For any l 2 f1 : : : mg, the gradient wrt.

On Logistic Regression: Gradients of the Log Loss, Multi-Class Classiﬁcation, and Other Optimization Techniques

Lecture 4 Feedforward Neural Networks, Backpropagation

Revisiting the Softmax Bellman Operator: New Beneﬁts and New Perspective

Logistic Regression Trained with Different Loss Functions Discussion

On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks

Regularized Regression Under Quadratic Loss, Logistic Loss, Sigmoidal Loss, and Hinge Loss

CS281B/Stat241b. Statistical Learning Theory. Lecture 7. Peter Bartlett

The Central Limit Theorem in Differential Privacy

Loss Function Search for Face Recognition

Bayesian Classifiers Under a Mixture Loss Function

Deep Neural Networks for Choice Analysis: Architecture Design with Alternative-Specific Utility Functions Shenhao Wang Baichuan

Pseudo-Learning Effects in Reinforcement Learning Model-Based Analysis: a Problem Of

Lecture 18: Wrapping up Classification Mark Hasegawa-Johnson, 3/9/2019