On Logistic Regression: Gradients of the Log Loss, Multi-Class Classification, and Other Optimization Techniques

On Logistic Regression: Gradients of the Log Loss, Multi-Class Classification, and Other Optimization Techniques

On Logistic Regression: Gradients of the Log Loss, Multi-Class Classification, and Other Optimization Techniques Karl Stratos June 20, 2018 1 / 22 d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). 2 / 22 The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) 2 / 22 I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) 2 / 22 Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent 2 / 22 Overview Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton's Method Stochastic Gradient Descent (SGD) 3 / 22 I Gradient given by rw (− log p (yjx; w)) = −(y − σ(w · x))x Negative Log Probability Under Logistic Function I Using y 2 f0; 1g to denote the two classes, − log p (yjx; w) = −y log σ(w · x) − (1 − y) log σ(−w · x) log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0 is convex in w. 4 / 22 I Gradient given by rw (− log p (yjx; w)) = −(y − σ(w · x))x Negative Log Probability Under Logistic Function I Using y 2 f0; 1g to denote the two classes, − log p (yjx; w) = −y log σ(w · x) − (1 − y) log σ(−w · x) log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0 is convex in w. 4 / 22 Negative Log Probability Under Logistic Function I Using y 2 f0; 1g to denote the two classes, − log p (yjx; w) = −y log σ(w · x) − (1 − y) log σ(−w · x) log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0 is convex in w. I Gradient given by rw (− log p (yjx; w)) = −(y − σ(w · x))x 4 / 22 n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? 5 / 22 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 5 / 22 LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd 5 / 22 Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. 5 / 22 Gradient Descent for Logistic Regression Input: training objective n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 number of iterations T n LOG LOG LOG Output: parameter w^ 2 R such that JS (w^ ) ≈ JS (wS ) 1. Initialize θ0 (e.g., randomly). 2. For t = 0 :::T − 1, n ηt X θt+1 = θt + y(i) − σ w · x(i) x(i) n i=1 3. Return θT . 6 / 22 Overview Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton's Method Stochastic Gradient Descent (SGD) 7 / 22 y d I A log-linear model has w 2 R for each possible class y 2 f1 : : : mg to define the probability of the class equaling y: y exp(w · x) p y x; θ = Pm y0 y0=1 exp(w · x) t m where θ = w t=1 denotes the set of parameters I Is this a valid probability distribution? From Binary to m Classes d I A logistic regressor has a single w 2 R to define the probability of \on" (against “off"): exp(w · x) p (1jx; w) = 1 + exp(w · x) 8 / 22 I Is this a valid probability distribution? From Binary to m Classes d I A logistic regressor has a single w 2 R to define the probability of \on" (against “off"): exp(w · x) p (1jx; w) = 1 + exp(w · x) y d I A log-linear model has w 2 R for each possible class y 2 f1 : : : mg to define the probability of the class equaling y: y exp(w · x) p y x; θ = Pm y0 y0=1 exp(w · x) t m where θ = w t=1 denotes the set of parameters 8 / 22 From Binary to m Classes d I A logistic regressor has a single w 2 R to define the probability of \on" (against “off"): exp(w · x) p (1jx; w) = 1 + exp(w · x) y d I A log-linear model has w 2 R for each possible class y 2 f1 : : : mg to define the probability of the class equaling y: y exp(w · x) p y x; θ = Pm y0 y0=1 exp(w · x) t m where θ = w t=1 denotes the set of parameters I Is this a valid probability distribution? 8 / 22 I A log-linear model is a linear transformation by a matrix m×d 1 m W 2 R (with rows w ::: w ) followed by softmax: p (yjx;W ) = softmaxy(W x) Softmax Formulation m I We can transform any vector z 2 R into a probability distribution over m elements by the softmax function m m−1 softmax : R ! ∆ , exp(zi) softmaxi(z) := Pm 8i = 1 : : : m j=1 exp(zj) 9 / 22 Softmax Formulation m I We can transform any vector z 2 R into a probability distribution over m elements by the softmax function m m−1 softmax : R ! ∆ , exp(zi) softmaxi(z) := Pm 8i = 1 : : : m j=1 exp(zj) I A log-linear model is a linear transformation by a matrix m×d 1 m W 2 R (with rows w ::: w ) followed by softmax: p (yjx;W ) = softmaxy(W x) 9 / 22 l I For any l 2 f1 : : : mg, the gradient wrt.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    42 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us