
On Logistic Regression: Gradients of the Log Loss, Multi-Class Classification, and Other Optimization Techniques Karl Stratos June 20, 2018 1 / 22 d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). 2 / 22 The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) 2 / 22 I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) 2 / 22 Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent 2 / 22 Overview Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton's Method Stochastic Gradient Descent (SGD) 3 / 22 I Gradient given by rw (− log p (yjx; w)) = −(y − σ(w · x))x Negative Log Probability Under Logistic Function I Using y 2 f0; 1g to denote the two classes, − log p (yjx; w) = −y log σ(w · x) − (1 − y) log σ(−w · x) log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0 is convex in w. 4 / 22 I Gradient given by rw (− log p (yjx; w)) = −(y − σ(w · x))x Negative Log Probability Under Logistic Function I Using y 2 f0; 1g to denote the two classes, − log p (yjx; w) = −y log σ(w · x) − (1 − y) log σ(−w · x) log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0 is convex in w. 4 / 22 Negative Log Probability Under Logistic Function I Using y 2 f0; 1g to denote the two classes, − log p (yjx; w) = −y log σ(w · x) − (1 − y) log σ(−w · x) log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0 is convex in w. I Gradient given by rw (− log p (yjx; w)) = −(y − σ(w · x))x 4 / 22 n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? 5 / 22 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 5 / 22 LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd 5 / 22 Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. 5 / 22 Gradient Descent for Logistic Regression Input: training objective n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 number of iterations T n LOG LOG LOG Output: parameter w^ 2 R such that JS (w^ ) ≈ JS (wS ) 1. Initialize θ0 (e.g., randomly). 2. For t = 0 :::T − 1, n ηt X θt+1 = θt + y(i) − σ w · x(i) x(i) n i=1 3. Return θT . 6 / 22 Overview Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton's Method Stochastic Gradient Descent (SGD) 7 / 22 y d I A log-linear model has w 2 R for each possible class y 2 f1 : : : mg to define the probability of the class equaling y: y exp(w · x) p y x; θ = Pm y0 y0=1 exp(w · x) t m where θ = w t=1 denotes the set of parameters I Is this a valid probability distribution? From Binary to m Classes d I A logistic regressor has a single w 2 R to define the probability of \on" (against “off"): exp(w · x) p (1jx; w) = 1 + exp(w · x) 8 / 22 I Is this a valid probability distribution? From Binary to m Classes d I A logistic regressor has a single w 2 R to define the probability of \on" (against “off"): exp(w · x) p (1jx; w) = 1 + exp(w · x) y d I A log-linear model has w 2 R for each possible class y 2 f1 : : : mg to define the probability of the class equaling y: y exp(w · x) p y x; θ = Pm y0 y0=1 exp(w · x) t m where θ = w t=1 denotes the set of parameters 8 / 22 From Binary to m Classes d I A logistic regressor has a single w 2 R to define the probability of \on" (against “off"): exp(w · x) p (1jx; w) = 1 + exp(w · x) y d I A log-linear model has w 2 R for each possible class y 2 f1 : : : mg to define the probability of the class equaling y: y exp(w · x) p y x; θ = Pm y0 y0=1 exp(w · x) t m where θ = w t=1 denotes the set of parameters I Is this a valid probability distribution? 8 / 22 I A log-linear model is a linear transformation by a matrix m×d 1 m W 2 R (with rows w ::: w ) followed by softmax: p (yjx;W ) = softmaxy(W x) Softmax Formulation m I We can transform any vector z 2 R into a probability distribution over m elements by the softmax function m m−1 softmax : R ! ∆ , exp(zi) softmaxi(z) := Pm 8i = 1 : : : m j=1 exp(zj) 9 / 22 Softmax Formulation m I We can transform any vector z 2 R into a probability distribution over m elements by the softmax function m m−1 softmax : R ! ∆ , exp(zi) softmaxi(z) := Pm 8i = 1 : : : m j=1 exp(zj) I A log-linear model is a linear transformation by a matrix m×d 1 m W 2 R (with rows w ::: w ) followed by softmax: p (yjx;W ) = softmaxy(W x) 9 / 22 l I For any l 2 f1 : : : mg, the gradient wrt.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages42 Page
-
File Size-