Linear Classification Overview Logistic Regression Model I Logistic

Linear Classification Overview Logistic Regression Model I Logistic

Overview I Logistic regression model I Linear classifiers I Gradient descent and SGD Linear classification I Multinomial logistic regression model COMS 4771 Fall 2019 0 / 27 1 / 27 Logistic regression model I Logistic regression model II d t I Suppose x is given by d real-valued features, so x R , while I Sigmoid function σ(t) := 1/(1 + e− ) ∈ y 1, +1 . I Useful property: 1 σ(t) = σ( t) ∈ {− } − − I Logistic regression model for (X,Y ): I Pr(Y = +1 X = x) = σ(xTw) | T T I Y X = x is Bernoulli (but taking values in 1, +1 rather than I Pr(Y = 1 X = x) = 1 σ(x w) = σ( x w) | {− } − | − − 0, 1 ), with “heads probability” parameter I Convenient formula: for each y 1, +1 , { } ∈ {− } 1 . Pr(Y = y X = x) = σ(yxTw). 1 + exp( xTw) | − d I w R is parameter vector of interest 1 ∈ I w not involved in marginal distribution of X (which we don’t care 0.8 much about) 0.6 0.4 0.2 0 -6 -4 -2 0 2 4 6 Figure 1: Sigmoid function 2 / 27 3 / 27 Log-odds in logistic regression model Optimal classifier in logistic regression model I Log-odds in the model is given by a linear function: I Recall that Bayes classifier is Pr(Y = +1 X = x) T ln | = x w. ? +1 if Pr(Y = +1 X = x) > 1/2 Pr(Y = 1 X = x) f (x) = | − | 1 otherwise. (− I Just like in linear regression, common to use feature expansion! d+1 If distribution of (X,Y ) comes from logistic regression model with I E.g., affine feature expansion ϕ(x) = (1, x) R I ∈ parameter w, then Bayes classifier is +1 if xTw > 0 f ?(x) = 1 otherwise. (− = sign(xTw). I This is a linear classifier I Many other statistical models for classification data lead to a linear (or affine) classifier, e.g., Naive Bayes 4 / 27 5 / 27 Linear classifiers, operationally Geometry of linear classifiers I d I Compute linear combination of features, then check if above I Hyperplane specified by normal vector w R : d ∈ I H = x R : xTw = 0 threshold (zero) { ∈ } I This is the decision boundary of a linear classifier d I Angle θ between x and w has T +1 if i=1 wixi > 0 sign(x w) = xTw ( 1 otherwise. cos(θ) = − P x w k k2k k2 I With affine feature expansion, threshold can be non-zero I Distance to hyperplane given by x 2 cos(θ) k k · I x is on same side of H as w iff xTw > 0 x θ w Figure 2: Example of an affine classifier H 6 / 27 7 / 27 Figure 3: Hyperplane Geometry of linear classifiers II MLE for logistic regression I With feature expansion, can obtain other types of decision boundaries I Treat training examples as iid, same distribution as test example I Log-likelihood of w given data d (x1, y1),..., (xn, yn) R 1, +1 : ∈ × {− } n ln(1 + exp( y xTw)) + terms not involving w − − i i { } i X=1 I No “closed form” expression for maximizer I (Later, we’ll discuss algorithms for finding approximate maximizers using iterative methods like gradient descent.) I What about ERM perspective? 8 / 27 9 / 27 Zero-one loss and ERM for linear classifiers ERM for linear classifiers I Recall: error rate of classifier f can also be written as risk: I Theorem: In IID model, ERM solution wˆ satisfies 1 (f) = E[ f(X)=Y ] = Pr(f(X) = Y ), d R { 6 } 6 E[ (wˆ )] min (w) + O . R ≤ w Rd R n where loss function is zero-one loss. ∈ r ! I For classification, we are ultimately interested in classifiers with small I Unfortunately, solving this optimization problem, even for linear error rate classifiers, is computationally intractable. I I.e., small (zero-one loss) risk I (Sharp contrast to ERM optimization problem for linear regression!) I Just like for linear regression, can apply plug-in principle to derive empirical risk minimization (ERM), but now for linear classifiers. d I Find w R to minimize ∈ n 1 1 (w) := sign(xTw)=y . R n { i 6 i} i X=1 b I Very different from MLE for logistic regression 10 / 27 11 / 27 Linearly separable data I Linearly separable data II I Training data is linearly separable if there exists a linear classifier with I Training data is linearly separable if there exists a linear classifier with training error rate zero. training error rate zero. I (Special case where ERM optimization problem is tractable.) I (Special case where ERM optimization problem is tractable.) d T d T I There exists w R such that sign(xi w) = yi for all i = 1, . , n. I There exists w R such that sign(xi w) = yi for all i = 1, . , n. ∈ ∈ I Equivalent: I Equivalent: T T yixi w > 0 for all i = 1, . , n yixi w > 0 for all i = 1, . , n w H Figure 5: Not linearly separable Figure 4: Linearly separable 12 / 27 13 / 27 Finding a linear separator I Finding a linear separator II d I Suppose training data (x1, y1),..., (xn, yn) R 1, +1 is I Method 2: (approximately) solve logistic regression MLE ∈ × {− } linearly separable. I How to find a linear separator (assuming one exists)? I Method 1: solve linear feasibility problem 14 / 27 15 / 27 Surrogate loss functions I Surrogate loss functions II I Often, a linear separator will not exist. I Another example: squared loss I Regard each term in negative log-likelihood as a “loss” I ` (s) = (1 s)2 sq − ` (s) := ln(1 + exp( s)) log − 2.5 I C.f. Zero-one loss: zero-one loss 1 squared loss `zo(s) := s 0 2 { ≤ } I `log (up to scaling) is upper-bound on `zo: a surrogate loss: 1.5 1 1 `zo(s) `log(s) = `log2 (s). ≤ ln 2 0.5 2.5 0 zero-one loss -1.5 -1 -0.5 0 0.5 1 1.5 2 (scaled) logistic loss Figure 7: Squared loss vs zero-one loss 1.5 1 0.5 0 -1.5 -1 -0.5 0 0.5 1 1.5 Figure 6: Logistic loss vs zero-one loss 16 / 27 17 / 27 Surrogate loss functions III Gradient descent for logistic regression I Modified squared loss: I No “closed form” expression for logistic regression MLE 2 I `msq(s) := max 0, 1 s . I Instead, use iterative algorithms like gradient descent. { − } I Gradient of empirical risk: by linearity of derivative operator, 2.5 n zero-one loss 1 T 2 modified squared loss (w) = ` (y x w). ∇R`log n ∇ log i i i=1 1.5 X b (0) d 1 I Algorithm: start with some w R and η > 0. ∈ 0.5 I For t = 1, 2,...: (t) (t 1) (t 1) 0 w := w − η ` (w − ) -1.5 -1 -0.5 0 0.5 1 1.5 − ∇R log Figure 8: Modified squared loss vs zero-one loss b 18 / 27 19 / 27 Gradient of logistic loss Interpretation of gradient descent for logistic regression I Gradient of logistic loss on i-th training example: using chain rule, I Interpretation of gradient descent: T T T `log(yix w) = `0 (yix w)yixi ` (y x w) = 1 Pr (Y = y X = x ) y x ∇ i log i ∇ log i i − − w i | i i i 1 = 1 yixi I Trying to make Prw(Y = yi X = xi) as close to 1 as possible. − − 1 + exp( y xTw) | − i i ! I (Achieved by making w infinitely far in direction of yixi.) I How much of yixi to add to w is scaled by how far the Prw( ) = 1 Prw(Y = yi X = xi) yixi ··· − − | currently is from 1. Here, Prw is probability distribution of (X,Y ) from logistic regression model with parameter w. 20 / 27 21 / 27 Behavior of gradient descent for logistic regression Stochastic gradient method I I Analysis of gradient descent for logistic regression MLE much more I Every iteration of gradient descent takes Θ(nd) time. complicated than for linear regression I Pass through all training examples to make a single update. I Solution could be at infinity! I If n is enormous, too expensive to make many passes. I Theorem: for appropriate choice of step size η > 0, (w(t)) inf (w) I Alternative: Stochastic gradient descent (SGD) R → w d R ∈R I Use one or a few training examples to estimate the gradient. as t (even if the infimumb is never attained).b I Gradient at w(t): → ∞ n 1 T (t) 10.0 `(yj x w ). n ∇ j 7.5 j X=1 5.0 (Called full batch gradient.) 2.5 2.000 I Pick term J uniformly at random: 4.000 14.000 10.000 8.000 6.000 12.000 0.0 T (t) `(yJ xJ w ). 2.5 ∇ 5.0 I What is expected value of this random vector? 7.5 T (t) 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 E `(yJ xJ w ) = ∇ h i Figure 9: Empirical logistic loss risk 22 / 27 23 / 27 Stochastic gradient method II Example: SGD for logistic regression I Minibatch I Logistic regression MLE for data d I To reduce variance of estimate, use several random examples (x1, y1),..., (xn, yn) R 1, +1 . (0) d ∈ × {− } J1,...,JB and average—called minibatch gradient. I Start with w R , η > 0, t = 1 ∈ I For epoch p = 1, 2,...: B 1 T (t) I For each training example (x, y) in a random order: `(yJ x w ).

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    7 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us