Linear Classification Overview Logistic Regression Model I Logistic
Total Page:16
File Type:pdf, Size:1020Kb
Overview I Logistic regression model I Linear classifiers I Gradient descent and SGD Linear classification I Multinomial logistic regression model COMS 4771 Fall 2019 0 / 27 1 / 27 Logistic regression model I Logistic regression model II d t I Suppose x is given by d real-valued features, so x R , while I Sigmoid function σ(t) := 1/(1 + e− ) ∈ y 1, +1 . I Useful property: 1 σ(t) = σ( t) ∈ {− } − − I Logistic regression model for (X,Y ): I Pr(Y = +1 X = x) = σ(xTw) | T T I Y X = x is Bernoulli (but taking values in 1, +1 rather than I Pr(Y = 1 X = x) = 1 σ(x w) = σ( x w) | {− } − | − − 0, 1 ), with “heads probability” parameter I Convenient formula: for each y 1, +1 , { } ∈ {− } 1 . Pr(Y = y X = x) = σ(yxTw). 1 + exp( xTw) | − d I w R is parameter vector of interest 1 ∈ I w not involved in marginal distribution of X (which we don’t care 0.8 much about) 0.6 0.4 0.2 0 -6 -4 -2 0 2 4 6 Figure 1: Sigmoid function 2 / 27 3 / 27 Log-odds in logistic regression model Optimal classifier in logistic regression model I Log-odds in the model is given by a linear function: I Recall that Bayes classifier is Pr(Y = +1 X = x) T ln | = x w. ? +1 if Pr(Y = +1 X = x) > 1/2 Pr(Y = 1 X = x) f (x) = | − | 1 otherwise. (− I Just like in linear regression, common to use feature expansion! d+1 If distribution of (X,Y ) comes from logistic regression model with I E.g., affine feature expansion ϕ(x) = (1, x) R I ∈ parameter w, then Bayes classifier is +1 if xTw > 0 f ?(x) = 1 otherwise. (− = sign(xTw). I This is a linear classifier I Many other statistical models for classification data lead to a linear (or affine) classifier, e.g., Naive Bayes 4 / 27 5 / 27 Linear classifiers, operationally Geometry of linear classifiers I d I Compute linear combination of features, then check if above I Hyperplane specified by normal vector w R : d ∈ I H = x R : xTw = 0 threshold (zero) { ∈ } I This is the decision boundary of a linear classifier d I Angle θ between x and w has T +1 if i=1 wixi > 0 sign(x w) = xTw ( 1 otherwise. cos(θ) = − P x w k k2k k2 I With affine feature expansion, threshold can be non-zero I Distance to hyperplane given by x 2 cos(θ) k k · I x is on same side of H as w iff xTw > 0 x θ w Figure 2: Example of an affine classifier H 6 / 27 7 / 27 Figure 3: Hyperplane Geometry of linear classifiers II MLE for logistic regression I With feature expansion, can obtain other types of decision boundaries I Treat training examples as iid, same distribution as test example I Log-likelihood of w given data d (x1, y1),..., (xn, yn) R 1, +1 : ∈ × {− } n ln(1 + exp( y xTw)) + terms not involving w − − i i { } i X=1 I No “closed form” expression for maximizer I (Later, we’ll discuss algorithms for finding approximate maximizers using iterative methods like gradient descent.) I What about ERM perspective? 8 / 27 9 / 27 Zero-one loss and ERM for linear classifiers ERM for linear classifiers I Recall: error rate of classifier f can also be written as risk: I Theorem: In IID model, ERM solution wˆ satisfies 1 (f) = E[ f(X)=Y ] = Pr(f(X) = Y ), d R { 6 } 6 E[ (wˆ )] min (w) + O . R ≤ w Rd R n where loss function is zero-one loss. ∈ r ! I For classification, we are ultimately interested in classifiers with small I Unfortunately, solving this optimization problem, even for linear error rate classifiers, is computationally intractable. I I.e., small (zero-one loss) risk I (Sharp contrast to ERM optimization problem for linear regression!) I Just like for linear regression, can apply plug-in principle to derive empirical risk minimization (ERM), but now for linear classifiers. d I Find w R to minimize ∈ n 1 1 (w) := sign(xTw)=y . R n { i 6 i} i X=1 b I Very different from MLE for logistic regression 10 / 27 11 / 27 Linearly separable data I Linearly separable data II I Training data is linearly separable if there exists a linear classifier with I Training data is linearly separable if there exists a linear classifier with training error rate zero. training error rate zero. I (Special case where ERM optimization problem is tractable.) I (Special case where ERM optimization problem is tractable.) d T d T I There exists w R such that sign(xi w) = yi for all i = 1, . , n. I There exists w R such that sign(xi w) = yi for all i = 1, . , n. ∈ ∈ I Equivalent: I Equivalent: T T yixi w > 0 for all i = 1, . , n yixi w > 0 for all i = 1, . , n w H Figure 5: Not linearly separable Figure 4: Linearly separable 12 / 27 13 / 27 Finding a linear separator I Finding a linear separator II d I Suppose training data (x1, y1),..., (xn, yn) R 1, +1 is I Method 2: (approximately) solve logistic regression MLE ∈ × {− } linearly separable. I How to find a linear separator (assuming one exists)? I Method 1: solve linear feasibility problem 14 / 27 15 / 27 Surrogate loss functions I Surrogate loss functions II I Often, a linear separator will not exist. I Another example: squared loss I Regard each term in negative log-likelihood as a “loss” I ` (s) = (1 s)2 sq − ` (s) := ln(1 + exp( s)) log − 2.5 I C.f. Zero-one loss: zero-one loss 1 squared loss `zo(s) := s 0 2 { ≤ } I `log (up to scaling) is upper-bound on `zo: a surrogate loss: 1.5 1 1 `zo(s) `log(s) = `log2 (s). ≤ ln 2 0.5 2.5 0 zero-one loss -1.5 -1 -0.5 0 0.5 1 1.5 2 (scaled) logistic loss Figure 7: Squared loss vs zero-one loss 1.5 1 0.5 0 -1.5 -1 -0.5 0 0.5 1 1.5 Figure 6: Logistic loss vs zero-one loss 16 / 27 17 / 27 Surrogate loss functions III Gradient descent for logistic regression I Modified squared loss: I No “closed form” expression for logistic regression MLE 2 I `msq(s) := max 0, 1 s . I Instead, use iterative algorithms like gradient descent. { − } I Gradient of empirical risk: by linearity of derivative operator, 2.5 n zero-one loss 1 T 2 modified squared loss (w) = ` (y x w). ∇R`log n ∇ log i i i=1 1.5 X b (0) d 1 I Algorithm: start with some w R and η > 0. ∈ 0.5 I For t = 1, 2,...: (t) (t 1) (t 1) 0 w := w − η ` (w − ) -1.5 -1 -0.5 0 0.5 1 1.5 − ∇R log Figure 8: Modified squared loss vs zero-one loss b 18 / 27 19 / 27 Gradient of logistic loss Interpretation of gradient descent for logistic regression I Gradient of logistic loss on i-th training example: using chain rule, I Interpretation of gradient descent: T T T `log(yix w) = `0 (yix w)yixi ` (y x w) = 1 Pr (Y = y X = x ) y x ∇ i log i ∇ log i i − − w i | i i i 1 = 1 yixi I Trying to make Prw(Y = yi X = xi) as close to 1 as possible. − − 1 + exp( y xTw) | − i i ! I (Achieved by making w infinitely far in direction of yixi.) I How much of yixi to add to w is scaled by how far the Prw( ) = 1 Prw(Y = yi X = xi) yixi ··· − − | currently is from 1. Here, Prw is probability distribution of (X,Y ) from logistic regression model with parameter w. 20 / 27 21 / 27 Behavior of gradient descent for logistic regression Stochastic gradient method I I Analysis of gradient descent for logistic regression MLE much more I Every iteration of gradient descent takes Θ(nd) time. complicated than for linear regression I Pass through all training examples to make a single update. I Solution could be at infinity! I If n is enormous, too expensive to make many passes. I Theorem: for appropriate choice of step size η > 0, (w(t)) inf (w) I Alternative: Stochastic gradient descent (SGD) R → w d R ∈R I Use one or a few training examples to estimate the gradient. as t (even if the infimumb is never attained).b I Gradient at w(t): → ∞ n 1 T (t) 10.0 `(yj x w ). n ∇ j 7.5 j X=1 5.0 (Called full batch gradient.) 2.5 2.000 I Pick term J uniformly at random: 4.000 14.000 10.000 8.000 6.000 12.000 0.0 T (t) `(yJ xJ w ). 2.5 ∇ 5.0 I What is expected value of this random vector? 7.5 T (t) 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 E `(yJ xJ w ) = ∇ h i Figure 9: Empirical logistic loss risk 22 / 27 23 / 27 Stochastic gradient method II Example: SGD for logistic regression I Minibatch I Logistic regression MLE for data d I To reduce variance of estimate, use several random examples (x1, y1),..., (xn, yn) R 1, +1 . (0) d ∈ × {− } J1,...,JB and average—called minibatch gradient. I Start with w R , η > 0, t = 1 ∈ I For epoch p = 1, 2,...: B 1 T (t) I For each training example (x, y) in a random order: `(yJ x w ).