Overview

I model I Linear classifiers I and SGD Linear classification I Multinomial logistic regression model

COMS 4771 Fall 2019

0 / 27 1 / 27 Logistic regression model I Logistic regression model II

d t I Suppose x is given by d real-valued features, so x R , while I Sigmoid function σ(t) := 1/(1 + e− ) ∈ y 1, +1 . I Useful property: 1 σ(t) = σ( t) ∈ {− } − − I Logistic regression model for (X,Y ): I Pr(Y = +1 X = x) = σ(xTw) | T T I Y X = x is Bernoulli (but taking values in 1, +1 rather than I Pr(Y = 1 X = x) = 1 σ(x w) = σ( x w) | {− } − | − − 0, 1 ), with “heads probability” parameter I Convenient formula: for each y 1, +1 , { } ∈ {− } 1 . Pr(Y = y X = x) = σ(yxTw). 1 + exp( xTw) | − d I w R is parameter vector of interest 1 ∈ I w not involved in marginal distribution of X (which we don’t care 0.8 much about) 0.6 0.4

0.2

0 -6 -4 -2 0 2 4 6

Figure 1: Sigmoid function

2 / 27 3 / 27 Log-odds in logistic regression model Optimal classifier in logistic regression model

I Log-odds in the model is given by a linear function: I Recall that Bayes classifier is

Pr(Y = +1 X = x) T ln | = x w. ? +1 if Pr(Y = +1 X = x) > 1/2 Pr(Y = 1 X = x) f (x) = | − | 1 otherwise. (− I Just like in , common to use feature expansion! d+1 If distribution of (X,Y ) comes from logistic regression model with I E.g., affine feature expansion ϕ(x) = (1, x) R I ∈ parameter w, then Bayes classifier is

+1 if xTw > 0 f ?(x) = 1 otherwise. (− = sign(xTw).

I This is a linear classifier I Many other statistical models for classification data lead to a linear (or affine) classifier, e.g., Naive Bayes

4 / 27 5 / 27 Linear classifiers, operationally Geometry of linear classifiers I

d I Compute linear combination of features, then check if above I Hyperplane specified by normal vector w R : d ∈ I H = x R : xTw = 0 threshold (zero) { ∈ } I This is the decision boundary of a linear classifier d I Angle θ between x and w has T +1 if i=1 wixi > 0 sign(x w) = xTw ( 1 otherwise. cos(θ) = − P x w k k2k k2 I With affine feature expansion, threshold can be non-zero I Distance to hyperplane given by x 2 cos(θ) k k · I x is on same side of H as w iff xTw > 0

x θ w

Figure 2: Example of an affine classifier

H

6 / 27 7 / 27 Figure 3: Hyperplane Geometry of linear classifiers II MLE for logistic regression

I With feature expansion, can obtain other types of decision boundaries I Treat training examples as iid, same distribution as test example I Log-likelihood of w given data d (x1, y1),..., (xn, yn) R 1, +1 : ∈ × {− } n ln(1 + exp( y xTw)) + terms not involving w − − i i { } i X=1 I No “closed form” expression for maximizer I (Later, we’ll discuss algorithms for finding approximate maximizers using iterative methods like gradient descent.) I What about ERM perspective?

8 / 27 9 / 27 Zero-one loss and ERM for linear classifiers ERM for linear classifiers

I Recall: error rate of classifier f can also be written as risk: I Theorem: In IID model, ERM solution wˆ satisfies 1 (f) = E[ f(X)=Y ] = Pr(f(X) = Y ), d R { 6 } 6 E[ (wˆ )] min (w) + O . R ≤ w Rd R n where is zero-one loss. ∈ r ! I For classification, we are ultimately interested in classifiers with small I Unfortunately, solving this optimization problem, even for linear error rate classifiers, is computationally intractable. I I.e., small (zero-one loss) risk I (Sharp contrast to ERM optimization problem for linear regression!) I Just like for linear regression, can apply plug-in principle to derive empirical risk minimization (ERM), but now for linear classifiers. d I Find w R to minimize ∈ n 1 1 (w) := sign(xTw)=y . R n { i 6 i} i X=1 b I Very different from MLE for logistic regression

10 / 27 11 / 27 Linearly separable data I Linearly separable data II

I Training data is linearly separable if there exists a linear classifier with I Training data is linearly separable if there exists a linear classifier with training error rate zero. training error rate zero. I (Special case where ERM optimization problem is tractable.) I (Special case where ERM optimization problem is tractable.) d T d T I There exists w R such that sign(xi w) = yi for all i = 1, . . . , n. I There exists w R such that sign(xi w) = yi for all i = 1, . . . , n. ∈ ∈ I Equivalent: I Equivalent: T T yixi w > 0 for all i = 1, . . . , n yixi w > 0 for all i = 1, . . . , n

w

H Figure 5: Not linearly separable Figure 4: Linearly separable

12 / 27 13 / 27 Finding a linear separator I Finding a linear separator II

d I Suppose training data (x1, y1),..., (xn, yn) R 1, +1 is I Method 2: (approximately) solve logistic regression MLE ∈ × {− } linearly separable. I How to find a linear separator (assuming one exists)? I Method 1: solve linear feasibility problem

14 / 27 15 / 27 Surrogate loss functions I Surrogate loss functions II

I Often, a linear separator will not exist. I Another example: squared loss I Regard each term in negative log-likelihood as a “loss” I ` (s) = (1 s)2 sq − ` (s) := ln(1 + exp( s)) log − 2.5 I C.f. Zero-one loss: zero-one loss 1 squared loss `zo(s) := s 0 2 { ≤ } I `log (up to scaling) is upper-bound on `zo: a surrogate loss: 1.5 1 1 `zo(s) `log(s) = `log2 (s). ≤ ln 2 0.5

2.5 0 zero-one loss -1.5 -1 -0.5 0 0.5 1 1.5 2 (scaled) logistic loss Figure 7: Squared loss vs zero-one loss 1.5

1

0.5

0 -1.5 -1 -0.5 0 0.5 1 1.5

Figure 6: Logistic loss vs zero-one loss 16 / 27 17 / 27 Surrogate loss functions III Gradient descent for logistic regression

I Modified squared loss: I No “closed form” expression for logistic regression MLE 2 I `msq(s) := max 0, 1 s . I Instead, use iterative algorithms like gradient descent. { − } I Gradient of empirical risk: by linearity of derivative operator,

2.5 n zero-one loss 1 T 2 modified squared loss (w) = ` (y x w). ∇R`log n ∇ log i i i=1 1.5 X b (0) d 1 I Algorithm: start with some w R and η > 0. ∈ 0.5 I For t = 1, 2,...:

(t) (t 1) (t 1) 0 w := w − η ` (w − ) -1.5 -1 -0.5 0 0.5 1 1.5 − ∇R log Figure 8: Modified squared loss vs zero-one loss b

18 / 27 19 / 27 Gradient of logistic loss Interpretation of gradient descent for logistic regression

I Gradient of logistic loss on i-th training example: using chain rule, I Interpretation of gradient descent:

T T T `log(yix w) = `0 (yix w)yixi ` (y x w) = 1 Pr (Y = y X = x ) y x ∇ i log i ∇ log i i − − w i | i i i 1  = 1 yixi I Trying to make Prw(Y = yi X = xi) as close to 1 as possible. − − 1 + exp( y xTw) | − i i ! I (Achieved by making w infinitely far in direction of yixi.) I How much of yixi to add to w is scaled by how far the Prw( ) = 1 Prw(Y = yi X = xi) yixi ··· − − | currently is from 1.  Here, Prw is probability distribution of (X,Y ) from logistic regression model with parameter w.

20 / 27 21 / 27 Behavior of gradient descent for logistic regression Stochastic gradient method I

I Analysis of gradient descent for logistic regression MLE much more I Every iteration of gradient descent takes Θ(nd) time. complicated than for linear regression I Pass through all training examples to make a single update. I Solution could be at infinity! I If n is enormous, too expensive to make many passes. I Theorem: for appropriate choice of step size η > 0, (w(t)) inf (w) I Alternative: Stochastic gradient descent (SGD) R → w d R ∈R I Use one or a few training examples to estimate the gradient. as t (even if the infimumb is never attained).b I Gradient at w(t): → ∞ n 1 T (t) 10.0 `(yj x w ). n ∇ j 7.5 j X=1 5.0 (Called full batch gradient.)

2.5 2.000 I Pick term J uniformly at random:

4.000

14.000 10.000

8.000 6.000

12.000 0.0 T (t) `(yJ xJ w ). 2.5 ∇ 5.0 I What is expected value of this random vector?

7.5 T (t) 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 E `(yJ xJ w ) = ∇ h i Figure 9: Empirical logistic loss risk

22 / 27 23 / 27 Stochastic gradient method II Example: SGD for logistic regression

I Minibatch I Logistic regression MLE for data d I To reduce variance of estimate, use several random examples (x1, y1),..., (xn, yn) R 1, +1 . (0) d ∈ × {− } J1,...,JB and average—called minibatch gradient. I Start with w R , η > 0, t = 1 ∈ I For epoch p = 1, 2,...: B 1 T (t) I For each training example (x, y) in a random order: `(yJ x w ). B ∇ b Jb b=1 (t) (t 1) 1 X w := w − + η 1 yx; t := t + 1. T (t 1) I Rule of thumb: larger batch size B larger step size η. − 1 + exp( yx w − ) →  −  w(0) x I Alternative: instead of picking example uniformly at random, shuffle I (If = 0, then final solution is in span of i.) order of training examples, and take next example in this order. I Verify that expected value is same! I Seems to reduce variance as well, but not fully understood.

24 / 27 25 / 27 Multinomial logistic regression MLE for multinomial logistic regression

I How to handle K > 2 classes? I Treat training examples as iid, same distribution as test example I Multinomial logistic regression model I Encode label yi as a one-hot vector y˜i e1,..., eK 1 ∈ { } I Y X = x has a categorical distribution over 1,...,K I y˜i = (˜yi,1,..., y˜i,K ), where y˜i,k = y =k | { } { i }

Pr(Y = k X = x) = softmax(W x)k | I MLE equivalent to minimizing empirical risk with cross-entropy loss

I Softmax function (vector-valued function): n 1 exp(vk) `ce(y˜i, softmax(W xi)) v : n softmax( )k = K i=1 l=1 exp(vl) X K T K d where ` (p, q) = pk ln qk. I W = [w1 wK ] R × is parameterP matrix of interest ce k=1 | · · · | ∈ − P

26 / 27 27 / 27