Machine Learning Generative Vs. Discriminative Classifiers

Machine Learning

10-701/15-781, Fall 2008

Logistic Regression -- generative verses discriminative classifier

Eric Xing Lecture 5, September 22, 2008

Reading: Chap. 4.3 CB © Eric Xing @ CMU, 2006-2008 1

Generative vs. Discriminative Classifiers

z Goal: Wish to learn f: X → Y, e.g., P(Y|X)

z Generative classifiers (e.g., Naïve Bayes): Yi z Assume some functional form for P(X|Y), P(Y)

This is a ‘generative’ model of the data! Xi z Estimate parameters of P(X|Y), P(Y) directly from training data

z Use Bayes rule to calculate P(Y|X= x)

z Discriminative classifiers: Yi z Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Xi z Estimate parameters of P(Y|X) directly from training data

1 Discussion: Generative and discriminative classifiers

z Generative:

z Modeling the joint distribution of all data

z Discriminative:

z Modeling only points at the boundary

z How? Regression!

Linear regression

z The data: Yi

{}(x1 , y1 ),(x2 , y2 ),(x3 , y3 ),L,(xN , yN ) Xi N z Both nodes are observed:

z X is an input vector z Y is a response vector (we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator)

z A regression scheme can be used to model p(y|x) directly, rather than p(x,y)

2 Classification and logistic regression

The logistic function

3 Logistic regression (sigmoid classifier)

z The condition distribution: a Bernoulli p(y | x) = µ(x) y (1 − µ(x))1− y where µ is a logistic function 1 µ(x) = T 1 + e−θ x

z We can used the brute-force gradient method as in LR

z But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model (see future lectures …)

Training Logistic Regression: MCLE

z Estimate parameters θ=<θ0, θ1, ... θm> to maximize the conditional likelihood of training data

z Training data

z Data likelihood =

z Data conditional likelihood =

4 Expressing Conditional Log Likelihood

z Recall the logistic function:

and conditional likelihood:

Maximizing Conditional Log Likelihood

z The objective:

z Good news: l(θ) is concave function of θ

z Bad news: no closed-form solution to maximize l(θ)

5 Gradient Ascent

z Property of sigmoid function:

z The gradient:

The gradient ascent algorithm iterate until change < ε For all i, repeat

Overfitting …

6 How about MAP?

z It is very common to use regularized maximum likelihood.

z One common approach is to define priors on θ

z – Normal distribution, zero mean, identity covariance

z Helps avoid very large weights and overfitting

z MAP estimate

MLE vs MAP

z Maximum conditional likelihood estimate

z Maximum a posteriori estimate

7 The Newton’s method

z Finding a zero of a function

The Newton’s method (con’d)

z To maximize the conditional likelihood l(θ):

since l is convex, we need to find θ∗ where l’(θ∗)=0 !

z So we can perform the following iteration:

8 The Newton-Raphson method

z In LR the θ is vector-valued, thus we need the following generalization:

z ∇ is the gradient operator over the function

z H is known as the Hessian of the function

The Newton-Raphson method

z In LR the θ is vector-valued, thus we need the following generalization:

z ∇ is the gradient operator over the function

z H is known as the Hessian of the function

9 Iterative reweighed least squares (IRLS)

z Recall in the least square est. in linear regression, we have:

which can also derived from Newton-Raphson

z Now for logistic regression:

IRLS

z Recall in the least square est. in linear regression, we have:

which can also derived from Newton-Raphson

z Now for logistic regression:

10 Logistic regression: practical issues

z NR (IRLS) takes O(N+d3) per iteration, where N = number of training cases and d = dimension of input x, but converge in fewer iterations

z Quasi-Newton methods, that approximate the Hessian, work faster.

z Conjugate gradient takes O(Nd) per iteration, and usually works best in practice.

z Stochastic gradient descent can also be used if N is large c.f. perceptron rule:

Case study

z Dataset

z 20 News Groups (20 classes)

z Download :(http://people.csail.mit.edu/jrennie/20Newsgroups/)

z 61,118 words, 18,774 documents

z Class labels descriptions

11 Experimental setup

z Parameters

z Binary feature;

z Using (stochastic) Gradient descent vs. IRLS to estimate parameters

z Training/Test Sets:

z Subset of 20ng (binary classes)

z Use SVD to do dimension reduction (to 100)

z Random select 50% for training

z 10 run and report average result

z Convergence Criteria: RMSE in training set

Convergence curves

alt.atheism rec.autos comp.windows.x vs. vs. vs. comp.graphics rec.sport.baseball rec.motorcycles

Legend: - X-axis: Iteration #; Y-axis: error - In each figure, red for IRLS and blue for gradient descent

12 Generative vs. Discriminative Classifiers

z Goal: Wish to learn f: X → Y, e.g., P(Y|X)

z Generative classifiers (e.g., Naïve Bayes): Yi z Assume some functional form for P(X|Y), P(Y)

This is a ‘generative’ model of the data! Xi z Estimate parameters of P(X|Y), P(Y) directly from training data

z Use Bayes rule to calculate P(Y|X= x)

z Discriminative classifiers: Yi z Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Xi z Estimate parameters of P(Y|X) directly from training data

Gaussian Discriminative Analysis

z learning f: X → Y, where

z X is a vector of real-valued features, < X1…Xm >

z Y is boolean Y z What does that imply about the form of P(Y|X)? i

z The joint probability of a datum and its label is: Xi k k k p(xn , yn = 1 | µ,σ ) = p(yn = 1)× p(xn | yn = 1, µ,σ )

1 1 2 = π k / exp{}- 2 (xn - µk ) (2πσ 2 )1 2 2σ

z Given a datum xn, we predict its label using the conditional probability of the label given the datum:

1 1 2 π k 2 1/ 2 exp{− 2 (xn − µk ) } (2πσ ) 2σ p(y k =1| x , µ,σ ) = n n 1 π exp − 1 (x − µ )2 ∑ k ' 2 1/ 2 {}2σ 2 n k ' k ' (2πσ )

13 Naïve Bayes Classifier

z When X is multivariate-Gaussian vector: Yi z The joint probability of a datum and it label is:

r k r k r k r X p(xn , yn = 1 | µ,Σ) = p(yn = 1)× p(xn | yn = 1, µ,Σ) i 1 = π exp{}- 1 (xr - µr )T Σ−1 (xr - µr ) k (2π Σ )1/2 2 n k n k

z The naïve Bayes simplification Yi k k j k p(xn , yn = 1 | µ,σ ) = p(yn = 1)×∏ p(xn | yn = 1, µk , j ,σ k , j ) j X 1 X 2 … X m 1 i i i = π exp - 1 (x j - µ )2 k ∏ 2 1/2 {}σ 2 n k , j 2 k , j j (2πσ k , j )

m j z More generally: p(xn , yn |η,π ) = p(yn | π )×∏ p(xn | yn ,η) j=1

z Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density

The predictive distribution

z Under naïve Bayes assumption:

⎧ ⎫ ⎪ ⎛ 1 j j ⎞⎪ π exp − ⎜ (x − µ )2 − logσ − C ⎟ k ⎨ ∑ j ⎜ 2 n k k, j ⎟⎬ ⎪ 2σ k , j ⎪ k v ⎩ ⎝ ⎠⎭ p(yn = 1 | xn , µ,Σ,π ) = ** ⎧ ⎛ ⎞⎫ ⎪ ⎜ 1 j j 2 ⎟⎪ π k ' exp⎨− (xn − µk ' ) − logσ k ', j − C ⎬ ∑∑kj' ⎜ σ 2 ⎟ ⎩⎪ ⎝ 2 k ', j ⎠⎭⎪

z For two class (i.e., K=2), and when the two classes haves the same variance, ** turns out to be the logistic function 1 1 1 p(y = | x ) = = n 1 n ⎪⎧ ⎛ ⎞⎪⎫ π exp − ⎜ 1 (x j -µ j )2 −logσ −C ⎟ 2 ⎨ ∑ j ⎜ 2 n 2 j ⎟⎬ ⎩⎪ ⎝ 2σ j ⎠⎭⎪ 1 + j 1 j j 1 j 2 j 2 (1−π1 ) ⎧ ⎛ ⎞⎫ exp x ( - ) ([ ] -[ ] ) log ⎪ 1 j j 2 ⎪ 1 + {}− ( n 2 µ1 µ2 + 2 µ1 µ2 )+ π π exp − ⎜ (x -µ ) −logσ −C ⎟ ∑ j σ j σ j 1 1 ⎨ ∑ j ⎜ 2 n 1 j ⎟⎬ 1 ⎩⎪ ⎝ 2σ j ⎠⎭⎪ = T 1 + e−θ xn © Eric Xing @ CMU, 2006-2008 28

14 The decision boundary

z The predictive distribution

1 1 1 p(yn = 1 | xn ) = = T M −θ xn ⎧ j ⎫ 1 + e 1 + exp⎨− ∑θ j xn −θ0 ⎬ ⎩ j=1 ⎭

z The Bayes decision rule:

⎛ ⎞ ⎜ 1 ⎟ 1 T p(y = | x ) −θ xn n 1 n ⎜ 1 + e ⎟ T ln = ln T = θ xn 2 ⎜ −θ xn ⎟ p(yn = 1 | xn ) e ⎜ T ⎟ ⎝ 1 + e−θ xn ⎠

z For multiple class (i.e., K>2), * correspond to a softmax function

T −θk xn k e p(yn = 1 | xn ) = T ∑e−θ j xn j

Naïve Bayes vs Logistic Regression

z Consider Y boolean, X continuous, X=

z Number of parameters to estimate:

NB:

LR:

z Estimation method:

z NB parameter estimates are uncoupled

z LR parameter estimates are coupled

15 Naïve Bayes vs Logistic Regression

z Asymptotic comparison (# training examples → infinity)

z when model assumptions correct

z NB, LR produce identical classifiers

z when model assumptions incorrect

z LR is less biased – does not assume conditional independence

z therefore expected to outperform NB

Naïve Bayes vs Logistic Regression

z Non-asymptotic analysis (see [Ng & Jordan, 2002] )

z convergence rate of parameter estimates – how many training examples needed to assure good estimates?

NB order log m (where m = # of attributes in X) LR order m

z NB converges more quickly to its (perhaps less helpful) asymptotic estimates

16 Rate of convergence: logistic regression

z Let hDis,m be logistic regression trained on n examples in m dimensions. Then with high probability:

z Implication: if we want

for some small constant ε0, it suffices to pick order m examples

Æ Convergences to its asymptotic classifier, in order m examples

z result follows from Vapnik’s structural risk bound, plus fact that the "VC Dimension" of an m-dimensional linear separators is m

Rate of convergence: naïve Bayes parameters

z Let any ε1, δ>0, and any n ≥ 0 be fixed.

Assume that for some fixed ρ0 > 0, we have that

z Let

z Then with probability at least 1-δ, after n examples:

1. For discrete input, for all i and b

2. For continuous inputs, for all i and b

17 Some experiments from UCI data sets

Back to our 20 NG Case study

z Dataset

z 20 News Groups (20 classes)

z 61,118 words, 18,774 documents

z Experiment:

z Solve only a two-class subset: 1 vs 2.

z 1768 instances, 61188 features.

z Use dimensionality reduction on the data (SVD).

z Use 90% as training set, 10% as test set.

z Test prediction error used as accuracy measure.

18 Generalization error (1)

z Versus training size

• 30 features. • A fixed test set • Training set varied from 10% to 100% of the training set

Generalization error (2)

z Versus model size

• Number of dimensions of the data varied from 5 to 50 in steps of 5

• The features were chosen in decreasing order of their singular values

• 90% versus 10% split on training and test

19 Summary

z Logistic regression

z – Functional form follows from Naïve Bayes assumptions

z For Gaussian Naïve Bayes assuming variance

z For discrete-valued Naïve Bayes too

z But training procedure picks parameters without the conditional independence assumption

z – MLE training: pick W to maximize P(Y | X; θ)

z – MAP training: pick W to maximize P(θ | X,Y) z ‘regularization’

z helps reduce overfitting

z Gradient ascent/descent

z – General approach when closed-form solutions unavailable

z Generative vs. Discriminative classifiers

z – Bias vs. variance tradeoff