Machine Learning
10-701/15-781, Fall 2008
Logistic Regression -- generative verses discriminative classifier
Eric Xing Lecture 5, September 22, 2008
Reading: Chap. 4.3 CB © Eric Xing @ CMU, 2006-2008 1
Generative vs. Discriminative Classifiers
z Goal: Wish to learn f: X → Y, e.g., P(Y|X)
z Generative classifiers (e.g., Naïve Bayes): Yi z Assume some functional form for P(X|Y), P(Y)
This is a ‘generative’ model of the data! Xi z Estimate parameters of P(X|Y), P(Y) directly from training data
z Use Bayes rule to calculate P(Y|X= x)
z Discriminative classifiers: Yi z Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Xi z Estimate parameters of P(Y|X) directly from training data
© Eric Xing @ CMU, 2006-2008 2
1 Discussion: Generative and discriminative classifiers
z Generative:
z Modeling the joint distribution of all data
z Discriminative:
z Modeling only points at the boundary
z How? Regression!
© Eric Xing @ CMU, 2006-2008 3
Linear regression
z The data: Yi
{}(x1 , y1 ),(x2 , y2 ),(x3 , y3 ),L,(xN , yN ) Xi N z Both nodes are observed:
z X is an input vector z Y is a response vector (we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator)
z A regression scheme can be used to model p(y|x) directly, rather than p(x,y)
© Eric Xing @ CMU, 2006-2008 4
2 Classification and logistic regression
© Eric Xing @ CMU, 2006-2008 5
The logistic function
© Eric Xing @ CMU, 2006-2008 6
3 Logistic regression (sigmoid classifier)
z The condition distribution: a Bernoulli p(y | x) = µ(x) y (1 − µ(x))1− y where µ is a logistic function 1 µ(x) = T 1 + e−θ x
z We can used the brute-force gradient method as in LR
z But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model (see future lectures …)
© Eric Xing @ CMU, 2006-2008 7
Training Logistic Regression: MCLE
z Estimate parameters θ=<θ0, θ1, ... θm> to maximize the conditional likelihood of training data
z Training data
z Data likelihood =
z Data conditional likelihood =
© Eric Xing @ CMU, 2006-2008 8
4 Expressing Conditional Log Likelihood
z Recall the logistic function:
and conditional likelihood:
© Eric Xing @ CMU, 2006-2008 9
Maximizing Conditional Log Likelihood
z The objective:
z Good news: l(θ) is concave function of θ
z Bad news: no closed-form solution to maximize l(θ)
© Eric Xing @ CMU, 2006-2008 10
5 Gradient Ascent
z Property of sigmoid function:
z The gradient:
The gradient ascent algorithm iterate until change < ε For all i, repeat
© Eric Xing @ CMU, 2006-2008 11
Overfitting …
© Eric Xing @ CMU, 2006-2008 12
6 How about MAP?
z It is very common to use regularized maximum likelihood.
z One common approach is to define priors on θ
z – Normal distribution, zero mean, identity covariance
z Helps avoid very large weights and overfitting
z MAP estimate
© Eric Xing @ CMU, 2006-2008 13
MLE vs MAP
z Maximum conditional likelihood estimate
z Maximum a posteriori estimate
© Eric Xing @ CMU, 2006-2008 14
7 The Newton’s method
z Finding a zero of a function
© Eric Xing @ CMU, 2006-2008 15
The Newton’s method (con’d)
z To maximize the conditional likelihood l(θ):
since l is convex, we need to find θ∗ where l’(θ∗)=0 !
z So we can perform the following iteration:
© Eric Xing @ CMU, 2006-2008 16
8 The Newton-Raphson method
z In LR the θ is vector-valued, thus we need the following generalization:
z ∇ is the gradient operator over the function
z H is known as the Hessian of the function
© Eric Xing @ CMU, 2006-2008 17
The Newton-Raphson method
z In LR the θ is vector-valued, thus we need the following generalization:
z ∇ is the gradient operator over the function
z H is known as the Hessian of the function
© Eric Xing @ CMU, 2006-2008 18
9 Iterative reweighed least squares (IRLS)
z Recall in the least square est. in linear regression, we have:
which can also derived from Newton-Raphson
z Now for logistic regression:
© Eric Xing @ CMU, 2006-2008 19
IRLS
z Recall in the least square est. in linear regression, we have:
which can also derived from Newton-Raphson
z Now for logistic regression:
© Eric Xing @ CMU, 2006-2008 20
10 Logistic regression: practical issues
z NR (IRLS) takes O(N+d3) per iteration, where N = number of training cases and d = dimension of input x, but converge in fewer iterations
z Quasi-Newton methods, that approximate the Hessian, work faster.
z Conjugate gradient takes O(Nd) per iteration, and usually works best in practice.
z Stochastic gradient descent can also be used if N is large c.f. perceptron rule:
© Eric Xing @ CMU, 2006-2008 21
Case study
z Dataset
z 20 News Groups (20 classes)
z Download :(http://people.csail.mit.edu/jrennie/20Newsgroups/)
z 61,118 words, 18,774 documents
z Class labels descriptions
© Eric Xing @ CMU, 2006-2008 22
11 Experimental setup
z Parameters
z Binary feature;
z Using (stochastic) Gradient descent vs. IRLS to estimate parameters
z Training/Test Sets:
z Subset of 20ng (binary classes)
z Use SVD to do dimension reduction (to 100)
z Random select 50% for training
z 10 run and report average result
z Convergence Criteria: RMSE in training set
© Eric Xing @ CMU, 2006-2008 23
Convergence curves
alt.atheism rec.autos comp.windows.x vs. vs. vs. comp.graphics rec.sport.baseball rec.motorcycles
Legend: - X-axis: Iteration #; Y-axis: error - In each figure, red for IRLS and blue for gradient descent
© Eric Xing @ CMU, 2006-2008 24
12 Generative vs. Discriminative Classifiers
z Goal: Wish to learn f: X → Y, e.g., P(Y|X)
z Generative classifiers (e.g., Naïve Bayes): Yi z Assume some functional form for P(X|Y), P(Y)
This is a ‘generative’ model of the data! Xi z Estimate parameters of P(X|Y), P(Y) directly from training data
z Use Bayes rule to calculate P(Y|X= x)
z Discriminative classifiers: Yi z Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Xi z Estimate parameters of P(Y|X) directly from training data
© Eric Xing @ CMU, 2006-2008 25
Gaussian Discriminative Analysis
z learning f: X → Y, where
z X is a vector of real-valued features, < X1…Xm >
z Y is boolean Y z What does that imply about the form of P(Y|X)? i
z The joint probability of a datum and its label is: Xi k k k p(xn , yn = 1 | µ,σ ) = p(yn = 1)× p(xn | yn = 1, µ,σ )
1 1 2 = π k / exp{}- 2 (xn - µk ) (2πσ 2 )1 2 2σ
z Given a datum xn, we predict its label using the conditional probability of the label given the datum:
1 1 2 π k 2 1/ 2 exp{− 2 (xn − µk ) } (2πσ ) 2σ p(y k =1| x , µ,σ ) = n n 1 π exp − 1 (x − µ )2 ∑ k ' 2 1/ 2 {}2σ 2 n k ' k ' (2πσ )
© Eric Xing @ CMU, 2006-2008 26
13 Naïve Bayes Classifier
z When X is multivariate-Gaussian vector: Yi z The joint probability of a datum and it label is:
r k r k r k r X p(xn , yn = 1 | µ,Σ) = p(yn = 1)× p(xn | yn = 1, µ,Σ) i 1 = π exp{}- 1 (xr - µr )T Σ−1 (xr - µr ) k (2π Σ )1/2 2 n k n k
z The naïve Bayes simplification Yi k k j k p(xn , yn = 1 | µ,σ ) = p(yn = 1)×∏ p(xn | yn = 1, µk , j ,σ k , j ) j X 1 X 2 … X m 1 i i i = π exp - 1 (x j - µ )2 k ∏ 2 1/2 {}σ 2 n k , j 2 k , j j (2πσ k , j )
m j z More generally: p(xn , yn |η,π ) = p(yn | π )×∏ p(xn | yn ,η) j=1
z Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density
© Eric Xing @ CMU, 2006-2008 27
The predictive distribution
z Understanding the predictive distribution p(y k = 1, x | µv,Σ,π ) π N(x ,| µ ,Σ ) p(y k = 1 | x , µv,Σ,π ) = n n = k n k k * n n p(x | µv,Σ) π N(x ,| µ ,Σ ) n ∑k ' k ' n k ' k '
z Under naïve Bayes assumption:
⎧ ⎫ ⎪ ⎛ 1 j j ⎞⎪ π exp − ⎜ (x − µ )2 − logσ − C ⎟ k ⎨ ∑ j ⎜ 2 n k k, j ⎟⎬ ⎪ 2σ k , j ⎪ k v ⎩ ⎝ ⎠⎭ p(yn = 1 | xn , µ,Σ,π ) = ** ⎧ ⎛ ⎞⎫ ⎪ ⎜ 1 j j 2 ⎟⎪ π k ' exp⎨− (xn − µk ' ) − logσ k ', j − C ⎬ ∑∑kj' ⎜ σ 2 ⎟ ⎩⎪ ⎝ 2 k ', j ⎠⎭⎪
z For two class (i.e., K=2), and when the two classes haves the same variance, ** turns out to be the logistic function 1 1 1 p(y = | x ) = = n 1 n ⎪⎧ ⎛ ⎞⎪⎫ π exp − ⎜ 1 (x j -µ j )2 −logσ −C ⎟ 2 ⎨ ∑ j ⎜ 2 n 2 j ⎟⎬ ⎩⎪ ⎝ 2σ j ⎠⎭⎪ 1 + j 1 j j 1 j 2 j 2 (1−π1 ) ⎧ ⎛ ⎞⎫ exp x ( - ) ([ ] -[ ] ) log ⎪ 1 j j 2 ⎪ 1 + {}− ( n 2 µ1 µ2 + 2 µ1 µ2 )+ π π exp − ⎜ (x -µ ) −logσ −C ⎟ ∑ j σ j σ j 1 1 ⎨ ∑ j ⎜ 2 n 1 j ⎟⎬ 1 ⎩⎪ ⎝ 2σ j ⎠⎭⎪ = T 1 + e−θ xn © Eric Xing @ CMU, 2006-2008 28
14 The decision boundary
z The predictive distribution
1 1 1 p(yn = 1 | xn ) = = T M −θ xn ⎧ j ⎫ 1 + e 1 + exp⎨− ∑θ j xn −θ0 ⎬ ⎩ j=1 ⎭
z The Bayes decision rule:
⎛ ⎞ ⎜ 1 ⎟ 1 T p(y = | x ) −θ xn n 1 n ⎜ 1 + e ⎟ T ln = ln T = θ xn 2 ⎜ −θ xn ⎟ p(yn = 1 | xn ) e ⎜ T ⎟ ⎝ 1 + e−θ xn ⎠
z For multiple class (i.e., K>2), * correspond to a softmax function
T −θk xn k e p(yn = 1 | xn ) = T ∑e−θ j xn j
© Eric Xing @ CMU, 2006-2008 29
Naïve Bayes vs Logistic Regression
z Consider Y boolean, X continuous, X=
z Number of parameters to estimate:
NB:
LR:
z Estimation method:
z NB parameter estimates are uncoupled
z LR parameter estimates are coupled
© Eric Xing @ CMU, 2006-2008 30
15 Naïve Bayes vs Logistic Regression
z Asymptotic comparison (# training examples → infinity)
z when model assumptions correct
z NB, LR produce identical classifiers
z when model assumptions incorrect
z LR is less biased – does not assume conditional independence
z therefore expected to outperform NB
© Eric Xing @ CMU, 2006-2008 31
Naïve Bayes vs Logistic Regression
z Non-asymptotic analysis (see [Ng & Jordan, 2002] )
z convergence rate of parameter estimates – how many training examples needed to assure good estimates?
NB order log m (where m = # of attributes in X) LR order m
z NB converges more quickly to its (perhaps less helpful) asymptotic estimates
© Eric Xing @ CMU, 2006-2008 32
16 Rate of convergence: logistic regression
z Let hDis,m be logistic regression trained on n examples in m dimensions. Then with high probability:
z Implication: if we want
for some small constant ε0, it suffices to pick order m examples
Æ Convergences to its asymptotic classifier, in order m examples
z result follows from Vapnik’s structural risk bound, plus fact that the "VC Dimension" of an m-dimensional linear separators is m
© Eric Xing @ CMU, 2006-2008 33
Rate of convergence: naïve Bayes parameters
z Let any ε1, δ>0, and any n ≥ 0 be fixed.
Assume that for some fixed ρ0 > 0, we have that
z Let
z Then with probability at least 1-δ, after n examples:
1. For discrete input, for all i and b
2. For continuous inputs, for all i and b
© Eric Xing @ CMU, 2006-2008 34
17 Some experiments from UCI data sets
© Eric Xing @ CMU, 2006-2008 35
Back to our 20 NG Case study
z Dataset
z 20 News Groups (20 classes)
z 61,118 words, 18,774 documents
z Experiment:
z Solve only a two-class subset: 1 vs 2.
z 1768 instances, 61188 features.
z Use dimensionality reduction on the data (SVD).
z Use 90% as training set, 10% as test set.
z Test prediction error used as accuracy measure.
© Eric Xing @ CMU, 2006-2008 36
18 Generalization error (1)
z Versus training size
• 30 features. • A fixed test set • Training set varied from 10% to 100% of the training set
© Eric Xing @ CMU, 2006-2008 37
Generalization error (2)
z Versus model size
• Number of dimensions of the data varied from 5 to 50 in steps of 5
• The features were chosen in decreasing order of their singular values
• 90% versus 10% split on training and test
© Eric Xing @ CMU, 2006-2008 38
19 Summary
z Logistic regression
z – Functional form follows from Naïve Bayes assumptions
z For Gaussian Naïve Bayes assuming variance
z For discrete-valued Naïve Bayes too
z But training procedure picks parameters without the conditional independence assumption
z – MLE training: pick W to maximize P(Y | X; θ)
z – MAP training: pick W to maximize P(θ | X,Y) z ‘regularization’
z helps reduce overfitting
z Gradient ascent/descent
z – General approach when closed-form solutions unavailable
z Generative vs. Discriminative classifiers
z – Bias vs. variance tradeoff
© Eric Xing @ CMU, 2006-2008 39
20