Machine Learning
10-701/15-781, Spring 2008
Logistic Regression -- generative verses discriminative classifier
Eric Xing Lecture 5, January 30, 2006
Reading: Chap. 3.1.3-4 CB
Generative vs. Discriminative Classifiers
z Goal: Wish to learn f: X → Y, e.g., P(Y|X)
z Generative classifiers (e.g., Naïve Bayes): Yi z Assume some functional form for P(X|Y), P(Y)
This is a ‘generative’ model of the data! Xi z Estimate parameters of P(X|Y), P(Y) directly from training data
z Use Bayes rule to calculate P(Y|X= xi)
z Discriminative classifiers: Yi z Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Xi z Estimate parameters of P(Y|X) directly from training data
1 Consider the a Gaussian Generative Classifier
z learning f: X → Y, where
z X is a vector of real-valued features, < X1…Xm >
z Y is boolean Y z What does that imply about the form of P(Y|X)? i
z The joint probability of a datum and its label is: Xi k k k p(xn , yn = 1 | µ,σ ) = p(yn = 1)× p(xn | yn = 1, µ,σ )
1 1 2 = π k / exp{}- 2 (xn - µk ) (2πσ 2 )1 2 2σ
z Given a datum xn, we predict its label using the conditional probability of the label given the datum:
1 1 2 π k 2 1/ 2 exp{− 2 (xn − µk ) } (2πσ ) 2σ p(y k =1| x , µ,σ ) = n n 1 π exp − 1 (x − µ )2 ∑ k ' 2 1/ 2 {}2σ 2 n k ' k ' (2πσ )
Naïve Bayes Classifier
z When X is multivariate-Gaussian vector: Yi z The joint probability of a datum and it label is:
r k r k r k r X p(xn , yn = 1 | µ,Σ) = p(yn = 1)× p(xn | yn = 1, µ,Σ) i 1 = π exp{}- 1 (xr - µr )T Σ−1 (xr - µr ) k (2π Σ )1/2 2 n k n k
z The naïve Bayes simplification Yi k k j k p(xn , yn = 1 | µ,σ ) = p(yn = 1)×∏ p(xn | yn = 1, µk , j ,σ k , j ) j X 1 X 2 … X m 1 i i i = π exp - 1 (x j - µ )2 k ∏ 2 1/2 {}σ 2 n k , j 2 k , j j (2πσ k , j )
m j z More generally: p(xn , yn |η,π ) = p(yn | π )×∏ p(xn | yn ,η) j=1
z Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density
2 The predictive distribution
z Understanding the predictive distribution p(y k = 1, x | µv,Σ,π ) π N(x ,| µ ,Σ ) p(y k = 1 | x , µv,Σ,π ) = n n = k n k k * n n p(x | µv,Σ) π N(x ,| µ ,Σ ) n ∑k ' k ' n k ' k '
z Under naïve Bayes assumption:
⎧ ⎫ ⎪ ⎛ 1 j j ⎞⎪ π exp − ⎜ (x − µ )2 − logσ − C ⎟ k ⎨ ∑ j ⎜ 2 n k k, j ⎟⎬ ⎪ 2σ k , j ⎪ k v ⎩ ⎝ ⎠⎭ p(yn = 1 | xn , µ,Σ,π ) = ** ⎧ ⎛ ⎞⎫ ⎪ ⎜ 1 j j 2 ⎟⎪ π k ' exp⎨− (xn − µk ' ) − logσ k ', j − C ⎬ ∑∑kj' ⎜ σ 2 ⎟ ⎩⎪ ⎝ 2 k ', j ⎠⎭⎪
z For two class (i.e., K=2), and when the two classes haves the same variance, ** turns out to be the logistic function 1 1 1 p(y = | x ) = = n 1 n ⎪⎧ ⎛ ⎞⎪⎫ π exp − ⎜ 1 (x j -µ j )2 −logσ −C ⎟ 2 ⎨ ∑ j ⎜ 2 n 2 j ⎟⎬ ⎩⎪ ⎝ 2σ j ⎠⎭⎪ 1 + j 1 j j 1 j 2 j 2 (1−π1 ) ⎧ ⎛ ⎞⎫ exp x ( - ) ([ ] -[ ] ) log ⎪ 1 j j 2 ⎪ 1 + {}− ( n 2 µ1 µ2 + 2 µ1 µ2 )+ π π exp − ⎜ (x -µ ) −logσ −C ⎟ ∑ j σ j σ j 1 1 ⎨ ∑ j ⎜ 2 n 1 j ⎟⎬ 1 ⎩⎪ ⎝ 2σ j ⎠⎭⎪ = T 1 + e−θ xn
The decision boundary
z The predictive distribution
1 1 1 p(yn = 1 | xn ) = = T M −θ xn ⎧ j ⎫ 1 + e 1 + exp⎨− ∑θ j xn −θ0 ⎬ ⎩ j=1 ⎭
z The Bayes decision rule:
⎛ ⎞ ⎜ 1 ⎟ 1 T p(y = | x ) −θ xn n 1 n ⎜ 1 + e ⎟ T ln = ln T = θ xn 2 ⎜ −θ xn ⎟ p(yn = 1 | xn ) e ⎜ T ⎟ ⎝ 1 + e−θ xn ⎠
z For multiple class (i.e., K>2), * correspond to a softmax function
T −θk xn k e p(yn = 1 | xn ) = T ∑e−θ j xn j
3 Discussion: Generative and discriminative classifiers
z Generative:
z Modeling the joint distribution of all data
z Discriminative:
z Modeling only points at the boundary
z How? Regression!
Linear regression
z The data: Yi
{}(x1 , y1 ),(x2 , y2 ),(x3 , y3 ),L,(xN , yN ) Xi N z Both nodes are observed:
z X is an input vector z Y is a response vector (we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator)
z A regression scheme can be used to model p(y|x) directly, rather than p(x,y)
4 Logistic regression (sigmoid classifier)
z The condition distribution: a Bernoulli p(y | x) = µ(x) y (1 − µ(x))1− y where µ is a logistic function 1 µ(x) = T 1 + e−θ x
z We can used the brute-force gradient method as in LR
z But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model (see future lectures …)
Training Logistic Regression: MCLE
z Estimate parameters θ=<θ0, θ1, ... θm> to maximize the conditional likelihood of training data
z Training data
z Data likelihood =
z Data conditional likelihood =
5 Expressing Conditional Log Likelihood
z Recall the logistic function:
and conditional likelihood:
Maximizing Conditional Log Likelihood
z The objective:
z Good news: l(θ) is concave function of θ
z Bad news: no closed-form solution to maximize l(θ)
6 Gradient Ascent
z Property of sigmoid function:
z The gradient:
The gradient ascent algorithm iterate until change < ε For all i, repeat
How about MAP?
z It is very common to use regularized maximum likelihood.
z One common approach is to define priors on θ
z – Normal distribution, zero mean, identity covariance
z Helps avoid very large weights and overfitting
z MAP estimate
7 MLE vs MAP
z Maximum conditional likelihood estimate
z Maximum a posteriori estimate
Logistic regression: practical issues
z IRLS takes O(Nd3) per iteration, where N = number of training cases and d = dimension of input x.
z Quasi-Newton methods, that approximate the Hessian, work faster.
z Conjugate gradient takes O(Nd) per iteration, and usually works best in practice.
z Stochastic gradient descent can also be used if N is large c.f. perceptron rule:
8 Naïve Bayes vs Logistic Regression
z Consider Y boolean, X continuous, X=
z Number of parameters to estimate:
NB:
LR:
z Estimation method:
z NB parameter estimates are uncoupled
z LR parameter estimates are coupled
Naïve Bayes vs Logistic Regression
z Asymptotic comparison (# training examples → infinity)
z when model assumptions correct
z NB, LR produce identical classifiers
z when model assumptions incorrect
z LR is less biased – does not assume conditional independence
z therefore expected to outperform NB
9 Naïve Bayes vs Logistic Regression
z Non-asymptotic analysis (see [Ng & Jordan, 2002] )
z convergence rate of parameter estimates – how many training examples needed to assure good estimates?
NB order log n (where n = # of attributes in X) LR order n
z NB converges more quickly to its (perhaps less helpful) asymptotic estimates
Rate of convergence: logistic regression
z Let hDis,m be logistic regression trained on m examples in n dimensions. Then with high probability:
z Implication: if we want
for some small constant ε0, it suffices to pick order n examples
Æ Convergences to its asymptotic classifier, in order n examples
z result follows from Vapnik’s structural risk bound, plus fact that the "VC Dimension" of an n-dimensional linear separators is n
10 Rate of covergence: naïve Bayes parameters
z Let any ε1, δ>0, and any n ≥ 0 be fixed. Assume that for some fixed ρ > 0, we have that
z Let
z Then we probability at least 1-δ, after m examples:
1. For discrete input, for all i and b
2. For continuous inputs, for all i and b
Some experiments from UCI data sets
11 Summary
z Logistic regression
z – Functional form follows from Naïve Bayes assumptions z For Gaussian Naïve Bayes assuming variance ¾j;k = ¾j z For discrete-valued Naïve Bayes too
z But training procedure picks parameters without the conditional independence assumption
z – MLE training: pick W to maximize P(Y | X; θ)
z – MAP training: pick W to maximize P(θ | X,Y) z ‘regularization’
z helps reduce overfitting
z Gradient ascent/descent
z – General approach when closed-form solutions unavailable
z Generative vs. Discriminative classifiers
z – Bias vs. variance tradeoff
12