Machine Learning Generative Vs. Discriminative Classifiers

Machine Learning 10-701/15-781, Spring 2008 Logistic Regression -- generative verses discriminative classifier Eric Xing Lecture 5, January 30, 2006 Reading: Chap. 3.1.3-4 CB Generative vs. Discriminative Classifiers z Goal: Wish to learn f: X → Y, e.g., P(Y|X) z Generative classifiers (e.g., Naïve Bayes): Yi z Assume some functional form for P(X|Y), P(Y) This is a ‘generative’ model of the data! Xi z Estimate parameters of P(X|Y), P(Y) directly from training data z Use Bayes rule to calculate P(Y|X= xi) z Discriminative classifiers: Yi z Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Xi z Estimate parameters of P(Y|X) directly from training data 1 Consider the a Gaussian Generative Classifier z learning f: X → Y, where z X is a vector of real-valued features, < X1…Xm > z Y is boolean Y z What does that imply about the form of P(Y|X)? i z The joint probability of a datum and its label is: Xi k k k p(xn , yn = 1 | µ,σ ) = p(yn = 1)× p(xn | yn = 1, µ,σ ) 1 1 2 = π k / exp{}- 2 (xn - µk ) (2πσ 2 )1 2 2σ z Given a datum xn, we predict its label using the conditional probability of the label given the datum: 1 1 2 π k 2 1/ 2 exp{− 2 (xn − µk ) } (2πσ ) 2σ p(y k =1| x , µ,σ ) = n n 1 π exp − 1 (x − µ )2 ∑ k ' 2 1/ 2 {}2σ 2 n k ' k ' (2πσ ) Naïve Bayes Classifier z When X is multivariate-Gaussian vector: Yi z The joint probability of a datum and it label is: r k r k r k r X p(xn , yn = 1 | µ,Σ) = p(yn = 1)× p(xn | yn = 1, µ,Σ) i 1 = π exp{}- 1 (xr - µr )T Σ−1 (xr - µr ) k (2π Σ )1/2 2 n k n k z The naïve Bayes simplification Yi k k j k p(xn , yn = 1 | µ,σ ) = p(yn = 1)×∏ p(xn | yn = 1, µk , j ,σ k , j ) j X 1 X 2 … X m 1 i i i = π exp - 1 (x j - µ )2 k ∏ 2 1/2 {}σ 2 n k , j 2 k , j j (2πσ k , j ) m j z More generally: p(xn , yn |η,π ) = p(yn | π )×∏ p(xn | yn ,η) j=1 z Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density 2 The predictive distribution z Understanding the predictive distribution p(y k = 1, x | µv,Σ,π ) π N(x ,| µ ,Σ ) p(y k = 1 | x , µv,Σ,π ) = n n = k n k k * n n p(x | µv,Σ) π N(x ,| µ ,Σ ) n ∑k ' k ' n k ' k ' z Under naïve Bayes assumption: ⎧ ⎫ ⎪ ⎛ 1 j j ⎞⎪ π exp − ⎜ (x − µ )2 − logσ − C ⎟ k ⎨ ∑ j ⎜ 2 n k k, j ⎟⎬ ⎪ 2σ k , j ⎪ k v ⎩ ⎝ ⎠⎭ p(yn = 1 | xn , µ,Σ,π ) = ** ⎧ ⎛ ⎞⎫ ⎪ ⎜ 1 j j 2 ⎟⎪ π k ' exp⎨− (xn − µk ' ) − logσ k ', j − C ⎬ ∑∑kj' ⎜ σ 2 ⎟ ⎩⎪ ⎝ 2 k ', j ⎠⎭⎪ z For two class (i.e., K=2), and when the two classes haves the same variance, ** turns out to be the logistic function 1 1 1 p(y = | x ) = = n 1 n ⎪⎧ ⎛ ⎞⎪⎫ π exp − ⎜ 1 (x j -µ j )2 −logσ −C ⎟ 2 ⎨ ∑ j ⎜ 2 n 2 j ⎟⎬ ⎩⎪ ⎝ 2σ j ⎠⎭⎪ 1 + j 1 j j 1 j 2 j 2 (1−π1 ) ⎧ ⎛ ⎞⎫ exp x ( - ) ([ ] -[ ] ) log ⎪ 1 j j 2 ⎪ 1 + {}− ( n 2 µ1 µ2 + 2 µ1 µ2 )+ π π exp − ⎜ (x -µ ) −logσ −C ⎟ ∑ j σ j σ j 1 1 ⎨ ∑ j ⎜ 2 n 1 j ⎟⎬ 1 ⎩⎪ ⎝ 2σ j ⎠⎭⎪ = T 1 + e−θ xn The decision boundary z The predictive distribution 1 1 1 p(yn = 1 | xn ) = = T M −θ xn ⎧ j ⎫ 1 + e 1 + exp⎨− ∑θ j xn −θ0 ⎬ ⎩ j=1 ⎭ z The Bayes decision rule: ⎛ ⎞ ⎜ 1 ⎟ 1 T p(y = | x ) −θ xn n 1 n ⎜ 1 + e ⎟ T ln = ln T = θ xn 2 ⎜ −θ xn ⎟ p(yn = 1 | xn ) e ⎜ T ⎟ ⎝ 1 + e−θ xn ⎠ z For multiple class (i.e., K>2), * correspond to a softmax function T −θk xn k e p(yn = 1 | xn ) = T ∑e−θ j xn j 3 Discussion: Generative and discriminative classifiers z Generative: z Modeling the joint distribution of all data z Discriminative: z Modeling only points at the boundary z How? Regression! Linear regression z The data: Yi {}(x1 , y1 ),(x2 , y2 ),(x3 , y3 ),L,(xN , yN ) Xi N z Both nodes are observed: z X is an input vector z Y is a response vector (we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator) z A regression scheme can be used to model p(y|x) directly, rather than p(x,y) 4 Logistic regression (sigmoid classifier) z The condition distribution: a Bernoulli p(y | x) = µ(x) y (1 − µ(x))1− y where µ is a logistic function 1 µ(x) = T 1 + e−θ x z We can used the brute-force gradient method as in LR z But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model (see future lectures …) Training Logistic Regression: MCLE z Estimate parameters θ=<θ0, θ1, ... θm> to maximize the conditional likelihood of training data z Training data z Data likelihood = z Data conditional likelihood = 5 Expressing Conditional Log Likelihood z Recall the logistic function: and conditional likelihood: Maximizing Conditional Log Likelihood z The objective: z Good news: l(θ) is concave function of θ z Bad news: no closed-form solution to maximize l(θ) 6 Gradient Ascent z Property of sigmoid function: z The gradient: The gradient ascent algorithm iterate until change < ε For all i, repeat How about MAP? z It is very common to use regularized maximum likelihood. z One common approach is to define priors on θ z – Normal distribution, zero mean, identity covariance z Helps avoid very large weights and overfitting z MAP estimate 7 MLE vs MAP z Maximum conditional likelihood estimate z Maximum a posteriori estimate Logistic regression: practical issues z IRLS takes O(Nd3) per iteration, where N = number of training cases and d = dimension of input x. z Quasi-Newton methods, that approximate the Hessian, work faster. z Conjugate gradient takes O(Nd) per iteration, and usually works best in practice. z Stochastic gradient descent can also be used if N is large c.f. perceptron rule: 8 Naïve Bayes vs Logistic Regression z Consider Y boolean, X continuous, X=<X1 ... Xm> z Number of parameters to estimate: NB: LR: z Estimation method: z NB parameter estimates are uncoupled z LR parameter estimates are coupled Naïve Bayes vs Logistic Regression z Asymptotic comparison (# training examples → infinity) z when model assumptions correct z NB, LR produce identical classifiers z when model assumptions incorrect z LR is less biased – does not assume conditional independence z therefore expected to outperform NB 9 Naïve Bayes vs Logistic Regression z Non-asymptotic analysis (see [Ng & Jordan, 2002] ) z convergence rate of parameter estimates – how many training examples needed to assure good estimates? NB order log n (where n = # of attributes in X) LR order n z NB converges more quickly to its (perhaps less helpful) asymptotic estimates Rate of convergence: logistic regression z Let hDis,m be logistic regression trained on m examples in n dimensions. Then with high probability: z Implication: if we want for some small constant ε0, it suffices to pick order n examples Æ Convergences to its asymptotic classifier, in order n examples z result follows from Vapnik’s structural risk bound, plus fact that the "VC Dimension" of an n-dimensional linear separators is n 10 Rate of covergence: naïve Bayes parameters z Let any ε1, δ>0, and any n ≥ 0 be fixed. Assume that for some fixed ρ > 0, we have that z Let z Then we probability at least 1-δ, after m examples: 1. For discrete input, for all i and b 2. For continuous inputs, for all i and b Some experiments from UCI data sets 11 Summary z Logistic regression z – Functional form follows from Naïve Bayes assumptions z For Gaussian Naïve Bayes assuming variance ¾j;k = ¾j z For discrete-valued Naïve Bayes too z But training procedure picks parameters without the conditional independence assumption z – MLE training: pick W to maximize P(Y | X; θ) z – MAP training: pick W to maximize P(θ | X,Y) z ‘regularization’ z helps reduce overfitting z Gradient ascent/descent z – General approach when closed-form solutions unavailable z Generative vs. Discriminative classifiers z – Bias vs. variance tradeoff 12.

Machine Learning Generative Vs. Discriminative Classifiers

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support