
Machine Learning 10-701/15-781, Spring 2008 Logistic Regression -- generative verses discriminative classifier Eric Xing Lecture 5, January 30, 2006 Reading: Chap. 3.1.3-4 CB Generative vs. Discriminative Classifiers z Goal: Wish to learn f: X → Y, e.g., P(Y|X) z Generative classifiers (e.g., Naïve Bayes): Yi z Assume some functional form for P(X|Y), P(Y) This is a ‘generative’ model of the data! Xi z Estimate parameters of P(X|Y), P(Y) directly from training data z Use Bayes rule to calculate P(Y|X= xi) z Discriminative classifiers: Yi z Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Xi z Estimate parameters of P(Y|X) directly from training data 1 Consider the a Gaussian Generative Classifier z learning f: X → Y, where z X is a vector of real-valued features, < X1…Xm > z Y is boolean Y z What does that imply about the form of P(Y|X)? i z The joint probability of a datum and its label is: Xi k k k p(xn , yn = 1 | µ,σ ) = p(yn = 1)× p(xn | yn = 1, µ,σ ) 1 1 2 = π k / exp{}- 2 (xn - µk ) (2πσ 2 )1 2 2σ z Given a datum xn, we predict its label using the conditional probability of the label given the datum: 1 1 2 π k 2 1/ 2 exp{− 2 (xn − µk ) } (2πσ ) 2σ p(y k =1| x , µ,σ ) = n n 1 π exp − 1 (x − µ )2 ∑ k ' 2 1/ 2 {}2σ 2 n k ' k ' (2πσ ) Naïve Bayes Classifier z When X is multivariate-Gaussian vector: Yi z The joint probability of a datum and it label is: r k r k r k r X p(xn , yn = 1 | µ,Σ) = p(yn = 1)× p(xn | yn = 1, µ,Σ) i 1 = π exp{}- 1 (xr - µr )T Σ−1 (xr - µr ) k (2π Σ )1/2 2 n k n k z The naïve Bayes simplification Yi k k j k p(xn , yn = 1 | µ,σ ) = p(yn = 1)×∏ p(xn | yn = 1, µk , j ,σ k , j ) j X 1 X 2 … X m 1 i i i = π exp - 1 (x j - µ )2 k ∏ 2 1/2 {}σ 2 n k , j 2 k , j j (2πσ k , j ) m j z More generally: p(xn , yn |η,π ) = p(yn | π )×∏ p(xn | yn ,η) j=1 z Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density 2 The predictive distribution z Understanding the predictive distribution p(y k = 1, x | µv,Σ,π ) π N(x ,| µ ,Σ ) p(y k = 1 | x , µv,Σ,π ) = n n = k n k k * n n p(x | µv,Σ) π N(x ,| µ ,Σ ) n ∑k ' k ' n k ' k ' z Under naïve Bayes assumption: ⎧ ⎫ ⎪ ⎛ 1 j j ⎞⎪ π exp − ⎜ (x − µ )2 − logσ − C ⎟ k ⎨ ∑ j ⎜ 2 n k k, j ⎟⎬ ⎪ 2σ k , j ⎪ k v ⎩ ⎝ ⎠⎭ p(yn = 1 | xn , µ,Σ,π ) = ** ⎧ ⎛ ⎞⎫ ⎪ ⎜ 1 j j 2 ⎟⎪ π k ' exp⎨− (xn − µk ' ) − logσ k ', j − C ⎬ ∑∑kj' ⎜ σ 2 ⎟ ⎩⎪ ⎝ 2 k ', j ⎠⎭⎪ z For two class (i.e., K=2), and when the two classes haves the same variance, ** turns out to be the logistic function 1 1 1 p(y = | x ) = = n 1 n ⎪⎧ ⎛ ⎞⎪⎫ π exp − ⎜ 1 (x j -µ j )2 −logσ −C ⎟ 2 ⎨ ∑ j ⎜ 2 n 2 j ⎟⎬ ⎩⎪ ⎝ 2σ j ⎠⎭⎪ 1 + j 1 j j 1 j 2 j 2 (1−π1 ) ⎧ ⎛ ⎞⎫ exp x ( - ) ([ ] -[ ] ) log ⎪ 1 j j 2 ⎪ 1 + {}− ( n 2 µ1 µ2 + 2 µ1 µ2 )+ π π exp − ⎜ (x -µ ) −logσ −C ⎟ ∑ j σ j σ j 1 1 ⎨ ∑ j ⎜ 2 n 1 j ⎟⎬ 1 ⎩⎪ ⎝ 2σ j ⎠⎭⎪ = T 1 + e−θ xn The decision boundary z The predictive distribution 1 1 1 p(yn = 1 | xn ) = = T M −θ xn ⎧ j ⎫ 1 + e 1 + exp⎨− ∑θ j xn −θ0 ⎬ ⎩ j=1 ⎭ z The Bayes decision rule: ⎛ ⎞ ⎜ 1 ⎟ 1 T p(y = | x ) −θ xn n 1 n ⎜ 1 + e ⎟ T ln = ln T = θ xn 2 ⎜ −θ xn ⎟ p(yn = 1 | xn ) e ⎜ T ⎟ ⎝ 1 + e−θ xn ⎠ z For multiple class (i.e., K>2), * correspond to a softmax function T −θk xn k e p(yn = 1 | xn ) = T ∑e−θ j xn j 3 Discussion: Generative and discriminative classifiers z Generative: z Modeling the joint distribution of all data z Discriminative: z Modeling only points at the boundary z How? Regression! Linear regression z The data: Yi {}(x1 , y1 ),(x2 , y2 ),(x3 , y3 ),L,(xN , yN ) Xi N z Both nodes are observed: z X is an input vector z Y is a response vector (we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator) z A regression scheme can be used to model p(y|x) directly, rather than p(x,y) 4 Logistic regression (sigmoid classifier) z The condition distribution: a Bernoulli p(y | x) = µ(x) y (1 − µ(x))1− y where µ is a logistic function 1 µ(x) = T 1 + e−θ x z We can used the brute-force gradient method as in LR z But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model (see future lectures …) Training Logistic Regression: MCLE z Estimate parameters θ=<θ0, θ1, ... θm> to maximize the conditional likelihood of training data z Training data z Data likelihood = z Data conditional likelihood = 5 Expressing Conditional Log Likelihood z Recall the logistic function: and conditional likelihood: Maximizing Conditional Log Likelihood z The objective: z Good news: l(θ) is concave function of θ z Bad news: no closed-form solution to maximize l(θ) 6 Gradient Ascent z Property of sigmoid function: z The gradient: The gradient ascent algorithm iterate until change < ε For all i, repeat How about MAP? z It is very common to use regularized maximum likelihood. z One common approach is to define priors on θ z – Normal distribution, zero mean, identity covariance z Helps avoid very large weights and overfitting z MAP estimate 7 MLE vs MAP z Maximum conditional likelihood estimate z Maximum a posteriori estimate Logistic regression: practical issues z IRLS takes O(Nd3) per iteration, where N = number of training cases and d = dimension of input x. z Quasi-Newton methods, that approximate the Hessian, work faster. z Conjugate gradient takes O(Nd) per iteration, and usually works best in practice. z Stochastic gradient descent can also be used if N is large c.f. perceptron rule: 8 Naïve Bayes vs Logistic Regression z Consider Y boolean, X continuous, X=<X1 ... Xm> z Number of parameters to estimate: NB: LR: z Estimation method: z NB parameter estimates are uncoupled z LR parameter estimates are coupled Naïve Bayes vs Logistic Regression z Asymptotic comparison (# training examples → infinity) z when model assumptions correct z NB, LR produce identical classifiers z when model assumptions incorrect z LR is less biased – does not assume conditional independence z therefore expected to outperform NB 9 Naïve Bayes vs Logistic Regression z Non-asymptotic analysis (see [Ng & Jordan, 2002] ) z convergence rate of parameter estimates – how many training examples needed to assure good estimates? NB order log n (where n = # of attributes in X) LR order n z NB converges more quickly to its (perhaps less helpful) asymptotic estimates Rate of convergence: logistic regression z Let hDis,m be logistic regression trained on m examples in n dimensions. Then with high probability: z Implication: if we want for some small constant ε0, it suffices to pick order n examples Æ Convergences to its asymptotic classifier, in order n examples z result follows from Vapnik’s structural risk bound, plus fact that the "VC Dimension" of an n-dimensional linear separators is n 10 Rate of covergence: naïve Bayes parameters z Let any ε1, δ>0, and any n ≥ 0 be fixed. Assume that for some fixed ρ > 0, we have that z Let z Then we probability at least 1-δ, after m examples: 1. For discrete input, for all i and b 2. For continuous inputs, for all i and b Some experiments from UCI data sets 11 Summary z Logistic regression z – Functional form follows from Naïve Bayes assumptions z For Gaussian Naïve Bayes assuming variance ¾j;k = ¾j z For discrete-valued Naïve Bayes too z But training procedure picks parameters without the conditional independence assumption z – MLE training: pick W to maximize P(Y | X; θ) z – MAP training: pick W to maximize P(θ | X,Y) z ‘regularization’ z helps reduce overfitting z Gradient ascent/descent z – General approach when closed-form solutions unavailable z Generative vs. Discriminative classifiers z – Bias vs. variance tradeoff 12.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages12 Page
-
File Size-