Naïve Bayes & Logisdc Regression Lecture 18

Naïve Bayes & Logisc Regression Lecture 18 David Sontag New York University Slides adapted from Vibhav Gogate, Luke Zettlemoyer, Carlos Guestrin, and Dan Weld Naïve Bayes • Naïve Bayes assump:on: – Features are independent given class: – More generally: • How many parameters now? • Suppose X is composed of n binary features The Naïve Bayes Classifier • Given: – Prior P(Y) Y – n condi:onally independent features X given the class Y – For each Xi, we have likelihood P(X |Y) i X1 X2 Xn • Decision rule: If certain assumption holds, NB is optimal classifier! (they typically don’t) A Digit Recognizer • Input: pixel grids • Output: a digit 0-9 Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature Fij for each grid position <i,j> – Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image – Each input maps to a feature vector, e.g. – Here: lots of features, each is binary valued • Naïve Bayes model: • Are the features independent given class? • What do we need to learn? Example Distributions 1 0.1 1 0.01 1 0.05 2 0.1 2 0.05 2 0.01 3 0.1 3 0.05 3 0.90 4 0.1 4 0.30 4 0.80 5 0.1 5 0.80 5 0.90 6 0.1 6 0.90 6 0.90 7 0.1 7 0.05 7 0.25 8 0.1 8 0.60 8 0.85 9 0.1 9 0.50 9 0.60 0 0.1 0 0.80 0 0.80 µµMLEMLE, ,σσMLEMLE=argmax=argmaxPP(( µ,µ,σσ)) µµMLEMLE,,σσMLEMLE =argmax=argmaxµ,µ,σPPσ(( D|D|µ,µ,σσ)) µ,µ,σσ D|D| NN NN (x(xi i µµ)) = ((xxii µµ−)) =0 = −22 =0 == −− −−2σσ =0=0 −− i=1i=1 σσ2 i=1=1 i NN NN == xxi i++NµNµ=0=0 == −− xxii ++NµNµ=0=0 −− i=1i=1 ii=1=1 NN 22 NN NN (x(xi i µ22µ)) =NN + ((xxii µµ−)) =0 = + −33 =0 == −−σ++σ −−3σσ =0=0 −− σσ i=1i=1 σσ3 ii=1=1 NN NN 2 1 N N [ ( )]2 1 N N [tjtj i iwwihihi(ixxj22)]j argarg max maxlnln 11 ++ [[tt−jj −i wwiihhii((xxjj)])] arg max ln + − − i 22 arg maxwwln σσ√√22ππ + −− −− 222σσ ww σσ√√22ππ j=1j=1 22σσ2 jj=1=1 NN N [ ( )]22 MLE for the parameters of NB N [tjtj i iwwihihi(ixxj22)]j =argmax [[tt−jj −i wwiihhii((xxjj)])] =argmax − − i 22 =argmax=argmaxww −− −− 222σσ • ww j=1j=1 22σσ2 Given dataset jj=1=1 – Count(A=a,B=b) Ã number of examples NN N where A=a and B=b N 22 == arg arg min min [t[jtj wwihihi(ix(xj22)]j)] • MLE for discrete NB, simply: == arg arg min minww [[ttjj −− wwiihhii((xxjj)])] ww j=1j=1−− i i jj=1=1 ii – Prior: CountCount(Y(Y==yy)) PP(Y(Y==yy)=)=CountCount((YY == yy)) PP((YY == yy)=)= Count(Y = y ) yyCount(Y = y) y CountCount ((YY == yy)) – Likelihood: y CountCount((XXii == x,x, Y Y == yy)) PP((XXii == xx YY == yy)=)= || CountCount((XXii == xx,Y,Y == yy)) xx 22 22 Subtle:es of NB classifier – Violang the NB assump:on • Usually, features are not condi:onally independent: • Actual probabili:es P(Y|X) oVen biased towards 0 or 1 (i.e., not well “calibrated”) • Nonetheless, NB oVen performs well, even when assump:on is violated Text classificaon • Classify e-mails – Y = {Spam,NotSpam} • Classify news ar:cles – Y = {what is the topic of the ar:cle?} • Classify webpages – Y = {Student, professor, project, …} • What about the features X? – The text! th Features X are en:re document – Xi for i word in ar:cle Bag of Words Approach aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 ... oil 1 … Zaire 0 NB with Bag of Words for text classificaon • Learning phase: – Prior P(Y) • Count number of documents for each topic – P(Xi|Y) • Just considering the documents assigned to topic Y, find the frac:on of docs containing a given word; remember this dist’n is shared across all posi:ons i • Test phase: – For each document • Use naïve Bayes decision rule Naive Bayes = Linear Classifier Theorem: assume that x i 0 , 1 for all i [1 ,N ] . Then, the Naive Bayes classifier∈ { }is defined∈ by x sgn(w x + b), → · Pr[xi=1 +1] Pr[xi=0 +1] where w i =log Pr[ x =1 | 1] log Pr[ x =0 | 1] i |− − i |− Pr[+1] N Pr[xi=0 +1] and b =logPr[ 1] + i=1 log Pr[x =0| 1] . − i |− Proof: observe that for any i [1 ,N ] , ∈ Pr[x +1] Pr[x =1 +1] Pr[x =0 +1] Pr[x =0 +1] log i | = log i | log i | x +log i | . Pr[x 1] Pr[x =1 1] − Pr[x =0 1] i Pr[x =0 1] i | − i | − i | − i | − [Slide from Mehyrar Mohri] Mehryar Mohri - Introduction to Machine Learning page 20 Summary Bayesian prediction: • requires solving density estimation problems. N often difficult to estimate Pr[ x y ] for x R . • | ∈ • but, simple and easy to apply; widely used. Naive Bayes: • strong assumption. • straightforward estimation problem. • specific linear classifier. • sometimes surprisingly good performance. Mehryar Mohri - Introduction to Machine Learning page 21 Lets take a(nother) probabilis:c approach!!! • Previously: directly es:mate the data distribu:on P(X,Y)! – challenging due to size of distribu:on! – make Naïve Bayes assump:on: only need P(Xi|Y)! • But wait, we classify according to: – maxY P(Y|X) • Why not learn P(Y|X) directly? Generave vs. Discriminave Classifiers • Want to Learn: X Y – X – features – Y – target classes P(Y | X) ∝ P(X | Y) P(Y) • Generave classifier, e.g., Naïve Bayes: – Assume some funconal form for P(X|Y), P(Y) – Es:mate parameters of P(X|Y), P(Y) directly from training data – Use Bayes’ rule to calculate P(Y|X= x) – This is an example of a “generave” model • Indirect computaon of P(Y|X) through Bayes rule • As a result, can also generate a sample of the data, P(X) = ∑y P(y) P(X|y) – Can easily handle missing data • Discriminave classifiers, e.g., Logis:c Regression: – Assume some funconal form for P(Y|X) – Es:mate parameters of P(Y|X) directly from training data – This is the “discrimina&ve” (or “condional”) model • Directly learn P(Y|X) • But cannot obtain a sample of the data, because P(X) is not available Logisc Regression Learn P(Y|X) directly! Assume a particular functional form ✬ Linear classifier? On one side we say P(Y=1|X)=1, and on the other P(Y=1|X)=0 ✬ But, this is not differentiable (hard to learn)… doesn’t allow for label noise... P(Y=1)=1 P(Y=1)=0 Copyright c 2010, Tom M. Mitchell. 7 Copyright c 2010, Tom M. Mitchell. 7 where the superscript j refers to the jth training example, and where δ(Y = yk) is 1 if Y = yk and 0 otherwise. Note the role of δ here is to select only those training where the superscript j refers to the jth training example, and where δ(Y = yk) is examples for which Y = yk. 1 if Y = yk and 0 otherwise. Note the role of δ here is to select only those training The maximum likelihood estimator for σ2 is examples for which Y = yk. ik The maximum likelihood estimator for 2 is 1 σik ˆ 2 j 2 j σik = j ∑(Xi µîk) δ(Y = yk) (14) ∑ j δ(Y1 = yk) − ˆ 2 j j 2 j σik = j ∑(Xi µîk) δ(Y = yk) (14) ∑ j δ(Y = yk) j − This maximum likelihood estimator is biased, so the minimum variance unbi- asedThis estimator maximum (MVUE) likelihood is sometimes estimator used is biased, instead. so It the is minimum variance unbi- ased estimator (MVUE) is sometimes used instead. It is 1 ˆ 2 j 2 j σik = j ∑(Xi µîk) δ(Y = yk) (15) 2 (∑ j δ(Y 1= yk)) 1 j j − 2 j σˆ = − (X µîk) δ(Y = yk) (15) ik (∑ δ(Y j = y )) 1 ∑ i − j k Logisc Regression − j 3 Logistic Regression 3 Logistic Regression Logistic Regression is an approach to learning functions of the formLogisticf : X functionY, or (Sigmoid): → PLogistic(Y X) in Regression the case where is an approachY is discrete-valued, to learning functions and X = ofX the... formX fis: anyX vectorY, or | • Learn P(Y|X) directly! 1 n → containingP(Y X) in the discrete case or where continuousY is discrete-valued, variables. In this and sectionX = X we... willX primarilyis any vector con- | 1 n sidercontaining the case discrete where orY• continuousisAssume a boolean variables. a variable, particular In in this order section to simplify we will notation. primarily In con- the 1 z finalsider subsection the case where we extendY isfunctional a boolean our treatment variable, form to the in order case where to simplifyY takes notation. on any In finite the 1+e− final subsection we extend our treatment to the case where Y takes on any finite number of discrete values.• Sigmoid applied to a linear numberLogistic of discrete Regression values. assumes a parametric form for the distribution P(Y X), Logistic Regressionfunction assumes a of parametric the data: form for the distribution P(Y |X), then directly estimates its parameters from the training data. The parametric| modelthen directly assumed estimates by Logistic its parameters Regression infrom the the case training where Y data.is boolean The parametric is: z model assumed by Logistic Regression in the case where Y is boolean is: 1 P(Y = 1 X)= (16) | 1 + exp(w 1+ n w X ) P(Y = 1 X)= 0 ∑ni=1 i i (16) | 1 + exp(w0 + ∑i=1 wiXi) Features can be and n and exp(w0 + ∑ wiXi) discrete or P(Y = 0 X)= ni=1 (17) exp(w0 + ∑i=1nwiXi) continuous! P(Y = 0|X)= 1 + exp(w0 + ∑ni=1 wiXi) (17) | 1 + exp(w0 + ∑ wiXi) Notice that equation (17) follows directly from equationi=1 (16), because the sum of theseNotice two that probabilities equation (17) must follows equal directly 1.

Naïve Bayes & Logisdc Regression Lecture 18

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support