Naïve Bayes & Logisdc Regression Lecture 18

Naïve Bayes & Logisc Regression Lecture 18 David Sontag New York University Slides adapted from Vibhav Gogate, Luke Zettlemoyer, Carlos Guestrin, and Dan Weld Naïve Bayes • Naïve Bayes assump:on: – Features are independent given class: – More generally: • How many parameters now? • Suppose X is composed of n binary features The Naïve Bayes Classifier • Given: – Prior P(Y) Y – n condi:onally independent features X given the class Y – For each Xi, we have likelihood P(X |Y) i X1 X2 Xn • Decision rule: If certain assumption holds, NB is optimal classifier! (they typically don’t) A Digit Recognizer • Input: pixel grids • Output: a digit 0-9 Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature Fij for each grid position <i,j> – Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image – Each input maps to a feature vector, e.g. – Here: lots of features, each is binary valued • Naïve Bayes model: • Are the features independent given class? • What do we need to learn? Example Distributions 1 0.1 1 0.01 1 0.05 2 0.1 2 0.05 2 0.01 3 0.1 3 0.05 3 0.90 4 0.1 4 0.30 4 0.80 5 0.1 5 0.80 5 0.90 6 0.1 6 0.90 6 0.90 7 0.1 7 0.05 7 0.25 8 0.1 8 0.60 8 0.85 9 0.1 9 0.50 9 0.60 0 0.1 0 0.80 0 0.80 µµMLEMLE, ,σσMLEMLE=argmax=argmaxPP(( µ,µ,σσ)) µµMLEMLE,,σσMLEMLE =argmax=argmaxµ,µ,σPPσ(( D|D|µ,µ,σσ)) µ,µ,σσ D|D| NN NN (x(xi i µµ)) = ((xxii µµ−)) =0 = −22 =0 == −− −−2σσ =0=0 −− i=1i=1 σσ2 i=1=1 i NN NN == xxi i++NµNµ=0=0 == −− xxii ++NµNµ=0=0 −− i=1i=1 ii=1=1 NN 22 NN NN (x(xi i µ22µ)) =NN + ((xxii µµ−)) =0 = + −33 =0 == −−σ++σ −−3σσ =0=0 −− σσ i=1i=1 σσ3 ii=1=1 NN NN 2 1 N N [ ( )]2 1 N N [tjtj i iwwihihi(ixxj22)]j argarg max maxlnln 11 ++ [[tt−jj −i wwiihhii((xxjj)])] arg max ln + − − i 22 arg maxwwln σσ√√22ππ + −− −− 222σσ ww σσ√√22ππ j=1j=1 22σσ2 jj=1=1 NN N [ ( )]22 MLE for the parameters of NB N [tjtj i iwwihihi(ixxj22)]j =argmax [[tt−jj −i wwiihhii((xxjj)])] =argmax − − i 22 =argmax=argmaxww −− −− 222σσ • ww j=1j=1 22σσ2 Given dataset jj=1=1 – Count(A=a,B=b) Ã number of examples NN N where A=a and B=b N 22 == arg arg min min [t[jtj wwihihi(ix(xj22)]j)] • MLE for discrete NB, simply: == arg arg min minww [[ttjj −− wwiihhii((xxjj)])] ww j=1j=1−− i i jj=1=1 ii – Prior: CountCount(Y(Y==yy)) PP(Y(Y==yy)=)=CountCount((YY == yy)) PP((YY == yy)=)= Count(Y = y ) yyCount(Y = y) y CountCount ((YY == yy)) – Likelihood: y CountCount((XXii == x,x, Y Y == yy)) PP((XXii == xx YY == yy)=)= || CountCount((XXii == xx,Y,Y == yy)) xx 22 22 Subtle:es of NB classifier – Violang the NB assump:on • Usually, features are not condi:onally independent: • Actual probabili:es P(Y|X) oVen biased towards 0 or 1 (i.e., not well “calibrated”) • Nonetheless, NB oVen performs well, even when assump:on is violated Text classificaon • Classify e-mails – Y = {Spam,NotSpam} • Classify news ar:cles – Y = {what is the topic of the ar:cle?} • Classify webpages – Y = {Student, professor, project, …} • What about the features X? – The text! th Features X are en:re document – Xi for i word in ar:cle Bag of Words Approach aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 ... oil 1 … Zaire 0 NB with Bag of Words for text classificaon • Learning phase: – Prior P(Y) • Count number of documents for each topic – P(Xi|Y) • Just considering the documents assigned to topic Y, find the frac:on of docs containing a given word; remember this dist’n is shared across all posi:ons i • Test phase: – For each document • Use naïve Bayes decision rule Naive Bayes = Linear Classifier Theorem: assume that x i 0 , 1 for all i [1 ,N ] . Then, the Naive Bayes classifier∈ { }is defined∈ by x sgn(w x + b), → · Pr[xi=1 +1] Pr[xi=0 +1] where w i =log Pr[ x =1 | 1] log Pr[ x =0 | 1] i |− − i |− Pr[+1] N Pr[xi=0 +1] and b =logPr[ 1] + i=1 log Pr[x =0| 1] . − i |− Proof: observe that for any i [1 ,N ] , ∈ Pr[x +1] Pr[x =1 +1] Pr[x =0 +1] Pr[x =0 +1] log i | = log i | log i | x +log i | . Pr[x 1] Pr[x =1 1] − Pr[x =0 1] i Pr[x =0 1] i | − i | − i | − i | − [Slide from Mehyrar Mohri] Mehryar Mohri - Introduction to Machine Learning page 20 Summary Bayesian prediction: • requires solving density estimation problems. N often difficult to estimate Pr[ x y ] for x R . • | ∈ • but, simple and easy to apply; widely used. Naive Bayes: • strong assumption. • straightforward estimation problem. • specific linear classifier. • sometimes surprisingly good performance. Mehryar Mohri - Introduction to Machine Learning page 21 Lets take a(nother) probabilis:c approach!!! • Previously: directly es:mate the data distribu:on P(X,Y)! – challenging due to size of distribu:on! – make Naïve Bayes assump:on: only need P(Xi|Y)! • But wait, we classify according to: – maxY P(Y|X) • Why not learn P(Y|X) directly? Generave vs. Discriminave Classifiers • Want to Learn: X Y – X – features – Y – target classes P(Y | X) ∝ P(X | Y) P(Y) • Generave classifier, e.g., Naïve Bayes: – Assume some funconal form for P(X|Y), P(Y) – Es:mate parameters of P(X|Y), P(Y) directly from training data – Use Bayes’ rule to calculate P(Y|X= x) – This is an example of a “generave” model • Indirect computaon of P(Y|X) through Bayes rule • As a result, can also generate a sample of the data, P(X) = ∑y P(y) P(X|y) – Can easily handle missing data • Discriminave classifiers, e.g., Logis:c Regression: – Assume some funconal form for P(Y|X) – Es:mate parameters of P(Y|X) directly from training data – This is the “discrimina&ve” (or “condional”) model • Directly learn P(Y|X) • But cannot obtain a sample of the data, because P(X) is not available Logisc Regression Learn P(Y|X) directly! Assume a particular functional form ✬ Linear classifier? On one side we say P(Y=1|X)=1, and on the other P(Y=1|X)=0 ✬ But, this is not differentiable (hard to learn)… doesn’t allow for label noise... P(Y=1)=1 P(Y=1)=0 Copyright c 2010, Tom M. Mitchell. 7 Copyright c 2010, Tom M. Mitchell. 7 where the superscript j refers to the jth training example, and where δ(Y = yk) is 1 if Y = yk and 0 otherwise. Note the role of δ here is to select only those training where the superscript j refers to the jth training example, and where δ(Y = yk) is examples for which Y = yk. 1 if Y = yk and 0 otherwise. Note the role of δ here is to select only those training The maximum likelihood estimator for σ2 is examples for which Y = yk. ik The maximum likelihood estimator for 2 is 1 σik ˆ 2 j 2 j σik = j ∑(Xi µîk) δ(Y = yk) (14) ∑ j δ(Y1 = yk) − ˆ 2 j j 2 j σik = j ∑(Xi µîk) δ(Y = yk) (14) ∑ j δ(Y = yk) j − This maximum likelihood estimator is biased, so the minimum variance unbi- asedThis estimator maximum (MVUE) likelihood is sometimes estimator used is biased, instead. so It the is minimum variance unbi- ased estimator (MVUE) is sometimes used instead. It is 1 ˆ 2 j 2 j σik = j ∑(Xi µîk) δ(Y = yk) (15) 2 (∑ j δ(Y 1= yk)) 1 j j − 2 j σˆ = − (X µîk) δ(Y = yk) (15) ik (∑ δ(Y j = y )) 1 ∑ i − j k Logisc Regression − j 3 Logistic Regression 3 Logistic Regression Logistic Regression is an approach to learning functions of the formLogisticf : X functionY, or (Sigmoid): → PLogistic(Y X) in Regression the case where is an approachY is discrete-valued, to learning functions and X = ofX the... formX fis: anyX vectorY, or | • Learn P(Y|X) directly! 1 n → containingP(Y X) in the discrete case or where continuousY is discrete-valued, variables. In this and sectionX = X we... willX primarilyis any vector con- | 1 n sidercontaining the case discrete where orY• continuousisAssume a boolean variables. a variable, particular In in this order section to simplify we will notation. primarily In con- the 1 z finalsider subsection the case where we extendY isfunctional a boolean our treatment variable, form to the in order case where to simplifyY takes notation. on any In finite the 1+e− final subsection we extend our treatment to the case where Y takes on any finite number of discrete values.• Sigmoid applied to a linear numberLogistic of discrete Regression values. assumes a parametric form for the distribution P(Y X), Logistic Regressionfunction assumes a of parametric the data: form for the distribution P(Y |X), then directly estimates its parameters from the training data. The parametric| modelthen directly assumed estimates by Logistic its parameters Regression infrom the the case training where Y data.is boolean The parametric is: z model assumed by Logistic Regression in the case where Y is boolean is: 1 P(Y = 1 X)= (16) | 1 + exp(w 1+ n w X ) P(Y = 1 X)= 0 ∑ni=1 i i (16) | 1 + exp(w0 + ∑i=1 wiXi) Features can be and n and exp(w0 + ∑ wiXi) discrete or P(Y = 0 X)= ni=1 (17) exp(w0 + ∑i=1nwiXi) continuous! P(Y = 0|X)= 1 + exp(w0 + ∑ni=1 wiXi) (17) | 1 + exp(w0 + ∑ wiXi) Notice that equation (17) follows directly from equationi=1 (16), because the sum of theseNotice two that probabilities equation (17) must follows equal directly 1.

Load more