Naïve Bayes & Logisc Regression Lecture 18

David Sontag New York University

Slides adapted from Vibhav Gogate, Luke Zettlemoyer, Carlos Guestrin, and Dan Weld Naïve Bayes • Naïve Bayes assumpon: – Features are independent given class:

– More generally:

• How many parameters now? • Suppose X is composed of n binary features The Naïve Bayes Classifier • Given: – Prior P(Y) Y – n condionally independent features X given the class Y – For each Xi, we have likelihood P(X |Y) i X1 X2 Xn • Decision rule:

If certain assumption holds, NB is optimal classifier! (they typically don’t) A Digit Recognizer

• Input: pixel grids

• Output: a digit 0-9 Naïve Bayes for Digits (Binary Inputs)

• Simple version:

– One feature Fij for each grid position – Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image – Each input maps to a feature vector, e.g.

– Here: lots of features, each is binary valued • Naïve Bayes model:

• Are the features independent given class? • What do we need to learn? Example Distributions

1 0.1 1 0.01 1 0.05 2 0.1 2 0.05 2 0.01 3 0.1 3 0.05 3 0.90 4 0.1 4 0.30 4 0.80 5 0.1 5 0.80 5 0.90 6 0.1 6 0.90 6 0.90 7 0.1 7 0.05 7 0.25 8 0.1 8 0.60 8 0.85 9 0.1 9 0.50 9 0.60 0 0.1 0 0.80 0 0.80 µµMLEMLE, ,σσMLEMLE=argmax=argmaxPP(( µ,µ,σσ)) µµMLEMLE,,σσMLEMLE =argmax=argmaxµ,µ,σPPσ(( D|D|µ,µ,σσ)) µ,µ,σσ D|D| NN NN (x(xi i µµ)) = ((xxii µµ−)) =0 = −22 =0 == −− −−2σσ =0=0 −− i=1i=1 σσ2 i=1=1￿ ￿i ￿ ￿NN NN == xxi i++NµNµ=0=0 == −− xxii ++NµNµ=0=0 −− i=1i=1 ii=1=1￿￿ ￿￿ NN 22 NN NN (x(xi i µ22µ)) =NN + ((xxii µµ−)) =0 = + −33 =0 == −−σ++σ −−3σσ =0=0 −− σσ i=1i=1 σσ3 ii=1=1￿￿ ￿￿ NN NN 2 1 N N [ ( )]2 1 N N [tjtj i iwwihihi(ixxj22)]j argarg max maxlnln 11 ++ [[tt−jj −i wwiihhii((xxjj)])] arg max ln + − − i 22 arg maxwwln σσ√√22ππ + −− −− 222σσ ww σ￿σ￿√√22ππ ￿￿ j=1j=1 22￿σσ￿2 ￿ ￿ jj=1=1￿￿ ￿ ￿ ￿ ￿￿ ￿ NN N [ ( )]22 MLE for the parameters of NB N [tjtj i iwwihihi(ixxj22)]j =argmax [[tt−jj −i wwiihhii((xxjj)])] =argmax − − i 22 =argmax=argmaxww −− −− 222σσ • ww j=1j=1 22￿σσ￿2 Given dataset jj=1=1￿￿ ￿ ￿￿ ￿ – Count(A=a,B=b) Ã number of examples NN N where A=a and B=b N 22 == arg arg min min [t[jtj wwihihi(ix(xj22)]j)] • MLE for discrete NB, simply: == arg arg min minww [[ttjj −− wwiihhii((xxjj)])] ww j=1j=1−− i i jj=1=1￿￿ ii￿￿ – Prior: ￿￿ ￿￿ CountCount(Y(Y==yy)) PP(Y(Y==yy)=)=CountCount((YY == yy)) PP((YY == yy)=)= Count(Y = y ) yy￿Count(Y = y￿)￿ y CountCount￿ ((YY == yy￿￿)) – Likelihood: y￿￿ ￿￿ ￿￿CountCount((XXii == x,x, Y Y == yy)) PP((XXii == xx YY == yy)=)= || CountCount((XXii == xx￿,Y,Y == yy)) xx￿￿ ￿ ￿￿

22 22 Subtlees of NB classifier – Violang the NB assumpon

• Usually, features are not condionally independent:

• Actual probabilies P(Y|X) oen biased towards 0 or 1 (i.e., not well “calibrated”)

• Nonetheless, NB oen performs well, even when assumpon is violated Text classificaon • Classify e-mails – Y = {Spam,NotSpam} • Classify news arcles – Y = {what is the topic of the arcle?} • Classify webpages – Y = {Student, professor, project, …}

• What about the features X? – The text! th Features X are enre document – Xi for i word in arcle Bag of Words Approach

aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 ... oil 1 … Zaire 0 NB with Bag of Words for text classificaon • Learning phase: – Prior P(Y) • Count number of documents for each topic

– P(Xi|Y) • Just considering the documents assigned to topic Y, find the fracon of docs containing a given word; remember this dist’n is shared across all posions i • Test phase: – For each document • Use naïve Bayes decision rule Naive Bayes = Linear Classifier

Theorem: assume that x i 0 , 1 for all i [1 ,N ] . Then, the Naive Bayes classifier∈ { }is defined∈ by x sgn(w x + b), ￿→ · Pr[xi=1 +1] Pr[xi=0 +1] where w i =log Pr[ x =1 | 1] log Pr[ x =0 | 1] i |− − i |− Pr[+1] N Pr[xi=0 +1] and b =logPr[ 1] + i=1 log Pr[x =0| 1] . − i |− Proof: observe that for any￿ i [1 ,N ] , ∈ Pr[x +1] Pr[x =1 +1] Pr[x =0 +1] Pr[x =0 +1] log i | = log i | log i | x +log i | . Pr[x 1] Pr[x =1 1] − Pr[x =0 1] i Pr[x =0 1] i | − ￿ i | − i | − ￿ i | −

[Slide from Mehyrar Mohri] Mehryar Mohri - Introduction to page 20 Summary

Bayesian prediction: • requires solving density estimation problems. N often difficult to estimate Pr[ x y ] for x R . • | ∈ • but, simple and easy to apply; widely used. Naive Bayes: • strong assumption. • straightforward estimation problem. • specific linear classifier. • sometimes surprisingly good performance.

Mehryar Mohri - Introduction to Machine Learning page 21 Lets take a(nother) probabilisc approach!!!

• Previously: directly esmate the data distribuon P(X,Y)! – challenging due to size of distribuon! – make Naïve Bayes assumpon: only need P(Xi|Y)! • But wait, we classify according to:

– maxY P(Y|X) • Why not learn P(Y|X) directly? Generave vs. Discriminave Classifiers

• Want to Learn: X  Y – X – features – Y – target classes P(Y | X) ∝ P(X | Y) P(Y) • Generave classifier, e.g., Naïve Bayes: – Assume some funconal form for P(X|Y), P(Y) – Esmate parameters of P(X|Y), P(Y) directly from training data – Use Bayes’ rule to calculate P(Y|X= x) – This is an example of a “generave” model • Indirect computaon of P(Y|X) through Bayes rule

• As a result, can also generate a sample of the data, P(X) = ∑y P(y) P(X|y)

– Can easily handle missing data • Discriminave classifiers, e.g., Logisc Regression: – Assume some funconal form for P(Y|X) – Esmate parameters of P(Y|X) directly from training data – This is the “discriminave” (or “condional”) model • Directly learn P(Y|X) • But cannot obtain a sample of the data, because P(X) is not available Logisc Regression

 Learn P(Y|X) directly!  Assume a particular functional form ✬ ? On one side we say P(Y=1|X)=1, and on the other P(Y=1|X)=0 ✬ But, this is not differentiable (hard to learn)… doesn’t allow for label noise...

P(Y=1)=1

P(Y=1)=0 Copyright c 2010, Tom M. Mitchell. 7 ￿ Copyright c 2010, Tom M. Mitchell. 7 ￿ where the superscript j refers to the jth training example, and where δ(Y = yk) is 1 if Y = yk and 0 otherwise. Note the role of δ here is to select only those training where the superscript j refers to the jth training example, and where δ(Y = yk) is examples for which Y = yk. 1 if Y = yk and 0 otherwise. Note the role of δ here is to select only those training The maximum likelihood estimator for σ2 is examples for which Y = yk. ik The maximum likelihood estimator for 2 is 1 σik ˆ 2 j 2 j σik = j ∑(Xi µˆik) δ(Y = yk) (14) ∑ j δ(Y1 = yk) − ˆ 2 j j 2 j σik = j ∑(Xi µˆik) δ(Y = yk) (14) ∑ j δ(Y = yk) j − This maximum likelihood estimator is biased, so the minimum unbi- asedThis estimator maximum (MVUE) likelihood is sometimes estimator used is biased, instead. so It the is minimum variance unbi- ased estimator (MVUE) is sometimes used instead. It is 1 ˆ 2 j 2 j σik = j ∑(Xi µˆik) δ(Y = yk) (15) 2 (∑ j δ(Y 1= yk)) 1 j j − 2 j σˆ = − (X µˆik) δ(Y = yk) (15) ik (∑ δ(Y j = y )) 1 ∑ i − j k Logisc Regression − j 3 3 Logistic Regression Logistic Regression is an approach to learning functions of the formLogisticf : X functionY, or (Sigmoid): → PLogistic(Y X) in Regression the case where is an approachY is discrete-valued, to learning functions and X = ofX the... formX fis: anyX vectorY, or | • Learn P(Y|X) directly! ￿ 1 n￿ → containingP(Y X) in the discrete case or where continuousY is discrete-valued, variables. In this and sectionX = X we... willX primarilyis any vector con- | ￿ 1 n￿ sidercontaining the case discrete where orY• continuousisAssume a boolean variables. a variable, particular In in this order section to simplify we will notation. primarily In con- the 1 z finalsider subsection the case where we extendY isfunctional a boolean our treatment variable, form to the in order case where to simplifyY takes notation. on any In finite the 1+e− final subsection we extend our treatment to the case where Y takes on any finite number of discrete values.• Sigmoid applied to a linear numberLogistic of discrete Regression values. assumes a parametric form for the distribution P(Y X), Logistic Regressionfunction assumes a of parametric the data: form for the distribution P(Y |X), then directly estimates its parameters from the training data. The parametric| modelthen directly assumed estimates by Logistic its parameters Regression infrom the the case training where Y data.is boolean The parametric is: z model assumed by Logistic Regression in the case where Y is boolean is: 1 P(Y = 1 X)= (16) | 1 + exp(w 1+ n w X ) P(Y = 1 X)= 0 ∑ni=1 i i (16) | 1 + exp(w0 + ∑i=1 wiXi) Features can be and n and exp(w0 + ∑ wiXi) discrete or P(Y = 0 X)= ni=1 (17) exp(w0 + ∑i=1nwiXi) continuous! P(Y = 0|X)= 1 + exp(w0 + ∑ni=1 wiXi) (17) | 1 + exp(w0 + ∑ wiXi) Notice that equation (17) follows directly from equationi=1 (16), because the sum of theseNotice two that probabilities equation (17) must follows equal directly 1. from equation (16), because the sum of theseOne two highly probabilities convenient must property equal 1. of this form for P(Y X) is that it leads to a One highly convenient property of this form for P(Y |X) is that it leads to a simple linear expression for classification. To classify any| given X we generally simple linear expression for classification. To classify any given X we generally want to assign the value y that maximizes P(Y = y X). Put another way, we want to assign the value yk that maximizes P(Y = yk|X). Put another way, we assign the label Y = 0 if thek following condition holds:k| assign the label Y = 0 if the following condition holds: P(Y = 0 X) 1 < P(Y = 0|X) 1 < P(Y = 1|X) P(Y = 1|X) | substituting from equations (16) and (17), this this becomes becomes n n 1 < exp(w0 + wiXi) 1 < exp(w0 + ∑wiXi) i∑=1 i=1 Logisc Funcon in n Dimensions

Sigmoid applied to a linear function of the data:

-2 0 2 4 6 -4 8 10 0.2 0.4 0.6 0.8 1 x Features can be discrete or continuous! Copyright c 2010, Tom M. Mitchell. 7 ￿ Copyright c 2010, Tom M. Mitchell. 7 ￿ where the superscript j refers to the jth training example, and where δ(Y = yk) is Copyright c 2010, Tom M. Mitchell. 7 1 if Y = yk and 0 otherwise. Note the role of ￿δ here is to select only those training whereexamples the superscript for whichj Yrefers= yk to. the jth training example, and where δ(Y = yk) is 1 if Y =Theyk and maximum 0 otherwise. likelihood Note estimator the role of forδσhere2 is is to select only those training where theik superscript j refers to the jth training example, and where δ(Y = yk) is examples for which Y = yk. 1 if Y =2 yk and 0 otherwise. Note the role of δ here is to select only those training CopyrightThec 2010, maximum Tom M. likelihood Mitchell. estimator1 for σ isj 7 ￿ ˆ 2 ik 2 j σik = j examples∑(Xi forµˆ whichik) δ(YY = yk). (14) ∑ j δ(Y = yk) j − 2 1 Thej maximum likelihood estimator for σik is where the superscript jσˆrefers2 = to the jth training(X example,µˆ )2δ and(Y j where= y ) δ(Y = y ) is (14) ik δ(Y j = y ) ∑ i − ik k k 1 if Y = y andThis 0 otherwise.maximum likelihood Note∑ j the role estimatork of δj here is biased, is to select so the only minimum those1 training variance unbi- k ˆ 2 j 2 j ased estimator (MVUE) is sometimes used instead. Itσ isik = j ∑(Xi µˆik) δ(Y = yk) (14) examples for which Y = yk. ∑ j δ(Y = yk) − This maximum likelihood estimator2 is biased, so the minimum variance unbi-j The maximum likelihood estimator1 for σik is ased estimator (MVUE)ˆ 2 = is sometimes used instead.( j It is)2 ( j = ) σik j This∑ maximumXi µˆik likelihoodδ Y yk estimator is(15) biased, so the minimum variance unbi- (∑ j δ(Y = yk)) 1 − 2 1 1 j − j2 j σˆ ik =2 (Xasedi µ estimatorˆik) jδ(Y (MVUE)=2yk) j is sometimes(14) used instead. It is σˆ ik = δ(Y j = y ) ∑ − (Xi µˆik) δ(Y = yk) (15) ∑ j(∑ δ(Y j =k y j)) 1 ∑ − j k − j 1 3 Logistic Regression σˆ 2 = (X j µˆ )2δ(Y j = y ) (15) CopyrightThis maximumc 2010, likelihood Tom M. estimator Mitchell. is biased, so the minimumik ( varianceδ(Y j = y unbi-)) 1 ∑ 8i − ik k ￿ ∑ j k j ased estimatorLogistic (MVUE) Regression is sometimes is an approach used to instead. learning It is functions of the form f : X − Y, or 3 Logistic Regression → P(Y X) in the case where Y is discrete-valued, and X = X1 ...Xn is any vector | 2 1 j 2 j ￿ ￿ Logisticcontaining Regressionσˆ = discrete is an or continuousapproach to variables.3 learning(X Logisticµˆ functionsik In) thisδ(Y section= Regression ofy thek) we form willf primarily: X(15)Y, con- or ik (∑ δ(Y j = y )) 1 ∑ i − → P(YsiderX) in the the case case wherej whereY isY aisk boolean discrete-valued,− j variable, andin orderX = toX simplify1 ...Xn notation.is any vector In the | Logistic Regression￿ is an approach￿ to learning functions of the form f : X Y, or containingfinal subsection discrete or we continuous extend our variables. treatment In to this the casesection where we willY takes primarily on any con- finite → sider the case where Y is a boolean variable,P(Y X) inin order the case to simplify where Y notation.is discrete-valued, In the and X = X1 ...Xn is any vector 3 Logisticnumber of Regression discrete values. | ￿ ￿ final subsectionLogistic Regression we extend our assumes treatment acontaining parametric to the case discrete form where for or the continuousY takes distribution on variables. anyP finite(Y X) In, this section we will primarily con- | numberthen of directly discrete estimates values. its parameterssider from the case the where trainingY is data. a boolean The parametric variable, in order to simplify notation. In the Logistic Regression is an approach1 to learning functions of the form f : X Y, or Logistic Regression assumes a parametricfinal subsection form for we the extend distribution our→ treatmentP(Y X to), the case where Y takes on any finite P(Y X) inmodel the case assumed where byY Logisticis discrete-valued, Regression in and theX case= X where1 ...XYn isis boolean any vector is: | 0.8 ￿ ￿ | containingthen directly discrete estimates or continuous its parameters variables.number In from this ofthe section discrete training we values. will data. primarily The parametric con- model assumed by Logistic Regression0.6 inLogistic the case1 Regression where Y is booleanassumes is: a parametric form for the distribution P(Y X), sider the case where Y is a booleanP(Y Y = variable,1 X)= in order to simplifyn notation. In the (16) | | 1 + expY = 1/(1( w+ exp(+−X)) w X ) final subsection we extend our treatment0.4 tothen the directly case where0 estimates∑i=Y1 takesi i its on parameters any finite from the training data. The parametric model assumed1 by Logistic Regression in the case where Y is boolean is: number ofand discrete values.P(Y = 10.2X)= n (16) | 1 + exp(w0 + ∑i=n 1 wiXi) Logistic Regression assumes a0 parametricexp form(w0 for+ ∑ thei=1 w distributioniXi) P(Y X), P(Y =−50 X)= 0 5 (17) 1 X n | thenand directly estimates its parameters| from1 + theexp training(w0 + ∑ data.i=1 wPiX( TheYi) = parametric1 X)= n (16) n | 1 + exp(w0 + ∑i= wiXi) model assumed by Logistic RegressionLogisc Regression: decision boundary in theexp case(w0 + where∑i=1YwiisXi boolean) is: 1 Notice that equationP(Y (17)= 0 followsX)= directly from equationn (16), because the sum(17) of | 1and+ exp(w0 + ∑i=1 wiXi) these two probabilities must equal 1. 1 n exp(w0 + ∑ wiXi) NoticeOne that equation highlyP(Y convenient= (17)1 X follows)= property directly of from thisn form equation for P (16),(PY(YX because)=is0 thatX)= the it(16) leads sum of to a i=1 (17) | 1 + exp(w0 + ∑i=1 wiXi) | n thesesimple two probabilities linear expression must for equal classification. 1. To classify any given| X we1 generally+ exp(w0 + ∑i=1 wiXi) and wantOne highly to assign convenient the value propertyyk that maximizes ofNotice this form thatP equation( forY =Py(Yk X (17)X)).is Put follows that another it leadsdirectly way, to from a we equation (16), because the sum of • Prediction: Output then Y with || simpleassign linear the expression label Y = 0 for if the classification. followingexp(wthese0 + condition∑ two Toi=1 classify probabilitieswiX holds:i) any given mustX equalwe generally 1. P(Y = 0 X)= n (17) highest P(Y|X)+ ( + ( = ) ) = 0 Figurewant 1:to assign Form the of the value logistic| yk that1 function. maximizesexp w0One In∑P highlyi= LogisticY1 wiXy convenientik X Regression,. Put another propertyP( way,Y ofX this) weis form as-0 for P(Y X) is that it leads to a – For binary Y, outputP(Y Y=0= 0 Xif ) | | | Noticesumedassign that to equation the follow label (17)thisY = form. follows0 if the following directly1 from

3.1 Form of P(Y X) for Gaussian Naive Bayes Classifier | Here we derive the form of P(Y X) entailed by the assumptions of a Gaussian | Naive Bayes (GNB) classifier, showing that it is precisely the form used by Logis- tic Regression and summarized in equations (16) and (17). In particular, consider a GNB based on the following modeling assumptions:

Y is boolean, governed by a , with parameter π = • P(Y = 1)

X = X ...X , where each X is a continuous random variable • ￿ 1 n￿ i Understanding Sigmoids

w0=-2, w1=-1

w0=0, w1=-1 w0=0, w1= -0.5 Likelihood vs. Condional Likelihood

Generave (Naïve Bayes) maximizes Data likelihood

Discriminave (Logisc Regr.) maximizes Condional Data Likelihood

Discriminave models can’t compute P(xj|w)! Or, … “They don’t waste effort learning P(X)” Focus only on P(Y|X) - all that maers for classificaon