Naïve Bayes Lecture 17
Total Page:16
File Type:pdf, Size:1020Kb
Naïve Bayes Lecture 17 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Mehryar Mohri Brief ArticleBrief Article The Author The Author January 11, 2012January 11, 2012 θˆ =argmax ln P ( θ) θˆ =argmax ln P ( θ) θ D| θ D| ln θαH ln θαH d d d ln P ( θ)= [ln θαH (1 θ)αT ] = [α ln θ + α ln(1 θ)] d d dθ D| dθ d − dθ H T − ln P ( θ)= [ln θαH (1 θ)αT ] = [α ln θ + α ln(1 θ)] dθ D| dθ − dθ H T − d d αH αT = αH ln θ + αT ln(1 θ)= =0 d d dθ αH dθ αT− θ − 1 θ = αH ln θ + αT ln(1 θ)= =0 − dθ dθ − θ −2N12 θ δ 2e− − P (mistake) ≥ ≥ 2N2 δ 2e− P (mistake) ≥ ≥ ln δ ln 2 2N2 Bayesian Learning ≥ − Prior • Use Bayes’ rule! 2 Data Likelihood ln δ ln 2 2N ln(2/δ) ≥ − N ≥ 22 Posterior ln(2/δ) N ln(2/0.05) 3.8 ≥ 22N Normalization= 190 ≥ 2 0.12 ≈ 0.02 × ln(2• /Or0. equivalently:05) 3.8 N • For uniform priors, this= 190 reducesP (θ) to 1 ≥ 2 0.12 ≈ 0.02 ∝ ×maximum likelihood estimation! P (θ) 1 P (θ ) P ( θ) ∝ |D ∝ D| 1 1 Brief ArticleBrief Article Brief Article Brief Article The Author The Author The Author January 11, 2012 JanuaryThe Author 11, 2012 January 11, 2012 January 11, 2012 θˆ =argmax ln Pθˆ( =argmaxθ) ln P ( θ) θ D| θ D| ˆ =argmax ln (θˆ =argmax) ln P ( θ) θ P θ θ θ D| αH D| ln θαH ln θ αH ln θαH ln θ d d α α d d d lnαP ( θ)=α [lndθ H (1 θ) T ] = [α ln θ + α ln(1 θ)] ln P ( θ)= [lndθ H (1 θ) T ]d= [α ln θ + α ln(1d Hθ)] T d d dθ D| dθ d αH H − αT T dθ − dθ D| dθ αlnH P ( − θα)=T [lndθθ (1 θ) ] = [α−H ln θ + αT ln(1 θ)] ln P ( θ)= [lndθθ (1 D|θ) ] =dθ [αH ln θ−+ αT ln(1dθ θ)] − dθ D| dθ − d dθ d −α α d d= ln +α ln(1α )= H T =0 αH d θ αHT d T θ αH αT = αH d ln θ + αT d =ln(1α dθθln)=θα+Hα dθαln(1T −θ=0)= θ − 1 θ=0 = α dθ ln θ + α dθln(1 H−dθθ)= θT d−θ 1 −=0θ θ − 1 −θ H T 1 − dθ dθ − θ − 2N2θ − 2 δ 2e− −2 P (mistake) 2N ≥ 2N ≥ δ 2e−2N2 P (mistake)δ 2e− P (mistake) δ ≥2e− ≥P (mistake)≥ ≥ ≥ ≥ ln δ ln 2 2N2 2 ln δ ln 2 2N2 ln δ ln 2 2N2 ≥ − ln δ ≥ln 2 −2N ≥ − ≥ − ln(2/δ) Prior + Data -> Posterior N ln(2/δ) ln(2/δ) N 2 2 ln(2/δ) ≥≥ 22 NN 2 ≥≥ 222 ln(2//00..05)05) 33.8.8 NN 2 == 190 190 ln(2ln(2//00.05).05) 3.38.8≥ 22 00..112 ≈≈00.02.02 NN == 190 190×× ≥≥ 22 00.1.12 2 ≈≈0.002.02 ×× PP((θθ)) 11 ∝∝ PP(θ()θ) 11 ∝∝ • Likelihood funcon: P ((θθ )) PP(( θθ)) |D|D ∝∝ D|D| PP((θθ ))• PPosterior: P( ( θ)θ) |D ∝ D| |D ∝ D| αH ααT ββH 11 βTβ 1 1 PP((θθ )) θ H(1(1 θθ)) T θθ H−−(1(1 θ)θ) −T − |D|D ∝∝ −− −− α α β 1 β 1 α +β 1 α +βt+1 P (θ ) θ H (1 θ) T θ H − (1 θ) T − = θ H H − (1 θ) T α |D ∝ α β − 1 β 1 − α +β 1 α +βt+1− P (θ ) θ H (1 θ) T θ H − (1 θ) T − = θ H H − (1 θ) T = Beta(α +β , α +β ) |D ∝ − − − H H T T 11 1 1 What about conBnuous variables? • Billionaire says: If I am measuring a conBnuous variable, what can you do for me? • You say: Let me tell you about Gaussians… Some properBes of Gaussians • Affine transformaon (mulBplying by scalar and adding a constant) are Gaussian – X ~ N(µ,σ2) – Y = aX + b Y ~ N(aµ+b,a2σ2) • Sum of Gaussians is Gaussian 2 – X ~ N(µX,σ X) 2 – Y ~ N(µY,σ Y) 2 2 – Z = X+Y Z ~ N(µX+µY, σ X+σ Y) • Easy to differenBate, as we will see soon! x Learning a Gaussian i Exam i = Score • Collect a bunch of data 0 85 – Hopefully, i.i.d. samples 1 95 – e.g., exam scores 2 100 3 12 • Learn parameters … … – Mean: µ 99 89 – Variance: σ MLE for Gaussian: • Prob. of i.i.d. samples D={x1,…,xN}: µMLE, σMLE =argmaxP ( µ, σ) µ,σ D| • Log-likelihood of data: 2 Your second learning algorithm: MLE for mean of a Gaussian • What’s MLE for mean? µMLEµMLE, σMLE, σMLE=argmax=argmaxPP(( µ,µ,σσ)) µ,µ,σσ D| µMLE, σMLE =argmaxP ( D|µ, σ) µ,σ D| N N N (xi µ) = (x(ix −µµ)) =0 = = − i−−σ2 =0=0 − − i=1 σσ22 i=1i=1 N = x + Nµ =0 − i =1 i N N (x µ)2 = + i − =0 − σ σ3 =1 i 2 2 2 MLE for variance • Again, set derivave to zero: µMLE, σMLE =argmaxP ( µ, σ) µ,σ D| N µMLE, σMLE(x=argmaxµ) P ( µ, σ) = i − =0µ,σ D| − σ2 =1 i NN N (x(xi µ)2µ) = = + i − − =0=0 − σ − σ3σ2 i=1=1 i 2 2 Learning Gaussian parameters • MLE: • BTW. MLE for the variance of a Gaussian is biased – Expected result of esBmaon is not true parameter! – Unbiased variance esBmator: Bayesian learning of Gaussian parameters • Conjugate priors – Mean: Gaussian prior – Variance: Wishart DistribuBon • Prior for mean: Bayesian Prediction Definition: the expected conditional loss of predicting y is ∈ Y [y x]= L(y,y)Pr[y x]. L | | y ∈Y Bayesian decision : predict class minimizing expected conditional loss, that is y∗ =argmin [y x]=argmin L(y,y)Pr[y x]. L | | y y y ∈Y zero-oneb loss: y∗ =argmaxb Pr[y x]. • y | Maximum a Posteriori (MAP) principle. b Mehryar Mohri - Introduction to Machine Learning page 6 Binary Classification - Illustration 1 Pr[y x] 1 | Pr[y x] 2 | 0 x Mehryar Mohri - Introduction to Machine Learning page 7 Maximum a Posteriori (MAP) Definition: the MAP principle consists of predicting according to the rule y =argmaxPr[y x]. y | ∈Y Equivalently, by the Bayes formula: Pr[x y]Pr[y] y =argmax | =argmaxPr[x y]Pr[y]. y Pr[x] y | ∈Y ∈Y How do we determine Pr[ x y ] and Pr[ y ] ? Density estimation problem.| Mehryar Mohri - Introduction to Machine Learning page 8 Density Estimation Data: sample drawn i.i.d. from set X according to some distribution D , x1,...,xm ∈ X. Problem: find distribution p out of a set P that best estimates D . • Note: we will study density estimation specifically in a future lecture. [Slide from Mehryar Mohri] Mehryar Mohri - Introduction to Machine Learning page 10 Density esBmaon • Can make parametric assumpBon, e.g. that Pr(x|y) is a mulBvariate Gaussian distribuBon • When the dimension of x is small enough, can use a non-parametric approach (e.g., kernel density es.ma.on) x Difficulty of (naively) esBmang high- dimensional distribuBons • Can we directly esBmate the data distribuBon P(X,Y)? • How do we represent these? How many parameters? – Prior, P(Y): • Suppose Y is composed of k classes – Likelihood, P(X|Y): • Suppose X is composed of n binary features • Complex model ! High variance with limited data!!! CondiBonal Independence • X is condiConally independent of Y given Z, if the probability distribuBon for X is independent of the value of Y, given the value of Z • e.g., • Equivalent to: Naïve Bayes • Naïve Bayes assumpBon: – Features are independent given class: – More generally: • How many parameters now? • Suppose X is composed of n binary features The Naïve Bayes Classifier • Given: – Prior P(Y) Y – n condiBonally independent features X given the class Y – For each Xi, we have likelihood P(X |Y) i X1 X2 Xn • Decision rule: If certain assumption holds, NB is optimal classifier! (they typically don’t) .