Machine Learning Classification

Machine Learning 10-701/15-781, Spring 2008 Naïve Bayes Classifier Eric Xing Lecture 3, January 23, 2006 Reading: Chap. 4 CB and handouts Classification z Representing data: z Choosing hypothesis class z Learning: h:X a Y z X – features z Y – target classes 1 Suppose you know the following … z Classification-specific Dist.: P(X|Y) p(X |Y = 1) r = p1 (X ;µ1 ,Σ1 ) p(X | Y = 2) r = p2 (X ; µ2 ,Σ2 ) Bayes classifier: z Class prior (i.e., "weight"): P(Y) z This is a generative model of the data! Optimal classification z Theorem: Bayes classifier is optimal! z That is z Proof: z Recall last lecture: z How to learn a Bayes classifier? 2 Recall Bayes Rule Which is shorthand for: Equivalently: Recall Bayes Rule Which is shorthand for: Common abbreviation: 3 Learning Bayes Classifier z Training data: z Learning = estimating P(X|Y), and P(Y) z Classification = using Bayes rule to calculate P(Y | Xnew) How hard is it to learn the optimal classifier? z How do we represent these? How many parameters? z Prior, P(Y): z Suppose Y is composed of k classes z Likelihood, P(X|Y): z Suppose X is composed of n binary features z Complex model ! High variance with limited data!!! 4 Naïve Bayes: assuming that Xi and Xj are conditionally independent given Y, for all i≠j Conditional Independence z X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z Which we often write z e.g., z Equivalent to: 5 The Naïve Bayes assumption z Naïve Bayes assumption: z Features are independent given class: z More generally: z How many parameters now? z Suppose X is composed of n binary features The Naïve Bayes Classifier z Given: z Prior P(Y) z n conditionally independent features X given the class Y z For each Xi, we have likelihood P(Xi|Y) z Decision rule: z If assumption holds, NB is optimal classifier! 6 Naïve Bayes Algorithm z Train Naïve Bayes (examples) z for each* value yk z estimate z for each* value xij of each attribute Xi z estimate z Classify (Xnew) * probabilities must sum to 1, so need estimate only n-1 parameters... Learning NB: parameter estimation z Maximum Likelihood Estimate (MLE): choose θ that maximizes probability of observed data D z Maximum a Posteriori (MAP) estimate: choose θ that is most probable given prior probability and the data 7 MLE for the parameters of NB Discrete features: z Maximum likelihood estimates (MLE’s): z Given dataset z Count(A=a,B=b) ≡ number of examples where A=a and B=b Subtleties of NB classifier 1 – Violating the NB assumption z Often the Xi are not really conditionally independent z We use Naïve Bayes in many cases anyway, and it often works pretty well z often the right classification, even when not the right probability (see [Domingos&Pazzani, 1996]) z But the resulting probabilities P(Y|Xnew) are biased toward 1 or 0 (why?) 8 Subtleties of NB classifier 2 – Insufficient training data z What if you never see a training instance where X1000=a when Y=b? z e.g., Y={SpamEmail or not}, X1000={‘Rolex’} z P(X1000=T | Y=T) = 0 z Thus, no matter what the values X2,…,Xn take: z P(Y=T | X1,X2,…,X1000=T, …, Xn) = 0 z What now??? MAP for the parameters of NB Discrete features: z Maximum a Posteriori (MAP) estimate: (MAP’s): z Given prior: z Consider binary feature z θ is a Bernoulli rate αT −1 α F −1 Γ(αT +α F ) αT −1 α F −1 θ (1 −θ ) P(θ;αT ,α F ) = θ (1 −θ ) = Γ(αT )Γ(α F ) B(αT ,α F ) z Let βa=Count(X=a) ≡ number of examples where X=a 9 Bayesian learning for NB parameters – a.k.a. smoothing z Posterior distribution of θ z Bernoulli: z Multinomial z MAP estimate: z Beta prior equivalent to extra thumbtack flips z As N → 1, prior is “forgotten” z But, for small sample size, prior is important! MAP for the parameters of NB z Dataset of N examples z Let βiab=Count(Xi=a,Y=b) ≡ number of examples where Xi=a and Y=b z Let γb=Count(Y=b) z Prior Q(Xi|Y) ∝ Multinomial(αi1, …, αiK) or Multinomial(α/K) Q(Y) ∝ Multinomial(τi1, …, τiM) or Multinomial(τ/M) m “virtual” examples z MAP estimate z Now, even if you never observe a feature/class, posterior probability never zero 10 Text classification z Classify e-mails z Y = {Spam,NotSpam} z Classify news articles z Y = {what is the topic of the article?} z Classify webpages z Y = {Student, professor, project, …} z What about the features X? z The text! Features X are entire document – Xi for ith word in article 11 NB for Text classification z P(X|Y) is huge!!! z Article at least 1000 words, X={X1,…,X1000} th z Xi represents i word in document, i.e., the domain of Xi is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc. z NB assumption helps a lot!!! z P(Xi=xi|Y=y) is just the probability of observing word xi in a document on topic y Bag of words model z Typical additional assumption – Position in document doesn’t matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y) z “Bag of words” model – order of words on the page ignored z Sounds really silly, but often works very well! When the lecture is over, remember to wake up the person sitting next to you in the lecture room. 12 Bag of words model z Typical additional assumption – Position in document doesn’t matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y) z “Bag of words” model – order of words on the page ignored z Sounds really silly, but often works very well! in is lecture lecture next over person remember room sitting the the the to to up wake when you Bag of Words Approach aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 ... oil 1 … Zaire 0 13 NB with Bag of Words for text classification z Learning phase: z Prior P(Y) z Count how many documents you have from each topic (+ prior) z P(Xi|Y) z For each topic, count how many times you saw word in documents of this topic (+ prior) z Test phase: z For each document z Use naïve Bayes decision rule Twenty News Groups results 14 Learning curve for Twenty News Groups What if we have continuous Xi ? th z Eg., character recognition: Xi is i pixel z Gaussian Naïve Bayes (GNB): Sometimes assume variance z is independent of Y (i.e., σi), z or independent of Xi (i.e., σk) z or both (i.e., σ) 15 Estimating Parameters: Y discrete, Xi continuous z Maximum likelihood estimates: jth training example δ(x)=1 if x true, else 0 Gaussian Naïve Bayes 16 Example: GNB for classifying mental states [Mitchell et al.] ~1 mm resolution ~2 images per sec. 15,000 voxels/image non-invasive, safe measures Blood Oxygen Level 10 sec Dependent (BOLD) response Typical impulse response Brain scans can track activation with precision and sensitivity [Mitchell et al.] 17 Gaussian Naïve Bayes: Learned µvoxel,word P(BrainActivity | WordCategory = {People,Animal})[Mitchell et al.] Learned Bayes Models – Means for P(BrainActivity | WordCategory) [Mitchell et al.] Pairwise classification accuracy: 85% People words Animal words 18 What you need to know about Naïve Bayes z Optimal decision using Bayes Classifier z Naïve Bayes classifier z What’s the assumption z Why we use it z How do we learn it z Why is Bayesian estimation important z Text classification z Bag of words model z Gaussian NB z Features are still conditionally independent z Each feature has a Gaussian distribution given class 19.

Load more