The Naïve Bayes Classifier

The Naïve Bayes Classifier Machine Learning 1 Today’s lecture • The naïve Bayes Classifier • Learning the naïve Bayes Classifier • Practical concerns 2 Today’s lecture • The naïve Bayes Classifier • Learning the naïve Bayes Classifier • Practical concerns 3 Where are we? We have seen Bayesian learning – Using a probabilistic criterion to select a hypothesis – Maximum a posteriori and maximum likelihood learning You should know what is the difference between them We could also learn functions that predict probabilities of outcomes – Different from using a probabilistic criterion to learn Maximum a posteriori (MAP) prediction as opposed to MAP learning 4 Where are we? We have seen Bayesian learning – Using a probabilistic criterion to select a hypothesis – Maximum a posteriori and maximum likelihood learning You should know what is the difference between them We could also learn functions that predict probabilities of outcomes – Different from using a probabilistic criterion to learn Maximum a posteriori (MAP) prediction as opposed to MAP learning 5 MAP prediction Using the Bayes rule for predicting � given an input � � � = � � = � � � = � � � = � � = � = �(� = �) Posterior probability of label being y for this input x 6 MAP prediction Using the Bayes rule for predicting � given an input � � � = � � = � � � = � � � = � � = � = �(� = �) Predict the label � for the input � using � � = � � = � � � = � argmax . �(� = �) 7 MAP prediction Using the Bayes rule for predicting � given an input � � � = � � = � � � = � � � = � � = � = �(� = �) Predict the label � for the input � using argmax � � = � � = � � � = � . 8 Don’t confuse with MAP learning: MAP prediction finds hypothesis by Using the Bayes rule for predicting � given an input � � � = � � = � � � = � � � = � � = � = �(� = �) Predict the label � for the input � using argmax � � = � � = � � � = � . 9 MAP prediction Predict the label � for the input � using argmax � � = � � = � � � = � . Likelihood of observing this Prior probability of the label input x when the label is y being y All we need are these two sets of probabilities 10 Example: Tennis again Play tennis P(Play tennis) Without any other information, what is the prior probability that I Prior Yes 0.3 should play tennis? No 0.7 Temperature Wind P(T, W |Tennis = Yes) On days that I do play tennis, what is the Hot Strong 0.15 probability that the Hot Weak 0.4 temperature is T and Cold Strong 0.1 the wind is W? Cold Weak 0.35 Likelihood Temperature Wind P(T, W |Tennis = No) On days that I don’t play Hot Strong 0.4 tennis, what is the probability that the Hot Weak 0.1 temperature is T and the Cold Strong 0.3 wind is W? Cold Weak 0.2 11 Example: Tennis again Play tennis P(Play tennis) Without any other information, what is the prior probability that I Prior Yes 0.3 should play tennis? No 0.7 Temperature Wind P(T, W |Tennis = Yes) On days that I do play tennis, what is the Hot Strong 0.15 probability that the Hot Weak 0.4 temperature is T and Cold Strong 0.1 the wind is W? Cold Weak 0.35 Likelihood Temperature Wind P(T, W |Tennis = No) On days that I don’t play Hot Strong 0.4 tennis, what is the probability that the Hot Weak 0.1 temperature is T and the Cold Strong 0.3 wind is W? Cold Weak 0.2 12 Example: Tennis again Play tennis P(Play tennis) Without any other information, what is the prior probability that I Prior Yes 0.3 should play tennis? No 0.7 Temperature Wind P(T, W |Tennis = Yes) On days that I do play tennis, what is the Hot Strong 0.15 probability that the Hot Weak 0.4 temperature is T and Cold Strong 0.1 the wind is W? Cold Weak 0.35 Likelihood Temperature Wind P(T, W |Tennis = No) On days that I don’t play Hot Strong 0.4 tennis, what is the probability that the Hot Weak 0.1 temperature is T and the Cold Strong 0.3 wind is W? Cold Weak 0.2 13 Example: Tennis again Input: Temperature = Hot (H) Play tennis P(Play tennis) Wind = Weak (W) Prior Yes 0.3 No 0.7 Should I play tennis? Temperature Wind P(T, W |Tennis = Yes) Hot Strong 0.15 Hot Weak 0.4 Cold Strong 0.1 Cold Weak 0.35 Likelihood Temperature Wind P(T, W |Tennis = No) Hot Strong 0.4 Hot Weak 0.1 Cold Strong 0.3 Cold Weak 0.2 14 Example: Tennis again Input: Temperature = Hot (H) Play tennis P(Play tennis) Wind = Weak (W) Prior Yes 0.3 No 0.7 Should I play tennis? Temperature Wind P(T, W |Tennis = Yes) argmaxy P(H, W | play?) P (play?) Hot Strong 0.15 Hot Weak 0.4 Cold Strong 0.1 Cold Weak 0.35 Likelihood Temperature Wind P(T, W |Tennis = No) Hot Strong 0.4 Hot Weak 0.1 Cold Strong 0.3 Cold Weak 0.2 15 Example: Tennis again Input: Temperature = Hot (H) Play tennis P(Play tennis) Wind = Weak (W) Prior Yes 0.3 No 0.7 Should I play tennis? Temperature Wind P(T, W |Tennis = Yes) argmaxy P(H, W | play?) P (play?) Hot Strong 0.15 P(H, W | Yes) P(Yes) = 0.4 £ 0.3 Hot Weak 0.4 = 0.12 Cold Strong 0.1 Cold Weak 0.35 P(H, W | No) P(No) = 0.1 £ 0.7 Likelihood = 0.07 Temperature Wind P(T, W |Tennis = No) Hot Strong 0.4 Hot Weak 0.1 Cold Strong 0.3 Cold Weak 0.2 16 Example: Tennis again Input: Temperature = Hot (H) Play tennis P(Play tennis) Wind = Weak (W) Prior Yes 0.3 No 0.7 Should I play tennis? Temperature Wind P(T, W |Tennis = Yes) argmaxy P(H, W | play?) P (play?) Hot Strong 0.15 P(H, W | Yes) P(Yes) = 0.4 £ 0.3 Hot Weak 0.4 = 0.12 Cold Strong 0.1 Cold Weak 0.35 P(H, W | No) P(No) = 0.1 £ 0.7 Likelihood = 0.07 Temperature Wind P(T, W |Tennis = No) Hot Strong 0.4 MAP prediction = Yes Hot Weak 0.1 Cold Strong 0.3 Cold Weak 0.2 17 How hard is it to learn probabilistic models? O T H W Play? Outlook: S(unny), 1 S H H W - O(vercast), 2 S H H S - 3 O H H W + R(ainy) 4 R M H W + Temperature: H(ot), 5 R C N W + M(edium), 6 R C N S - 7 O C N S + C(ool) 8 S M H W - Humidity: H(igh), 9 S C N W + 10 R M N W + N(ormal), 11 S M N S + L(ow) 12 O M H S + 13 O H N W + Wind: S(trong), 14 R M H S - W(eak) 18 How hard is it to learn probabilistic models? O T H W Play? Outlook: S(unny), 1 S H H W - O(vercast), 2 S H H S - 3 O H H W + R(ainy) We need to learn 4 R M H W + Temperature: H(ot), 5 R C N W + M(edium), 6 R C N S - 1.The prior �(Play? ) C(ool) 7 O C N S + 2.The likelihoods � x Play? ) 8 S M H W - Humidity: H(igh), 9 S C N W + 10 R M N W + N(ormal), 11 S M N S + L(ow) 12 O M H S + 13 O H N W + Wind: S(trong), 14 R M H S - W(eak) 19 How hard is it to learn probabilistic models? O T H W Play? Prior P(play?) 1 S H H W - 2 S H H S - • A single number (Why only one?) 3 O H H W + 4 R M H W + Likelihood P(X | Play?) 5 R C N W + • There are 4 features 6 R C N S - 7 O C N S + • For each value of Play? (+/-), we 8 S M H W - need a value for each possible 9 S C N W + 10 R M N W + assignment: P(x1, x2, x3, x4 | Play?) 11 S M N S + • (24 – 1) parameters in each case 12 O M H S + 13 O H N W + One for each assignment 14 R M H S - 20 How hard is it to learn probabilistic models? O T H W Play? Prior P(play?) 1 S H H W - 2 S H H S - • A single number (Why only one?) 3 O H H W + 4 R M H W + Likelihood P(X | Play?) 5 R C N W + • There are 4 features 6 R C N S - 7 O C N S + • For each value of Play? (+/-), we 8 S M H W - need a value for each possible 9 S C N W + 10 R M N W + assignment: P(O, T, H, W | Play?) 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - 21 How hard is it to learn probabilistic models? O T H W Play? Prior P(play?) 1 S H H W - 2 S H H S - • A single number (Why only one?) 3 O H H W + 4 R M H W + Likelihood P(X | Play?) 5 R C N W + • There are 4 features 6 R C N S - 7 O C N S + • For each value of Play? (+/-), we 8 S M H W - need a value for each possible 9 S C N W + 10 R M N W + assignment: P(O, T, H, W | Play?) 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - 3 3 3 2 Values for this feature 22 How hard is it to learn probabilistic models? O T H W Play? Prior P(play?) 1 S H H W - 2 S H H S - • A single number (Why only one?) 3 O H H W + 4 R M H W + Likelihood P(X | Play?) 5 R C N W + • There are 4 features 6 R C N S - 7 O C N S + • For each value of Play? (+/-), we 8 S M H W - need a value for each possible 9 S C N W + 10 R M N W + assignment: P(O, T, H, W | Play?) 11 S M N S + • (3 ⋅ 3 ⋅ 3 ⋅ 2 − 1) parameters in 12 O M H S + 13 O H N W + each case 14 R M H S - One for each assignment 3 3 3 2 Values for this feature 23 How hard is it to learn probabilistic models? In general O T H W Play? Prior P(Y) 1 S H H W - 2 S H H S - • If there are k labels, then k – 1 3 O H H W + parameters (why not k?) 4 R M H W + 5 R C N W + Likelihood P(X | Y) 6 R C N S - 7 O C N S + • If there are d features, then: 8 S M H W - • We need a value for each possible 9 S C N W + 10 R M N W + P(x1, x2, !, xd | y) for each y 11 S M N S + • k(2d – 1) parameters 12 O M H S + 13 O H N W + Need a lot of data to estimate these 14 R M H S - many numbers! 24 How hard is it to learn probabilistic models? In general O T H W Play? Prior P(Y) 1 S H H W - 2 S H H S - • If there are k labels, then k – 1 3 O H H W + parameters (why not k?) 4 R M H W + 5 R C N W + Likelihood P(X | Y) 6 R C N S - 7 O C N S + • If there are d Boolean features: 8 S M H W - • We need a value for each 9 S C N W + 10 R M N W + possible P(x1, x2, !, xd | y) for 11 S M N S + each y 12 O M H S + • d 13 O H N W + k(2 – 1) parameters 14 R M H S - Need a lot of data to estimate these many numbers! 25 How hard is it to learn probabilistic models? In general O T H W Play? Prior P(Y) 1 S H H W - 2 S H H S - • If there are k labels, then k – 1 3 O H H W + parameters (why not k?) 4 R M H W + 5 R C N W + Likelihood P(X | Y) 6 R C N S - 7 O C N S + • If there are d Boolean features: 8 S M H W - • We need a value for each 9 S C N W + 10 R M N W + possible

The Naïve Bayes Classifier

Practical Statistics for Particle Physics Lecture 1 AEPS2018, Quy Nhon, Vietnam

3.3 Bayes' Formula

Numerical Physics with Probabilities: the Monte Carlo Method and Bayesian Statistics Part I for Assignment 2

The Bayesian Approach to Statistics

Paradoxes and Priors in Bayesian Regression

A Widely Applicable Bayesian Information Criterion

Part IV: Monte Carlo and Nonparametric Bayes Outline

Marginal Likelihood

Naïve Bayes Classifier

Constraints Versus Priors † Philip B

9 Introduction to Hierarchical Models

Bayesian Model Selection