The Naïve Bayes Classifier
Machine Learning
1 Today’s lecture
• The naïve Bayes Classifier
• Learning the naïve Bayes Classifier
• Practical concerns
2 Today’s lecture
• The naïve Bayes Classifier
• Learning the naïve Bayes Classifier
• Practical concerns
3 Where are we?
We have seen Bayesian learning – Using a probabilistic criterion to select a hypothesis – Maximum a posteriori and maximum likelihood learning You should know what is the difference between them
We could also learn functions that predict probabilities of outcomes – Different from using a probabilistic criterion to learn
Maximum a posteriori (MAP) prediction as opposed to MAP learning
4 Where are we?
We have seen Bayesian learning – Using a probabilistic criterion to select a hypothesis – Maximum a posteriori and maximum likelihood learning You should know what is the difference between them
We could also learn functions that predict probabilities of outcomes – Different from using a probabilistic criterion to learn
Maximum a posteriori (MAP) prediction as opposed to MAP learning
5 MAP prediction
Using the Bayes rule for predicting � given an input �
� � = � � = � � � = � � � = � � = � = �(� = �)
Posterior probability of label being y for this input x
6 MAP prediction
Using the Bayes rule for predicting � given an input �
� � = � � = � � � = � � � = � � = � = �(� = �)
Predict the label � for the input � using
� � = � � = � � � = � argmax �(� = �)
7 MAP prediction
Using the Bayes rule for predicting � given an input �
� � = � � = � � � = � � � = � � = � = �(� = �)
Predict the label � for the input � using
argmax � � = � � = � � � = �
8 Don’t confuse with MAP learning: MAP prediction finds hypothesis by
Using the Bayes rule for predicting � given an input �
� � = � � = � � � = � � � = � � = � = �(� = �)
Predict the label � for the input � using
argmax � � = � � = � � � = �
9 MAP prediction
Predict the label � for the input � using
argmax � � = � � = � � � = �
Likelihood of observing this Prior probability of the label input x when the label is y being y
All we need are these two sets of probabilities
10 Example: Tennis again Play tennis P(Play tennis) Without any other information, what is the prior probability that I Prior Yes 0.3 should play tennis? No 0.7
Temperature Wind P(T, W |Tennis = Yes) On days that I do play tennis, what is the Hot Strong 0.15 probability that the Hot Weak 0.4 temperature is T and Cold Strong 0.1 the wind is W? Cold Weak 0.35 Likelihood Temperature Wind P(T, W |Tennis = No) On days that I don’t play Hot Strong 0.4 tennis, what is the probability that the Hot Weak 0.1 temperature is T and the Cold Strong 0.3 wind is W? Cold Weak 0.2
11 Example: Tennis again Play tennis P(Play tennis) Without any other information, what is the prior probability that I Prior Yes 0.3 should play tennis? No 0.7
Temperature Wind P(T, W |Tennis = Yes) On days that I do play tennis, what is the Hot Strong 0.15 probability that the Hot Weak 0.4 temperature is T and Cold Strong 0.1 the wind is W? Cold Weak 0.35 Likelihood Temperature Wind P(T, W |Tennis = No) On days that I don’t play Hot Strong 0.4 tennis, what is the probability that the Hot Weak 0.1 temperature is T and the Cold Strong 0.3 wind is W? Cold Weak 0.2
12 Example: Tennis again Play tennis P(Play tennis) Without any other information, what is the prior probability that I Prior Yes 0.3 should play tennis? No 0.7
Temperature Wind P(T, W |Tennis = Yes) On days that I do play tennis, what is the Hot Strong 0.15 probability that the Hot Weak 0.4 temperature is T and Cold Strong 0.1 the wind is W? Cold Weak 0.35 Likelihood Temperature Wind P(T, W |Tennis = No) On days that I don’t play Hot Strong 0.4 tennis, what is the probability that the Hot Weak 0.1 temperature is T and the Cold Strong 0.3 wind is W? Cold Weak 0.2
13 Example: Tennis again Input: Temperature = Hot (H) Play tennis P(Play tennis) Wind = Weak (W) Prior Yes 0.3 No 0.7 Should I play tennis?
Temperature Wind P(T, W |Tennis = Yes) Hot Strong 0.15 Hot Weak 0.4 Cold Strong 0.1 Cold Weak 0.35 Likelihood Temperature Wind P(T, W |Tennis = No) Hot Strong 0.4 Hot Weak 0.1 Cold Strong 0.3 Cold Weak 0.2
14 Example: Tennis again Input: Temperature = Hot (H) Play tennis P(Play tennis) Wind = Weak (W) Prior Yes 0.3 No 0.7 Should I play tennis?
Temperature Wind P(T, W |Tennis = Yes) argmaxy P(H, W | play?) P (play?) Hot Strong 0.15 Hot Weak 0.4 Cold Strong 0.1 Cold Weak 0.35 Likelihood Temperature Wind P(T, W |Tennis = No) Hot Strong 0.4 Hot Weak 0.1 Cold Strong 0.3 Cold Weak 0.2
15 Example: Tennis again Input: Temperature = Hot (H) Play tennis P(Play tennis) Wind = Weak (W) Prior Yes 0.3 No 0.7 Should I play tennis?
Temperature Wind P(T, W |Tennis = Yes) argmaxy P(H, W | play?) P (play?) Hot Strong 0.15 P(H, W | Yes) P(Yes) = 0.4 £ 0.3 Hot Weak 0.4 = 0.12 Cold Strong 0.1 Cold Weak 0.35 P(H, W | No) P(No) = 0.1 £ 0.7 Likelihood = 0.07 Temperature Wind P(T, W |Tennis = No) Hot Strong 0.4 Hot Weak 0.1 Cold Strong 0.3 Cold Weak 0.2
16 Example: Tennis again Input: Temperature = Hot (H) Play tennis P(Play tennis) Wind = Weak (W) Prior Yes 0.3 No 0.7 Should I play tennis?
Temperature Wind P(T, W |Tennis = Yes) argmaxy P(H, W | play?) P (play?) Hot Strong 0.15 P(H, W | Yes) P(Yes) = 0.4 £ 0.3 Hot Weak 0.4 = 0.12 Cold Strong 0.1 Cold Weak 0.35 P(H, W | No) P(No) = 0.1 £ 0.7 Likelihood = 0.07 Temperature Wind P(T, W |Tennis = No)
Hot Strong 0.4 MAP prediction = Yes Hot Weak 0.1 Cold Strong 0.3 Cold Weak 0.2
17 How hard is it to learn probabilistic models?
O T H W Play? Outlook: S(unny), 1 S H H W - O(vercast), 2 S H H S - 3 O H H W + R(ainy) 4 R M H W + Temperature: H(ot), 5 R C N W + M(edium), 6 R C N S - 7 O C N S + C(ool) 8 S M H W - Humidity: H(igh), 9 S C N W + 10 R M N W + N(ormal), 11 S M N S + L(ow) 12 O M H S + 13 O H N W + Wind: S(trong), 14 R M H S - W(eak)
18 How hard is it to learn probabilistic models?
O T H W Play? Outlook: S(unny), 1 S H H W - O(vercast), 2 S H H S - 3 O H H W + R(ainy) We need to learn 4 R M H W + Temperature: H(ot), 5 R C N W + M(edium), 6 R C N S - 1.The prior �(Play? ) C(ool) 7 O C N S + 2.The likelihoods � x Play? ) 8 S M H W - Humidity: H(igh), 9 S C N W + 10 R M N W + N(ormal), 11 S M N S + L(ow) 12 O M H S + 13 O H N W + Wind: S(trong), 14 R M H S - W(eak)
19 How hard is it to learn probabilistic models?
O T H W Play? Prior P(play?) 1 S H H W - 2 S H H S - • A single number (Why only one?) 3 O H H W + 4 R M H W + Likelihood P(X | Play?) 5 R C N W + • There are 4 features 6 R C N S - 7 O C N S + • For each value of Play? (+/-), we 8 S M H W - need a value for each possible 9 S C N W + 10 R M N W + assignment: P(x1, x2, x3, x4 | Play?) 11 S M N S + • (24 – 1) parameters in each case 12 O M H S + 13 O H N W + One for each assignment 14 R M H S -
20 How hard is it to learn probabilistic models?
O T H W Play? Prior P(play?) 1 S H H W - 2 S H H S - • A single number (Why only one?) 3 O H H W + 4 R M H W + Likelihood P(X | Play?) 5 R C N W + • There are 4 features 6 R C N S - 7 O C N S + • For each value of Play? (+/-), we 8 S M H W - need a value for each possible 9 S C N W + 10 R M N W + assignment: P(O, T, H, W | Play?) 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S -
21 How hard is it to learn probabilistic models?
O T H W Play? Prior P(play?) 1 S H H W - 2 S H H S - • A single number (Why only one?) 3 O H H W + 4 R M H W + Likelihood P(X | Play?) 5 R C N W + • There are 4 features 6 R C N S - 7 O C N S + • For each value of Play? (+/-), we 8 S M H W - need a value for each possible 9 S C N W + 10 R M N W + assignment: P(O, T, H, W | Play?) 11 S M N S + 12 O M H S + 13 O H N W + 14 R M H S - 3 3 3 2 Values for this feature 22 How hard is it to learn probabilistic models?
O T H W Play? Prior P(play?) 1 S H H W - 2 S H H S - • A single number (Why only one?) 3 O H H W + 4 R M H W + Likelihood P(X | Play?) 5 R C N W + • There are 4 features 6 R C N S - 7 O C N S + • For each value of Play? (+/-), we 8 S M H W - need a value for each possible 9 S C N W + 10 R M N W + assignment: P(O, T, H, W | Play?) 11 S M N S + • (3 ⋅ 3 ⋅ 3 ⋅ 2 − 1) parameters in 12 O M H S + 13 O H N W + each case 14 R M H S - One for each assignment 3 3 3 2 Values for this feature 23 How hard is it to learn probabilistic models? In general O T H W Play? Prior P(Y) 1 S H H W - 2 S H H S - • If there are k labels, then k – 1 3 O H H W + parameters (why not k?) 4 R M H W + 5 R C N W + Likelihood P(X | Y) 6 R C N S - 7 O C N S + • If there are d features, then: 8 S M H W - • We need a value for each possible 9 S C N W + 10 R M N W + P(x1, x2, !, xd | y) for each y 11 S M N S + • k(2d – 1) parameters 12 O M H S + 13 O H N W + Need a lot of data to estimate these 14 R M H S - many numbers!
24 How hard is it to learn probabilistic models? In general O T H W Play? Prior P(Y) 1 S H H W - 2 S H H S - • If there are k labels, then k – 1 3 O H H W + parameters (why not k?) 4 R M H W + 5 R C N W + Likelihood P(X | Y) 6 R C N S - 7 O C N S + • If there are d Boolean features: 8 S M H W - • We need a value for each 9 S C N W + 10 R M N W + possible P(x1, x2, !, xd | y) for 11 S M N S + each y 12 O M H S + • d 13 O H N W + k(2 – 1) parameters 14 R M H S - Need a lot of data to estimate these many numbers! 25 How hard is it to learn probabilistic models? In general O T H W Play? Prior P(Y) 1 S H H W - 2 S H H S - • If there are k labels, then k – 1 3 O H H W + parameters (why not k?) 4 R M H W + 5 R C N W + Likelihood P(X | Y) 6 R C N S - 7 O C N S + • If there are d Boolean features: 8 S M H W - • We need a value for each 9 S C N W + 10 R M N W + possible P(x1, x2, !, xd | y) for 11 S M N S + each y 12 O M H S + • d 13 O H N W + k(2 – 1) parameters 14 R M H S - Need a lot of data to estimate these many numbers! 26 How hard is it to learn probabilistic models?
Prior P(Y) • If there are k labels, then k – 1 parameters (why not k?) High model complexity Likelihood P(X | Y) If there is very limited data, high variance in the • If there are d Boolean features: parameters • We need a value for each possible P(x1, x2, !, xd | y) for each y • k(2d – 1) parameters Need a lot of data to estimate these many numbers! 27 How hard is it to learn probabilistic models?
Prior P(Y) • If there are k labels, then k – 1 parameters (why not k?) High model complexity Likelihood P(X | Y) If there is very limited data, high variance in the • If there are d Boolean features: parameters • We need a value for each How can we deal with this? possible P(x1, x2, !, xd | y) for each y • k(2d – 1) parameters Need a lot of data to estimate these many numbers! 28 How hard is it to learn probabilistic models?
Prior P(Y) • If there are k labels, then k – 1 parameters (why not k?) High model complexity Likelihood P(X | Y) If there is very limited data, high variance in the • If there are d Boolean features: parameters • We need a value for each How can we deal with this? possible P(x , x , !, x | y) for 1 2 d Answer: Make independence each y assumptions • k(2d – 1) parameters Need a lot of data to estimate these many numbers! 29 Recall: Conditional independence
Suppose X, Y and Z are random variables
X is conditionally independent of Y given Z if the probability distribution of X is independent of the value of Y when Z is observed
Or equivalently
30 Modeling the features
d �(� , � , ⋯ , � |�) required k(2 – 1) parameters
What if all the features were conditionally independent given the label? The Naïve Bayes Assumption
That is, � � , � , ⋯ , � � = � � � � � � ⋯ � � �
Requires only d numbers for each label. kd features overall. Not bad!
31 Modeling the features
d �(� , � , ⋯ , � |�) required k(2 – 1) parameters
What if all the features were conditionally independent given the label? The Naïve Bayes Assumption
That is, � � , � , ⋯ , � � = � � � � � � ⋯ � � �
Requires only d numbers for each label. kd parameters overall. Not bad!
32 The Naïve Bayes Classifier
Assumption: Features are conditionally independent given the label Y
To predict, we need two sets of probabilities – Prior P(y)
– For each xj, we have the likelihood P(xj | y)
33 The Naïve Bayes Classifier
Assumption: Features are conditionally independent given the label Y
To predict, we need two sets of probabilities – Prior P(y)
– For each xj, we have the likelihood P(xj | y) Decision rule
ℎ � = argmax � � � � , � , ⋯ , � �)
34 The Naïve Bayes Classifier
Assumption: Features are conditionally independent given the label Y
To predict, we need two sets of probabilities – Prior P(y)
– For each xj, we have the likelihood P(xj | y) Decision rule
ℎ � = argmax � � � � , � , ⋯ , � �)
= argmax � � �(� |�)
35 Decision boundaries of naïve Bayes
What is the decision boundary of the naïve Bayes classifier?
Consider the two class case. We predict the label to be + if