Naïve Bayes Classifier

Eamonn Keogh UCR

This is a high level overview only. For details, see: Pattern Recognition and Machine Learning, Christopher Bishop, Springer-Verlag, 2006. Thomas Bayes Or Pattern Classification by R. O. Duda, P. E. Hart, D. Stork, Wiley and Sons. 1702 - 1761

We will start off with a visual intuition, before looking at the math… Grasshoppers

Antenna Length 10 1 2 3 4 5 6 7 8 9 1 Let’s get lots more data… Remember this example? 2 Abdomen Length 3 4 5 6 7 8 9 10 Katydids just just now… build “Antenna Length” one for With a data,wecanlot a of build Let histogram. us

Antenna Length 10 4 5 6 7 8 9 1 2 3 1 Grasshoppers Katydids 2 3 4 5 6 7 8 9 10 We can leave the histograms as they are, or we can summarize them with two normal distributions.

Let us us two normal distributions for ease of visualization in the following slides… • We want to classify an insect we have found. Its antennae are 3 units long. How can we classify it?

• We can just ask ourselves, give the distributions of antennae lengths we have seen, is it more probable that our insect is a Grasshopper or a Katydid. • There is a formal way to discuss the most probable classification…

p(cj | d) = probability of class cj, given that we have observed d

Antennae length is 3 p(cj | d) = probability of class cj, given that we have observed d

P(Grasshopper | 3 ) = 10 / (10 + 2) = 0.833 P(Katydid | 3 ) = 2 / (10 + 2) = 0.166

Antennae length is 3 p(cj | d) = probability of class cj, given that we have observed d

P(Grasshopper | 7 ) = 3 / (3 + 9) = 0.250 P(Katydid | 7 ) = 9 / (3 + 9) = 0.750

9 3

Antennae length is 7 p(cj | d) = probability of class cj, given that we have observed d

P(Grasshopper | 5 ) = 6 / (6 + 6) = 0.500 P(Katydid | 5 ) = 6 / (6 + 6) = 0.500

6 6

Antennae length is 5 Bayes Classifiers

That was a visual intuition for a simple case of the Bayes classifier, also called:

• Idiot Bayes • Naïve Bayes • Simple Bayes

We are about to see some of the mathematical formalisms, and more examples, but keep in mind the basic idea.

Find out the probability of the previously unseen instance belonging to each class, then simply pick the most probable class. Bayes Classifiers • Bayesian classifiers use Bayes theorem, which says

p(cj | d ) = p(d | cj ) p(cj) p(d)

• p(cj | d) = probability of instance d being in class cj, This is what we are trying to compute

• p(d | cj) = probability of generating instance d given class cj, We can imagine that being in class cj, causes you to have feature d with some probability • p(cj) = probability of occurrence of class cj, This is just how frequent the class cj, is in our database • p(d) = probability of instance d occurring This can actually be ignored, since it is the same for all classes Assume that we have two classes (Note: “Drew c = male, and c = female. can be a male 1 2 or female name”) We have a person whose sex we do not know, say “drew” or d. Drew Barrymore Classifying drew as male or female is equivalent to asking is it more probable that drew is male or female, I.e which is greater p(male | drew) or p(female | drew)

Drew Carey What is the probability of being called “drew” given that you are a male? What is the probability of being a male? p(male | drew) = p(drew | male ) p(male) p(drew) What is the probability of being named “drew”? (actually irrelevant, since it is that same for all classes) This is Officer Drew (who arrested me in 1997). Is Officer Drew a Male or Female?

Luckily, we have a small database with names and sex.

We can use it to apply Bayes Name Sex rule… Drew Male Officer Drew Claudia Female Drew Female Drew Female p(cj | d) = p(d | cj ) p(cj) Alberto Male p(d) Karin Female Nina Female Sergio Male Name Sex Drew Male Claudia Female Drew Female Drew Female

p(cj | d) = p(d | cj ) p(cj) Alberto Male p(d) Karin Female Officer Drew Nina Female Sergio Male p(male | drew) = 1/3 * 3/8 = 0.125 3/8 3/8 Officer Drew is more likely to be p(female | drew) = 2/5 * 5/8 = 0.250 a Female. 3/8 3/8 Officer Drew IS a female!

Officer Drew

p(male | drew) = 1/3 * 3/8 = 0.125 3/8 3/8

p(female | drew) = 2/5 * 5/8 = 0.250 3/8 3/8 So far we have only considered Bayes p(c | d) = p(d | c ) p(c ) Classification when we have one j j j attribute (the “antennae length”, or the p(d) “name”). But we may have many features. How do we use all the features?

Name Over 170CM Eye Hair length Sex Drew No Blue Short Male Claudia Yes Brown Long Female Drew No Blue Long Female Drew No Blue Long Female Alberto Yes Brown Short Male Karin No Blue Long Female Nina Yes Brown Short Female Sergio Yes Blue Long Male • To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate

p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj)

The probability of class cj generating instance d, equals….

The probability of class cj generating the observed value for feature 1, multiplied by..

The probability of class cj generating the observed value for feature 2, multiplied by.. • To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate

p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj)

p(officer drew|cj) = p(over_170cm = yes|cj) * p(eye =blue|cj) * ….

Officer Drew is blue-eyed, p(officer drew| Female) = 2/5 * 3/5 * …. over 170cm tall, and has p(officer drew| Male) = 2/3 * 2/3 * …. long hair The Naive Bayes classifiers is often represented as this c type of graph… j

Note the direction of the arrows, which state that each class causes certain features, with a certain probability

p(d1|cj) p(d2|cj) … p(dn|cj) Naïve Bayes is fast and cj space efficient

We can look up all the probabilities with a single scan of the database and store them in a (small) table…

p(d1|cj) p(d2|cj) … p(dn|cj)

Sex Over190cm Sex Long Hair Sex Male Yes 0.15 Male Yes 0.05 Male No 0.85 No 0.95 Female Yes 0.01 Female Yes 0.70 Female No 0.99 No 0.30 Naïve Bayes is NOT sensitive to irrelevant features...

Suppose we are trying to classify a persons sex based on several features, including eye color. (Of course, eye color is completely irrelevant to a persons gender)

p(Jessica |cj) = p(eye = brown|cj) * p( wears_dress = yes|cj) * ….

p(Jessica | Female) = 9,000/10,000 * 9,975/10,000 * …. p(Jessica | Male) = 9,001/10,000 * 2/10,000 * …. Almost the same!

However, this assumes that we have good enough estimates of the probabilities, so the more data the better. An obvious point. I have used a simple two class problem, and cj two possible values for each example, for my previous examples. However we can have an arbitrary number of classes, or feature values

p(d1|cj) p(d2|cj) … p(dn|cj)

Animal Mass >10kg Animal Color Animal Cat Yes 0.15 Cat Black 0.33 Cat No 0.85 White 0.23 Dog Yes 0.91 Brown 0.44 Dog No 0.09 Dog Black 0.97 Pig Pig Yes 0.99 White 0.03 No 0.01 Brown 0.90 Pig Black 0.04 White 0.01 Brown 0.95 Naïve Bayesian Problem! p(d|c ) j Classifier Naïve Bayes assumes independence of features…

p(d1|cj) p(d2|cj) p(dn|cj)

Sex Over 6 Sex Over 200 foot pounds Male Yes 0.15 Male Yes 0.11 No 0.85 No 0.80 Female Yes 0.01 Female Yes 0.05 No 0.99 No 0.95 Naïve Bayesian Solution p(d|c ) j Classifier Consider the relationships between attributes…

p(d1|cj) p(d2|cj) p(dn|cj)

Sex Over 6 Sex Over 200 pounds foot Male Yes and Over 6 foot 0.11 Male Yes 0.15 No and Over 6 foot 0.59 No 0.85 Yes and NOT Over 6 foot 0.05 Female Yes 0.01 No and NOT Over 6 foot 0.35 No 0.99 Female Yes and Over 6 foot 0.01 Naïve Bayesian Solution p(d|c ) j Classifier Consider the relationships between attributes…

p(d1|cj) p(d2|cj) p(dn|cj)

But how do we find the set of connecting arcs?? The Naïve Bayesian Classifier has a piecewise quadratic decision boundary

Grasshoppers Katydids Ants

Adapted from slide by Ricardo Gutierrez-Osuna One second of audio from the laser sensor. Only 0.2 Bombus impatiens (Common Eastern Bumble 0.1 Bee) is in the insectary. 0

-0.1 Background noise Bee begins to cross laser 4 -0.2 x 10 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

-3 x 10 Single-Sided Amplitude Spectrum of Y(t) 4 Peak at 3 60Hz 197Hz

|Y(f)| 2 interference Harmonics 1

0 0 100 200 300 400 500 600 700 800 900 1000 Frequency (Hz) -3 x 10 4

|Y(f)| 2

0 0 100 200 300 400 500 600 700 800 900 1000 Frequency (Hz)

0 100 200 300 400 500 600 700 800 900 1000 Frequency (Hz)

0 100 200 300 400 500 600 700 800 900 1000 Frequency (Hz) 0 100 200 300 400 500 600 700 Wing Beat Frequency Hz 0 100 200 300 400 500 600 700 Wing Beat Frequency Hz 400 500 600 700

Anopheles stephensi: Female Aedes aegyptii : Female mean =475, Std = 30 mean =567, Std = 43 517

If I see an insect with a wingbeat frequency of 500, what is it?

(500−475)2 1 − 푃 퐴푛표푝ℎ푒푙푒푠 푤𝑖푛𝑔푏푒푎푡 = 500 = 푒 2×302 2휋 30 517

400 500 600 700

8.02% of the What is the error rate? 12.2% of the area under the area under the pink curve red curve

Can we get more features? Circadian Features

Aedes aegypti (yellow fever mosquito)

0 dawn dusk 0 12 24 Midnight Noon Midnight What is it? frequency of420Hz insect with a wingbeat Suppose I observe an

400 500 600 700 700

Suppose I observe an 600 insect with a wingbeat frequency of 420Hz at

11:00am 500

What is it? 400

0 12 24 Midnight Noon Midnight ( ( ( What is it? 11:00am frequency of420 at insect with a wingbeat Suppose I observe an Aedes Anopheles Culex Culex | [420Hz,11:00am]) = ( | [420Hz,11:00am]) = ( | [420Hz,11:00am]) = ( 6 0 6 / ( / ( / ( Midnight 0 6 6 6 400 500 600 700 + + + 6 6 6 + + + 0 0 0 )) )) * ( )) )) * ( )) )) * ( 4 3 2 / ( / ( / ( Noon 2 2 2 12 + + + 4 4 4 + + + 3 3 3 )) )) = 0.222 )) )) = 0.000 )) )) = 0.111 Midnight 24 Which of the “Pigeon Problems” can be 10 solved by a decision tree? 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10

100 10 90 9 80 8 70 7 60 6 50 5 40 4 30 3 20 2 10 1

10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 Advantages/Disadvantages of Naïve Bayes

• Advantages: – Fast to train (single scan). Fast to classify – Not sensitive to irrelevant features – Handles real and discrete data – Handles streaming data well • Disadvantages: – Assumes independence of features