Lecture 2: Probability, Naive Bayes

CS 585, Fall 2015 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp2015/

Brendan O’Connor

Thursday, September 10, 15 To d ay

• Probability Review • “Naive Bayes” classiﬁcation • Python demo

Thursday, September 10, 15 Probability Theory Review

1= P (A = a) a XP (AB) Conditional Probability P (A B)= | P (B) Chain Rule P (AB)=P (A B)P (B) | P (A)= P (A, B = b) Law of Total Probability b P (A)=X P (A B = b)P (B = b) | Xb Disjunction (Union) P (A B)=P (A)+P (B) P (AB) _ Negation (Complement) P ( A)=1 P (A) 3 ¬ Thursday, September 10, 15 Probability Theory Review

Bayes Rule tells you how to ﬂip the conditional. Useful if you assume a generative process for your data.

Likelihood Prior

P (D H)P (H) P (H D)= | | P (D) Posterior Normalizer Rev. Thomas Bayes c. 1701-1761 4

Thursday, September 10, 15 Bayes Rule and its pesky denominator

Likelihood Prior

Unnormalized posterior By itself does not sum to 1!

Thursday, September 10, 15 Bayes Rule for classiﬁcation inference Assume a generative process, P(w | y) doc label words y w Inference problem: given w, what is y?

Authorship1.3 The chain problem: rule and classify marginal a new probabilities text. (1 pt) IsAlice it y=Anna has been studyingor y=Barry? karate, and has a 50% chance of surviving an en- counter with Zombie Bob. If she opens the door, what is the chance that she will live, conditioned on Bob saying graagh? Assume there is a 100% chance Observethat Alice lives w: if Look Bob is notat arandom zombie. word in the new text. It is abracadabra. 2 NecromanticP(y=A Scrolls| w=abracadabra) ? The Necromantic Scroll Aﬁcionados (NSA) would like to know the author of a recently discoveredP(y | w) ancient = P(w|y) scroll. They P(y) have narrowed / P(w) down the possibilities to two candidate wizards: Anna and Barry. From painstaking corpus analy- sis of texts known to be written by each of these wizards, they have collected P(y): Assumefrequency 50% statistics prior for the wordsprob.abracadabra and gesundheit, shown in Table 1

P(w | y): abracadabra gesundheit Calculate from Anna 5 per 1000 words 6 per 1000 words previous data Barry 10 per 1000 words 1 per 1000 words Table 1: Word frequencies for6 wizards Anna and Barry Thursday, September 10, 15

2.1 Bayes rule (1 pt) Catherine has a prior belief that Anna is 80% likely to be the author of the scroll. She peeks at a random word of the scroll, and sees that it is the word abracadabra. Use Bayes’ rule to compute Catherine’s posterior belief that Anna is the author of the scroll.

2.2 Multiple words (2 pts) Dante has no prior belief about the authorship of the scrolls, and reads the entire ﬁrst page. It contains 100 words, with two counts of the word abracadabra and one count of the word gesundheit.

1. What is his posterior belief about the probability that Anna is author of the scroll? (1 pt)

2. Does Dante need to consider the 97 words that were not abracadabra or gesundheit? Why or why not? (1 pt)

2 Bayes Rule as hypothesis vector scoring Sum to 1? 0.8

0.4 P (H = h)

0.0 a b c Prior 0.8

0.4 P (E H = h) Multiply |

0.0 Likelihood 0.8 P (E H = h)P (H = h) 0.4 Unnorm.| Posterior 0.0

Normalize 0.8 1 P (E H = h)P (H = h) 0.4 Z |

0.0 Posterior 7

Thursday, September 10, 15 Bayes Rule as hypothesis vector scoring

[0.2, 0.2, 0.6] Sum to 1? 0.8

0.4 P (H = h) Yes

0.0 a b c Prior 0.8

0.4 P (E H = h) Multiply |

0.0 Likelihood 0.8 P (E H = h)P (H = h) 0.4 Unnorm.| Posterior 0.0

Normalize 0.8 1 P (E H = h)P (H = h) 0.4 Z |

0.0 Posterior 7

Thursday, September 10, 15 Bayes Rule as hypothesis vector scoring

[0.2, 0.2, 0.6] Sum to 1? 0.8

0.4 P (H = h) Yes

0.0 a b c Prior

0.8 [0.2, 0.05, 0.05]

0.4 P (E H = h) Multiply | No

0.0 Likelihood 0.8 P (E H = h)P (H = h) 0.4 Unnorm.| Posterior 0.0

Normalize 0.8 1 P (E H = h)P (H = h) 0.4 Z |

0.0 Posterior 7

Thursday, September 10, 15 Bayes Rule as hypothesis vector scoring

[0.2, 0.2, 0.6] Sum to 1? 0.8

0.4 P (H = h) Yes

0.0 a b c Prior

0.8 [0.2, 0.05, 0.05]

0.4 P (E H = h) Multiply | No

0.0 Likelihood

0.8 [0.04, 0.01, 0.03] P (E H = h)P (H = h) 0.4 No Unnorm.| Posterior 0.0

Normalize 0.8 1 P (E H = h)P (H = h) 0.4 Z |

0.0 Posterior 7

Thursday, September 10, 15 Bayes Rule as hypothesis vector scoring

[0.2, 0.2, 0.6] Sum to 1? 0.8

0.4 P (H = h) Yes

0.0 a b c Prior

0.8 [0.2, 0.05, 0.05]

0.4 P (E H = h) Multiply | No

0.0 Likelihood

0.8 [0.04, 0.01, 0.03] P (E H = h)P (H = h) 0.4 No Unnorm.| Posterior 0.0

Normalize [0.500, 0.125, 0.375] 0.8 1 P (E H = h)P (H = h) 0.4 Z | Yes

0.0 Posterior 7

Thursday, September 10, 15 Text Classiﬁcation with Naive Bayes

Thursday, September 10, 15 Classiﬁcation problems

• Given text d, want to predict label y • Is this restaurant review positive or negative? • Is this email spam or not? • Which author wrote this text? • (Is this word a noun or verb?) • d: documents, sentences, etc. • y: discrete/categorical variable

Goal: from training set of (d,y) pairs, learn a probabilistic classiﬁer f(d) = P(y|d) (“supervised learning”)

Thursday, September 10, 15 6.1 NAIVE BAYES CLASSIFIERS 3 • 6.1 Naive Bayes Classiﬁers

naive Bayes classifier In this section we introduce the multinomial naive Bayes classifier, so called be- cause it is a Bayesian classifier that makes a simplifying (naive) assumption about how the features interact. The intuition of the classifier is shown in Fig. 6.1. We represent a text document bag-of-words as if it were a bag-of-words, that is, an unordered set of words with their position ignored, keeping only their frequency in the document. In the example in the figure, instead of representing the word order in all the phrases like “I love this movie” and “I would recommend it”, we simply note that the word I occurred 5 times in the entire excerpt, the word it 6 times, the word love, recommend, and movie once, and Featuresso on. for model: Bag-of-words

it 6 I 5 I love this movie! It's sweet, the 4 but with satirical humor. The fairy always love it to 3 it to dialogue is great and the whimsical it I and 3 and seen are seen 2 adventure scenes are fun... friend anyone It manages to be whimsical happy dialogue yet 1 and romantic while laughing adventure recommend would 1 satirical at the conventions of the whosweet of it whimsical 1 I to movie times 1 fairy tale genre. I would it but romantic I several yet sweet 1 recommend it to just about humor anyone. I've seen it several again it the satirical 1 the would times, and I'm always happy seen adventure 1 to scenes I the manages to see it again whenever I the genre 1 fun times I and and fairy 1 have a friend who hasn't about while humor 1 seen it yet! whenever have conventions have 1 with great 1 … …

Figure 6.1 Intuition of the multinomial naive Bayes classiﬁer applied to a movie review. The position of the words is ignored (the bag of words assumption) and we make use of the frequency of each word.

10 Naive Bayes is a probabilistic classiﬁer, meaning that for a document d, out of Thursday, September 10, 15 all classes c C the classiﬁer returns the classc ˆ which has the maximum posterior 2 ˆ probability given the document, In Eq. 6.1 we use the hat notation ˆ to mean “our estimate of the correct class”.

cˆ = argmaxP(c d) (6.1) c C | 2

Bayesian inference This idea of Bayesian inference has been known since the work of Bayes (1763), and was first applied to text classification by Mosteller and Wallace (1964). The intuition of Bayesian classification is to use Bayes’ rule to transform Eq. 6.1 into other probabilities that have some useful properties. Bayes’ rule is presented in Eq. 6.2; it gives us a way to break down any conditional probability P(x y) into three other DRAFT| Levels of linguistic structure Discourse CommunicationEvent(e) SpeakerContext(s) Semantics Agent(e, Alice) TemporalBefore(e, s) Recipient(e, Bob)

Syntax VP NP PP .

NounPrp VerbPast Prep NounPrp Punct Words Alice talked to Bob . Morphology talk -ed Characters Alice talked to Bob. 11

Thursday, September 10, 15 Levels of linguistic structure

Words Alice talked to Bob .

Characters Alice talked to Bob. 12

Thursday, September 10, 15 Levels of linguistic structure

Words are fundamental units of meaning

Words Alice talked to Bob .

Characters Alice talked to Bob. 12

Thursday, September 10, 15 Levels of linguistic structure

Words are fundamental units of meaning and easily identiﬁable* *in some languages

Words Alice talked to Bob .

Characters Alice talked to Bob. 12

Thursday, September 10, 15 How to classify with words?

• Approach #1: use a predeﬁned dictionary (or make one up) Human Knowledge • e.g. for sentiment.... • score += 1 for each “happy”, “awesome”, “cool” • score -= 1 for each “sad”, “awful”, “bad” • Approach #2: use labeled documents Supervised Learning • Learn which words correlate to positive vs. negative documents • Use these correlations to classify new documents 13

Thursday, September 10, 15 Supervised learning

Many Labeled Examples Training Model (d= “...”, y=B) Parameters (d= “...”, y=B) (d= “...”, y=A)

Classify (d= “...”, y=?) y=A new texts (d= “...”, y=?) y=B

Thursday, September 10, 15 Supervised learning: Generative model

Many Labeled Examples Training Model P(y) (d= “...”, y=B) Parameters P(d | y) (d= “...”, y=B) (d= “...”, y=A)

Classify (d= “...”, y=?) y=A new texts (d= “...”, y=?) y=B P(y | d) 15

Thursday, September 10, 15 Multinomial Naive Bayes P (y w ..w ) P (y) P (w ..w y) | 1 T / 1 T | Tokens in doc

Predictions:

Predict class arg max P (Y = y w1..wT ) y | or, predict prob of classes...

Thursday, September 10, 15 Multinomial Naive Bayes P (y w ..w ) P (y) P (w ..w y) | 1 T / 1 T | Tokens in doc the “Naive Bayes” P (w y) assumption: t | t conditional indep. Y Parameters: P (w y) for each document category y and wordtype w | P (y) prior distribution over document categories y Learning: Estimate parameters as frequency ratios; e.g. #(w occurrences in docs with label y)+↵ P (w y, ↵)= | #(tokens total across docs with label y)+V ↵ Predictions:

Predict class arg max P (Y = y w1..wT ) y | or, predict prob of classes...

Thursday, September 10, 15