<<

Lecture 2: Probability, Naive Bayes

CS 585, Fall 2015 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp2015/

Brendan O’Connor

Thursday, September 10, 15 To d ay

• Probability Review • “Naive Bayes” classification • Python demo

2

Thursday, September 10, 15 Probability Theory Review

1= P (A = a) a XP (AB) P (A B)= | P (B) Chain Rule P (AB)=P (A B)P (B) | P (A)= P (A, B = b) Law of Total Probability b P (A)=X P (A B = b)P (B = b) | Xb Disjunction (Union) P (A B)=P (A)+P (B) P (AB) _ Negation (Complement) P ( A)=1 P (A) 3 ¬ Thursday, September 10, 15 Probability Theory Review

1= P (A = a) a XP (AB) Conditional Probability P (A B)= | P (B) Chain Rule P (AB)=P (A B)P (B) | P (A)= P (A, B = b) Law of Total Probability b P (A)=X P (A B = b)P (B = b) | Xb Disjunction (Union) P (A B)=P (A)+P (B) P (AB) _ Negation (Complement) P ( A)=1 P (A) 3 ¬ Thursday, September 10, 15 Probability Theory Review

1= P (A = a) a XP (AB) Conditional Probability P (A B)= | P (B) Chain Rule P (AB)=P (A B)P (B) | P (A)= P (A, B = b) Law of Total Probability b P (A)=X P (A B = b)P (B = b) | Xb Disjunction (Union) P (A B)=P (A)+P (B) P (AB) _ Negation (Complement) P ( A)=1 P (A) 3 ¬ Thursday, September 10, 15 Probability Theory Review

1= P (A = a) a XP (AB) Conditional Probability P (A B)= | P (B) Chain Rule P (AB)=P (A B)P (B) | P (A)= P (A, B = b) Law of Total Probability b P (A)=X P (A B = b)P (B = b) | Xb Disjunction (Union) P (A B)=P (A)+P (B) P (AB) _ Negation (Complement) P ( A)=1 P (A) 3 ¬ Thursday, September 10, 15 Probability Theory Review

1= P (A = a) a XP (AB) Conditional Probability P (A B)= | P (B) Chain Rule P (AB)=P (A B)P (B) | P (A)= P (A, B = b) Law of Total Probability b P (A)=X P (A B = b)P (B = b) | Xb Disjunction (Union) P (A B)=P (A)+P (B) P (AB) _ Negation (Complement) P ( A)=1 P (A) 3 ¬ Thursday, September 10, 15 Probability Theory Review

1= P (A = a) a XP (AB) Conditional Probability P (A B)= | P (B) Chain Rule P (AB)=P (A B)P (B) | P (A)= P (A, B = b) Law of Total Probability b P (A)=X P (A B = b)P (B = b) | Xb Disjunction (Union) P (A B)=P (A)+P (B) P (AB) _ Negation (Complement) P ( A)=1 P (A) 3 ¬ Thursday, September 10, 15 Probability Theory Review

1= P (A = a) a XP (AB) Conditional Probability P (A B)= | P (B) Chain Rule P (AB)=P (A B)P (B) | P (A)= P (A, B = b) Law of Total Probability b P (A)=X P (A B = b)P (B = b) | Xb Disjunction (Union) P (A B)=P (A)+P (B) P (AB) _ Negation (Complement) P ( A)=1 P (A) 3 ¬ Thursday, September 10, 15 Probability Theory Review

1= P (A = a) a XP (AB) Conditional Probability P (A B)= | P (B) Chain Rule P (AB)=P (A B)P (B) | P (A)= P (A, B = b) Law of Total Probability b P (A)=X P (A B = b)P (B = b) | Xb Disjunction (Union) P (A B)=P (A)+P (B) P (AB) _ Negation (Complement) P ( A)=1 P (A) 3 ¬ Thursday, September 10, 15 Bayes Rule P(D|H) Model: how hypothesis causes data D: observed H: unknown data / evidence P(H|D)

Bayes Rule tells you how to flip the conditional. Useful if you assume a generative process for your data.

Likelihood Prior

P (D H)P (H) P (H D)= | | P (D) Posterior Normalizer Rev. Thomas Bayes c. 1701-1761 4

Thursday, September 10, 15 Bayes Rule and its pesky denominator

Likelihood Prior

P (d h)P (h) P (d h)P (h) P (h d)= | = | | P (d) P (d h0)P (h0) h0 | Constant P w.r.t. h ∝ “Proportional to” P (h d) P (d h)P (h) Implicitly for varying H. | / | This notation is very common, though slightly ambiguous.

Unnormalized posterior By itself does not sum to 1!

5

Thursday, September 10, 15 Bayes Rule for classification inference Assume a generative process, P(w | y) doc label words y w Inference problem: given w, what is y?

Authorship1.3 The chain problem: rule and classify marginal a new probabilities text. (1 pt) IsAlice it y=Anna has been studyingor y=Barry? karate, and has a 50% chance of surviving an en- counter with Zombie Bob. If she opens the door, what is the chance that she will live, conditioned on Bob saying graagh? Assume there is a 100% chance Observethat Alice lives w: if Look Bob is notat arandom zombie. word in the new text. It is abracadabra. 2 NecromanticP(y=A Scrolls| w=abracadabra) ? The Necromantic Scroll Aficionados (NSA) would like to know the author of a recently discoveredP(y | w) ancient = P(w|y) scroll. They P(y) have narrowed / P(w) down the possibilities to two candidate wizards: Anna and Barry. From painstaking corpus analy- sis of texts known to be written by each of these wizards, they have collected P(y): Assumefrequency 50% prior for the wordsprob.abracadabra and gesundheit, shown in Table 1

P(w | y): abracadabra gesundheit Calculate from Anna 5 per 1000 words 6 per 1000 words previous data Barry 10 per 1000 words 1 per 1000 words Table 1: Word frequencies for6 wizards Anna and Barry Thursday, September 10, 15

2.1 Bayes rule (1 pt) Catherine has a prior belief that Anna is 80% likely to be the author of the scroll. She peeks at a random word of the scroll, and sees that it is the word abracadabra. Use Bayes’ rule to compute Catherine’s posterior belief that Anna is the author of the scroll.

2.2 Multiple words (2 pts) Dante has no prior belief about the authorship of the scrolls, and reads the entire first page. It contains 100 words, with two counts of the word abra- cadabra and one count of the word gesundheit.

1. What is his posterior belief about the probability that Anna is author of the scroll? (1 pt)

2. Does Dante need to consider the 97 words that were not abracadabra or gesundheit? Why or why not? (1 pt)

2 Bayes Rule as hypothesis vector scoring Sum to 1? 0.8

0.4 P (H = h)

0.0 a b c Prior 0.8

0.4 P (E H = h) Multiply |

0.0 Likelihood 0.8 P (E H = h)P (H = h) 0.4 Unnorm.| Posterior 0.0

Normalize 0.8 1 P (E H = h)P (H = h) 0.4 Z |

0.0 Posterior 7

Thursday, September 10, 15 Bayes Rule as hypothesis vector scoring

[0.2, 0.2, 0.6] Sum to 1? 0.8

0.4 P (H = h) Yes

0.0 a b c Prior 0.8

0.4 P (E H = h) Multiply |

0.0 Likelihood 0.8 P (E H = h)P (H = h) 0.4 Unnorm.| Posterior 0.0

Normalize 0.8 1 P (E H = h)P (H = h) 0.4 Z |

0.0 Posterior 7

Thursday, September 10, 15 Bayes Rule as hypothesis vector scoring

[0.2, 0.2, 0.6] Sum to 1? 0.8

0.4 P (H = h) Yes

0.0 a b c Prior

0.8 [0.2, 0.05, 0.05]

0.4 P (E H = h) Multiply | No

0.0 Likelihood 0.8 P (E H = h)P (H = h) 0.4 Unnorm.| Posterior 0.0

Normalize 0.8 1 P (E H = h)P (H = h) 0.4 Z |

0.0 Posterior 7

Thursday, September 10, 15 Bayes Rule as hypothesis vector scoring

[0.2, 0.2, 0.6] Sum to 1? 0.8

0.4 P (H = h) Yes

0.0 a b c Prior

0.8 [0.2, 0.05, 0.05]

0.4 P (E H = h) Multiply | No

0.0 Likelihood

0.8 [0.04, 0.01, 0.03] P (E H = h)P (H = h) 0.4 No Unnorm.| Posterior 0.0

Normalize 0.8 1 P (E H = h)P (H = h) 0.4 Z |

0.0 Posterior 7

Thursday, September 10, 15 Bayes Rule as hypothesis vector scoring

[0.2, 0.2, 0.6] Sum to 1? 0.8

0.4 P (H = h) Yes

0.0 a b c Prior

0.8 [0.2, 0.05, 0.05]

0.4 P (E H = h) Multiply | No

0.0 Likelihood

0.8 [0.04, 0.01, 0.03] P (E H = h)P (H = h) 0.4 No Unnorm.| Posterior 0.0

Normalize [0.500, 0.125, 0.375] 0.8 1 P (E H = h)P (H = h) 0.4 Z | Yes

0.0 Posterior 7

Thursday, September 10, 15 Text Classification with Naive Bayes

8

Thursday, September 10, 15 Classification problems

• Given text d, want to predict label y • Is this restaurant review positive or negative? • Is this email spam or not? • Which author wrote this text? • (Is this word a noun or verb?) • d: documents, sentences, etc. • y: discrete/categorical variable

Goal: from training set of (d,y) pairs, learn a probabilistic classifier f(d) = P(y|d) (“”)

9

Thursday, September 10, 15 6.1 NAIVE BAYES CLASSIFIERS 3 • 6.1 Naive Bayes Classifiers

naive Bayes classifier In this section we introduce the multinomial naive Bayes classifier, so called be- cause it is a Bayesian classifier that makes a simplifying (naive) assumption about how the features interact. The intuition of the classifier is shown in Fig. 6.1. We represent a text document bag-of-words as if it were a bag-of-words, that is, an unordered set of words with their position ignored, keeping only their frequency in the document. In the example in the figure, instead of representing the word order in all the phrases like “I love this movie” and “I would recommend it”, we simply note that the word I occurred 5 times in the entire excerpt, the word it 6 times, the word love, recommend, and movie once, and Featuresso on. for model: Bag-of-words

it 6 I 5 I love this movie! It's sweet, the 4 but with satirical humor. The fairy always love it to 3 it to dialogue is great and the whimsical it I and 3 and seen are seen 2 adventure scenes are fun... friend anyone It manages to be whimsical happy dialogue yet 1 and romantic while laughing adventure recommend would 1 satirical at the conventions of the whosweet of it whimsical 1 I to movie times 1 fairy tale genre. I would it but romantic I several yet sweet 1 recommend it to just about humor anyone. I've seen it several again it the satirical 1 the would times, and I'm always happy seen adventure 1 to scenes I the manages to see it again whenever I the genre 1 fun times I and and fairy 1 have a friend who hasn't about while humor 1 seen it yet! whenever have conventions have 1 with great 1 … …

Figure 6.1 Intuition of the multinomial naive Bayes classifier applied to a movie review. The position of the words is ignored (the bag of words assumption) and we make use of the frequency of each word.

10 Naive Bayes is a probabilistic classifier, meaning that for a document d, out of Thursday, September 10, 15 all classes c C the classifier returns the classc ˆ which has the maximum posterior 2 ˆ probability given the document, In Eq. 6.1 we use the hat notation ˆ to mean “our estimate of the correct class”.

cˆ = argmaxP(c d) (6.1) c C | 2

Bayesian inference This idea of Bayesian inference has been known since the work of Bayes (1763), and was first applied to text classification by Mosteller and Wallace (1964). The in- tuition of Bayesian classification is to use Bayes’ rule to transform Eq. 6.1 into other probabilities that have some useful properties. Bayes’ rule is presented in Eq. 6.2; it gives us a way to break down any conditional probability P(x y) into three other DRAFT| Levels of linguistic structure Discourse CommunicationEvent(e) SpeakerContext(s) Semantics Agent(e, Alice) TemporalBefore(e, s) Recipient(e, Bob)

S

Syntax VP NP PP .

NounPrp VerbPast Prep NounPrp Punct Words Alice talked to Bob . Morphology talk -ed Characters Alice talked to Bob. 11

Thursday, September 10, 15 Levels of linguistic structure

Words Alice talked to Bob .

Characters Alice talked to Bob. 12

Thursday, September 10, 15 Levels of linguistic structure

Words are fundamental units of meaning

Words Alice talked to Bob .

Characters Alice talked to Bob. 12

Thursday, September 10, 15 Levels of linguistic structure

Words are fundamental units of meaning and easily identifiable* *in some languages

Words Alice talked to Bob .

Characters Alice talked to Bob. 12

Thursday, September 10, 15 How to classify with words?

• Approach #1: use a predefined dictionary (or make one up) Human Knowledge • e.g. for sentiment.... • score += 1 for each “happy”, “awesome”, “cool” • score -= 1 for each “sad”, “awful”, “bad”

13

Thursday, September 10, 15 How to classify with words?

• Approach #1: use a predefined dictionary (or make one up) Human Knowledge • e.g. for sentiment.... • score += 1 for each “happy”, “awesome”, “cool” • score -= 1 for each “sad”, “awful”, “bad” • Approach #2: use labeled documents Supervised Learning • Learn which words correlate to positive vs. negative documents • Use these correlations to classify new documents 13

Thursday, September 10, 15 Supervised learning

Many Labeled Examples Training Model (d= “...”, y=B) Parameters (d= “...”, y=B) (d= “...”, y=A)

Classify (d= “...”, y=?) y=A new texts (d= “...”, y=?) y=B

14

Thursday, September 10, 15 Supervised learning:

Many Labeled Examples Training Model P(y) (d= “...”, y=B) Parameters P(d | y) (d= “...”, y=B) (d= “...”, y=A)

Classify (d= “...”, y=?) y=A new texts (d= “...”, y=?) y=B P(y | d) 15

Thursday, September 10, 15 Multinomial Naive Bayes P (y w ..w ) P (y) P (w ..w y) | 1 T / 1 T | Tokens in doc

Predictions:

Predict class arg max P (Y = y w1..wT ) y | or, predict prob of classes...

Thursday, September 10, 15 Multinomial Naive Bayes P (y w ..w ) P (y) P (w ..w y) | 1 T / 1 T | Tokens in doc the “Naive Bayes” P (w y) assumption: t | t conditional indep. Y Parameters: P (w y) for each document category y and wordtype w | P (y) prior distribution over document categories y Learning: Estimate parameters as frequency ratios; e.g. #(w occurrences in docs with label y)+↵ P (w y, ↵)= | #(tokens total across docs with label y)+V ↵ Predictions:

Predict class arg max P (Y = y w1..wT ) y | or, predict prob of classes...

Thursday, September 10, 15