Bayesian Methods: Naïve Bayes

Nicholas Ruozzi University of Texas at Dallas

based on the slides of Vibhav Gogate Last Time

• Parameter learning – Learning the parameter of a simple coin flipping model • Prior distributions • Posterior distributions • Today: more parameter learning and naïve Bayes

2 Maximum Likelihood Estimation (MLE)

• Data: Observed set of 훼퐻 heads and 훼푇 tails • Hypothesis: Coin flips follow a binomial distribution • Learning: Find the “best” 휃 • MLE: Choose  to maximize the likelihood (probability of D given 휃)

휃푀퐿퐸 = arg max 푝(퐷|휃) 휃

3 MAP Estimation

• Choosing 휃 to maximize the posterior distribution is called maximum a posteriori (MAP) estimation

휃푀퐴푃 = arg max 푝(휃|퐷) 휃

• The only difference between 휃푀퐿퐸 and 휃푀퐴푃 is that one assumes a uniform prior (MLE) and the other allows an arbitrary prior

4

• How many coin flips do we need in order to guarantee that our learned parameter does not differ too much from the true parameter (with high probability)?

• Can use Chernoff bound (again!)

– Suppose 푌1, … , 푌푁 are i.i.d. random variables taking values in {0, 1} such that 퐸푝 푌𝑖 = 푦. For 휖 > 0,

1 2 푝 푦 − ෍ 푌 ≥ 휖 ≤ 2푒−2푁휖 푁 𝑖 𝑖

5 Sample Complexity

• How many coin flips do we need in order to guarantee that our learned parameter does not differ too much from the true parameter (with high probability)?

• Can use Chernoff bound (again!)

– For the coin flipping problem with 푋1, … , 푋푛 iid coin flips and 휖 > 0,

1 2 푝 휃 − ෍ 푋 ≥ 휖 ≤ 2푒−2푁휖 푡푟푢푒 푁 𝑖 𝑖

6 Sample Complexity

• How many coin flips do we need in order to guarantee that our learned parameter does not differ too much from the true parameter (with high probability)?

• Can use Chernoff bound (again!)

– For the coin flipping problem with 푋1, … , 푋푛 iid coin flips and 휖 > 0,

−2푁휖2 푝 휃푡푟푢푒 − 휃푀퐿퐸 ≥ 휖 ≤ 2푒

7 Sample Complexity

• How many coin flips do we need in order to guarantee that our learned parameter does not differ too much from the true parameter (with high probability)?

• Can use Chernoff bound (again!)

– For the coin flipping problem with 푋1, … , 푋푛 iid coin flips and 휖 > 0,

−2푁휖2 푝 휃푡푟푢푒 − 휃푀퐿퐸 ≥ 휖 ≤ 2푒

2 1 2 훿 ≥ 2푒−2푁휖 ⇒ 푁 ≥ ln 2휖2 훿

8 MLE for Gaussian Distributions

• Two parameter distribution characterized by a mean and a

9 Some properties of Gaussians

• Affine transformation (multiplying by scalar and adding a constant) are Gaussian

– 푋 ~ 푁(휇, 휎2) – 푌 = 푎푋 + 푏 ⇒ 푌 ~ 푁(푎휇 + 푏, 푎2휎2) • Sum of Gaussians is Gaussian

– 푋 ~ 푁(휇 , 2), 푌 ~ 푁(휇 , 2) 푋 휎푋 푌 휎푌 – 푍 = 푋 + 푌 ⇒ 푍 ~ 푁(휇 + 휇 , 2 + 2) 푋 푌 휎푋 휎푌 • Easy to differentiate, as we will see soon!

10 Learning a Gaussian

• Collect data 𝑖 Exam Score – Hopefully, i.i.d. samples 0 85 – e.g., exam scores 1 95 2 100 • Learn parameters 3 12 … … – Mean: μ 99 89 – Variance: σ

11 MLE for Gaussian:

• Probability of 푁 i.i.d. samples 퐷 = 푥(1), … , 푥(푁) 푁 2 푁 푥 푖 −휇 1 − 푝 퐷 휇, 휎 = ෑ 푒 2휎2 2 2휋휎 𝑖=1

• Log-likelihood of the data 푁 2 푁 푥 𝑖 − 휇 ln 푝(퐷|휇, 휎) = − ln 2휋휎2 − ෍ 2 2휎2 𝑖=1

12 MLE for the Mean of a Gaussian

푁 2 휕 휕 푁 푥 𝑖 − 휇 ln 푝(퐷|휇, 휎) = − ln 2휋휎2 − ෍ 휕휇 휕휇 2 2휎2 𝑖=1 푁 2 휕 푥 𝑖 − 휇 = − ෍ 휕휇 2휎2 𝑖=1 푁 푥 𝑖 − 휇 = − ෍ 휎2 𝑖=1 푁휇 − σ푁 푥 𝑖 = 𝑖=1 = 0 휎2

푁 1 휇 = ෍ 푥 𝑖 푀퐿퐸 푁 𝑖=1

13 MLE for Variance

푁 2 휕 휕 푁 푥 𝑖 − 휇 ln 푝(퐷|휇, 휎) = − ln 2휋휎2 − ෍ 휕휎 휕휎 2 2휎2 𝑖=1 푁 2 푁 휕 푥 𝑖 − 휇 = − + − ෍ 휎 휕휎 2휎2 𝑖=1 푁 2 푁 푥 𝑖 − 휇 = − + ෍ = 0 휎 휎3 𝑖=1

푁 1 2 휎2 = ෍ 푥 𝑖 − 휇 푀퐿퐸 푁 푀퐿퐸 𝑖=1

14 Learning Gaussian parameters

푁 1 휇 = ෍ 푥 𝑖 푀퐿퐸 푁 𝑖=1 푁 1 2 휎2 = ෍ 푥 𝑖 − 휇 푀퐿퐸 푁 푀퐿퐸 𝑖=1 • MLE for the variance of a Gaussian is biased – Expected result of estimation is not true parameter! – Unbiased variance estimator

푁 1 2 휎2 = ෍ 푥 𝑖 − 휇 푢푛푏𝑖푎푠푒푑 푁 − 1 푀퐿퐸 𝑖=1

15 Bayesian Categorization/Classification

• Given features 푥 = (푥1, … , 푥푚) predict a label 푦 • If we had a joint distribution over 푥 and 푦, given 푥 we could find the label using MAP inference

arg max 푝(푦|푥1, … , 푥푚) 푦 • Can compute this in exactly the same way that we did before using Bayes rule:

푝 푥1, … , 푥푚 푦 푝 푦 푝 푦 푥1, … , 푥푚 = 푝 푥1, … , 푥푚

16 Article Classification

• Given a collection of news articles labeled by topic, goal is, given a new news article to predict its topic – One possible feature vector:

• One feature for each word in the document, in order

푡ℎ – 푥𝑖 corresponds to the 𝑖 word

– 푥𝑖 can take a different value for each word in the dictionary

17 Text Classification

18 Text Classification

• 푥1, 푥2, … is sequence of words in document • The set of all possible features (and hence 푝(푦|푥)) is huge

– Article at least 1000 words, 푥 = (푥1, … , 푥1000)

푡ℎ – 푥𝑖 represents 𝑖 word in document • Can be any word in the dictionary – at least 10,000 words – 10,0001000 = 104000 possible values – Atoms in Universe: ~1080

19 Bag of Words Model

• Typically assume position in document doesn’t matter

푝 푋𝑖 = 푥𝑖 푌 = 푦 = 푝(푋푘 = 푥𝑖|푌 = 푦) – All positions have the same distribution – Ignores the order of words – Sounds like a bad assumption, but often works well! • Features – Set of all possible words and their corresponding frequencies (number of times it occurs in the document)

20 Bag of Words

aardvark 0 about2 all 2 Africa 1 apple0 anxious 0 ... gas 1 ... oil 1 … Zaire 0

21 Need to Simplify Somehow

• Even with the bag of words assumption, there are too many possible outcomes – Too many probabilities

푝(푥1, … , 푥푚|푦) • Can we assume some are the same?

푝(푥1, 푥2|푦 ) = 푝(푥1|푦) 푝(푥2|푦) – This is a assumption

22 Conditional Independence

• X is conditionally independent of Y given Z, if the probability distribution for X is independent of the value of Y, given the value of Z 푝 푋 푌, 푍 = 푃(푋|푍) • Equivalent to 푝 푋, 푌 푍 = 푝 푋 푍 푃(푌|푍)

23 Naïve Bayes

• Naïve Bayes assumption – Features are independent given class label

푝(푥1, 푥2|푦 ) = 푝(푥1|푦) 푝(푥2|푦) – More generally 푚

푝 푥1, … , 푥푚|푦 = ෑ 푝(푥𝑖|푦) 𝑖=1 • How many parameters now?

– Suppose 푥 is composed of 푚 binary features

24 The Naïve

• Given 푌 – Prior 푝(푦)

– 푚 conditionally independent features 푋 given the class 푌 푋1 푋2 푋푛

– For each 푋𝑖, we have likelihood 푃(푋𝑖|푌) • Classify via

∗ 푦 = ℎ푁퐵 푥 = arg max 푝 푦 푝(푥1, … , 푥푚|푦) 푦 푚

= arg max 푝(푦) ෑ 푝(푥𝑖|푦) 푦 𝑖

25 MLE for the Parameters of NB

• Given dataset, count occurrences for all pairs

– 퐶표푢푛푡(푋𝑖 = 푥𝑖, 푌 = 푦) is the number of samples in which 푋𝑖 = 푥𝑖 and 푌 = 푦 • MLE for discrete NB 퐶표푢푛푡 푌 = 푦 푝 푌 = 푦 = ′ σ푦′ 퐶표푢푛푡(푌 = 푦 )

퐶표푢푛푡 푋𝑖 = 푥𝑖, 푌 = 푦 푝 푋𝑖 = 푥𝑖 푌 = 푦 = ′ σ ′ 퐶표푢푛푡 (푋𝑖 = 푥𝑖 , 푌 = 푦) 푥푖

26 Naïve Bayes Calculations

27 Subtleties of NB Classifier: #1

• Usually, features are not conditionally independent: 푚

푝 푥1, … , 푥푚 푦 ≠ ෑ 푝(푥𝑖|푦) 𝑖=1 • The naïve Bayes assumption is often violated, yet it performs surprisingly well in many cases • Plausible reason: Only need the probability of the correct class to be the largest! – Example: binary classification; just need to figure out the correct side of 0.5 and not the actual probability (0.51 is the same as 0.99).

28 Subtleties of NB Classifier: #2

• What if you never see a training instance (푋1 = 푎, 푌 = 푏) – Example: you did not see the word Enlargement in spam!

– Then 푝 푋1 = 푎 푌 = 푏 = 0

– Thus no matter what values 푋2, ⋯ , 푋푚take

푃 푋1 = a, 푋2 = 푥2, ⋯ , 푋푚 = 푥푚 푌 = 푏 = 0 – Why?

29 Subtleties of NB Classifier: #2

• To fix this, use a prior! – Already saw how to do this in the coin-flipping example using the Beta distribution – For NB over discrete spaces, can use the Dirichlet prior

– The Dirichlet distribution is a distribution over 푧1, … , 푧푘 ∈ (0,1) such that 푧1 + ⋯ + 푧푘 = 1 characterized by 푘 parameters 훼1, … , 훼푘 푘 훼푖−1 푓 푧1, … , 푧푘; 훼1, … , 훼푘 ∝ ෑ 푧𝑖 𝑖=1 • Called smoothing, what are the MLE estimates under these kinds of priors?

30