Introduction to Lecture 3

Mehryar Mohri Courant Institute and Google Research [email protected] Bayesian Learning Bayes’ Formula/Rule

Terminology:

likelihood prior Pr[X Y ]Pr[Y ] Pr[Y X]= | . posterior| Pr[X] evidence

Mehryar Mohri - Introduction to Machine Learning page 3 Loss Function

Definition: function L : R + indicating the penalty for an incorrectY prediction.×Y→ • L ( y,y ) : loss for prediction of y instead of y . Examples: ￿ ￿ • zero-one loss: standard loss function in classification; L ( y,y ￿)=1 y = y for y,y ￿ . ￿ ￿ ∈Y • non-symmetric losses: e.g., for spam classification; L ( ham , spam) L ( spam ￿ , ham) . ≤ squared loss: standard loss function in • ￿ 2 regression; L ( y,y ￿ )=( y ￿ y ) . − Mehryar Mohri - Introduction to Machine Learning page 4 Classification Problem

Input space : e.g., set of documents. X N feature vector Φ ( x ) R associated to x . • ∈ ∈X N notation: feature vector x R . • ∈ • example: vector of word counts in document. Output or target space : set of classes; e.g., sport, Y business, art. Problem: given x , predict the correct class y ∈Y associated to x .

Mehryar Mohri - Introduction to Machine Learning page 5 Bayesian Prediction

Definition: the expected conditional loss of predicting y is ∈Y [y x]= L(y,y)Pr[y x]. L￿ | | y ￿∈Y Bayesian decision￿ : predict ￿class minimizing expected conditional loss, that is

y∗ =argmin [y x]=argmin L(y,y)Pr[y x]. L | | y y y ￿∈Y ￿zero-oneb loss:￿ y∗ =argmaxb Pr[y x￿]. • y | Maximum a Posteriori (MAP) principle. ￿ b ￿ Mehryar Mohri - Introduction to Machine Learning page 6 Binary Classification - Illustration

1 Pr[y x] 1 |

Pr[y x] 2 | 0 x

Mehryar Mohri - Introduction to Machine Learning page 7 Maximum a Posteriori (MAP)

Definition: the MAP principle consists of predicting according to the rule y =argmaxPr[y x]. y | ∈Y Equivalently, by￿ the Bayes formula: Pr[x y]Pr[y] y =argmax | =argmaxPr[x y]Pr[y]. y Pr[x] y | ∈Y ∈Y ￿ How do we determine Pr[ x y ] and Pr[ y ] ? problem.|

Mehryar Mohri - Introduction to Machine Learning page 8 Application - Maximum a Posteriori

Formulation: hypothesis set H .

Pr[O h]Pr[h] hˆ =argmaxPr[h O]=argmax | =argmaxPr[O h]Pr[h]. h H | h H Pr[O] h H | ∈ ∈ ∈ Example: determine if a patient has a rare disease H = d, nd , given laboratory test O = pos, neg . { } { } With Pr[ d ]= .005 , Pr[ pos d ]= .98 , Pr[ neg nd ]= . 95 , if the test | | is positive, what should be the diagnosis?

Pr[pos d]Pr[d]=.98 .005=.0049. | × Pr[pos nd]Pr[nd]=(1 .95) .(1 .005)=.04975>.0049. | − × −

Mehryar Mohri - Introduction to Machine Learning page 9 Density Estimation

Data: sample drawn i.i.d. from set X according to some distribution D ,

x1,...,xm ∈ X.

Problem: find distribution p out of a set P that best estimates D . • Note: we will study density estimation specifically in a future lecture.

Mehryar Mohri - Introduction to Machine Learning page 10 Maximum Likelihood

Likelihood: probability of observing sample under distribution p ∈ P , which, given the independence assumption is m Pr[x1,...,xm]=! p(xi). i=1 Principle: select distribution maximizing sample probability m p! = argmax ! p(xi), ∈P p i=1 m or p! = argmax ! log p(xi). ∈P p i=1 Mehryar Mohri - Introduction to Machine Learning page 11 Example: Bernoulli Trials

Problem: find most likely Bernoulli distribution, given sequence of coin flips H, T, T, H, T, H, T, H, H, H, T, T, . . ., H.

Bernoulli distribution: p(H)=θ, p(T )=1 θ. − Likelihood: l(p)=logθN(H)(1 θ)N(T ) = N(H)logθ +−N(T )log(1 θ). − Solution: l is differentiable and concave; dl(p) N(H) N(T ) N(H) = =0 θ = . dθ θ − 1 θ ⇔ N(H)+N(T ) − Mehryar Mohri - Introduction to Machine Learning page 12 Example: Gaussian Distribution

Problem: find most likely Gaussian distribution, given sequence of real-valued observations 3.18, 2.35,.95, 1.175,... 1 (x µ)2 Normal distribution: p(x)= exp − 2 . √2πσ2 !− 2σ " 1 m (x µ)2 Likelihood: l(p)= m log(2πσ2) i − . −2 − 2σ2 i=1 ￿ Solution: l is differentiable and concave; m m ∂p(x) 1 ∂p(x) 2 1 2 2 =0⇔ µ = xi =0⇔ σ = x − µ . ∂µ m ! 2 ! i i=1 ∂σ m i=1

Mehryar Mohri - Introduction to Machine Learning page 13 ML Properties

Problems: • the underlying distribution may not be among those searched. • overfitting: number of examples too small wrt number of parameters. • Pr[ y ]=0 if class y does not appear in sample! smoothing techniques.

Mehryar Mohri - Introduction to Machine Learning page 14 Additive Smoothing

Definition: the additive or Laplace smoothing for estimating Pr[ y ] , y , from a sample of size m is defined by ∈Y y + α Pr[y]= | | . m + α |Y| • α =0 : ML ￿ (MLE). • MLE after adding α to the count of each class. • Bayesian justification based on Dirichlet prior. • poor performance for some applications, such as n-gram language modeling.

Mehryar Mohri - Introduction to Machine Learning page 15 Estimation Problem

Conditional probability: Pr[x y]=Pr[x ,...,x y]. | 1 N | • for large N , number of features, difficult to estimate. even if features are Boolean, that is x 0 , 1 , • i ∈{ } there are 2 N possible feature vectors! may need very large sample.

Mehryar Mohri - Introduction to Machine Learning page 16 Naive Bayes

Conditional independence assumption: for any y , ∈Y Pr[x ,...,x y]=Pr[x y] ...Pr[x y]. 1 N | 1 | N | • given the class, the features are assumed to be independent. • strong assumption, typically does not hold.

Mehryar Mohri - Introduction to Machine Learning page 17 Example - Document Classification

Features: presence/absence of word x i .

Estimation of Pr[ x y ]: of word x i among i | documents labeled with y , or smooth estimate. Estimation of Pr[ y ] : frequency of class y in sample. Classification: N y =argmaxPr[y] Pr[x y]. i | y i=1 ∈Y ￿ ￿

Mehryar Mohri - Introduction to Machine Learning page 18 Naive Bayes - Binary Classification

Classes: = 1 , +1 . Y {− } Pr[+1 x] Decision based on sign of log Pr[ 1 | x ] ; in terms of log-odd ratios: − | Pr[+1 x] Pr[+1] Pr[x +1] log | =log | Pr[ 1 x] Pr[ 1] Pr[x 1] − | − |− Pr[+1] N Pr[x +1] =log i=1 i | Pr[ 1] N Pr[x 1] − ￿i=1 i |− N Pr[+1] ￿ Pr[xi +1] =log + log | . Pr[ 1] Pr[x 1] i=1 i − ￿ |− contribution of feature/expert i to decision Mehryar Mohri - Introduction to Machine Learning ￿ page 19￿￿ ￿ Naive Bayes = Linear Classifier

Theorem: assume that x i 0 , 1 for all i [1 ,N ] . Then, the Naive Bayes classifier∈{ }is defined∈ by x sgn(w x + b), ￿→ · Pr[xi=1 +1] Pr[xi=0 +1] where w i =log Pr[ x =1 | 1] log Pr[ x =0 | 1] i |− − i |− Pr[+1] N Pr[xi=0 +1] and b =logPr[ 1] + i=1 log Pr[x =0| 1] . − i |− Proof: observe that for any￿ i [1 ,N ] , ∈ Pr[x +1] Pr[x =1 +1] Pr[x =0 +1] Pr[x =0 +1] log i | = log i | log i | x +log i | . Pr[x 1] Pr[x =1 1] − Pr[x =0 1] i Pr[x =0 1] i |− ￿ i |− i |− ￿ i |−

Mehryar Mohri - Introduction to Machine Learning page 20 Summary

Bayesian prediction: • requires solving density estimation problems. N often difficult to estimate Pr[ x y ] for x R . • | ∈ • but, simple and easy to apply; widely used. Naive Bayes: • strong assumption. • straightforward estimation problem. • specific linear classifier. • sometimes surprisingly good performance.

Mehryar Mohri - Introduction to Machine Learning page 21