Introduction to Machine Learning Lecture 3

Introduction to Machine Learning Lecture 3 Mehryar Mohri Courant Institute and Google Research [email protected] Bayesian Learning Bayes’ Formula/Rule Terminology: likelihood prior Pr[X Y ]Pr[Y ] Pr[Y X]= | . posterior| Pr[X] probability evidence Mehryar Mohri - Introduction to Machine Learning page 3 Loss Function Definition: function L : R + indicating the penalty for an incorrectY prediction.×Y→ • L ( y,y ) : loss for prediction of y instead of y . Examples: • zero-one loss: standard loss function in classification; L ( y,y )=1 y = y for y,y . ∈Y • non-symmetric losses: e.g., for spam classification; L ( ham , spam) L ( spam , ham) . ≤ squared loss: standard loss function in • 2 regression; L ( y,y )=( y y ) . − Mehryar Mohri - Introduction to Machine Learning page 4 Classification Problem Input space : e.g., set of documents. X N feature vector Φ ( x ) R associated to x . • ∈ ∈X N notation: feature vector x R . • ∈ • example: vector of word counts in document. Output or target space : set of classes; e.g., sport, Y business, art. Problem: given x , predict the correct class y ∈Y associated to x . Mehryar Mohri - Introduction to Machine Learning page 5 Bayesian Prediction Definition: the expected conditional loss of predicting y is ∈Y [y x]= L(y,y)Pr[y x]. L | | y ∈Y Bayesian decision : predict class minimizing expected conditional loss, that is y∗ =argmin [y x]=argmin L(y,y)Pr[y x]. L | | y y y ∈Y zero-oneb loss: y∗ =argmaxb Pr[y x]. • y | Maximum a Posteriori (MAP) principle. b Mehryar Mohri - Introduction to Machine Learning page 6 Binary Classification - Illustration 1 Pr[y x] 1 | Pr[y x] 2 | 0 x Mehryar Mohri - Introduction to Machine Learning page 7 Maximum a Posteriori (MAP) Definition: the MAP principle consists of predicting according to the rule y =argmaxPr[y x]. y | ∈Y Equivalently, by the Bayes formula: Pr[x y]Pr[y] y =argmax | =argmaxPr[x y]Pr[y]. y Pr[x] y | ∈Y ∈Y How do we determine Pr[ x y ] and Pr[ y ] ? Density estimation problem.| Mehryar Mohri - Introduction to Machine Learning page 8 Application - Maximum a Posteriori Formulation: hypothesis set H . Pr[O h]Pr[h] hˆ =argmaxPr[h O]=argmax | =argmaxPr[O h]Pr[h]. h H | h H Pr[O] h H | ∈ ∈ ∈ Example: determine if a patient has a rare disease H = d, nd , given laboratory test O = pos, neg . { } { } With Pr[ d ]= .005 , Pr[ pos d ]= .98 , Pr[ neg nd ]= . 95 , if the test | | is positive, what should be the diagnosis? Pr[pos d]Pr[d]=.98 .005=.0049. | × Pr[pos nd]Pr[nd]=(1 .95) .(1 .005)=.04975>.0049. | − × − Mehryar Mohri - Introduction to Machine Learning page 9 Density Estimation Data: sample drawn i.i.d. from set X according to some distribution D , x1,...,xm ∈ X. Problem: find distribution p out of a set P that best estimates D . • Note: we will study density estimation specifically in a future lecture. Mehryar Mohri - Introduction to Machine Learning page 10 Maximum Likelihood Likelihood: probability of observing sample under distribution p ∈ P , which, given the independence assumption is m Pr[x1,...,xm]=! p(xi). i=1 Principle: select distribution maximizing sample probability m p! = argmax ! p(xi), ∈P p i=1 m or p! = argmax ! log p(xi). ∈P p i=1 Mehryar Mohri - Introduction to Machine Learning page 11 Example: Bernoulli Trials Problem: find most likely Bernoulli distribution, given sequence of coin flips H, T, T, H, T, H, T, H, H, H, T, T, . ., H. Bernoulli distribution: p(H)=θ, p(T )=1 θ. − Likelihood: l(p)=logθN(H)(1 θ)N(T ) = N(H)logθ +−N(T )log(1 θ). − Solution: l is differentiable and concave; dl(p) N(H) N(T ) N(H) = =0 θ = . dθ θ − 1 θ ⇔ N(H)+N(T ) − Mehryar Mohri - Introduction to Machine Learning page 12 Example: Gaussian Distribution Problem: find most likely Gaussian distribution, given sequence of real-valued observations 3.18, 2.35,.95, 1.175,... 1 (x µ)2 Normal distribution: p(x)= exp − 2 . √2πσ2 !− 2σ " 1 m (x µ)2 Likelihood: l(p)= m log(2πσ2) i − . −2 − 2σ2 i=1 Solution: l is differentiable and concave; m m ∂p(x) 1 ∂p(x) 2 1 2 2 =0⇔ µ = xi =0⇔ σ = x − µ . ∂µ m ! 2 ! i i=1 ∂σ m i=1 Mehryar Mohri - Introduction to Machine Learning page 13 ML Properties Problems: • the underlying distribution may not be among those searched. • overfitting: number of examples too small wrt number of parameters. • Pr[ y ]=0 if class y does not appear in sample! smoothing techniques. Mehryar Mohri - Introduction to Machine Learning page 14 Additive Smoothing Definition: the additive or Laplace smoothing for estimating Pr[ y ] , y , from a sample of size m is defined by ∈Y y + α Pr[y]= | | . m + α |Y| • α =0 : ML estimator (MLE). • MLE after adding α to the count of each class. • Bayesian justification based on Dirichlet prior. • poor performance for some applications, such as n-gram language modeling. Mehryar Mohri - Introduction to Machine Learning page 15 Estimation Problem Conditional probability: Pr[x y]=Pr[x ,...,x y]. | 1 N | • for large N , number of features, difficult to estimate. even if features are Boolean, that is x 0 , 1 , • i ∈{ } there are 2 N possible feature vectors! may need very large sample. Mehryar Mohri - Introduction to Machine Learning page 16 Naive Bayes Conditional independence assumption: for any y , ∈Y Pr[x ,...,x y]=Pr[x y] ...Pr[x y]. 1 N | 1 | N | • given the class, the features are assumed to be independent. • strong assumption, typically does not hold. Mehryar Mohri - Introduction to Machine Learning page 17 Example - Document Classification Features: presence/absence of word x i . Estimation of Pr[ x y ]: frequency of word x i among i | documents labeled with y , or smooth estimate. Estimation of Pr[ y ] : frequency of class y in sample. Classification: N y =argmaxPr[y] Pr[x y]. i | y i=1 ∈Y Mehryar Mohri - Introduction to Machine Learning page 18 Naive Bayes - Binary Classification Classes: = 1 , +1 . Y {− } Pr[+1 x] Decision based on sign of log Pr[ 1 | x ] ; in terms of log-odd ratios: − | Pr[+1 x] Pr[+1] Pr[x +1] log | =log | Pr[ 1 x] Pr[ 1] Pr[x 1] − | − |− Pr[+1] N Pr[x +1] =log i=1 i | Pr[ 1] N Pr[x 1] − i=1 i |− N Pr[+1] Pr[xi +1] =log + log | . Pr[ 1] Pr[x 1] i=1 i − |− contribution of feature/expert i to decision Mehryar Mohri - Introduction to Machine Learning page 19 Naive Bayes = Linear Classifier Theorem: assume that x i 0 , 1 for all i [1 ,N ] . Then, the Naive Bayes classifier∈{ }is defined∈ by x sgn(w x + b), → · Pr[xi=1 +1] Pr[xi=0 +1] where w i =log Pr[ x =1 | 1] log Pr[ x =0 | 1] i |− − i |− Pr[+1] N Pr[xi=0 +1] and b =logPr[ 1] + i=1 log Pr[x =0| 1] . − i |− Proof: observe that for any i [1 ,N ] , ∈ Pr[x +1] Pr[x =1 +1] Pr[x =0 +1] Pr[x =0 +1] log i | = log i | log i | x +log i | . Pr[x 1] Pr[x =1 1] − Pr[x =0 1] i Pr[x =0 1] i |− i |− i |− i |− Mehryar Mohri - Introduction to Machine Learning page 20 Summary Bayesian prediction: • requires solving density estimation problems. N often difficult to estimate Pr[ x y ] for x R . • | ∈ • but, simple and easy to apply; widely used. Naive Bayes: • strong assumption. • straightforward estimation problem. • specific linear classifier. • sometimes surprisingly good performance. Mehryar Mohri - Introduction to Machine Learning page 21.

Introduction to Machine Learning Lecture 3

Sequence Models

Additive Smoothing for Relevance-Based Language Modelling of Recommender Systems

Smoothing Parameter and Model Selection for General Smooth Models

Introduction to Naivebayes Package

Adding Improvements to Multinomial Naive Bayes for Increasing the Accuracy of Aggressive Tweets Classification

Efficient Learning of Smooth Probability Functions from Bernoulli Tests With

Bias/Variance Tradeoff

Package 'Naivebayes'

Language Modeling and Probability

How to Count Thumb-Ups and Thumb-Downs: User-Rating Based Ranking of Items from an Axiomatic Perspective

Language Models

A Comparison of Smoothing Techniques for Bilingual Lexicon Extraction from Comparable Corpora