COMP 551 – Applied Machine Learning Lecture 5: Generative Models for Linear Classification

COMP 551 – Applied Machine Learning Lecture 5: Generative models for linear classification Instructor: Joelle Pineau ([email protected]) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission. Today’s quiz • Q1. What is a linear classifier? (In contrast to a non-linear classifier.) • Q2. Describe the difference between discriminative and generative classifiers. • Q3. Consider the following data set. If you use logistic regression to compute a decision boundary, what is the prediction for x6? Data Feature 1 Feature 2 Feature 3 Output x1 1 0 0 0 x2 1 0 1 0 x3 0 1 0 0 x4 1 1 1 1 x5 1 1 0 1 x6 0 0 0 ? COMP-551: Applied Machine Learning 2 Joelle Pineau Quick recap • Two approaches for linear classification: – Discriminative learning: Directly estimate P(y|x). • Logistic regression, P(y|x) : σ(WX) = 1 / (1 + e-WX) – Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x). COMP-551: Applied Machine Learning 3 Joelle Pineau LinearLinear discriminantLinear discriminant discriminant analysis analysis analysis (LDA) (LDA) (LDA) • Return to Bayes rule: P(x | y)PP(y(x)| y)P(y) • Return to Bayes• Return rule: to BayesP (rule:y | x ) = P(y | x) = P(x) P(x) 1 T −1 1 − (x−µ) Σ (x−µ) − e(x−2 µ)T Σ−1(x−µ) • LDA makes• Make explicit explicit assumptions assumptions aboutabout P( P(x|yx|y): ): P(x | y) =2 e 1/2 1/2 • Make explicit assumptions about P(x|y): P(x | y) = (2π ) | Σ | (2 )1/2 | |1/2 • Multivariate– Multivariate Gaussian, Gaussian, with mean with meanμ and µcovariance and covariance matrix matrix Σ . π Σ . Σ • Notation: here x is a single instance, represented as an m*1 vector. – Multivariate Gaussian, with mean µ and covariance matrix Σ . • •Key Consider assumption the of log-odds LDA: Both ratio classes (again, haveP(x) doesn’t the same mattercovariance for decision) matrix,: Σ. Pr(x | y =1)P(y =1) P(y =1) 1 T 1 T 1 log = log − (µ + µ ) Σ− (µ − µ )+ x Σ− (µ − µ ) • Consider the log-oddsPr(x | y = 0 )ratioP(y = 0(again,) P (P(x)y = 0 )doesn’t2 y=0 mattery=1 for decision)y=0 y=1 : y=0 y=1 Pr(x | y =1)P(y =1) P(y =1) 1 T 1 T 1 log Key assumption= log of LDA:− (Bothµ classes+ µ ) haveΣ− ( µthe same− µ covariance)+ x Σ− ( matrix,µ − µΣ. ) Pr(x | y = 0)P(y = 0) P(y = 0) 2 y=0 y=1 y=0 y=1 y=0 y=1 COMP-598: Applied Machine Learning 19 Joelle Pineau Key assumption of LDA: Both classes have the same covariance matrix, Σ. COMP-551: Applied Machine Learning 4 Joelle Pineau COMP-598: Applied Machine Learning 19 Joelle Pineau Applying LDA – 2 class case • Estimate µ, Σ, P(y), from the training data: – Let N1, N0, be the number of training data points from classes 1 and 0, respectively. Applying– Let I(x) LDAbe the indicator – 2function, class where I(x)=0 case if x=0 , I(x)=1 if x=1. – P(y=0) = N0 / (N0 + N1) P(y=1) = N1 / (N0 + N1) – µ0 = ∑i=1:n I(yi=0) xi / N0 µ1 = ∑i=1:n I(yi=1) xi / N1 T – Σ = ∑k=0:1∑i=1:n I(yi=0) (xi – µk)(xi – µk) / (N0+N1-Nk) • Estimate µ, Σ, P(y), from the training data: – Let N1, N0, be the number of training data points from classes 1 and • Given an input x, classify it as class 1 if the log-odds ratio is >0, 0, respectively. classify it as class 0 otherwise. – Let I(x) be the indicator function, where I(x)=0 if x=0, I(x)=1 if x=1. – P(y=0) = NCOMP-598:0 / (N0 Applied + N 1Machine) Learning P(y=1)20 = N1 / (N0 + N1Joelle) Pineau – µ0 = ∑i=1:n I(yi=0) xi / N0 µ1 = ∑i=1:n I(yi=1) xi / N1 T – Σ = ∑k=0:1∑i=1:n I(yi=0) (xi – µk)(xi – µk) / (N0+N1-Nk) 10 • Given an input x, classify it as class 1 if the log-odds ratio is >0, classify it as class 0 otherwise. COMP-598: Applied Machine Learning 20 Joelle Pineau 10 LinearLinear discriminantLinear discriminant discriminant analysis analysis analysis (LDA) (LDA) (LDA) • Return to Bayes rule: P(x | y)PP(y(x)| y)P(y) • Return to Bayes• Return rule: to BayesP (rule:y | x ) = P(y | x) = P(x) P(x) 1 T −1 1 − (x−µ) Σ (x−µ) − e(x−2 µ)T Σ−1(x−µ) • LDA makes• Make explicit explicit assumptions assumptions aboutabout P( P(x|yx|y): ): P(x | y) =2 e 1/2 1/2 • Make explicit assumptions about P(x|y): P(x | y) = (2π ) | Σ | (2 )1/2 | |1/2 • Multivariate– Multivariate Gaussian, Gaussian, with mean with meanμ and µcovariance and covariance matrix matrix Σ . π Σ . Σ • Notation: here x is a single instance, represented as an m*1 vector. – Multivariate Gaussian, with mean µ and covariance matrix Σ . • •Key Consider assumption the of log-odds LDA: Both ratio classes (again, haveP(x) doesn’t the same mattercovariance for decision) matrix,: Σ. Pr(x | y =1)P(y =1) P(y =1) 1 T 1 T 1 log = log − (µ + µ ) Σ− (µ − µ )+ x Σ− (µ − µ ) • Consider• Consider the log-odds Pther(x |logy = 0- odds)ratioP(y = 0ratio(again,) (again,P (P(x)y = 0 P(x))doesn’t2doesn’ty=0 matter mattery=1 for for decision)y =decision)0 y=1 :: y=0 y=1 Pr(x | y =P1)(Px(|yy=Key=11)) Passumption(y =1)P(y =of1 PLDA:)(y =1 1Both) 1classesT −1 Thave−11 theT same−1 covarianceT −T1 −1 matrix, Σ. log ln = log = ln − (µ− +µµ1 Σ )µ1Σ+ (µµ1 Σ−µµ1 +)x+Σx Σ(µ1(−µµ2 )− µ ) Pr(x | y =P0()xP|(y ==00))P(y = 0P)(y = 0P)(y =20) y=02 y=1 2 y=0 y=1 y=0 y=1 COMP-598: Applied Machine Learning 19 Joelle TPineau This is a linear decision boundary! w0 + x w Key assumption of LDA: Both classes have the same covariance matrix, Σ. COMP-551: Applied Machine Learning 5 Joelle Pineau COMP-598: Applied Machine Learning 19 Joelle Pineau Applying LDA – 2 class case • Estimate µ, Σ, P(y), from the training data: – Let N1, N0, be the number of training data points from classes 1 and 0, respectively. Applying– Let I(x) LDAbe the indicator – 2function, class where I(x)=0 case if x=0 , I(x)=1 if x=1. – P(y=0) = N0 / (N0 + N1) P(y=1) = N1 / (N0 + N1) – µ0 = ∑i=1:n I(yi=0) xi / N0 µ1 = ∑i=1:n I(yi=1) xi / N1 T – Σ = ∑k=0:1∑i=1:n I(yi=0) (xi – µk)(xi – µk) / (N0+N1-Nk) • Estimate µ, Σ, P(y), from the training data: – Let N1, N0, be the number of training data points from classes 1 and • Given an input x, classify it as class 1 if the log-odds ratio is >0, 0, respectively. classify it as class 0 otherwise. – Let I(x) be the indicator function, where I(x)=0 if x=0, I(x)=1 if x=1. – P(y=0) = NCOMP-598:0 / (N0 Applied + N 1Machine) Learning P(y=1)20 = N1 / (N0 + N1Joelle) Pineau – µ0 = ∑i=1:n I(yi=0) xi / N0 µ1 = ∑i=1:n I(yi=1) xi / N1 T – Σ = ∑k=0:1∑i=1:n I(yi=0) (xi – µk)(xi – µk) / (N0+N1-Nk) 10 • Given an input x, classify it as class 1 if the log-odds ratio is >0, classify it as class 0 otherwise. COMP-598: Applied Machine Learning 20 Joelle Pineau 10 Learning in LDA: 2 class case • Estimate P(y), μ, Σ, from the training data, then apply log-odds ratio. COMP-551: Applied Machine Learning 6 Joelle Pineau Learning in LDA: 2 class case • Estimate P(y), μ, Σ, from the training data, then apply log-odds ratio. – P(y=0) = N0 / (N0 + N1) P(y=1) = N1 / (N0 + N1) where N1, N0, be # of training samples from classes 1 and 0, respectively. COMP-551: Applied Machine Learning 7 Joelle Pineau Learning in LDA: 2 class case • Estimate P(y), μ, Σ, from the training data, then apply log-odds ratio. – P(y=0) = N0 / (N0 + N1) P(y=1) = N1 / (N0 + N1) where N1, N0, be # of training samples from classes 1 and 0, respectively. – μ0 = ∑i=1:n I(yi=0) xi / N0 μ1 = ∑i=1:n I(yi=1) xi / N1 where I(x) is the indicator function: I(x)=0 if x=0, I(x)=1 if x=1. COMP-551: Applied Machine Learning 8 Joelle Pineau Learning in LDA: 2 class case • Estimate P(y), μ, Σ, from the training data, then apply log-odds ratio. – P(y=0) = N0 / (N0 + N1) P(y=1) = N1 / (N0 + N1) where N1, N0, be # of training samples from classes 1 and 0, respectively. – μ0 = ∑i=1:n I(yi=0) xi / N0 μ1 = ∑i=1:n I(yi=1) xi / N1 where I(x) is the indicator function: I(x)=0 if x=0, I(x)=1 if x=1.

COMP 551 – Applied Machine Learning Lecture 5: Generative Models for Linear Classification

Descriptive Statistics (Part 2): Interpreting Study Results

Teacher Guide 12.1 Gambling Page 1

Odds: Gambling, Law and Strategy in the European Union Anastasios Kaburakis* and Ryan M Rodenberg†

3. Probability Theory

Probability Cheatsheet V2.0 Thinking Conditionally Law of Total Probability (LOTP)

Understanding Relative Risk, Odds Ratio, and Related Terms: As Simple As It Can Get Chittaranjan Andrade, MD

Random Variables and Applications

Medical Statistics PDF File 9

Generalized Naive Bayes Classifiers

A Guide Odds

Classification with Naive Bayes

Section 8.1 - Distributions of Random Variables