COMP 551 – Applied Machine Learning Lecture 5: Generative models for linear classification

Instructor: Joelle Pineau ([email protected])

Class web page: www.cs.mcgill.ca/~jpineau/comp551

Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission. Today’s quiz

• Q1. What is a linear classifier? (In contrast to a non-linear classifier.)

• Q2. Describe the difference between discriminative and generative classifiers.

• Q3. Consider the following data set. If you use to compute

a decision boundary, what is the prediction for x6?

Data Feature 1 Feature 2 Feature 3 Output

x1 1 0 0 0

x2 1 0 1 0

x3 0 1 0 0

x4 1 1 1 1

x5 1 1 0 1

x6 0 0 0 ?

COMP-551: Applied Machine Learning 2 Joelle Pineau Quick recap

• Two approaches for linear classification: – Discriminative learning: Directly estimate P(y|x).

• Logistic regression, P(y|x) : σ(WX) = 1 / (1 + e-WX)

– Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).

COMP-551: Applied Machine Learning 3 Joelle Pineau LinearLinear discriminantLinear discriminant discriminant analysis analysis analysis (LDA) (LDA) (LDA)

• Return to Bayes rule: P(x | y)PP(y(x)| y)P(y) • Return to Bayes• Return rule: to BayesP (rule:y | x ) = P(y | x) = P(x) P(x)

1 T −1 1 − (x−µ) Σ (x−µ) − e(x−2 µ)T Σ−1(x−µ) • LDA makes• Make explicit explicit assumptions assumptions aboutabout P( P(x|yx|y): ): P(x | y) =2 e 1/2 1/2 • Make explicit assumptions about P(x|y): P(x | y) = (2π ) | Σ | (2 )1/2 | |1/2 • Multivariate– Multivariate Gaussian, Gaussian, with mean with meanμ and µcovariance and covariance matrix matrix Σ . π Σ . Σ • Notation: here x is a single instance, represented as an m*1 vector. – Multivariate Gaussian, with mean µ and covariance matrix Σ . • •Key Consider assumption the of log-odds LDA: Both ratio classes (again, haveP(x) doesn’t the same mattercovariance for decision) matrix,: Σ.

Pr(x | y =1)P(y =1) P(y =1) 1 T 1 T 1 log = log − (µ + µ ) Σ− (µ − µ )+ x Σ− (µ − µ ) • Consider the log-oddsPr(x | y = 0 )ratioP(y = 0(again,) P (P(x)y = 0 )doesn’t2 y=0 mattery=1 for decision)y=0 y=1 : y=0 y=1

Pr(x | y =1)P(y =1) P(y =1) 1 T 1 T 1 log Key assumption= log of LDA:− (Bothµ classes+ µ ) haveΣ− ( µthe same− µ covariance)+ x Σ− ( matrix,µ − µΣ. ) Pr(x | y = 0)P(y = 0) P(y = 0) 2 y=0 y=1 y=0 y=1 y=0 y=1 COMP-598: Applied Machine Learning 19 Joelle Pineau Key assumption of LDA: Both classes have the same covariance matrix, Σ. COMP-551: Applied Machine Learning 4 Joelle Pineau

COMP-598: Applied Machine Learning 19 Joelle Pineau Applying LDA – 2 class case

• Estimate µ, Σ, P(y), from the training data:

– Let N1, N0, be the number of training data points from classes 1 and 0, respectively. Applying– Let I(x) LDAbe the indicator – 2function, class where I(x)=0 case if x=0 , I(x)=1 if x=1. – P(y=0) = N0 / (N0 + N1) P(y=1) = N1 / (N0 + N1)

– µ0 = ∑i=1:n I(yi=0) xi / N0 µ1 = ∑i=1:n I(yi=1) xi / N1 T – Σ = ∑k=0:1∑i=1:n I(yi=0) (xi – µk)(xi – µk) / (N0+N1-Nk) • Estimate µ, Σ, P(y), from the training data:

– Let N1, N0, be the number of training data points from classes 1 and • Given an input x, classify it as class 1 if the log- is >0, 0, respectively. classify it as class 0 otherwise. – Let I(x) be the indicator function, where I(x)=0 if x=0, I(x)=1 if x=1.

– P(y=0) = NCOMP-598:0 / (N0 Applied + N 1Machine) Learning P(y=1)20 = N1 / (N0 + N1Joelle) Pineau

– µ0 = ∑i=1:n I(yi=0) xi / N0 µ1 = ∑i=1:n I(yi=1) xi / N1 T – Σ = ∑k=0:1∑i=1:n I(yi=0) (xi – µk)(xi – µk) / (N0+N1-Nk)

10

• Given an input x, classify it as class 1 if the log-odds ratio is >0, classify it as class 0 otherwise.

COMP-598: Applied Machine Learning 20 Joelle Pineau

10 LinearLinear discriminantLinear discriminant discriminant analysis analysis analysis (LDA) (LDA) (LDA)

• Return to Bayes rule: P(x | y)PP(y(x)| y)P(y) • Return to Bayes• Return rule: to BayesP (rule:y | x ) = P(y | x) = P(x) P(x)

1 T −1 1 − (x−µ) Σ (x−µ) − e(x−2 µ)T Σ−1(x−µ) • LDA makes• Make explicit explicit assumptions assumptions aboutabout P( P(x|yx|y): ): P(x | y) =2 e 1/2 1/2 • Make explicit assumptions about P(x|y): P(x | y) = (2π ) | Σ | (2 )1/2 | |1/2 • Multivariate– Multivariate Gaussian, Gaussian, with mean with meanμ and µcovariance and covariance matrix matrix Σ . π Σ . Σ • Notation: here x is a single instance, represented as an m*1 vector. – Multivariate Gaussian, with mean µ and covariance matrix Σ . • •Key Consider assumption the of log-odds LDA: Both ratio classes (again, haveP(x) doesn’t the same mattercovariance for decision) matrix,: Σ.

Pr(x | y =1)P(y =1) P(y =1) 1 T 1 T 1 log = log − (µ + µ ) Σ− (µ − µ )+ x Σ− (µ − µ ) • Consider• Consider the log-odds Pther(x |logy = 0- odds)ratioP(y = 0ratio(again,) (again,P (P(x)y = 0 P(x))doesn’t2doesn’ty=0 matter mattery=1 for for decision)y =decision)0 y=1 :: y=0 y=1

Pr(x | y =P1)(Px(|yy=Key=11)) Passumption(y =1)P(y =of1 PLDA:)(y =1 1Both) 1classesT −1 Thave−11 theT same−1 covarianceT −T1 −1 matrix, Σ. log ln = log = ln − (µ− +µµ1 Σ )µ1Σ+ (µµ1 Σ−µµ1 +)x+Σx Σ(µ1(−µµ2 )− µ ) Pr(x | y =P0()xP|(y ==00))P(y = 0P)(y = 0P)(y =20) y=02 y=1 2 y=0 y=1 y=0 y=1 COMP-598: Applied Machine Learning 19 Joelle TPineau This is a linear decision boundary! w0 + x w Key assumption of LDA: Both classes have the same covariance matrix, Σ. COMP-551: Applied Machine Learning 5 Joelle Pineau

COMP-598: Applied Machine Learning 19 Joelle Pineau Applying LDA – 2 class case

• Estimate µ, Σ, P(y), from the training data:

– Let N1, N0, be the number of training data points from classes 1 and 0, respectively. Applying– Let I(x) LDAbe the indicator – 2function, class where I(x)=0 case if x=0 , I(x)=1 if x=1. – P(y=0) = N0 / (N0 + N1) P(y=1) = N1 / (N0 + N1)

– µ0 = ∑i=1:n I(yi=0) xi / N0 µ1 = ∑i=1:n I(yi=1) xi / N1 T – Σ = ∑k=0:1∑i=1:n I(yi=0) (xi – µk)(xi – µk) / (N0+N1-Nk) • Estimate µ, Σ, P(y), from the training data:

– Let N1, N0, be the number of training data points from classes 1 and • Given an input x, classify it as class 1 if the log-odds ratio is >0, 0, respectively. classify it as class 0 otherwise. – Let I(x) be the indicator function, where I(x)=0 if x=0, I(x)=1 if x=1.

– P(y=0) = NCOMP-598:0 / (N0 Applied + N 1Machine) Learning P(y=1)20 = N1 / (N0 + N1Joelle) Pineau

– µ0 = ∑i=1:n I(yi=0) xi / N0 µ1 = ∑i=1:n I(yi=1) xi / N1 T – Σ = ∑k=0:1∑i=1:n I(yi=0) (xi – µk)(xi – µk) / (N0+N1-Nk)

10

• Given an input x, classify it as class 1 if the log-odds ratio is >0, classify it as class 0 otherwise.

COMP-598: Applied Machine Learning 20 Joelle Pineau

10 Learning in LDA: 2 class case

• Estimate P(y), μ, Σ, from the training data, then apply log-odds ratio.

COMP-551: Applied Machine Learning 6 Joelle Pineau Learning in LDA: 2 class case

• Estimate P(y), μ, Σ, from the training data, then apply log-odds ratio.

– P(y=0) = N0 / (N0 + N1) P(y=1) = N1 / (N0 + N1)

where N1, N0, be # of training samples from classes 1 and 0, respectively.

COMP-551: Applied Machine Learning 7 Joelle Pineau Learning in LDA: 2 class case

• Estimate P(y), μ, Σ, from the training data, then apply log-odds ratio.

– P(y=0) = N0 / (N0 + N1) P(y=1) = N1 / (N0 + N1)

where N1, N0, be # of training samples from classes 1 and 0, respectively.

– μ0 = ∑i=1:n I(yi=0) xi / N0 μ1 = ∑i=1:n I(yi=1) xi / N1 where I(x) is the indicator function: I(x)=0 if x=0, I(x)=1 if x=1.

COMP-551: Applied Machine Learning 8 Joelle Pineau Learning in LDA: 2 class case

• Estimate P(y), μ, Σ, from the training data, then apply log-odds ratio.

– P(y=0) = N0 / (N0 + N1) P(y=1) = N1 / (N0 + N1)

where N1, N0, be # of training samples from classes 1 and 0, respectively.

– μ0 = ∑i=1:n I(yi=0) xi / N0 μ1 = ∑i=1:n I(yi=1) xi / N1 where I(x) is the indicator function: I(x)=0 if x=0, I(x)=1 if x=1.

T – Σ = ∑k=0:1∑i=1:n I(yi=0) (xi – μk)(xi – μk) / (N0+N1-Nk)

COMP-551: Applied Machine Learning 9 Joelle Pineau Learning in LDA: 2 class case

• Estimate P(y), μ, Σ, from the training data, then apply log-odds ratio.

– P(y=0) = N0 / (N0 + N1) P(y=1) = N1 / (N0 + N1)

where N1, N0, be # of training samples from classes 1 and 0, respectively.

– μ0 = ∑i=1:n I(yi=0) xi / N0 μ1 = ∑i=1:n I(yi=1) xi / N1 where I(x) is the indicator function: I(x)=0 if x=0, I(x)=1 if x=1.

T – Σ = ∑k=0:1∑i=1:n I(yi=0) (xi – μk)(xi – μk) / (N0+N1-Nk)

• Decision boundary: Given an input x, classify it as class 1 if the log-odds ratio is >0, classify it as class 0 otherwise.

COMP-551: Applied Machine Learning 10 Joelle Pineau More complex decision boundaries MoreMore complexcomplex decisiondecision boundaries boundaries

• Want• Want to accommodateto accommodate more more than than 2 2 classes?classes? • Want to accommodate more than 2 classes? – Consider–– ConsiderConsider the thethe per-class perper-class-class linear linearlinear discriminant discriminant discriminant function: function:function:

T T −1−1 11 TT −11 δy (δx()x=) =x xΣΣµµy −− µµy Σ µµy++lologgPP(y()y) y y 22 y y – Use–– UseUse the the thefollowing followingfollowing decision decisiondecision rule: rule:rule: # 1 & Output = # TT −−11 1 T T−1 −1 & Output Output = a=r g amrgamxaδxδ(yx(x))==aarrggmmaax%xx ΣΣ µµy −− µyµΣ Σµyµ+ lo+glPog(yP)((y) y y y $% y 2 y y ' ( y y $ 2 '

• Want more flexible (non-linear) decision boundaries? • Want more flexible (non-linear) decision boundaries? – Recall trick from linear regression of re-coding variables. •– WantRecall more trick flexiblefrom linear (non regression-linear) decision of re-coding boundaries? variables. – Recall trick from linear regression of re-coding variables.

COMP-598: Applied Machine Learning 21 Joelle Pineau COMP-551: Applied Machine Learning 11 Joelle Pineau COMP-598: Applied Machine Learning 21 Joelle Pineau

LDA decision boundaries – 3 class case 4.3 Linear Discriminant Analysis 109 LDA decision boundaries – 3 class case 4.3 Linear Discriminant Analysis 109

13 3 3 3 3 3 313 3 3 + 1 3 2 2 1 3 2 3 31 3 2 1 3 3 3 322 2 3 333 3 2 2 11 2 3 1 2 3 1 12 22 2 2 13 13 1 + + 1 1 2 1 3 1 3 1 3 1 1 1 332 22 1 1 2 3 1 13 1 2 3 2 3 1 321 3 2 + 1 1 3 2 2 1 1 2 2 3 2 3 31 2 3 2 1 2 3 3 3 322 2 3 333 32 2 2 11 2 3 1 2 3 1 12 22 2 2 13 13 1 + + 1 1 2 1 3 1 3 1 1 1 2 22 1 1 2 1 1 2 2 1 21 2 FIGURE 4.5. The left panel shows three Gaussian distributions,1 with the same 1 2 2 covariance and different means. Included are the contours2 2 of constant density enclosing 95% of the in each case. The Bayes decision2 boundaries between each pair of classes are shown (broken straight lines),andtheBayes decision boundaries separating all three classes are the thicker solid lines (a subset FIGUREof the former). 4.5. The On left the panel right shows we see three a sample Gaussian of 30 drawn distributions, from each with Gaussian the same covariancedistribution, and anddifferent the fitted means. LDA Includeddecision boundaries. are the contours of constant density enclosing 95% of the probability in each case. The Bayes decision boundaries COMP-598:between eachApplied pair Machine of Learning classes are shown (broken22 straight lines),andtheBayesJoelle Pineau the figure the contours corresponding to 95% highest probability density, decision boundaries separating all three classes are the thicker solid lines (a subset as well as the class centroids. Notice that the decision boundaries are not of the former). On the right we see a sample of 30 drawn from each Gaussian the perpendicular bisectors of the line segments joining thecentroids.This distribution, and the fitted LDA decision boundaries. would be the case if the covariance Σ were spherical σ2I,andtheclass priors were equal. From (4.9) we see that the linear discriminant functions COMP-598: Applied Machine Learning 22 Joelle Pineau the figure the contours corresponding1 to 95% highest probability density, 11 δ (x)=xT Σ−1µ µT Σ−1µ + log π (4.10) as well as the class centroids.k Noticek − that2 k the decisionk k boundaries are not the perpendicular bisectors of the line segments joining thecentroids.This are an equivalent description of the decision rule, with G(x)=argmax2 kδk(x). wouldIn be practice the case we do if thenot know covariance the parametersΣ were of spherical the Gaussianσ I dist,andtheclassributions, priorsand were will equal. need to From estimate (4.9) them we usingsee that our the traininglinear data: discriminant functions πˆ = N /N ,whereN is the number1 of class-k observations; 11 • k kδ (x)=xT Σk −1µ µT Σ−1µ + log π (4.10) k k − 2 k k k µˆk = xi/Nk; • gi=k are an equivalent!K description of the decisionT rule, with G(x)=argmaxkδk(x). Σˆ = (xi µˆk)(xi µˆk) /(N K). In practice• wek do=1 notgi= knowk − the parameters− of− the Gaussian distributions, andFigure will need 4.5! (right to estimate! panel) shows them the using estimated our training decision data: boundaries based on asampleofsize30eachfromthreeGaussiandistributions.Figure 4.1 on πˆ = N /N ,whereN is the number of class-k observations; •pagek 103 isk another example,k but here the classes are not Gaussian. With two classes there is a simple correspondence between linear dis- µˆk = xi/Nk; •criminant analysisgi=k and classification by linear regression,asin(4.5).The LDA rule classifies to class 2 if ˆ !K T Σ = k=1 g =k(xi µˆk)(xi µˆk) /(N K). • −1 i 1 − −−1 − xT Σˆ (ˆµ µˆ ) > (ˆµ +ˆµ )T Σˆ (ˆµ µˆ ) log(N /N ), (4.11) Figure 4.5! (right2! panel)− 1 shows2 2 the1 estimated2 − decision1 − bound2 1aries based on asampleofsize30eachfromthreeGaussiandistributions.Figure 4.1 on page 103 is another example, but here the classes are not Gaussian. With two classes there is a simple correspondence between linear dis- criminant analysis and classification by linear regression,asin(4.5).The LDA rule classifies to class 2 if

−1 1 −1 xT Σˆ (ˆµ µˆ ) > (ˆµ +ˆµ )T Σˆ (ˆµ µˆ ) log(N /N ), (4.11) 2 − 1 2 2 1 2 − 1 − 2 1 UsingUsing higher-orderhigher4.2 Linear- Regressionorder offeaturesfeatures an Indicator Matrix 103

1 1

2 1 2 1 • 1 1 2 1 2 1 2 3 2 3 2 2 2 2 2 22 1 2 22 1 2 2 3 2 2 3 2 3 3 2 3 3 2 2 2 2 33 3 33 2 2 2 2 33 3 33 2 2 12 3 3 3 2 2 12 3 3 3 2 2 2 22 2 3 3 2 2 2 22 2 3 3 2 2 2 2 2 1 2 2 2 2 2 1 2 222 22 2 1 3 2 222 22 2 1 3 22 2 2 22 2 2 2 3 3 22 2 2 22 2 2 2 3 3 2 2 222 2 22 2 2 2 3 3 33 2 2 222 2 22 2 2 2 3 3 33 222 2 2 2222 222 3333 3 222 2 2 2222 222 3333 3 2 222 2 2222 21 3 33 2 222 2 2222 21 3 33 222 222 22222 2 33 3 3 222 222 22222 2 33 3 3 2222 2 2 2 222 2 2 3 3 333 2222 2 2 2 222 2 2 3 3 333 2 222 2 2 222 22 22 2 1 333 33 2 222 2 2 222 22 22 2 1 333 33 2 22 21222 222 2 33 333 3 2 22 21222 222 2 33 333 3 1 2 2222 222 222 22 1 3 1 2 2222 222 222 22 1 3 2 2 212 2222 1 1 33333 3 2 2 212 2222 1 1 33333 3 2 2 2 22 212 1 1 1 1 1 3 33 3 2 2 2 22 212 1 1 1 1 1 3 33 3 21 212 22222 1 2 1 11 1 3 33 21 212 22222 1 2 1 11 1 3 33 1 2 2 2 22 1 1 1 1 1 1 1133333 3 1 2 2 2 22 1 1 1 1 1 1 1133333 3 1 1 212 2 11 2 11 1 11 33333 3333 1 1 212 2 11 2 11 1 11 33333 3333 112 1 21 22 1 11 11 1 1 11 3 3 3 112 1 21 22 1 11 11 1 1 11 3 3 3 2221 1 1 1 1 1 1 1 11 1 133333 3 3 3 2221 1 1 1 1 1 1 1 11 1 133333 3 3 3 2 111 1211 2 1 111 1 1 3133333 3 2 111 1211 2 1 111 1 1 3133333 3 1 1 11 1 11 333333333 1 1 11 1 11 333333333 1 1 1 1 1 1 1 1 1 1 1 11 33 33333 1 1 1 1 1 1 1 1 1 1 1 11 33 33333 1 1 1111 11 11 1 1 1 11 111 31313333 3 1 1 1111 11 11 1 1 1 11 111 31313333 3 111 1 1 11 1 11 11 33333 33 111 1 1 11 1 11 11 33333 33 1 1 1 1 1 1 1 13333333 3 1 1 1 1 1 1 1 13333333 3 1 11 111 11 3 333 1 11 111 11 3 333 11 1 1 1 1 111 1 1 33333333 11 1 1 1 1 111 1 1 33333333 1 11 1 11 1 1 11 3 3333 1 11 1 11 1 1 11 3 3333 1 1 1 1 3 333333 1 1 1 1 3 333333 1 11 3333 1 11 3333 1 1 11 3 1 1 11 3 1 1 333 1 1 333 333 333 3 3 3 3 3 3 3 3

FIGURE 4.1. The left plot shows some data from three classes, with linear decision boundaries found by linear discriminant analysis. The right plot shows quadratic decision boundaries. These were obtained by finding linear boundaries 2 2 in the five-dimensional space X1,X2,X1X2,X1 ,X2 . Linear inequalities in this space are quadratic inequalities in the original space.

COMP-598:COMP-551: Applied Machine LearningLearning 2312 JoelleJoelle Pineau Pineau mation h(X) where h :IRp IR q with q>p, and will be explored in later chapters. !→

4.2 Linear Regression of an Indicator Matrix

Here each of the response categories are coded via an indicator variable. Thus if Quadratichas K classes, there discriminant will be K such indicators analysisY ,k=1 ,...,K, G k with Yk =1ifG = k else 0. These are collected together in a vector Y =(Y1,...,YK ), and the N training instances of these form an N K •indicator LDA assumes response matrixall classesY. Y haveis a matrixthe same of 0’s covariance and 1’s, with matrix, each Σ× row. having a single 1. We fit a linear regression model to each of thecolumns •of YQDAsimultaneously, allows different and thecovariance fit is given matrices, by Σy for each class y. Yˆ = X(XT X)−1XT Y. (4.3) • Linear discriminant function: Chapter 3 has more details on linear regression. Note that we have a coeffi- T −1 1 T −1 cient vector for eachδ (x response) = x Σ µ column− µ yΣk,andhencea(µ + log P(y) p+1) K coefficient y y y y × matrix Bˆ =(XT X)−1XT Y.HereX2 is the model matrix with p+1 columns •corresponding Quadratic todiscriminant the p inputs, function: and a leading column of 1’s for the intercept. Anewobservationwithinputx is classified as follows: 1 1 T −1 δy (x) = − Σy − (x − µy ) Σy (x − µy )+ log P(y) compute the fitted output2 f2ˆ(x)T =(1,xT )Bˆ ,aK vector; • – identifyQDA has the more largest parameters component to estimate, and classify but accordingly: greater flexibility to • estimate the target function. Bias-variance trade-off. ˆ ˆ G(x)=argmaxk∈Gfk(x). (4.4)

COMP-598: Applied Machine Learning 24 Joelle Pineau

12 Using higher-order4.2 Linear Regression offeatures an Indicator Matrix 103

Using1 higher-order4.2 Linear Regression offeatures an Indicator1 Matrix 103 2 1 2 1 1 1 2 1 2 1 2 3 2 3 2 2 1 2 2 1 2 22 1 2 22 1 2 2 3 3 2 2 3 3 2 2 3 2 2 3 2 2 2 2 1 33 3 33 2 2 2 2 1 33 3 33 2 2 12 1 3 3 3 2 2 121 3 3 3 2 2 2 222 2 1 3 3 2 22 2 22 2 1 3 3 2 2 22 2 1 2 22 2 2 1 2 222 2 2 1 3 3 2 222 2 1 3 3 2 2 2 222 21 3 3 2 2 2 2 221 2 3 3 22 2 22 2 22 22 22 22 2 3 22 2 2 22 22 22 22 22 2 3 2 22 2 22 222 2222222 3 3 333 33 2 22 2 2222222 22222 3 3 3 33 33 22 2 222 22 2 22222 2 1 33 333333 33 22 2 22222222 222 2 1 33 3 3 3333 33 222 222222 2222 122 33 3 3 2 22 2222222 12222 22 33 3 3 22 222 2222 2222222 3 333333 3 3 2 22 2222 2222 22 222 3 33 333 3 3 3 22 22 2 22 2 222 2 22 3 333 33 22 22 2 2 22 22 2 22 3 333 33 2 222 222 2 222222222 22 1 13 333 2 2222 22 2 2222222212 22 13 333 22 2222221222 22 2222 1 3 3 3 3 3 3 2 222222 212222 222 222 1 3 3 3 3 3 3 122 22222 2 2222 2222 2 1 3 3 33 3 22 212 22222 2 222 2222 1 3 3 33 3 2 2 22 222 2212 2222 2 21 1 1 3 33333333 3 2 22 2222 2 222122 2222 2 1 1 1 3 3 3333333 3 222 222 2 2222222 22122 12 1 1 33133 33 3 3 222 2 2 22222 22 22222 212 12 1 1 3333 3 1 3 3 3 2 2212221 22222221 21 1 131 1333 33 2 222 212212222222212 1 13 3131 1 3 33 1 222 222 222222 222 1 1 1 1 33113333333 3 1 222 2222 222222 2 2 22 1 1 1 133 311333333 3 1 22122 2 2 122 2 12212 2 2 2 11 1 1 311 3 333 3333 333 12222 2 1 2 2 12222211 2 2 11 1 1 3 3 11333 3333 333 1122 2212 2 2221222222 222122 2 11 1 1 1 11 33333 333 33 2 11222 2 12 2222212222222 12 11 1 1 11 113333 333 3 33 2 22 21222212 2122 21 1 1 11 1 1 1133 1 133333333 3 3 2 22 21222 2222122 2 1 1 1 1 11 1 1 33 113331 3133333 3 3 3 1 2 2222 121212 12222 22 11 11 3133333333 1 2 2222 222 121221 212 111 1 313333333 3 2 2 1212112222 2 1 11 1 1 1 11 1 1 33333333 333 2 2 212 22122 11 2 1 1 11 1 1 111 3333313 33333 33 2 2122 22 212 12111 1 1 11 1 1 1 111 333333333 2 2 22 212 212 12 1 111 1 1 1 1 1 1 131 33 33 33333 1 1 1 21 21 221222 1 1 2 111 11 1 1 1 11 111 13331333333 1 211 211 22222 1 2 1 111 11 1 11 11111113 333313333 1 2 22 21 11 12111 1 11 11 11 11 1 1 111331333333333 1 2 22 2 12111 1111111 11 1 1 111 1 11333133 3133333 3 1 1 1 212 22111 1 2 1 11 1 111 1 111 333333333333 1 1 1 212 2211 12 1 11 1 111 1 1 11 1 33333 333333 3 33 1 2 1 211 2 11 1 1 1 11 111 1111 1 3133333333 3 1 2 1 21 2 1 1 11 1 1 1 1 111111 1 3133 313333333 3 1 222111 11 111 11 111 1 1 1 11 1 133333333333 3 222111 11 1 1 1 1 1 11 1111 1111 1 133 133 3333333333 21111 1211 1 2 11111 1 111 1 1 331333333 33 2 111 1211 12111 1111 111 1 1 3313333333333 3 1 1 11 111 1 11 11 3 333333333333 1 11 1 11 11 11 131333333333333333 1 1 1 1 1 1 1 1 11 11 1 1 1 11 33333333333 1 1 1 1 1 11 11 1 1 1 1 11 1 11 33 333333333 3 1 1 1111 111 111 1 1 1 11 111 313313333 3 1 1 1111 111 11 111 1 11 111 3131333333 1 111 1 1 11 1 11 11 333333 33 111 1 11 11 1 11 11 33333333 1 1 11 11 1 1 1 1 133333333 3 1 1 1 1 1 1 11 1 1 13333333 333 1 11 111 111 3 3333 3 1 11 111 111 3 333 3 3 11 1 1 1 1 111 1 1 333333333 11 1 1 1 1 111 1 1 333333333 1 11 1 11 1 1 11 3 3333 3 1 11 1 11 1 1 11 3 3333 3 1 1 1 1 3 3333333 1 1 1 1 3 3333333 3 1 11 3333 3 1 11 3333 3 3 1 1 11 3 1 1 11 3 1 1 333 1 1 333 333 333 3 3 3 3 3 3 3 3 FIGURE 4.1. The left plot shows some data from three classes, with linear decisionFIGURE boundaries 4.1. The found left plot by linear shows discriminant some data from analysis. three classes, The right with plot linear shows quadraticdecision boundariesdecision boundaries. found by linear These discriminant were obtained analysis. by findi Theng right linear plot boundaries shows 2 2 inquadratic the five-dimensional decision boundaries. space X These1,X2 were,X1X obtained2,X1 ,X by2 . findi Linearng linear inequalities boundaries in this 2 2 spacein the are five-dimensional quadratic inequalities space X in1,X the2,X original1X2,X1 space.,X2 . Linear inequalities in this space are quadratic inequalities in the original space. COMP-598: Applied Machine Learning 23 Joelle Pineau mation h(X) where h :IRp IR q with q>p, and will be explored in later COMP-598: Applied Machine Learningp q 23 Joelle Pineau chapters.mation h(X) where h :IR !→ IR with q>p, and will be explored in later chapters. !→

4.24.2 Linear Linear Regression Regression of of an an Indicator Indicator Matrix Matrix Here each of the response categories are coded via an indicator variable. Here each of the response categories are coded via an indicator variable. Thus if Quadratichas K classes, there discriminant will be K such indicators analysisYk,k=1 ,...,K, Thus if Quadratichas K classes, there discriminant will be K such indicators analysisYk,k=1 ,...,K, with Y GQuadraticG=1ifG = k discriminantelse 0. These are analysis collected together in a vector with Ykk =1ifG = k else 0. These are collected together in a vector YY=(=(YY1,...,Y1,...,YKK),), and and the NN trainingtraining instances instances of of these these form form an anN N K K indicator response matrix Y. Y is a matrix of 0’s and 1’s, with each× × row •• •indicator LDALDALDA assumes assumes assumes response all classes matrixallall classesclassesY have. Y havehave theis a same matrixthe the same samecovariance of 0’s covariance covariance and matrix, 1’s, with matrix, Σ matrix,. each Σ row. Σ. havinghaving a a single single 1. 1. We We fit fit a linear linear regression regression model model to to each each of of thecolumns thecolumns • • YQDAYQDA allows allows differentdifferent covariancecovariance matrices, matrices, Σ Σ for for each each class class y. y. •ofofQDAsimultaneously,simultaneously, allows different and and covariance the fit fit is is matrices, given given by by Σy for eachy y class y. Yˆ ==XX((XXTTXX)−)−1X1XT TYY. . (4.3)(4.3) • • LinearLinear discriminant discriminant function: function: • ChapterLinear 3 discriminant has more details function: on linear regression. Note that we have a coeffi- Chapter 3 has more details on linear1 regression. Note that we have a coeffi- cient vector for each responseT −−11 column1 TyT −,andhencea(1−1 p+1) K coefficient cient vector for eachδδy((x responsex))== x ΣΣ µµ columny −− µµy ΣykΣk,andhencea(µµy ++lolgogP(Py()y) p+1) K coefficient ˆ T y −1 T y 2 y y × × matrixmatrixBˆB=(=(XXT XX))−1XXT Y.Here.HereXX2 isis the the model model matrix matrix with withp+1p+1 columns columns •corresponding•corresponding QuadraticQuadratic discriminant to todiscriminant the the pp inputs,inputs, function: function: and and a a leading leading column column of of 1’s 1’s for for the the intercept. intercept. • QuadraticAnewobservationwithinput discriminant function:x is classified as follows: Anewobservationwithinput1 1x is classifiedT −1 as follows: δ (x) = − 1 Σ − 1(x − µ ) TΣ −(1x − µ )+ log P(y) y y T y y T y compute theδy ( fittedx) = − output2 Σy −f2ˆ(x()x −=(1µy ) ,xΣy ()Bxˆ−,aµyK)+vector;log P(y) •compute the fitted output2 f2ˆ(x)T =(1,xT )Bˆ ,aK vector; –• –QDA QDA has has more more parameters parameters to estimate, to estimate, but greater but greater flexibility flexibility to to – identifyQDA has the more largest parameters component to andestimate, classify but accordingly: greater flexibility to •identifyestimateestimate the the thetarget largest target function. component function. Bias -Bias-variancevariance and classify trade- off.trade-off. accordingly: • estimate the target function.Gˆ(x)=argmax Bias-variancefˆ (x) trade-off.. (4.4) ˆ k∈G kˆ G(x)=argmaxk∈Gfk(x). (4.4)

COMP-551: Applied Machine Learning 13 Joelle Pineau COMP-598: Applied Machine Learning 24 Joelle Pineau COMP-598: Applied Machine Learning 24 Joelle Pineau

12 12 LDA vs QDA LDA vs4.3 LinearQDA Discriminant Analysis 111 4.3 Linear Discriminant Analysis 107 1 1

2 1 2 1 1 1 2 1 2 Linear1 Discriminant Analysis 2 3 2 3 2 2 2 2 2 22 1 2 22 1 2 2 3 2 2 3 2 3 3 2 3 3 2 2 2 2 33 3 33 2 2 2 2 33 3 33 2 2 12 3 3 3 2 2 o 12 3 3ooo3 2 2 2 22 2 3 2 2 2 22 2ooo 3oo 2 2 2 1 3 2 2 2 o 1 3 o 2 222 2 2 3 2 222 2 2 3 2 2 2 22 2 1 3 3 2 oo 2 2 22 2 1 3 3 22 2 2 22 2 22 2 3 22 o2 2 22 2 22 2 o o 3 2 22 2 222 22 2222 3 33 33 2 22 2o o 222 22 2222 o o 3 33 3o3o 22 2 222 22 222 2 1 3333 33 22 2 222 22o 222 2 1 3333 33 o 22 222222 222 22 3 22 2222o22 222 22 3 o o 222 222 222 2 2 222 3 3 3 3 3 222 222 222 o2 2 o2o22o o oo o3 3o 3 3 3 2 22 222 2 22 3333 33 2 22 o oo2o22 2 o2o2 o ooo oo o333o3oo33 2 22 22 22 22222222 22 13 33 2 22 22 22 22o222222 2o2oo o oo o13oo o33oo o 1 2 22 21 222 22 22 3 3 3 3 3 1 2 22 21 222 22 2•2 oo oo • 3 3 3 3 3 ooo o 222 2 22 2222 1 3333 3 222 2 22 2222 oooooo o o 1 3333 3 o 2 2 21222 12 1 1 1 1 1 3 333 2 2 21222 12 o o1 1 oo 1 oo 1 1 3 333 o o o o 2 12222 22222 12 1 1 1 333 2 12222 22222 •12oo ooo1 oo o o o1 1 333 o o o oo 2 1 2 1 11 3 3 2 1 2 1 o o o oo o•o11 3 3 o 2 2 2 1 3 2 2 2 o 1 o o 3 o 11 2 12 22 1 2 1 11 1 1 1 1133333 33 o 1o1o 2 12 o22 1 o2o 1 11 1oo •1o oo1 ooo 113o3333o33 o oo o o o 1 1 2 2221 1 2 1 1 11 1 11 33333 3 33 1 o1 2 o2o221 o1 2o 1 1o1o1 o o o1 1o1 o33333 3 •33oooo o o o o 1 2 1 2221 1 1 1 1 1 1 333 33 1o2 1 2221o 1 1 1o 1 1o 1 o 333o 3oo3 o o o o o 1 1 1 1 1 1 1 11 1 1331 33 3 33 1 o1 1 1 1 1•oo1 11 o1 1331 3o3 3o33 o o o 2 1 1 211 2 1 1 1111 11 1 3333333 o oo 2 1 o1o 211 2 1o 1 1111 11 o1 3333333 • o o o 33 33 o o o 3o3 33 1 1 11 111 1 1 333 3333 o 1o1 o 11 11o1 1o 1 o oo333o3333o o o oo o o o 1 1 1 111 1 1 1 111 11 3333333 1 1 oo o o1 111 1 ooo1oo1 111 11 3o333333 o o oo• o o 1 1 1 11111 1 11 1 1 11 1 1 11 31 133 3 1 o•ooo1o1oo11oo1o11 1 11o 1 1oo11oo1 1 11 o31 o133 o3 o o oo 1 1 1 111 1 1 1 333333 33 oo1 o1 oo1 111o 1 o1o o 1 o o 333333 3oo3 o o o 1 1 1 1 11 1 1 31333333 3 o1 oo1 o1 o 1 •11o o 1 oo1 31333333o o3 oo oo o• o 11 111 1 333 o o o11 111 o 1o o 333 o 11 1 1 1 1 111 1 1 33333333 11•1 1 1o1o111 oo1o o 1 33333333 o 1 1 1 11 1 1 11 3 3333 o o 1 1 1 o11 1o 1 o 11 3 333o3 oo o o 1 1 1 1 3 3333333 1 1 1o oo•o1 oooo o 3 333o333o3 o oo o 1 1 33 o o1 o 1 o o 33o o 1 1 33 o oo o oo1oo 1 o 33 oo o 1 1 11 3 o 1 1 11 o oo o 3 o o oo 1 1 333 o o o 1 o 1 o o 33o3 o oo o o o 333 oo o o 333 o•o 3 o o o oo o 3 o o o o 3 o • o 3 o ooo o oo 3 3 o oo o o 3 3 •ooo o o o oooo o o o • o o o oo o o o o o o o o oo• o o o o o o o o o o oo o o •o o oo o oo o o o FIGURE 4.6. Two methods for fitting quadraticoo boundaries.o Theoo left ploto oo o showso o oo oo o o o o o o o o• o oooo o o Coordinate 2 for Training Data o o o the quadratic decision boundaries for the data in Figure 4.1• (obtained using LDA oo o oo o o 2 o 2o o o o o o o in the five-dimensional space X1,X2,X1X2,X1o ,X2 ). The right plot shows the o quadratic decision boundaries found by QDA. The differenceso are small, as is usually the case. o o o o -6 -4 -2 0 2 4 COMP-551: Applied Machine Learning 14 o oJoelle Pineau COMP-598: Applied Machine Learning 25 Joelle Pineau between the discriminant functions where-4K is some -2 pre-chosen 0class (here 2 4 3 we have chosen the last), and each difference requires pCoordinate+1parameters 1 for Training Data . Likewise for QDA there will be (K 1) p(p +3)/2+1 parameters. FIGURE− 4.4.× A{ two-dimensional plot} of the vowel training data. There are Both LDA and QDA perform welleleven on classes an amazinglywith X IR 10,andthisisthebestviewintermsofaLDAmodel large and diverseset ∈ of classification tasks. For example,(Section in 4.3.3). the TheSTATLOG heavy circles project are the projected(Michie mean et vectors for each class. al., 1994) LDA was among the topThe classthree overlap classifiers is considerable. for 7 of the 22datasets, QDA among the top three for four datasets, and one of the pair were in the top three for 10 datasets. Both techniques are widely used, and entire books 4.3 Linear Discriminant Analysis 107 Comparingare devoted to LDA. linear It seems that TABLEclassification whatever 4.1. Training exotic and tools test are errormethods th rateserageofthe using a variety of linear techniques day, we shouldLinear Discriminant always Analysis have availableon the vowel these data. two There simple are tools.eleven classesThe question in ten dimensions, ofwhichthree arises why LDA and QDA haveaccount such a for good90% trackof the record. variance (via The a reaso principaln is components not analysis). We see o that linear regression is hurt by masking, increasing the testandtrainingerror o o oo oo oo likelyoo to be that the data are approximatelyby over 10%. Gaussian, and in addition for o o o o oo o o oo oo o o o o LDA thato oo o the covariancesoo oo o are approximately equal. More likely a reason is ooo o o o oooo ooo o ooo o • oo o oo o o oooo ooooo o o o • o o Technique Error Rates ooo o o o o o o •oo oooo oooo o o o o o oo that the datao cano o oo onlyo•o support simpleo decision boundaries such as linear or o o o o o oo • oo o o o o o o o o o o o o ooo ooo o •ooo o o oo o o o Training Test o oo o o o o oo o o o o o o o o •oo o o o o o o o o oo oo o o • o o o quadratic,o o o o and theo estimateso o o providedoo o via the Gaussian models are stable. o o o o ooooooo o o o o o o o o• o o o Linear regression 0.48 0.67 ooo ooo o o oo o o o o oo o oo • ooo oo o oo oo o o oo o o o ooo oo o o • o o oo o o oo o oo o• o This is• o aoo biaso o varianceooooo o o tradeoffo —weo can putLinear up with discriminant the bias analysis of a linear 0.32 0.56 o o o o oo o o o oo•o ooooo o o o o ooo o o o o o o o ooooo o o oo o o o o oo o o o o oo Quadratic discriminant analysis 0.01 0.53 o o o o o o o o oo o o o decision boundaryoo o o becauseo•o it can be estimated with much lowervariance o o o oo o o o o o o • o o ooo o oo Logistic regression 0.22 0.51 o oo o o •ooo o o o oooo o o o than more• o exotic alternatives.o o oo Thiso o argument is less believable for QDA, o o o o o o oo• o o o o o o o o o o oo o o •o o oo o oo o o o sinceoo it cano haveoo manyo oo o parameterso o itself, although perhaps fewer than the oo oo o o o o o o o o• o oooo o o Coordinate 2 for Training Data o o o • oo non-parametrico oo alternatives. o o o o o o o o o o o o o o o o o -6 -4 -2 0 2 4 o o 3Although we fit the covariance matrix Σˆ to compute the LDA discriminant functions, -4 -2 0 2 4 a much reducedCoordinate function 1 for Training of Data it is all that is required to estimate the O(p) parameters needed to compute the decision boundaries. FIGURE 4.4. A two-dimensional plot of the vowel training data. There are eleven classes with X IR 10,andthisisthebestviewintermsofaLDAmodel ∈ (Section 4.3.3). The heavy circles are the projected mean vectors for each class. The class overlap is considerable.

TABLECOMP-598: 4.1. Training Applied and test Machine error rates Learning using a variety of linear techniques 26 Joelle Pineau on the vowel data. There are eleven classes in ten dimensions, ofwhichthree account for 90% of the variance (via a principal components analysis). We see that linear regression is hurt by masking, increasing the testandtrainingerror by over 10%.

Technique Error Rates Training Test Linear regression 0.48 0.67 Linear discriminant analysis 0.32 0.56 Quadratic discriminant analysis 0.01 0.53 13 Logistic regression 0.22 0.51 Generative learning – Scaling up

• Consider applying generative learning (e.g. LDA) to Project 1 data.

COMP-551: Applied Machine Learning 15 Joelle Pineau Generative learning – Scaling up

• Consider applying generative learning (e.g. LDA) to Project 1 data. E.g. Task: Given first few words, determine language (without dictionary). – What are features? How to encode the data? – How to handle high dimensional feature space? – Assume same co-variance matrix for both classes? (Maybe that is ok.)

What kinds of assumptions can we make to simplify the problem?

COMP-551: Applied Machine Learning 16 Joelle Pineau Naïve Bayes assumption

• Generative learning: Estimate P(x|y), P(y). Then calculate P(y|x).

• Naïve Bayes: Assume the xj are conditionally independent given y.

– In other words: P(xj | y) = P(xj | y, xk), for all j, k

COMP-551: Applied Machine Learning 17 Joelle Pineau Naïve Bayes assumption

• Generative learning: Estimate P(x|y), P(y). Then calculate P(y|x).

• Naïve Bayes: Assume the xj are conditionally independent given y.

– In other words: P(xj | y) = P(xj | y, xk), for all j, k

• Generative model structure:

P(x | y) = P(x1, x2, …, xm | y)

= P(x1 | y) P(x2 | y, x1) P(x3 | y, x1, x2) …P(xm | y, x1, x2, …, xm-1) (from general rules of )

= P(x1 | y) P(x2 | y) P(x3 | y) …P(xm | y) (from the Naïve Bayes assumption above)

COMP-551: Applied Machine Learning 18 Joelle Pineau Naïve Bayes graphical model

y

x1 x2 x3 … xm

• How many parameters to estimate? Assume m binary features.

COMP-551: Applied Machine Learning 19 Joelle Pineau Naïve Bayes graphical model

y

x1 x2 x3 … xm

• How many parameters to estimate? Assume m binary features. – without Naïve Bayes assumption: O(2m) numbers to describe model. – With Naïve Bayes assumption: O(m) numbers to describe model.

• Useful when the number of features is high.

COMP-551: Applied Machine Learning 20 Joelle Pineau Training a Naïve Bayes classifier

• Assume x, y are binary variables, m=1. y • Estimate the parameters P(x|y) and P(y) from data.

– Define: �1 = Pr (y=1) �j,1 = Pr (xj=1 | y=1) x �j,0 = Pr (xj=1 | y=0).

COMP-551: Applied Machine Learning 21 Joelle Pineau Training a Naïve Bayes classifier

• Assume x, y are binary variables, m=1. y • Estimate the parameters P(x|y) and P(y) from data.

– Define: �1 = Pr (y=1) �j,1 = Pr (xj=1 | y=1) x �j,0 = Pr (xj=1 | y=0).

• Evaluation criteria: Find parameters that maximize the log- .

• Likelihood: Pr(y|x) ∝ Pr(y)Pr(x|y) = ∏i=1:n ( P(yi) ∏j=1:mP(xi,j | yi) ) • Samples i are independent, so we take product over n. • Input features are independent (cond. on y) so we take product over m.

COMP-551: Applied Machine Learning 22 Joelle Pineau Training a Naïve Bayes classifier

y 1-y • Likelihood for binary output variable: L(�1|y=1) = �1 (1-�1) • Log-likelihood for all parameters:

log L(�1,�i,1,�i,0 | D) = Σi=1:n [ log P(yi) + Σj=1:m log P(xi,j | yi) ]

COMP-551: Applied Machine Learning 23 Joelle Pineau Training a Naïve Bayes classifier

y 1-y • Likelihood for binary output variable: L(�1|y=1) = �1 (1-�1) • Log-likelihood for all parameters:

log L(�1,�i,1,�i,0 | D) = Σi=1:n [ log P(yi) + Σj=1:m log P(xi,j | yi) ]

= Σi=1:n [ yi log �1 + (1-yi )log(1-�1)

+ Σj=1:m yi ( xi,j log�i,1 + (1-xi,j )log(1-�i,1) )

+ Σj=1:m (1-yi)( xi,j log�i,0 + (1-xi,j )log(1-�i,0) ) ]

COMP-551: Applied Machine Learning 24 Joelle Pineau Training a Naïve Bayes classifier

y 1-y • Likelihood for binary output variable: L(�1|y=1) = �1 (1-�1) • Log-likelihood for all parameters:

log L(�1,�i,1,�i,0 | D) = Σi=1:n [ log P(yi) + Σj=1:m log P(xi,j | yi) ]

= Σi=1:n [ yi log �1 + (1-yi )log(1-�1)

+ Σj=1:m yi ( xi,j log�i,1 + (1-xi,j )log(1-�i,1) )

+ Σj=1:m (1-yi)( xi,j log�i,0 + (1-xi,j )log(1-�i,0) ) ] (will have other form if params P(x|y) have other form, e.g. Gaussian).

COMP-551: Applied Machine Learning 25 Joelle Pineau Training a Naïve Bayes classifier

y 1-y • Likelihood for binary output variable: L(�1|y=1) = �1 (1-�1) • Log-likelihood for all parameters:

log L(�1,�i,1,�i,0 | D) = Σi=1:n [ log P(yi) + Σj=1:m log P(xi,j | yi) ]

= Σi=1:n [ yi log �1 + (1-yi )log(1-�1)

+ Σj=1:m yi ( xi,j log�i,1 + (1-xi,j )log(1-�i,1) )

+ Σj=1:m (1-yi)( xi,j log�i,0 + (1-xi,j )log(1-�i,0) ) ] (will have other form if params P(x|y) have other form, e.g. Gaussian).

• To estimate �1, take derivative of logL with respect to �1, set to 0:

∂L / ∂�1 = Σi=1:n (yi /�1 - (1-yi )/(1-�1)) = 0

COMP-551: Applied Machine Learning 26 Joelle Pineau Training a Naïve Bayes classifier

Solving for �1 we get:

�1 = (1/n) Σi=1:n yi = number of examples where y=1 / number of examples

COMP-551: Applied Machine Learning 27 Joelle Pineau Training a Naïve Bayes classifier

Solving for �1 we get:

�1 = (1/n) Σi=1:n yi = number of examples where y=1 / number of examples

Similarly, we get:

�j,1 = number of examples where xj=1 and y=1 / number of examples where y=1

�j,0 = number of examples where xj=1 and y=0 / number of examples where y=0

COMP-551: Applied Machine Learning 28 Joelle Pineau Naïve Bayes decision boundary

• Consider again the log-odds ratio:

Pr(y =1| x) Pr(x | y =1)P(y =1) log = log Pr(y = 0 | x) Pr(x | y = 0)P(y = 0) m P x | y =1 P(y =1) ∏ j 1 ( j ) = log + log = P(y = 0) m P x j | y = 0 ∏ j=1 ( )

m P(y =1) P(x j | y =1) = log +∑log P(y = 0) j=1 P(x j | y = 0)

COMP-551: Applied Machine Learning 29 Joelle Pineau Naïve Bayes decision boundary

• Consider the case where features are binary: xj = {0, 1} • Define: P(x j = 0 | y =1) P(x j =1| y =1) w j,0 = log ; w j,1 = log P(x j = 0 | y = 0) P(x j =1| y = 0)

m Pr(y =1| x) P(y =1) P(x j | y =1) • Now we have: log = log +∑log Pr(y = 0 | x) P(y = 0) j=1 P(x j | y = 0) P(y =1) m = log +∑(w j,0 (1− x j )+ w j,1x j ) P(y = 0) j=1 P(y =1) m m = log +∑w j,0 +∑(w j,1 − w j,0 )x j P(y = 0) j=1 j=1 T This is a linear decision boundary! w0 + x w

COMP-551: Applied Machine Learning 30 Joelle Pineau Text classification example

• Using Naïve Bayes, we can compute probabilities for all the words which appear in the document collection. – P(y=c) is the probability of class c

– P(xj | y=c) is the probability of seeing word j in documents of class c

Class c

word1 word2 word3 … wordm

COMP-551: Applied Machine Learning 31 Joelle Pineau Text classification example

• Using Naïve Bayes, we can compute probabilities for all the words which appear in the document collection. – P(y=c) is the probability of class c

– P(xj | y=c) is the probability of seeing word j in documents of class c

• Set of classes depends on the application, e.g. – Topic modeling: each class corresponds to documents on a given topic, e.g. {Politics, Finance, Sports, Arts}. Class c What happens when a word is not observed in the training data? word1 word2 word3 … wordm

COMP-551: Applied Machine Learning 32 Joelle Pineau Laplace smoothing

• Replace the maximum likelihood estimator:

Pr(xj | y=1) = number of instance with xj=1 and y=1 number of examples with y=1

COMP-551: Applied Machine Learning 33 Joelle Pineau Laplace smoothing

• Replace the maximum likelihood estimator:

Pr(xj | y=1) = number of instance with xj=1 and y=1 number of examples with y=1 • With the following:

Pr(xj | y=1) = (number of instance with xj=1 and y=1) + 1 (number of examples with y=1) + 2

COMP-551: Applied Machine Learning 34 Joelle Pineau Laplace smoothing

• Replace the maximum likelihood estimator:

Pr(xj | y=1) = number of instance with xj=1 and y=1 number of examples with y=1 • With the following:

Pr(xj | y=1) = (number of instance with xj=1 and y=1) + 1 (number of examples with y=1) + 2 – If no example from that class, it reduces to a prior probability or Pr=1/2.

– If all examples have xj=1, then Pr(xj=0|y) has Pr = 1 / (#examples + 1). – If a word appears frequently, the new estimate is only slightly biased.

• This is an example of Bayesian prior for Naïve Bayes.

COMP-551: Applied Machine Learning 35 Joelle Pineau Example: 20 newsgroups

• Given 1000 training documentsExample: 20from newsgroups each group, learn to classifyGiven new 1000 documents training documents according from to which each group, newsgroup learn to classifythey new documents according to which newsgroup they came from

came from: comp.graphics misc.forsale comp.os.ms-windows.misc rec.autos comp.sys.ibm.pc.hardware rec.motorcycles comp.sys.mac.hardware rec.sport.baseball comp.windows.x rec.sport.hockey alt.atheism sci.space soc.religion.christian sci.crypt talk.religion.misc sci.electronics talk.politics.mideast sci.med talk.politics.misc talk.politics.guns

Naive Bayes: 89% classification accuracy - comparable to other state- • Naïveof-art Bayes: methods 89% classification accuracy (comparable to other state-of-the-art methods.)

COMP-551:COMP-652, Applied Lecture Machine 6 - September Learning 25, 2012 36 Joelle Pineau 25

Gaussian Discriminant Analysis

This is a generative model for continuous inputs • P (y) is still assumed to be binomial • P (x y) is assumed to be a multivariate Gaussian (normal distribution), • | with mean µ Rn and covariance matrix Rn Rn. ⇤ ⇤ ⇥ Dierent types of models are obtained based on dierent assumptions • about the covariance matrix – If covariance is shared among the classes, we get linear discriminant analysis – If covariance is distinct between classes, we get quadratic discriminant analysis – If covariance matrix is diagonal (i.e. features are independent), we get Gaussian Naive Bayes

COMP-652, Lecture 6 - September 25, 2012 26 Multi-class classification

• Generally two options: 1. Learn a single classifier that can produce 20 distinct output values. 2. Learn 20 different 1-vs-all binary classifiers.

COMP-551: Applied Machine Learning 37 Joelle Pineau Multi-class classification

• Generally two options: 1. Learn a single classifier that can produce 20 distinct output values. 2. Learn 20 different 1-vs-all binary classifiers.

• Option 1 assumes you have a multi-class version of the classifier. – For Naïve Bayes, compute P(y|x) for each class, and select the class with highest probability.

COMP-551: Applied Machine Learning 38 Joelle Pineau Multi-class classification

• Generally two options: 1. Learn a single classifier that can produce 20 distinct output values. 2. Learn 20 different 1-vs-all binary classifiers.

• Option 1 assumes you have a multi-class version of the classifier. – For Naïve Bayes, compute P(y|x) for each class, and select the class with highest probability.

• Option 2 applies to all binary classifiers, so more flexible. But often slower (need to learn many classifiers), and creates a class imbalance problem (the target class has relatively fewer data points, compared to the aggregation of the other classes.)

COMP-551: Applied Machine Learning 39 Joelle Pineau Best features extracted

From MSc thesis by Jason Rennie http://qwone.com/~jason/papers/sm-thesis.pdf COMP-551: Applied Machine Learning 40 Joelle Pineau Gaussian Naïve Bayes

Extending Naïve Bayes to continuous inputs:

• P(y) is still assumed to be a binomial distribution.

• P(x|y) is assumed to be a multivariate Gaussian (normal) distribution with mean μ∊ℜn and covariance matrix Σ ∊ ℜnxℜn

COMP-551: Applied Machine Learning 41 Joelle Pineau Gaussian Naïve Bayes

Extending Naïve Bayes to continuous inputs:

• P(y) is still assumed to be a binomial distribution.

• P(x|y) is assumed to be a multivariate Gaussian (normal) distribution with mean μ∊ℜn and covariance matrix Σ ∊ ℜnxℜn – If we assume the same Σ for all classes: Linear discriminant analysis. – If Σ is distinct between classes: Quadratic discriminant analysis. – If Σ is diagonal (i.e. features are independent): Gaussian Naïve Bayes.

COMP-551: Applied Machine Learning 42 Joelle Pineau Gaussian Naïve Bayes

Extending Naïve Bayes to continuous inputs:

• P(y) is still assumed to be a binomial distribution.

• P(x|y) is assumed to be a multivariate Gaussian (normal) distribution with mean μ∊ℜn and covariance matrix Σ ∊ ℜnxℜn – If we assume the same Σ for all classes: Linear discriminant analysis. – If Σ is distinct between classes: Quadratic discriminant analysis. – If Σ is diagonal (i.e. features are independent): Gaussian Naïve Bayes.

• How do we estimate parameters? Derive the maximum likelihood estimators for μ and Σ.

COMP-551: Applied Machine Learning 43 Joelle Pineau More generally: Bayesian networks

• If not all conditional independence relations are true, we can introduce new arcs to show what dependencies are there.

• At each node, we have the conditional probability distribution of the corresponding variable, given its parents.

• This more general type of graph, annotated with conditional distributions, is called a Bayesian network.

• More on this in COMP-652.

COMP-551: Applied Machine Learning 44 Joelle Pineau What you should know

• Naïve Bayes assumption

• Log-odds ratio decision boundary

• How to estimate parameters for Naïve Bayes

• Laplace smoothing

• Relation between Naïve Bayes, LDA, QDA, Gaussian Naïve Bayes.

COMP-551: Applied Machine Learning 45 Joelle Pineau