Linear and Quadratic Discriminant Analysis: Tutorial

Linear and Quadratic Discriminant Analysis: Tutorial Benyamin Ghojogh [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Mark Crowley [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Abstract fundamental methods. Finally, some experiments on syn- This tutorial explains Linear Discriminant Anal- thetic datasets are reported and analyzed for illustration. ysis (LDA) and Quadratic Discriminant Analysis (QDA) as two fundamental classification meth- 2. Optimization for the Boundary of Classes ods in statistical and probabilistic learning. We First suppose the data is one dimensional, x 2 R. As- start with the optimization of decision boundary sume we have two classes with the Cumulative Distribu- on which the posteriors are equal. Then, LDA tion Functions (CDF) F1(x) and F2(x), respectively. Let and QDA are derived for binary and multiple the Probability Density Functions (PDF) of these CDFs be: classes. The estimation of parameters in LDA and QDA are also covered. Then, we explain @F (x) f (x) = 1 ; (1) how LDA and QDA are related to metric learn- 1 @x ing, kernel principal component analysis, Maha- @F (x) f (x) = 2 ; (2) lanobis distance, logistic regression, Bayes op- 2 @x timal classifier, Gaussian naive Bayes, and likelihood ratio test. We also prove that LDA and respectively. Fisher discriminant analysis are equivalent. We We assume that the two classes have normal (Gaussian) finally clarify some of the theoretical concepts distribution which is the most common and default distri- with simulations we provide. bution in the real-world applications. The mean of one of the two classes is greater than the other one; we assume µ1 < µ2. An instance x 2 R belongs to one of these two 1. Introduction classes: n Assume we have a dataset of instances f(xi; yi)gi=1 with 2 d N (µ1; σ1); if x 2 C1; sample size n and dimensionality xi 2 R and yi 2 R. The x ∼ 2 (3) N (µ2; σ2); if x 2 C2; yi’s are the class labels. We would like to classify the space of data using these instances. Linear Discriminant Analysis where C and C denote the first and second class, respec- (LDA) and Quadratic discriminant Analysis (QDA) (Fried- 1 2 arXiv:1906.02590v1 [stat.ML] 1 Jun 2019 tively. man et al., 2009) are two well-known supervised classifica- x tion methods in statistical and probabilistic learning. This For an instance , we may have an error in estimation of x∗ paper is a tutorial for these two classifiers where the the- the class it belongs to. At a point, which we denote by , ory for binary and multi-class classification are detailed. the probability of the two classes are equal; therefore, the x∗ Then, relations of LDA and QDA to metric learning, ker- point is on the boundary of the two classes. As we have µ < µ µ < x∗ < µ nel Principal Component Analysis (PCA), Fisher Discrim- 1 2, we can say 1 2 as shown in Fig.1. x < x∗ x > x∗ x inant Analysis (FDA), logistic regression, Bayes optimal Therefore, if or the instance belongs to classifier, Gaussian naive Bayes, and Likelihood Ratio Test the first and second class, respectively. Hence, estimating x < x∗ x > x∗ (LRT) are explained for better understanding of these two or to respectively belong to the second and first class is an error in estimation of the class. This probability of the error can be stated as: ∗ ∗ P(error) = P(x > x ; x 2 C1) + P(x < x ; x 2 C2): (4) Linear and Quadratic Discriminant Analysis: Tutorial 2 where jCj is the number of classes which is two here. The f1(x) and π1 are the likelihood (class conditional) and prior probabilities, respectively, and the denominator is the marginal probability. Therefore, Eq. (13) becomes: f1(x) π1 PjCj (X = x j x 2 Ci) πi Figure 1. Two Gaussian density functions where they are equal at i=1 P ∗ f (x) π the point x . =set 2 2 PjCj i=1 P(X = x j x 2 Ci) πi As we have P(A; B) = P(AjB) P(B), we can say: =) f1(x) π1 = f2(x) π2: (15) ∗ P(error) = P(x > x j x 2 C1) P(x 2 C1) ∗ (5) + P(x < x j x 2 C2) P(x 2 C2); which we want to minimize: Now let us think of data as multivariate data with dimensionality d. The PDF for multivariate Gaussian distribu- minimize (error); (6) x∗ P tion, x ∼ N (µ; Σ) is: by finding the best boundary of classes, i.e., x∗. According to the definition of CDF, we have: > −1 (x < c; x 2 C ) = F (c); 1 (x − µ) Σ (x − µ) P 1 1 f(x) = p exp − ; ∗ ∗ (2π)djΣj 2 =) P(x > x ; x 2 C1) = 1 − F1(x ); (7) ∗ ∗ (16) P(x < x ; x 2 C2) = F2(x ): (8) According to the definition of PDF, we have: (x 2 C ) = f (x) = π ; (9) d d d×d P 1 1 1 where x 2 R , µ 2 R is the mean, Σ 2 R is the P(x 2 C2) = f2(x) = π2; (10) covariance matrix, and j:j is the determinant of matrix. The π ≈ 3:14 in this equation should not be confused with the where we denote the priors f (x) and f (x) by π and π , 1 2 1 2 π (prior) in Eq. (12) or (15). Therefore, the Eq. (12) or respectively. k (15) becomes: Hence, Eqs. (5) and (6) become: ∗ ∗ minimize 1 − F1(x ) π1 + F2(x ) π2: (11) x∗ > −1 We take derivative for the sake of minimization: 1 (x − µ1) Σ1 (x − µ1) exp − π1 p d 2 @ P(error) ∗ ∗ set (2π) jΣ1j = −f1(x ) π1 + f2(x ) π2 = 0; @x∗ > −1 1 (x − µ2) Σ2 (x − µ2) ∗ ∗ = exp − π2; =) f1(x ) π1 = f2(x ) π2: (12) p d (2π) jΣ2j 2 Another way to obtain this expression is equating the pos- (17) terior probabilities to have the equation of the boundary of where the distributions of the first and second class are classes: N (µ1; Σ1) and N (µ2; Σ2), respectively. set P(x 2 C1 j X = x) = P(x 2 C2 j X = x): (13) 3. Linear Discriminant Analysis for Binary According to Bayes rule, the posterior is: Classification P(X = x j x 2 C1) P(x 2 C1) In Linear Discriminant Analysis (LDA), we assume that the P(x 2 C1 j X = x) = two classes have equal covariance matrices: P(X = x) f (x) π = 1 1 ; PjCj k=1 P(X = x j x 2 Ck) πk (14) Σ1 = Σ2 = Σ: (18) Linear and Quadratic Discriminant Analysis: Tutorial 3 Therefore, the Eq. (17) becomes: the class of an instance x is estimated as: > −1 1; if δ(x) < 0; 1 (x − µ1) Σ (x − µ1) Cb(x) = (22) exp − π1 2; if δ(x) > 0: p(2π)djΣj 2 > −1 If the priors of two classes are equal, i.e., π1 = π2, the Eq. 1 (x − µ2) Σ (x − µ2) = exp − π2; (20) becomes: p(2π)djΣj 2 −1 > > −1 2 Σ (µ − µ ) x (x − µ1) Σ (x − µ1) 2 1 (23) =) exp − π1 > −1 2 + µ1 − µ2) Σ (µ1 − µ2) = 0; > −1 (x − µ2) Σ (x − µ2) = exp − π2; whose left-hand-side expression can be considered as δ(x) 2 in Eq. (22). (a) 1 > −1 =) − (x − µ1) Σ (x − µ1) + ln(π1) 2 4. Quadratic Discriminant Analysis for 1 > −1 = − (x − µ ) Σ (x − µ ) + ln(π2); Binary Classification 2 2 2 In Quadratic Discriminant Analysis (QDA), we relax the where (a) takes natural logarithm from the sides of equa- assumption of equality of the covariance matrices: tion. We can simplify this term as: Σ1 6= Σ2; (24) which means the covariances are not necessarily equal (if (x − µ )>Σ−1(x − µ ) = (x> − µ>)Σ−1(x − µ ) 1 1 1 1 they are actually equal, the decision boundary will be linear > −1 > −1 > −1 > −1 = x Σ x − x Σ µ1 − µ1 Σ x + µ1 Σ µ1 and QDA reduces to LDA). (a) > −1 > −1 > −1 Therefore, the Eq. (17) becomes: = x Σ x + µ1 Σ µ1 − 2 µ1 Σ x; (19) > −1 1 (x − µ1) Σ1 (x − µ1) where (a) is because x>Σ−1µ = µ>Σ−1x as it is a exp − π1 1 1 p(2π)djΣ j 2 scalar and Σ−1 is symmetric so Σ−> = Σ−1. Thus, we 1 > −1 have: 1 (x − µ2) Σ2 (x − µ2) = exp − π2; p d (2π) jΣ2j 2 1 > −1 1 > −1 > −1 − x Σ x − µ1 Σ µ1 + µ1 Σ x + ln(π1) (a) d 1 2 2 =) − ln(2π) − ln(jΣ1j) 1 > −1 1 > −1 > −1 2 2 = − x Σ x − µ2 Σ µ2 + µ2 Σ x + ln(π2): 1 2 2 − (x − µ )>Σ−1(x − µ ) + ln(π ) 2 1 1 1 1 Therefore, if we multiply the sides of equation by 2, we d 1 = − ln(2π) − ln(jΣ j) have: 2 2 2 > 1 2 Σ−1(µ − µ ) x − (x − µ )>Σ−1(x − µ ) + ln(π ); 2 1 2 2 2 2 2 > −1 π2 + µ1 − µ2) Σ (µ1 − µ2) + 2 ln( ) = 0; where (a) takes natural logarithm from the sides of equa- π1 (20) tion. According to Eq. (19), we have: > which is the equation of a line in the form of a x + b = 0.

Linear and Quadratic Discriminant Analysis: Tutorial

Solving Cubic Polynomials

Elements of Chapter 9: Nonlinear Systems Examples

Polynomials and Quadratics

Quadratic Polynomials

The Determinant and the Discriminant

Nature of the Discriminant

Resultant and Discriminant of Polynomials

Lyashko-Looijenga Morphisms and Submaximal Factorisations of A

Galois Groups of Cubics and Quartics (Not in Characteristic 2)

Discriminants, Resultants, and Their Tropicalization

Eisenstein and the Discriminant

On Facets of the Newton Polytope for the Discriminant of the Polynomial System