<<

Linear and Quadratic Discriminant Analysis: Tutorial

Benyamin Ghojogh [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada

Mark Crowley [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada

Abstract fundamental methods. Finally, some experiments on syn- This tutorial explains Linear Discriminant Anal- thetic datasets are reported and analyzed for illustration. ysis (LDA) and Quadratic Discriminant Analysis (QDA) as two fundamental classification meth- 2. Optimization for the Boundary of Classes ods in statistical and probabilistic learning. We First suppose the data is one dimensional, x ∈ R. As- start with the optimization of decision boundary sume we have two classes with the Cumulative Distribu- on which the posteriors are equal. Then, LDA tion Functions (CDF) F1(x) and F2(x), respectively. Let and QDA are derived for binary and multiple the Probability Density Functions (PDF) of these CDFs be: classes. The estimation of parameters in LDA and QDA are also covered. Then, we explain ∂F (x) f (x) = 1 , (1) how LDA and QDA are related to metric learn- 1 ∂x ing, kernel principal component analysis, Maha- ∂F (x) f (x) = 2 , (2) lanobis distance, logistic regression, Bayes op- 2 ∂x timal classifier, Gaussian naive Bayes, and like- lihood ratio test. We also prove that LDA and respectively. Fisher discriminant analysis are equivalent. We We assume that the two classes have normal (Gaussian) finally clarify some of the theoretical concepts distribution which is the most common and default distri- with simulations we provide. bution in the real-world applications. The mean of one of the two classes is greater than the other one; we assume µ1 < µ2. An instance x ∈ R belongs to one of these two 1. Introduction classes: n Assume we have a dataset of instances {(xi, yi)}i=1 with  2 d N (µ1, σ1), if x ∈ C1, sample size n and dimensionality xi ∈ R and yi ∈ R. The x ∼ 2 (3) N (µ2, σ2), if x ∈ C2, yi’s are the class labels. We would like to classify the space of data using these instances. Linear Discriminant Analysis where C and C denote the first and second class, respec- (LDA) and Quadratic discriminant Analysis (QDA) (Fried- 1 2 arXiv:1906.02590v1 [stat.ML] 1 Jun 2019 tively. man et al., 2009) are two well-known supervised classifica- x tion methods in statistical and probabilistic learning. This For an instance , we may have an error in estimation of x∗ paper is a tutorial for these two classifiers where the the- the class it belongs to. At a point, which we denote by , ory for binary and multi-class classification are detailed. the probability of the two classes are equal; therefore, the x∗ Then, relations of LDA and QDA to metric learning, ker- point is on the boundary of the two classes. As we have µ < µ µ < x∗ < µ nel Principal Component Analysis (PCA), Fisher Discrim- 1 2, we can say 1 2 as shown in Fig.1. x < x∗ x > x∗ x inant Analysis (FDA), logistic regression, Bayes optimal Therefore, if or the instance belongs to classifier, Gaussian naive Bayes, and Likelihood Ratio Test the first and second class, respectively. Hence, estimating x < x∗ x > x∗ (LRT) are explained for better understanding of these two or to respectively belong to the second and first class is an error in estimation of the class. This probability of the error can be stated as:

∗ ∗ P(error) = P(x > x , x ∈ C1) + P(x < x , x ∈ C2). (4) Linear and Quadratic Discriminant Analysis: Tutorial 2

where |C| is the number of classes which is two here. The f1(x) and π1 are the likelihood (class conditional) and prior probabilities, respectively, and the denominator is the marginal probability. Therefore, Eq. (13) becomes:

f1(x) π1 P|C| (X = x | x ∈ Ci) πi Figure 1. Two Gaussian density functions where they are equal at i=1 P ∗ f (x) π the point x . =set 2 2 P|C| i=1 P(X = x | x ∈ Ci) πi As we have P(A, B) = P(A|B) P(B), we can say: =⇒ f1(x) π1 = f2(x) π2. (15) ∗ P(error) = P(x > x | x ∈ C1) P(x ∈ C1) ∗ (5) + P(x < x | x ∈ C2) P(x ∈ C2), which we want to minimize: Now let us think of data as multivariate data with dimen- sionality d. The PDF for multivariate Gaussian distribu- minimize (error), (6) x∗ P tion, x ∼ N (µ, Σ) is: by finding the best boundary of classes, i.e., x∗. According to the definition of CDF, we have: > −1 (x < c, x ∈ C ) = F (c), 1  (x − µ) Σ (x − µ) P 1 1 f(x) = p exp − , ∗ ∗ (2π)d|Σ| 2 =⇒ P(x > x , x ∈ C1) = 1 − F1(x ), (7) ∗ ∗ (16) P(x < x , x ∈ C2) = F2(x ). (8) According to the definition of PDF, we have:

(x ∈ C ) = f (x) = π , (9) d d d×d P 1 1 1 where x ∈ R , µ ∈ R is the mean, Σ ∈ R is the P(x ∈ C2) = f2(x) = π2, (10) covariance , and |.| is the of matrix. The π ≈ 3.14 in this equation should not be confused with the where we denote the priors f (x) and f (x) by π and π , 1 2 1 2 π (prior) in Eq. (12) or (15). Therefore, the Eq. (12) or respectively. k (15) becomes: Hence, Eqs. (5) and (6) become: ∗  ∗ minimize 1 − F1(x ) π1 + F2(x ) π2. (11) x∗ > −1 We take derivative for the sake of minimization: 1  (x − µ1) Σ1 (x − µ1) exp − π1 p d 2 ∂ P(error) ∗ ∗ set (2π) |Σ1| = −f1(x ) π1 + f2(x ) π2 = 0, ∂x∗ > −1 1  (x − µ2) Σ2 (x − µ2) ∗ ∗ = exp − π2, =⇒ f1(x ) π1 = f2(x ) π2. (12) p d (2π) |Σ2| 2 Another way to obtain this expression is equating the pos- (17) terior probabilities to have the equation of the boundary of where the distributions of the first and second class are classes: N (µ1, Σ1) and N (µ2, Σ2), respectively. set P(x ∈ C1 | X = x) = P(x ∈ C2 | X = x). (13) 3. Linear Discriminant Analysis for Binary According to Bayes rule, the posterior is: Classification

P(X = x | x ∈ C1) P(x ∈ C1) In Linear Discriminant Analysis (LDA), we assume that the P(x ∈ C1 | X = x) = two classes have equal covariance matrices: P(X = x) f (x) π = 1 1 , P|C| k=1 P(X = x | x ∈ Ck) πk (14) Σ1 = Σ2 = Σ. (18) Linear and Quadratic Discriminant Analysis: Tutorial 3

Therefore, the Eq. (17) becomes: the class of an instance x is estimated as:  > −1 1, if δ(x) < 0, 1  (x − µ1) Σ (x − µ1) Cb(x) = (22) exp − π1 2, if δ(x) > 0. p(2π)d|Σ| 2 > −1 If the priors of two classes are equal, i.e., π1 = π2, the Eq. 1  (x − µ2) Σ (x − µ2) = exp − π2, (20) becomes: p(2π)d|Σ| 2 −1 > > −1 2 Σ (µ − µ ) x  (x − µ1) Σ (x − µ1) 2 1 (23) =⇒ exp − π1 > −1 2 + µ1 − µ2) Σ (µ1 − µ2) = 0, > −1  (x − µ2) Σ (x − µ2) = exp − π2, whose left-hand-side expression can be considered as δ(x) 2 in Eq. (22). (a) 1 > −1 =⇒ − (x − µ1) Σ (x − µ1) + ln(π1) 2 4. Quadratic Discriminant Analysis for 1 > −1 = − (x − µ ) Σ (x − µ ) + ln(π2), Binary Classification 2 2 2 In Quadratic Discriminant Analysis (QDA), we relax the where (a) takes natural logarithm from the sides of equa- assumption of equality of the covariance matrices: tion. We can simplify this term as: Σ1 6= Σ2, (24) which means the covariances are not necessarily equal (if (x − µ )>Σ−1(x − µ ) = (x> − µ>)Σ−1(x − µ ) 1 1 1 1 they are actually equal, the decision boundary will be linear > −1 > −1 > −1 > −1 = x Σ x − x Σ µ1 − µ1 Σ x + µ1 Σ µ1 and QDA reduces to LDA). (a) > −1 > −1 > −1 Therefore, the Eq. (17) becomes: = x Σ x + µ1 Σ µ1 − 2 µ1 Σ x, (19) > −1 1  (x − µ1) Σ1 (x − µ1) where (a) is because x>Σ−1µ = µ>Σ−1x as it is a exp − π1 1 1 p(2π)d|Σ | 2 scalar and Σ−1 is symmetric so Σ−> = Σ−1. Thus, we 1 > −1 have: 1  (x − µ2) Σ2 (x − µ2) = exp − π2, p d (2π) |Σ2| 2 1 > −1 1 > −1 > −1 − x Σ x − µ1 Σ µ1 + µ1 Σ x + ln(π1) (a) d 1 2 2 =⇒ − ln(2π) − ln(|Σ1|) 1 > −1 1 > −1 > −1 2 2 = − x Σ x − µ2 Σ µ2 + µ2 Σ x + ln(π2). 1 2 2 − (x − µ )>Σ−1(x − µ ) + ln(π ) 2 1 1 1 1 Therefore, if we multiply the sides of equation by 2, we d 1 = − ln(2π) − ln(|Σ |) have: 2 2 2 > 1 2 Σ−1(µ − µ ) x − (x − µ )>Σ−1(x − µ ) + ln(π ), 2 1 2 2 2 2 2 > −1  π2 + µ1 − µ2) Σ (µ1 − µ2) + 2 ln( ) = 0, where (a) takes natural logarithm from the sides of equa- π1 (20) tion. According to Eq. (19), we have: > which is the equation of a line in the form of a x + b = 0. 1 1 1 − ln(|Σ |) − x>Σ−1x − µ>Σ−1µ Therefore, if we consider Gaussian distributions for the two 2 1 2 1 2 1 1 1 classes where the covariance matrices are assumed to be + µ>Σ−1x + ln(π ) equal, the decision boundary of classification is a line. Be- 1 1 1 cause of linearity of the decision boundary which discrimi- 1 1 > −1 1 > −1 = − ln(|Σ2|) − x Σ2 x − µ2 Σ2 µ2 nates the two classes, this method is named linear discrim- 2 2 2 > −1 inant analysis. + µ2 Σ2 x + ln(π2). For obtaining Eq. (20), we brought the expressions to the Therefore, if we multiply the sides of equation by 2, we right side which was corresponding to the second class; have: d therefore, if we use δ(x): R → R as the left-hand-side > −1 −1 −1 > expression (function) in Eq. (20): x (Σ1 − Σ2) x + 2 (Σ2 µ2 − Σ1 µ1) x > −1 > −1 |Σ1| −1 > + (µ1 Σ1 µ1 − µ2 Σ2 µ2) + ln (25) δ(x) := 2 Σ (µ2 − µ1) x |Σ2| (21) > −1  π2 π2 + µ1 − µ2) Σ (µ1 − µ2) + 2 ln( ), + 2 ln( ) = 0, π1 π1 Linear and Quadratic Discriminant Analysis: Tutorial 4 which is in the x>A x + b>x + c = 0. because it maximizes the posterior of that class. In this Therefore, if we consider Gaussian distributions for the two expression, δ(x) is Eq. (28). classes, the decision boundary of classification is quadratic. In LDA, we assume that the covariance matrices of the k Because of quadratic decision boundary which discrimi- classes are equal: nates the two classes, this method is named quadratic dis- criminant analysis. Σ1 = ··· = Σ|C| = Σ. (30) For obtaining Eq. (25), we brought the expressions to the Therefore, the Eq. (28) becomes: right side which was corresponding to the second class; d 1 therefore, if we use δ(x): R → R as the left-hand-side δk(x) = − ln(|Σ|) expression (function) in Eq. (25): 2 1 > −1 1 > −1 −1 −1 > − (x − µk) Σ (x − µk) + ln(πk) = − ln(|Σ|) δ(x) := x (Σ1 − Σ2) x + 2 (Σ2 µ2 − Σ1 µ1) x 2 2 1 1 |Σ1| π2 > −1 > −1 > −1 > −1 > −1 − x Σ x − µk Σ µk + µk Σ x + ln(πk). + (µ1 Σ1 µ1 − µ2 Σ2 µ2) + ln + 2 ln( ), 2 2 |Σ2| π1 (26) We drop the constant terms −(1/2) ln(|Σ|) and the class of an instance x is estimated as the Eq. (22). −(1/2) x>Σ−1x which are the same for all classes (note If the priors of two classes are equal, i.e., π1 = π2, the Eq. that before taking the logarithm, the term −(1/2) ln(|Σ|) (20) becomes: is multiplied and the term −(1/2) x>Σ−1x is multiplied as an exponential term). Thus, the scaled posterior of the x>(Σ − Σ )−1x + 2 (Σ−1µ − Σ−1µ )>x 1 2 2 2 1 1 k-th class becomes: (27) > −1 > −1 |Σ1| 1 + (µ1 Σ1 µ1 − µ2 Σ2 µ2) + ln = 0, δ (x) := µ>Σ−1x − µ>Σ−1µ + ln(π ). (31) |Σ2| k k 2 k k k whose left-hand-side expression can be considered as δ(x) In LDA, the class of the instance x is determined by Eq. in Eq. (22). (29), where δ(x) is Eq. (31), because it maximizes the posterior of that class. 5. LDA and QDA for Multi-class In conclusion, QDA and LDA deal with maximizing the Classification posterior of classes but work with the likelihoods (class Now we consider multiple classes, which can be more than conditional) and priors. two, indexed by k ∈ {1,..., |C|}. Recall Eq. (12) or (15) where we are using the scaled posterior, i.e., fk(x) πk. Ac- 6. Estimation of Parameters in LDA and QDA cording to Eq. (16), we have: In LDA and QDA, we have several parameters which are required in to calculate the posteriors. These param- f (x) π k k eters are the means and the covariance matrices of classes > −1 1  (x − µk) Σk (x − µk) and the priors of classes. = exp − πk. p d (2π) |Σk| 2 The priors of the classes are very tricky to calculate. It is Taking natural logarithm gives: somewhat a chicken and egg problem because we want to know the class probabilities (priors) to estimate the class of d 1 an instance but we do not have the priors and should esti- ln(fk(x) πk) = − ln(2π) − ln(|Σk|) 2 2 mate them. Usually, the prior of the k-th class is estimated 1 > −1 according to the sample size of the k-th class: − (x − µk) Σk (x − µk) + ln(πk). 2 n π = k , (32) We drop the constant term −(d/2) ln(2π) which is the bk n same for all classes (note that this term is multiplied be- where nk and n are the number of training instances in the fore taking the logarithm). Thus, the scaled posterior of the k-th class and in total, respectively. This estimation consid- k-th class becomes: ers Bernoulli distribution for choosing every instance out of 1 the overall training set to be in the k-th class. δk(x) := − ln(|Σk|) 2 (28) The mean of the k-th class can be estimated using the Max- 1 > −1 imum Likelihood Estimation (MLE), or Method of Mo- − (x − µk) Σk (x − µk) + ln(πk). 2 ments (MOM), for the mean of a Gaussian distribution: In QDA, the class of the instance x is estimated as: n d 1 X  R 3 µk = xi I C(xi) = k , (33) Cb(x) = arg max δk(x), (29) b n k k i=1 Linear and Quadratic Discriminant Analysis: Tutorial 5 where I(.) is the indicator function which is one and zero if its condition is satisfied and not satisfied, respectively. In QDA, the covariance matrix of the k-th class is estimated using MLE:

d×d R 3 Σb k = n 1 X (34) (x − µ )(x − µ )> C(x ) = k. n i bk i bk I i k i=1 Or we can use the unbiased estimation of the covariance matrix: Figure 2. The QDA and LDA where the covariance matrices are identity matrix. For equal priors, the QDA and LDA reduce d×d R 3 Σb k = to simple classification using Euclidean distance from means of n classes. Changing the prior modifies the location of decision 1 X (35) (x − µ )(x − µ )> C(x ) = k. boundary where even one point can be classified differently for n − 1 i bk i bk I i k i=1 different priors.

In LDA, we assume that the covariance matrices of the classes are equal; therefore, we use the weighted average distance from the mean of classes is one of the simplest of the estimated covariance matrices as the common co- classification methods where the used metric is Euclidean variance matrix in LDA: distance. The Eq. (39) has a very interesting message. We know that P|C| P|C| nk Σb k nk Σb k in metric Multi-Dimensional Scaling (MDS) (Cox & Cox, d×d 3 Σ = k=1 = k=1 , (36) R b P|C| 2000) and kernel Principal Component Analysis (PCA), we nr n r=1 have (see (Ham et al., 2004) and Chapter 2 in (Strange & where the weights are the cardinality of the classes. Zwiggelaar, 2014)): 1 7. LDA and QDA are Metric Learning! K = − HDH, (41) 2 Recall Eq. (28) which is the scaled posterior for the QDA. n×n First, assume that the covariance matrices are all equal (as where D ∈ R is the distance matrix whose elements n×n we have in LDA) and they all are the identity matrix: are the distances between the data instances, K ∈ R is the kernel matrix over the data instances, Rn×n 3 H := > n Σ1 = ··· = Σ|C| = I, (37) I − (1/n)11 is the centering matrix, and R 3 1 := [1, 1,..., 1]>. If the elements of the distance matrix D are which means that all the classes are assumed to be spheri- obtained using Euclidean distance, the MDS is equivalent cally distributed in the d dimensional space. After this as- to Principal Component Analysis (PCA) (Jolliffe, 2011). sumption, the Eq. (28) becomes: Comparing Eqs. (39) and (41) shows an interesting con- 1 nection between the posterior of a class in QDA and the δ (x) = − (x − µ )>(x − µ ) + ln(π ), (38) k 2 k k k kernel over the the data instances of the class. In this com- parison, the Eq. (41) should be considered for a class and −1 because |I| = 1, ln(1) = 0, and I = I. If we assume not the entire data, so K ∈ Rnk×nk , D ∈ Rnk×nk , and nk×nk that the priors are all equal, the term ln(πk) is constant and H ∈ R . can be dropped: Now, consider the case where still the covariance matrices are all identity matrix but the priors are not equal. In this 1 1 δ (x) = − (x − µ )>(x − µ ) = − d2 , (39) case, we have Eq. (38). If we take an exponential (in- k 2 k k 2 k verse of logarithm) from this expression, the πk becomes where dk is the Euclidean distance from the mean of the a scale factor (weight). This means that we still are using k-th class: distance metric to measure the distance of an instance from q the means of classes but we are scaling the distances by the > dk = ||x − µk||2 = (x − µk) (x − µk). (40) priors of classes. If a class happens more, i.e., its prior is larger, it must have a larger posterior so we reduce the dis- Thus, the QDA or LDA reduce to simple Euclidean dis- tance from the mean of its class. In other words, we move tance from the means of classes if the covariance matri- the decision boundary according to the prior of classes (see ces are all identity matrix and the priors are equal. Simple Fig.2). Linear and Quadratic Discriminant Analysis: Tutorial 6

As the next step, consider a more general case where the & Jin, 2006; Kulis, 2013) in a perspective. Note that in covariance matrices are not equal as we have in QDA. We metric learning, a valid distance metric is defined as (Yang apply Singular Value Decomposition (SVD) to the covari- & Jin, 2006): ance matrix of the k-th class: 2 2 > dA(x, µk) := ||x − µk||A = (x − µk) A (x − µk), > Σk = U k Λk U k , (44) where the left and right matrices of singular vectors are where A is a positive semi-definite matrix, i.e., A  0. > −1 equal because the covariance matrix is symmetric. There- In QDA, we are also using (x − µk) Σk (x − µk). The fore: covariance matrix is positive semi-definite according to the characteristics of covariance matrix. Moreover, according −1 −1 > Σk = U k Λk U k , to characteristics of a positive semi-definite matrix, the in-

−1 > verse of a positive semi-definite matrix is positive semi- −1 where U k = U k because it is an orthogonal matrix. definite so Σk  0. Therefore, QDA is using metric Therefore, we can simplify the following term: learning (and as will be discussed in next section, it can be seen as a manifold learning method, too). (x − µ )>Σ−1(x − µ ) k k k It is also noteworthy that the QDA and LDA can also > −1 > = (x − µk) U k Λk U k (x − µk) be seen as Mahalanobis distance (McLachlan, 1999; > > > −1 > > De Maesschalck et al., 2000) which is also a metric: = (U k x − U k µk) Λk (U k x − U k µk). −1 d2 (x, µ) := ||x − µ||2 = (x − µ)>Σ−1(x − µ), As Λk is a diagonal matrix with non-negative elements M M (because it is covariance), we can decompose it as: (45)

Λ−1 = Λ−1/2Λ−1/2. where Σ is the covariance matrix of the cloud of data whose k k k mean is µ. The intuition of Mahalanobis distance is that Therefore: if we have several data clouds (e.g., classes), the distance from the class with larger variance should be scaled down > > > −1 > > (U k x − U k µk) Λk (U k x − U k µk) because that class is taking more of the space so it is more > > > −1/2 −1/2 > > probable to happen. The scaling down shows in the inverse = (U k x − U k µk) Λk Λk (U k x − U k µk) > −1 of covariance matrix. Comparing (x − µk) Σk (x − µk) (a) −1/2 > −1/2 > > in QDA or LDA with Eq. (45) shows that QDA and LDA = (Λk U k x − Λk U k µk) −1/2 > −1/2 > are sort of using Mahalanobis distance. (Λk U k x − Λk U k µk), ? −>/2 −1/2 8. LDA ≡ FDA where (a) is because Λk = Λk because it is diago- nal. We define the following transformation: In the previous section, we saw that LDA and QDA can be seen as metric learning. We know that metric learning −1/2 > φk : x 7→ Λk U k x, (42) can be seen as a family of manifold learning methods. We briefly explain the reason of this assertion: As A  0, we which also results in the transformation of the mean: φk : can say A = UU >  0. Therefore, Eq. (44) becomes: −1/2 > µ 7→ Λk U k µ. Therefore, the Eq. (28) can be restated 2 > > as: ||x − µk||A = (x − µk) UU (x − µk) 1 = (U >x − U >µ )>(U >x − U >µ ), δ (x) = − ln(|Σ |) k k k 2 k 1 > which means that metric learning can be seen as compari- − φ (x) − φ (µ ) φ (x) − φ (µ ) + ln(π ). 2 k k k k k k k son of simple Euclidean distances after the transformation (43) φ : x 7→ U >x which is a projection into a subspace with Ignoring the terms −(1/2) ln(|Σk|) and ln(πk), we can see projection matrix U. Thus, metric learning is a manifold that the transformation has changed the covariance matrix learning approach. This gives a hint that the Fisher Dis- of the class to identity matrix. Therefore, the QDA (and criminant Analysis (FDA) (Fisher, 1936; Welling, 2005), also LDA) can be seen as simple comparison of distances which is a manifold learning approach (Tharwat et al., from the means of classes after applying a transformation to 2017), might have a connection to LDA; especially, be- the data of every class. In other words, we are learning the cause the names FDA and LDA are often used interchange- metric using the SVD of covariance matrix of every class. ably in the literature. Actually, other names of FDA are Thus, LDA and QDA can be seen as metric learning (Yang Fisher LDA (FLDA) and even LDA. Linear and Quadratic Discriminant Analysis: Tutorial 7

We know that if we project (transform) the data of a class Comparing Eq. (53) with Eq. (23) shows that LDA p using a projection vector u ∈ R to a p dimensional sub- and FDA are equivalent up to a scaling factor µ1 − > −1 space (p ≤ d), i.e.: µ2) Σ (µ1 − µ2) (note that this term is multiplied as > an exponential factor before taking logarithm to obtain Eq. x 7→ u x, (46) (23), so this term a scaling factor). Hence, we can say: for all data instances of the class, the mean and the covari- LDA ≡ FDA. (54) ance matrix of the class are transformed as: In other words, FDA projects into a subspace. On the other µ 7→ u>µ, (47) hand, according to Section7, LDA can be seen as a met- Σ 7→ u>Σ u, (48) ric learning with a subspace where the Euclidean distance is used after projecting onto that subspace. The two sub- because of characteristics of mean and variance. spaces of FDA and LDA are the same subspace. It should The Fisher criterion (Xu & Lu, 2006) is the ratio of the be noted that in manifold (subspace) learning, the scale 2 2 between-class variance, σb , and within-class variance, σw: does not matter because all the distances scale similarly.

2 Note that LDA assumes one (and not several) Gaussian for 2 > > 2 u>(µ − µ ) σb (u µ2 − u µ1) 2 1 every class and so does the FDA. That is why FDA faces f := 2 = > > = > . σw u Σ2 u + u Σ1 u u (Σ2 + Σ1) u problem for multi-modal data (Sugiyama, 2007). (49)

The FDA maximizes the Fisher criterion: 9. Relation to Logistic Regression 2 According to Eqs. (16) and (32), Gaussian and Bernoulli u>(µ − µ ) 2 1 (50) distributions are used for likelihood (class conditional) and maximize > , u u (Σ2 + Σ1) u prior, respectively, in LDA and QDA. Thus, we are mak- ing assumptions for the likelihood and prior, although we which can be restated as: finally work with posterior in LDA and QDA according to > 2 Eq. (15). Logistic regression (Kleinbaum et al., 2002) says maximize u (µ2 − µ1) , u (51) why do we make assumptions on the likelihood and prior > subject to u (Σ2 + Σ1) u = 1, when we want to work on posterior finally. Let us make assumption directly for the posterior. according to Rayleigh-Ritz quotient method (Croot, 2005). In logistic regression, first a linear function is applied to The Lagrangian (Boyd & Vandenberghe, 2004) is: > the data to have β x0 where Rd+1 3 x0 = [x>, 1]> and 2 β ∈ d+1 L = u>(µ − µ ) − λu>(Σ + Σ ) u − 1, R include the intercept. Then, logistic function 2 1 2 1 is used in order to have a value in range (0, 1) to simulate where λ is the Lagrange multiplier. Equating the derivative probability. Therefore, in logistic regression, the posterior of L to zero gives: is assumed to be: (C(x) | X = x) ∂L 2 set P = 2 (µ − µ ) u − 2 λ (Σ2 + Σ1) u = 0 ∂u 2 1  exp(β>x0) C(x) 1 1−C(x) 2 = , =⇒ (µ2 − µ1) u = λ (Σ2 + Σ1) u, 1 + exp(β>x0) 1 + exp(β>x0) (55) which is a generalized eigenvalue problem (µ2 − where C(x) ∈ {−1, +1} for the two classes. Logistic 2  µ1) , (Σ2 + Σ1) according to (Ghojogh et al., 2019b). regression considers the coefficient β as the parameter to The projection vector is the eigenvector of (Σ2 + be optimized and uses Newton’s method (Boyd & Vanden- −1 2 Σ1) (µ2 − µ1) ; therefore, we can say: berghe, 2004) for the optimization. Therefore, in summary, logistic regression makes assumption on the posterior while u ∝ (Σ + Σ )−1(µ − µ )2. 2 1 2 1 LDA and QDA make assumption on likelihood and prior. In LDA, the equality of covariance matrices is assumed. Thus, according to Eq. (18), we can say: 10. Relation to Bayes Optimal Classifier and Gaussian Naive Bayes u ∝ (2 Σ)−1(µ − µ )2 ∝ Σ−1(µ − µ )2. (52) 2 1 2 1 The Bayes classifier maximizes the posteriors of the classes According to Eq. (46), we have: (Murphy, 2012):

> > −1 2 Cb(x) = arg max P(x ∈ Ck | X = x). (56) u x ∝ Σ (µ2 − µ1) x. (53) k Linear and Quadratic Discriminant Analysis: Tutorial 8

According to Eq. (14) and Bayes rule, we have: Therefore, Eq. (59) becomes (Mitchell, 1997):

P(x ∈ Ck | X = x) ∝ P(X = x | x ∈ Ck) P(x ∈ Ck), X Cb(x) = arg max P(Ck | hj) P(hj | D), (60) | {z } C ∈C πk k hj ∈H (57) where the denominator of posterior (the marginal) which In conclusion, the Bayes classifier is optimal. Therefore, if is: the likelihoods of classes are Gaussian, QDA is an optimal classifier and if the likelihoods are Gaussian and the co- |C| X variance matrices are equal, the LDA is an optimal classi- P(X = x) = P(X = x | x ∈ Cr) πr, fier. Often, the distributions in the natural life are Gaussian; r=1 especially, because of central limit theorem (Hazewinkel, is ignored because it is not dependent on the classes C1 to 2001), the summation of independent and identically dis- C|C|. tributed (iid) variables is Gaussian and the signals usually According to Eq. (57), the posterior can be written in terms add in the real world. This explains why LDA and QDA of likelihood and prior; therefore, Eq. (56) can be restated are very effective classifiers in machine learning. We also as: saw that FDA is equivalent to LDA. Thus, the reason of ef- fectiveness of the powerful FDA classifier becomes clear. Cb(x) = arg max πk P(X = x | x ∈ Ck). (58) We have seen the very successful performance of FDA k and LDA in different applications, such as face recogni- Note that the Bayes classifier does not make any assump- tion (Belhumeur et al., 1997; Etemad & Chellappa, 1997; tion on the posterior, prior, and likelihood, unlike LDA and Zhao et al., 1999), action recognition (Ghojogh et al., 2017; QDA which assume the uni-modal Gaussian distribution Mokari et al., 2018), and EEG classification (Malekmo- for the likelihood (and we may assume Bernoulli distribu- hammadi et al., 2019). tion for the prior in LDA and QDA according to Eq. (32)). Implementing Bayes classifier is difficult in practice so we Therefore, we can say the difference of Bayes and QDA approximate it by naive Bayes (Zhang, 2004). If xj denotes > is in assumption of uni-modal Gaussian distribution for the the j-th dimension (feature) of x = [x1, . . . , xd] , Eq. (58) likelihood (class conditional); hence, if the likelihoods are is restated as: already uni-modal Gaussian, the Bayes classifier reduces to QDA. Likewise, the difference of Bayes and LDA is in as- Cb(x) = arg max πk P(x1, x2, . . . , xd | x ∈ Ck). (61) sumption of Gaussian distribution for the likelihood (class k conditional) and equality of covariance matrices of classes; The term (x , x , . . . , x | x ∈ C ) is very difficult to thus, if the likelihoods are already Gaussian and the co- P 1 2 d k compute as the features are possibly correlated. Naive variance matrices are already equal, the Bayes classifier re- Bayes relaxes this possibility and naively assumes that the duces to LDA. features are conditionally independent (⊥⊥) when they are It is noteworthy that the Bayes classifier is an optimal clas- conditioned on the class: sifier because it can be seen as an ensemble of hypothe- ses (models) in the hypothesis (model) space and no other P(x1, x2, . . . , xd | x ∈ Ck) ensemble of hypotheses can outperform it (see Chapter 6, d Page 175 in (Mitchell, 1997)). In the literature, it is referred ⊥⊥ Y ≈ (x | C ) (x | C ) ··· (x | C ) = (x | C ). to as Bayes optimal classifier. To better formulate the ex- P 1 k P 2 k P d k P j k j=1 plained statements, the Bayes optimal classifier estimates the class as: Therefore, Eq. (61) becomes: X Cb(x) = arg max P(Ck | hj) P(D | hj) P(hj), (59) Ck∈C d hj ∈H Y Cb(x) = arg max πk P(xj | Ck). (62) k n j=1 where C := {C1,..., C|C|}, D := {xi}i=1 is the training set, hj is a hypothesis for estimating the class of instances, and H is the hypothesis space including all possible hy- In Gaussian naive Bayes, univariate Gaussian distribution potheses. is assumed for the likelihood (class conditional) of every According to Bayes rule, similar to what we had for Eq. feature: (57), we have: 1  (x − µ )2  (x | C ) = exp − j k , (63) P j k p 2 2σ2 P(hj | D) ∝ P(D | hj) P(hj). 2πσk k Linear and Quadratic Discriminant Analysis: Tutorial 9 where the mean and unbiased variance are estimated as: where the degree of freedom of χ2 distribution is df := n dim(HA) − dim(H0) and dim(.) is the number of unspeci- 1 X 3 µ = x C(x ) = k, (64) fied parameters in the hypothesis. R bk n i,j I i k i=1 There is a connection between LDA or QDA and the LRT n 1 X (Lachenbruch & Goldstein, 1979). Recall Eq. (12) or (15) 3 σ2 = (x − µ )2 C(x ) = k, (65) R bk n − 1 i,j bk I i which can be restated as: k i=1 f (x) π where xi,j denotes the j-th feature of the i-th training in- 2 2 = 1, (68) stance. The prior can again be estimated using Eq. (32). f1(x) π1 According to Eqs. (62) and (63), Gaussian naive Bayes is equivalent to QDA where the covariance matrices are di- which is for the decision boundary. The Eq. (22) dealt with agonal, i.e., the off-diagonal of the covariance matrices are the difference of f2(x) π2 and f1(x) π1; however, here we ignored. Therefore, we can say that QDA is more pow- are dealing with their ratio. Recall Fig.1 where if we move ∗ ∗ ∗ erful than Gaussian naive Bayes because Gaussian naive x to the right and left, the ratio f2(x ) π2/f1(x ) π1 de- Bayes is a simplified version of QDA. Moreover, it is obvi- creases and increases, respectively, because the probabili- ous that Gaussian naive Bayes and QDA are equivalent for ties of the first and second class happening change. In other ∗ one dimensional data. Comparing to LDA, the Gaussian words, moving the x changes the significance level and naive Bayes is equivalent to LDA if the covariance matrices power. Therefore, Eq. (68) can be used to have a statis- 2 2 tical test where the posteriors are used in the ratio, as we are diagonal and they are all equal, i.e., σ1 = ··· = σ|C|; therefore, LDA and Gaussian naive Bayes have their own also used posteriors in LDA and QDA. The null/alternative assumptions, one on the off-diagonal of covariance matri- hypothesis an be considered to be the mean and covariance ces and the other one on equality of the covariance matri- of the first/second class. In other words, the two hypothe- ces. As Gaussian naive Bayes has some level of optimality ses say that the point belongs to a specific class. Hence, if (Zhang, 2004), it becomes clear why LDA and QDA are the ratio is larger than a value t, the instance x is estimated such effective classifiers. to belong to the second class; otherwise, the first class is chosen. According to Eq. (16), the Eq. (68) becomes: 11. Relation to Likelihood Ratio Test (|Σ |)−1/2 exp − 1 (x − µ )>Σ−1(x − µ ) π Consider two hypotheses for estimating some parameter, 2 2 2 2 2 2 ≥ t, (|Σ |)−1/2 exp − 1 (x − µ )>Σ−1(x − µ ) π a null hypothesis H0 and an alternative hypothesis HA. 1 2 1 1 1 1 The probability P(reject H0 | H0) is called type 1 error, (69) false positive error, or false alarm error. The probabil- ity P(accept H0 | HA) is called type 2 error or false nega- for QDA. In LDA, the covariance matrices are equal, so: tive error. The P(reject H0 | H0) is also called significance 1 > −1  level, while 1 − P(accept H0 | HA) = P(reject H0 | HA) is exp − (x − µ2) Σ (x − µ2) π2 2 ≥ t. (70) called power. 1 > −1  exp − 2 (x − µ1) Σ (x − µ1) π1 If L(θA) and L(θ0) are the likelihoods (probabilities) for the alternative and null hypotheses, the likelihood ratio is: As can be seen, changing the priors change impacts the ra- tio as expected. Moreover, the value of t can be chosen L(θA) f(x; θA) Λ = = . (66) according to the desired significance level in the χ2 distri- L(θ ) f(x; θ ) 0 0 bution using the χ2 table. The Eqs. (69) and (70) show The Likelihood Ratio Test (LRT) (Casella & Berger, 2002) the relation of LDA and QDA with LRT. As the LRT has rejects the H0 in favor of HA if the likelihood ratio is the largest power (Neyman & Pearson, 1933), the effective- greater than a threshold, i.e., Λ ≥ t. The LRT is a very ness of LDA and QDA in classification is explained from a effective statistical test because according to the Neyman- hypothesis testing point of view. Pearson lemma (Neyman & Pearson, 1933), it has the largest power among all statistical tests with the same sig- 12. Simulations nificance level. In this section, we report some simulations which make the If the sample size is large, n → ∞, and the θA is estimated concepts of tutorial clearer by illustration. using MLE, the logarithm of the likelihood ratio asymptot- ically has the distribution of χ2 under the null hypothesis 12.1. Experiments with Equal Class Sample Sizes (White, 1984; Casella & Berger, 2002): We created a synthetic dataset of three classes each of

H0 2 which is a two dimensional Gaussian distribution. The 2 ln(Λ) ∼ χ(df), (67) means and covariance matrices of the three Gaussians from Linear and Quadratic Discriminant Analysis: Tutorial 10

(a) (b) (c)

(d) (e) (f)

(g)

Figure 3. The synthetic dataset: (a) three classes each with size 200, (b) two classes each with size 200, (c) three classes each with size 10, (d) two classes each with size 10, (e) three classes with sizes 200, 100, and 10, (f) two classes with sizes 200 and 10, and (g) two classes with sizes 400 and 200 where the larger class has two modes. which the class samples were randomly drawn are: The three classes are shown in Fig.3-a where each has sample size 200. Experiments were performed on the three > > > classes. We also performed experiments on two of the three µ1 = [−4, 4] , µ2 = [3, −3] , µ1 = [−3, 3] , classes to test a binary classification. The two classes are 10 1 3 0  6 1.5 Σ = , Σ = , Σ = . shown in Fig.3-b. The LDA, QDA, naive Bayes, and 1 1 5 2 0 4 3 1.5 4 Linear and Quadratic Discriminant Analysis: Tutorial 11

(a) (b) (c)

(d) (e) (f)

(g) (h)

Figure 4. Experiments with equal class sample sizes: (a) LDA for two classes, (b) QDA for two classes, (c) Gaussian naive Bayes for two classes, (d) Bayes for two classes, (e) LDA for three classes, (f) QDA for three classes, (g) Gaussian naive Bayes for three classes, and (h) Bayes for three classes.

Bayes classifications of the two and three classes are shown we used Eqs. (62) and (63) and estimated the parameters in Fig.4. For both binary and ternary classification with using Eqs. (64) and (65). For Bayes classifier, we used LDA and QDA, we used Eqs. (31) and (28), respectively, Eq. (58) with Eq. (63) but we do not estimate the mean with Eq. (29). We also estimated the mean and covariance and variance; except, in order to use the exact likelihoods using Eqs. (33), (35), and (36). For Gaussian naive Bayes, in Eq. (58), we use the exact mean and covariance matrices Linear and Quadratic Discriminant Analysis: Tutorial 12

(a) (b) (c)

(d) (e) (f)

(g) (h)

Figure 5. Experiments with small class sample sizes: (a) LDA for two classes, (b) QDA for two classes, (c) Gaussian naive Bayes for two classes, (d) Bayes for two classes, (e) LDA for three classes, (f) QDA for three classes, (g) Gaussian naive Bayes for three classes, and (h) Bayes for three classes. of the distributions which we sampled from. We, however, LDA and QDA are linear and curvy (quadratic), respec- estimated the priors. The priors were estimated using Eq. tively. The results of QDA, Gaussian naive Bayes, and (32) for all the classifiers. Bayes are very similar although they have slight differ- As can be seen in Fig.4, the space is partitioned into ences. This is because the classes are already Gaussian so two/three parts and this validates the assertion that LDA if the estimates of means and covariance matrices are accu- and QDA can be considered as metric learning methods rate enough, QDA and Bayes are equivalent. The classes as discussed in Section7. As expected, the boundaries of are Gaussians and the off-diagonal elements of covariance Linear and Quadratic Discriminant Analysis: Tutorial 13

(a) (b) (c)

(d) (e) (f)

(g) (h)

Figure 6. Experiments with different class sample sizes: (a) LDA for two classes, (b) QDA for two classes, (c) Gaussian naive Bayes for two classes, (d) Bayes for two classes, (e) LDA for three classes, (f) QDA for three classes, (g) Gaussian naive Bayes for three classes, and (h) Bayes for three classes. matrices are also small compared to the diagonal; therefore, i.e., n → ∞. Therefore, if the sample size is small, we naive Bayes is also behaving similarly. expect mode difference between QDA and Bayes classi- fiers. We made a synthetic dataset with three or two classes 12.2. Experiments with Small Class Sample Sizes with the same mentioned means and covariance matrices. According to Monte-Carlo approximation (Robert & The sample size of every class was 10. Figures3-c and Casella, 2013), the estimates in Eqs. (33), (35), (64) and 3-d show these datasets. The results of LDA, QDA, Gaus- (65) are more accurate if the sample size goes to infinity, sian naive Bayes, and Bayes classifiers for this dataset are Linear and Quadratic Discriminant Analysis: Tutorial 14

(a) (b)

(c) (d)

Figure 7. Experiments with multi-modal data: (a) LDA, (b) QDA, (c) Gaussian naive Bayes, and (d) Bayes. shown in Fig.5. As can be seen, now, the results of QDA, or LDA faces problem for multi-modal data (Sugiyama, Gaussian naive Bayes, and Bayes are different for the rea- 2007). For testing this, we made a synthetic dataset with son explained. two classes, one with sample size 400 having two modes of Gaussians and the other with sample size 200 having one 12.3. Experiments with Different Class Sample Sizes mode. We again used the same mentioned means and co- According to Eq. (32) used in Eqs. (28), (31), (58), and variance matrices. The dataset is shown in Fig.3-g. (62), the prior of a class changes by the sample size of The results of the LDA, QDA, Gaussian naive Bayes, and the class. In order to see the effect of sample size, we Bayes classifiers for this dataset are shown in Fig.7. The made a synthetic dataset with different class sizes, i.e., 200, mean and covariance matrix of the larger class, although 100, and 10, shown in Figs.3-e,3-f. We used the same it has two modes, were estimated using Eqs. (33), (35), mentioned means and covariance matrices. The results are (64) and (65) in LDA, QDA, and Gaussian naive Bayes. shown in Fig.6. As can be seen, the class with small sam- However, for the likelihood used in Bayes classifier, i.e., in ple size has covered a small portion of space in discrimina- Eq. (58), we need to know the exact multi-modal distribu- tion which is expected because its prior is small according tion. Therefore, we fit a mixture of two Gaussians (Gho- to Eq. (32); therefore, its posterior is small. On the other jogh et al., 2019a) to the data of the larger class: hand, the class with large sample size has covered a larger portion because of a larger prior. 2 X 12.4. Experiments with Multi-Modal Data P(X = x | x ∈ Ck) = wk f(x; µk, Σk), (71) k=1 As mentioned in Section8, LDA and QDA assume uni- modal Gaussian distribution for every class and thus FDA where f(x; µk, Σk) is Eq. (16) and we the fitted parame- Linear and Quadratic Discriminant Analysis: Tutorial 15 ters were: Casella, George and Berger, Roger L. Statistical inference, volume 2. Duxbury Pacific Grove, CA, 2002. > > µ1 = [−3.88, 4] , µ2 = [3.04, −2.92] , Cox, Trevor F and Cox, Michael AA. Multidimensional 9.27 0.79 2.87 0.03 Σ = , Σ = , scaling. Chapman and hall/CRC, 2000. 1 0.79 4.82 2 0.03 3.78 Croot, Ernie. The Rayleigh principle for finding w1 = 0.49, w2 = 0.502. eigenvalues. Technical report, Georgia Institute of As Fig.7 shows, LDA has not performed well enough Technology, School of , 2005. Online: as expected. The performance of QDA is more accept- http://people.math.gatech.edu/∼ecroot/notes_linear.pdf, able than LDA but still not good enough because QDA Accessed: March 2019. also assumes a uni-modal Gaussian for every class. The De Maesschalck, Roy, Jouan-Rimbaud, Delphine, and result of Gaussian naive Bayes is very different from the Massart, Desir´ e´ L. The Mahalanobis distance. Chemo- Bayes here because the Gaussian naive Bayes assumes uni- metrics and intelligent laboratory systems, 50(1):1–18, modal Gaussian with diagonal covariance for every class. 2000. Finally, the Bayes has the best result as it takes into account the multi-modality of the data and it is optimum (Mitchell, Etemad, Kamran and Chellappa, Rama. Discriminant anal- 1997). ysis for recognition of human face images. Journal of the Optical Society of America A, 14(8):1724–1733, 1997. 13. Conclusion and Future Work Fisher, Ronald A. The use of multiple measurements in This paper was a tutorial paper for LDA and QDA as two taxonomic problems. Annals of eugenics, 7(2):179–188, fundamental classification methods. We explained the rela- 1936. tions of these two methods with some other methods in ma- chine learning, manifold (subspace) learning, metric learn- Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. ing, statistics, and statistical testing. Some simulations The elements of statistical learning, volume 2. Springer were also provided for better clarification. series in statistics, New York, NY, USA, 2009. This paper focused on LDA and QDA which are discrim- inators with one and two degrees of freedom, Ghodsi, Ali. Classification course, department of statis- respectively. As the future work, we will work on a tuto- tics and actuarial science, university of Waterloo. Online rial paper for non-linear discriminant analysis using kernels Youtube Videos, 2015. Accessed: January 2019. (Baudat & Anouar, 2000; Li et al., 2003; Lu et al., 2003), Ghodsi, Ali. Data visualization course, department of which is called kernel discriminant analysis, to have dis- statistics and actuarial science, university of Waterloo. criminators with more than two degrees of freedom. Online Youtube Videos, 2017. Accessed: January 2019. Acknowledgment Ghojogh, Benyamin, Mohammadzade, Hoda, and Mokari, The authors hugely thank Prof. Ali Ghodsi (see his great Mozhgan. Fisherposes for human action recognition us- online related courses (Ghodsi, 2015; 2017)), Prof. Mu ing Kinect sensor data. IEEE Sensors Journal, 18(4): Zhu, Prof. Hoda Mohammadzade, and other professors 1612–1627, 2017. whose courses have partly covered the materials mentioned Ghojogh, Benyamin, Ghojogh, Aydin, Crowley, Mark, and in this tutorial paper. Karray, Fakhri. Fitting a mixture distribution to data: Tutorial. arXiv preprint arXiv:1901.06708, 2019a. References Baudat, Gaston and Anouar, Fatiha. Generalized discrimi- Ghojogh, Benyamin, Karray, Fakhri, and Crowley, Mark. nant analysis using a kernel approach. Neural computa- Eigenvalue and generalized eigenvalue problems: Tuto- tion, 12(10):2385–2404, 2000. rial. arXiv preprint arXiv:1903.11240, 2019b. Ham, Ji Hun, Lee, Daniel D, Mika, Sebastian, and Belhumeur, Peter N, Hespanha, Joao˜ P, and Kriegman, Scholkopf,¨ Bernhard. A kernel view of the dimensional- David J. Eigenfaces vs. Fisherfaces: Recognition using ity reduction of manifolds. In International Conference class specific linear projection. IEEE Transactions on on Machine Learning, 2004. Pattern Analysis & Machine Intelligence, (7):711–720, 1997. Hazewinkel, Michiel. Central limit theorem. Encyclopedia of Mathematics, Springer, 2001. Boyd, Stephen and Vandenberghe, Lieven. Convex opti- mization. Cambridge university press, 2004. Jolliffe, Ian. Principal component analysis. Springer, 2011. Linear and Quadratic Discriminant Analysis: Tutorial 16

Kleinbaum, David G, Dietz, K, Gail, M, Klein, Mitchel, Welling, Max. Fisher linear discriminant analysis. Tech- and Klein, Mitchell. Logistic regression. Springer, 2002. nical report, University of Toronto, Toronto, Ontario, Canada, 2005. Kulis, Brian. Metric learning: A survey. Foundations and Trends R in Machine Learning, 5(4):287–364, 2013. White, Halbert. Asymptotic theory for econometricians. Academic press, 1984. Lachenbruch, Peter A and Goldstein, M. Discriminant analysis. Biometrics, pp. 69–85, 1979. Xu, Yong and Lu, Guangming. Analysis on fisher discrim- inant criterion and linear separability of feature space. In Li, Yongmin, Gong, Shaogang, and Liddell, Heather. 2006 International Conference on Computational Intel- Recognising trajectories of facial identities using kernel ligence and Security, volume 2, pp. 1671–1676. IEEE, discriminant analysis. Image and Vision Computing, 21 2006. (13-14):1077–1086, 2003. Lu, Juwei, Plataniotis, Konstantinos N, and Venetsanopou- Yang, Liu and Jin, Rong. Distance metric learning: A los, Anastasios N. Face recognition using kernel direct comprehensive survey. Technical report, Department discriminant analysis algorithms. IEEE Transactions on of Computer Science and Engineering, Michigan State Neural Networks, 14(1):117–126, 2003. University, 2006. Malekmohammadi, Alireza, Mohammadzade, Hoda, Zhang, Harry. The optimality of naive Bayes. In American Chamanzar, Alireza, Shabany, Mahdi, and Ghojogh, Association for Artificial Intelligence (AAAI), 2004. Benyamin. An efficient hardware implementation for a Zhao, Wenyi, Chellappa, Rama, and Phillips, P Jonathon. motor imagery brain computer interface system. Scientia Subspace linear discriminant analysis for face recogni- Iranica, 26:72–94, 2019. tion. Citeseer, 1999. McLachlan, Goeffrey J. Mahalanobis distance. Resonance, 4(6):20–26, 1999. Mitchell, Thomas. Machine learning. McGraw Hill Higher Education, 1997. Mokari, Mozhgan, Mohammadzade, Hoda, and Ghojogh, Benyamin. Recognizing involuntary actions from 3d skeleton data using body states. Scientia Iranica, 2018. Murphy, Kevin P. Machine learning: a probabilistic per- spective. MIT press, 2012. Neyman, Jerzy and Pearson, Egon Sharpe. IX. On the prob- lem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of Lon- don. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289–337, 1933. Robert, Christian and Casella, George. Monte Carlo sta- tistical methods. Springer Science & Business Media, 2013. Strange, Harry and Zwiggelaar, Reyer. Open Problems in Spectral Dimensionality Reduction. Springer, 2014. Sugiyama, Masashi. Dimensionality reduction of multi- modal labeled data by local fisher discriminant analysis. Journal of machine learning research, 8(May):1027– 1061, 2007. Tharwat, Alaa, Gaber, Tarek, Ibrahim, Abdelhameed, and Hassanien, Aboul Ella. Linear discriminant analysis: A detailed tutorial. AI communications, 30(2):169–190, 2017.