ECE656-Machine Learning and Adaptive Systems Lectures 21 & 22

Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) ECE656-Machine Learning and Adaptive Systems Lectures 21 & 22 M.R. Azimi, Professor Department of Electrical and Computer Engineering Colorado State University Fall 2015 M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Fisher Linear Discriminant (1936)-Two-Class Story Why FDA versus PCA? PCA finds components that are useful for representing data in lower dimensional subspace by maximizing the variance. Thus, low variance (energy) components are discarded. PCA is an unsupervised method i.e. data samples are unlabeled and hence no attempt is made to extract discriminatory components. Thus, the directions discarded by PCA may happen to be the most discriminatory features (e.g. discriminating O and Q). In contrast, FDA seeks to reduce dimensionality by preserving most discriminatory features. P N Let fxp; dpgp=1, with xp 2 R and dp 2 R, be a set of labeled data samples, P1 of class C1 (dp = −1) and P2 of class C2 (dp = 1) and P1 + P2 = P . We seek to obtain a mapping, t yp = w xp that linearly combines samples, xp, to scalars, yp such that class separation among these scalars is maximized. Clearly, if jjwjj = 1 this corresponds to projecting each sample onto a line and in which case we are seeking a line direction that maximizes the separation (see figure). M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Fisher Linear Discriminant-Cont. To find the best w, we use the sample mean as a measure of separation between the projected points. If we define sample mean vector of each class Ci, µ = 1 PPi x ; i = 1; 2 i Pi p=1 p then the sample mean for the projected points is, 1 PPi 1 PPi t t mi = yp = w x = w µ ; i = 1; 2 Pi p=1 Pi p=1 p i Note that the training samples are arranged in two different subsets for classes of C1 and C2. We use the distance between the projected means as our measure, i.e. jm − m j = jwt(µ − µ )j 1 2 1 2 However, the distance between projected means alone is not a good measure since it does not account for the standard deviation within classes. M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Fisher Linear Discriminant-Cont. Fisher suggested maximizing the difference between the means, normalized by a 2 2 measure of the within-class scatter of the projected samples given by, (σ1 + σ2 ), 2 where σi is the sample variance of y sample in class Ci. Thus, the Fisher measure to maximize wrt w is, 2 jm1−m2j J(w) = 2 2 σ1 +σ2 To find the optimum w we must express J(w) as a function of w. We define scatter matrices Si as S = PPi (x − µ )(x − µ )t i p=1 p i p i i.e. within class sample covariance matrices (a measure of the scatter in feature space x for class Ci). Also, define the within-class scatter matrix SW = S1 + S2 M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Fisher Linear Discriminant-Cont. The variance (or scatter) of the projected samples in Ci can be expressed as, σ2 = PPi (y − m )2 = PPi (wtx − wtµ )2 i p=1 p i p=1 p i = PPi wt(x − µ )(x − µ )tw = wtS w p=1 p i p i i Thus, the denominator of J(w), or the total within-class scatter, becomes 2 2 t t (σ1 + σ2 ) = w (S1 + S2)w = w SW w. Also, we can express the numerator of J(w) as (m − m )2 = (wtµ − wtµ )2 = wt(µ − µ )(µ − µ )tw = wtS w 1 2 1 2 1 2 1 2 B where matrix SB is called the between-class scatter matrix. Note that though SW is symmetric, PD and nonsingular (when Pi N), matrix SB is symmetric, PD but singular (rank one). Now, the Fisher criterion in terms of w becomes, t w SB w J(w) = t w SW w i.e. generalized Rayleigh quotient. An important property to notice about the Fisher measure J(w) is that it is invariant to scaling of the weight vector i.e. w ! aw. Hence, we choose a w such that t w SW w = 1. M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Fisher Linear Discriminant-Cont. This leads to the following constrained optimization t t Maximize w SB w subject to w SW w = 1. which can be represented by Lagrangian cost function, t t max fw SB w − λ(w SW w − 1)g. w The solution of this maximization gives the following generalized eigenvalue problem, SB w = λSW w Alternatively, −1 SW SB w = λw Remarks: −1 Note that it is not necessary to solve for eigenvalues and eigenvectors of SW SB since S w is always in the direction of (µ − µ ). Thus, due to scale B 1 2 invariance, we can write the solution for optimum w as w∗ = S−1(µ − µ ). W 1 2 Using Fisher linear discriminant classification is reduced from N-D problem to a 1-D problem. This mapping is many-to-one and cannot reduce minimum achievable error rate. M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Fisher Linear Discriminant-Cont. Example: Compute the LDA projection for the following 2-D data subsets X1 = f(4; 1); (2; 4); (2; 3); (3; 6); (4; 4)g in class C1 X2 = f(9; 10); (6; 8); (9; 5); (8; 7); (10; 8)g in class C2. The class scatter matrices are: 0:8 −0:4 1:84 −0:04 S = and S = 1 −0:4 2:64 2 −0:04 2:64 Also, µ = [3 3:6]t and µ = [8:4 7:6]t. 1 2 Thus, we get 29:16 21:6 2:64 −0:44 S = and S = . B 21:6 16 W −0:44 5:28 First we use the direct approach by solving eigenvalues and eigenvectors of −1 ∗ t SW SB w = λw. The eigenvector that gives the largest J(w) is w = [0:91 0:39] . Alternatively, we use the result in Remark 1 i.e. w∗ = S−1(µ − µ ) = [−0:91 − 0:39]t. W 1 2 M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Multiple Discriminant Analysis For M-class problem, instead of one projection y, we will now seek (M − 1) (1) (2) (M−1) projections, y ; y ; ··· y by means of projection vectors wi; i 2 [1;M − 1] (or DFs) arranged by columns into a projection matrix W = [w1w2 ··· wM−1]. That is, y(i) = wtx or y = W tx . p i p p p Following the same procedure as in the LDA, we have S = PPi (x − µ )(x − µ )t i p=1 p i p i where µ = 1 PPi x . The within-class scatter matrix i Pi p=1 p becomes, S = PM PPi (x − µ )(x − µ )t. W i=1 p=1 p i p i The between-class scatter matrix in this case is S = PM P (µ − µ)(µ − µ)t. B i=1 i i i where µ = 1 PP x = 1 PM P µ . P p=1 p P i=1 i i Similarly, we define the mean vector and scatter matrices for the projected samples as, S~ = PM PPi (y − µ~ )(y − µ~ )t, W i=1 p=1 p i p i where µ~ = 1 PPi y . i Pi p=1 p M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Multiple Discriminant Analysis-Cont. Also, S~ = PM P (~µ − µ~)(~µ − µ~)t. B i=1 i i i where µ~ = 1 PP y = 1 PM P µ~ . P p=1 p P i=1 i i It is straightforward to show that, t t S~W = W SW W and S~B = W SB W . Since the projection is no longer a scalar (it has M − 1 dimensions), we use the determinant of the scatter matrices to obtain a scalar objective function to find optimum W . This is, t jW SB W j J(W ) = t . jW SW W j It can be shown that the optimal projection matrix W ∗ is the one whose columns are the eigenvectors corresponding to the largest eigenvalues of the following generalized eigenvalue problem, t ∗ ∗ ∗ ∗ jW SB W j ∗ W = [w w ··· w ] = argmax t =) (SB − λiSW )w = 0. 1 2 M−1 jW SW W j i Remarks: Matrix SB is the sum of M rank-1 matrices and the mean vectors are constrained by µ = 1 PM P µ . Thus, matrix S will be of rank M − 1 or P i=1 i i B less. This means that only M − 1 of the eigenvalues λi will be non-zero. M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Multiple Discriminant Analysis-Cont. The projections with maximum class separability information are the −1 eigenvectors corresponding to the largest eigenvalues of SW SB .

ECE656-Machine Learning and Adaptive Systems Lectures 21 & 22

Local Multidimensional Scaling for Nonlinear Dimension Reduction, Graph Drawing and Proximity Analysis

A Flocking Algorithm for Isolating Congruent Phylogenomic Datasets

Multidimensional Scaling

Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization

A Review of Multidimensional Scaling (MDS) and Its Utility in Various Psychological Domains

Multidimensional Scaling Advanced Applied Multivariate Analysis STAT 2221, Fall 2013

Multidimensional Scaling

On Multidimensional Scaling and Unfolding in R: Smacof Version 2

Dimensionality Reduction a Short Tutorial

Bayesian Multidimensional Scaling Model for Ordinal Preference Data

Multivariate Data Analysis Practical and Theoretical Aspects of Analysing Multivariate Data with R

ELKI: a Large Open-Source Library for Data Analysis