Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS)

ECE656-Machine Learning and Adaptive Systems Lectures 21 & 22

M.R. Azimi, Professor

Department of Electrical and Computer Engineering Colorado State University

Fall 2015

M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Fisher Linear Discriminant (1936)-Two-Class Story

Why FDA versus PCA? PCA finds components that are useful for representing data in lower dimensional subspace by maximizing the variance. Thus, low variance (energy) components are discarded. PCA is an unsupervised method i.e. data samples are unlabeled and hence no attempt is made to extract discriminatory components. Thus, the directions discarded by PCA may happen to be the most discriminatory features (e.g. discriminating O and Q). In contrast, FDA seeks to reduce dimensionality by preserving most discriminatory features.

P N Let {xp, dp}p=1, with xp ∈ R and dp ∈ R, be a set of labeled data samples, P1 of class C1 (dp = −1) and P2 of class C2 (dp = 1) and P1 + P2 = P . We seek to obtain a mapping,

t yp = w xp

that linearly combines samples, xp, to scalars, yp such that class separation among these scalars is maximized. Clearly, if ||w|| = 1 this corresponds to projecting each sample onto a line and in which case we are seeking a line direction that maximizes the separation (see figure).

M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Fisher Linear Discriminant-Cont.

To find the best w, we use the sample mean as a measure of separation between the projected points. If we define sample mean vector of each class Ci, µ = 1 PPi x , i = 1, 2 i Pi p=1 p then the sample mean for the projected points is,

1 PPi 1 PPi t t mi = yp = w x = w µ , i = 1, 2 Pi p=1 Pi p=1 p i

Note that the training samples are arranged in two different subsets for classes of C1 and C2. We use the distance between the projected means as our measure, i.e. |m − m | = |wt(µ − µ )| 1 2 1 2 However, the distance between projected means alone is not a good measure since it does not account for the standard deviation within classes. M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Fisher Linear Discriminant-Cont.

Fisher suggested maximizing the difference between the means, normalized by a 2 2 measure of the within-class scatter of the projected samples given by, (σ1 + σ2 ), 2 where σi is the sample variance of y sample in class Ci. Thus, the Fisher measure to maximize wrt w is, 2 |m1−m2| J(w) = 2 2 σ1 +σ2 To find the optimum w we must express J(w) as a function of w. We define scatter matrices Si as S = PPi (x − µ )(x − µ )t i p=1 p i p i i.e. within class sample covariance matrices (a measure of the scatter in feature space x for class Ci). Also, define the within-class scatter matrix

SW = S1 + S2

M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Fisher Linear Discriminant-Cont.

The variance (or scatter) of the projected samples in Ci can be expressed as, σ2 = PPi (y − m )2 = PPi (wtx − wtµ )2 i p=1 p i p=1 p i = PPi wt(x − µ )(x − µ )tw = wtS w p=1 p i p i i Thus, the denominator of J(w), or the total within-class scatter, becomes 2 2 t t (σ1 + σ2 ) = w (S1 + S2)w = w SW w. Also, we can express the numerator of J(w) as (m − m )2 = (wtµ − wtµ )2 = wt(µ − µ )(µ − µ )tw = wtS w 1 2 1 2 1 2 1 2 B

where matrix SB is called the between-class scatter matrix. Note that though SW is symmetric, PD and nonsingular (when Pi  N), matrix SB is symmetric, PD but singular (rank one). Now, the Fisher criterion in terms of w becomes,

t w SB w J(w) = t w SW w i.e. generalized Rayleigh quotient. An important property to notice about the Fisher measure J(w) is that it is invariant to scaling of the weight vector i.e. w → aw. Hence, we choose a w such that t w SW w = 1. M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Fisher Linear Discriminant-Cont.

This leads to the following constrained optimization

t t Maximize w SB w subject to w SW w = 1. which can be represented by Lagrangian cost function,

t t max {w SB w − λ(w SW w − 1)}. w The solution of this maximization gives the following generalized eigenvalue problem,

SB w = λSW w Alternatively, −1 SW SB w = λw Remarks:

−1 Note that it is not necessary to solve for eigenvalues and eigenvectors of SW SB since S w is always in the direction of (µ − µ ). Thus, due to B 1 2 invariance, we can write the solution for optimum w as w∗ = S−1(µ − µ ). W 1 2 Using Fisher linear discriminant classification is reduced from N-D problem to a 1-D problem. This mapping is many-to-one and cannot reduce minimum achievable error rate.

M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Fisher Linear Discriminant-Cont.

Example: Compute the LDA projection for the following 2-D data subsets

X1 = {(4, 1), (2, 4), (2, 3), (3, 6), (4, 4)} in class C1 X2 = {(9, 10), (6, 8), (9, 5), (8, 7), (10, 8)} in class C2. The class scatter matrices are:  0.8 −0.4   1.84 −0.04  S = and S = 1 −0.4 2.64 2 −0.04 2.64

Also, µ = [3 3.6]t and µ = [8.4 7.6]t. 1 2 Thus, we get  29.16 21.6   2.64 −0.44  S = and S = . B 21.6 16 W −0.44 5.28 First we use the direct approach by solving eigenvalues and eigenvectors of −1 ∗ t SW SB w = λw. The eigenvector that gives the largest J(w) is w = [0.91 0.39] . Alternatively, we use the result in Remark 1 i.e. w∗ = S−1(µ − µ ) = [−0.91 − 0.39]t. W 1 2

M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Multiple Discriminant Analysis

For M-class problem, instead of one projection y, we will now seek (M − 1) (1) (2) (M−1) projections, y , y , ··· y by means of projection vectors wi, i ∈ [1,M − 1] (or DFs) arranged by columns into a projection matrix W = [w1w2 ··· wM−1]. That is, y(i) = wtx or y = W tx . p i p p p Following the same procedure as in the LDA, we have S = PPi (x − µ )(x − µ )t i p=1 p i p i where µ = 1 PPi x . The within-class scatter matrix i Pi p=1 p becomes, S = PM PPi (x − µ )(x − µ )t. W i=1 p=1 p i p i The between-class scatter matrix in this case is S = PM P (µ − µ)(µ − µ)t. B i=1 i i i where µ = 1 PP x = 1 PM P µ . P p=1 p P i=1 i i Similarly, we define the mean vector and scatter matrices for the projected samples as, S˜ = PM PPi (y − µ˜ )(y − µ˜ )t, W i=1 p=1 p i p i where µ˜ = 1 PPi y . i Pi p=1 p M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Multiple Discriminant Analysis-Cont.

Also, S˜ = PM P (˜µ − µ˜)(˜µ − µ˜)t. B i=1 i i i where µ˜ = 1 PP y = 1 PM P µ˜ . P p=1 p P i=1 i i It is straightforward to show that, t t S˜W = W SW W and S˜B = W SB W . Since the projection is no longer a scalar (it has M − 1 dimensions), we use the determinant of the scatter matrices to obtain a scalar objective function to find optimum W . This is, t |W SB W | J(W ) = t . |W SW W | It can be shown that the optimal projection matrix W ∗ is the one whose columns are the eigenvectors corresponding to the largest eigenvalues of the following generalized eigenvalue problem, t ∗ ∗ ∗ ∗ |W SB W | ∗ W = [w w ··· w ] = argmax t =⇒ (SB − λiSW )w = 0. 1 2 M−1 |W SW W | i Remarks: Matrix SB is the sum of M rank-1 matrices and the mean vectors are constrained by µ = 1 PM P µ . Thus, matrix S will be of rank M − 1 or P i=1 i i B less. This means that only M − 1 of the eigenvalues λi will be non-zero. M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Multiple Discriminant Analysis-Cont.

The projections with maximum class separability information are the −1 eigenvectors corresponding to the largest eigenvalues of SW SB . EVD can be avoided by computing root of |SB − λiSW | = 0. LDA can be derived as the Maximum Likelihood method for the case of normal class-conditional densities with equal covariance matrices. LDA produces at most M − 1 feature projections (principal components of SB . If the classification error estimates establish that more features are needed, some other method must be employed to provide those additional features. LDA is a parametric method (it assumes unimodal Gaussian likelihoods). If distributions are non-Gaussian, the LDA projections may not preserve complex structures in the data needed for classification.

LDA

Perceptron SVM

M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Multidimensional Scaling (MDS)

Goal: Find a low-dimensional representation set of points which are most closely consistent (under some cost function) with a measured set dissimilarities D ∈ RN×N (or pre-). That is, construct a set of low dimensional points whose interpoint Euclidean distances match those in D closely. Note that here the original data points are not directly observable, only their distances (e.g., construct geographical map of CO based upon road distances of towns).

There are many cost functions that can be used leading to different MDS algorithms. Here, we focus on classical MDS. Let Dˆ denote the distance matrix between the points in the low-dimensional space and ρ(D, Dˆ) be an appropriate cost function. One very convenient cost function is called the “STRAIN criterion” given by ˆ ˆ 2 ρ(D, D) = ||Pc(D − D)Pc||F , 2 H 1 t where ||A||F = tr(AA ) is the Frobenius norm of matrix A, and Pc = I − N 11 represents a centering matrix. Note that PcX subtracts the row mean from each row of matrix X while XPc subtracts the column mean from each column of matrix X. Using this cost function one would like to match variations in distance, rather than the values themselves; this makes it invariant to translation, rotation, and reflection. The solution of the above problem is given in the next theorem. M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Multidimensional Scaling (MDS)-Cont.

L×L Definition 1: A distance matrix D ∈ R has the properties (a) dii = 0, (b) dij ≥ 0, i 6= j, and (c) dij = dji. Definition 2: A distance matrix D is called Euclidean if there exists a configuration of N points x1, ··· xL ∈ R whose interpoint distances are given by D i.e. t 2 dij = (xi − xj ) (xi − xj ) = ||xi − xj || . Theorem: 1 Let D be a distance matrix and define ∆ = − 2 PcDPc. Then, D is Euclidean iff ∆ is PSD. In particular, the following hold: Matrix ∆ is the Gram matrix for a mean-centered configuration with interpoint distances given by D. If D is the matrix for configuration X then δ = (x − µ )t(x − µ ) where µ = 1 PL x . In matrix form ij i x j x x L i=1 i t ∆ = PcX(PcX) , so ∆ ≥ 0 i.e. the centered inner product matrix for X. Conversely, if ∆ is PSD then a configuration corresponding to ∆ can be constructed as follows. Let ∆ = UΛU t be the spectral decomposition of ∆ with t Λ = Diag[λ1 ··· λL]. Now, since ∆ is a Gram matrix, ∆ = YY , where 1 Y = UΛ 2 . Note that the points in configuration Y have also interpoint distance D but with mean of zero. Additionally, they are related to the eigenvectors of ∆. Rows of Y correspond to the new points with same interpoint distances given by D. M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) Multidimensional Scaling (MDS)-Cont.

To get a low M-dimensional embedding corresponding to M non-negative eigenvalues λis, we can discard the last N − M columns of Y . To see the relation with PCA, we use PCA to project Y onto its M principal coordinates i.e. Y Ψ, where Ψ ∈ RN×M and the columns of Ψ = [ψ , ··· ψ ] are the 1 M first M eigenvectors of Y tY (covariance matrix). That is, Y tY ψ = γ ψ i i i 1 Now, using Y = UΛ 2 yields, 1 1 (UΛ 2 )tUΛ 2 ψ = γ ψ i i i 1 1 Λ 2 U tUΛ 2 ψ = γ ψ or Λψ = γ ψ since U tU = I. i i i i i i which suggests that ψ = e and γ = λ where e is the ith standard basis vector for i i i i i L R . Then, Y Ψ = Y [e1 ··· eM ] which corresponds to the first M columns of Y . Remarks: 1 The truncation step is indeed equivalent to projecting Y onto its first M principal components. Moreover, λis reveal relative importance of each dimension as they are the same as those found by PCA. 2 MDS produced a linear embedding that preserved all pairwise distances. However, it cannot be used to nonlinear manifold learning as the Euclidean distance measurements do not respect topology of the manifold.

M.R. Azimi Machine Learning and Adaptive Systems Fisher Discriminant Analysis (FDA) Multiple Discriminant Analysis (MDA) Multidimensional Scaling (MDS) MDS Example

Consider a 7x7 distance√ matrix√  0 1 3√ 2 3√ 1 1  0 1 3 2 3 1  √   0 1 3√ 2 1  D =  0 1 3 1     0 1 1  0 1 0 Constructing the ∆ matrix gives  2 1 −1 −2 −1 1 0  2 1 −1 −2 −1 0  2 1 −1 −2 0  1  2 1 −1 0  ∆ = − 2 PcDPc =    2 1 0   2 0  0

Columns of ∆ are linearly dependent. It can be shown that δ3 = δ2 − δ1, δ4 = −δ1, δ5 = −δ2, δ6 = δ1 − δ2, and δ7 = 0. Thus, the rank of ∆ is 2. Hence a configuration exactly fitting the distance matrix D can be constructed in M = 2 dimensions. Eigenvalues of ∆ are λ1 = λ2 = 3 and λ3 ··· = λ7 = 0. Using the associated eigenvectors,√ √ √ √ u = [ 3 , 3 , 0, − 3 , − 3 , 0, 0]t and u = [ 1 , − 1 , −1, − 1 , 1 , 1, 0]t. 1 2 2 2 2 √ √2 2 2 2 √2 The coordinates of 7 points are [ 3 , 1 ]t, [ 3 , − 1 ]t, [0, −1]t, [− 3 , − 1 ]t, √ 2 2 2 2 2 2 3 1 t t t t [− 2 , 2 ] , [0, 1] , and [0, 0] . Note that the mean vector of these points is [0, 0] .

M.R. Azimi Machine Learning and Adaptive Systems