Eigenfaces and Fisherfaces Dimension Reduction and Component Analysis

Jason Corso

University of Michigan

EECS 598 Fall 2014 Foundations of

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 1 / 101 V (1) − V (1 − ) D D = VD(1) (2) 1 − (1 − )D

1 D = 20

0.8 D = 5

Noting that the volume of the sphere D = 2 D 0.6 will scale with r , we have: D = 1

D 0.4 VD(r) = KDr (1) volume fraction

0.2 where KD is some constant (depending 0 only on D). 0 0.2 0.4 0.6 0.8 1 ²

Dimensionality High Dimensions Often Test Our Intuitions

Consider a simple arrangement: you have a sphere of radius r = 1 in a space of D dimensions. We want to compute what is the fraction of the volume of the sphere that lies between radius r = 1 −  and r = 1.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 2 / 101 Dimensionality High Dimensions Often Test Our Intuitions

Consider a simple arrangement: you VD(1) − VD(1 − ) have a sphere of radius r = 1 in a space = VD(1) (2) of D dimensions. 1 − (1 − )D We want to compute what is the fraction of the volume of the sphere

that lies between radius r = 1 −  and 1 D = 20 r = 1. 0.8 D = 5

Noting that the volume of the sphere D = 2 D 0.6 will scale with r , we have: D = 1

D 0.4 VD(r) = KDr (1) volume fraction

0.2 where KD is some constant (depending 0 only on D). 0 0.2 0.4 0.6 0.8 1 ²

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 2 / 101 Dimensionality Let’s Build Some More Intuition Example from Bishop PRML

Dataset: Measurements taken from a pipeline containing a 2 mixture of oil.

Three classes present 1.5 (different geometrical configuration): homogeneous, annular, and laminar. x7 1 Each data point is a 12 dimensional input vector 0.5 consisting of measurements taken with gamma ray densitometers, which measure 0 the attenuation of gamma 0 0.25 0.5 0.75 1 x6 rays passing along narrow beams through the pipe.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 3 / 101 Dimensionality Let’s Build Some More Intuition Example from Bishop PRML

100 data points of features x6 2 and x7 are shown on the right. Goal: Classify the new data point at the ‘x’. 1.5

x7 1

0.5

0 0 0.25 0.5 0.75 1 x6

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 4 / 101 Dimensionality Let’s Build Some More Intuition Example from Bishop PRML

Observations we can make:

The cross is surrounded by 2 many red points and some green points. 1.5 Blue points are quite far from the cross. x7 1 Nearest-Neighbor Intuition: The query point should be determined more strongly by 0.5 nearby points from the training

set and less strongly by more 0 0 0.25 0.5 0.75 1 distant points. x6

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 5 / 101 What problems may exist with this approach?

Dimensionality Let’s Build Some More Intuition Example from Bishop PRML

One simple way of doing it is: 2 We can divide the feature space up into regular cells. For each, cell, we associated the 1.5 class that occurs most frequently in that cell (in our x7 1 training data).

Then, for a query point, we 0.5 determine which cell it falls into and then assign in the label 0 associated with the cell. 0 0.25 0.5 0.75 1 x6

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 6 / 101 Dimensionality Let’s Build Some More Intuition Example from Bishop PRML

One simple way of doing it is: 2 We can divide the feature space up into regular cells. For each, cell, we associated the 1.5 class that occurs most frequently in that cell (in our x7 1 training data).

Then, for a query point, we 0.5 determine which cell it falls into and then assign in the label 0 associated with the cell. 0 0.25 0.5 0.75 1 x6 What problems may exist with this approach?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 6 / 101 Dimensionality Let’s Build Some More Intuition Example from Bishop PRML

The problem we are most interested in now is the one that becomes apparent when we add more variables into the mix, corresponding to problems of higher dimensionality. In this case, the number of additional cells grows exponentially with the dimensionality of the space. Hence, we would need an exponentially large training data set to ensure all cells are filled.

x2 x2

x1

x1 x1 x3 D = 1 D = 2 D = 3

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 7 / 101 Dimensionality Curse of Dimensionality

This severe difficulty when working in high dimensions was coined the curse of dimensionality by Bellman in 1961. The idea is that the volume of a space increases exponentially with the dimensionality of the space.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 8 / 101 Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979

How does the probability of error vary as we add more features, in theory? Consider the following two-class problem:

The prior probabilities are known and equal: P (ω1) = P (ω2) = 1/2. The class-conditional densities are Gaussian with unit covariance:

p(x|ω1) ∼ N(µ1, I) (3)

p(x|ω2) ∼ N(µ2, I) (4)

where µ1 = µ, µ2 = −µ, and µ is an n-vector whose ith component is (1/i)1/2. The corresponding Bayesian Decision Rule is

T decide ω1 if x µ > 0 (5)

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 9 / 101 What can we say about this result as more features are added?

Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979 The probability of error is 1 P (error) = √ ∞ exp −z2/2 dz (6) 2π Zr/2   where n 2 2 r = kµ1 − µ2k = 4 (1/i). (7) i=1 p(x ωi)P(ωXi) Let’s take this integral for | ω granted... (For more detail, you ω1 2

can look at DHS Problem 31 in reducible Chapter 2 and read Section 2.7.) error

x

xB x* R1 R2 p(x ω )P(ω ) dx p(x ω )P(ω ) dx ∫ | 2 2 ∫ | 1 1 R1 R2 FIGURE 2.17. Components of the probability of error for equal priors and (nonoptimal) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces∗ 10 / 101 decision point x . The pink area corresponds to the probability of errors for deciding ω1 when the state of nature is in fact ω2; the gray area represents the converse, as given in Eq. 70. If the decision boundary is instead at the point of equal posterior probabilities, xB, then this reducible error is eliminated and the total shaded area is the minimum possible; this is the Bayes decision and gives the Bayes error rate. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979 The probability of error is 1 P (error) = √ ∞ exp −z2/2 dz (6) 2π Zr/2   where n 2 2 r = kµ1 − µ2k = 4 (1/i). (7) i=1 p(x ωi)P(ωXi) Let’s take this integral for | ω granted... (For more detail, you ω1 2

can look at DHS Problem 31 in reducible Chapter 2 and read Section 2.7.) error

What can we say about this x

xB x* result as more features are R1 R2

∫p(x ω2)P(ω2) dx ∫p(x ω1)P(ω1) dx added? | | R1 R2 FIGURE 2.17. Components of the probability of error for equal priors and (nonoptimal) JJ Corso (University of Michigan) Eigenfaces and Fisherfaces∗ 10 / 101 decision point x . The pink area corresponds to the probability of errors for deciding ω1 when the state of nature is in fact ω2; the gray area represents the converse, as given in Eq. 70. If the decision boundary is instead at the point of equal posterior probabilities, xB, then this reducible error is eliminated and the total shaded area is the minimum possible; this is the Bayes decision and gives the Bayes error rate. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979

The probability of error approaches 0 as n approach infinity because 1/i is a divergent series. More intuitively, each additional feature is going to decrease the probability of error as long as its means are different. In the general case of varying means and but same variance for a feature, we have

d 2 µi1 − µi2 r2 = (8) σ i=1 i X   Certainly, we prefer features that have big differences in the mean relative to their variance. We need to note that if the probabilistic structure of the problem is completely known then adding new features is not going to decrease the Bayes risk (or increase it).

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 11 / 101 Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979

So, adding dimensions is good....

x3

x2

x1

FIGURE 3.3. Two three-dimensional distributions have nonoverlapping densities, and ...in theory.thus in three dimensions the Bayes error vanishes. When projected to a subspace—here, the two-dimensional x1 − x2 subspace or a one-dimensional x1 subspace—there can But, in practice,be greater performanceoverlap of the projected seems distributions, to and not hence obey greater this Bayes error. theory. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 12 / 101 Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979

Consider again the two-class problem, but this time with unknown means µ1 and µ2. Instead, we have m labeled samples x1,..., xm. Then, the best estimate of µ for each class is the sample mean (recall the parameter estimation lecture).

m 1 µ = xi (9) m i=1 X where xi comes from ω2. The covariance is I/m.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 13 / 101 Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979

Probablity of error is given by

T 1 ∞ 2 P (error) = P (x µ ≥ 0|ω2) = √ exp −z /2 dz (10) 2π Zγn   because it has a Gaussian form as n approaches infinity where

1/2 γn = E(z)/ [var(z)] (11) n E(z) = (1/i) (12) i=1 X n 1 var(z) = 1 + (1/i) + n/m (13) m i=1   X

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 14 / 101 Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979

Probablity of error is given by

T 1 ∞ 2 P (error) = P (x µ ≥ 0|ω2) = √ exp −z /2 dz 2π Zγn   The key is that we can show

lim γn = 0 (14) n →∞ and thus the probability of error approaches one-half as the dimensionality of the problem becomes very high.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 15 / 101 Dimensionality Dimensionality and Classification Error? Some parts taken from G. V. Trunk, TPAMI Vol. 1 No. 3 PP. 306-7 1979

Trunk performed an experiment to investigate the convergence rate of the probability of error to one-half. He simulated the problem for dimensionality 1 to 1000 and ran 500 repetitions for each dimension. We see an increase in performance initially and then a decrease (as the dimensionality of the problem grows larger than the number of training samples).

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 16 / 101 Dimension Reduction Motivation for Dimension Reduction

The discussion on the curse of dimensionality should be enough! Even though our problem may have a high dimension, data will often be confined to a much lower effective dimension in most real world problems. Computational complexity is another important point: generally, the higher the dimension, the longer the training stage will be (and potentially the measurement and classification stages). We seek an understanding of the underlying data to Remove or reduce the influence of noisy or irrelevant features that would otherwise interfere with the classifier; Identify a small set of features (perhaps, a transformation thereof) during data exploration.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 17 / 101 First, we need a distance function on the sample space. Let’s use the Euclidean distance and the sum of squared distances criterion:

n 2 J0(x0) = kx0 − xkk (15) k=1 X Then, we seek a value of x0 that minimizes J0.

Dimension Reduction Principal Component Analysis Dimension Reduction Version 0

We have n samples {x1, x2,..., xn}.

How can we best represent the n samples by a single vector x0?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 18 / 101 Then, we seek a value of x0 that minimizes J0.

Dimension Reduction Principal Component Analysis Dimension Reduction Version 0

We have n samples {x1, x2,..., xn}.

How can we best represent the n samples by a single vector x0? First, we need a distance function on the sample space. Let’s use the Euclidean distance and the sum of squared distances criterion:

n 2 J0(x0) = kx0 − xkk (15) k=1 X

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 18 / 101 Dimension Reduction Principal Component Analysis Dimension Reduction Version 0

We have n samples {x1, x2,..., xn}.

How can we best represent the n samples by a single vector x0? First, we need a distance function on the sample space. Let’s use the Euclidean distance and the sum of squared distances criterion:

n 2 J0(x0) = kx0 − xkk (15) k=1 X Then, we seek a value of x0 that minimizes J0.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 18 / 101 Dimension Reduction Principal Component Analysis Dimension Reduction Version 0

We can show that the minimizer is indeed the sample mean:

n 1 m = x (16) n k k=1 X We can verify it by adding m − m into J0:

n 2 J0(x0) = k(x0 − m) − (xk − m)k (17) k=1 Xn n 2 2 = kx0 − mk + kxk − mk (18) k=1 k=1 X X Thus, J0 is minimized when x0 = m. Note, the second term is independent of x0.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 19 / 101 Dimension Reduction Principal Component Analysis Dimension Reduction Version 0

So, the sample mean is an initial dimension-reduced representation of the data (a zero-dimensional one). It is simple, but it not does reveal any of the variability in the data. Let’s try to obtain a one-dimensional representation: i.e., let’s project the data onto a line running through the sample mean.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 20 / 101 Represent point xk by m + ake. We can find an optimal set of coefficients by again minimizing the squared-error criterion n 2 J1(a1, . . . , an, e) = k(m + ake) − xkk (20) k=1 Xn n n 2 2 T 2 = akkek − 2 ake (xk − m) + kxk − mk k=1 k=1 k=1 X X X (21)

Dimension Reduction Principal Component Analysis Dimension Reduction Version 1

Let e be a unit vector in the direction of the line. The standard equation for a line is then x = m + ae (19) where scalar a ∈ R governs which point along the line we are and hence corresponds to the distance of any point x from the mean m.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 21 / 101 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1

Let e be a unit vector in the direction of the line. The standard equation for a line is then x = m + ae (19) where scalar a ∈ R governs which point along the line we are and hence corresponds to the distance of any point x from the mean m. Represent point xk by m + ake. We can find an optimal set of coefficients by again minimizing the squared-error criterion n 2 J1(a1, . . . , an, e) = k(m + ake) − xkk (20) k=1 Xn n n 2 2 T 2 = akkek − 2 ake (xk − m) + kxk − mk k=1 k=1 k=1 X X X (21)

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 21 / 101 How do we find the best direction for that line?

Dimension Reduction Principal Component Analysis Dimension Reduction Version 1

Differentiating for ak and equating to 0 yields

T ak = e (xk − m) (22)

This indicates that the best value for ak is the projection of the point xk onto the line e that passes through m.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 22 / 101 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1

Differentiating for ak and equating to 0 yields

T ak = e (xk − m) (22)

This indicates that the best value for ak is the projection of the point xk onto the line e that passes through m. How do we find the best direction for that line?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 22 / 101 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1

What if we substitute the expression we computed for the best ak directly into the J1 criterion:

n n n 2 2 2 J1(e) = ak − 2 ak + kxk − mk (23) k=1 k=1 k=1 Xn X X n 2 T 2 = − e (xk − m) + kxk − mk (24) k=1 h i k=1 Xn X n T T 2 = − e (xk − m)(xk − m) e + kxk − mk (25) k=1 k=1 X X

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 23 / 101 This should be familiar – this is a multiple of the sample . Putting it in:

n T 2 J1(e) = −e Se + kxk − mk (27) k=1 X T The e that maximizes e Se will minimize J1.

Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 The Scatter Matrix

Define the scatter matrix S as n T S = (xk − m)(xk − m) (26) k=1 X

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 24 / 101 Putting it in:

n T 2 J1(e) = −e Se + kxk − mk (27) k=1 X T The e that maximizes e Se will minimize J1.

Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 The Scatter Matrix

Define the scatter matrix S as n T S = (xk − m)(xk − m) (26) k=1 X This should be familiar – this is a multiple of the sample covariance matrix.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 24 / 101 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 The Scatter Matrix

Define the scatter matrix S as n T S = (xk − m)(xk − m) (26) k=1 X This should be familiar – this is a multiple of the sample covariance matrix. Putting it in:

n T 2 J1(e) = −e Se + kxk − mk (27) k=1 X T The e that maximizes e Se will minimize J1.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 24 / 101 Differentiating w.r.t. e and setting equal to 0. ∂u = 2Se − 2λe (29) ∂e Se = λe (30)

Does this form look familiar?

Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 Solving for e

We use Lagrange multipliers to maximize eTSe subject to the constraint that kek = 1.

u = eTSe − λ(eTe − 1) (28)

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 25 / 101 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 Solving for e

We use Lagrange multipliers to maximize eTSe subject to the constraint that kek = 1.

u = eTSe − λ(eTe − 1) (28)

Differentiating w.r.t. e and setting equal to 0. ∂u = 2Se − 2λe (29) ∂e Se = λe (30)

Does this form look familiar?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 25 / 101 Dimension Reduction Principal Component Analysis Dimension Reduction Version 1 Eigenvectors of S

Se = λe

This is an eigenproblem. Hence, it follows that the best one-dimensional estimate (in a least-squares sense) for the data is the eigenvector corresponding to the largest eigenvalue of S. So, we will project the data onto the largest eigenvector of S and translate it to pass through the mean.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 26 / 101 This idea readily extends to multiple dimensions, say d0 < d dimensions. We replace the earlier equation of the line with

d0 x = m + aiei (31) i=1 X And we have a new criterion function

n d0 2 Jd0 = m + akiei − xk (32) ! k=1 i=1 X X

Dimension Reduction Principal Component Analysis Principal Component Analysis

We’re already done...basically.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 27 / 101 Dimension Reduction Principal Component Analysis Principal Component Analysis

We’re already done...basically.

This idea readily extends to multiple dimensions, say d0 < d dimensions. We replace the earlier equation of the line with

d0 x = m + aiei (31) i=1 X And we have a new criterion function

n d0 2 Jd0 = m + akiei − xk (32) ! k=1 i=1 X X

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 27 / 101 Dimension Reduction Principal Component Analysis Principal Component Analysis

Jd0 is minimized when the vectors e1,..., ed0 are the d0 eigenvectors fo the scatter matrix having the largest eigenvalues. These vectors are orthogonal. They form a natural set of basis vectors for representing any feature x.

The coefficients ai are called the principal components. Visualize the basis vectors as the principal axes of a hyperellipsoid surrounding the data (a cloud of points). Principle components reduces the dimension of the data by restricting attention to those directions of maximum variation, or scatter.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 28 / 101 Dimension Reduction Fisher Linear Discriminant Fisher Linear Discriminant

Description vs. Discrimination PCA is likely going to be useful for representing data. But, there is no reason to assume that it would be good for discriminating between two classes of data. ‘Q’ versus ‘O’. Discriminant Analysis seeks directions that are efficient for discrimination.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 29 / 101 We can form a linear combination of the components of a sample x:

y = wTx (34)

which yields a corresponding set of n samples y1, . . . , yn split into subsets Y1 and Y2.

Dimension Reduction Fisher Linear Discriminant Let’s Formalize the Situation

Suppose we have a set of n d-dimensional samples with n1 in set D1, and similarly for set D2.

Di = {x1,..., xni }, i = {1, 2} (33)

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 30 / 101 Dimension Reduction Fisher Linear Discriminant Let’s Formalize the Situation

Suppose we have a set of n d-dimensional samples with n1 in set D1, and similarly for set D2.

Di = {x1,..., xni }, i = {1, 2} (33)

We can form a linear combination of the components of a sample x:

y = wTx (34)

which yields a corresponding set of n samples y1, . . . , yn split into subsets Y1 and Y2.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 30 / 101 Does the magnitude of w have any real significance?

Dimension Reduction Fisher Linear Discriminant Geometrically, This is a Projection

If we constrain the norm of w to be 1 (i.e., kwk = 1) then we can conceptualize that each yi is the projection of the corresponding xi onto a line in the direction of w.

x2 x2 2 2

1.5 1.5

1 1

0.5 0.5 w

x1 x1 0.5 1 1.5 0.5 1 1.5 w -0.5

FIGURE 3.5. Projection of the same set of samples onto two different lines in the di- rections marked w. The figure on the right shows greater separation between the red and black projected points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 31 / 101 Dimension Reduction Fisher Linear Discriminant Geometrically, This is a Projection

If we constrain the norm of w to be 1 (i.e., kwk = 1) then we can conceptualize that each yi is the projection of the corresponding xi onto a line in the direction of w.

x2 x2 2 2

1.5 1.5

1 1

0.5 0.5 w

x1 x1 0.5 1 1.5 0.5 1 1.5 w -0.5

FIGURE 3.5. Projection of the same set of samples onto two different lines in the di- Doesrections the magnitude marked w. The of figurew have on the rightany shows real greater significance? separation between the red and black projected points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 31 / 101 However, this may not be possible depending on our underlying classes. So, how do we find the best direction w?

Dimension Reduction Fisher Linear Discriminant What is a Good Projection?

For our two-class setup, it should be clear that we want the projection that will have those samples from class ω1 falling into one cluster (on the line) and those samples from class ω2 falling into a separate cluster (on the line).

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 32 / 101 So, how do we find the best direction w?

Dimension Reduction Fisher Linear Discriminant What is a Good Projection?

For our two-class setup, it should be clear that we want the projection that will have those samples from class ω1 falling into one cluster (on the line) and those samples from class ω2 falling into a separate cluster (on the line). However, this may not be possible depending on our underlying classes.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 32 / 101 Dimension Reduction Fisher Linear Discriminant What is a Good Projection?

For our two-class setup, it should be clear that we want the projection that will have those samples from class ω1 falling into one cluster (on the line) and those samples from class ω2 falling into a separate cluster (on the line). However, this may not be possible depending on our underlying classes. So, how do we find the best direction w?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 32 / 101 Then the sample mean for the projected points is 1 m˜ i = y (36) ni y i X∈Y 1 = wTx (37) ni x i X∈D T = w mi . (38)

And, thus, is simply the projection of mi.

Dimension Reduction Fisher Linear Discriminant Separation of the Projected Means

Let mi be the d-dimensional sample mean for class i: 1 mi = x . (35) ni x i X∈D

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 33 / 101 Dimension Reduction Fisher Linear Discriminant Separation of the Projected Means

Let mi be the d-dimensional sample mean for class i: 1 mi = x . (35) ni x i X∈D Then the sample mean for the projected points is 1 m˜ i = y (36) ni y i X∈Y 1 = wTx (37) ni x i X∈D T = w mi . (38)

And, thus, is simply the projection of mi.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 33 / 101 The scale of w: we can make this distance arbitrarily large by scaling w. Rather, we want to make this distance large relative to some measure of the standard deviation.

Dimension Reduction Fisher Linear Discriminant Distance Between Projected Means

The distance between projected means is thus

T |m˜ 1 − m˜ 2| = |w (m1 − m2)| (39)

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 34 / 101 Rather, we want to make this distance large relative to some measure of the standard deviation.

Dimension Reduction Fisher Linear Discriminant Distance Between Projected Means

The distance between projected means is thus

T |m˜ 1 − m˜ 2| = |w (m1 − m2)| (39)

The scale of w: we can make this distance arbitrarily large by scaling w.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 34 / 101 Dimension Reduction Fisher Linear Discriminant Distance Between Projected Means

The distance between projected means is thus

T |m˜ 1 − m˜ 2| = |w (m1 − m2)| (39)

The scale of w: we can make this distance arbitrarily large by scaling w. Rather, we want to make this distance large relative to some measure of the standard deviation.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 34 / 101 From this, we can estimate the variance of the pooled data: 1 s˜2 +s ˜2 . (41) n 1 2 2 2  s˜1 +s ˜2 is called the total within-class scatter of the projected samples.

Dimension Reduction Fisher Linear Discriminant The Scatter

To capture this variation, we compute the scatter

2 2 s˜i = (y − m˜ i) (40) y i X∈Y which is essentially a scaled sampled variance.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 35 / 101 Dimension Reduction Fisher Linear Discriminant The Scatter

To capture this variation, we compute the scatter

2 2 s˜i = (y − m˜ i) (40) y i X∈Y which is essentially a scaled sampled variance. From this, we can estimate the variance of the pooled data: 1 s˜2 +s ˜2 . (41) n 1 2 2 2  s˜1 +s ˜2 is called the total within-class scatter of the projected samples.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 35 / 101 This term is the ratio of the distance between the projected means scaled by the within-class scatter (the variation of the data). Recall the similar term from earlier in the lecture which indicated the amount a feature will reduce the probability of error is proportional to the ratio of the difference of the means to the variance. FLD will choose the maximum... We need to rewrite J(·) as a function of w.

Dimension Reduction Fisher Linear Discriminant Fisher Linear Discriminant

The Fisher Linear Discriminant will select the w that maximizes

2 |m˜ 1 − m˜ 2| J(w) = 2 2 . (42) s˜1 +s ˜2 It does so independently of the magnitude of w.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 36 / 101 We need to rewrite J(·) as a function of w.

Dimension Reduction Fisher Linear Discriminant Fisher Linear Discriminant

The Fisher Linear Discriminant will select the w that maximizes

2 |m˜ 1 − m˜ 2| J(w) = 2 2 . (42) s˜1 +s ˜2 It does so independently of the magnitude of w. This term is the ratio of the distance between the projected means scaled by the within-class scatter (the variation of the data). Recall the similar term from earlier in the lecture which indicated the amount a feature will reduce the probability of error is proportional to the ratio of the difference of the means to the variance. FLD will choose the maximum...

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 36 / 101 Dimension Reduction Fisher Linear Discriminant Fisher Linear Discriminant

The Fisher Linear Discriminant will select the w that maximizes

2 |m˜ 1 − m˜ 2| J(w) = 2 2 . (42) s˜1 +s ˜2 It does so independently of the magnitude of w. This term is the ratio of the distance between the projected means scaled by the within-class scatter (the variation of the data). Recall the similar term from earlier in the lecture which indicated the amount a feature will reduce the probability of error is proportional to the ratio of the difference of the means to the variance. FLD will choose the maximum... We need to rewrite J(·) as a function of w.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 36 / 101 SW is called the within-class scatter matrix.

SW is symmetric and positive semidefinite.

In typical cases, when is SW nonsingular?

Dimension Reduction Fisher Linear Discriminant Within-Class Scatter Matrix

Define scatter matrices Si:

T Si = (x − mi)(x − mi) (43)

x i X∈D and

SW = S1 + S2 . (44)

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 37 / 101 Dimension Reduction Fisher Linear Discriminant Within-Class Scatter Matrix

Define scatter matrices Si:

T Si = (x − mi)(x − mi) (43)

x i X∈D and

SW = S1 + S2 . (44)

SW is called the within-class scatter matrix.

SW is symmetric and positive semidefinite.

In typical cases, when is SW nonsingular?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 37 / 101 Dimension Reduction Fisher Linear Discriminant Within-Class Scatter Matrix

Deriving the sum of scatters.

2 2 T T s˜i = w x − w mi (45) x i X∈D   T T = w (x − mi)(x − mi) w (46)

x i X∈D T = w Siw (47)

We can therefore write the sum of the scatters as an explicit function of w:

2 2 T s˜1 +s ˜2 = w SW w (48)

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 38 / 101 Dimension Reduction Fisher Linear Discriminant Between-Class Scatter Matrix

The separation of the projected means obeys

2 2 T T (m ˜ 1 − m˜ 2) = w m1 − w m2 (49)  T  T = w (m1 − m2)(m1 − m2) w (50) T = w SBw (51)

Here, SB is called the between-class scatter matrix:

T SB = (m1 − m2)(m1 − m2) (52)

SB is also symmetric and positive semidefinite.

When is SB nonsingular?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 39 / 101 The vector w that maximizes J(·) must satisfy

SBw = λSW w (54)

which is a generalized eigenvalue problem.

For nonsingular SW (typical), we can write this as a standard eigenvalue problem:

1 SW− SBw = λw (55)

Dimension Reduction Fisher Linear Discriminant Rewriting the FLD J(·)

We can rewrite our objective as a function of w.

T w SBw J(w) = T (53) w SW w This is the generalized Rayleigh quotient.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 40 / 101 For nonsingular SW (typical), we can write this as a standard eigenvalue problem:

1 SW− SBw = λw (55)

Dimension Reduction Fisher Linear Discriminant Rewriting the FLD J(·)

We can rewrite our objective as a function of w.

T w SBw J(w) = T (53) w SW w This is the generalized Rayleigh quotient. The vector w that maximizes J(·) must satisfy

SBw = λSW w (54)

which is a generalized eigenvalue problem.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 40 / 101 Dimension Reduction Fisher Linear Discriminant Rewriting the FLD J(·)

We can rewrite our objective as a function of w.

T w SBw J(w) = T (53) w SW w This is the generalized Rayleigh quotient. The vector w that maximizes J(·) must satisfy

SBw = λSW w (54)

which is a generalized eigenvalue problem.

For nonsingular SW (typical), we can write this as a standard eigenvalue problem:

1 SW− SBw = λw (55)

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 40 / 101 The FLD converts a many-dimensional problem to a one-dimensional one. One still must find the threshold. This is easy for known densities, but not so easy in general.

Dimension Reduction Fisher Linear Discriminant Simplifying

Since kwk is not important and SBw is always in the direction of (m1 − m2), we can simplify

1 w∗ = SW− (m1 − m2) (56)

w∗ maximizes J(·) and is the Fisher Linear Discriminant.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 41 / 101 One still must find the threshold. This is easy for known densities, but not so easy in general.

Dimension Reduction Fisher Linear Discriminant Simplifying

Since kwk is not important and SBw is always in the direction of (m1 − m2), we can simplify

1 w∗ = SW− (m1 − m2) (56)

w∗ maximizes J(·) and is the Fisher Linear Discriminant. The FLD converts a many-dimensional problem to a one-dimensional one.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 41 / 101 Dimension Reduction Fisher Linear Discriminant Simplifying

Since kwk is not important and SBw is always in the direction of (m1 − m2), we can simplify

1 w∗ = SW− (m1 − m2) (56)

w∗ maximizes J(·) and is the Fisher Linear Discriminant. The FLD converts a many-dimensional problem to a one-dimensional one. One still must find the threshold. This is easy for known densities, but not so easy in general.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 41 / 101 Comparisons Classic A Classic PCA vs. FLD comparison Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997. 714 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 7, JULY 1997 classification. ThisPCA method maximizes selects W in the [1] in total such a way that the ratio ofscatter the between-class across scatter all classes. and the within- class scatter is maximized. Let the between-classPCA projections scatter matrix be are defined thus as c T =--mmmm optimalSNBiiÂchch for reconstructioni fromi=1 a low dimensional and the within-class scatter matrix be defined as basis,c but not necessarily T S =--chchxxmm fromW  a discriminationk i k i i=1xk ŒXi where m i is thestandpoint. mean image of class Xi , and Ni is the num- ber of samples in class X . If S is nonsingular, the optimal FLD maximizesi W the ratio of projection Wopt is chosen as the matrix with orthonormal columns whichthe maximizes between-class the ratio of the scatter determinant of the between-class scatter matrix of the projected samples to the determinantand of the the within-class within-class scatter matrix scatter. of the projected samples, i.e.,

FLD tries toT “shape” the Fig. 2. A comparison of principal component analysis (PCA) and WSWB Fisher’s linear discriminant (FLD) for a two class problem where data Wscatteropt = arg max to make it more for each class lies near a linear subspace. W WSWT effective for classification.W K = ww12 wm (4) method, which we call Fisherfaces, avoids this problem by JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 42 / 101 K projecting the image set to a lower dimensional space so where nswi im=12,, , is the set of generalized eigen- that the resulting within-class scatter matrix SW is nonsin- vectors of SB and SW corresponding to the m largest gener- gular. This is achieved by using PCA to reduce the dimen- K alized eigenvalues nsli im=12,, , , i.e., sion of the feature space to N - c, and then applying the K standard FLD defined by (4) to reduce the dimension to c - 1. SSiBiww==l iWi,,,,12 m More formally, W is given by Note that there are at most c - 1 nonzero generalized eigen- opt values, and so an upper bound on m is c - 1, where c is the T T T WWWopt = fld pca (5) number of classes. See [4]. To illustrate the benefits of class specific linear projec- where tion, we constructed a low dimensional analogue to the T WWSWpca =arg max T classification problem in which the samples from each class W lie near a linear subspace. Fig. 2 is a comparison of PCA T T WWpca SWBpca W and FLD for a two-class problem in which the samples from Wfld =arg max W T T each class are randomly perturbed in a direction perpen- WWpca SWWpca W dicular to a linear subspace. For this example, N = 20, n = 2, Note that the optimization for W is performed over and m = 1. So, the samples from each class lie near a line pca passing through the origin in the 2D feature space. Both n ¥ (N - c) matrices with orthonormal columns, while the PCA and FLD have been used to project the points from 2D optimization for W is performed over (N - c) ¥ m matrices down to 1D. Comparing the two projections in the figure, fld PCA actually smears the classes together so that they are no with orthonormal columns. In computing Wpca , we have longer linearly separable in the projected space. It is clear thrown away only the smallest c - 1 principal components. that, although PCA achieves larger total scatter, FLD There are certainly other ways of reducing the within- achieves greater between-class scatter, and, consequently, class scatter while preserving between-class scatter. For classification is simplified. example, a second method which we are currently investi- In the face recognition problem, one is confronted with gating chooses W to maximize the between-class scatter of Rnn¥ the projected samples after having first reduced the within- the difficulty that the within-class scatter matrix SW Œ is always singular. This stems from the fact that the rank of class scatter. Taken to an extreme, we can maximize the between-class scatter of the projected samples subject to the S is at most N - c, and, in general, the number of images W constraint that the within-class scatter is zero, i.e., in the learning set N is much smaller than the number of pixels in each image n. This means that it is possible to T WWSWopt = arg max B (6) choose the matrix W such that the within-class scatter of the W Œ: projected samples can be made exactly zero. where : is the set of n ¥ m matrices with orthonormal col- In order to overcome the complication of a singular S , W umns contained in the kernel of SW . we propose an alternative to the criterion in (4). This IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 7, JULY 1997 711

Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection

Peter N. Belhumeur, Joao~ P. Hespanha, and David J. Kriegman

Abstract—We develop a face recognition algorithm which is insensitive to large variation in lighting direction and facial expression. Taking a pattern classification approach, we consider each pixel in an image as a coordinate in a high-dimensional space. We take advantage of the observation that the images of a particular face, under varying illumination but fixed pose, lie in a 3D linear subspace of the high dimensional image space—if the face is a Lambertian surface without shadowing. However, since faces are not truly Lambertian surfaces and do indeed produce self-shadowing, images will deviate from this linear subspace. Rather than explicitly modeling this deviation, we linearly project the image into a subspace in a manner which discounts those regions of the face with large deviation. Our projection method is based on Fisher’s Linear Discriminant and produces well separated classes in a low-dimensional subspace, even under severe variation in lighting and facial expressions. The Eigenface technique, another method based on linearly projecting the image space to a low dimensional subspace, has similar computational requirements. Yet, extensive experimental results demonstrate that the proposed “Fisherface” method has error rates that are lower than those of the Eigenface technique for tests on the Harvard and Yale Face Databases.

Index Terms—Appearance-based vision, face recognition, illumination invariance, Fisher’s linear discriminant. —————————— ✦ ——————————

1INTRODUCTION Within the last several years, numerous algorithms have space to a significantly lower dimensional feature space been proposed for face recognition; for detailed surveys see which is insensitive both to variation in lighting direction [1], [2]. While much progress has been made toward recog- and facial expression. We choose projection directions that nizing faces under small variations in lighting, facial ex- are nearly orthogonal to the within-class scatter, projecting pression and pose, reliable techniques for recognition under away variations in lighting and facial expression while more extreme variations have proven elusive. maintaining discriminability. Our method Fisherfaces, a In this paper, we outline a new approach for face recogni- derivative of ComparisonsFisher’s LinearCase Discriminant Study: Faces (FLD) [4], [5], tion—one that is insensitive to large variations in lighting maximizes the ratio of between-class scatter to that of and facial expressions. Note that lighting variability includes within-class scatter. not only intensity, but also directionCase and number Study: of light EigenFacesThe Eigenface versusmethod is also FisherFaces based on linearly project- sources. As is evident from Fig. 1, Source:the same person, Belhumeur with the et al.ing IEEEthe image TPAMI space 19(7)to a low 711–720. dimensional 1997. feature space [6], same facial expression, and seen from the same viewpoint, [7], [8]. However, the Eigenface method, which uses princi- can appear dramatically different when light sources illumi- pal components analysis (PCA) for dimensionality reduc- nate the face from different directions. See alsoAnalysis Fig. 4. of classiction, yields pattern projection recognition directions that techniques maximize the (PCA total and FLD) to Our approach to face recognition exploits two observations: scatter across all classes, i.e., across all images of all faces. In 1) All of the images of a Lambertian surface,do face taken recognition. from choosing the projection which maximizes total scatter, PCA a fixed viewpoint, but under varyingFixed illumination, pose lie butretains varying unwanted illumination. variations due to lighting and facial in a 3D linear subspace of the high-dimensional image expression. As illustrated in Figs. 1 and 4 and stated by space [3]. The variation inMoses the et resultingal., “the variations images between caused the images by of the the same varying 2) Because of regions of shadowing, specularities, and face due to illumination and viewing direction are almost facial expressions, the above observationillumination does not ex- willalways nearly larger always than image dominate variations due the to variation change in face caused by an actly hold. In practice, certain regionsidentity of the face change. may identity” [9]. Thus, while the PCA projections are optimal have variability from image to image that often devi- ates significantly from the linear subspace, and, con- sequently, are less reliable for recognition. We make use of these observations by finding a linear projection of the faces from the high-dimensional image

———————————————— • The authors are with the Center for Computational Vision and Control, Dept. of Electrical Engineering, Yale University, New Haven, CT 06520-8267. E-mail: {belhumeur, kriegman}@yale.edu, [email protected]. Manuscript received 15 Feb. 1996 revised 20 Mar. 1997. Recommended for accep- Fig. 1. The same person seen under different lighting conditions can tance by J. Daugman. appear dramatically different: In the left image, the dominant light For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 104797. source is nearly head-on; in the right image, the dominant light source is from above and to the right.

0162-8828/97/$10.00 © 1997 IEEE JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 43 / 101 BELHUMEUR ET AL.: EIGENFACES VS. FISHERFACES: RECOGNITION USING CLASS SPECIFIC LINEAR PROJECTION 717

Interpolating between Subsets 1 and 5 Method Reduced Error Rate (%) Space Subset 2 Subset 3 Subset 4 Eigenface 4 53.3 75.4 52.9 10 11.11 33.9 20.0 Eigenface 4 31.11 60.0 29.4 w/o 1st 3 10 6.7 20.0 12.9 Correlation 129 0.0 21.54 7.1 Linear Subspace 15 0.0 1.5 0.0 Fisherface 4 0.0 0.0 1.2

Fig. 6. Interpolation: When each of the methods is trained on images from both near frontal and extreme lighting (Subsets 1 and 5), the graph and corresponding table show the relative performance under intermediate lighting conditions. image, the glasses were removed. If the person normally In this test, error rates were determined by the “leaving- wore glasses, those were used; if not, a random pair was bor- one-out” strategy [4]: To classify an image of a person, that rowed. Images 3-5 were acquired by illuminating the face in image was removed from the data set and the dimension- a neutral expression with a Luxolamp in three positions. The ality reduction matrix W was computed. All images in the last five images were acquired under ambient lighting with database, excluding the test image, were then projected different expressions (happy, sad, winking, sleepy, and sur- down into the reduced space to be used for classification. prised). For the Eigenface and correlation tests, the images Recognition was performed using a nearest neighbor classi- were normalized to have zero mean and unit variance, as this fier. Note that for this test, each person in the learning set is improved the performance of these methods. The images represented by the projection of ten images, except for the were manually centered and cropped to two differentComparisons scales: Casetest Study: person Faces who is represented by only nine. The larger images included the full face and part of the back- In general, the performance of the Eigenface method ground while the closelyCase cropped Study: ones EigenFaces included internal versusvaries FisherFaces with the number of principal components. Thus, be- structures such as the brow,Source: eyes, Belhumeur nose, mouth, et al. IEEE and TPAMIchin, but 19(7) 711–720.fore comparing 1997. the Linear Subspace and Fisherface methods did not extend to the occluding contour. with the Eigenface method, we first performed an experi-

Fig. 7. The Yale database contains 160 frontal face images covering 16 individuals taken under 10 different conditions: A normal image under ambient lighting, one with or withoutJJ Corso glasses, (University ofthree Michigan) images takenEigenfaces with different and Fisherfaces point light sources, and five different44 / 101facial expressions. But, in the presence of shadowing, specularities, and facial expressions, the above statement will not hold. This will ultimately result in deviations from the 3D linear subspace and worse classification accuracy.

Comparisons Case Study: Faces Is Linear Okay For Faces? Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

Fact: all of the images of a Lambertian surface, taken from a fixed viewpoint, but under varying illumination, lie in a 3D linear subspace of the high-dimensional image space.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 45 / 101 Comparisons Case Study: Faces Is Linear Okay For Faces? Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

Fact: all of the images of a Lambertian surface, taken from a fixed viewpoint, but under varying illumination, lie in a 3D linear subspace of the high-dimensional image space. But, in the presence of shadowing, specularities, and facial expressions, the above statement will not hold. This will ultimately result in deviations from the 3D linear subspace and worse classification accuracy.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 45 / 101 Pre-Processing – each image is normalized to have zero mean and unit variance. Why? Gets rid of the light source intensity and the effects of a camera’s automatic gain control. For a query image x, we select the class of the training image that is the nearest neighbor in the image space:

x∗ = arg minkx − xik then decide C(x∗) (57) xi { } where we have vectorized each image.

Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

Consider a set of N training images, {x1, x2,..., xN }. We know that each of the N images belongs to one of c classes and can define a C(·) function to map the image x into a class ωc.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 46 / 101 Gets rid of the light source intensity and the effects of a camera’s automatic gain control. For a query image x, we select the class of the training image that is the nearest neighbor in the image space:

x∗ = arg minkx − xik then decide C(x∗) (57) xi { } where we have vectorized each image.

Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

Consider a set of N training images, {x1, x2,..., xN }. We know that each of the N images belongs to one of c classes and can define a C(·) function to map the image x into a class ωc. Pre-Processing – each image is normalized to have zero mean and unit variance. Why?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 46 / 101 For a query image x, we select the class of the training image that is the nearest neighbor in the image space:

x∗ = arg minkx − xik then decide C(x∗) (57) xi { } where we have vectorized each image.

Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

Consider a set of N training images, {x1, x2,..., xN }. We know that each of the N images belongs to one of c classes and can define a C(·) function to map the image x into a class ωc. Pre-Processing – each image is normalized to have zero mean and unit variance. Why? Gets rid of the light source intensity and the effects of a camera’s automatic gain control.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 46 / 101 Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

Consider a set of N training images, {x1, x2,..., xN }. We know that each of the N images belongs to one of c classes and can define a C(·) function to map the image x into a class ωc. Pre-Processing – each image is normalized to have zero mean and unit variance. Why? Gets rid of the light source intensity and the effects of a camera’s automatic gain control. For a query image x, we select the class of the training image that is the nearest neighbor in the image space:

x∗ = arg minkx − xik then decide C(x∗) (57) xi { } where we have vectorized each image.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 46 / 101 Computationally expensive. Require large amount of storage. Noise may play a role. Highly parallelizable.

Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

What are the advantages and disadvantages of the correlation based method in this context?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 47 / 101 Comparisons Case Study: Faces Method 1: Correlation Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

What are the advantages and disadvantages of the correlation based method in this context? Computationally expensive. Require large amount of storage. Noise may play a role. Highly parallelizable.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 47 / 101 Define the total scatter matrix ST as

N T ST = (xk − µ)(xk − µ) (59) k=1 X where µ is the sample mean image.

Comparisons Case Study: Faces Method 2: EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997. The original paper is Turk and Pentland, “Eigenfaces for Recognition” Journal of Cognitive Neuroscience, 3(1). 1991.

Quickly recall the main idea of PCA. Define a linear projection of the original n-d image space into an m-d space with m < n or m  n, which yields new vectors y:

T yk = W xk k = 1, 2,...,N (58)

n m where W ∈ R × .

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 48 / 101 Comparisons Case Study: Faces Method 2: EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997. The original paper is Turk and Pentland, “Eigenfaces for Recognition” Journal of Cognitive Neuroscience, 3(1). 1991.

Quickly recall the main idea of PCA. Define a linear projection of the original n-d image space into an m-d space with m < n or m  n, which yields new vectors y:

T yk = W xk k = 1, 2,...,N (58)

n m where W ∈ R × . Define the total scatter matrix ST as

N T ST = (xk − µ)(xk − µ) (59) k=1 X where µ is the sample mean image.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 48 / 101 Wopt is chosen to maximize the determinant of the total scatter matrix of the projected vectors:

T Wopt = arg max|W ST W | (61) W

= w1 w2 ... wm (62)   where {wi|i = 1, d, . . . , m} is the set of n-d eigenvectors of ST corresponding to the largest m eigenvalues.

Comparisons Case Study: Faces Method 2: EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997. The original paper is Turk and Pentland, “Eigenfaces for Recognition” Journal of Cognitive Neuroscience, 3(1). 1991.

The scatter of the projected vectors {y1, y2,..., yN } is

T W ST W (60)

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 49 / 101 Comparisons Case Study: Faces Method 2: EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997. The original paper is Turk and Pentland, “Eigenfaces for Recognition” Journal of Cognitive Neuroscience, 3(1). 1991.

The scatter of the projected vectors {y1, y2,..., yN } is

T W ST W (60)

Wopt is chosen to maximize the determinant of the total scatter matrix of the projected vectors:

T Wopt = arg max|W ST W | (61) W

= w1 w2 ... wm (62)   where {wi|i = 1, d, . . . , m} is the set of n-d eigenvectors of ST corresponding to the largest m eigenvalues.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 49 / 101 Comparisons Case Study: Faces

An Example:

Source: http://www.cs.princeton.edu/ cdecoro/eigenfaces/. (not sure if this dataset include lighting variation...)

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 50 / 101 The scatter being maximized is due not only to the between-class scatter that is useful for classification but also to the within-clas scatter, which is generally undesirable for classification. If PCA is presented faces with varying illumination, the projection matrix Wopt will contain principal components that retain the variation in lighting. If this variation is higher than the variation due to class identity, then PCA will suffer greatly for classification. Yields a more compact representation than the correlation-based method.

Comparisons Case Study: Faces Method 2: EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997. The original paper is Turk and Pentland, “Eigenfaces for Recognition” Journal of Cognitive Neuroscience, 3(1). 1991.

What are the advantages and disadvantages of the eigenfaces method in this context?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 51 / 101 Comparisons Case Study: Faces Method 2: EigenFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997. The original paper is Turk and Pentland, “Eigenfaces for Recognition” Journal of Cognitive Neuroscience, 3(1). 1991.

What are the advantages and disadvantages of the eigenfaces method in this context? The scatter being maximized is due not only to the between-class scatter that is useful for classification but also to the within-clas scatter, which is generally undesirable for classification. If PCA is presented faces with varying illumination, the projection matrix Wopt will contain principal components that retain the variation in lighting. If this variation is higher than the variation due to class identity, then PCA will suffer greatly for classification. Yields a more compact representation than the correlation-based method.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 51 / 101 Comparisons Case Study: Faces Method 3: Linear Subspaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

Use Lambertian model directly. Consider a point p on a Lambertian surface illuminated by a point light source at infinity. Let s ∈ R3 signify the product of the light source intensity with the unit vector for the light source direction. The image intensity of the surface at p when viewed by a camera is

E(p) = a(p)n(p)Ts (63)

where n(p) is the unit inward normal vector to the surface at point p, and a(p) is the albedo of the surface at p (a scalar). Hence, the image intensity of the point p is linear on s.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 52 / 101 For each face (class) use three or more images taken under different lighting conditions to construct a 3D basis for the linear subspace. For recognition, compute the distance of a new image to each linear subspace and choose the face corresponding to the shortest distance.

Comparisons Case Study: Faces Method 3: Linear Subspaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

So, if we assume no shadowing, given three images of a Lambertian surface from the same viewpoint under three known, linearly independent light source directions, the albedo and surface normal can be recovered. Alternatively, one can reconstruct the image of the surface under an arbitrary lighting direction by a linear combination of the three original images. This fact can be used for classification.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 53 / 101 Comparisons Case Study: Faces Method 3: Linear Subspaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

So, if we assume no shadowing, given three images of a Lambertian surface from the same viewpoint under three known, linearly independent light source directions, the albedo and surface normal can be recovered. Alternatively, one can reconstruct the image of the surface under an arbitrary lighting direction by a linear combination of the three original images. This fact can be used for classification. For each face (class) use three or more images taken under different lighting conditions to construct a 3D basis for the linear subspace. For recognition, compute the distance of a new image to each linear subspace and choose the face corresponding to the shortest distance.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 53 / 101 If there is no noise or shadowing, this method will achieve error free classification under any lighting conditions (and if the surface is indeed Lambertian). Faces inevitably have self-shadowing. Faces have expressions... Still pretty computationally expensive (linear in number of classes).

Comparisons Case Study: Faces Method 3: Linear Subspaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

Pros and Cons?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 54 / 101 Comparisons Case Study: Faces Method 3: Linear Subspaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

Pros and Cons? If there is no noise or shadowing, this method will achieve error free classification under any lighting conditions (and if the surface is indeed Lambertian). Faces inevitably have self-shadowing. Faces have expressions... Still pretty computationally expensive (linear in number of classes).

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 54 / 101 Comparisons Case Study: Faces Method 4: FisherFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

Recall the Fisher Linear Discriminant setup. The between-class scatter matrix c T SB = Ni(µi − µ)(µi − µ) (64) i=1 X The within-class scatter matrix c T SW = (xk − µi)(xk − µi) (65) i=1 xk i X X∈D The optimal projection Wopt is chosen as the matrix with orthonormal columns which maximizes the ratio of the determinant of the between-class scatter matrix of the projected vectors to the determinant of the within-class scatter of the projected vectors: T |W SBW | Wopt = arg max T (66) W |W SW W | JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 55 / 101 In face recognition, things get a little more complicated because the within-class scatter matrix SW is always singular. This is because the rank of SW is at most N − c and the number of images in the learning set are commonly much smaller than the number of pixels in the image. This means we can choose W such that the within-class scatter is exactly zero.

Comparisons Case Study: Faces Method 4: FisherFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

The eigenvectors {wi|i = 1, 2, . . . , m} corresponding to the m largest eigenvalues of the following generalized eigenvalue problem comprise Wopt:

SBwi = λiSW wi, i = 1, 2, . . . , m (67) This is a multi-class version of the FLD, which we will discuss in more detail.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 56 / 101 This is because the rank of SW is at most N − c and the number of images in the learning set are commonly much smaller than the number of pixels in the image. This means we can choose W such that the within-class scatter is exactly zero.

Comparisons Case Study: Faces Method 4: FisherFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

The eigenvectors {wi|i = 1, 2, . . . , m} corresponding to the m largest eigenvalues of the following generalized eigenvalue problem comprise Wopt:

SBwi = λiSW wi, i = 1, 2, . . . , m (67) This is a multi-class version of the FLD, which we will discuss in more detail. In face recognition, things get a little more complicated because the within-class scatter matrix SW is always singular.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 56 / 101 Comparisons Case Study: Faces Method 4: FisherFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

The eigenvectors {wi|i = 1, 2, . . . , m} corresponding to the m largest eigenvalues of the following generalized eigenvalue problem comprise Wopt:

SBwi = λiSW wi, i = 1, 2, . . . , m (67) This is a multi-class version of the FLD, which we will discuss in more detail. In face recognition, things get a little more complicated because the within-class scatter matrix SW is always singular. This is because the rank of SW is at most N − c and the number of images in the learning set are commonly much smaller than the number of pixels in the image. This means we can choose W such that the within-class scatter is exactly zero.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 56 / 101 PCA to first reduce the dimension to N − c and the FLD to reduce it to c − 1. T T T Wopt is given by the product WFLDWPCA where

T WPCA = arg max|W ST W | (68) W T T |W WPCASBWPCAW | WFLD = arg max (69) W T T |W WPCASW WPCAW |

Comparisons Case Study: Faces Method 4: FisherFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

To overcome this, project the image set to a lower dimensional space so that the resulting within-class scatter matrix SW is nonsingular. How?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 57 / 101 Comparisons Case Study: Faces Method 4: FisherFaces Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

To overcome this, project the image set to a lower dimensional space so that the resulting within-class scatter matrix SW is nonsingular. How? PCA to first reduce the dimension to N − c and the FLD to reduce it to c − 1. T T T Wopt is given by the product WFLDWPCA where

T WPCA = arg max|W ST W | (68) W T T |W WPCASBWPCAW | WFLD = arg max (69) W T T |W WPCASW WPCAW |

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 57 / 101 Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

Hypothesis: face recognition algorithms will perform better if they exploit the fact that images of a Lambertian surface lie in a linear subspace. Used Hallinan’s Harvard Database which sampled the space of light source directions in 15 degree incremements. Used 330 images of five people (66 of each) and extracted five subsets. Classification is nearest neighbor in all cases.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 58 / 101 Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997. BELHUMEUR ET AL.: EIGENFACES VS. FISHERFACES: RECOGNITION USING CLASS SPECIFIC LINEAR PROJECTION 715

3EXPERIMENTAL RESULTS In this section, we present and discuss each of the afore- mentioned face recognition techniques using two different databases. Because of the specific hypotheses that we wanted to test about the relative performance of the consid- ered algorithms, many of the standard databases were in- appropriate. So, we have used a database from the Harvard Robotics Laboratory in which lighting has been systemati- cally varied. Secondly, we have constructed a database at Yale that includes variation in both facial expression and 1 lighting. 3.1 Variation in Lighting The first experiment was designed to test the hypothesis Fig. 3. The highlighted lines of longitude and latitude indicate the light source directions for Subsets 1 through 5. Each intersection of a lon- that under variable illumination, face recognition algo- gitudinal and latitudinal line on the right side of the illustration has a rithms will perform better if they exploit the fact that im- corresponding image in the database. ages of a Lambertian surface lie in a linear subspace. More specifically, the recognition error rates for all four algo- direction coincident with the camera’s optical axis. rithms described in Section 2 are compared using an im- JJ Corso (UniversitySubset of Michigan) 2 contains 45Eigenfaces images and for Fisherfaces which the greater of the 59 / 101 age database constructed by Hallinan at the Harvard Ro- longitudinal and latitudinal angles of light source di- botics Laboratory [14], [15]. In each image in this data- $ rection are 30 from the camera axis. base, a subject held his/her head steady while being illu- Subset 3 contains 65 images for which the greater of the minated by a dominant light source. The space of light longitudinal and latitudinal angles of light source di- source directions, which can be parameterized by spheri- $ $ rection are 45 from the camera axis. cal angles, was then sampled in 15 increments. See Fig. 3. Subset 4 contains 85 images for which the greater of the From this database, we used 330 images of five people (66 longitudinal and latitudinal angles of light source di- of each). We extracted five subsets to quantify the effects $ rection are 60 from the camera axis. of varying lighting. Sample images from each subset are Subset 5 contains 105 images for which the greater of the shown in Fig. 4. longitudinal and latitudinal angles of light source di- $ Subset 1 contains 30 images for which both the longitudi- rection are 75 from the camera axis. nal and latitudinal angles of light source direction are $ For all experiments, classification was performed using a within 15 of the camera axis, including the lighting nearest neighbor classifier. All training images of an indi-

1. The Yale database is available for download from http://cvc.yale.edu.

Fig. 4. Example images from each subset of the Harvard Database used to test the four algorithms. BELHUMEUR ET AL.: EIGENFACES VS. FISHERFACES: RECOGNITION USING CLASS SPECIFIC LINEAR PROJECTION 715

3EXPERIMENTAL RESULTS In this section, we present and discuss each of the afore- mentioned face recognition techniques using two different databases. Because of the specific hypotheses that we wanted to test about the relative performance of the consid- ered algorithms, many of the standard databases were in- appropriate. So, we have used a database from the Harvard Robotics Laboratory in which lighting has been systemati- cally varied. Secondly, we have constructed a database at Yale that includes variation in both facial expression and 1 lighting. 3.1 Variation in Lighting The first experiment was designed to test the hypothesis Fig. 3. The highlighted lines of longitude and latitude indicate the light source directions for Subsets 1 through 5. Each intersection of a lon- that under variable illumination, face recognition algo- gitudinal and latitudinal line on the right side of the illustration has a rithms will perform better if they exploit the fact that im- corresponding image in the database. ages of a Lambertian surface lie in a linear subspace. More specifically, the recognition error rates for all four algo- direction coincident with the camera’s optical axis. rithms described in Section 2 are compared using an im- Subset 2 contains 45 images for which the greater of the age database constructed by Hallinan at the Harvard Ro- longitudinal and latitudinal angles of light source di- botics Laboratory [14], [15]. In each image in this data- $ rection are 30 from the camera axis. base, a subject held his/her head steady while being illu- Subset 3 contains 65 images for which the greater of the minated by a dominant light source. The space of light longitudinal and latitudinal angles of light source di- source directions, which can be parameterized by spheri- $ $ rection are 45 from the camera axis. cal angles, was then sampled in 15 increments. See Fig. 3. Subset 4 contains 85 images for which the greater of the From this database, we used 330 images of five people (66 longitudinal and latitudinal angles of light source di- of each). We extracted five subsets to quantify the effects $ rection are 60 from the camera axis. of varying lighting. Sample images from each subset are Comparisons SubsetCase Study: 5 contains Faces 105 images for which the greater of the shown in Fig. 4. longitudinal and latitudinal angles of light source di- $ Subset 1 contains 30 images for which both the longitudi- rection are 75 from the camera axis. Experimentnal and latitudinal angles 1: of Variation light source direction in are Lighting $ For all experiments, classification was performed using a Source:within 15 Belhumeur of the camera et axis, al. including IEEE TPAMI the lighting 19(7)nearest 711–720. neighbor 1997. classifier. All training images of an indi- 1. The Yale database is available for download from http://cvc.yale.edu.

Fig. 4. ExampleJJ Corso images (University from each of Michigan) subset of the Harvard DatabaseEigenfaces used and to test Fisherfaces the four algorithms. 60 / 101 Comparisons Case Study: Faces Experiment 1: Variation in Lighting – Extrapolation Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

Train on Subset 1.

Test716 of Subsets 1,2,and IEEE TRANSACTIONS 3. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 7, JULY 1997

Extrapolating from Subset 1 Method Reduced Error Rate (%) Space Subset 1 Subset 2 Subset 3 Eigenface 4 0.0 31.1 47.7 10 0.0 4.4 41.5 Eigenface 4 0.0 13.3 41.5 w/o 1st 3 10 0.0 4.4 27.7 Correlation 29 0.0 0.0 33.9 Linear Subspace 15 0.0 4.4 9.2 Fisherface 4 0.0 0.0 4.6

Fig. 5. Extrapolation: When each of the methods is trained on images with near frontal illumination (Subset 1), the graph and corresponding table show the relative performance under extreme light source conditions. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 61 / 101 vidual were projected into the feature space. The images axis, there is a significant performance difference were cropped within the face so that the contour of the between the two class-specific methods and the Ei- 2 head was excluded. For the Eigenface and correlation tests, genface method. the images were normalized to have zero mean and unit 2) It has also been noted that the Eigenface method is variance, as this improved the performance of these meth- equivalent to correlation when the number of Eigen- ods. For the Eigenface method, results are shown when ten faces equals the size of the training set [17], and since principal components were used. Since it has been sug- performance increases with the dimension of the ei- gested that the first three principal components are primar- genspace, the Eigenface method should do no better ily due to lighting variation and that recognition rates can than correlation [26]. This is empirically demonstrated be improved by eliminating them, error rates are also pre- as well. sented using principal components four through thirteen. 3) In the Eigenface method, removing the first three We performed two experiments on the Harvard Data- principal components results in better performance base: extrapolation and interpolation. In the extrapolation under variable lighting conditions. experiment, each method was trained on samples from 4) While the Linear Subspace method has error rates that Subset 1 and then tested using samples from Subsets 1, 2, are competitive with the Fisherface method, it re- 3 and 3. Since there are 30 images in the training set, cor- quires storing more than three times as much infor- relation is equivalent to the Eigenface method using 29 mation and takes three times as long. principal components. Fig. 5 shows the result from this 5) The Fisherface method had error rates lower than the experiment. Eigenface method and required less computation time. In the interpolation experiment, each method was trained on Subsets 1 and 5 and then tested the methods on Subsets 2, 3.2 Variation in Facial Expression, Eye Wear, and Lighting 3, and 4. Fig. 6 shows the result from this experiment. These two experiments reveal a number of interesting Using a second database constructed at the Yale Center for points: Computational Vision and Control, we designed tests to de- termine how the methods compared under a different range 1) All of the algorithms perform perfectly when lighting of conditions. For sixteen subjects, ten images were acquired is nearly frontal. However, as lighting is moved off during one session in front of a simple background. Subjects 2. We have observed that the error rates are reduced for all methods when included females and males (some with facial hair), and the contour is included and the subject is in front of a uniform background. some wore glasses. Fig. 7 shows ten images of one subject. However, all methods performed worse when the background varies. 3. To test the methods with an image from Subset 1, that image was removed The first image was taken under ambient lighting in a neutral from the training set, i.e., we employed the “leaving-one-out” strategy [4]. facial expression and the person wore glasses. In the second Comparisons Case Study: Faces Experiment 2: Variation in Lighting – Interpolation Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

Train on Subsets 1 and 5. TestBELHUMEUR ofSubsets ET AL.: EIGENFACES 2, VS. FISHERFACES: 3, and RECOGNITION 4. USING CLASS SPECIFIC LINEAR PROJECTION 717

Interpolating between Subsets 1 and 5 Method Reduced Error Rate (%) Space Subset 2 Subset 3 Subset 4 Eigenface 4 53.3 75.4 52.9 10 11.11 33.9 20.0 Eigenface 4 31.11 60.0 29.4 w/o 1st 3 10 6.7 20.0 12.9 Correlation 129 0.0 21.54 7.1 Linear Subspace 15 0.0 1.5 0.0 Fisherface 4 0.0 0.0 1.2

Fig. 6. Interpolation: When each of the methods is trained on images from both near frontal and extreme lighting (Subsets 1 and 5), the graph and corresponding table show the relative performance under intermediate lighting conditions. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 62 / 101 image, the glasses were removed. If the person normally In this test, error rates were determined by the “leaving- wore glasses, those were used; if not, a random pair was bor- one-out” strategy [4]: To classify an image of a person, that rowed. Images 3-5 were acquired by illuminating the face in image was removed from the data set and the dimension- a neutral expression with a Luxolamp in three positions. The ality reduction matrix W was computed. All images in the last five images were acquired under ambient lighting with database, excluding the test image, were then projected different expressions (happy, sad, winking, sleepy, and sur- down into the reduced space to be used for classification. prised). For the Eigenface and correlation tests, the images Recognition was performed using a nearest neighbor classi- were normalized to have zero mean and unit variance, as this fier. Note that for this test, each person in the learning set is improved the performance of these methods. The images represented by the projection of ten images, except for the were manually centered and cropped to two different scales: test person who is represented by only nine. The larger images included the full face and part of the back- In general, the performance of the Eigenface method ground while the closely cropped ones included internal varies with the number of principal components. Thus, be- structures such as the brow, eyes, nose, mouth, and chin, but fore comparing the Linear Subspace and Fisherface methods did not extend to the occluding contour. with the Eigenface method, we first performed an experi-

Fig. 7. The Yale database contains 160 frontal face images covering 16 individuals taken under 10 different conditions: A normal image under ambient lighting, one with or without glasses, three images taken with different point light sources, and five different facial expressions. Empirically demonstrated that the Eigenface method is equivalent to correlation when the number of Eigenfaces equals the size of the training set (Exp. 1). In the Eigenface method, removing the first three principal components results in better performance under variable lighting conditions. Linear Subspace has comparable error rates with the FisherFace method, but it requires 3x as much storage and takes three times as long. The Fisherface method had error rates lower than the Eigenface method and required less computation time.

Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

All of the algorithms perform perfectly when lighting is nearly frontal. However, when lighting is moved off axis, there is significant difference between the methods (spec. the class-specific methods and the Eigenface method).

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 63 / 101 In the Eigenface method, removing the first three principal components results in better performance under variable lighting conditions. Linear Subspace has comparable error rates with the FisherFace method, but it requires 3x as much storage and takes three times as long. The Fisherface method had error rates lower than the Eigenface method and required less computation time.

Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

All of the algorithms perform perfectly when lighting is nearly frontal. However, when lighting is moved off axis, there is significant difference between the methods (spec. the class-specific methods and the Eigenface method). Empirically demonstrated that the Eigenface method is equivalent to correlation when the number of Eigenfaces equals the size of the training set (Exp. 1).

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 63 / 101 Linear Subspace has comparable error rates with the FisherFace method, but it requires 3x as much storage and takes three times as long. The Fisherface method had error rates lower than the Eigenface method and required less computation time.

Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

All of the algorithms perform perfectly when lighting is nearly frontal. However, when lighting is moved off axis, there is significant difference between the methods (spec. the class-specific methods and the Eigenface method). Empirically demonstrated that the Eigenface method is equivalent to correlation when the number of Eigenfaces equals the size of the training set (Exp. 1). In the Eigenface method, removing the first three principal components results in better performance under variable lighting conditions.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 63 / 101 The Fisherface method had error rates lower than the Eigenface method and required less computation time.

Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

All of the algorithms perform perfectly when lighting is nearly frontal. However, when lighting is moved off axis, there is significant difference between the methods (spec. the class-specific methods and the Eigenface method). Empirically demonstrated that the Eigenface method is equivalent to correlation when the number of Eigenfaces equals the size of the training set (Exp. 1). In the Eigenface method, removing the first three principal components results in better performance under variable lighting conditions. Linear Subspace has comparable error rates with the FisherFace method, but it requires 3x as much storage and takes three times as long.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 63 / 101 Comparisons Case Study: Faces Experiment 1 and 2: Variation in Lighting Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

All of the algorithms perform perfectly when lighting is nearly frontal. However, when lighting is moved off axis, there is significant difference between the methods (spec. the class-specific methods and the Eigenface method). Empirically demonstrated that the Eigenface method is equivalent to correlation when the number of Eigenfaces equals the size of the training set (Exp. 1). In the Eigenface method, removing the first three principal components results in better performance under variable lighting conditions. Linear Subspace has comparable error rates with the FisherFace method, but it requires 3x as much storage and takes three times as long. The Fisherface method had error rates lower than the Eigenface method and required less computation time. JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 63 / 101 718 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 7, JULY 1997

Comparisons Case Study: Faces Experiment 3: Yale DB Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997. Uses a second database of 16 subjects with ten images of images (5 Fig. 8. As demonstrated on the Yale Database, the variation in performance of the Eigenface method depends on the number of principal compo- nentsvarying retained. Dropping in expression).the first three appears to improve performance.

”Leaving-One-Out” of Yale Database Method Reduced Error Rate (%) Space Close Crop Full Face Eigenface 30 24.4 19.4 Eigenface w/o 1st 3 30 15.3 10.8 Correlation 160 23.9 20.0 Linear 48 21.6 15.6 Subspace Fisherface 15 7.3 0.6

JJ CorsoFig. 9. (University The graph and of corresponding Michigan) table show the relativeEigenfaces performance and of Fisherfaces the algorithms when applied to the Yale Database which contains 64 / 101 variation in facial expression and lighting.

ment to determine the number of principal components Note that the Linear Subspace method faired compara- yielding the lowest error rate. Fig. 8 shows a plot of error tively worse in this experiment than in the lighting experi- rate vs. the number of principal components, for the closely ments in the previous section. Because of variation in facial cropped set, when the initial three principal components expression, the images no longer lie in a linear subspace. were retained and when they were dropped. Since the Fisherface method tends to discount those por- The relative performance of the algorithms is self evident tions of the image that are not significant for recognizing an in Fig. 9. The Fisherface method had error rates that were individual, the resulting projections W tend to mask the better than half that of any other method. It seems that the regions of the face that are highly variable. For example, the Fisherface method chooses the set of projections which per- area around the mouth is discounted, since it varies quite a forms well over a range of lighting variation, facial expres- bit for different facial expressions. On the other hand, the sion variation, and presence of glasses. nose, cheeks, and brow are stable over the within-class Comparisons Case Study: Faces Experiment 3: Comparing Variation with Number of Principal Components Source: Belhumeur et al. IEEE TPAMI 19(7) 711–720. 1997.

718 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 7, JULY 1997

Fig. 8. As demonstrated on the Yale Database, the variation in performance of the Eigenface method depends on the number of principal compo- nents retained. Dropping the first three appears to improve performance.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 65 / 101

”Leaving-One-Out” of Yale Database Method Reduced Error Rate (%) Space Close Crop Full Face Eigenface 30 24.4 19.4 Eigenface w/o 1st 3 30 15.3 10.8 Correlation 160 23.9 20.0 Linear 48 21.6 15.6 Subspace Fisherface 15 7.3 0.6

Fig. 9. The graph and corresponding table show the relative performance of the algorithms when applied to the Yale Database which contains variation in facial expression and lighting. ment to determine the number of principal components Note that the Linear Subspace method faired compara- yielding the lowest error rate. Fig. 8 shows a plot of error tively worse in this experiment than in the lighting experi- rate vs. the number of principal components, for the closely ments in the previous section. Because of variation in facial cropped set, when the initial three principal components expression, the images no longer lie in a linear subspace. were retained and when they were dropped. Since the Fisherface method tends to discount those por- The relative performance of the algorithms is self evident tions of the image that are not significant for recognizing an in Fig. 9. The Fisherface method had error rates that were individual, the resulting projections W tend to mask the better than half that of any other method. It seems that the regions of the face that are highly variable. For example, the Fisherface method chooses the set of projections which per- area around the mouth is discounted, since it varies quite a forms well over a range of lighting variation, facial expres- bit for different facial expressions. On the other hand, the sion variation, and presence of glasses. nose, cheeks, and brow are stable over the within-class Comparisons Case Study: Faces

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 66 / 101 LDA D PCA LDA

D

PCA

There are two dierent classes embedded in two dierent Gaussianlike distributions However only two

Figure When we have few training data, then it may be preferable to

sample per class are supplied to the learning procedure PCA or LDA The classication result of the PCA procedure

ctor is more desirable than the result of the LDA D and D represent the decision

using only thedescribe rst eigenve total scatter.

PCA LD A

thresholds obtained by using nearestneighbor classication

In this pap er we will show that the switch from PCA to LDA may not always b e warranted and may

1

sometimes lead to faulty system design esp ecially if the size of the learning database is small Our claim carries

intuitive plausibility as can b e established with the help of Fig This gure shows two learning instances

marked by circles and crosses for each class whose underlying but unknown distribution is shown by the dotted

curve Taking all of the data into account PCA will compute a vector that has largest variance asso ciated with

it This is shown by the vertical line lab eled PCA On the other hand LDA will compute a vector which b est

discriminates b etween the two classes This vector is shown by the diagonal line lab eled LDA The decision

thresholds yielded by the nearest neighb or approach for the two cases are marked D and D As can b e

PCA LDA

seen by the manner in which the decision thresholds intersect the ellipses corresp onding to the class distributions

PCA will yield sup erior results

Although examples such as the one depicted in Fig are quite convincing with regard to the claim that

LDA is not always b e sup erior to PCA we still b ear the burden of establishing our claim with the help of actual

data This we will do in the rest of this pap er with the help of a face databases the ARface database a publicly

available dataset

As additional evidence in supp ort of our claim we should also draw the attention of the reader to some

of the results of the Septemb er FERET comp etition In particular we wish to p oint to the LDA results

obtained by the University of Maryland that compare unfavorably with resp ect to a standard PCA approach

as describ ed in A notable characteristic of the data used in such exp eriments was that only one or two

learning samples p er class were given to the system

Of course as one would exp ect given large and representative learning datasets LDA should outp erform

PCA Simply to conrm this intuitive conclusion we will also show results on the ARdatabase of faces In

2

this database the sample size p er class for learning is larger than was the case for the FERET comp etition

For example the database we will use to show LDA outp erforming PCA has images of dierent facial shots

corresp onding to dierent expressions or illumination conditions andor o cclusions for each sub ject

1

This is not to cast any asp ersions on the system design employed by the previously cited contributions Our claim has validity

only when the size of the learning database is insuciently large or nonuniformly distributed

2

Note that FERET deals with a very large numb er of classes but the numb er of classes is not the issue in this pap er Our main

concern here is with the problems caused by insucient data p er class available for learning

Comparisons Case Study: Faces Can PCA outperform FLD for recognition? Source: Mart´ınezand Kak. PCA versus LDA. PAMI 23(2), 2001.

There may be situations in which PCA might outperform FLD. Can you think of such a situation?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 67 / 101 Comparisons Case Study: Faces Can PCA outperform FLD for recognition? Source: Mart´ınezand Kak. PCA versus LDA. PAMI 23(2), 2001.

There may be situations in which PCA might outperform FLD. Can you think of such a situation?

LDA D PCA LDA

D

PCA

There are two dierent classes embedded in two dierent Gaussianlike distributions However only two

Figure When we have few training data, then it may be preferable to

sample per class are supplied to the learning procedure PCA or LDA The classication result of the PCA procedure

and D represent the decision ctor is more desirable than the result of the LDA D

using only thedescribe rst eigenve total scatter.

PCA LD A

thresholds obtained by using nearestneighbor classication

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 67 / 101

In this pap er we will show that the switch from PCA to LDA may not always b e warranted and may

1

sometimes lead to faulty system design esp ecially if the size of the learning database is small Our claim carries

intuitive plausibility as can b e established with the help of Fig This gure shows two learning instances

marked by circles and crosses for each class whose underlying but unknown distribution is shown by the dotted

curve Taking all of the data into account PCA will compute a vector that has largest variance asso ciated with

it This is shown by the vertical line lab eled PCA On the other hand LDA will compute a vector which b est

discriminates b etween the two classes This vector is shown by the diagonal line lab eled LDA The decision

thresholds yielded by the nearest neighb or approach for the two cases are marked D and D As can b e

PCA LDA

seen by the manner in which the decision thresholds intersect the ellipses corresp onding to the class distributions

PCA will yield sup erior results

Although examples such as the one depicted in Fig are quite convincing with regard to the claim that

LDA is not always b e sup erior to PCA we still b ear the burden of establishing our claim with the help of actual

data This we will do in the rest of this pap er with the help of a face databases the ARface database a publicly

available dataset

As additional evidence in supp ort of our claim we should also draw the attention of the reader to some

of the results of the Septemb er FERET comp etition In particular we wish to p oint to the LDA results

obtained by the University of Maryland that compare unfavorably with resp ect to a standard PCA approach

as describ ed in A notable characteristic of the data used in such exp eriments was that only one or two

learning samples p er class were given to the system

Of course as one would exp ect given large and representative learning datasets LDA should outp erform

PCA Simply to conrm this intuitive conclusion we will also show results on the ARdatabase of faces In

2

this database the sample size p er class for learning is larger than was the case for the FERET comp etition

For example the database we will use to show LDA outp erforming PCA has images of dierent facial shots

corresp onding to dierent expressions or illumination conditions andor o cclusions for each sub ject

1

This is not to cast any asp ersions on the system design employed by the previously cited contributions Our claim has validity

only when the size of the learning database is insuciently large or nonuniformly distributed

2

Note that FERET deals with a very large numb er of classes but the numb er of classes is not the issue in this pap er Our main

concern here is with the problems caused by insucient data p er class available for learning Comparisons Case Study: Faces Can PCA outperform FLD for recognition? Source: Mart´ınezand Kak. PCA versus LDA. PAMI 23(2), 2001.

They tested on the AR face database and found affirmative results.

Figure Shown here are performance curves for three dierent ways of dividing the data into a training set and a testing

JJset Corso (University of Michigan) Eigenfaces and Fisherfaces 68 / 101

Figure Performance curves for the highdimensional case Key: There will be c − 1 projection functions for a c class problem and hence the projection will be from a d-dimensional space to a (c − 1)-dimensional space. d must be greater than or equal to c.

Extensions Multiple Discriminant Analysis Multiple Discriminant Analysis

We can generalize the Fisher Linear Discriminant to multiple classes. Indeed we saw it done in the Case Study on Faces. But, let’s cover it with some rigor. Let’s make sure we’re all at the same place: How many discriminant functions will be involved in a c class problem?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 69 / 101 Extensions Multiple Discriminant Analysis Multiple Discriminant Analysis

We can generalize the Fisher Linear Discriminant to multiple classes. Indeed we saw it done in the Case Study on Faces. But, let’s cover it with some rigor. Let’s make sure we’re all at the same place: How many discriminant functions will be involved in a c class problem? Key: There will be c − 1 projection functions for a c class problem and hence the projection will be from a d-dimensional space to a (c − 1)-dimensional space. d must be greater than or equal to c.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 69 / 101 Extensions Multiple Discriminant Analysis MDA The Projection

The projection is accomplished by c − 1 discriminant functions: T yi = wi x i = 1, . . . , c − 1 (70) which is summarized in matrix form as y = WTx (71) where W is the d × (c − 1) projection function.

W1 W2

FIGURE 3.6. Three three-dimensional distributions are projected onto two-dimensional JJ Corso (University ofsubspaces, Michigan) described by a normalEigenfaces vectors andW1 and FisherfacesW2. Informally, multiple discriminant 70 / 101 methods seek the optimum such subspace, that is, the one with the greatest separation of the projected distributions for a given total within-scatter matrix, here as associated with W1. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. Extensions Multiple Discriminant Analysis MDA Within-Class Scatter Matrix

The generalization for the Within-Class Scatter Matrix is straightforward:

c SW = Si (72) i=1 X where, as before,

T Si = (x − mi)(x − mi)

x i X∈D and 1 mi = x . ni x i X∈D

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 71 / 101 Let’s define a total mean vector c 1 1 m = x = nimi (73) n n x i=1 X X Recall then total scatter matrix

T ST = (x − m)(x − m) (74) x X

Extensions Multiple Discriminant Analysis MDA Between-Class Scatter Matrix

The between-class scatter matrix SB is not so easy to generalize.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 72 / 101 Extensions Multiple Discriminant Analysis MDA Between-Class Scatter Matrix

The between-class scatter matrix SB is not so easy to generalize. Let’s define a total mean vector c 1 1 m = x = nimi (73) n n x i=1 X X Recall then total scatter matrix

T ST = (x − m)(x − m) (74) x X

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 72 / 101 So, we can define this second term as a generalized between-class scatter matrix. The total scatter is the the sum of the within-class scatter and the between-class scatter.

Extensions Multiple Discriminant Analysis MDA Between-Class Scatter Matrix

Then it follows that c T ST = (x − mi + mi − m)(x − mi + mi − m) (75)

i=1 x i ∈D Xc X c T T = (x − mi)(x − mi) + (mi − m)(mi − m)

i=1 x i i=1 x i X X∈D X X∈D (76) c T = SW + ni(mi − m)(mi − m) (77) i=1 X

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 73 / 101 Extensions Multiple Discriminant Analysis MDA Between-Class Scatter Matrix

Then it follows that c T ST = (x − mi + mi − m)(x − mi + mi − m) (75)

i=1 x i ∈D Xc X c T T = (x − mi)(x − mi) + (mi − m)(mi − m)

i=1 x i i=1 x i X X∈D X X∈D (76) c T = SW + ni(mi − m)(mi − m) (77) i=1 X So, we can define this second term as a generalized between-class scatter matrix. The total scatter is the the sum of the within-class scatter and the between-class scatter.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 73 / 101 Extensions Multiple Discriminant Analysis MDA Objective Page 1

We again seek a criterion that will maximize the between-class scatter of the projected vectors to the within-class scatter of the projected vectors. Recall that we can write down the scatter in terms of these projections: c 1 1 mi = y and m = nimi (78) ni n y i i=1 X∈Y X e c e e T SW = (y − mi)(y − mi) (79)

i=1 y i X X∈Y e c e e T SB = ni(mi − m)(mi − m) (80) i=1 X JJ Corso (University of Michigan) e Eigenfaces ande Fisherfacese e e 74 / 101 A simple scalar measure of scatter is the determinant of the scatter matrix. The determinant is the product of the eigenvalues and thus is the product of the variation along the principal directions.

T |SB| |W SBW| J(W) = = T (83) |SW | |W SW W| e e

Extensions Multiple Discriminant Analysis MDA Objective Page 2

Then, we can show

T SW = W SW W (81) T SB = W SBW (82) e e

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 75 / 101 Extensions Multiple Discriminant Analysis MDA Objective Page 2

Then, we can show

T SW = W SW W (81) T SB = W SBW (82) e A simple scalar measure ofe scatter is the determinant of the scatter matrix. The determinant is the product of the eigenvalues and thus is the product of the variation along the principal directions.

T |SB| |W SBW| J(W) = = T (83) |SW | |W SW W| e e

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 75 / 101 If SW is nonsingular, then this can be converted to a conventional eigenvalue problem. Or, we could notice that the rank of SB is at most c − 1 and do some clever algebra... The solution for W is, however, not unique and would allow arbitrary scaling and rotation, but these would not change the ultimate classification.

Extensions Multiple Discriminant Analysis MDA Solution

The columns of an optimal W are the generalized eigenvectors that correspond to the largest eigenvalues in

SBwi = λiSW wi (84)

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 76 / 101 The solution for W is, however, not unique and would allow arbitrary scaling and rotation, but these would not change the ultimate classification.

Extensions Multiple Discriminant Analysis MDA Solution

The columns of an optimal W are the generalized eigenvectors that correspond to the largest eigenvalues in

SBwi = λiSW wi (84)

If SW is nonsingular, then this can be converted to a conventional eigenvalue problem. Or, we could notice that the rank of SB is at most c − 1 and do some clever algebra...

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 76 / 101 Extensions Multiple Discriminant Analysis MDA Solution

The columns of an optimal W are the generalized eigenvectors that correspond to the largest eigenvalues in

SBwi = λiSW wi (84)

If SW is nonsingular, then this can be converted to a conventional eigenvalue problem. Or, we could notice that the rank of SB is at most c − 1 and do some clever algebra... The solution for W is, however, not unique and would allow arbitrary scaling and rotation, but these would not change the ultimate classification.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 76 / 101 So, can we apply PCA on the images directly? Yes! This will be accomplished by what the authors’ call the image total covariance matrix.

Extensions Image PCA IMPCA Source: Yang and Yang: IMPCA vs. PCA, v35, 2002.

Observation: To apply PCA and FLD on images, we need to first “vectorize” them, which can lead to high-dimensional vectors. Solving the associated eigen-problems is a very time-consuming process.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 77 / 101 Yes! This will be accomplished by what the authors’ call the image total covariance matrix.

Extensions Image PCA IMPCA Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002.

Observation: To apply PCA and FLD on images, we need to first “vectorize” them, which can lead to high-dimensional vectors. Solving the associated eigen-problems is a very time-consuming process. So, can we apply PCA on the images directly?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 77 / 101 Extensions Image PCA IMPCA Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002.

Observation: To apply PCA and FLD on images, we need to first “vectorize” them, which can lead to high-dimensional vectors. Solving the associated eigen-problems is a very time-consuming process. So, can we apply PCA on the images directly? Yes! This will be accomplished by what the authors’ call the image total covariance matrix.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 77 / 101 Extensions Image PCA IMPCA: Problem Set-Up Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002.

m n Define A ∈ R × as our image. Let w denote an n-dimensional column vector, which will represent the subspace onto which we will project an image.

y = Aw (85)

which yields m-dimensional vector y. You might think of w this as a “feature selector.” We, again, want to maximize the total scatter...

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 78 / 101 Extensions Image PCA IMPCA: Image Total Scatter Matrix Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002.

We have M total data samples. The sample mean image is

M 1 M = Aj (86) M j=1 X And the projected sample mean is

M 1 m = yj (87) M j=1 X M e 1 = Ajw (88) M j=1 X = Mw (89)

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 79 / 101 Extensions Image PCA IMPCA: Image Total Scatter Matrix Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002.

The scatter of the projected samples is

M T S = (yj − m)(yj − m) (90) j=1 X e M e e T = (Aj − M)w (Aj − M)w (91) j=1 X   M T T tr(S) = w (Aj − M) (Aj − M) w (92)   j=1 X e  

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 80 / 101 And, a suitable criterion, in a form we’ve seen before is

T JI (w) = w SI w (94)

We know already that the vectors w that maximize JI are the orthonormal eigenvectors of SI corresponding to the largest eigenvalues of SI .

Extensions Image PCA IMPCA: Image Total Scatter Matrix Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002.

So, the image total scatter matrix is

M T SI = (Aj − M) (Aj − M) (93) j=1 X

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 81 / 101 We know already that the vectors w that maximize JI are the orthonormal eigenvectors of SI corresponding to the largest eigenvalues of SI .

Extensions Image PCA IMPCA: Image Total Scatter Matrix Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002.

So, the image total scatter matrix is

M T SI = (Aj − M) (Aj − M) (93) j=1 X And, a suitable criterion, in a form we’ve seen before is

T JI (w) = w SI w (94)

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 81 / 101 Extensions Image PCA IMPCA: Image Total Scatter Matrix Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002.

So, the image total scatter matrix is

M T SI = (Aj − M) (Aj − M) (93) j=1 X And, a suitable criterion, in a form we’ve seen before is

T JI (w) = w SI w (94)

We know already that the vectors w that maximize JI are the orthonormal eigenvectors of SI corresponding to the largest eigenvalues of SI .

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 81 / 101 Extensions Image PCA IMPCA: Experimental Results Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002.

Dataset is the ORL database 10 different images taken of 40 individuals. Facial expressions and details are varying. 1998 J. Yang, J.-Y. Yang / Pattern RecognitionEven the 35 pose (2002) can 1997–1999 vary slightly: 20 degree rotation/tilt and 10% scale. Size is normalized (92 × 112). where, TSx denotes the total covariance matrix ofprojected feature vectors of training images, and tr(TSx) denotes the trace of TSx. Now, the problem is how to evaluate the co- variance matrix TSx. Suppose there are L known pattern classes, and M denotes the total number ofsamples ofall classes. The jth training Fig. 1. Five images ofone person in ORL facedatabase. image is denoted by an m n matrix Aj (j =1; 2;:::;M), The first five images of each person are used for training and the and the mean image ofall training× sample is denoted by AK. second five are used for testing. After the projection of training image onto X , we get the 2.3. Feature extraction method projected feature vector JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 82 / 101 The obtained optimal projection vectors X ;:::;X ofIM- Y = A X; j =1; 2;:::;M: (3) 1 d j j PCA are used for feature extraction. Let Suppose the mean vector ofall projected featurevectors is Yk = AXk ;k=1; 2;:::;d: (8) denoted by YK, it is easy to get YK =AXK . Then, TSx can be evaluated by Then, we get a family of image projected feature vectors M Y ;:::;Y , which are used to form an N = md dimensional 1 1 d TS = (Y YK)(Y YK)T resulting projected feature vector of image A as follows: x M j j j=1 − − Y1 AX1 M 1 T Y2 AX2 = [(Aj AK)X ][(Aj AK)X ] : (4)     M − − Y = = : (9) j=1  .   .   .   .  So          Yd   AXd  1 M     tr(TS )=X T (A AK)T(A AK) X: (5)     x M j j j=1 − − 3. Experimental results Now, let us deÿne the matrix below M The proposed method is tested on the ORL database that 1 K T K Gt = (Aj A) (Aj A): (6) contains a set offaceimages taken at the Olivetti Research M − − j=1 Laboratory in Cambridge, UK. There are 10 diOerent images for 40 individuals. For some persons, the images were taken The matrix Gt is called image total covariance (scatter) at diOerent times. And the facial expression (open=closed matrix. And, it is easy to verify that Gt is an n n nonneg- ative deÿnite matrix by its deÿnition. × eyes, smiling=non-smiling) and facial details (glasses=no Accordingly, the criterion in Eq. (2) can be expressed in glasses) are variable. The images were taken against a dark this form homogeneous background and the persons are in upright, frontal position with tolerance for some tilting and rotation J (X )=X TG X: (7) p t ofup to 20 ◦. Moreover, there is some variation in scale of up to about 10%. All images are grayscale and normalized 2.2. Image principal component analysis with a resolution of92 112. The ÿve images ofone person in ORL are shown in Fig.× 1. The criterion in Eq. (7) is called generalized total scatter In this experiment, we use the ÿrst ÿve images ofeach criterion. In fact, in the special case of image matrix being person for training and the remaining ÿve for testing. Thus row vectors, it is not hard to prove that the criterion is the the total amount oftraining samples and testing samples are common total scatter criterion. The vector X maximizing the both 200. The proposed IMPCA is used for feature extrac- criterion is called the optimal projection axe, and its physical tion. Here, since the size ofimage total covariance matrix meaning is obvious, i.e., after projection of image matrix Gt is 92 92, it is very easy to work out its eigenvectors. onto X , the total scatter ofprojected samples is maximized. And the× number ofselected eigenvectors (projection vec- It is evident that the optimal projection axe is the eigen- tors) varies from 1 to 10. Note that if the projection vector vector corresponding to the maximal eigenvalue of Gt . Gen- number is k, the dimension ofcorresponding projected fea- erally, in many cases, one optimal projection axe is not ture vector is 112 k. Finally, a minimum distance classiÿer enough, we usually select a set ofprojection axes subject and a nearest neighbor× classiÿer are, respectively, employed to the orthonormal constraints and maximizing the criterion to classify in the projected feature space. The recognition in Eq. (7). In fact, the optimal projection axes X1;:::;Xd of rates are shown in Table 1. The Eigenfaces [1] and Fisher- IMPCA can be selected as the orthonormal eigenvectors of faces [2] are used for feature extraction as well, and their

Gt associated with the ÿrst d largest eigenvalues. optimal performance lies in Table 2. Moreover, the CPU Extensions Image PCA IMPCA: Experimental Results Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002.

J.J. Yang, Yang, J.-Y. J.-Y. Yang Yang / Pattern / Pattern Recognition Recognition 3535 (2002) (2002) 1997–1999 1997–1999 1999 1999

IMPCATableTable varying 1 the number of extracted eigenvectors (NN classifier): RecognitionRecognition rates (%) based based on on IMPCA IMPCA projected projected features features as as the the number number of ofprojection projection vectors vectors varying varying from from 1 to 1 10 to 10

ProjectionProjection 1 1 2 2 345678910345678910 J. Yang, J.-Y. Yang / Patternvectorvector Recognition number 35 (2002) 1997–1999 1999

Minimum distance 73.0 83.0 86.5 88.5 88.5 88.5 90.0 90.5 91.0 91.0 Table 1 Minimum distance 73.0 83.0 86.5 88.5 88.5 88.5 90.0 90.5 91.0 91.0 Nearest neighbor 85.0 92.0 93.5 94.5 94.5 95.0 95.0 95.5 93.5 94.0 Recognition rates (%) based on IMPCA projected features asNearest the number neighbor of projection 85.0 vectors varying 92.0 from 93.5 1 to 10 94.5 94.5 95.0 95.0 95.5 93.5 94.0

Projection 1 2 345678910 vector number TableTable 2 TablesTables 1 and 1 and 2 indicate 2 indicate that that our ourproposed proposed IMPCA IMPCA outper- outper- Comparison ofthe maximal recognition rates using the three meth- Comparison ofthe maximal recognition rates using the three meth- formsforms Eigenfaces Eigenfaces and and Fisherfaces. Fisherfaces. What’s What’s more, more, in Table in Table 3, 3, Minimum distance 73.0 83.0 86.5odsods 88.5 88.5 88.5 90.0 90.5 91.0it is evident91.0 that the time consumed for feature extraction Nearest neighbor 85.0 92.0 93.5 94.5 94.5 95.0 95.0 95.5 93.5it is94.0 evident that the time consumed for feature extraction Recognition rate Eigenfaces Fisherfaces IMPCA usingusing IMPCA IMPCA is much is much less less than than that that ofthe ofthe other other two meth-two meth- Recognition rate Eigenfaces Fisherfaces IMPCA ods. So our methods are more preferable for image feature Minimum distance 89.5% (46) 88.5% (39) 91.0% ods. So our methods are more preferable for image feature Minimum distance 89.5% (46) 88.5% (39) 91.0% extraction. Nearest neighbor 93.5% (37) 88.5% (39) 95.5% extraction. Table 2 Nearest neighborTables 93.5%1 and 2(37) indicate 88.5% that our (39) proposed 95.5% IMPCA outper- Comparison ofthe maximal recognition rates using the three meth-Note: Theforms value in Eigenfaces parentheses and denotes Fisherfaces. the number What’s ofaxes as more, in Table 3, Acknowledgements ods theNote maximal: Theit recognition value is evident in parentheses rate that is achieved. the denotes time the consumed number ofaxes for feature as extraction the maximal recognition rate is achieved. Acknowledgements JJ Corso (University of Michigan)using IMPCAEigenfaces is much and less Fisherfaces than that ofthe other two meth- 83 / 101 Recognition rate Eigenfaces Fisherfaces IMPCA We wish to thank National Science Foundation ofChina ods. So our methods are more preferable for imageunder featureWe Grant wish No.60072034 to thank National for supporting. Science Foundation ofChina Minimum distance 89.5% (46) 88.5% (39) 91.0%Table 3 extraction. under Grant No.60072034 for supporting. Nearest neighbor 93.5% (37) 88.5% (39) 95.5%TableThe CPU 3 time consumed for feature extraction and classiÿcation Theusing CPU the three time methodsconsumed for feature extraction and classiÿcation Note: The value in parentheses denotes the number ofaxesusing as the three methods References the maximal recognition rate is achieved. Time (s)Acknowledgements Feature Classiÿcation Total References Time (s)extraction Feature time time Classiÿcation time Total [1] M. Turk, A. Pentland, Face recognition using Eigenfaces, We wishextraction to thank time National time Science Foundation time ofChina[1]Proceedings M. Turk, ofthe A. Pentland,IEEE Conferenceon Face recognition Computer using Vision Eigenfaces, and Eigenfaces (37) 371.79 5.16 376.95 under Grant No.60072034 for supporting. PatternProceedings Recognition, ofthe 1991, IEEE pp. Conferenceon 586–591. Computer Vision and Table 3 EigenfacesFisherfaces (37) (39) 378.10 371.79 5.27 5.16 383.37 376.95 [2] P.N.Pattern Belhumeur, Recognition, J.P. Hespanha, 1991, pp.D.J. 586–591. Kriengman, Eigenfaces vs. The CPU time consumed for feature extraction and classiÿcationFisherfacesIMPCA (112 (39)8) 378.10 27.14 25.04 5.27 52.18 383.37 Fisherfaces: recognition using class speciÿc linear projection, × [2] P.N. Belhumeur, J.P. Hespanha, D.J. Kriengman, Eigenfaces vs. using the three methods IMPCA (112 8) 27.14 25.04 52.18 IEEEFisherfaces: Trans. Pattern recognition Anal. Mach. using Intell. class 19 speciÿc (7) (1997) linear 711–720. projection, References× [3] K.IEEE Liu, Y.-Q.Trans. Cheng, Pattern Anal. J.-Y. Mach.Yang, Intell. et al., 19 Algebraic (7) (1997) feature 711–720. Time (s) Feature Classiÿcation Totaltime consumed for feature extraction and classiÿcation using [3]extraction K. Liu, for Y.-Q. image Cheng, recognition J.-Y. Yang, based et al., on Algebraic an optimal feature [1] M. Turk, A. Pentland, Face recognition using Eigenfaces, extraction time time timetimethe above consumed methods for feature under extraction a nearest and neighbor classiÿcation classiÿer using is discriminantextraction criterion, for image Pattern recognition Recognition based 26 (6)on (1993)an optimal Proceedings ofthe IEEE Conferenceon Computer Vision903–911. and theexhibited above in methods Table 3. under a nearest neighbor classiÿer is discriminant criterion, Pattern Recognition 26 (6) (1993) Eigenfaces (37) 371.79 5.16 376.95 Pattern Recognition, 1991, pp. 586–591. exhibited in Table 3. 903–911. Fisherfaces (39) 378.10 5.27 383.37 [2] P.N. Belhumeur, J.P. Hespanha, D.J. Kriengman, Eigenfaces vs. IMPCA (112 8) 27.14 25.04 52.18 Fisherfaces: recognition using class speciÿc linear projection, × About the Author—JIAN YANG was born in Jiangsu, China, on 3 June 1973. He received his M.S. degree in Applied Mathematics from Changsha RailwayIEEE University Trans. Patternin 1998. Anal. Now, Mach. he is a Intell. teacher 19 in (7) the (1997) Department 711–720. ofApplied Mathematics ofNanjing University ofScience Aboutand Technology the Author[3] (NUST). K.—JIAN Liu, At Y.-Q. YANG the same Cheng, was time, born J.-Y. he in is Jiangsu,Yang, working et China, for al., his Algebraicon Ph.D. 3 June degree feature 1973. in Pattern He received Recognition his M.S. and degree Intelligence in Applied System. Mathematics His current from time consumed for feature extraction and classiÿcation usingChangshainterests include Railway faceextraction University recognition for in and 1998. image detection, Now, recognition he handwritten is a teacher based character in on the Departmentrecognition an optimal and ofApplied data fusion. Mathematics ofNanjing University ofScience the above methods under a nearest neighbor classiÿerand is Technology (NUST).discriminant At the criterion, same time, Pattern he is working Recognition for his 26 Ph.D. (6) degree (1993) in Pattern Recognition and Intelligence System. His current interests include face recognition and detection, handwritten character recognition and data fusion. exhibited in Table 3. About the Author903–911.—JING-YU YANG received the B.S. Degree in Computer Science from NUST, Nanjing, China. From 1982 to 1984 he was a visiting scientist at the Coordinated Science Laboratory, University ofIllinois at Urbana-Champaign. From 1993 to 1994 he was a Aboutvisiting the professor Author at—JING-YU the Department YANG of Computer received Science,the B.S. MissuriaDegree in University. Computer And Science in 1998, from he NUST, acted as Nanjing, a visiting China. professor From at 1982 Concordia to 1984 he wasUniversity a visiting in Canada. scientist He at isthe currently Coordinated a professor Science and Laboratory, Chairman University in the department ofIllinois of at Computer Urbana-Champaign. Science at NUST. From He1993 is to the 1994 author he of was a About the Author—JIAN YANG was born in Jiangsu, China,visitingover on 100 3 professor June scientiÿc 1973. at papers Hethereceived Department in computer his of M.S. vision, Computer degree pattern Science,in recognition,Applied Missuria Mathematics and University. artiÿcial from intelligence. And in 1998, He has he acted won more as a visiting than 20 professor provincial at awards Concordia Changsha Railway University in 1998. Now, he is a teacherUniversityand in the national Department in awards. Canada. ofApplied His He current is currently Mathematics research a interests professor ofNanjing are and in theChairman University areas ofpattern in ofScience the department recognition, of robot Computer vision, imageScience processing, at NUST. data He fusion,and is the author of and Technology (NUST). At the same time, he is working foroverartiÿcial his 100 Ph.D. intelligence. scientiÿc degree papers in Pattern in computer Recognition vision, and pattern Intelligence recognition, System. and His artiÿcial current intelligence. He has won more than 20 provincial awards interests include face recognition and detection, handwrittenand character national recognition awards. His and current data fusion. research interests are in the areas ofpattern recognition, robot vision, image processing, data fusion,and artiÿcial intelligence. About the Author—JING-YU YANG received the B.S. Degree in Computer Science from NUST, Nanjing, China. From 1982 to 1984 he was a visiting scientist at the Coordinated Science Laboratory, University ofIllinois at Urbana-Champaign. From 1993 to 1994 he was a visiting professor at the Department of Computer Science, Missuria University. And in 1998, he acted as a visiting professor at Concordia University in Canada. He is currently a professor and Chairman in the department of Computer Science at NUST. He is the author of over 100 scientiÿc papers in computer vision, pattern recognition, and artiÿcial intelligence. He has won more than 20 provincial awards and national awards. His current research interests are in the areas ofpattern recognition, robot vision, image processing, data fusion,and artiÿcial intelligence. J. Yang, J.-Y. Yang / Pattern Recognition 35 (2002) 1997–1999 1999

Table 1 Recognition rates (%) based on IMPCA projected features as the number of projection vectors varying from 1 to 10

Projection 1 2 345678910 vector number

Minimum distance 73.0 83.0 86.5 88.5 88.5 88.5 90.0 90.5 91.0 91.0 Nearest neighbor 85.0 92.0 93.5 94.5 94.5 95.0 95.0 95.5 93.5 94.0

Table 2 Tables 1 and 2 indicate that our proposed IMPCA outper- Comparison ofthe maximal recognition rates using the three meth- forms Eigenfaces and Fisherfaces. What’s more, in Table 3, ods it is evident that the time consumed for feature extraction Recognition rate Eigenfaces Fisherfaces IMPCA using IMPCA is much less than that ofthe other two meth- ods. So our methods are more preferable for image feature Minimum distance 89.5% (46) 88.5% (39) 91.0% extraction. Nearest neighbor 93.5% (37) 88.5% (39) 95.5%

Note: The value in parentheses denotes the number ofaxes as the maximal recognition rate is achieved. Acknowledgements

We wish to thank National Science Foundation ofChina under Grant No.60072034 for supporting. Table 3 The CPU time consumed for feature extraction and classiÿcation using the three methods IMPCA vs. EigenFaces vs. FisherFaces for Speed References Time (s) Feature Classiÿcation Total extraction time time time [1] M. Turk, A. Pentland, Face recognition using Eigenfaces, Proceedings ofthe IEEE Conferenceon Computer Vision and Eigenfaces (37) 371.79 5.16 376.95 Pattern Recognition, 1991, pp. 586–591. Fisherfaces (39) 378.10 5.27 383.37 [2] P.N. Belhumeur, J.P. Hespanha, D.J. Kriengman, Eigenfaces vs. IMPCA (112 8) 27.14 25.04 52.18 Fisherfaces: recognition using class speciÿc linear projection, × IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720. [3] K. Liu, Y.-Q. Cheng, J.-Y. Yang, et al., Algebraic feature time consumed for feature extraction and classiÿcation using extraction for image recognition based on an optimal the above methods under a nearest neighbor classiÿer is discriminant criterion, Pattern Recognition 26 (6) (1993) exhibited in Table 3. 903–911.

About the Author—JIAN YANG was born in Jiangsu, China, on 3 June 1973. He received his M.S. degree in Applied Mathematics from Changsha Railway University in 1998. Now, he is a teacher in the Department ofApplied Mathematics ofNanjing University ofScience and Technology (NUST). At the same time, he is working for his Ph.D. degree in Pattern Recognition and Intelligence System. His current interests include face recognition and detection, handwritten character recognition and data fusion.

About the Author—JING-YU YANG received the B.S. Degree in Computer Science from NUST, Nanjing, China. From 1982 to 1984 he was a visiting scientist at the Coordinated Science Laboratory, University ofIllinois at Urbana-Champaign. From 1993 to 1994 he was a visiting professor at the Department of Computer Science, Missuria University. And in 1998, he acted as a visiting professor at Concordia University in Canada. He is currently a professor and Chairman in the department of Computer Science at NUST. He is the author of over 100 scientiÿc papers in computer vision, pattern recognition, and artiÿcial intelligence. He has won more than 20 provincial awards and national awards. His current research interests are in the areas ofpattern recognition, robot vision, image processing, data fusion,and artiÿcial intelligence.

J. Yang, J.-Y. Yang / Pattern Recognition 35 (2002) 1997–1999 1999

Table 1 Recognition rates (%) based on IMPCA projected features as the number of projection vectors varying from 1 to 10

Projection 1 2 345678910 vector number

Minimum distance 73.0 83.0 86.5 88.5 88.5 88.5 90.0 90.5 91.0 91.0 Nearest neighbor 85.0Extensions Image 92.0 PCA 93.5 94.5 94.5 95.0 95.0 95.5 93.5 94.0 IMPCA: Experimental Results Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002. Table 2 Tables 1 and 2 indicate that our proposed IMPCA outper- Comparison ofthe maximal recognition rates using the three meth- forms Eigenfaces and Fisherfaces. What’s more, in Table 3, IMPCAods vs. EigenFaces vs. FisherFaces for Recognition it is evident that the time consumed for feature extraction Recognition rate Eigenfaces Fisherfaces IMPCA using IMPCA is much less than that ofthe other two meth- ods. So our methods are more preferable for image feature Minimum distance 89.5% (46) 88.5% (39) 91.0% extraction. Nearest neighbor 93.5% (37) 88.5% (39) 95.5%

Note: The value in parentheses denotes the number ofaxes as the maximal recognition rate is achieved. Acknowledgements

We wish to thank National Science Foundation ofChina under Grant No.60072034 for supporting. Table 3 The CPU time consumed for feature extraction and classiÿcation using the three methods References Time (s) Feature Classiÿcation Total extraction time time time [1] M. Turk, A. Pentland, Face recognition using Eigenfaces, JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 84 / 101 Proceedings ofthe IEEE Conferenceon Computer Vision and Eigenfaces (37) 371.79 5.16 376.95 Pattern Recognition, 1991, pp. 586–591. Fisherfaces (39) 378.10 5.27 383.37 [2] P.N. Belhumeur, J.P. Hespanha, D.J. Kriengman, Eigenfaces vs. IMPCA (112 8) 27.14 25.04 52.18 Fisherfaces: recognition using class speciÿc linear projection, × IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720. [3] K. Liu, Y.-Q. Cheng, J.-Y. Yang, et al., Algebraic feature time consumed for feature extraction and classiÿcation using extraction for image recognition based on an optimal the above methods under a nearest neighbor classiÿer is discriminant criterion, Pattern Recognition 26 (6) (1993) exhibited in Table 3. 903–911.

About the Author—JIAN YANG was born in Jiangsu, China, on 3 June 1973. He received his M.S. degree in Applied Mathematics from Changsha Railway University in 1998. Now, he is a teacher in the Department ofApplied Mathematics ofNanjing University ofScience and Technology (NUST). At the same time, he is working for his Ph.D. degree in Pattern Recognition and Intelligence System. His current interests include face recognition and detection, handwritten character recognition and data fusion.

About the Author—JING-YU YANG received the B.S. Degree in Computer Science from NUST, Nanjing, China. From 1982 to 1984 he was a visiting scientist at the Coordinated Science Laboratory, University ofIllinois at Urbana-Champaign. From 1993 to 1994 he was a visiting professor at the Department of Computer Science, Missuria University. And in 1998, he acted as a visiting professor at Concordia University in Canada. He is currently a professor and Chairman in the department of Computer Science at NUST. He is the author of over 100 scientiÿc papers in computer vision, pattern recognition, and artiÿcial intelligence. He has won more than 20 provincial awards and national awards. His current research interests are in the areas ofpattern recognition, robot vision, image processing, data fusion,and artiÿcial intelligence. J. Yang, J.-Y. Yang / Pattern Recognition 35 (2002) 1997–1999 1999

Table 1 Recognition rates (%) based on IMPCA projected features as the number of projection vectors varying from 1 to 10

Projection 1 2 345678910 vector number J. Yang, J.-Y. Yang / Pattern Recognition 35 (2002) 1997–1999 1999 Minimum distance 73.0 83.0 86.5 88.5 88.5 88.5 90.0 90.5 91.0 91.0 TableNearest 1 neighbor 85.0 92.0 93.5 94.5 94.5 95.0 95.0 95.5 93.5 94.0 Recognition rates (%) based on IMPCA projected features as the number of projection vectors varying from 1 to 10

Projection 1 2 345678910 Tablevector 2 number Tables 1 and 2 indicate that our proposed IMPCA outper- Comparison ofthe maximal recognition rates using the three meth- forms Eigenfaces and Fisherfaces. What’s more, in Table 3, odsMinimum distance 73.0 83.0 86.5 88.5it 88.5 is evident 88.5 that the time 90.0 consumed 90.5 for feature 91.0 extraction 91.0 Nearest neighbor 85.0Extensions Image 92.0 PCA 93.5 94.5 94.5 95.0 95.0 95.5 93.5 94.0 Recognition rate Eigenfaces Fisherfaces IMPCA using IMPCA is much less than that ofthe other two meth- IMPCA: Experimental Results ods. So our methods are more preferable for image feature Minimum distance 89.5% (46) 88.5% (39) 91.0% Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002. extraction. TableNearest 2 neighbor 93.5% (37) 88.5% (39) 95.5% Tables 1 and 2 indicate that our proposed IMPCA outper- Comparison ofthe maximal recognition rates using the three meth- forms Eigenfaces and Fisherfaces. What’s more, in Table 3, IMPCAodsNote: vs. The EigenFaces value in parentheses vs. FisherFaces denotes the for number Recognition ofaxes as the maximal recognition rate is achieved. itAcknowledgements is evident that the time consumed for feature extraction Recognition rate Eigenfaces Fisherfaces IMPCA using IMPCA is much less than that ofthe other two meth- ods.We So wish our tomethods thank National are more Science preferable Foundation for image ofChina feature Minimum distance 89.5% (46) 88.5% (39) 91.0% extraction.under Grant No.60072034 for supporting. TableNearest 3 neighbor 93.5% (37) 88.5% (39) 95.5% The CPU time consumed for feature extraction and classiÿcation IMPCAusingNote the: vs. The three EigenFaces value methods in parentheses vs. FisherFaces denotes the for number Speed ofaxes as the maximal recognition rate is achieved. AcknowledgementsReferences Time (s) Feature Classiÿcation Total extraction time time time [1]We M. wish Turk, to A. thank Pentland, National Face Science recognition Foundation using Eigenfaces, ofChina underProceedings Grant No.60072034 ofthe IEEE Conferenceon for supporting. Computer Vision and TableEigenfaces 3 (37) 371.79 5.16 376.95 Pattern Recognition, 1991, pp. 586–591. TheFisherfaces CPU time (39) consumed 378.10 for feature extraction 5.27 and classiÿcation 383.37 [2] P.N. Belhumeur, J.P. Hespanha, D.J. Kriengman, Eigenfaces vs. usingIMPCA the (112 three methods8) 27.14 25.04 52.18 Fisherfaces: recognition using class speciÿc linear projection, × References IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720. Time (s) Feature Classiÿcation Total [3] K. Liu, Y.-Q. Cheng, J.-Y. Yang, et al., Algebraic feature extraction time time time [1] M. Turk, A. Pentland, Face recognition using Eigenfaces, JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 84 / 101 time consumed for feature extraction and classiÿcation using Proceedingsextraction for ofthe image IEEE Conferenceon recognition based Computer on Vision an optimal and theEigenfaces above (37) methods 371.79 under a nearest neighbor 5.16 classiÿer 376.95 is Patterndiscriminant Recognition, criterion, 1991, Pattern pp. 586–591. Recognition 26 (6) (1993) exhibitedFisherfaces in (39) Table 3. 378.10 5.27 383.37 [2] P.N.903–911. Belhumeur, J.P. Hespanha, D.J. Kriengman, Eigenfaces vs. IMPCA (112 8) 27.14 25.04 52.18 Fisherfaces: recognition using class speciÿc linear projection, × IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720. About the Author—JIAN YANG was born in Jiangsu, China, on 3 June[3] 1973. K. HeLiu, received Y.-Q. Cheng, his M.S. J.-Y. degree Yang, in Applied et al., Mathematics Algebraic feature from Changshatime consumed Railway for University feature extraction in 1998. Now, and classiÿcation he is a teacher using in the Departmentextraction ofApplied for Mathematics image recognition ofNanjing based University on ofScience an optimal andthe Technology above methods (NUST). under At the a nearest same time, neighbor he is working classiÿer for his is Ph.D. degreediscriminant in Pattern Recognition criterion, Pattern and Intelligence Recognition System. 26 (6) His current(1993) interestsexhibited include in Table face 3. recognition and detection, handwritten character recognition903–911. and data fusion.

About the Author—JING-YU YANG received the B.S. Degree in Computer Science from NUST, Nanjing, China. From 1982 to 1984 he Aboutwas a visiting the Author scientist—JIAN at the YANG Coordinated was born Science in Jiangsu, Laboratory, China, on University 3 June 1973. ofIllinois He received at Urbana-Champaign. his M.S. degree From in Applied 1993 toMathematics 1994 he was from a Changshavisiting professor Railway at University the Department in 1998. of ComputerNow, he is Science, a teacher Missuria in the Department University. AndofApplied in 1998, Mathematics he acted as ofNanjing a visiting University professor at ofScience Concordia andUniversity Technology in Canada. (NUST). He At is the currently same time, a professor he is working and Chairman for his Ph.D. in the degree department in Pattern of Computer Recognition Science and Intelligence at NUST. HeSystem. is the His author current of interestsover 100 include scientiÿc face papers recognition in computer and detection, vision, pattern handwritten recognition, character and recognition artiÿcial intelligence. and data fusion. He has won more than 20 provincial awards and national awards. His current research interests are in the areas ofpattern recognition, robot vision, image processing, data fusion,and Aboutartiÿcial the intelligence. Author—JING-YU YANG received the B.S. Degree in Computer Science from NUST, Nanjing, China. From 1982 to 1984 he was a visiting scientist at the Coordinated Science Laboratory, University ofIllinois at Urbana-Champaign. From 1993 to 1994 he was a visiting professor at the Department of Computer Science, Missuria University. And in 1998, he acted as a visiting professor at Concordia University in Canada. He is currently a professor and Chairman in the department of Computer Science at NUST. He is the author of over 100 scientiÿc papers in computer vision, pattern recognition, and artiÿcial intelligence. He has won more than 20 provincial awards and national awards. His current research interests are in the areas ofpattern recognition, robot vision, image processing, data fusion,and artiÿcial intelligence. However, this speed-up comes at some cost. What is that cost? Why does IMPCA work in this case? What is IMPCA really doing?

Extensions Image PCA IMPCA: Discussion Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002.

There is a clear speed-up because the amount of computation has been greatly reduced.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 85 / 101 Why does IMPCA work in this case? What is IMPCA really doing?

Extensions Image PCA IMPCA: Discussion Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002.

There is a clear speed-up because the amount of computation has been greatly reduced. However, this speed-up comes at some cost. What is that cost?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 85 / 101 Extensions Image PCA IMPCA: Discussion Source: Yang and Yang: IMPCA vs. PCA, Pattern Recognition v35, 2002.

There is a clear speed-up because the amount of computation has been greatly reduced. However, this speed-up comes at some cost. What is that cost? Why does IMPCA work in this case? What is IMPCA really doing?

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 85 / 101 the top eigenvectors of the data covariance matrix. In classical (or metric) MDS, one computes the low dimensional embedding that best preserves pairwise dis- tances between data points. If these distances correspond to Euclidean distances, the results of metric MDS are equivalent to PCA. Both methods are simple to implement, and their optimizations do not involve local minima. These virtues ac- count for the widespread use of PCA and MDS, despite their inherent limitations as linear methods. Recently, we introduced an eigenvector method—called locally linear embedding (LLE)—for the problem of nonlinear dimensionality reduction[4]. This problem is illustrated by the nonlinear manifold in Figure 1. In this example, the dimen- sionality reduction by LLE succeeds in identifying the underlying structure of the manifold, while projections of the data by PCA or metric MDS map faraway data points to nearby points in the plane. Like PCA and MDS, our algorithm is sim- ple to implement, and its optimizations do not involve local minima. At the same time, however, it is capable of generating highly nonlinear embeddings. Note that mixture models for local dimensionality reduction[5, 6], which cluster the data and perform PCA within each cluster, do not address the problem considered here— namely, how to map high dimensional data into a single global coordinate system of lower dimensionality. In this paper, we review the LLE algorithm in its most basic form and illustrate a potential application to audiovisual speech synthesis[3]. But, what if the data is non-linear?

(A) (B) (C)

3 3 2 2 2 1 1 1 0

0 0 -1

-1 5 -1 5 -1 0 -1 0 -2 1 0 1 0 -2 -1 0 1 2

Figure 1: The problem of nonlinear dimensionality reduction, as illustrated for three dimensional data (B) sampled from a two dimensional manifold (A). An un- supervised learning algorithm must discover the global internal coordinates of the manifold without signals that explicitly indicate how the data should be embed- ded in two dimensions. The shading in (C) illustrates the neighborhood-preserving mapping discovered by LLE.

2

Non-linear Dimension Reduction Locally Linear Embedding Locally Linear Embedding Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

So far, we have covered a few methods for dimension reduction that all make the underlying assumption the data in high dimension lives on a planar manifold in the lower dimension. The methods are easy to implement. The methods do not suffer from local minima.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 86 / 101 the top eigenvectors of the data covariance matrix. In classical (or metric) MDS, one computes the low dimensional embedding that best preserves pairwise dis- tances between data points. If these distances correspond to Euclidean distances, the results of metric MDS are equivalent to PCA. Both methods are simple to implement, and their optimizations do not involve local minima. These virtues ac- count for the widespread use of PCA and MDS, despite their inherent limitations as linear methods. Recently, we introduced an eigenvector method—called locally linear embedding (LLE)—for the problem of nonlinear dimensionality reduction[4]. This problem is illustrated by the nonlinear manifold in Figure 1. In this example, the dimen- sionality reduction by LLE succeeds in identifying the underlying structure of the manifold, while projections of the data by PCA or metric MDS map faraway data points to nearby points in the plane. Like PCA and MDS, our algorithm is sim- ple to implement,Non-linear Dimensionand Reductionits optimizationsLocally Linear Embeddingdo not involve local minima. At the same Locallytime, Linearhowe Embeddingver, it is capable of generating highly nonlinear embeddings. Note that mixture models for local dimensionality reduction[5, 6], which cluster the data and Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001 perform PCA within each cluster, do not address the problem considered here— So far,namely we have, ho coveredw to map a fewhigh methodsdimensional for dimensiondata into reductiona single thatglobal coordinate system all makeof lo thewer underlyingdimensionality assumption. the data in high dimension lives on a planar manifold in the lower dimension. InThethis methodspaper are, we easyrevie to implement.w the LLE algorithm in its most basic form and illustrate a The methods do not suffer from local minima. potential application to audiovisual speech synthesis[3]. But, what if the data is non-linear?

(A) (B) (C)

3 3 2 2 2 1 1 1 0

0 0 -1

-1 5 -1 5 -1 0 -1 0 -2 1 0 1 0 -2 -1 0 1 2 JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 86 / 101

Figure 1: The problem of nonlinear dimensionality reduction, as illustrated for three dimensional data (B) sampled from a two dimensional manifold (A). An un- supervised learning algorithm must discover the global internal coordinates of the manifold without signals that explicitly indicate how the data should be embed- ded in two dimensions. The shading in (C) illustrates the neighborhood-preserving mapping discovered by LLE.

2 The LLE way of doing this is to make the assumption that, given enough data samples, the local frame of a particular point xi is linear. LLE proceeds to preserve this local structure while simultaneously reducing the global dimension (indeed preserving the local structure gives LLE the necessary constraints to discover the manifold).

Non-linear Dimension Reduction Locally Linear Embedding The Non-Linear Problem Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

In non-linear dimension reduction, one must discover the global internal coordinates of the manifold without signals that explicitly indicate how the data should be embedded in the lower dimension (or even how many dimensions should be used).

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 87 / 101 Non-linear Dimension Reduction Locally Linear Embedding The Non-Linear Problem Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

In non-linear dimension reduction, one must discover the global internal coordinates of the manifold without signals that explicitly indicate how the data should be embedded in the lower dimension (or even how many dimensions should be used). The LLE way of doing this is to make the assumption that, given enough data samples, the local frame of a particular point xi is linear. LLE proceeds to preserve this local structure while simultaneously reducing the global dimension (indeed preserving the local structure gives LLE the necessary constraints to discover the manifold).

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 87 / 101 Furthermore, we assume that each data point and its neighbors lie on or close to a locally linear patch of the manifold. LLE characterizes the local geometry of these patches by linear coefficients that reconstruct each data point from its neighbors. If the neighbors form the D-dimensional simplex, then these coefficients form the barycentric coordinates of the data point. In the simplest form of LLE, one identifies K such nearest neighbors based on the Euclidean distance. But, one can use all points in a ball of fixed radius, or even more sophisticated metrics.

Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

Suppose we have n real-valued vectors xi in dataset D each of dimension D. We assume these data are sampled from some smooth underlying manifold.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 88 / 101 LLE characterizes the local geometry of these patches by linear coefficients that reconstruct each data point from its neighbors. If the neighbors form the D-dimensional simplex, then these coefficients form the barycentric coordinates of the data point. In the simplest form of LLE, one identifies K such nearest neighbors based on the Euclidean distance. But, one can use all points in a ball of fixed radius, or even more sophisticated metrics.

Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

Suppose we have n real-valued vectors xi in dataset D each of dimension D. We assume these data are sampled from some smooth underlying manifold. Furthermore, we assume that each data point and its neighbors lie on or close to a locally linear patch of the manifold.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 88 / 101 Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

Suppose we have n real-valued vectors xi in dataset D each of dimension D. We assume these data are sampled from some smooth underlying manifold. Furthermore, we assume that each data point and its neighbors lie on or close to a locally linear patch of the manifold. LLE characterizes the local geometry of these patches by linear coefficients that reconstruct each data point from its neighbors. If the neighbors form the D-dimensional simplex, then these coefficients form the barycentric coordinates of the data point. In the simplest form of LLE, one identifies K such nearest neighbors based on the Euclidean distance. But, one can use all points in a ball of fixed radius, or even more sophisticated metrics.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 88 / 101 Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

We can write down the reconstruction error with the following cost function:

2

JLLE1 (W) = xi − Wijxj (95) i j X X

Notice that each row of the weight matrix W will be nonzero for only K columns; i.e., W is a quite sparse matrix. I.e., if we define the set N(xi) as the K neighbors of xi, then we enforce

Wij = 0 if xj ∈/ N(xi) (96)

We will enforce that the rows sum to 1, i.e., j Wij = 1. (This is for invariance.) P

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 89 / 101 This means that these weights characterize intrinsic geometric properties of the neighborhood as opposed to any properties that depend on a particular frame of reference. If we suppose the data lie on or near a smooth nonlinear manifold of dimension d  D. Then, to a good approximation, there exists a linear mapping (a translation, rotation, and scaling) that maps the high dimensional coordinates of each neighborhood to global internal coordinates on the manifold.

Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

Note that the constrained weights that minimize these reconstruction errors obey important symmetries: for any data point, they are invariant to rotations, rescalings, and translations of that data point and its neighbors.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 90 / 101 If we suppose the data lie on or near a smooth nonlinear manifold of dimension d  D. Then, to a good approximation, there exists a linear mapping (a translation, rotation, and scaling) that maps the high dimensional coordinates of each neighborhood to global internal coordinates on the manifold.

Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

Note that the constrained weights that minimize these reconstruction errors obey important symmetries: for any data point, they are invariant to rotations, rescalings, and translations of that data point and its neighbors. This means that these weights characterize intrinsic geometric properties of the neighborhood as opposed to any properties that depend on a particular frame of reference.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 90 / 101 Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

Note that the constrained weights that minimize these reconstruction errors obey important symmetries: for any data point, they are invariant to rotations, rescalings, and translations of that data point and its neighbors. This means that these weights characterize intrinsic geometric properties of the neighborhood as opposed to any properties that depend on a particular frame of reference. If we suppose the data lie on or near a smooth nonlinear manifold of dimension d  D. Then, to a good approximation, there exists a linear mapping (a translation, rotation, and scaling) that maps the high dimensional coordinates of each neighborhood to global internal coordinates on the manifold.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 90 / 101 Non-linear Dimension Reduction Locally Linear Embedding LLE: Neighborhoods Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

So, we expect that the characterization of the local neighborhoods (W) in the original space to be equally valid for the local patches on the manifold.

In other words, the same weights Wij that reconstruct point xi in the original space should also reconstruct it in the embedded manifold coordinate system. First, let’s solve for the weights. And, then we’ll see how to use this point to ultimately compute the global dimension reduction.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 91 / 101 Non-linear Dimension Reduction Locally Linear Embedding Solving for the Weights Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

Solving for the weights W is a constrained least-squares problem.

Consider a particular x with K nearest neighbors ηj and weights wj which sum to one. We can write the reconstruction error as

2  = x − wjηj (97) j X 2

= wj(x − ηj) (98) j X

= wjwkCjk (99) jk X T where Cjk is the covariance (x − ηj)(x − ηk) .

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 92 / 101 But, rather than explicitly inverting the covariance matrix, one can solve the linear system

Cjkwk = 1 (101) k X and rescale the weights so that they sum to one.

Non-linear Dimension Reduction Locally Linear Embedding Solving for the Weights Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

We need to add a Lagrange multiplier to enforce the constraint

j wj = 1 and then the weights can be solved in closed form. The optimal weights, in terms of the local covariance matrix, are P 1 k Cjk− wj = 1 . (100) Plm Clm− P

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 93 / 101 Non-linear Dimension Reduction Locally Linear Embedding Solving for the Weights Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

We need to add a Lagrange multiplier to enforce the constraint

j wj = 1 and then the weights can be solved in closed form. The optimal weights, in terms of the local covariance matrix, are P 1 k Cjk− wj = 1 . (100) Plm Clm− But, rather than explicitly invertingP the covariance matrix, one can solve the linear system

Cjkwk = 1 (101) k X and rescale the weights so that they sum to one.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 93 / 101 Non-linear Dimension Reduction Locally Linear Embedding Stage 2: Neighborhood Preserving Mapping Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

Each high-dimensional input vector xi is mapped to a low-dimension vector yi representing the global internal coordinates on the manifold.

LLE does this by choosing the d-dimension coordinates yi to minimize the embedding cost function:

2

JLLE2 (y) = yi − Wijyj (102) i j X X

The basis for the cost function is the same—locally linear reconstruction errors—but here, the weights W are fixed and the coordinates y are optimized.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 94 / 101 Non-linear Dimension Reduction Locally Linear Embedding Stage 2: Neighborhood Preserving Mapping Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

This defines a quadratic:

2

JLLE2 (y) = yi − Wijyj (103) i j X X T = Mij(yi yj) (104) ij X where

Mij = δij − Wij − Wji + WkiWkj (105) k X with δij is 1 if j =6 i and 0 otherwise.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 95 / 101 Non-linear Dimension Reduction Locally Linear Embedding Stage 2: Neighborhood Preserving Mapping Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

To remove the degree of freedom for translation, we enforce the coordinates to be centered at the origin:

yi = 0 (106) i X To avoid degenerate solutions, we constrain the embedding vectors to have unit covariance, with their outer-products satisfying

1 T yiy = I (107) n i i X The optimal embedding is found by the bottom d + 1 non-zero eigenvectors, i.e., those d + 1 eigenvectors corresponding to the smallest but non-zero d + 1 eigenvalues. The bottom eigenvector is the unit vector the corresponds to the free translation, it is discarded.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 96 / 101 Non-linear Dimension Reduction Locally Linear Embedding Putting It Together Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

LLE ALGORITHM

¡ ¢

1. Compute the neighbors of each data point, £ .



¡

¢ £ 2. Compute the weights £ that best reconstruct each data point from

its neighbors, minimizing the cost in eq. (1) by constrained linear fits.



¡

£ 3. Compute the vectors $/£ best reconstructed by the weights , minimizing the quadratic form in eq. (2) by its bottom nonzero eigenvectors.

Figure 2: Summary of the LLE algorithm, mapping high dimensional data points,

¡ ¡ $0£ LLE¢¤£ illustrates, to low dimensional a generalembedding principlev ofectors, manifold. learning, elucidated by

Tenenbaum et al[11], that overlapping local neighborhoods—collectively



$ £ analyzed—canare chosen, the provideoptimal informationweights £ and aboutcoordinates global geometry.are computed(Fromby standard Saul and Roweismethods 2001)in linear algebra. The algorithm involves a single pass through the three steps in Fig. 2 and finds global minima of the reconstruction and embedding costs

JJ Corsoin Eqs. (University(1) ofand Michigan)(2). As discussedEigenfacesin Appendix and FisherfacesA, in the unusual case where the97 / 101

¨

neighbors outnumber the input dimensionality ¦213¥ , the least squares problem for finding the weights does not have a unique solution, and a regularization term— for example, one that penalizes the squared magnitudes of the weights—must be added to the reconstruction cost.

The algorithm, as described in Fig. 2, takes as input the high dimensional vec-

¡ ¢ tors, £ . In many settings, however, the user may not have access to data of this form, but only to measurements of dissimilarity or pairwise distance between dif- ferent data points. A simple variation of LLE, described in Appendix C, can be applied to input of this form. In this way, matrices of pairwise distances can be analyzed by LLE just as easily as MDS[2]; in fact only a small fraction of all pos- sible pairwise distances (representing distances between neighboring points and their respective neighbors) are required for running LLE.

3 Examples

The embeddings discovered by LLE are easiest to visualize for intrinsically two 5467 dimensional manifolds. In Fig. 1, for example, the input to LLE consisted

data points sampled off the S-shaped manifold. The resulting embedding shows 8 :9

how the algorithm, using ¦ neighbors per data point, successfully unraveled the underlying two dimensional structure.

5 the top eigenvectors of the data covariance matrix. In classical (or metric) MDS, one computes the low dimensional embedding that best preserves pairwise dis- tances between data points. If these distances correspond to Euclidean distances, the results of metric MDS are equivalent to PCA. Both methods are simple to implement, and their optimizations do not involve local minima. These virtues ac- count for the widespread use of PCA and MDS, despite their inherent limitations as linear methods. Recently, we introduced an eigenvector method—called locally linear embedding (LLE)—for the problem of nonlinear dimensionality reduction[4]. This problem is illustrated by the nonlinear manifold in Figure 1. In this example, the dimen- sionality reduction by LLE succeeds in identifying the underlying structure of the manifold, while projections of the data by PCA or metric MDS map faraway data points to nearby points in the plane. Like PCA and MDS, our algorithm is sim- ple to implement,Non-linearand its Dimensionoptimizations Reductiondo Locallynot in Linearvolv Embeddinge local minima. At the same time, however, it is capable of generating highly nonlinear embeddings. Note that Examplemixture models 1 for local dimensionality reduction[5, 6], which cluster the data and Source:perform SaulPCA and Roweis,within eachAn Introductioncluster, do not to Locallyaddress Linearthe problem Embeddingconsidered, 2001here— namely, how to map high dimensional data into a single global coordinate system of lower dimensionality. In this paper, we review the LLE algorithm in its most basic form and illustrate a K=12.potential application to audiovisual speech synthesis[3].

(A) (B) (C)

3 3 2 2 2 1 1 1 0

0 0 -1

-1 5 -1 5 -1 0 -1 0 -2 1 0 1 0 -2 -1 0 1 2

Figure 1: The problem of nonlinear dimensionality reduction, as illustrated for three dimensional data (B) sampled from a two dimensional manifold (A). An un- supervised learning algorithm must discover the global internal coordinates of the

JJ Corsomanifold (Universitywithout of Michigan)signals that explicitlyEigenfaces andindicate Fisherfaceshow the data should be embed-98 / 101 ded in two dimensions. The shading in (C) illustrates the neighborhood-preserving mapping discovered by LLE.

2 Non-linear Dimension Reduction Locally Linear Embedding Example 2 Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

K=4.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 99 / 101

Figure 3: The results of PCA (top) and LLE (bottom), applied to images of a single face translated across a two-dimensional background of noise. Note how LLE maps the images with corner faces to the corners of its two dimensional embedding, while PCA fails to preserve the neighborhood structure of nearby images.

7 Non-linear Dimension Reduction Locally Linear Embedding Example 3 Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

K=16.

If the lip images described a nearly linear manifold, these two methods would

yield similar results; thus theFigure significant4: Images differencesof lips mapped in theseinto the embeddingsembedding space revealdescribed the by the first presence of nonlinear structure.two coordinates of PCA (top) and LLE (bottom). Representative lips are shown JJ Corso (University of Michigan) next toEigenfacescircled points and Fisherfacesin different parts of each space. The differences100 / 101between the two embeddings indicate the presence of nonlinear structure in the data.

8

Figure 4: Images of lips mapped into the embedding space described by the first two coordinates of PCA (top) and LLE (bottom). Representative lips are shown next to circled points in different parts of each space. The differences between the two embeddings indicate the presence of nonlinear structure in the data.

8 Computing the weights is O(DnK3). Computing the projection is O(dn2). All matrix computations are on very sparse matrices and can thus be implemented quite efficiently.

Non-linear Dimension Reduction Locally Linear Embedding LLE Complexity Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

Computing nearest neighbors scales O(Dn2) in the worst case. But, in many situations, space partitioning methods can be used to find the K nearest neighbors in O(n log n) time.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 101 / 101 Computing the projection is O(dn2). All matrix computations are on very sparse matrices and can thus be implemented quite efficiently.

Non-linear Dimension Reduction Locally Linear Embedding LLE Complexity Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

Computing nearest neighbors scales O(Dn2) in the worst case. But, in many situations, space partitioning methods can be used to find the K nearest neighbors in O(n log n) time. Computing the weights is O(DnK3).

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 101 / 101 All matrix computations are on very sparse matrices and can thus be implemented quite efficiently.

Non-linear Dimension Reduction Locally Linear Embedding LLE Complexity Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

Computing nearest neighbors scales O(Dn2) in the worst case. But, in many situations, space partitioning methods can be used to find the K nearest neighbors in O(n log n) time. Computing the weights is O(DnK3). Computing the projection is O(dn2).

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 101 / 101 Non-linear Dimension Reduction Locally Linear Embedding LLE Complexity Source: Saul and Roweis, An Introduction to Locally Linear Embedding, 2001

Computing nearest neighbors scales O(Dn2) in the worst case. But, in many situations, space partitioning methods can be used to find the K nearest neighbors in O(n log n) time. Computing the weights is O(DnK3). Computing the projection is O(dn2). All matrix computations are on very sparse matrices and can thus be implemented quite efficiently.

JJ Corso (University of Michigan) Eigenfaces and Fisherfaces 101 / 101