Comparative Assessment of Independent Component Analysis (ICA) for Face Recognition

Appears in the Second International Conference on Audio- and Video-based Biometric Person Authentication, AVBPA'99, Washington D. C. USA, March 22-24, 1999.

Chengjun Liu and Harry Wechsler Department of Computer Science, George Mason University,

4400 University Drive, Fairfax, VA 22030-4444, USA

cliu, wechsler ¡ @cs.gmu.edu

Abstract fication problems. A successful face recognition method- ology depends heavily on the particular choice of the fea- This paper addresses the relative usefulness of Inde- tures used by the (pattern) classifier [4], [17], [3]. Feature pendent Component Analysis (ICA) for Face Recognition. selection in pattern recognition involves the derivation of Comparative assessments are made regarding (i) ICA sen- salient features from the raw input data in order to reduce sitivity to the dimension of the space where it is carried out, the amount of data used for classification and simultane- and (ii) ICA discriminant performance alone or when com- ously provide enhanced discriminatory power. bined with other discriminant criteria such as Bayesian One popular technique for feature selection and di- framework or Fisher’s Linear Discriminant (FLD). Sensi- mensionality reduction is Principal Component Analysis tivity analysis suggests that for enhanced performance ICA (PCA) [8], [6]. PCA is a standard decorrelation technique should be carried out in a compressed and whitened Prin- and following its application one derives an orthogonal cipal Component Analysis (PCA) space where the small projection basis which directly leads to dimensionality re- trailing eigenvalues are discarded. The reason for this duction, and possibly to feature selection. PCA was first finding is that during whitening the eigenvalues of the co- applied to reconstruct human faces by Kirby and Sirovich variance matrix appear in the denominator and that the [10] and to recognize faces by Turk and Pentland [19]. The small trailing eigenvalues mostly encode noise. As a con- recognition method, known as eigenfaces, defines a feature sequence the whitening component, if used in an uncom- space, or “face space”, which drastically reduces the di- pressed image space, would fit for misleading variations mensionality of the original space, and face detection and and thus generalize poorly to new data. Discriminant anal- identification are then carried out in the reduced space. ysis shows that the ICA criterion, when carried out in the Independent Component Analysis (ICA) has emerged properly compressed and whitened space, performs bet- recently as one powerful solution to the problem of blind ter than the eigenfaces and Fisherfaces methods for face source separation [5], [9], [7] while its possible use for recognition, but its performance deteriorates when aug- face recognition has been shown by Bartlett and Sejnowski mented by additional criteria such as the Maximum A Pos- [1]. ICA searches for a linear transformation to express a teriori (MAP) rule of the Bayes classifier or the FLD. The set of random variables as linear combinations of statisti- reason for the last finding is that the Mahalanobis distance cally independent source variables [5]. The search crite- embedded in the MAP classifier duplicates to some extent rion involves the minimization of the mutual information the whitening component, while using FLD is counter to expressed as a function of high order cumulants. Basically the independence criterion intrinsic to ICA. PCA considers the 2nd order moments only and it uncor- relates data, while ICA accounts for higher order statistics and it identifies the independent source components from 1. Introduction their linear mixtures (the observables). ICA thus provides a more powerful data representation than PCA [9]. As Face recognition is important not only because it has a PCA derives only the most expressive features for face re- lot of potential applications in research fields such as Hu- construction rather than face classification, one would usu- man Computer Interaction (HCI), biometrics and security, ally use some subsequent discriminant analysis to enhance but also because it is a typical Pattern Recognition (PR) PCA performance [18]. problem whose solution would help solving other classi- This paper makes a comparative assessment on the use of ICA as a discriminant analysis criterion whose goal is faces and the Most Discriminant Features (MDF) method) to enhance PCA stand alone performance. Experiments for face recognition. Methods that combine PCA and the in support of our comparative assessment of ICA for face standard FLD, however, lack in their generalization ability recognition are carried out using a large data set consisting as they overfit to the training data. To address the overfit- of 1,107 images and drawn from the FERET database [16]. ting problem Liu and Wechsler [11] introduced Enhanced The comparative assessment suggests that for enhanced FLD Models (EFM) to improve on the generalization ca- face recognition performance ICA should be carried out pability of the standard FLD based classifiers such as Fish- in a compressed and whitened space, and that ICA per- erfaces [2]. formance deteriorate when it is augmented by additional decision rules such as the Bayes classifier or the Fisher’s 3. Independent Component Analysis (ICA) linear discriminant analysis. As PCA considers the 2nd order moments only it lacks 2. Background information on higher order statistics. ICA accounts for higher order statistics and it identifies the independent PCA provides an optimal signal representation tech- source components from their linear mixtures (the observ- nique in the mean square error sense. The motivation be- ables). ICA thus provides a more powerful data represen- hind using PCA for human face representation and recog- tation than PCA [9] as its goal is that of providing an inde- nition comes from its optimal and robust image compres- pendent rather than uncorrelated image decomposition and sion and reconstruction capability [10] [15]. PCA yields representation. projection axes based on the variations from all the training ICA of a random vector searches for a linear transformation which minimizes the statistical dependence be-

samples, hence these axes are fairly robust for representing ¢¡¤£¦¥ both training and testing images (not seen during training). tween its components [5]. In particular, let be a

PCA does not distinguish, however, the different roles of random vector representing an image, where § is the di- the variations (within- and between-class variations) and mensionality of the image space. The vector is formed by it treats them equally. This leads to poor performance concatenating the rows or the columns of the image which may be normalized to have a unit norm and/or an equalized

when the distributions of the face classes are not separated

by the mean-difference but separated by the covariance- histogram. The covariance matrix of is deﬁned as

"!$# difference [6]. ¨ © (1)

Swets and Weng [18] point out that PCA derives only the most expressive features which are unrelated to actual &%

where is the expectation operator, ' denotes the trans- face recognition, and in order to improve performance one pose operation, and ¨(©)¡¤£*¥,+-¥ . The ICA of factor- needs additional discriminant analysis. One such discrim- izes the covariance matrix ¨(© into the following form

inant criterion, the Bayes classiﬁer, yields the minimum

functions (pdf) are known. The use of the Bayes classiﬁer

/ .

is conditioned on the availability of an adequate amount where is diagonal real positive and transforms the

of representative training data in order to estimate the pdf. original data into 1

As an example, Moghaddam and Pentland [13] developed 2 . a Probabilistic Visual Learning (PVL) method which uses 1 (3) the eigenspace decomposition as an integral part of esti-

such that the components of the new data 1 are indepen- mating the pdf in high-dimensional image spaces. To ad- dent or “the most independent possible” [5]. dress the lack of sufﬁcient training data Liu and Wechsler . To derive the ICA transformation , Comon [5] de- [12] introduced the Probabilistic Reasoning Models (PRM) veloped an algorithm which consists of three operations: where the conditional pdf for each class is modeled using whitening, rotation, and normalization. First, the whiten- the pooled within-class scatter and the Maximum A Pos- ing operation transforms a random vector into another teriori (MAP) Bayes classiﬁcation rule is implemented in

one 3 that has a unit covariance matrix. the reduced PCA subspace. The Fisher’s Linear Discriminant (FLD) is yet another 2 54 687&9$: 3 (4)

popular discriminant criterion. By applying ﬁrst PCA 6 for dimensionality reduction and then FLD for discrimi- where 4 and are derived by solving the following eigen-

nant analysis, Belhumire, Hespanha, and Kriegman [2] and value equation.

¢¡ :£¡¥¤¦¤¥¤¥¡

where 7 is an orthonormal eigenvec- small trailing eigenvalues are excluded so that we can ob-

62 ¨§ © £ #

7 : ¡¥¤¦¤¥¤¥¡

tor matrix and ¡ is a diagonal tain more robust whitening results. Toward that end, dur- © eigenvalue matrix of ¨ . One notes that whitening, an in- ing the whitening transformation we should keep a balance tegral ICA component, counteracts the fact that the Mean between the need that the selected eigenvalues account for Square Error (MSE) preferentially weighs low frequencies most of the spectral energy of the raw data and the require- [14]. The rotation operations, then, perform source sepa- ment that the trailing eigenvalues of the covariance matrix ration (to derive independent components) by minimizing are not too small. As a result, the eigenvalue spectrum of the mutual information approximated using higher order the training data supplies useful information for the deci-

cumulants. Finally, the normalization operation derives sion of the dimensionality of the subspace. unique independent components in terms of orientation, unit norm, and order of projections [5].

4. Sensitivity Analysis of ICA

Eq. 4 can be rearranged to the following form

7 9 :

6 4

3 (6) 6 where 4 and are eigenvector and eigenvalue matrices, respectively (see Eq. 5). Eq. 6 shows that during whitening the eigenvalues appear in the denominator. The trailing eigenvalues, which tend to capture noise as their values are fairly small, thus cause the whitening step to ﬁt for misleading variations and make the ICA criterion to generalize poorly when it is exposed to new data. As a consequence, if the whitening step is preceded by a dimensionality reduction procedure ICA performance would be enhanced and computational complexity reduced. Speciﬁcally, space di-

mensionality is ﬁrst reduced to discard the small trailing

"$#&%('*)!+-,/.!01'2' 3 4&06587 '9 :0;4<"&'£=?> %@.!0 A¥BC' eigenvalues and only then the compressed data is normal- !

ized (‘sphered’) to unit gain control. /DFEGD$HJI!4VUW1X

6 UCZ

If we assume that the eigenvalues in are sorted in de- Y

% % %

7¢ :

creasing order, , then the ﬁrst ( )

¥,+

¡ £

leading eigenvectors deﬁne a matrix

4 4 4

The experimental data, consisting of 1,107 facial im-

7¡ :¡¦¤¥¤¥¤¥¡ (7)

ages corresponding to 369 subjects, comes from the ¡

6 FERET database [16]. Some of the face images used are 7

and the ﬁrst eigenvalues specify a diagonal matrix

£ shown in Fig. 1. 600 out of the 1,107 images correspond to

§ © £ #

6 200 subjects with each subject having three images — two

7 7 : ¡¥¤¦¤¥¤¥¡ ¡ (8) of them are the ﬁrst and the second shot, while the third

The dimensionality reduction whitening transforms the shot is taken under low illumination. For the remaining

¡ £

data into 169 subjects there are also three images for each subject,

but two out of the three images are duplicates taken at a

7 9$:

7 (9) different and much later time. Two images of each subject are used for training with the remaining one used for test- The question now is how to choose the dimensional- ing. The images are cropped to the size of 64 x 96, once

ity of the reduced subspace. Note that the goal of us- the eye coordinates are manually detected. Fig. 2 shows ing whitening for dimensionality reduction is two-fold. On the relative magnitude of the eigenvalues derived using the the one hand, we hope to lose as little representative in- 738 training face images. When the eigenvalue index is formation of the original data as possible in the transfor- greater than 40, the corresponding eigenvalues have rela- mation from the high dimensional space to the low dimen- tively small magnitudes, and if they were included in the sional one. On the other hand, in the reduced subspace the whitening transformation these small eigenvalues will lead

3 to decreased ICA performance (see Fig. 3 and Fig. 4) as 0.96

they amplify the effects of noise. So we set the dimension 0.95

of the reduced space as . In order to assess the sensitivity of ICA in terms of the 0.94 dimension of the compressed and whitened space where it 0.93 is implemented, we carried out a comparative assessment 0.92

for different dimensional ( ) whitened subspace. Fig. 3 and Fig. 4 show the ICA face recognition performance 0.91

when different is chosen during the whitening transfor- 0.9 mation (see Eq. 9) and as the number of features used can correct retrieval rate 0.89

range up to the dimension of the compressed space. The

0.88 m = 30 curve corresponding to performs better than all the m = 40 m = 50 other curves obtained for different values. Fig. 4 shows 0.87 m = 70 that ICA performance deteriorates quite severely as is 0.86 getting larger. The reason for this deterioration is that for 20 25 30 35 40 45 50 55 60 65 70 number of features

the large values of , more small trailing eigenvalues (see

"$#&%('$+ .!065&4£% 4 .<% 0;4¤P¢' .> 4<= 4

Fig. 2 ) are included in the whitening step and this leads !

I! T= P£%@ :06 ¤4¤A P£%@ AK'C%@ T.§¤ BC'¤ W

to an unstable transformation. Note that smaller values for

¡ ¡ ¡!

%(' =¦5!' P¢A £ '<7 #+

lead to decreased ICA performance as well because the whitened space fails to capture enough information on the original data.

0.96 0.1 0.94 0.09 0.92 0.08 0.9

0.07 0.88

0.06 0.86

0.05 0.84 correct retrieval rate 0.04 0.82