M.I.T Media Lab oratory Perceptual Computing Section Technical Rep ort No. 443
App ears in: The 3rd IEEE Int'l ConferenceonAutomatic Face & GestureRecognition, Nara, Japan, April 1998
Beyond Eigenfaces:
Probabilistic Matching for Face Recognition
Baback Moghaddam, Wasiuddin Wahid and Alex Pentland
MIT Media Lab oratory, 20 Ames St., Cambridge, MA 02139, USA.
Email: fbaback,wasi,[email protected]
facial expressions of the same individual ) and extrapersonal Abstract
variations (corresp onding to variations b etween di er-
E
We prop ose a novel technique for direct visual
ent individu al s). Our similarity measure is then expressed
matching of images for the purp oses of face
in terms of the probabili ty
recognition and database search. Sp eci call y,
we argue in favor of a probabilistic measure of
S (I ;I )=P ( 2 )=P ( j) (1)
1 2 I I
similarity, in contrast to simpler metho ds which
where P ( j) is the aposteriori probabilitygiven by
I
are based on standard L norms (e.g., template
2
Bayes rule, using estimates of the likeliho o d s P (j )and
I
matching) or subspace-restricted norms (e.g.,
P (j )which are derived from training data using an
E
eigenspace matching). The prop osed similarity
ecient subspace metho d for density estimation of high-
measure is based on a Bayesian analysis of image
dimensional data [6]. This Bayesian (MAP) approach
di erences: we mo del twomutually exclusive
can also b e viewed as a generalized nonlinear extension of
classes of variation b etween two facial images:
Linear Discriminant Analysis (LDA) [8, 3] or \FisherFace"
intra-personal (variations in app earance of the
techniques [1] for face recognition. Moreover, our nonlinear
same individ ual , due to di erent expressions or
generalization has distinct computational/ storage advan-
lighting) and extra-personal (variations in ap-
tages over these linear metho ds for large databases.
p earance due to a di erence in identity). The
high-dimension al probabili ty density functions
2 Analysis of Intensity Di erences
for each resp ective class are then obtained from
training data using an eigenspace densityes-
Wenow consider the problem of characterizing the typ e
timation technique and subsequently used to
of di erences which o ccur when matching two images
compute a similarity measure based on the a
in a face recognition task. We de ne two distinct and
posteriori probabili tyofmemb ership in the intra-
mutually exclusive classes: representing intrapersonal
I
personal class, which is used to rank matches
variations b etween multiple images of the same individu al
in the database. The p erformance advantage
(e.g., with di erent expressions and lighting conditions),
of this probabili sti c matching technique over
and representing extrapersonal variations which result
E
standard nearest-neighb or eigenspace matching
when matching two di erent individual s. We will assume
is demonstrated using results from ARPA's 1996
that b oth classes are Gaussian-distri bu ted and seek to
\FERET" face recognition comp etition, in which
obtain estimates of the likeliho o d functions P (j )and
I
this algorithm was found to b e the top p erformer.
P (j ) for a given intensity di erence = I I .
E 1 2
Given these likelih o o d s we can de ne the similarity score
1 Intro duction
S (I ;I )between a pair of images directly in terms of the
1 2
intrap ersonal aposteriori probability as given byBayes
Current approaches to image matching for visual ob ject
rule:
recognition and image database retrieval often makeuse
of simple image similarity metrics such as Euclidean
S = P ( j)
I
distance or normalized correlation, which corresp ond to
P (j )P ( )
I I
(2)
=
a standard template-matching approach to recognition [2].
P (j )P ( )+ P (j )P ( )
I I E E
For example, in its simplest form, the similarity measure
S (I ;I )between two images I and I can b e set to b e where the priors P ( ) can b e set to re ect sp eci c
1 2 1 2
inversely prop ortional to the norm jjI I jj.Such a simple op erating conditions (e.g., numb er of test images vs. the
1 2
formulation su ers from a ma jor drawback: it do es not size of the database) or other sources of a priori knowledge
exploit knowledge of whichtypeofvariations are critical regarding the two images b eing matched. Additionall y,this
(as opp osed to incidental) in expressing similarity. In particular Bayesian formulation casts the standard face
this pap er, weformulate a probabilistic similarity measure recognition task (essentially an M -ary classi catio n prob-
which is based on the probability that the image intensity lem for M individu als ) into a binary pattern classi cation
di erences, denoted by = I I ,arecharacteristic problem with and . This much simpler problem
1 2 I E
of typical variations in app earance of the same ob ject. is then solved using the maximum aposteriori (MAP)
For example, for purp oses of face recognition, wecan rule | i.e., two images are determined to b elong to the
de ne two classes of facial image variations: intrapersonal same individua l if P ( j) >P( j), or equivalently,if
I E
1
variations (corresp onding, for example, to di erent . S (I ;I ) >
I 1 2
2 1
Density Mo deling
2.1 F
One diculty with this approach is that the intensity
N
di erence vector is very high-dimensi onal , with 2R
4
N = O (10 ). Therefore wetypically lack sucient
and DFFS
indep endent training observations to compute reliable 2nd-
order statistics for the likeliho o d densities (i.e., singular
covariance matrices will result). Even if wewere able
to estimate these statistics, the computational cost of
aluating the likeliho o ds is formidable. Furthermore, this
ev DIFS
computation would b e highly inecient since the intrinsic
dimensionali ty or ma jor degrees-of-freedom of for each
class is likely to b e signi cantly smaller than N .
Recently, an ecient density estimation metho d was
y Moghaddam & Pentland [6] which divides
prop osed b F
N
the vector space R into two complementary subspaces
(a)
using an eigenspace decomp osition. This metho d relies on
a Principal Comp onents Analysis (PCA) [4] to form a low-
dimensional estimate of the complete likelihood whichcan
aluated using only the rst M principal comp onents,
be ev FF
where M< Figure 1 whichshows an orthogonal decomp osition of the N vector space R into twomutually exclusive subspaces: the principal subspace F containing the rst M principal comp onents and its orthogonal complement F ,which contains the residual of the expansion. The comp onentin the orthogonal subspace F is the so-called \distance-from- feature-space" (DFFS), a Euclidean distance equivalentto the PCA residual error. The comp onent ofwhichlies in the feature space F is referred to as the \distance-in- Mahalanobis distance for feature-space" (DIFS) and is a 1 M N Gaussian densities. (b) As shown in [6], the complete likeliho o d estimate can N Figure 1: (a) Decomp osition of R into the principal sub- b e written as the pro duct of two indep endent marginal space F and its orthogonal complement F for a Gaussian Gaussian densities ! 2 3 density,(b)atypical eigenvalue sp ectrum and its division M X 2 y 3 2 into the two orthogonal subspaces. 1 i exp 2 6 7 () 2 i exp 6 7 2 i=1 ^ 6 7 5 4 P (j ) = (N M )=2 M 6 7 (2) Y 4 5 1=2 M=2 (2 ) images consists of hard recognition cases that have proven i i=1 dicult for all face recognition algorithms previously tested on the FERET database. The diculty p osed bythis ^ = P (j ) P (j ) dataset app ears to stem from the fact that the images were F F (3) taken at di erent times, at di erent lo cations, and under ^ di erent imaging conditions. The set of images consists where P (j ) is the true marginal densityinF , P (j ) F F of pairs of frontal-views (FA/FB) and are divided into is the estimated marginal density in the orthogonal comple- 2 two subsets: the \gallery" (training set) and the \prob es" ment F , y are the principal comp onents and () is the i (testing set). The gallery images consisted of 74 pairs of residual (or DFFS). The optimal value for the weighting images (2 p er individu al) and the prob e set consisted of 38 parameter is then found to b e simply the average of the pairs of images, corresp onding to a subset of the gallery F eigenvalues memb ers. The prob e and gallery datasets were captured N X 1 aweek apart and exhibit di erences in clothing, hair and = (4) i lighting (see Figure 2). N M i=M +1 Before we can apply our matching technique, we need to We note that in actual practice, the ma jorityofthe p erform an ane alignment of these facial images. For this F eigenvalues are unknown but can b e estimated, for purp ose wehave used an automatic face-pro cessing system example, by tting a nonlinear function to the available which extracts faces from the input image and normalizes p ortion of the eigenvalue sp ectrum and estimating the for translation, scale as well as slight rotations (b oth in- average of the eigenvalues b eyond the principal subspace. plane and out-of-plane). This system is describ ed in detail in [6] and uses maximum-likeli ho o d estimation of ob ject 3 Exp eriments lo cation (in this case the p osition and scale of a face and the lo cation of individua l facial features) to geometrically align To test our recognition strategy we used a collection of faces into standard normalized form as shown in Figure 3. images from the FERET face database. This collection of 2 algorithm tested on this database, and that it is lower (by ab out 10%) than the typical rates that wehave obtained with the FERET database [5]. We attribute this lower p erformance to the fact that these images were selected to b e particularly challengin g. In fact, using an eigenface metho d to match the rst views of the 76 individua ls in the gallery to their second views, we obtain a higher recognition rate of 89% (68 out of 76), suggesting that the gallery images representalesschallenging data set (a) (b) since these images were taken at the same time and under Figure 2: Examples of FERET frontal-view image pairs identical lighting conditions. used for (a) the Gallery set (training) and (b) the Prob e 3.2 Bayesian Matching set (testing). For our probabili sti c algorithm, we rst gathered training data by computing the intensity di erences for a training subset of 74 intrap ersonal di erences (by matching the two views of every individ ual in the gallery) and a random sub- set of 296 extrap ersonal di erences (by matching images of ent individual s in the gallery), corresp onding to the Multiscale Feature di er Head Search Search classes and , resp ectively. I E It is interesting to consider how these two classes are distributed, for example, are they linearly separable or Scale Warp Mask emb edded distributio ns? One simple metho d of visualizi ng this is to plot their mutual principal comp onents | i.e., p erform PCA on the combined dataset and pro ject each Figure 3: The face alignmentsystem vector onto the principal eigenvectors. Such a visualizati on is shown in Figure 5(a) which is a 3D scatter plot of the rst 3 principal comp onents. This plot shows what All the faces in our exp eriments were geometrically aligned app ears to b e two completely enmeshed distributions , b oth and normalized in this manner prior to further analysis. having near-zero means and di ering primarily in the amount of scatter, with displayin g smaller intensity I 3.1 Eigenface Matching di erences as exp ected. It therefore app ears that one can not reliably distinguish low-amplitude extrap ersonal As a baseline comparison, we rst used an eigenface di erences (of which there are many) from intrap ersonal matching technique for recognition [9]. The normalized ones. images from the gallery and the prob e sets were pro jected However, direct visual interpretation of Figure 5(a) is onto a 100-dimensional eigenspace and a nearest-neighbor very misleadin g since we are essentially dealing with low- rule based on a Euclidean distance measure was used to dimensional (or \ attened") hyp er-ellipso id s whichare match each prob e image to a gallery image. We note intersecting near the origin of a very high-dimension al that this metho d corresp onds to a generalized template- space. The key distingui shi ng factor b etween the two matching metho d which uses a Euclidean norm typ e of distributions is their relative orientation. Fortunately,we similarity S (I ;I ), which is restricted to the principal 1 2 can easily determine this relative orientation by p erforming comp onent subspace of the data. A few of the lower- a separate PCA on each class and computing the dot order eigenfaces used for this pro jection are shown in pro duct of their resp ective rst eigenvectors. This analysis Figure 4. We note that these eigenfaces representthe yields the cosine of the angle b etween the ma jor axes principal comp onents of an entirely di erent set of images of the twohyp er-ellip soi ds, whichwas found to b e 124 , | i.e., none of the individ ual s in the gallery or prob e implying that the orientation of the twohyp er-ellipsoi ds sets were used in obtaining these eigenvectors. In other is quite di erent. Figure 5(b) is a schematic illustration words, neither the gallery nor the prob e sets were part of of the geometry of this con guration, where the hyp er- the \training set." The rank-1 recognition rate obtained ellipsoids have b een drawn to approximate scale using the with this metho d was found to b e 84% (64 correct matches corresp onding eigenvalues. out of 76), and the correct matchwas always in the top 10 nearest neighb ors. Note that this p erformance is 3.3 Dual Eigenfaces b etter than or similar to recognition rates obtained byany We note that the twomutually exclusive classes and I corresp ond to a \dual" set of eigenfaces as shown in E Figure 6. Note that the intrap ersonal variations shown in Figure 6-(a) represent subtle variations due mostly to ex- pression changes (and lighting) whereas the extrap ersonal variations in Figure 6-(b) are more representative of general eigenfaces whichcodevariations such as hair color, facial hair and glasses. Also note the overall qualitative similarity of the extrap ersonal eigenfaces to the standard eigenfaces Figure 4: Standard Eigenfaces. in Figure 4. This suggests the basic intuition that intensity 3 φ 1 I o 124 100 50 0 −50 Ω I φ 1 Ω E E −100 200 100 200 0 100 0 −100 −100 −200 −200 (a) (b) Figure 5: (a) Distributio n of the two classes in the rst 3 principal comp onents (circles for , dots for )and(b) I E schematic representation of the two distributions showing orientation di erence b etween the corresp onding principal eigenvectors. P (j ) using the PCA-based metho d outlined in Sec- E tion 2.1. We selected principal subspace dimensions of M = 10 and M = 30 for and , resp ectively. I E I E These density estimates were then used with a default setting of equal priors, P ( )=P ( ), to evaluate the a I E posteriori intrap ersonal probability P ( j) for matching I (a) prob e images to those in the gallery. Therefore, for each prob e image we computed prob e- to-gallery di erences and sorted the matching order, this time using the aposteriori probability P ( j) as the I similarity measure. This probabili stic ranking yielded an improved rank-1 recognition rate of 89.5%. Furthermore, out of the 608 extrap ersonal warps p erformed in this (b) recognition exp eriment, only 2% (11) were misclassi ed as Figure 6: \Dual" Eigenfaces: (a) Intrap ersonal, (b) b eing intrap ersonal | i.e., with P ( j) >P( j). I E Extrap ersonal 3.4 The 1996 FERET Comp etition Results di erences of the extrap ersonal typ e span a larger vector This approach to recognition has pro duced a signi cant space similar to the volume of facespace spanned by improvementover the accuracy we obtained using a stan- standard eigenfaces, whereas the intrapersonal eigenspace dard eigenface nearest-neighb or matching rule. The proba- corresp onds to a more tightly constrained subspace. It bilistic similarity measure was used in the Septemb er 1996 is the representation of this intrap ersonal subspace that FERET comp etition (with subspace dimensional iti es of is the critical part of formulating a probabili stic measure M = M = 125) and was found to b e the top-p erforming I E of facial similarity. In fact our exp eriments with a larger system byatypical margin of 10-20% over the other set of FERET images have shown that this intrap ersonal comp eting algorithms [7] (see Figure 7). Figure 8 shows eigenspace alone is sucient for a simpli ed maximum the p erformance comparison b etween standard eigenfaces likelihood measure of similarity (see Section 3.4). and the Bayesian metho d from this test. Note the 10% gain Finally,we note that since these classes are not linearly in p erformance a orded by the new Bayesian similarity separable, simple linear discriminant techniques (e.g., us- measure. Similarly, Figure 9 shows the recognition results ing hyp erplanes) can not b e used with any degree of relia- for \duplicate" images whichwere separated in time by bility. The prop er decision surface is inherently nonlinear up to 6 months (a much more challengin g recognition (quadratic, in fact, under the Gaussian assumption) and problem) whichshows a 30% improvement in recognition is b est de ned in terms of the aposteriori probabili ties | rate with Bayesian matching. Thus we note that in i.e., by the equality P ( j) = P ( j). Fortunately,the I E b oth cases (FA/FB and duplicates) the new probabili stic optimal discriminant surface is automatically implemented similarity measure has e ectively halved the error rate of when invoking a MAP classi catio n rule. eigenface matching. Wehave recently exp erimented with a more simpli ed Having analyzed the geometry of the two distribution s, probabilisti c similarity measure which uses only the in- we then computed the likeliho o d estimates P (j ) and I 4 1.00 0.90 0.80 0.90 0.70 0.80 MIT Sep 96 0.60 MIT Mar 95 UMD 0.70 Excalibur MIT-Stand. Auguts 1996 Cumulative match score probe: 3816 0.50 Rutgers MIT-Stand.MIT-Stand. MarchMarch 19951995 gallery: 1196 Cumulative Match Score scored probe: 1195 ARL EF Gallery Probes 831 463 0.60 0.40 010203040 Rank 0.30 ulative recognition rates for frontal FA/FB Figure 7: Cum 0 1020304050 views for the comp eting algorithms in the FERET 1996 Rank test. The top curve (lab eled \MIT Sep 96") corresp onds Figure 9: Cumulative recognition rates for frontal dupli- to our Bayesian matching technique. Note that second cate views with standard eigenface matching and the newer placed is standard eigenface matching (lab eled \MIT Mar Bayesian similarity metric. 95"). trapersonal eigenfaces with the intensity di erence to to simpler metho ds which are based on standard L norms 2 formulate a maximum likelihood (ML) matching technique (e.g., template matching [2]) or subspace-restricted norms using (e.g., eigenspace matching [9]). This technique is based on 0 S = P (j ) (5) I aBayesian analysis of image di erences which leads to a very useful measure of similarity. instead of the maximum a posteriori (MAP) approachde- The p erformance advantage of our probabili sti c match- ned by Equation 2. Although this simpli ed measure has not yet b een ocially FERET tested, our own exp eriments ing technique has b een demonstrated using b oth a small 0 database (internally tested) as well as a large (800+) with a database of size 2000 haveshown that using S instead of S results in only a minor (2%) de cit in the database with an indep endent double-bli nd test as part of ARPA's Septemb er 1996 \FERET" comp etition, in recognition rate while cutting the computational cost by a factor of 1/2 (requiring a single eigenspace pro jection as whichBayesian similarity out-p erformed all comp eting algorithms (at least one of whichwas using an LDA/Fisher opp osed to two). typ e metho d). Webelieve that these results clearly demon- strate the sup erior p erformance of probabilisti c matching 1.00 over eigenface, LDA/Fisher and other existing techniques. This probabilisti c framework is particularly advanta- geous in that the intra/extra density estimates explicitly haracterize the typ e of app earance variations whichare 0.90 c critical in formulating a meaningful measure of similarity. For example, the deformations corresp onding to facial hanges (whichmayhave high image-di erence MIT-Stand. August 1996 expression c irrelevant when the measure of sim- 0.80 norms) are, in fact, MIT-Stand. March 1995 y is to b e based on identity. The subspace density Gallery Probes ilarit Cumulative Match Score ting these classes 831 780 estimation metho d used for represen thus corresp onds to a learning metho d for discovering the 0.70 principal mo des of variation imp ortant to the classi cation 0 10203040 task. Furthermore, by equating similarity with the a Rank posteriori probabili tywe obtain an optimal non-linear decision rule for matching and recognition. This asp ect Figure 8: Cumulative recognition rates for frontal FA/FB of our approach di ers signi cantl y from recent metho ds views with standard eigenface matching and the newer which use simple linear discriminant analysis techniques Bayesian similarity metric. for recognition (e.g., [8, 3]). Our Bayesian (MAP) metho d can also b e viewed as a generalized nonlinear (quadratic) version of Linear Discriminant Analysis (LDA) [3] or 4 Conclusions \FisherFace" techniques [1]. The computational advantage of our approach is that there is no need to compute and Wehave prop osed a novel technique for direct visual store an eigenspace for each individ ual in the gallery (as matching of images for the purp oses of recognition and required with LDA). One (or at most two) eigenspaces are search in a large face database. Sp eci call y,wehave argued sucient for probabili sti c matching and therefore storage in favor of a probabilistic measure of similarity,incontrast 5 and computational costs are xed and do not increase with the size of the database (as with LDA/Fisher metho ds). Finally, the results obtained with the simpli ed ML 0 similarity measure (S in Eq. 5) suggest a computationally equivalentyet sup erior alternative to standard eigenface matching. In other words, a likeliho o d similarity based on the intrap ersonal density P (j ) alone is far sup erior to I nearest-neighb or matching in eigenspace while essentially requiring the same numb er of pro jections. For complete- ness (and a slightly b etter p erformance) however, one should use the aposteriori similarity S in Eq. 2, at twice the computational cost of standard eigenfaces. References [1] V.I. Belhumeur, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. sherfaces: Recognition using class sp e- ci c linear pro jection. IEEE Transactions on Pattern Analysis and Machine Intel ligence,PAMI-19(7):711{ 720, July 1997. [2] R. Brunelli and T. Poggio. Face recognition : Features vs. templates. IEEE Transactions on Pattern Analysis and Machine Intel ligence, 15(10), Octob er 1993. [3] K. Etemad and R. Chellappa. Discriminant analysis for recognition of human faces. In Proc. of Int'l Conf. on Acoustics, Speech and Signal Processing, pages 2148{ 2151, 1996. [4] I.T. Jolli e. Principal Component Analysis. Springer- Verlag, New York, 1986. [5] B. Moghaddam and A. Pentland. Face recognition using view-based and mo dular eigenspaces. Automatic Systems for the Identi cation and Inspection of Hu- mans, 2277, 1994. [6] B. Moghaddam and A. Pentland. Probabili stic visual learning for ob ject representation. IEEE Transactions on Pattern Analysis and Machine Intel ligence,PAMI- 19(7):696{710, July 1997. [7] P. J. Phillips, H. Mo on, P. Rauss, and S. Rizvi. The FERET evaluation metho dology for face-recognition algorithms. In IEEE Proceedings of Computer Vision and Pattern Recognition, pages 137{143, June 1997. [8] D. Swets and J. Weng. Using discriminant eigenfeatures for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intel ligence,PAMI-18(8):831{ 836, August 1996. [9] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 1991. 6