Whitened LDA for Face Recognition ∗

Vo Dinh Minh Nhat SungYoung Lee Hee Yong Youn Ubiquitous Computing Lab Ubiquitous Computing Lab Mobile Computing Lab Kyung Hee University Kyung Hee University SungKyunKwan University Suwon, Korea Suwon, Korea Suwon, Korea [email protected] [email protected] [email protected]

ABSTRACT 1. INTRODUCTION Over the years, many Linear Discriminant Analysis (LDA) A facial recognition system is a computer-driven applica- algorithms have been proposed for the study of high dimen- tion for automatically identifying a person from a digital im- sional data in a large variety of problems. An intrinsic lim- age. It does that by comparing selected facial features in the itation of classical LDA is the so-called ”small sample size live image and a facial database. With the rapidly increasing (3S) problem” that is, it fails when all scatter matrices are demand on face recognition technology, it is not surprising singular. Many LDA extensions were proposed in the past to see an overwhelming amount of research publications on to overcome the 3S problems. However none of the previ- this topic in recent years. Like other classical statistical ous methods could solve the 3S problem completely in the pattern recognition tasks, we usually represent data sam- sense that it can keep all the discriminative features with ples with n-dimensional vectors, i.e. data is vectorized to a low computational cost. By applying LDA after whiten- form data vectors before applying any technique. However ing data, we proposed the Whitened LDA (WLDA) which in many real applications, the dimension of those 1D data can find the most discriminant features without facing the vectors is very high, leading to the ”curse of dimensionality”. 3S problem. In WLDA, only eigenvalue problems instead of The curse of dimensionality is a significant obstacle in pat- generalized eigenvalue problems are performed, leading to tern recognition and machine learning problems that involve the low computation cost of WLDA. Experimental results learning from few data samples in a high-dimensional fea- are shown using two most popular Yale and ORL databases. ture space. In face recognition, Principal component analy- Comparisons are given against Linear Discriminant Analy- sis (PCA)[5] and Linear discriminant analysis (LDA)[1] are sis (LDA), Direct LDA (DLDA), Null space LDA (NLDA) the most popular subspace analysis approaches to learn the and several matrix-based subspace analysis approaches de- low-dimensional structure of high dimensional data. PCA is veloped recently. We show that our method is always the a subspace projection technique widely used for face recogni- best. tion. It finds a set of representative projection vectors such that the projected samples retain most information about original samples. The most representative vectors are the Categories and Subject Descriptors eigenvectors corresponding to the largest eigenvalues of the I.5.4 [PATTERN RECOGNITION]: Applications matrix. Unlike PCA, LDA finds a set of vectors that maximizes Fisher Discriminant Criterion. It simulta- General Terms neously maximizes the between-class scatter while minimiz- ing the within-class scatter in the projective feature vec- Algorithms, Theory, Experimentation, Performance tor space. While PCA can be called unsupervised learning techniques, LDA is supervised learning technique because it Keywords needs class information for each image in the training pro- cess. This method overcomes the limitations of the Eigen- Face Recognition, Data Whitening, LDA face method by applying the Fisher’s Linear Discriminant criterion. This criterion tries to maximize the ratio ∗This research was supported by the MIC(Ministry of Information and Communication), Korea, under the T ITRC(Information Technology Research Center) support w Sbw T (1) program supervised by the IITA (Institute of Information w Sww Technology Advancement) (IITA-2006-C1090-0602-0002) where Sb is the between-class scatter matrix, and Sw is the within-class scatter matrix. Thus, by applying this method, we find the projection directions that on one hand maximize Permission to make digital or hard copies of all or part of this work for the Euclidean distance between the face images of different personal or classroom use is granted without fee provided that copies are classes and on the other minimize the distance between the not made or distributed for profit or commercial advantage and that copies face images of the same class. This ratio is maximized when bear this notice and the full citation on the first page. To copy otherwise, to the column vectors of the projection matrix W are the eigen- republish, to post on servers or to redistribute to lists, requires prior specific vectors of S−1S . In face recognition tasks, this method permission and/or a fee. w b CIVR’07, July 9–11, 2007, Amsterdam, The Netherlands. cannot be applied directly since the dimension of the sam- Copyright 2007 ACM 978-1-59593-733-9/07/0007 ...$5.00. ple space is typically larger than the number of samples in

234 the training set. As a consequence, Sw is singular. This The key idea of WLDA is that we apply data whitening be- problem is known as the ”small sample size problem”[3]. fore perform LDA. With some nice properties of whitened A lot of methods have been proposed to solve this prob- data, we show how to turn the generalized eigenvalue prob- lem, and reviews of those methods could be found in papers lem of LDA into simple eigenvalue problem, leading to low relevant to this problem. For the sake of completeness, we computation cost of algorithm. While in previous works, here summarize the most novel approaches to overcome the some discriminant information lost during performing algo- 3S problem. In [1], they proposed a two stage PCA+LDA rithms, WLDA has the ability of keeping all discriminant method, also known as the Fisherface method, in which PCA information. Some main contributions of this paper can be is first used for dimension reduction so as to make Sw non- described as: giving the solution to LDA without facing the singular before the application of LDA. However, in order 3S problem, keeping all the discriminative features, solving to make Sw nonsingular, some directions corresponding to the problem with low computational cost. The outline of the small eigenvalues of Sw are thrown away in the PCA this paper is as follows. In Section 2, all important previ- step. Thus, applying PCA for dimensionality reduction has ous and related works are described such as : PCA, LDA, the potential to remove dimensions that contain discrimina- 2DPCA, GLRAM and 2DLDA. The proposed method is de- tive information. In [12], they try to make make Sw become scribed in Section 3. In Section 4, experimental results are nonsingular by adding a small perturbation matrix ∆ to presented for the ORL and Yale face image databases to Sw. However, this method is very computationally expen- demonstrate the effectiveness of our method. Finally, con- sive and will not be considered in the experiment part of clusions are presented in Section 5. this paper. The Direct-LDA method is proposed in [11]. First, the null space of Sb is removed and, then, the pro- 2. SUBSPACE ANALYSIS jection vectors that minimize the within-class scatter in the One approach to cope with the problem of excessive di- transformed space are selected from the range space of S . b mensionality of the image space is to reduce the dimension- However, removing the null space of S by dimensionality b ality by combining features. Linear combinations are par- reduction will also remove part of the null space of S and w ticularly attractive because they are simple to compute and may result in the loss of important discriminative informa- analytically tractable. In effect, linear methods project the tion. In [2], Chen et al. proposed the null space based LDA high-dimensional data onto a lower dimensional subspace. (NLDA), where the between-class scatter is maximized in Basic notations are described in Table 1 for reference. Sup- the null space of the within-class scatter matrix. The sin- pose that we have N sample images {x , x , ..., x } taking gularity problem is thus implicitly avoided. Huang et al. in 1 2 N values in an n-dimensional image space. Let us also consider [4] improved the efficiency of the algorithm by first remov- a linear transformation mapping the original n-dimensional ing the null space of the total scatter matrix. In orthogonal image space into an m-dimensional feature space, where LDA (OLDA) [8], a set of orthogonal discriminant vectors m < n. The new feature vectors y ∈

235 Table 1: Basic Notations Notations Descriptions n th xi ∈ < the i image point in vector form r×c th Xi ∈ < the i image point in matrix form th Πi the i class of data points (both in vector and matrix form) n dimension of xi m dimension of reduced feature vector yi r number of rows in Xi c number of columns in Xi N number of data samples C number of classes Ni number of data samples in class Πi L transformation on the left side R transformation on the right side l1 number of rows in Yi l2 number of columns in Yi

samples in class Πi. Then the between-class scatter matrix limited length of paper, details of those algorithms can be Sb and the within-class scatter matrix Sw are defined found in respective references.

C 1 X 2.3 Two-dimensional PCA - 2DPCA S = N (µ − µ)(µ − µ)T (4) b N i i i In 2D approach, the image matrix does not need to be i=1 previously transformed into a vector, so a set of N sample images is represented as {X ,X , ..., X } with X ∈

rank(St) = N − 1 N 1 P r×c rank(Sb) = C − 1 with M = N Xi ∈ < is the mean image of all samples. (6) i=1 rank(Sw) = N − C r×r Tt ∈ < is also called image covariance (scatter) matrix. St = Sb + Sw A linear transformation mapping the original r × c image

In LDA, the projection Wopt is chosen to maximize the ratio space into an r × m feature space, where m < c. The new r×m of the determinant of the between-class scatter matrix of feature matrices Yi ∈ < are defined by the following the projected samples to the determinant of the within-class linear transformation: scatter matrix of the projected samples, i.e., r×m Yi = (Xi − M)W ∈ < (10) T w Sbw where i = 1, 2, ..., N and W ∈

236 C P 2 Table 2: Algorithm – GLRAM Db = Nj kMj − MkF Ãj=1 ! C (14) Algorithm – GLRAM P T = tr Nj (Mj − M)(Mj − M) Step 0 j=1 (0) T Initialize L = L = [I1, 0] , and set k = 0. Step 1 In the low-dimensional space resulting from the linear trans- R(k+1) l2 formations L and R, the with-in and between-class distances Compute l2eigenvectors {Φi }i=1of the matrix SR = ˜ ˜ N Dw and Db can be computed as follows: P T (k) (k)T X L L Xi corresponding to the largest l2 eigen-   i C i=1 X X R(k+1) R(k+1) ˜  T T T  values and formR(k+1) = [Φ ..Φ ]. Dw = tr L (Xi − Mj )RR (Xi − Mj ) L 1 l2 Step 2 j=1 Xi∈Πj L(k+1) l1 (15) Compute l1eigenvectors {Φi }i=1of the matrix SL = N Ã ! P (k+1) (k+1)T T XC XiR R Xi corresponding to the largest l1 ˜ T T T i=1 Db = tr Nj L (Mj − M)RR (Mj − M) L (16) eigenvalues and formL(k+1) = [ΦL(k+1)..ΦL(k+1)]. j=1 1 l1 Step 3 The optimal transformations L and R would maximize F (L, R) = IfL(k+1),R(k+1)are not convergent then set increasekby 1 D˜ b/D˜ w. Let us define and go to Step 1, othervise proceed to Step 4. X R T T Step 4 Sw = (Xi − Mj )RR (Xi − Mj ) (17) ∗ (k+1) ∗ (k+1) ∗ LetL = L ,R = R and computeYi = Xi∈Πj ∗T ∗ L XiR fori = 1..N. XC R T T Sb = Nj (Mj − M)RR (Mj − M) (18) j=1 of matrices. By solving an optimization problem, which aims X SL = (X − M )T LLT (X − M ) (19) to minimize the reconstruction (approximation) error, they w i j i j derive an iterative algorithm, namely GLRAM, which stands Xi∈Πj for the Generalized Low Rank Approximations of Matrices. XC GLRAM reduces the reconstruction error sequentially, and L T T the resulting approximation is thus improved during succes- Sb = Nj (Mj − M) LL (Mj − M) (20) sive iterations. Formally, they consider the following opti- j=1 mization problem After defining those matrices we can derive the 2DLDA al- gorithm as in Table 3.

N P ° °2 min °X − LY RT ° i i F 3. WHITENED LDA L,R,Yi i=1 (12) T T s.t. L L = I1,R R = I2 In this part, we first review data whitening and then derive the WLDA in details.

r×l1 c×l2 l1×l2 where L ∈ < , R ∈ < , Yi ∈ < for i = 1..N, 3.1 Data Whitening l1×l1 l2×l2 n I1 ∈ < and I2 ∈ < are identity matrices, where Let x ∈ < denote a random vector with mean µ and l1 6 r and l2 6 c. An iterative procedure for computing L positive semi-definite Cx. We wish to and R is presented in Table 2. whiten the vector x ∈

237 0 is the diagonal matrix of non-zero eigenvalues. Then we Table 3: Algorithm – 2DLDA can form the whitening transformation matrix P = Λ−1/2U T and obtain the whitened data as Algorithm – 2DLDA −1/2 T r Step 0 yi = P (xi − µ) = Λ U (xi − µ) ∈ < (27) (0) T Initialize R = R = [I2, 0] , and set k = 0. Step 1 where i = 1..N. Then the between-class scatter matrix Gb Compute and the within-class scatter matrix Gw after data whitening R(k) P (k) (k)T T are re-defined as Sw = (Xi − Mj )R R (Xi − Mj ) Xi∈Πj XC C 1 T R(k) P (k) (k)T T Gb = Niηiηi (28) S = Nj (Mj − M)R R (Mj − M) N b i=1 j=1 Step 2 L(k) l1 XC X Compute l1 eigenvectors {Φi }i=1 of the matrix 1 T R(k) −1 R(k) (k) L(k) L(k) Gw = (yk − ηi)(yk − ηi) (29) (Sw ) Sw and form L = [Φ ..Φ ]. N 1 l1 i=1 y ∈Π Step 3 k i th Compute where ηi is the mean of whitened data in class i , i = 1..C L(k) P T (k) (k)T Sw = (Xi − Mj ) L L (Xi − Mj ) and defined as Xi∈Πj Ni Ni C 1 P 1 P L(k) P T (k) (k)T ηi = yk = P (xk − µ) S = N (M − M) L L (M − M) Ni Ni b j j j µ k=1 ¶k=1 (30) j=1 Ni 1 P Step 4 = P xk − µ = P (µi − µ) Ni R(k) l2 k=1 Compute l2 eigenvectors {Φi }i=1 of the matrix L(k) −1 L(k) (k+1) R(k) R(k) (Sw ) Sw and form R = [Φ ..Φ ]. Theorem 1. Given the whitened data, the between-class 1 l1 Step 5 scatter matrix Gb and the within-class scatter matrix Gw If L(k),R(k+1) are not convergent then set increase k by 1 defined as in (28) and (29), then we have and go to Step 1, othervise proceed to Step 6. Step 6 Gb + Gw = Ir ∗ (k) ∗ (k+1) ∗ ∗T ∗ r×r Let L = L ,R = R and compute Yi = L XiR where Ir ∈ < is an identity matrix. for i = 1..N. Proof. We can re-write the between-class scatter matrix Gb

C 1 P T σr > 0 is the diagonal matrix of non-zero eigenvalues. Then Gb = N Niηiηi i=1 whitening a random vector x with mean µ and covariance C (31) 1 P T T T matrix Cx can be obtained by performing the following cal- = N NiP (µi − µ)(µi − µ) P = PSbP culation with whitening transformation matrix P = Λ−1/2U T i=1 and the within-class scatter matrix G as y = P (x − µ) = Λ−1/2U T (x − µ) (23) w C 1 P P Thus, the output of this transformation has expectation Gw = N (P (xk − µ) − P (µi − µ)) i=1 yk∈Πi −1/2 T −1/2 T T E{y} = Λ U (E{x} − µ) = Λ U (µ − µ) = 0 (24) (P (xk − µ) − P (µi − µ)) = (32) PC P Due to E{y} = 0 and covariance matrix C can be written 1 T T T y N P (xk − µi)(xk − µi) P = PSwP as i=1 yk∈Πi n o T −1/2 T T −1/2 From (31) and (31), we have Cy = E{yy } = E Λ U (x − µ)(x − µ) UΛ −1/2 T −1/2 T T = Λ U CxUΛ = I Gb + Gw = PSbP + PSwP T T (33) (25) = P (Sb + Sw)P = PStP = Ir Thus, with the above transformation, we can whiten the random vector to have zero mean and the identity covariance Proof is done. matrix. Now we can formulate the Whitened LDA problem as fol- 3.2 Whitened LDA lowing optimization problem The key idea of WLDA is that we apply data whiten- vT G v J (v) = max b (34) ing before perform LDA. So we first perform data whiten- W LDA T v v Gwv ing by diagonalizing total covariance matrix St with r = where v ∈ vT v=1

238 Proof. Applying the fact that Gb + Gw = Ir from The- orem 1, we can re-write the optimization problem in (34) as

T T T v Gbv v Gbv v Gbv max T = max T = max T T (36) v v Gwv v v (Ir −Gb)v v v v−v Gbv

We can see that if v = v0 is an optimal vector of (36) then cv is also an optimal solution of (36), where c scalar value. Figure 1: Twenty sample images from ORL face So we constrain v to unit length, i.e. vT v = 1. Now, the database problem in (36) can be written with constraints as follow,

T v Gbv max T T (37) vT v=1 v v − v Gbv It’s easy to see from (37) that the solution for optimization problem in (34) can be obtained by solving the following simple eigenvalue problem

T max v Gbv (38) vT v=1 Proof is done. Figure 2: Ten sample images from Yale face database It’s easy to see that the optimal projection matrix V = [v v ...v ] for (38) is the set of r-dimensional eigenvectors 1 2 m 4. EXPERIMENTAL RESULTS of Gb corresponding to the m largest eigenvalues. Finally the optimal projection matrix for WLDA can be calculated This section evaluates the performance of PCA [5], Fish- T as WW LDA = P V . One should note that in WLDA we erface [1], Direct-LDA [11], NLDA [4], 2DPCA [6], GLRAM find the optimal projection vector wopt among all vector w [7], 2DLDA [9] and our new approach WLDA based on using in the form of Yale face database and ORL face database. In this paper, we apply the nearest-neighbor classifier for its simplicity. The w = P T v , ∀v ∈

⊥ center-light, w/glasses, happy, left-light, w/no glasses, nor- Proof. Let V be the range space St, V be the null mal, right-light, sad, sleepy, surprised, and wink. A ran- space of St. Equivalently, dom subset with k(k = 2, 3, 4, 5) images per individual was taken with labels to form the training set. The rest of the V = span{α |Stα 6= 0, k = 1, .., r} (41) k k database was considered to be the testing set. 10 times and ⊥ V = span{αk|Stαk = 0, k = r + 1, .., n} (42) Table 4: Comparison of the top recognition accuracy where r is the rank of St, {α1, .., αd} is an orthonormal (%) on Yale database. set and {α1, .., αr} is the set of orthonormal eigenvectors n corresponding to the non-zero eigenvalues of St. Since < = k 2 3 4 5 ⊥ n V ⊕ V , every vector a ∈ < has a unique decomposition PCA 75.556 83.333 84.762 87.778 ⊥ of the form a = b + c, where b ∈ V and c ∈ V . And the Fisherfaces 84.444 86.667 93.333 93.333 projected value corresponding to xk on the projection vector DLDA 80 83.333 91.429 91.111 a can be formed as NLDA 85.185 87.5 94.286 93.333 (a) T T T 2DPCA 76.296 83.333 88.571 88.889 yk = a (xk − µ) = b (xk − µ) + c (xk − µ) T (43) GLRAM 74.2 84.35 84.762 88.1 = b (xk − µ) 2DLDA 82.1 84.3 89.2 90 From (43) we see that we can find the all discriminant pro- WLDA 86.481 89.167 94.524 96.111 jection vectors from range space of St. Proof is done.

239 90 Table 5: Comparison of the top recognition accuracy (%) on ORL database. 80

k 2 3 4 5 70 PCA 81.875 87.5 90 90.5 60 Fisherfaces 81.563 85.714 86.667 82.5 DLDA 84.375 87.857 90.833 92.5 50 NLDA 84.063 87.857 91.25 91.5 2DPCA 84.063 86.071 89.167 91.5 40 GLRAM 81.875 86.75 89.45 90 30 2DLDA 82.175 84.234 90 90.75 WLDA 88.875 89.857 92.417 94 20 LDA DLDA 10 NLDA WLDA of random selection for training examples were performed 0 5 10 15 20 25 30 35 and the average recognition result was recorded. The train- ing samples were used to learn the subspace. The testing Figure 3: Comparison of the recognition accuracy samples were then projected into the low-dimensional rep- (%) versus reduced dimension m (1 to 39) on ORL resentation subspace. Recognition was performed using a database with training sample k = 2 nearest-neighbor classifier. We tested the recognition rates with different number of training samples. We show the best 90 results obtained by PCA [5], Fisherface [1], Direct-LDA [11], NLDA [4], 2DPCA [6], GLRAM [7], 2DLDA [9] and our new 80 approach WLDA in Table 4. 70 4.2 ORL Face Database In the ORL database, there are ten different images of 60 each of 40 distinct subjects. For some subjects, the images 50 were taken at different times, varying the lighting, facial ex- pressions (open / closed eyes, smiling / not smiling) and fa- 40 cial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects 30 in an upright, frontal position (with tolerance for some side 20 LDA movement). A random subset with k(k = 2, 3, 4, 5) images DLDA per individual was taken with labels to form the training set. 10 NLDA The rest of the database was considered to be the testing WLDA set. 10 times of random selection for training example were 0 performed and the average recognition result was recorded. 5 10 15 20 25 30 35 The experimental protocol is the same as before. The best recognition result of each method are shown in Table 5. Figure 4: Comparison of the recognition accuracy Next, in order to test the performance of these LDA- (%) versus reduced dimension m (1 to 39) on ORL based algorithms versus the reduced dimensions, we vary database with training sample k = 3 the reduced dimension m from 1 to 39 and perform these approaches based on image ORL database. Fig. 3,4,5,6. 100 show the recognition accuracy versus reduced dimension of LDA-based algorithms and our algorithm, corresponding to 80 the number of training sample k = 2, 3, 4, 5.

5. CONCLUSIONS 60 In this paper, we propose a new LDA algorithm that can overcome some disadvantages of previous work. The key 40 idea of WLDA is that we apply data whitening before per- LDA form LDA. With some nice properties of whitened data, we 20 DLDA show how to turn the generalized eigenvalue problem of LDA NLDA into simple eigenvalue problem, leading to low computation WLDA cost of algorithm. While in previous works, some discrimi- 0 5 10 15 20 25 30 35 nant information lost during performing algorithms, WLDA has the ability of keeping all discriminant information. Ex- Figure 5: Comparison of the recognition accuracy periment results show us the effectiveness of our algorithm. (%) versus reduced dimension m (1 to 39) on ORL database with training sample k = 4 6. REFERENCES

240 100 high-dimensional data - with application to face recognition. Pattern Recognition, (34):2067–2070, 2001. 80 [12] W. Zhao, R. Chellappa, and A. Krishnaswamy. Discriminant analysis of principal components for face recognition. Proceedings of the 3rd. International 60 Conference on Face and Gesture Recognition, page 336, 1998.

40

LDA 20 DLDA NLDA WLDA 0 5 10 15 20 25 30 35

Figure 6: Comparison of the recognition accuracy (%) versus reduced dimension m (1 to 39) on ORL database with training sample k = 5

[1] P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. fisherfaces: recognition using class specific linearprojection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 19(7):711–720, Jul 1997. [2] L. Chen, H. Liao, M. Ko, J. Liin, and G. Yu. A new lda-based face recognition system which can solve the samll sample size problem. Pattern Recognition, 33(10):1713–1726, Oct. 2000. [3] K. Fukunaga. Introduction to statistical pattern recognition. Academic Press Professional, Inc., San Diego, CA, USA, 1990. [4] R. Huang, Q. Liu, H. Lu, and S. Ma. Solving the small sample size problem of lda. Pattern Recognition, 2002. Proceedings. 16th International Conference on, 3:29– 32, 2002. [5] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71–86, 1991. [6] J. Yang, D. Zhang, A. F. Frangi, and J. yu Yang. Two-dimensional pca: A new approach to appearance-based face representation and recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, page 112, 2004. [7] J. Ye. Generalized low rank approximations of matrices. Proceedings of the twenty-first international conference on Machine learning, 26(1):131– 137, Jan 2004. [8] J. Ye. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research, 6:483–502, 2005. [9] J. Ye, R. Janardan, and Q. Li. Two-dimensional linear discriminant analysis. Neural Information Processing Systems, 2004. [10] J. Ye and T. Xiong. Null space versus orthogonal linear discriminant analysis. Proceedings of the 23rd international conference on Machine learning, pages 1073 – 1080, 2006. [11] H. Yu and J. Yang. A direct lda algorithm for

241