Pattern Recognition 45 (2012) 4080–4091
Contents lists available at SciVerse ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
Subclass discriminant Nonnegative Matrix Factorization for facial image analysis
Symeon Nikitidis b,a, Anastasios Tefas b, Nikos Nikolaidis b,a, Ioannis Pitas b,a,n a Informatics and Telematics Institute, Center for Research and Technology, Hellas, Greece b Department of Informatics, Aristotle University of Thessaloniki, Greece article info abstract
Article history: Nonnegative Matrix Factorization (NMF) is among the most popular subspace methods, widely used in Received 4 October 2011 a variety of image processing problems. Recently, a discriminant NMF method that incorporates Linear Received in revised form Discriminant Analysis inspired criteria has been proposed, which achieves an efficient decomposition of 21 March 2012 the provided data to its discriminant parts, thus enhancing classification performance. However, this Accepted 26 April 2012 approach possesses certain limitations, since it assumes that the underlying data distribution is Available online 16 May 2012 unimodal, which is often unrealistic. To remedy this limitation, we regard that data inside each class Keywords: have a multimodal distribution, thus forming clusters and use criteria inspired by Clustering based Nonnegative Matrix Factorization Discriminant Analysis. The proposed method incorporates appropriate discriminant constraints in the Subclass discriminant analysis NMF decomposition cost function in order to address the problem of finding discriminant projections Multiplicative updates that enhance class separability in the reduced dimensional projection space, while taking into account Facial expression recognition Face recognition subclass information. The developed algorithm has been applied for both facial expression and face recognition on three popular databases. Experimental results verified that it successfully identified discriminant facial parts, thus enhancing recognition performance. & 2012 Elsevier Ltd. All rights reserved.
1. Introduction to which the yielded representation is sparse. Towards this direction, Hoyer incorporated the notion of sparseness into the Nonnegative Matrix Factorization (NMF) [1], is an unsuper- standard NMF decomposition function, so as the representation vised matrix decomposition algorithm that requires both the data sparsity can be better controlled [6], while Li et al. [7] introduced matrix being decomposed and the yielding factors to contain localization constraints leading to a parts-based representation. nonnegative elements. The nonnegativity constraint implies that To interpret NMF parts-based image representation, consider the original data are reconstructed using only additive and no the scenario where NMF operates either on facial images or on a subtractive combinations of the yielded basic elements. This text documents collection. In this scenario, the NMF training limitation distinguishes NMF from many other traditional dimen- procedure aims to learn the parts of the decomposed data, which, sionality reduction algorithms, such as Principal Component for the first case, will correspond to different facial parts, while for Analysis (PCA) [2], Independent Component Analysis (ICA) [3,4] the latter, to meaningful topics. Consequently, the identified basis or Singular Value Decomposition (SVD) [5]. elements when combined using appropriate weight factors, will One of the most useful properties of NMF-based methods is reconstruct accurately the original facial images or text docu- that they usually produce a sparse representation of the decom- ments that have been decomposed. This parts-based representa- posed data. Sparse coding corresponds to data representation tion property of NMF, is consistent with the psychological using few basic elements that are spatially distributed, and ideally, intuition of combining parts to form the whole for object nonoverlapping. However, because the sparseness achieved by the representation in the human brain [8,9]. original NMF is somewhat of a side-effect rather than a goal, Recently, numerous specialized NMF-based algorithms applied caused by the imposed nonnegativity constraints, different in various problems in diverse fields have been proposed. These approaches have been proposed that attempt to control the degree algorithms modify the NMF decomposition cost function by incorporating additional penalty terms in order to fulfill specific requirements, arising in each application domain. In [10] Projec- n Corresponding author at: Department of Informatics, Aristotle University of tive NMF (PNMF) was introduced, which proved to generate a Thessaloniki, Greece. Tel.: þ30 2310996361. much sparser, compared to original NMF, and approximately E-mail addresses: [email protected] (S. Nikitidis), [email protected] (A. Tefas), [email protected] (N. Nikolaidis), orthogonal projection matrix, which reveals strong connections [email protected] (I. Pitas). between PNMF and nonnegative PCA. Extensive theoretical and
0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2012.04.030 S. Nikitidis et al. / Pattern Recognition 45 (2012) 4080–4091 4081 practical justifications of PNMF algorithm have been given in [11]. underlying distribution of each class by a mixture of Gaussians An extension of NMF that is applicable on mixed sign data has and employ criteria inspired by the Clustering based Discriminant been attempted in [12], where the proposed framework relaxes Analysis (CDA) introduced in [24]. Moreover, we extend the NMF the nonnegativity constraint on the bases matrix and considers algorithm modifying its decomposition by embedding appropri- entries with both positive and negative sign, while the weights ate discriminant constraints and reformulating the cost function matrix remains positively constrained. The efficiency of the that drives the optimization process. This extension provides presented framework has been investigated in various clustering discriminant projections that are expected both to exhibit robust- problems. ness in illumination changes and expression variations, and to Focusing on facial image analysis, numerous specialized NMF enhance class separability in the reduced dimensional space. To decomposition variants have been proposed for face recognition perform SDNMF optimization, we develop multiplicative update [7,13], facial identity verification [14] and facial expression rules that consider both samples class origin and clusters forma- recognition [17,18]. In such applications the entire facial image tion inside each class and prove their convergence using an forms a feature vector and NMF aims to find its projections that appropriately designed auxiliary function. optimize a given criterion. The resulting projections are then used In summary, the novel contributions of this paper are the in order to project unknown test facial images from the original following: high dimensional image space into a lower dimensional subspace, where the criterion under consideration is optimized. One limita- Subclass based discriminant constraints are incorporated in the tion of NMF is that the decomposed images should be vectorized NMF decomposition cost function resulting in a specialized in order to perform the nonnegative decomposition. Conse- NMF-based method. quently, this vectorization leads to significant information loss, Novel multiplicative update rules for optimizing SDNMF are since the local structure of the decomposed images is no longer proposed and their convergence is proven. available. In order to remedy this limitation, the 3D Nonnegative A decomposition of a facial image into its discriminant parts Tensor Factorization (NTF) has been introduced in the literature using sparse representations is obtained. [19,20]. A supervised NMF learning method that aims at extracting The rest of the paper is organized as follows. The NMF algorithm discriminant facial parts is the Discriminant NMF (DNMF) algo- and some of its most notable variants are reviewed in Section 2. rithm proposed in [14,26]. DNMF incorporates discrimination Section 3 introduces the proposed SDNMF method, which incor- criteria in the NMF factorization and achieves a more efficient porates subclass discriminant constraints in the NMF decomposi- decomposition of the data to its discriminant parts, thus enhan- tion framework and also, draws the proposed multiplicative cing separability between classes compared to the conventional update rules. Section 4 describes the conducted experiments, NMF. However, the incorporation of Linear Discriminant Analysis verifying the efficiency of our algorithm in face and facial (LDA) inspired criterion [22] inside DNMF has certain shortcom- expression recognition. Finally, convergence proof of our optimi- ings. Firstly, LDA assumes that the sample vectors of each class zation scheme is provided in Appendix A, whereas directions for are generated from underlying multivariate Gaussian distribu- future work and concluding remarks are drawn in Section 5. tions having a common covariance matrix but different class means. Secondly, since LDA assumes that each class is repre- sented by a single compact data cluster, the problem of non- 2. Brief review of NMF and its most notable variants linearly separable classes cannot be solved. However, this problem can be tackled if we consider that each class is parti- In this section, we briefly present the NMF decomposition tioned into a set of disjoint clusters (subclasses) and perform a concept and review its discriminant variant DNMF. Without discriminant analysis aiming at subclass separation between loosing generality, we shall assume that the decomposed data those belonging to different classes. Typically, in real world are facial images. Obviously, the techniques that will be described applications, data usually do have a subclass structure. This is can be applied to any kind of nonnegative data. common e.g. in facial expression recognition, since there is no unique way that people express certain emotions and there are other factors such as facial pose, texture and illumination varia- 2.1. NMF tions that lead to expression subclasses. If this fact is not properly addressed, the performance of NMF-based methods is signifi- The basic idea of NMF is to approximate a facial image by a cantly degraded [23]. linear combination of elements, the so called basis images, that In this work, we focus on developing a specialized NMF correspond to facial parts. Let I be a facial image database decomposition method that performs efficiently face and facial comprised of L images belonging to n different classes and F L expression recognition, alleviating certain deficiencies inherent to XAR þ is the nonnegative data matrix whose columns are these problems. Motivated by observations regarding facial F-dimensional feature vectors obtained by scanning row-wise each expression sample distribution in the initial facial space [24] facial image in the database. Thus, xi,j is the ith element of the jth and also by the fact that NMF does not perform robustly on noisy column vector xj. NMF considers factorizations of the form: datasets, we propose a novel algorithm, called Subclass Discrimi- X ZH, ð1Þ nant NMF (SDNMF). The proposed method addresses the general F M problem of finding discriminant projections that enhance class where ZAR þ is a matrix containing the basis images, while M L separability in the reduced dimensional space by imposing dis- matrix HAR þ contains the coefficients of the linear combina- criminant criteria that assume multimodality of the available tions of the basis images required to reconstruct each original facial train data. image in the database. Thus, the jth facial image, represented by
To remedy the aforementioned limitations, we relax the vector xj, can be approximated by the factorization xj Zhj,where assumption that each class is expected to consist of a single hj denotes the jth weight column of matrix H. Obviously, useful compact data cluster and regard that data inside each class factorizations for real world applications appear when the linear form various subclasses, where each one is approximated by subspace transformation projects data from the original F-dimen- a Gaussian distribution. Consequently, we approximate the sional space to a M-dimensional subspace with M5F. 4082 S. Nikitidis et al. / Pattern Recognition 45 (2012) 4080–4091
To measure the cost of the decomposition in (1), one popular discriminant constraints in the NMF decomposition cost function. approach is to use the Kullback–Leibler (KL) divergence metric The rationale behind DNMF is to incorporate LDA inspired criteria which is the most common approximation error measure for NMF inside the NMF factorization thus, introducing information factorization methods. The KL divergence between two vectors regarding how the various facial images are separated into T T x ¼½x1, ...,xF and q ¼½q1, ...,qF is defined as different classes, aiming at finding basis images that correspond to discriminant salient facial parts such as eyes, nose, mouth, XF x KLðxJqÞ9 x ln i þq x : ð2Þ eyebrows, etc. i q i i i ¼ 1 i In order to incorporate discriminant constraints into the NMF decomposition, the well known Fisher discriminant criterion Thus, the cost of the decomposition DNMF ðXJZHÞ in (1) can be measured as the sum of all KL divergences between all images in [21,27] has been exploited, given by the database and their respective reconstructed versions tr½WT S W JðWÞ¼ b , ð9Þ L T X tr½W SwW DNMF ðXJZHÞ9 KLðxjJZhjÞ where tr½ is the matrix trace operator. This criterion attempts to j ¼ 1 ! ! find a transformation matrix , that maximizes the ratio defined XL XF x XM W ¼ P i,j þ ð Þ xi,j ln M zi,khk,j xi,j : 3 by the traces of the between-class and within-class scatter j ¼ 1 i ¼ 1 k ¼ 1 zi,khk,j k ¼ 1 T T matrices !Sb ¼ W SbW and !Sw ¼ W SwW evaluated over the pro- Thus, NMF algorithm factorizes the data matrix X into ZH,by jected data. DNMF cost function incorporates the optimization of solving the following optimization problem: a similar discriminant criterion, minimizing the dispersion of the projected samples that belong to the same class around their J min DNMF ðX ZHÞ, corresponding mean, while at the same time, maximizing the Z,H X scatter of the mean vectors of all classes around their global subject to : zi,k Z0, hk,j Z0, zi,k ¼ 1, 8i,j,k: ð4Þ i mean. Consequently, the DNMF algorithm minimizes the follow- ing cost function: Using an appropriately designed auxiliary function and the J 9 J ! ! Expectation Maximization (EM) algorithm, it has been shown in DDNMF ðX ZHÞ DNMF ðX ZHÞþa tr½ Sw b tr½ Sb , ð10Þ [25] that the following multiplicative update rules update hk,j and where a and b are positive constants. zi,k, yielding the desired factors, while guarantee a nonincreasing Recently, Cai et al. [28] proposed the Graph regularized NMF behavior of the cost function in (3). The update rule for the tth (GNMF), designed for clustering problems, that encodes the data ðtÞ iteration for hk,j is given by geometric structure. GNMF considers a nearest neighbor graph in P ðt 1Þ xi,j order to exploit local geometrical invariance between the training iz P i,k zðt 1Þhðt 1Þ samples when traversed from the initial space to the projection ðtÞ ¼ ðt 1Þ P l i,l l,j ð Þ hk,j hk,j ðt 1Þ , 5 subspace. The developed optimization schema is formulated as izi,k J 91J J2 T ðtÞ DGNMF ðX ZHÞ 2 X ZH F þl tr½HLH , ð11Þ while zi,k is updated by P where J JF is the Frobenius norm, lZ0 is a regularization ðtÞ xi,j jh P parameter that controls the smoothness of the new representa- k,j zðt 1ÞhðtÞ ! ðtÞ ¼ ðt 1Þ P l i,l l,j ð Þ tion and L is the graph Laplacian matrix. zi,k zi,k ðtÞ , 6 jhk,j Another notable variant of NMF which exploits the training samples manifold structure in face space, is the topology preser- and normalized as ving NMF (TPNMF) algorithm proposed by Zhang et al. in [29].In !zðtÞ particular, TPNMF method is specialized for face representation zðtÞ ¼ P i,k : ð7Þ i,k ! ðtÞ and recognition and achieved better discrimination, compared l z l,k against the original NMF, by minimizing the constrained gradient distance which preserves the local nonlinear topology structure of 2.2. NMF variants the face. The developed optimization problem is formulated as
J 91J J2 9 92 In order to further enhance sparsity of the resulting basis DTPNMF ðX ZHÞ 2 X ZH F þl rZH , ð12Þ images, Li et al. proposed the Local NMF (LNMF) algorithm [7] by where rZH is the gradient of ZH and lZ0 is a parameter which including appropriate additional penalty terms in the NMF controls the trade-off between the minimization of the decom- decomposition cost function. More precisely, LNMF aims at position error and the topology preserving power. creating bases that can not be further decomposed into more components, while at the same time attempts to reduce redun- dant information between different bases. Moreover, the algo- 3. Subclass DNMF rithm retains only these representation components that contain the most important information. LNMF cost function is formu- In this section, we present the imposed clustering based lated as discriminant criteria and demonstrate how these are incorporated XM in the NMF cost function, thus creating the proposed SDNMF J 9 J T DLNMF ðX ZHÞ DNMF ðX ZHÞþa1 ½Z Z i,j optimization problem. Then, we derive the proposed multiplica- i a j tive update rules to optimize SDNMF and demonstrate their XM XM convergence. T T þa2 ½Z Z i,i b ½HH i,i, ð8Þ i ¼ 1 i ¼ 1 3.1. Clustering based Discriminant Analysis where a1,a2 and b are positive constants. Another popular NMF variant is the Discriminant Nonnegative Similar to LDA, CDA seeks to determine a transformation Matrix Factorization algorithm, which is an attempt to introduce matrix W, such that when applied on the initial input data X, S. Nikitidis et al. / Pattern Recognition 45 (2012) 4080–4091 4083
the projected samples form classes that are better separated. To Therefore, we attempt to simultaneously minimize tr½Rw and do so, CDA assumes that data inside each class do not form one maximize tr½Rb . compact cluster, but each class can be partitioned into one or more compact subclasses. Based on this assumption, CDA 3.2. Subclass discriminant Nonnegative Matrix Factorization attempts to discriminate classes, while, at the same time, mini- (SDNMF) mizes the scatter within each subclass. In detail, CDA exploits a modified Fisher criterion so that the In order to formulate the new cost function for the SDNMF between and within subclass scatter matrices are evaluated problem we add appropriate penalty terms in the NMF decom- considering not only the samples class labels but also their position as follows: respective subclass origins. To formulate the CDA criteria in the a b n-class facial image database I, let us denote the number of D ðXJZHÞ9D ðXJZHÞþ tr½R tr½R , ð17Þ SDNMF NMF 2 w 2 b subclasses composing the rth class by Cr, the totalP number of n where and b are positive constants, while 1 is used to simplify formed subclasses in the database by C, where C ¼ i Ci, and the a 2 number of facial images belonging to the yth subclass of the rth subsequent derivations. Consequently, the new minimization class by NðrÞðyÞ. Let us also define the mean vector for the yth problem is formulated as subclass of the rth class by lðrÞðyÞ ¼½mðrÞðyÞ, ...,mðrÞðyÞ T which is 1 F min DSDNMF ðXJZHÞ, ðrÞðyÞ Z,H evaluated from the NðrÞðyÞ facial images, while vector xr ¼ X ðrÞðyÞ ðrÞðyÞ T Z Z ½xr,1 , ...,xr,F corresponds to the feature vector of the rth subject to : zi,k 0, hk,j 0, zi,k ¼ 1, 8i,j,k, ð18Þ facial image belonging to the yth subclass of the rth class. Using i the above notations, we can define the within subclass scatter which requires the minimization of (17) subject to the nonnega- matrix Sw as tivity constraints applied on the elements of both the weights matrix H and the basis images matrix Z. Xn XCr NXðrÞðyÞ ðrÞðyÞ ðrÞðyÞ ðrÞðyÞ ðrÞðyÞ T In order to solve the constrained optimization problem in (18), Sw ¼ ðxr l Þðxr l Þ , ð13Þ r ¼ 1 y ¼ 1 r ¼ 1 we follow a similar approach to that in [25]. It should be noted that, as in every NMF-based optimization problem, the objective and the between subclass scatter matrix Sb as function in (17) is convex either in Z or in H, but nonconvex in
Xn Xn XCi XCr both variables. Therefore, the iterative optimization process of the ðiÞðjÞ ðrÞðyÞ ðiÞðjÞ ðrÞðyÞ T Sb ¼ ðl l Þðl l Þ : ð14Þ SDNMF algorithm reaches a local minimum. To do so, the i ¼ 1 r,r a i j ¼ 1 y ¼ 1 proposed process successively optimizes either variable Z or H keeping the other fixed. Since the added discriminant factors in Since NMF projects the initial data to a lower dimensional the SDNMF cost function are totally independent from the basis subspace using the pseudo-inverse Zy ¼ðZT ZÞ 1ZT , we desire to images matrix Z, keeping H fixed and optimizing for Z results in perform this projection in a discriminant manner and enhance the same optimization problem to that solved by the original NMF class separability in the projection subspace. In order to deter- algorithm in [25] and, consequently, leads to exactly the same mine the optimal projection of the facial images, we attempt to update formulae. Thus, we can recall the convergence proof of maximize a CDA inspired criterion formulated by exploiting the conventional NMF to show that (17) is nonincreasing under the within and between subclass scatter matrices in the projection update rules in (6). The interested reader is referred to [25] for subspace. To do so, the within subclass scatter matrix evaluated more details. Moreover, in order to optimize for H, we define an on the projected samples is transformed with respect to its appropriate convex auxiliary function that bounds from above the y T y previous form as: Rw ¼ðZ Þ SwZ while the between subclass objective function defining the factorization cost. If G is such an y T y scatter matrix as: Rb ¼ðZ Þ SbZ . Let us define the projected rth auxiliary function, then (17) is nonincreasing, under the update ðrÞðyÞ ðrÞðyÞ ðtÞ ðt 1Þ facial image xr by the M-dimensional feature vector gr ¼ H ¼ arg minHGðH,H Þ. By iterating this update rule, we obtain ðtÞ ðrÞðyÞ ðrÞðyÞ a series of minimizers H that improve the objective function. ½Z , ...,Z T resulting by applying the transformation r,1 r,M Thus, we come up with the following update rule for the weight gðrÞðyÞ ¼ ZyxðrÞðyÞ. Using the above notations we can evaluate the r r coefficients hk,l which for the tth iteration is defined as sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi within and between subclass scatter matrices Rw and Rb in the b 1 P x projection subspace as Aþ A2 þ4 a aþ ðC C Þ hðt 1Þ zðt 1ÞP i,l N r N k,l i i,k ðt 1Þ ðt 1Þ ðrÞðyÞ ðrÞðyÞ nz h hðtÞ ¼ i,n n,l , Xn XCr NXðrÞðyÞ k,l b 1 ðrÞðyÞ ðrÞðyÞ ðrÞðyÞ ðrÞðyÞ T 2 a aþ ðC CrÞ R ¼ ðg l Þðg l Þ , ð15Þ NðrÞðyÞ w r r NðrÞðyÞ r ¼ 1 y ¼ 1 r ¼ 1 ð19Þ
Xn Xn XCi XCr where hk,l can be also considered as the kth feature element, in ðiÞðjÞ ðrÞðyÞ ðiÞðjÞ ðrÞðyÞ T Rb ¼ ðl l Þðl l Þ , ð16Þ the projection subspace, of the lth image belonging to the yth i ¼ 1 r,r a i j ¼ 1 y ¼ 1 cluster of the rth facial class and parameter A is defined as: where the M-dimensional mean vector lðiÞðjÞ henceforth denotes b 1 X b Xn XCm A ¼ aþ ðC C Þ hðt 1Þ mðmÞðgÞ 1: the mean vector evaluated over the projected samples composing N r N k,l N k ðrÞðyÞ ðrÞðyÞ l,l a l ðrÞðyÞ m,m a r g ¼ 1 the jth subclass of the ith class. ð20Þ It is reasonable to desire the dispersion of those projected samples that belong to the same subclass of a certain class, to be Details regarding how the proposed update rules and the related as small as possible, since this would denote a high concentration parameter A are derived, along with a proof that the objective of these samples around their subclass mean and consequently, function is guaranteed to have a nonincreasing behavior under more compact subclasses formation. Furthermore, to separate in the updates in (19), can be found in Appendix A. the projection subspace subclasses belonging to different classes, After we obtain the optimal factors Z,H, SDNMF necessitates to we desire to maximize the difference between the means of every use the pseudo-inverse Zy ¼ðZT ZÞ 1ZT of the basis images matrix subclass of a certain class to every subclass of each other class. Z, in order to extract the discriminant features and compute the 4084 S. Nikitidis et al. / Pattern Recognition 45 (2012) 4080–4091