ARTICLE IN PRESS

Engineering Applications of Artificial Intelligence 20 (2007) 101–110 www.elsevier.com/locate/engappai

Initialization enhancer for non-negative matrix factorization

Zhonglong Zhenga,b,Ã, Jie Yangb, Yitan Zhuc

aInstitute of Information Science and Engineering, P.O. box 143, Zhejiang Normal University, Jinhua 321004, Zhejiang, People’s Republic of China bInstitute of Image Processing and , Shanghai Jiao Tong University, Shanghai 200030, People’s Republic of China cAlexandria Research Institute, Virginia Polytechnic Institute and State University, 206 N. Washington Street, Alexandria, VA 22314, USA

Received 25 March 2005; received in revised form 9 September 2005; accepted 12 March 2006 Available online 27 April 2006

Abstract

Non-negative matrix factorization (NMF), proposed recently by Lee and Seung, has been applied to many areas such as , image classification image compression, and so on. Based on traditional NMF, researchers have put forward several new algorithms to improve its performance. However, particular emphasis has to be placed on the initialization of NMF because of its local convergence, although it is usually ignored in many documents. In this paper, we explore three initialization methods based on principal component analysis (PCA), fuzzy clustering and Gabor wavelets either for the consideration of computational complexity or the preservation of structure. In addition, the three methods develop an efficient way of selecting the rank of the NMF in low- dimensional space. r 2006 Elsevier Ltd. All rights reserved.

Keywords: Non-negative matrix factorization; Principal component analysis; Fuzzy clustering; Gabor wavelet; Dimensionality reduction; Image classification

1. Introduction classification of classes. In addition, the eigen-face derived from PCA is holistic or can be called dense (Buciu and In pattern analysis and computer vision, visual recogni- Pitas, 2004). tion of objects is one of the most challenging problems. Recently, an unsupervised approach called non-negative Approaches to overcoming such problems have focused on matrix factorization (NMF) has been proposed by Lee and several methodologies. Appearance-based representation Seung (Lee and Seung, 1999, 2001). NMF is different by and recognition is one of the most successfully methods adding its non-negative constraints in contrast to PCA and still used today. It involves preprocessing of multidimen- independent component analysis (ICA) (Barlett et al., sional signals, such as face images (Turk and Pentland, 1998). When applied to image analysis and representation, 1991), character images (Jiangying zhou and Lopresti, the obtained NMF bases are localized features that 1997), speech spectrograms (Bell and Sejnowski, 1995) and correspond with intuitive notions of the parts of the so on. In fact, the essence of the preprocessing is the so- images. It is supported by psychological and physiological called dimensionality reduction. Principal component evidence that perception of the whole is based on parts- analysis (PCA), based on second-order statistics, is one of based representations (Mel, 1999). Researchers have the most popular linear dimensionality reduction methods. proposed several different algorithms based on traditional It is a well-known fact that PCA is optimal in terms of the NMF. Guillamet et al. presented a study of the weighted reconstruction error (MSE) but not for the separation and version of the original NMF (WNMF) (Guillamet et al., 2003). Li et al. developed a variant of NMF, named local non-negative matrix factorization (LNMF), by imposing à Corresponding author. Institute of Information Science and Engineer- additional constraints (Li et al., 2001). Buciu et al. ing, P.O. box 143, Zhejiang Normal University, Jinhua 321004, Zhejiang, People’s Republic of China. Tel.: +86 579 2282155; fax: +86 579 2298229. proposed a supervised technique called discriminant non- E-mail addresses: [email protected], [email protected] negative matrix factorization (DNMF) by taking into (Z. Zheng). account class information (Buciu and Pitas, 2004). Also,

0952-1976/$ - see front matter r 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.engappai.2006.03.001 ARTICLE IN PRESS 102 Z. Zheng et al. / Engineering Applications of Artificial Intelligence 20 (2007) 101–110

NMF technique has been applied to many areas (Pauca This is lower bounded by 0, and equals 0 if and only if et al., 2004; Guillamet and Vitria` , 2002; Lee and Lee, 2 X ¼ Xe. To minimize X Xe with respect to W and H, 2001). As we know, however, there are few documents paying subject to the constraints W; HX0, the multiplicative much attention to the problem of NMF initialization. In update rule is described below: theory, NMF unavoidably converges to local minima. So, T T ðW XÞau ðXH Þia the NMF bases will be different given different initializa- Hau Hau ; W ia W ia . (3) ðW TWHÞ ðWHHTÞ tion. The experiments cannot be repeated by others if they au ia do not know the initialization conditions. It is our desire to The Euclidean distance is invariant under these updates if give proper initialization for a given task in order to and only if W and H are at a stationary point of the improve the performance of NMF either for consideration distance. of computational complexity or for the preservation of Another useful measure is the divergence of X and Xe.It data structure. To this end, we explore three techniques, is defined as ! which are PCA, fuzzy clustering and Gabor wavelet, to X realize NMF initialization. Furthermore, the three initi- e X ij e DDðXjjXÞ¼ X ij log e X ij þ X ij : (4) alization methods provide helpful indications in determin- ij X ij ing the rank of NMF in low-dimensional space. It is also lower bounded 0, and vanishes if and only if In the following sections, we will give a brief introduction X ¼ Xe. We call it ‘‘divergence’’ instead of ‘‘distance’’ to the NMF algorithm. Then, we will present the three because it is not symmetric in X and Xe. To minimize initialization methods more formally, together with some DðXjjXeÞ with respect to W and H, subject to the illustrative simulations on the face image dataset. Finally, we constraints W; HX0, the corresponding multiplicative will discuss how to determine the rank of NMF in low- update rule is described as follows: dimensional space through the three initialization techniques. P W iaX iu=ðWHÞiu i P 2. Non-negative matrix factorization algorithm Hau Hau , W ka k Let a set of N training images be given as an m N P HauX iu=ðWHÞiu matrix X, with each column consisting of the m non- u P W ia W ia . ð5Þ negative pixel values of an image. NMF seeks to find non- Hav negative factors W and H such that v X Xe ¼ WH, (1) The divergence is also invariant under these updates if and only if W and H are at a stationary point of the divergence mr rN where W 2 R ; H 2 R . The r columns of W are Lee and Seung, 2001. thought of as basis images derived from the NMF Li et al. proposed a refinement of NMF by a slight algorithm (Lee and Seung, 1999). H is the encoding variation of the divergence algorithm, called LNMF (Li variable matrix. Dimensionality reduction is achieved when et al., 2001). The objective function of LNMF is roN. ! Xm XN X NMF imposes the non-negative constraints on both W X ij D ¼ X log X þ Xe þ aU b V , and H. As a consequence, the entries of W and H are all LD ij e ij ij ij ii i¼1 j¼1 X ij i non-negative and hence only non-subtractive basis combi- nations are allowed. It is believed to be compatible to the (6) intuitive notion of combining parts to form a whole. where a, b are positive constants and However, PCA requires columns of W to be orthonormal U ¼ W TW; V ¼ HHT. The update rule of the LNMF and the rows of H to be mutually orthogonal. It lacks for W is nearly identical to that in Eq. (5). For H, the intuitive meaning in that the linear combination of the update rule uses the square root to satisfy the additional bases in W usually involves complex cancellations between constraints: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi positive and negative numbers because of the arbitrary sign u X of the entries in W and H. u P W ia Hau tHau X iu . (7) To find an approximate factorization of NMF, the cost i W iaHau function that quantifies the quality of the approximation a has to be defined. Such a cost function can be constructed Having introduced the framework of NMF, we test using some measure of distance between W and H. One NMF on our private SJTU-face-database to extract the natural way is simply to use Euclidean distance between basis images to give an illustrative impression of NMF. We two matrices X and Xe to evaluate the approximation: select 400 images and each of them is cropped to 64 64 2 X size. The results are shown in Fig. 1. The NMF basis e e e 2 DEðX; XÞ¼X X ¼ ðX ij X ijÞ . (2) images look a little like those eigenfaces derived from PCA, ij to some extent. However, the basis faces in Fig. 1(b) could ARTICLE IN PRESS Z. Zheng et al. / Engineering Applications of Artificial Intelligence 20 (2007) 101–110 103

Fig. 1. Basis images: (a) part of original face images; (b) NMF basis images; (c) LNMF basis images; and (d) PCA basis images.

T approximate the entire collection of faces using only Using SVD to compute the eigenvectors of the X X T positive combinations instead of arbitrary combination in instead of XX (because usually mbN), up to N PCA. The basis faces shown in Fig. 1(c) are derived from eigenvalues and eigenvectors are obtained. The eigenvector LNMF and show more partial patches to represent images. matrix W is constructed by keeping only r eigenvectors But we have to note that low convergence is one of the (corresponding to the r largest eigenvalues li) as column weaknesses of LNMF. vectors. And H is an r N matrix containing the encoding coefficients. The reconstruction error is 3. Initializing NMF with different techniques 2 Emse ¼ E X WH . (10) As pointed out in Section 2, NMF seeks to find basis The criterion of selecting r is usually as follows: images W and encoding matrix H with constraints of non- P negativity. In theory, the solution to Eq. (1) is not unique r i¼1li under such constraints. Lee and Seung also provided P Xa. (11) N l proofs that both multiplicative update rules converge to i¼1 i local minima of the respective objective functions (Lee and Comparing Eq. (10) with Eq. (2), we find they are very Seung, 2001). As a result, it may lead to different outcomes similar to each other. If a is large enough (of course not given different initialization conditions. Generally speak- more than 1), PCA can find the direction in input space ing, there are distinct requirements in data analysis or where most of the energy of the input lies. Motivated by representation for different applications such as visual this starting point, we take the PCA-based method into coding, clustering and so on. It is our desire to design account to initialize NMF. However, the sign of variables specific initializations for NMF in order to improve its in W and H of Eq. (10) is arbitrary. In the NMF procedure, performance either for the consideration of computational all variables in W and H are non-negative. In order to complexity or for the preservation of data structure. utilize W and H obtained from PCA, we introduce such a nonlinear operator p that 3.1. PCA-based initialization pðWÞ¼pðW þÞ¼maxð0; W ijÞ,

PCA is one of the best known unsupervised feature pðHÞ¼pðHþÞ¼maxð0; HijÞ. ð12Þ extraction methods because of its conceptual simplicity and the existence of efficient algorithms that implement it In this case, the reconstruction error is (Mark and Erkki, 2004). Particularly in the face represen- 2 tation task, faces can be economically represented along Emseþ ¼ E X pðWÞpðHÞ . (13) the eigenface coordinate space, and approximately recon- Taking the mean matrix c composed of mean vector c into structed using just a small collection of eigenfaces and their consideration, we rewrite Eq. (13) as corresponding projections (‘coefficients’). It is optimal representation in the sense of mean-square error. 2 E ¼ EX ðpðWÞpðHÞþcÞ , (14) Given the m N matrix X as that of in NMF, the mseþ average vector c can be computed as: where X is the original data matrix. Obviously, Eq. (14) is equivalent to Eq. (2). If we initialize both factors W and H 1 XN c ¼ X . (8) of NMF with p(W) and p(H), then it is equal to minimize N i i¼1 Eq. (13) or minimize Eq. (2). Subtract the mean vector, and then the centered matrix is: The reason why we adopt the PCA-based method to initialize W and H lies in two aspects. One is that there is X ¼ðX 1 c; ...; X N cÞ. (9) much information contained in PCA subspace such that ARTICLE IN PRESS 104 Z. Zheng et al. / Engineering Applications of Artificial Intelligence 20 (2007) 101–110 they give a proper approximation to the original data. Frobenius norm: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi However, the approximation cannot be controlled by u Xm XN random initialization. The other is that the operator p u kkX WH ¼ t e2 . (15) enhances the sparseness of W: one possible goal of NMF. f ij i¼1 j¼1 On the contrary, random initialization is somewhat counterintuitive to this. The visual images of random(W) Fig. 6 shows that the approximation error from both and p(W) are shown in Fig. 2, respectively. Comparing the PCA-based and random initialization decreases fast before images shown in Fig. 2(b) and those in Fig. 1(d), we can 200 iterations. After 300 iterations, the error rate decreases conclude that operator p makes the basis images sparser. very slowly. Before 300 iterations, the PCA-based method The sparsity of basis images will slightly reduce the almost achieves superior approximation than the random computational complexity. Once an element is zero, it one because of its better initialization. In the long term, remains zero under the multiplicative update rules. So it random initialization will achieve a lower error than the does not need to compute an element if it is initialized to be PCA-based method. This is because PCA-based initializa- zero. tion enforces certain restrictions on W and H, and the To initialize NMF using the PCA-based method, we NMF algorithm using Euclidean distance measure cannot have to choose the rank of the low-dimensional space. As pull it out of a local minima. However, it is doubtful if this stated in Eq. (11), we choose r ¼ 25 when setting a ¼ 0.9. long-term error is meaningful because of its very low The ordering of columns in W is unimportant only if the decreasing rate. ordering of rows in H is kept consistent. Fig. 3 illustrates The other straightforward analysis is the sparsity and the improvement of the basis images after different orthogonality of the normalized basis images W. For the iterations using Euclidean distance NMF. For visualiza- sparsity of wi, we define the measure as tion, each basis image is normalized to have elements with ! Xm 1=p a maximum value 1. p kkw ¼ w ; 0opp1. (16) We find from Fig. 3 that NMF starts to model faces p i i¼1 from the beginning of the random initialization. As the algorithm progresses, the random initialization begins to In fact, this is a quasi-norm for 0opo1 since it does not lose its original noise and more closely resemble the faces satisfy the triangle inequality, but only the weaker 1=p0 0 conditions: x þ y p2 ðkkx p þ y Þ where p ¼ expected. However, the PCA-based initialization already p p p p=ðp 1Þ is the conjugate exponent of p; and x þ y p works in the ‘‘face space’’ with predefined parts. After 300 p p kkx p þ y (Donoho, 2001). iterations, the basis images obtained by random initializa- p p tion are distinctly different from those initial images shown For the orthogonality of wi, we define the measure as the in Fig. 2(a). However, with PCA-based initialization, the sum of the (N1) inner products: algorithm has a better start at the beginning to emphasize Xr T facial components such as eyes, nose and eyebrows. OðwiÞ¼ wi wj; i ¼ 1; ...; r. (17) Comparing the basis images after 1000 iterations with j¼1;iaj those initial images shown in Fig. 2(b), the reader will find If W is an orthogonal matrix, O(wi) equals 0. Otherwise, they are very similar to each other. O(wi) will increase with the orthogonality decreasing. The We further analyze the basis images W based on the sparsity and orthogonality results of random and PCA- results shown in Fig. 3. One natural analysis, illustrated in based initialization after 1000 iterations are shown in Fig. 6, is the minimization error of resulting approxima- Fig. 4, respectively. Here we see that the basis images in W tion. Here, the criterion of error analysis is measured by the of PCA-based initialization after 1000 iterations are more

Fig. 2. Visual basis image of (a) random(W) and (b) p(W). ARTICLE IN PRESS Z. Zheng et al. / Engineering Applications of Artificial Intelligence 20 (2007) 101–110 105

Fig. 3. (a)–(j) Improvement of basis images for random and PCA-based methods at different iterations. The top two rows are the results of random initialization; the other two are the results of PCA-based initialization. sparse and orthogonal than those obtained by the random at each point of the dataset by a membership function method. (Mihaela et al., 2002). It has been proved to be very suited to deal with the imprecise data. The advantage of fuzzy clustering is its adaptation to noisy data and classes that 3.2. Clustering-based initialization are not well separated. The fuzzy c-means (FCM) clustering algorithm used in this paper is described as follows: In conventional clustering, each sample in the original dataset X is either assigned to or not assigned to a group. The fuzzy clustering method, which applies the concept of (a) Fix the number of cluster c, fuzzier q, termination fuzzy sets to , gives membership to groups tolerance e and any inner product norm metric for Rn. ARTICLE IN PRESS 106 Z. Zheng et al. / Engineering Applications of Artificial Intelligence 20 (2007) 101–110

Fig. 4. Orthogonality and sparsity analysis: (a) sparsity of NMF W; (b) orthogonality of NMF W; (c) sparsity of PCA-based W; and (d) orthogonality of PCA-based W.

Fig. 5. (a)–(f) FCM-based initialization after different iterations.

(b) Initialize cluster center matrix W. (f) Calculate objective function (c) Compute dissimilarity matrix Di,j which is the dissim- Xc XN ilarity between wi and xj. q (d) Compute the membership matrix J ¼ ui;jDi;j i¼1 j¼1 1 ¼ ÀÁ ui;j P ð Þ . c D =D 1= q 1 s¼1 i;j s;j and if the tolerance e is satisfied, the algorithm is terminated; otherwise, go to (c). (e) Update cluster center W Before we carried out the FCM, the column vectors x of P i N uq x the original data matrix X are normalized to have elements Pj¼1 i;j j with a maximum value 1. Using the FCM algorithm, we wi ¼ N q . j¼1ui;j can obtain the centroid matrix W and the membership u of ARTICLE IN PRESS Z. Zheng et al. / Engineering Applications of Artificial Intelligence 20 (2007) 101–110 107 each sample in the dataset. The centroid W is shown in compact enough that the approximation error is very small Fig. 5(a) given c ¼ 25. To some extent, each basis image at the head start. So it is easily prone to converge to some contained in W is a combination of one or more faces in the certain local minima. And the multiplicative update rule original dataset. In Fig. 5(a), some images look more cannot change basis much and pull it out because of its blurry than others because the corresponding cluster better approximation. The approximation error is shown in contains more faces than other clusters. Motivated by this, Fig. 6. let us think about the following least-square problem: However, as the basis images in W are the centroids of different clusters, they are combinations of one or more X WL; L ¼ðW TWÞ1W TX, (18) faces. So the bases become denser than a single face image where the existence of (WTW)1 depends on the linear and the orthogonality decreases. The sparsity and ortho- independence of the columns of W (Dhillon and Modha, gonality of W using Eq. (17) is shown in Fig. 7. Compared 2001). Unfortunately, we have to note that the elements in with those shown in Fig. 4(c) and (d), the values in Fig. 7 the least-squares solution L are unconstrained in sign. are much larger. Therefore, we have to seek another kind of matrix H whose elements are non-negative to approximate X 3.3. Gabor-based initialization together with W in order to initialize NMF. Assume that each cluster derived from FCM is compact; a natural way Gabor wavelet is a powerful tool in image feature is to replace every xi in X with its corresponding cluster extraction and Lee described the good performance of centroid wi. This leads to the following equation: Gabor wavelet in rotation, scaling and translation property ( 1; i :¼ arg maxðU ijÞ; j ¼ 1; ...; N; (Daugman, 1988; Liu and Wechsler, 2001). The Gabor i Hij ¼ (19) wavelet (Lee, 1996) can be defined as follows 0; otherwise: 2 ÀÁhi km;n 2 2 2 2 c ðzÞ¼ e kkkm;n kkz =2s eikm;nz es =2 , (22) In other words, the values of elements in H are m;n s2 automatically determined by the membership matrix. The where m and n define the orientation and scale of the Gabor maximum value in columns U :j corresponds to 1 in H:j. Then the approximation is XEWH. Let us compare the kernels, and the wave vector km,n is defined as if objective function Eq. (2) with the cost function of the km;n ¼ kne m . (23) FCM algorithm under this constraint. In this case, we rewrite Eq. (2) as XN e 2 DEðX; XÞ¼ X j W cj , (20) j¼1 where Wcj denotes the centroid that the sample Xj belongs to. The FCM cost function is transformed to

XN J ¼ Dj. (21) j¼1 It is interesting that they are equivalent to each other. Therefore, it is appropriate using the FCM-based method to initialize NMF. The iteration results are shown in Fig. 5(b)–(f) under this initialization. The experiment results illustrated in Fig. 5 show that the basis images do not change dramatically even after 500 iterations. They are different from those of random initialization shown in Fig. 3. The reason is that the Fig. 6. Approximation error given different initializations using different clusters derived from FCM on this small dataset are initialization.

Fig. 7. Sparsity and orthogonality of basis images FCM. ARTICLE IN PRESS 108 Z. Zheng et al. / Engineering Applications of Artificial Intelligence 20 (2007) 101–110

Fig. 8. The mean-face (a) and the basis images using Gabor-based initialization after different iterations (b)–(g).

Fig. 9. Sparsity and orthogonality of basis images Gabor.

Once these parameters are given, the Gabor feature orientations are selected for most cases. The initial basis representation of an image I(z)is and iterative results are show in Fig. 8. The approximation error using Gabor-based initializa- n Gm;nðzÞ¼IðzÞ cm;nðzÞ, (24) tion at different iteration is illustrated in Fig. 6. The error curve is similar to that of random initialization. In this where z ¼ðx; yÞ and à is the convolution operator. case, all elements in the right factor H are initialized to be When convolved with Gabor wavelets, the image is 1. Imprecisely speaking, the samples in the original dataset transformed into a set of image features at certain scales are approximated by the computed mean face under such and orientations. Conversely, the image can be recon- initialization. It is obvious that such approximation is structed from these image features. Motivated by this inferior to both the clustering-based method and the PCA- standpoint, we try to initialize NMF with the Gabor- based method. However, the features contained in the basis based method. In order to utilize Gabor wavelets to get the images are different from those of clustering-based and initial W for NMF, we just compute the mean face of PCA-based methods because of the scale and orientation the dataset. That is to say, we compute the center of all the property. Although the basis images after 1000 iterations images in the dataset. The mean-face image is shown in also look whole-face alike, there may exist some con- Fig. 8(a). straints (unexplored) to strengthen the features in terms of Here, we choose five scales and five orientations for the scale or orientation according to different applications. The sake of comparison although four scales and eight sparsity and orthogonality of the bases are similar Fig. 9. ARTICLE IN PRESS Z. Zheng et al. / Engineering Applications of Artificial Intelligence 20 (2007) 101–110 109

4. Discussion of choosing rank r data. One of its most salient properties is that the resulting bases are often intuitive and easy to interpret because they We have discussed PCA-based, clustering-based and are non-negative and sparse. Researchers often take Gabor-based methods in detail in the previous sections to random initialization into account when utilizing NMF. initialize NMF. For the sake of comparison, all experi- In fact, random initialization may make the experiments ments choose rank r ¼ 25 for the low-dimensional space. unrepeatable because of its local minima property, Is there any other selection of the rank r? The answer is although neural networks are not (Baldi and Hornik, definitely ‘‘yes’’. Generally speaking, the rank of low- 1995). In this paper, we discuss the NMF algorithm, with dimensional space r is difficult to determine in dimension- emphasis on the initialization problem. The proposed ality reduction task. However, the criterion of how to methods are based on PCA, clustering and Gabor choose rank r is alternative along with different applica- techniques. Comparing with random initialization, the tions. For example, the online image retrieval demands the three methods demonstrate their superiority either in the rank r to be as small as possible to realize real time fast convergence in prophase or in the structure preserva- requirement (Gunther and Beretta, 2001). For image tion. We believe that the performance of NMF will be compression, however, the rank r should not be too small enhanced with the proposed initialization methods in data in order to achieve better approximation to the originals. analysis. In addition, the common users who are not familiar with the specific application or property of the original dataset Acknowledgement would feel puzzled to select a particular rank r for the low- dimensional space. The authors wish to acknowledge that this work is The initialization methods mentioned above give some supported by Fundamental Project of Shanghai under suggestions in determining the rank r comparing with grant number 03DZ14015. The authors would like to random initialization, to some extent. For the PCA-based thank Dr. Yonggang Wang for useful discussion. method, it is mainly concerned with energy conservation as stated in Eq. (11). This standpoint is very popular in PCA- related applications. It still gives directions when used to References initialize NMF. For the clustering-based method, the clusters are the indication of selected rank r under this Baldi, P., Hornik, K., 1995. Learning in linear neural networks: A survey. type of initialization. It is believed that there are some IEEE Transaction on Neural Networks 6, 837–857. Barlett, M.S., Ladesand, H.M., Sejnowsky, T.J., 1998. Independent other clustering algorithms that can be adopted to initialize component representations for face recognition. Proceedings of SPIE NMF (Wild, 2003). For the Gabor-based method, it 3299, 528–539. originates from its multi-scale and orientation property. Bell, A.J., Sejnowski, T.J., 1995. An information-matrimization approach When some other constraints are imposed on scale and to blind separation and blind deconvolution. Neural Computing 7, orientation, the method may show its potential benefit for 1129–1159. Buciu, I., Pitas, I., 2004. Application of non-Negative and Local non- some specific applications. Negative Matrix Factorization to Facial Expression Recognition. If we compute NMF given different values r in order to, ICPR 1, 288–291. for example, gain better approximation, it could be Daugman, J.G., 1988. Complete discrete 2-D Gabor transforms by neural computationally very expensive. For different datasets, networks for image analysis and compression. IEEE transanction on the rank r has to be re-computed without rank selection PAMI 36, 1160–1179. Dhillon, I.S., Modha, D.S., 2001. Concept decompositions for large sparse strategy. It is the reason why we use these methods to text data using clustering. 42, 143–175. initialize NMF. Furthermore, we find that the error Donoho, D.L., 2001. Sparse components analysis and optimal atomic decreases dramatically before 200 iterations. Both PCA- decomposition. Constructive Approximation 17, 353–382. based and clustering-based methods achieve better approx- Guillamet, D., and Vitria` , J., 2002. Non-negative matrix factorization for imation accuracy than random initialization from head face recognition. Proceedings of the 5th Catalonian Conference on AI: Topics in Artificial Intelligence, pp. 336–344. start, though they are inferior in the long term. In our view, Guillamet, D., Vitria, J., Scheile, B., 2003. Introducing a weighted non- the approximation accuracy of PCA-based and clustering- negative matrix factorization for image classification. Pattern Recog- based methods could be satisfactory in most cases because nition Letters 24, 2447–2454. the decreasing error rate is small enough to be ignored in Gunther, N.J., Beretta, G., 2001. A Benchmark for Image Retrieval using the long run. Even for the Gabor-based method, the Distributed Systems over the Internet. BIRDS–I. HP Labs, Palo Alto, Technical Report HPL–2000–162, San Jose. iteration result is nearly the same as that of random Jiangying Zhou, Lopresti, D.P., 1997. Extracting text from WWW initialization. One can expected to benefit from the Images. ICDAR, pp. 248–252. proposed methods in choosing rank r. Lee, S., 1996. Image representation using 2D Gabor wavelets. IEEE transaction on PAMI 18, 959–971. 5. Conclusion Lee, J.S., Lee, D.D., 2001. Non-negative matrix factorization of dynamic images in nuclear medicine. In: IEEE Medical Imaging Conference, pp. 2027–2030. It has been proven that non-negative matrix factoriza- Lee, D.D., Seung, H.S., 1999. Learning the parts of objects by non- tion is a useful tool in the analysis of a diverse range of negative matrix factorization. Nature 401, 788–791. ARTICLE IN PRESS 110 Z. Zheng et al. / Engineering Applications of Artificial Intelligence 20 (2007) 101–110

Lee, D.D., Seung, H.S., 2001. Algorithm for non-negative matrix Zhonglong Zheng received the B.A. degree in EE from University of factorization. Advances in Neural Information Processing Systems Petroleum, China in 1999 and doctoral degree in the institute of image 13, 556–562. processing & pattern recognition of Shanghai JiaoTong University, China. Li, S.Z., Hou, X.W., Zhang, H.J., Cheng, Q.S., 2001. Learning spatially Now he is a associate professor in the institute of information science and localized parts-based representation. In: Proceedings of IEEE Con- engineering, Zhejiang Normal University. His current research interests ference on Computer Vision and Pattern Recognition, pp. 207–212. include image analysis, video tracking, face detection and recognition, and Liu, C., Wechsler, H., 2001. A Gabor feature classifier for face . recognition. In: Proceedings of the 8th IEEE International Conference on Computer Vision, Vancouver, BC, Canada, pp. 270–279. Mark, D.P., Erkki, O., 2004. A ‘nonnegative PCA’ algorithm for Jie Yang received a Ph.D. degree in computer science from University of independent component analysis. IEEE transaction on Neural Net- Hamburg, Germany in 1994. Dr. Yang is now the Professor and Vice- works 15, 66–76. Director of Institute of Image Processing & Pattern Recognition of Mel, B.W., 1999. Think positive to find parts. Nature 401, 759–760. Shanghai JiaoTong University, China. His research interests include Mihaela G., Constantine K., Apostolos G., Ioannis, P., 2002. A new fuzzy artificial intelligence, image processing, computer vision, data mining and c-means based segmentation strategy application to lip region pattern recognition. He is the author of more than 80 scientific identification. IEEE-TTTC International Conference on Automation, publications and one book. Quality and Testing, Robotics, May 23–25, Cluj-Napoca, Romania, pp. 1–6. Pauca, V., Shahnaz, F., Berry, M.W., Plemmons, R., 2004. Text mining Yitan Zhu received his B.A. and M.S. degree in pattern recognition and using non-negative matrix factorization. In: Proceedings of the 4th intelligent system from Shanghai Jia Tong University, China. Now he is a SIAM International Conference on Data Mining. Ph.D. student of Alexandria Research Institute, Virginia Polytechnic Turk, M., Pentland, A., 1991. Eigenfaces for recognition. Neuroscience 3, Institute and State University. 71–86. Wild, S.M., 2003. Seeding non-negative matrix factorizations with the spherical K-Means clustering. M.S. Thesis, University of Colorado.