MULTIPLE KERNEL NONNEGATIVE MATRIX FACTORIZATION

Shounan An 1, Jeong-Min Yun 2, and Seungjin Choi 2,3

1 Machine Intelligence Group, Information and Technology Lab, LG Electronics, Korea 2 Department of Computer Science, POSTECH, Korea 3 Division of IT Convergence Engineering, POSTECH, Korea Email: [email protected], [email protected], [email protected]

ABSTRACT assumes that basis vectors uj lie within the column space of n φ(X) = [φ(x1),...,φ(xn)], i.e., uj = i=1 φ(xi)Wij , lead- Kernel nonnegative matrix factorization (KNMF) is a recent ker- ing to U = φ(X)W . Each column of W is restricted to satisfy nel extension of NMF, where matrix factorization is carried out in the sum-to-one constraint. Then KNMF seeksP a decomposition a reproducing kernel Hilbert space (RKHS) with a feature mapping ⊤ Rn×r φ(X) ≈ φ(X)WV , estimating parameters W ∈ + and φ(). Given a data matrix X ∈ Rm×n, KNMF seeks a decom- V ∈ Rn×r without explicit knowledge of φ() using kernel trick. In position, φ(X) ≈ UV ⊤, where the basis matrix takes the form + fact, KNMF is a special case of convex-NMF [6] and was shown to U = φ(X)W and parameters W ∈ Rn×r and V ∈ Rn×r are + + be useful in extracting spectral features from EEG signals [7]. estimated without explicit knowledge of φ(). As in most of ker- The performance of KNMF depends on the choice of kernel, nel methods, the performance of KNMF also heavily depends on when a single kernel is used. Recent advances in kernel methods the choice of kernel. In order to alleviate the kernel selection prob- emphasized the need to consider multiple kernels or parameteriza- lem when a single kernel is used, we present multiple kernel NMF tions of kernels, instead of a single fixed kernel [8, 9]. Multiple (MKNMF) where two learning problems are jointly solved in unsu- kernel learning (MKL) considers a convex (or conic) combination of pervised manner: (1) learning the best convex combination of kernel kernels for classifiers. Parameters involving classifiers and the best matrices; (2) learning parameters W and V . We formulate multiple convex combination of kernels are estimated by a joint optimization. kernel learning in MKNMF as a linear programming and estimate In this paper we incorporate MKL into KNMF, leading to mul- W and V using multiplicative updates as in KNMF. Experiments on tiple kernel NMF (MKNMF) where we jointly solve two learning benchmark face datasets confirm the high performance of MKNMF problems in unsupervised manner: (1) learning the best convex com- over several existing variants of NMF, in the task of feature extrac- bination of kernel matrices; (2) learning parameters W and V . The tion for face classification. useful characteristics of MKNMF is summarized as follows. Index Terms— Face recognition, multiple kernel learning, non- negative matrix factorization. • To our best knowledge, this is the first work on incorporating MKL into matrix factorizations. MKNMF jointly learns the best convex combination of ker- 1. INTRODUCTION • nel matrices and parameters W , V , in unsupervised manner, Nonnegative matrix factorization (NMF) is a method for low-rank while most of MKL methods have been developed in super- approximation of nonnegative multivariate data, the goal of which is vised learning. to approximate the data matrix (target matrix) X = [x1,..., xn] ∈ • We develop a simple alternating minimization algorithm for Rm×n Rm×r + as a product of two nonnegative factor matrices U ∈ + MKNMF, in which the best convex combination of kernels is Rn×r ⊤ and V ∈ + (2-factor decomposition), such that X ≈ UV [1]. determined by linear programming and parameters W and V Parameters U and V are estimated by multiplicative updates which are estimated by multiplicative updates as in KNMF. 2 iteratively minimize X − UV where represents the Frobe- • MKNMF can handle both nonnegative and negative data, just nious norm of a matrix. In addition to Euclidean distance, various like convex-NMF. divergence measures were also considered [2]. Various extensions of NMF have been developed. For example, additional constraints were imposed on basis vectors to improve the 2. MULTIPLE KERNEL NMF locality of basis vectors, leading to local NMF (LNMF) [3]. Fisher NMF (FNMF) was proposed in [4] where NMF is regularized by We present the objective function and an alternating minimization Fisher criterion, incorporating label information into NMF to im- algorithm for MKNMF to learn the best convex combination of ker- prove the discriminative power. Recently semi-supervised NMF was nels and to estimate factor matrices. presented in [5], where the data matrix and the (partial) class label matrix are jointly decomposed, sharing a factor matrix, in order to 2.1. Objective Function exploit both labeled and unlabeled data in the framework of NMF. m×n Another notable extension of NMF, which is of interest in this Suppose that the data matrix X = [x1,..., xn] ∈ R is a col- paper, is kernel NMF (KNMF) where NMF is carried out in a re- lection of m-dimensional vectors. We consider a feature mapping producing kernel Hilbert space (RKHS) with a feature mapping φ() : Rm → F where F is a feature space. Define φ(X) = Rm Rn×n φ() : → F where F is a feature space. [6, 7]. KNMF [φ(x1),...,φ(xn)]. Then the kernel matrix K ∈ + is given by K = φ⊤(X)φ(X). A direct application of NMF to the feature Algorithm 1 Algorithm outline for MKNMF. matrix φ(X) yields Input: Kj for j = 1,...,M and r Output: β ∈ RM , W ∈ Rn×r, V ∈ Rn×r ⊤ φ(X) ≈ UV , (1) 1: Initialize W and V 2: repeat where U = [u1,..., ur] and V = [v1,..., vr] are nonnegative 3: Apply linear programming to determine β which minimizes basis matrix and encoding matrix, respectively. Without explicit (5), given W and V knowledge of φ(), parameters U and V cannot be estimated in the M 4: Given β, construct K = j=1 βj Kj . Then update W and factorization (1). V using A simple trick to develop KNMF (or MKNMF) is to impose the P constraint that basis vectors u lie within the column space of φ(X), KV j W ← W ⊙ , i.e., uj = φ(x1)W1j + + φ(xn)Wnj , leading to KWV ⊤V KW U = φ(X)W , (2) V ← V ⊙ . VW ⊤KW Rn×r where Wij is the (i, j)-element of the matrix W ∈ + . We also 5: until convergence restrict ourselves to convex combinations of the columns of φ(X), leading that each column of W satisfies the sum-to-one constraint. Incorporating the constraint (2) into the least squares objective ⊤ function for the factorization (1) yields subject to 1 β = 1 and β ≥ 0. The matrices T j are defined ⊤ ⊤ ⊤ by T j = Kj (I − WV )(I − WV ) for j = 1,...,M and 1 RM J = φ(X) − UV ⊤2 t ∈ is the vector whose jth element is given by [t]j = tr {T j } . 2 The optimization of (5) with respect to β is the standard linear pro- 1 ⊤ 2 gramming, which is solved using linprog( ) in our MATLAB = φ(X) − φ(X)WV 2 implementation. Please read this, which you cannot find in the fi- 1 nal paper archived in IEEE Xplore. Our original idea was to find an = tr K(I − WV ⊤)(I − WV ⊤)⊤ , (3) 2 optimal combination of kernels, determining β by LP. However, it n o turns out that the optimization formulation (5) subject to constraints ⊤ ⊤ where tr{} is the trace operator and K = φ (X)φ(X) is the ker- 1 β = 1 and β ≥ 0, yields a simple solution where βi = 1 if nel matrix that is assumed to be nonnegative matrix in this paper. In i = arg maxj tj and otherwise 0. In other words, β is a unit vector fact, Eq. (3) is the objective function for KNMF. Now we consider a and the location of one depends on which entry of t has maximal M convex combination of kernel matrices Kj , i.e., K = j=1 βj Kj , value. Therefore, instead of running LP to determine β, it is suffi- where βj ≥ 0 for j = 1,...,M and β1 + βM = 1. Substitute cient to search which entry of t has maximal value. The problem P this relation into (3) to obtain the objective function for MKNMF becomes kernel selection. In fact, this was first pointed out to me by Mr. Bin Shen in Purdue University. Although it becomes kernel se- M 1 ⊤ ⊤ ⊤ lection, it is still valuable because this simple process finds the best J = tr βj Kj (I − WV )(I − WV ) , (4) 2 kernel which minimize the cost (5). More careful study is required (j=1 ) X to analyze this kernel is really the best one for classification, which Rn×r Rn×r 1⊤ 1 RM is not clear yet. subject to W ∈ + , V ∈ + , β = 1 ( ∈ is the ⊤ vector of all ones ), and β = [β1,...,βM ] ≥ 0. 2.2.2. Optimization of W and V

2.2. Algorithm M Given β, K = j=1 βj Kj is a fixed kernel matrix. Then the The optimization of the objective function (4) involves two learning optimization of (4) with respect to W and V follows KNMF, which problems: (1) learning β to determine the best convex combination is derived below. P of kernels; (2) estimating two factor matrices W and V . We solve Suppose that the gradient of an error function has a decomposi- this optimization by an alternating minimization. We first update β tion that is of the form with W and V fixed, which turns out to be a linear programming. Then we estimate W and V with β fixed, as in KNMF [7]. The ∇J = [∇J ]+ − [∇J ]−, algorithm is summarized in Algorithm 1. where [∇J ]+ > 0 and [∇J ]− > 0. Then the multiplicative updates 2.2.1. Optimization of β for the parameters Θ has the form

Fixing W and V , the objective function (4) becomes .η [∇J ]− Θ ← Θ ⊙ , (6) M [∇J ]+ 1 ⊤ ⊤ ⊤   J = tr βj Kj (I − WV )(I − WV ) 2 (j=1 ) X where ⊙ denotes the Hadamard product (element-wise product) and A M A represents the element-wise division, i.e. A = ij , ().η 1 Bij = βj tr {T j } B B ij 2 j=1 denotes the element-wise power and η is a learningh i rate (0 < η ≤ X 1). The multiplicative update (6) preserves the nonnegativity of the 1 ⊤ = t β, (5) parameter Θ, while ∇J = 0 when the convergence is achieved [10]. 2 Derivatives of the objective function (4) with respect to W and V are given by Table 1. Face recognition accuracy on FERET DB. Algorithm Train-2 Train-3 Train-4 + − ▽W J = [▽W J ] − [▽W J ] Baseline 0.5672 0.5938 0.5837 ⊤ PCA 0.6436 0.6837 0.7028 = KWV V − KV , + − KPCA 0.6731 0.6928 0.7372 ▽V J = [▽V J ] − [▽V J ] NMF 0.6527 0.6726 0.6892 = VW ⊤KW − KW . FNMF 0.6673 0.6847 0.7029 LNMF 0.6738 0.7028 0.7476 With these gradient calculations, invoking the relation (6) with η = KNMF 0.6624 0.7426 0.7562 yields multiplicative updates for W and V in MKNMF: 1 MKNMF 0.6928 0.7737 0.8163 KV W ← W ⊙ , (7) KWV ⊤V KW V ← V ⊙ . (8) Table 2. Face recognition accuracy on Yale DB. VW ⊤KW Algorithm Train-2 Train-3 Train-4 Baseline 0.4820 0.5137 0.5698 3. NUMERICAL EXPERIMENTS PCA 0.5185 0.5416 0.6285 KPCA 0.5259 0.5616 0.6333 We evaluated the performance of MKNMF in the task of face recog- NMF 0.4911 0.4083 0.5295 nition, compared to existing methods, including PCA (eigenface) FNMF 0.5518 0.5937 0.6327 [11], kernel PCA (KPCA) [12], NMF [1], local NMF (LNMF) [3], LNMF 0.5284 0.5732 0.6538 Fisher NMF (FNMF) [4], kernel NMF (KNMF) [7]. Features are KNMF 0.5838 0.6029 0.6650 extracted by these methods and a nearest neighbor (NN) classifier is MKNMF 0.6259 0.6524 0.7323 used for classification. As the baseline, original face images are used without any feature extraction.

3.1. Datasets to consider a convex combination of 11 Gaussian kernel matrices (with different bandwidths). The value of σ was given by the aver- We used two face image datesets: FERET 1 and Yale 2 DB. Near aged norm of data vectors x . In the case of KPCA and KNMF, the frontal face images were used and resized into 32 × 32. i appropriate value of γ was determined by leave-one-out cross vali- • In the case of FERET dataset, we chose a subset of face im- dation. All experiments were carried out on a PC with 3.4GHZ CPU ages which contains 1400 images collected from 200 individ- and 2GB RAM. uals. For each subject, 7 facial images were collected, reflect- ing varying facial expressions and illumination conditions. • Yale dataset provides 11 gray scale face images for each of the 3.3. Results and Discussions 15 individuals. The images demonstrate variations in lighting condition, facial expression and with/without glasses. Classification accuracy averaged over 20 independent runs are sum- marized in Table 1 and 2 for FERET DB and Yale DB, respectively 3.2. Experiment Settings (in these experiments, the intrinsic dimension r was set as 200). The 2 2 σ We divided face images into training set Xtrain and test set Xtest. optimal value of γ in KPCA is 8σ and KNMF is 4 for FERET DB, 2 2 For each subject in FERET (or Yale), we assigned randomly-selected σ σ while for Yale DB the optimal value of γ is 2 in KPCA and 8 in 2 (3, or 4) images into the training set and remaining images into the KNMF. MKNMF demonstrates the best performance across all cases test set, yielding 3 different cases: ’Train-2’, ’Train-3’, and ’Train- considered here, emphasizing that multiple kernel learning indeed 4’. improves the performance of NMF. Fig. 1 shows the classification Feature matrix V test in the case of test set Xtest is com- † † accuracy when the value of r varies from 20 to 200. Compared to puted by LS projection, i.e., V test = U Xtest, where U = ⊤ −1 ⊤ single kernel-based NMF where the kernel selection is performed (U U) U and U is learned from Xtrain. In the case of via cross-validation, MKNMF jointly optimizes the coefficients βm KNMF or MKNMF, LS projection is easily computed using a ker- in a convex combination of kernel matrices and the factor matrices ⊤ −1 ⊤ nel trick: V test = W KW W Ktest where Ktest = W and V , providing the best performance in these experiments. φ(X )⊤φ(X ). 2 train test  MKNMF requires O(n r) in time, while the computational We used Gaussian kernel that is of the form complexity of NMF is O(mnr), where n is number of data points, 1 m is the dimension of input data and r is the intrinsic dimension. [K] = exp − x − x 2 , ij γ i j The computational complexity of MKNMF (or KNMF) mainly   depends on the number of samples, while NMF depends on both where we used 11 different values of the number of samples and the dimension of the input data. Thus, MKNMF is efficient in cases data dimension is high but the number σ2 σ2 σ2 σ2 σ2 γ ∈ { , , , , ,σ2, 2σ2, 4σ2, 8σ2, 16σ2, 32σ2}, of training samples is not large. Fig. 2 shows run time vs. different 32 16 8 4 2 image sizes of both FERET and Yale DB, emphasizing that the run 1http://www.itl.nist.gov/iad/humanid/feret/feret-master.html time of MKNMF remains almost constant while the run time of 2http://cvc.yale.edu/projects/yalefaces/yalefaces.html NMF dramatically increases when the size of images increases. 0.9 0.9 PCA PCA KPCA KPCA 0.85 NMF 0.85 NMF FNMF FNMF LNMF LNMF 0.8 KNMF 0.8 KNMF MKNMF MKNMF

0.75 0.75

0.7 0.7

0.65 0.65

0.6 0.6

Recognition accuracy 0.55 Recognition accuracy 0.55

0.5 0.5 0 20 40 60 80 100 120 140 160 180 200 220 0 20 40 60 80 100 120 140 160 180 200 220 Dimension Dimension

(a) Yale DB (b) FERET DB

Fig. 1. Comparison of classification accuracy when the intrinsic dimension r varies from 20 to 200, in the case of Train-4 for both Yale and FERET.

35 80 NMF NMF MKNMF 70 MKNMF 30

60 25

50 20

40

15 30

10 20

5 10 Comutation time (seconds) Comutation time (seconds)

0 0 16x16 32x32 64x64 16x16 32x32 64x64 Input image size Input image size

(a) Yale DB (b) FERET DB

Fig. 2. Comparison of NMF and MKNMF in terms of run time for different sizes of images.

4. CONCLUSIONS 791, 1999. [2] A. Cichocki, H. Lee, Y. -D. Kim, and S. Choi, “Nonnegative We have presented MKNMF where we incorporated multiple kernel matrix factorization with α-divergence,” Pattern Recognition learning into nonnegative matrix factorization in order to alleviate Letters, vol. 29, no. 9, pp. 1433–1440, 2008. the difficulty in kernel selection when a single fixed kernel matrix [3] S. Z. Li, X. W. Hou, H. J. Zhang, and Q. S. Cheng, “Learning was used for NMF. MKNMF involved joint optimization of the coef- spatially localized parts-based representation,” in Proceedings ficients β in a convex combination of kernel matrices and the factor j of the IEEE International Conference on Computer Vision and matrices W and V . We solved this joint optimization by an alter- Pattern Recognition (CVPR), Kauai, Hawaii, 2001, pp. 207– nating minimization, where the best convex combination of kernel 212. matrices is determined by linear programming and the factor matri- ces are estimated by multiplicative updates. We compared MKNMF [4] Y. Wang, Y. Jia, C. Hu, and M. Turk, “Fisher non-negative ma- to various NMF methods in the task of feature extraction for face trix factorization for learning local features,” in Proceedings of recognition. Experiments on two face image datasets confirmed the the Asian Conference on Computer Vision (ACCV), Jeju Island, high performance of MKNMF over existing other NMF methods. Korea, 2004. [5] H. Lee, J. Yoo, and S. Choi, “Semi-supervised nonnegative Acknowledgments: This work was supported by NIPA ITRC Sup- matrix factorization,” IEEE Signal Processing Letters, vol. 17, port Program (NIPA-2010-C1090-1031-0009), NRF Converging no. 1, pp. 4–7, 2010. Research Center Program (2010K001171), and NRF WCU Program [6] C. Ding, T. Li, and M. I. Jordan, “Convex and semi- (R31-2010-000-10100-0). nonnegative matrix factorizations,” IEEE Transactions on Pat- tern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 45– 5. REFERENCES 55, 2010. [7] H. Lee, A. Cichocki, and S. Choi, “Kernel nonnegative matrix [1] D. D. Lee and H. S. Seung, “Learning the parts of objects by factorization for spectral EEG feature extraction,” Neurocom- non-negative matrix factorization,” Nature, vol. 401, pp. 788– puting, vol. 72, pp. 3182–3190, 2009. [8] G. R. G. Lanchriet, N. Cristianini, P. L. Bartlett, L. E. Ghaoui, and M. I. Jordan, “Learning the kernel matrix with semidefinite programming,” Journal of Research, vol. 5, pp. 27–72, 2004. [9] F. R. Bach, G. R. G. Lanchriet, and M. I. Jordan, “Multiple kernel learning, conic duality, and the SMO algorithm,” in Pro- ceedings of the International Conference on Machine Learning (ICML), Banff, Canada, 2004. [10] S. Choi, “Algorithms for orthogonal nonnegative matrix factor- ization,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN), Hong Kong, 2008. [11] M. Turk and A. Pentland, “Eigenfaces for recognition,” Jour- nal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991. [12] B. Sch¨olkopf, A. J. Smola, and K. -R. M¨uller, “Nonlinear com- ponent analysis as a kernel eigenvalue problem,” Neural Com- putation, vol. 10, no. 5, pp. 1299–1319, 1998.