Multiple Kernel Nonnegative Matrix Factorization

MULTIPLE KERNEL NONNEGATIVE MATRIX FACTORIZATION Shounan An 1, Jeong-Min Yun 2, and Seungjin Choi 2,3 1 Machine Intelligence Group, Information and Technology Lab, LG Electronics, Korea 2 Department of Computer Science, POSTECH, Korea 3 Division of IT Convergence Engineering, POSTECH, Korea Email: [email protected], [email protected], [email protected] ABSTRACT assumes that basis vectors uj lie within the column space of n φ(X) = [φ(x1),...,φ(xn)], i.e., uj = i=1 φ(xi)Wij , lead- Kernel nonnegative matrix factorization (KNMF) is a recent ker- ing to U = φ(X)W . Each column of W is restricted to satisfy nel extension of NMF, where matrix factorization is carried out in the sum-to-one constraint. Then KNMF seeksP a decomposition a reproducing kernel Hilbert space (RKHS) with a feature mapping ⊤ Rn×r φ(X) ≈ φ(X)WV , estimating parameters W ∈ + and φ(·). Given a data matrix X ∈ Rm×n, KNMF seeks a decom- V ∈ Rn×r without explicit knowledge of φ(·) using kernel trick. In position, φ(X) ≈ UV ⊤, where the basis matrix takes the form + fact, KNMF is a special case of convex-NMF [6] and was shown to U = φ(X)W and parameters W ∈ Rn×r and V ∈ Rn×r are + + be useful in extracting spectral features from EEG signals [7]. estimated without explicit knowledge of φ(·). As in most of ker- The performance of KNMF depends on the choice of kernel, nel methods, the performance of KNMF also heavily depends on when a single kernel is used. Recent advances in kernel methods the choice of kernel. In order to alleviate the kernel selection prob- emphasized the need to consider multiple kernels or parameteriza- lem when a single kernel is used, we present multiple kernel NMF tions of kernels, instead of a single fixed kernel [8, 9]. Multiple (MKNMF) where two learning problems are jointly solved in unsu- kernel learning (MKL) considers a convex (or conic) combination of pervised manner: (1) learning the best convex combination of kernel kernels for classifiers. Parameters involving classifiers and the best matrices; (2) learning parameters W and V . We formulate multiple convex combination of kernels are estimated by a joint optimization. kernel learning in MKNMF as a linear programming and estimate In this paper we incorporate MKL into KNMF, leading to mul- W and V using multiplicative updates as in KNMF. Experiments on tiple kernel NMF (MKNMF) where we jointly solve two learning benchmark face datasets confirm the high performance of MKNMF problems in unsupervised manner: (1) learning the best convex com- over several existing variants of NMF, in the task of feature extrac- bination of kernel matrices; (2) learning parameters W and V . The tion for face classification. useful characteristics of MKNMF is summarized as follows. Index Terms— Face recognition, multiple kernel learning, nonnegative matrix factorization. • To our best knowledge, this is the first work on incorporating MKL into matrix factorizations. MKNMF jointly learns the best convex combination of ker- 1. INTRODUCTION • nel matrices and parameters W , V , in unsupervised manner, Nonnegative matrix factorization (NMF) is a method for low-rank while most of MKL methods have been developed in super- approximation of nonnegative multivariate data, the goal of which is vised learning. to approximate the data matrix (target matrix) X = [x1,..., xn] ∈ • We develop a simple alternating minimization algorithm for Rm×n Rm×r + as a product of two nonnegative factor matrices U ∈ + MKNMF, in which the best convex combination of kernels is Rn×r ⊤ and V ∈ + (2-factor decomposition), such that X ≈ UV [1]. determined by linear programming and parameters W and V Parameters U and V are estimated by multiplicative updates which are estimated by multiplicative updates as in KNMF. 2 iteratively minimize X − UV where · represents the Frobe- • MKNMF can handle both nonnegative and negative data, just nious norm of a matrix. In addition to Euclidean distance, various like convex-NMF. divergence measures were also considered [2]. Various extensions of NMF have been developed. For example, additional constraints were imposed on basis vectors to improve the 2. MULTIPLE KERNEL NMF locality of basis vectors, leading to local NMF (LNMF) [3]. Fisher NMF (FNMF) was proposed in [4] where NMF is regularized by We present the objective function and an alternating minimization Fisher criterion, incorporating label information into NMF to im- algorithm for MKNMF to learn the best convex combination of ker- prove the discriminative power. Recently semi-supervised NMF was nels and to estimate factor matrices. presented in [5], where the data matrix and the (partial) class label matrix are jointly decomposed, sharing a factor matrix, in order to 2.1. Objective Function exploit both labeled and unlabeled data in the framework of NMF. m×n Another notable extension of NMF, which is of interest in this Suppose that the data matrix X = [x1,..., xn] ∈ R is a col- paper, is kernel NMF (KNMF) where NMF is carried out in a re- lection of m-dimensional vectors. We consider a feature mapping producing kernel Hilbert space (RKHS) with a feature mapping φ(·) : Rm → F where F is a feature space. Define φ(X) = Rm Rn×n φ(·) : → F where F is a feature space. [6, 7]. KNMF [φ(x1),...,φ(xn)]. Then the kernel matrix K ∈ + is given by K = φ⊤(X)φ(X). A direct application of NMF to the feature Algorithm 1 Algorithm outline for MKNMF. matrix φ(X) yields Input: Kj for j = 1,...,M and r Output: β ∈ RM , W ∈ Rn×r, V ∈ Rn×r ⊤ φ(X) ≈ UV , (1) 1: Initialize W and V 2: repeat where U = [u1,..., ur] and V = [v1,..., vr] are nonnegative 3: Apply linear programming to determine β which minimizes basis matrix and encoding matrix, respectively. Without explicit (5), given W and V knowledge of φ(·), parameters U and V cannot be estimated in the M 4: Given β, construct K = j=1 βj Kj . Then update W and factorization (1). V using A simple trick to develop KNMF (or MKNMF) is to impose the P constraint that basis vectors u lie within the column space of φ(X), KV j W ← W ⊙ , i.e., uj = φ(x1)W1j + · · · + φ(xn)Wnj , leading to KWV ⊤V KW U = φ(X)W , (2) V ← V ⊙ . VW ⊤KW Rn×r where Wij is the (i, j)-element of the matrix W ∈ + . We also 5: until convergence restrict ourselves to convex combinations of the columns of φ(X), leading that each column of W satisfies the sum-to-one constraint. Incorporating the constraint (2) into the least squares objective ⊤ function for the factorization (1) yields subject to 1 β = 1 and β ≥ 0. The matrices T j are defined ⊤ ⊤ ⊤ by T j = Kj (I − WV )(I − WV ) for j = 1,...,M and 1 RM J = φ(X) − UV ⊤2 t ∈ is the vector whose jth element is given by [t]j = tr {T j } . 2 The optimization of (5) with respect to β is the standard linear pro- 1 ⊤ 2 gramming, which is solved using linprog( ) in our MATLAB = φ(X) − φ(X)WV 2 implementation. Please read this, which you cannot find in the fi- 1 nal paper archived in IEEE Xplore. Our original idea was to find an = tr K(I − WV ⊤)(I − WV ⊤)⊤ , (3) 2 optimal combination of kernels, determining β by LP. However, it n o turns out that the optimization formulation (5) subject to constraints ⊤ ⊤ where tr{·} is the trace operator and K = φ (X)φ(X) is the ker- 1 β = 1 and β ≥ 0, yields a simple solution where βi = 1 if nel matrix that is assumed to be nonnegative matrix in this paper. In i = arg maxj tj and otherwise 0. In other words, β is a unit vector fact, Eq. (3) is the objective function for KNMF. Now we consider a and the location of one depends on which entry of t has maximal M convex combination of kernel matrices Kj , i.e., K = j=1 βj Kj , value. Therefore, instead of running LP to determine β, it is suffi- where βj ≥ 0 for j = 1,...,M and β1 + · · · βM = 1. Substitute cient to search which entry of t has maximal value. The problem P this relation into (3) to obtain the objective function for MKNMF becomes kernel selection. In fact, this was first pointed out to me by Mr. Bin Shen in Purdue University. Although it becomes kernel se- M 1 ⊤ ⊤ ⊤ lection, it is still valuable because this simple process finds the best J = tr βj Kj (I − WV )(I − WV ) , (4) 2 kernel which minimize the cost (5). More careful study is required (j=1 ) X to analyze this kernel is really the best one for classification, which Rn×r Rn×r 1⊤ 1 RM is not clear yet. subject to W ∈ + , V ∈ + , β = 1 ( ∈ is the ⊤ vector of all ones ), and β = [β1,...,βM ] ≥ 0. 2.2.2. Optimization of W and V 2.2. Algorithm M Given β, K = j=1 βj Kj is a fixed kernel matrix. Then the The optimization of the objective function (4) involves two learning optimization of (4) with respect to W and V follows KNMF, which problems: (1) learning β to determine the best convex combination is derived below. P of kernels; (2) estimating two factor matrices W and V . We solve Suppose that the gradient of an error function has a decomposi- this optimization by an alternating minimization. We first update β tion that is of the form with W and V fixed, which turns out to be a linear programming.

Load more