Expert Systems with Applications 42 (2015) 1278–1286

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier.com/locate/eswa

Feature selection and multi-kernel learning for adaptive graph regularized nonnegative matrix factorization ⇑ Jim Jing-Yan Wang a, Jianhua Z. Huang b, Yijun Sun c, Xin Gao a, a Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia b Department of Statistics, Texas A&M University, TX 77843-3143, USA c University at Buffalo, The State University of New York, Buffalo, NY 14203, USA article info abstract

Article history: Nonnegative matrix factorization (NMF), a popular part-based representation technique, does not capture Available online 20 September 2014 the intrinsic local geometric structure of the data space. Graph regularized NMF (GNMF) was recently proposed to avoid this limitation by regularizing NMF with a nearest neighbor graph constructed from Keywords: the input data set. However, GNMF has two main bottlenecks. First, using the original feature space Data representation directly to construct the graph is not necessarily optimal because of the noisy and irrelevant features Nonnegative matrix factorization and nonlinear distributions of data samples. Second, one possible way to handle the nonlinear distribu- Graph regularization tion of data samples is by kernel embedding. However, it is often difficult to choose the most suitable ker- Feature selection nel. To solve these bottlenecks, we propose two novel graph-regularized NMF methods, AGNMF and Multi-kernel learning FS AGNMFMK, by introducing feature selection and multiple-kernel learning to the graph regularized NMF, respectively. Instead of using a fixed graph as in GNMF, the two proposed methods learn the nearest neighbor graph that is adaptive to the selected features and learned multiple kernels, respectively. For each method, we propose a unified objective function to conduct feature selection/multi-kernel learning, NMF and adaptive graph regularization simultaneously. We further develop two iterative algorithms to solve the two optimization problems. Experimental results on two challenging pattern classification tasks demonstrate that the proposed methods significantly outperform state-of-the-art data representation methods. Ó 2014 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/).

1. Introduction from a low-dimensional manifold with a local geometric structure (Orsenigo & Vercellis, 2012; Wang, Bensmail, & Gao, 2012a, 2014a; Nonnegative matrix factorization (NMF) (Lee & Seung, 2000; Wang, Bensmail, Yao, & Gao, 2013b; Wang, Sun, & Gao, 2014b). Sun, Wu, Wu, Guo, & Lu, 2012; Wang, Almasri, & Gao, 2012b; Thus the nearby data samples in the original data space should also Wang, Bensmail, & Gao, 2013a; Wang & Gao, 2013; Wang, Wang, have similar NMF representations. In GNMF, the geometric struc- & Gao, 2013c) decomposes a nonnegative data matrix as a product ture of the data space is encoded by constructing a nearest neigh- of two low-rank nonnegative matrices, one of them is regarded as bor graph, and then the matrix factorization is sought by adding a the basis matrix, while the other one as the coding matrix, which graph regularization to the original NMF objective function. The could be used as a reduced representation of the data samples in key component of GNMF is the graph. In the original GNMF algo- the data matrix (Kim, Chen, Kim, Pan, & Park, 2011a). This method rithm, the graph is constructed according to the original input fea- has become popular in recent years for data representation in var- ture space. The nearest neighbors of a data sample is found by ious areas, such as bioinformatics (Zheng, Ng, Zhang, Shiu, & Wang, comparing the Euclidean distances (Lee, Rajkumar, Lo, Wan, & 2011) and computer vision (Cai et al., 2013). Recently, Cai, He, Han, Isa, 2013a; Merigó & Casanovas, 2011) between pairs of data and Huang (2011) argued that NMF fails to exploit the intrinsic points, while the weights of edges are also estimated in the Euclid- local geometric structure of the data space. They improved the tra- ean space, by assuming that the original features could provide a ditional NMF to graph regularized nonnegative matrix factoriza- proper representation of the local structure of the data space. How- tion (GNMF). The basic idea is that the data samples are drawn ever, as is well known that in many pattern recognition problems, using the original feature space directly is not appropriate because

⇑ Corresponding author. Tel.: +966 12 808 0323. of the noisy and irrelevant features and the nonlinear distribution E-mail address: [email protected] (X. Gao). of the samples. http://dx.doi.org/10.1016/j.eswa.2014.09.008 0957-4174/Ó 2014 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). J.J.-Y. Wang et al. / Expert Systems with Applications 42 (2015) 1278–1286 1279

To handle the noisy and irrelevant features, one may apply fea- nonnegative matrices H and W whose product can well approxi- ture selection (Fakhraei, Soltanian-Zadeh, & Fotouhi, 2014; mate the original matrix X as Iquebal, Pal, Ceglarek, & Tiwari, 2014; Lin, Chen, & Wu, 2014; Li, X HW; ð1Þ Wu, Li, & Ding, 2013b, 2013a) to assign different weights to different DR RN features, so that the data samples could be represented in a better where H 2 R , and W 2 R . Accordingly, each sample xn is way than using the original features. So far, the most broadly used approximated by a linear combination of the columns of H, feature selection method is proposed by Sun et al. (2012). Such an weighted by the components of the nth column of W,as approach is able to determine feature weights from a statistics point XR of view to automatically discover the intrinsic features. It provides a xn hrwrn ð2Þ powerful and efficient solution for feature selection in NMF. So this r¼1 work has been internationally recognized by the researchers in this field. To handle the nonlinear distribution of the data samples, one Therefore, H can be regarded as a collection of basis vectors, could map the input data into a nonlinear feature space by kernel while, wn, the nth columns of W, can be regarded as the coding vec- embedding (Cui & Soh, 2010; Yeh, Su, & Lee, 2013). However, the tor or a new representation of the nth data sample. The most com- most suitable types and parameters of the kernels for a particular monly used cost function to solve H and W is based on the squared task is often unknown, and selection of the optimal kernel by Euclidean distance (SED) between the two matrices: exhaustive search on a pre-defined pool of kernels is usually time- ONMF ðH; WÞ¼jjX HWjj2 consuming, and sometimes causes over-fitting. Multi-kernel learn- > > > > > ing (Chen, Li, Wei, Xu, & Shi, 2011; Yeh, Huang, & Lee, 2011), which ¼ TrðXX Þ2TrðXW H ÞþTrðHWW H Þ; ð3Þ seeks the optimal kernel by a weighted, linear combination of pre- where TrðÞ denotes the trace of a matrix. defined candidate kernels, has been introduced to handle the prob- lem of kernel selection. An, Yun, and Choi (2011), presented the 2.2. Graph regularized NMF Multi-Kernel NMF (NMFMK), which learns the best convex combina- tion of multiple kernel matrices and NMF parameters jointly. How- Cai et al. (2011) introduced the GNMF algorithm, by imposing ever, graph regularization was not taken into consideration in their the local invariance assumption (LIA) to NMF. If two data samples framework. In this paper, we will incorporate feature selection and x and x are close in the intrinsic geometric space of the data dis- multi-kernel learning into the graph regularization NMF to obtain n m tribution, w and w , the coding vectors of these two samples with novel and enhanced data representation methods. In this way, we n m respect to the new basis, should also be close to each other; and could handle the problem of noisy and irrelevant features, nonlin- vice versa. They modeled the local geometric structure by a K-near- early distributed data samples, graph construction, and data matrix est neighbor graph G constructed from the data set X. For each data factorization simultaneously. Compared to the methods reported in sample x 2X, the set of its K nearest neighbors, N ,inX is deter- the current literature which use a fixed graph for NMF parameters n n mined by the SED metric (Lee et al., 2013a)as learning, our method can adapt the graph to the learned feature or kernel weights, which improves the NMF by providing it with a more XD 2 2 > > > reliable graph. dðxn;xmÞ¼jjxn xmjj ¼ ðxdn xdmÞ ¼ xn xn þ xmxm 2xn xm ð4Þ d¼1 Here, we propose two novel methods, AGNMFFS and AGNMFMK, that incorporate features selection and multiple-kernel learning A K-nearest neighbor graph is constructed for X. Each data sam- into graph-regularized NMF, respectively. Feature selection or ple in X will be a node of the graph, and each node xn will be con- multi-kernel learning will provide a new data space for the graph nected to its K nearest neighbors N n. We also define a weight construction of GNMF, and at the same time, GNMF will direct fea- NN matrix A 2 R on the graph, with Anm equal to the weight of ture selection or multi-kernel learning. Both AGNMFFS and the connection between nodes xn and xm. There are many choices AGNMFMK are formulated as constraint optimization problems, to define the weight matrix A. Two of the most commonly used each of which has a unified objective function to optimize feature options are as follows: selection/multi-kernel learning and graph-regularized NMF simul- taneously. Experimental results demonstrate that the two pro- Gaussian kernel weighting posed methods significantly outperform state-of-the-art data ( 2 representation methods. jjxnxmjj exp r2 ; if xm 2Nn; The rest of the paper is organized as follows: We briefly review Anm ¼ ð5Þ 0; otherwise: the GNMF in Section 2. We then propose the two novel algorithms,

AGNMFFS and AGNMFMK, in Section 3. The proposed methods are Dot-product weighting compared with other NMF learning methods on two challenging > data sets for classification tasks in Section 4. Finally, the paper is xn xm; if xm 2Nn; Anm ¼ ð6Þ concluded in Section 5 with some future works. 0; otherwise:

2. Overview of graph regularized NMF With the weight matrix A, we can use the following graph reg- ularization term to measure the smoothness of the low-dimen- In this section, we will briefly introduce the graph regularized sional coding vector representations in W:

NMF as background knowledge of this paper. N 1 X OGðW;AÞ¼ jjw w jj2A ¼ TrðWDW >ÞTrðWAW >Þ¼TrðWLW >Þ; ð7Þ 2 n m nm n;m¼1 2.1. Nonnegative matrix factorization where D is a diagonal matrix whose entries are column sums of A, P N Given a training set with N nonnegative data samples i.e., Dnn ¼ m¼1Anm and L ¼ D A is the graph Laplacian matrix. D G X¼fx1; ...; xNg2Rþ represented as a nonnegative data matrix By minimizing O ðW; AÞ with regard to W, we expect that if two DN D X ¼½x1; ...; xN2Rþ , where xn 2 Rþ is the D-dimensional non- data points xn and xm are close, i.e., Anm is large, wn and wm are also negative feature vector of the nth sample, NMF aims to find two close to each other. 1280 J.J.-Y. Wang et al. / Expert Systems with Applications 42 (2015) 1278–1286

Combining this geometry-based regularizer, OGðW; AÞ, with the updated. Here we discuss how to update the Gaussian kernel NMF original NMF objective function, O ðH; WÞ, leads to the loss func- weighting for adaptive graph with feature selection, which is 2 tion of GNMF (Cai et al., 2011): k jjxnxmjjk k updated as Anmexp r2 if xm 2Nn, and 0 otherwise. OGNMF ðH; W; AÞ¼ONMF ðH; WÞþaOGðW; AÞ With the adaptive graph Gk, we can re-regularize the NMF in the > > > > > selected feature space. Similar to the GNMF, we propose the adap- ¼ TrðXX Þ2TrðXW H ÞþTrðHWW H Þ tive graph regularization term as þ aTrðWLW>Þ; ð8Þ 1 XN OAGðW; AkÞ¼ jjw w jj2Ak ¼ TrðWLkW>Þ; ð13Þ in which a is a tradeoff parameter. Thus the GNMF problem turns to 2 n m nm a constrained minimization problem as n;m¼1 k k k GNMF where L ¼ D A is the corresponding graph Laplacian. By mini- min O ðH; W; AÞ; AG k ~ ~ H;W ð9Þ mizing O ðW; A Þ, we expect that if two data points xn and xm are s:t: H P 0; W P 0 close with respect to the new features selected by k, the represen- tations wn and wm with respect to the selected features should also where H and W can be solved in an iterative manner by optimizing be close to each other. and updating them alternately (Cai et al., 2011). 3.1.3. Adaptive graph regularized NMF algorithm with feature 3. Adaptive graph regularized NMF with feature selection and selection multi-kernel learning To perform the feature selection together with the adaptive graph regularized NMF, we first propose the unified objective func- In this section, we propose two enhanced data representation tion for adaptive graph regularized NMF and feature selection for methods based on GNMF by encoding feature selection and data representation, and then develop an alternately updating multi-kernel learning, respectively. algorithm to estimate the basis matrix H, the coding coefficient matrix W and the feature weight matrix k. 3.1. Adaptive graph regularized NMF with feature selection Objective function: Combining the NMF objective function 3.1.1. Feature selection for NMF with feature selection defined in (11) with the adaptive Given an input sample x represented as a vector of D graph-based regularizer defined in (13) leads to the objective > D nonnegative features as x ¼½x1; ...; xD 2 R , feature selection þ function of our AGNMF with feature selection — AGNMFFS tries to scale each feature to obtain a weighted feature space, algorithm: parameterized by a D-dimensional nonnegative feature weight > D OAGNMFFS ðH; W; kÞ¼ONMFFS ðH; W; kÞþaOAGðW; AkÞ vector k ¼½k1; ...; kD 2 Rþ, where kd is the scaling factor for the dth featureP (Sun, Todorovic, & Goodison, 2010b). We restrict its 2 > D ¼ Tr½diagðkÞ XX scale by d¼1kd ¼ 1 and kd P 0. Thus the scaled feature vector of > 2 > > x is represented as ~x ¼½k1x1; ...; kDxD ¼ diagðkÞx, where diagðkÞ 2Tr½diagðkÞ XW H is a D D diagonal matrix with entries of k along the main þ Tr½diagðkÞ2HWW>H> diagonal. The original data matrix and basis matrix for NMF can k > be represented in the scaled space as (10), þ aTrðWL W Þð14Þ

XD The optimization problem (8) of GNMF can now be extended to ~ ~ X ¼ diagðkÞX and H ¼ diagðkÞH; s:t: kd ¼ 1; kd P 0; d ¼ 1;...;D: ð10Þ accommodate the feature selection and adaptive graph: d¼1 min OAGNMFFS ðH; W; kÞ By replacing the original features in X and H of NMF with the H;W;k ~ ~ weighted features X and H defined in (10), we have the augmented XD objective function for NMF with feature selection in an enlarged s:t: H P 0; W P 0; kd ¼ C; kd P 0; d ¼ 1; ...; D: parameter space d¼1 ð15Þ ONMFFS ðH; W; kÞ¼jjdiagðkÞðX HWÞjj2 Optimization: Since direct optimization to (15) is difficult, we ¼ Tr½diagðkÞ2XX>2Tr½diagðkÞ2XW>H> instead adopt an iterative, two-step strategy to alternately þ Tr½diagðkÞ2HWW>H>ð11Þ optimize ðH; WÞ and k. At each iteration, one of ðH; WÞ and k is optimized while the other is fixed, and then the roles of Here H; W and k are all the variables to solve so that the above ðH; WÞ and k are switched. Iterations are repeated until conver- objective function can be minimized. gence or a maximum number of iterations is reached. – On optimizing ðH; WÞ: By fixing k and updating the adaptive 3.1.2. Graph adaptive to selected features k k graph G with its corresponding Laplacian matrix L accord- After the new feature space defined by feature weight vector k ing to k, the optimization problem (15) is reduced to is defined, the nearest neighbor graph should also be updated to be 2 > 2 > > 2 > > adaptive to the selected features. First, the K nearest neighbors, N n, min Tr½diagðkÞ XX 2Tr½diagðkÞ XW H þTr½diagðkÞ HWW H of the nth data point should be re-found according to the k- H;W k > weighted SED, i.e., þ aTrðWL W Þs:t: H P 0; W P 0: ð16Þ XD k 2 2 2 The Lagrange L of the above optimization problem is d ðxn; xmÞ¼jjxn xmjjk ¼ kdðxdn xdmÞ ð12Þ d¼1 L¼Tr½diagðkÞ2XX>2Tr½diagðkÞ2XW>H> The K nearest neighbors N n re-found by the k-weighted dis- 2 > > k > k þ Tr½diagðkÞ HWW H þaTrðWL W Þ tance is denoted as N n, and the graph adaptive to k is donated as > > Gk. The corresponding weight matrix Ak of Gk should also be þ TrðUH ÞþTrðWW Þ; ð17Þ J.J.-Y. Wang et al. / Expert Systems with Applications 42 (2015) 1278–1286 1281

where U ¼½/dr and W ¼½wrn are the Lagrange multiplier matrices 1 kd ¼ C ; d ¼ 1; ...; D ð26Þ for constraint H P 0 and W P 0, respectively. By setting the partial yd derivatives of L with respect to H and W to zero, we have P P @L D D 1 2 > 2 > Moreover, since d 1kd ¼ 1, we have C d 1 ¼ 1, therefore ¼2diagðkÞ XW þ 2diagðkÞ HWW þ U ¼ 0 ¼ ¼ yd @H C ¼ P1 , and the minimizer of (21) is (23). h D 1 @L d¼1yd ¼2H>diagðkÞ2X þ 2H>diagðkÞ2HW þ 2aWLk þ W ¼ 0 @W ð18Þ Algorithm: The proposed iterative AGNMF algorithm with feature selection (named as AGNMFFS) is summarized in Using the KKT conditions, i.e., /drhdr ¼ 0 and wrnwrn ¼ 0, we get the Algorithm 1. following equations for hdr and wrn:

2 > 2 > ½diagðkÞ XW dr hdr þ½diagðkÞ HWW dr hdr ¼ 0 > 2 > 2 k Algorithm 1. AGNMFFS Algorithm ½H diagðkÞ Xrnwrn þ½H diagðkÞ HWrnwrn þ aðWL Þrnwrn ¼ 0 ð19Þ Input: Original data matrix X; Input: Initial factorization matrices H0 and W0; These equations lead to the following updating rules: Input: Tolerance stopping criterion n; 2 > Input: Maximum number of iterations, T; ½diagðkÞ XW dr 0 hdr 2 hdr 1 > Initialize the feature weight variables as kd ¼ D ; d ¼ 1; ...; D; ½diagðkÞ HWW dr ð20Þ Initialize t ¼ 1; ½H>diagðkÞ2X þ aWAk w rn w repeat rn > 2 k rn ½H diagðkÞ HW þ aWD kt rn Update the graph G and its corresponding Laplacian kt t1 – On optimizing k: By fixing H and W, and removing the terms matrix L according to k as introduced in Section 3.1.2; t t irrelevant to k, the optimization problem (15) becomes Update the factorization matrices H and W as in (20); Update the feature weights kt as in (22); min Tr½diagðkÞ2ðXX> 2XW>H> þ HWW>H>Þ k t ¼ t þ 1; AGNMF t t t XD until O FS ðH ; W ; k Þ 6 n or t P T 2 > ¼ Tr½diagðkÞ ðYY Þ; s:t: kdd ¼ 1; kdd P 0; d ¼ 1; ...; D: Output: The factorization matrices H ¼ Ht1; W ¼ Wt1 and d¼1 feature weight vector k ¼ kt1. ð21Þ where Y ¼ X HW. Here, the value of kd indicates the weight of the dth feature. We rewrite the objective function of (21) as follows: 3.1.4. Representing test sample with AGNMFFS

XD XN XD After learning ðH; WÞ and k via AGNMFFS for the training data 2 > 2 2 2 matrix X, we can use the basis matrix H and feature weight matrix min Tr½diagðkÞ ðYY Þ ¼ kd ydn ¼ kded; k n 1 k to infer the coding vector for a new data point. When a new test d¼1 ¼ d¼1 ð22Þ XD D data sample x 2 Rþ comes in, we first connect it to its K nearest s:t: k ¼ 1; k P 0; d ¼ 1; ...; D:k P 0: k dd dd d neighbors N from the training set X which are found by using d¼1 P the k-weighted SED (12), and then calculate the weight vector of N 2 where y is the ðd; nÞth element of matrix Y and e ¼ y .It x x 2 k dn d n¼1 dn x as ak ¼½ak ; ...; ak 2RN where ak ¼ exp jj mjjk ,ifx 2N ; could be optimized by using Theorem 1. 1 N n r2 n and 0, otherwise. Assuming the coding of the training samples Theorem 1. The closed form solution of the optimization problem in are not affected by the test sample, we only need to optimize the (21) is given by: following objective function regarding to the coding vector w 2 RR of the test sample: 1=y k ¼ P d ; d ¼ 1; ...; D ð23Þ d F a XN d¼11=yd AGNMFFS 2 2 k min OðwÞ ¼jjdiagðkÞðx HwÞjj þ jjw wnjj a w 2 n n¼1 P hihi Proof. Given the constrain of D k ¼ 1 and the Candy-Schwartz d¼1 d Tr diag k 2xx> 2Tr diag k 2xw>H> Tr diag k 2Hww>H> inequality, we have ¼ ð Þ ð Þ þ ½ ð Þ "# ! ! ! ! 2 2 XN XN XN XD XD ffiffiffiffiffi XD XD a k > k > a s > p 1 2 1 þ anTrðww ÞaTr w anwn þ anTrðwnwn Þ 1 ¼ kd ¼ kd y pffiffiffiffiffi 6 k y 2 2 d y d d y n¼1 n¼1 n¼1 d¼1 d¼1 d d¼1 d¼1 d ð24Þ s:t: w P 0: ð27Þ

By setting the partial derivative of the Lagrange function of (27) with respect to w to zero, and using the KKT conditions, we can Thus we have the following inequality, have the following updating rule for w: XD 1 hiP 2 > 2 N kdyd P P ð25Þ a k D 1 H diagðkÞ x þ 2 n¼1anwn d¼1 hir d¼1 yd wr P wr ð28Þ > 2 a N k ffiffiffiffiffi H diagðkÞ Hw þ 2 n¼1anw p 1ffiffiffiffi r and the equal sign holds if kd y ¼ C p ,or d yd 1282 J.J.-Y. Wang et al. / Expert Systems with Applications 42 (2015) 1278–1286

By repeating this updating rule, we could have the optimal cod- With the graph Gs adaptive to the multiple kernel space, we ing vector, w, for the test sample. then re-regularize the NMFMK in the multiple kernel space. We pro- pose the Adaptive Graph regularization term as 3.2. Adaptive graph regularized NMF with multiple kernel learning 1 XN OAGðW; AsÞ¼ jjw w jj2As ¼ TrðWLsW>Þ; ð35Þ 2 n m nm 3.2.1. Multiple kernel learning for NMF n;m¼1 Consider a nonlinear mapping x ! uðx Þ or X ! uðXÞ¼ n n s s s NN where L ¼ D A is the corresponding graph Laplacian. ½uðx1Þ; ...; uðxNÞ, the kernel matrix K 2 R is given by K ¼ uðXÞ>uðXÞ. A direct application of NMF to the feature matrix uðXÞ yields 3.2.3. AGNMF algorithm with multiple kernel learning To perform the multi-kernel learning together with the adaptive uðXÞHW ð29Þ graph regularized NMF, we first propose a unified object function, For the sake of convenience, we impose the constraint that the and then develop an alternately updating algorithm to solve it. vectorsP defining H lie within the column space of uðXÞ: N Objective function: Combining the NMF objective function hr ¼ n¼1f nr uðxnÞ or with multiple kernel defined in (33) with the adaptive graph- H ¼ uðXÞF; ð30Þ based regularizer defined in (35) leads to the optimization prob-

NK lem of our AGNMF with multi-kernel learning — AGNMFMK: where f nr is the ðn; rÞth element of the matrix F 2 Rþ . Substituting (30) to (3), we have the objective function for the kernelized version min OAGNMFMK ðF;W;sÞ¼ONMFMK ðF;W;sÞþaOAGðW;AsÞþbjjsjj2 of NMF F;W;s "# L NMF 2 X O k ðF; WÞ¼jjuðXÞuðXÞFWjj > s > 2 ¼ Tr slKlðI FWÞðI FWÞ þ aTrðWL W Þþbjjsjj ¼ Tr½uðXÞðI FWÞðI FWÞ>uðXÞ> ÂÃl¼1 ¼ Tr KsðI FWÞðI FWÞ> þ aTrðWLsW>Þ ¼ Tr½uðXÞ>uðXÞðI FWÞðI FWÞ> XL > 2 ¼ Tr½KðI FWÞðI FWÞ ð31Þ þ bjjsjj ;s:t: F P 0; W P 0; s P 0; sl ¼ 1 l¼1 Suppose there are altogether L different kernel functions fK gL l l¼1 ð36Þ available for the NMF task in hand. Accordingly, there are L differ- ent but associated nonlinear feature spaces. In general, we do not where the regularization term jjsjj2 is also introduced to prevent the know which kernel space should be used. An intuitive way is to parameter s from overfitting to one kernel. use them all by concatenating all feature spaces into an augmented Optimization: Similar to AGNMFFS, we also adopt an iterative Hilbert space and associating each feature space with a relevance P strategy to alternately optimize ðF; WÞ and s. L weight sl, where sl P 0; l¼1sl ¼ 1. We denote the kernel weights – On optimizing ðF; WÞ: By fixing s and updating the adaptive > s s as a vector s ¼½s1; ...; sL . Performing the NMF in such feature graph G and kernel matrix K , the optimization problem space is equivalent to employing a combined kernel function for (36) is reduced to the NMF: Âà min Tr KsðI FWÞðI FWÞ> þ aTrðWLsW>Þ F;W XL XL ð37Þ s K ¼ slKl; s:t:sl P 0; sl ¼ 1 ð32Þ s:t: F P 0; W P 0 l¼1 l¼1 Similar to the optimization of H and W of AGNMFFS, we have follow- We substitute this relation into (31) to obtain the objective ing rules to update F and W: function for Multiple Kernel-based NMF (NMFMK): "# ðKsW>Þ XL nr f nr f nr NMFMK > s > O ðF; W; sÞ¼Tr slKlðI FWÞðI FWÞ ð33Þ ðK FWW Þnr l¼1 ð38Þ > s s ðF K þ aWA Þrn wrn wrn ðF>KsFW þ aWDsÞ 3.2.2. Graph adaptation to multiple kernel learning rn To update the graph G regarding the multiple kernel space, – On optimizing s: By fixing F and W, and removing the irrel- s given a s, the K nearest neighbors N n for the GNMF algorithm will evant terms, the optimization problem (36) becomes be re-found by the s-weighted SED in the multiple kernel space, "#"# XL XL i.e., min Tr s K ðI FWÞðI FWÞ> þ bjjsjj2 ¼ Tr s K ZZ> s l l l l 2 l¼1 l¼1 dsðxn; xmÞ¼jjuðxnÞuðxmÞjj s XL XL XL s s s 2 2 ¼ K ðxn; xnÞþK ðxn; xmÞ2K ðxn; xmÞ þbjjsjj ¼ slgl þ b sl ;s:t: s P 0; sl ¼ 1 XL l¼1 l¼1 l¼1 ¼ sl½ðKlðxn; xnÞþKlðxn; xmÞ2Klðxn; xmÞ 34Þ ð39Þ l¼1 Âà > where Z ¼ I FW and gl ¼ Tr KlZZ . The optimization of (39) with The corresponding K nearest neighbor graph adaptive to s is respect to the feature weights s could be solved as a standard qua- donated as G. Here we discuss the updating of dot-product weight- dratic programming (QP) problem. ing for the weight matrix As of the adaptive graph with multiple P Algorithm: The iterative AGNMF algorithm with multiple s > s L kernel learning, i.e, Anm ¼ uðxnÞ uðxmÞ¼K ðxn; xmÞ¼ l¼1slKl kernel learning (named as AGNMFMK) is summarized in s ðxn; xmÞ,ifxm 2Nn; 0, otherwise. Algorithm 2. J.J.-Y. Wang et al. / Expert Systems with Applications 42 (2015) 1278–1286 1283

expression data for classification tasks. In this experiment we will Algorithm 2. AGNMFMK Algorithm evaluate the use of the proposed algorithms as a data representa-

Input: L base kernel matrices Kl; l ¼ 1; ...; L; tion method for the colon cancer classification problem. A publicly Input: Initial factorization matrices F0 and W0; available microarray dataset of colon tissue samples is used in this Input: Tolerance stopping criterion n; experiment (Zheng et al., 2011). The colon cancer data set contains Input: Maximum number of iterations, T; the gene expression data of D ¼ 2000 genes in N ¼ 62 colon tissue samples, 22 of which are normal samples and 40 of which are Initialize the kernel weight variables as s0 ¼ 1 ; l ¼ 1; ...; L; l L tumor colon tissue samples. The colon tissue samples are defined Initialize t ¼ 1; as positive samples while the normal samples as negative ones. repeat The gene expression data of 2000 genes of a sample are used as Update the graph Gst and its corresponding Laplacian the original nonnegative features. The proposed AGNMFFS or st t1 matrix L according to s as introduced in Section 3.2.2; AGNMFMK algorithms were applied to represent data samples in Update the factorization matrices Ft and Wt as in (38); a low dimensional coding vector. Update the kernel weights st as in (39); To evaluate the proposed algorithms, we performed a 5-fold t ¼ t þ 1; cross validation on the dataset. The entire dataset was split into until OAGNMFMK ðFt; Wt; stÞ 6 n or t P T five folds randomly, and in each fold, there were 8 positive samples and 4–5 negative samples. Each fold was used as an independent Output F ¼ Ft1; W ¼ Wt1 and s ¼ st1. test set in turn, while the remaining four folds were used as the

training set. We first applied AGNMFFS and AGNMFMK respectively to the training set to represent all the training samples as coding

3.2.4. Representing test sample with AGNMFMK vectors, and then trained a support vector machine (SVM) When a test sample x 2 RD comes in, we first connect it to its K (Emami & Omar, 2013) to distinguish the normal and tumor colon s nearest neighbors N from the training set X, which is found by tissue samples. When a test sample was given, we first represented using the s-weighted SED (34). Then the weight vector s s s N s s s a ¼½a1; ...; aN2R is calculated as an ¼ K ðx; xnÞ,ifxn 2N ;0, ROC otherwise. We need to optimize the following objective function 1 to solve w with AGNMFMK:

XN AGNMFMK 2 a 2 s minOðwÞ ¼jj/ðxÞ/ðXÞFwjj þ jjw wnjj a 0.8 w 2 n n¼1 AGNMF s ; s ; > > s ; > > FS ¼ Tr½K ðx xÞ 2Tr½K ðX xÞw F þTr½K ðX XÞFww F AGNMF "# MK N N 0.6 X X GNMF a s > s > þ anTrðww ÞaTr w anwn NMF 2 FS n¼1 n¼1 TPR NMF XN 0.4 MK a s > NMF þ a Trðw w Þ; s:t: w P 0 ð40Þ K 2 n n n n¼1 NMF

s s s > s s 0.2 where K ðX; yÞ¼½K ðx1; yÞ; ...; K ðxN; yÞ , and K ðX; XÞ¼½K ðxn; xmÞ 2 RNN. By setting the partial derivative of the Lagrange function of (40) regarding w to zero and using the KKT conditions, we have the 0 following updating rule for w 0 0.2 0.4 0.6 0.8 1 hiP FPR > s a N s F K ðX; xÞþ2 n¼1anwn hir wr P wr ð41Þ > s a N s F K ðX; XÞFw þ 2 n¼1anw r Recall−Precision 1

4. Experiments

In this section, we apply the two proposed enhanced AGNMF 0.9 AGNMF algorithms to two challenging classification tasks – colon cancer FS AGNMF diagnosis and face recognition. MK GNMF NMF 4.1. Experiment I: colon cancer diagnosis 0.8 FS NMF Precision MK NMF In the first experiment, we test the proposed algorithms as data K NMF representation methods for the colon cancer diagnosis task based 0.7 on the gene expression data.

4.1.1. Colon cancer dataset and setup Classification of colon cancer types according to the gene 0.6 0 0.2 0.4 0.6 0.8 1 expression of patients’ tissue samples is an important technique Recall for colon cancer diagnosis (Zheng et al., 2011). Given the gene expression levels for genes of a sample, the aim is to identify if it is a tumor or a normal colon tissue. The gene expression data is Fig. 1. ROC and recall-precision curves of different NMF representation methods on usually nonnegative, thus NMF could be used to represent the gene the colon cancer dataset. 1284 J.J.-Y. Wang et al. / Expert Systems with Applications 42 (2015) 1278–1286

Table 1 model consistently perform better than the GNMF method using AUC values of different NMF representation methods on the colon the original data space for the graph estimation. In particular, this cancer dataset. difference is significant when a small number of data samples with Method AUC large feature dimension are available to train the NMF. Moreover,

AGNMFFS 0.9972 NMFFS outperforms NMFMK, which implies that feature selection AGNMFMK 0.9869 works better than multiple kernel learning for high dimensional GNMF 0.9716 data with many noisy and irrelevant features, such as the gene NMF 0.9699 FS expression data. We also observed that both of the proposed fea- NMFMK 0.9585 ture selection version (AGNMF ) and multi-kernel version NMFK 0.9545 FS NMF 0.9523 (AGNMFMK) of graph-adaptive NMF methods are superior to their competing algorithms that only consider feature selection (NMFFS) or multi-kernel learning (NMFMK) without conducting graph regu- larization. This is consistent with the manifold assumption of the it by a coding vector, based on the basis and coding matrices, and data and also shows the necessity to apply graph regularization. feature selection or multi-kernel learning parameters learned It is also interesting to notice that the difference between NMFMK using the training set, and then classified the coding vector by and NMFK is marginal. A possible reason is that NMFMK does not the trained SVM classifier. Note that all the parameters were tuned use the l2 norm to regularize the kernel weights whereas NMFMK on the training set only, and the test set was not included in the does, thus the kernel weights overfit to one kernel in NMFMK. parameter optimization procedure. The classification performance is measured by the receiver 4.2. Experiment II: face recognition operating characteristic (ROC) curves and recall-precision curves (Zhang, Xu, & Chen, 2008). The ROC curves were obtained by plot- In the second experiment, we test our algorithms on the face ting the true positive rates (TPR) vs. the false positive rates (FPR) at recognition task. various threshold settings, while recall-precision curves were obtained by plotting the recalls vs. the precisions at various thresh- old settings. The TPR, FPR, recall and precision are defined as 4.2.1. Face image dataset and setup follows: We used the face image dataset from Georgia Institute of Tech- nology (GTFD) (Nefian & Hayes, 2000) in this experiment. This TP FP TPR ¼ ; FPR ¼ ; database contains face images of 50 individuals. Fifteen color pic- TP þ FN FP þ TN ð42Þ tures are taken for each individual in two or three sessions. Thus TP TP recall ¼ ; precision ¼ ; there are in total 750 images in the database. For each individual, TP þ FN TP þ FP the pictures of different positions, facial expressions, lighting con- where TP is the number of true positives, FP is the number of false ditions and scales are taken. The face area in each image is manu- positives, TN is the number of true negatives, while FN is the num- ally cropped. We extracted the color-based local binary pattern ber of false negatives. Moreover, the area under the ROC curve (LBP) (Nanni, Lumini, & Brahnam, 2012) and the Gabor wavelet (AUC) is also used as a measure of the classification performance. coefficients (Park & Kim, 2013) as features of each face image, and concatenated them to construct the original nonnegative fea- 4.1.2. Experimental results ture vector. Since our algorithms combine feature selection, multi-kernel To conduct the evaluation, we randomly split the entire data- learning, graph regularization and NMF, we compared our algo- base into non-overlapping training and test subsets. For each indi- rithms with the following relevant methods: the original NMF vidual, 10 images out of the 15 were randomly selected as the (Lee & Seung, 2000), the graph-regularized NMF (GNMF) (Cai training samples, and the remaining five images were used as the et al., 2011), the kernel NMF (NMFK)(Lee, Cichocki, & Choi, test samples. Thus there were in total 500 samples in the training 2009), the NMF with multi-kernels (NMFMK)(An et al., 2011) and subset while 250 samples in the test set. To represent the samples, the NMF with feature selection (NMFFS)(Das Gupta & Xiao, we first performed the NMF algorithms to the training set to obtain 2011). In total seven different methods were compared on this data the basis matrix and the coding vectors of training samples, and set. The ROC and recall-precision curves of these methods are then the test samples were coded into coding vectors using the shown in Fig. 1, and the AUCs are given in Table 1. It can be seen NMF parameters learned by the training set. To classify the test that the proposed methods using feature selection or multi-kernel samples, we first trained a (HMM) (Elliott, learning to learn an adaptive graph for regularization of NMF Siu, & Fung, 2014) for each individual using the training samples,

Accuracy 1

0.9

Accuracy 0.8

0.7 AGNMF_FS AGNMF_MK GNMF NMF_FS NMF_MK NMF_K NMF

Fig. 2. Boxplots of recognition accuracies of 10 splits on the GTFD face database. J.J.-Y. Wang et al. / Expert Systems with Applications 42 (2015) 1278–1286 1285 and then the test samples were fit to each of the HMM model, and feature selection and multi-kernel learning, NMF and graph regu- classified into the one with the highest log-likelihood score. The larization can be unified as a single learning problem. above split process (training/test) was repeated 10 times, and the In the future, to extend the work proposed in this paper, we recognition accuracies over the splits were reported as the final propose the following future research directions. The first direction performance. is to parallelize the proposed algorithms in a distributed system to apply it to a big data platform (Kwon & Sim, 2013; Lee, Lee, & Sohn, 2013b; Li et al., 2011b; Wang, Jiang, & Agrawal, 2012c; Wang, 4.2.2. Experimental results Nandi, & Agrawal, 2014c; Wang, Su, & Agrawal, 2013d). The second Fig. 2 shows the boxplots of the classification accuracies of dif- direction is to improve the proposed algorithm to handle different ferent methods over the 10 splits. It can be seen that the proposed data sets of different distributions for domain transfer learning AGNMFMK and AGNMFFS again consistently outperform other problems (Al-Shedivat, Wang, Alzahrani, Huang, & Gao, 2014; Lee methods. AGNMFMK performs similarly as AGNMFFS on this dataset. et al., 2013b; Meng, Lin, & Li, 2011). The third direction is to apply This makes sense because for the computer vision problems, such it to more applications, such as image watermarking (Ouhsain & as face recognition, multi-kernel-based methods have been shown Hamza, 2009), fault diagnosis(Li et al., 2011a), sensing (Sun, Hu, to perform well (Yeh et al., 2013). In this data representation task, & Qi, 2010a, 2014; Li, Wu, & Li, 2014), malicious websites detection modeling the data as graphs gives much better representation than (Xu, Zhan, Xu, & Ye, 2014, 2013), data analysis (Luo & Brodsky, modeling them in the original Hilbert space that is constructed by 2011; Luo, Brodsky, & Li, 2012), and protein sequence motif discov- a single kernel or multiple kernels. Thus, it is not surprising to see ery (Kim, Chen, Kim, Pan, & Park, 2011b). that the proposed AGNMFMK significantly and consistently outper- forms NMFMK and NMFK, although all of them minimize the data reconstruction error. Similar reasons can be used to explain the Acknowledgments improvement of AGNMFFS over NMFFS. By comparing the perfor- The study is in part supported by US National Science Founda- mance between AGNMFFS and GNMF, and between NMFFS and NMF, we can conclude that feature selection plays important roles tion (Grant No. DBI-1322212) and a grant from King Abdullah Uni- in the classification accuracy. versity of Science and Technology (KAUST), Saudi Arabia. We would like to thank Dr. Qingquan Sun for sharing the code of the feature selection algorithm with us. 5. Conclusion and future works References In this paper, we proposed two novel data representation meth- ods that aimed to solve the issue of NMF caused by noisy and irrel- Al-Shedivat, M., Wang, J. J.- Y., Alzahrani, M., Huang, J. Z., Gao, X. (2014). Supervised evant features, and non-linear distributions of data samples. The transfer sparse coding. In Twenty-eighth AAAI conference on artificial first method conducts feature selection, graph regularization, and intelligence(pp. 1665–1672). An, S., Yun, J.- M., Choi, S., (2011). Multiple kernel nonnegative matrix factorization. NMF simultaneously, whereas the second method optimizes In 2011 IEEE international conference on acoustics, speech, and signal processing. multi-kernel learning, graph regularization, and NMF in a unified International conference on acoustics speech and signal processing ICASSP (pp. objective function. We developed two iterative optimization algo- 1976–1979). Cai, Q., Yin, Y., Man, H., Cai, Q., Yin, Y., Man, H., (2013). Dspm: Dynamic structure rithms to optimize the two objective functions, respectively. preserving map for action recognition. In 2013 IEEE international conference on Experimental results demonstrate that our methods significantly multimedia and expo (ICME) (pp. 1–6). outperform state-of-the-art data representation methods on the Cai, D., He, X., Han, J., & Huang, T. S. (2011). Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and colon cancer classification and face recognition tasks. The strength Machine Intelligence, 33(8), 1548–1560. of the proposed algorithms lies on the fact that it does not need Chen, Z., Li, J., Wei, L., Xu, W., & Shi, Y. (2011). Multiple-kernel svm based multiple- class label information for feature selection and multi-kernel task oriented system for gene expression data analysis. Expert learning. The weakness is the high computational complexity of Systems with Applications, 38(10), 12151–12159. Cui, S., & Soh, Y. C. (2010). Linearity indices and linearity improvement of 2-d the proposed multi-kernel learning method due to the QP problem tetralateral position-sensitive detector. IEEE Transactions on Electron Devices, in each iteration. 57(9), 2310–2316. Manifold regularization based on graphs has been a popular Das Gupta, M., & Xiao, J. (2011). Non-negative matrix factorization as a feature selection tool for maximum margin classifiers. In 2011 IEEE conference on method in NMF-related studies. However, the construction of the computer vision and pattern recognition (CVPR). IEEE. graph is often effected by the noisy features and the nonlinear dis- Elliott, R., Siu, T., & Fung, E. (2014). A double hmm approach to altman z-scores and tribution of the data. The theoretical contributions of this paper are credit ratings. Expert Systems with Applications, 41(4 PART 2), 1553–1560. Emami, M., & Omar, K. (2013). A low-cost method for reliable ownership to propose two solutions for these problems. One contribution of identification of medical images using svm and lagrange duality. Expert them is to use the feature selection to refine the data to construct Systems with Applications, 40(18), 7579–7587. a reliable graph to regularize the NMF, and also to incorporate the Fakhraei, S., Soltanian-Zadeh, H., & Fotouhi, F. (2014). Bias and stability of single variable classifiers for feature ranking and selection. Expert Systems with problem of feature selection to the problem of NMF. Another one is Applications, 41(15), 6945–6958. to incorporate the problem of multi-kernel learning to NMF and Iquebal, A., Pal, A., Ceglarek, D., & Tiwari, M. (2014). Enhancement of Mahalanobis– also to use it to refine the graph construction. We show that both Taguchi system via rough sets based feature selection. Expert Systems with Applications, 41(17), 8003–8015. feature selection and multi-kernel learning can be used to con- Kim, W., Chen, B., Kim, J., Pan, Y., & Park, H. (2011a). Sparse nonnegative matrix struct a more reliable graph for NMF, and the learning of feature factorization for protein sequence motif discovery. Expert Systems with and kernel weights can be learned simultaneously with NMF. Applications, 38(10), 13198–13207. Moreover, we also provide some insightful and practical impli- Kim, W., Chen, B., Kim, J., Pan, Y., & Park, H. (2011b). Sparse nonnegative matrix factorization for protein sequence motif discovery. Expert Systems with cations to feature selection and multi-kernel learning. Although Applications, 38(10), 13198–13207. the feature selection and multi-kernel learning are used to con- Kwon, O., & Sim, J. (2013). Effects of data set features on the performances of struct the graph for NMF, feature selection and multi-kernel learn- classification algorithms. Expert Systems with Applications, 40(5), 1847–1857. Lee, D. D., Seung, H. S., (2000). Algorithms for non-negative matrix factorization. In ing problems are also solved by minimizing the objective of graph NIPS (pp. 556–562). regularized NMF. This gives an insight about feature selection and Lee, H., Cichocki, A., & Choi, S. (2009). Kernel nonnegative matrix factorization for multi-kernel learning: NMF and graph regularization can also be spectral eeg feature extraction. Neurocomputing, 72(13-15), 3182–3190. Lee, M., Lee, A., & Sohn, S. (2013b). Behavior scoring model for coalition loyalty used as criteria for feature selection and multi-kernel learning even programs by using summary variables of transaction data. Expert Systems with when the supervision information is missing. The problems of Applications, 40(5), 1564–1570. 1286 J.J.-Y. Wang et al. / Expert Systems with Applications 42 (2015) 1278–1286

Lee, L., Rajkumar, R., Lo, L., Wan, C., & Isa, D. (2013a). Oil and gas pipeline failure Sun, Y., Todorovic, S., & Goodison, S. (2010b). Local-learning-based feature selection prediction system using long range ultrasonic transducers and Euclidean- for high-dimensional data analysis. IEEE Transactions on Pattern Analysis and support vector machines classification approach. Expert Systems with Machine Intelligence, 32(9), 1610–1626. Applications, 40(6), 1925–1934. Sun, Q., Wu, P., Wu, Y., Guo, M., & Lu, J. (2012). Unsupervised multi-level non- Lin, C.-H., Chen, H.-Y., & Wu, Y.-S. (2014). Study of image retrieval and classification negative matrix factorization model: Binary data case. Journal of Information based on adaptive features using genetic algorithm feature selection. Expert Security, 3, 245. Systems with Applications, 41(15), 6611–6621. Wang, J.-Y., Almasri, I., & Gao, X. (2012b). Adaptive graph regularized nonnegative Li, H., Wu, G.-Q., Hu, X.-G., Zhang, J., Li, L., & Wu, X. (2011b). K-means clustering matrix factorization via feature selection. In 2012 21st International conference with bagging and mapreduce. In 2011 44th Hawaii international conference on on pattern recognition (ICPR) (pp. 963–966). IEEE. system sciences (HICSS) (pp. 1–8). IEEE. Wang, J. J.-Y., Bensmail, H., & Gao, X. (2012a). Multiple graph regularized protein Li, H., Wu, X., & Li, Z. (2014). Online learning with mobile sensor data for user domain ranking. BMC Bioinformatics, 13(1), 307. recognition. In The 29th symposium on applied computing (pp. 64–70). ACM. Wang, J. J.-Y., Bensmail, H., & Gao, X. (2013a). Multiple graph regularized Li, H., Wu, X., Li, Z., & Ding, W. (2013a). Group feature selection with feature nonnegative matrix factorization. Pattern Recognition, 46(10), 2840–2847. streams. In 2013 IEEE 13th international conference on data mining (ICDM) Wang, J. J.-Y., Bensmail, H., & Gao, X. (2014a). Feature selection and multi-kernel (pp. 1109–1114). IEEE. learning for sparse representation on a manifold. Neural Networks, 51(0), 9–16. Li, H., Wu, X., Li, Z., & Ding, W. (2013b). Online group feature selection from feature Wang, J. J.-Y., Bensmail, H., Yao, N., & Gao, X. (2013b). Discriminative sparse coding streams. In Twenty-seventh AAAI conference on artificial intelligence on multi-manifolds. Knowledge-Based Systems, 54, 199–206. (pp. 1627–1628). AAAI. Wang, J. J.-Y., & Gao, X. (2013). Beyond cross-domain learning: Multiple domain Li, B., Zhang, P.-L., Tian, H., Mi, S.-S., Liu, D.-S., & Ren, G.-Q. (2011a). A new feature nonnegative matrix factorization. Engineering Applications of Artificial extraction and selection scheme for hybrid fault diagnosis of gearbox. Expert Intelligence, 28(0), 181–189. Systems with Applications, 38(8), 10000–10009. Wang, Y., Jiang, W., & Agrawal, G. (2012c). SciMATE: A novel mapreduce-like Luo, J., Brodsky, A., (2011). An em-based multi-step piecewise surface regression framework for multiple scientific data formats. In 2012 12th IEEE/ACM learning algorithm. In The seventh international conference on data mining international symposium on cluster cloud and grid computing (CCGrid) (WORLDCOMP DMIN 11) (pp. 286–292). Las Vegas, Nevada. (pp. 443–450). IEEE. Luo, J., Brodsky, A., & Li, Y. (2012). An em-based algorithm on Wang, Y., Nandi, A., & Agrawal, G. (2014c). SAGA: Array storage as a DB with support piecewise surface regression problem. International Journal of Applied for structural aggregations. In Proceedings of the 26th international conference on Mathematics and Statistics, 28(4), 59–74. scientific and statistical database management (pp. 9). ACM. Meng, J., Lin, H., & Li, Y. (2011). Knowledge transfer based on feature representation Wang, Y., Su, Y., & Agrawal, G. (2013d). Supporting a light-weight data management mapping for text classification. Expert Systems with Applications, 38(8), layer over HDF5. In 2013 13th IEEE/ACM international symposium on cluster, cloud 10562–10567. and grid computing (CCGrid) (pp. 335–342). IEEE. Merigó, J., & Casanovas, M. (2011). Induced aggregation operators in the euclidean Wang, J. J.-Y., Sun, Y., & Gao, X. (2014b). Sparse structure regularized ranking. distance and its application in financial decision making. Expert Systems with Multimedia Tools and Applications, 1–20. Applications, 38(6), 7603–7608. Wang, J. J.-Y., Wang, X., & Gao, X. (2013c). Non-negative matrix factorization by Nanni, L., Lumini, A., & Brahnam, S. (2012). Survey on lbp based texture descriptors maximizing correntropy for cancer clustering. BMC Bioinformatics, 14(1), 107. for image classification. Expert Systems with Applications, 39(3), 3634–3641. Xu, L., Zhan, Z., Xu, S., Ye, K. (2014). An evasion and counter-evasion study in Nefian, A., Hayes, M. H. I. (2000). Maximum likelihood training of the embedded malicious websites detection. In 2014 IEEE conference on communications and hmm for face detection and recognition. In Proceedings 2000 international network security (CNS) (IEEE CNS 2014). San Francisco, USA. conference on image processing (pp. 33–6). Vol. 1. Xu, L., Zhan, Z., Xu, S., & Ye, K. (2013). Cross-layer detection of malicious websites. In Orsenigo, C., & Vercellis, C. (2012). Kernel ridge regression for out-of-sample Proceedings of the third ACM conference on data and application security and mapping in supervised manifold learning. Expert Systems with Applications, privacy (pp. 141–152). ACM. 39(9), 7757–7762. Yeh, C.-Y., Huang, C.-W., & Lee, S.-J. (2011). A multiple-kernel support vector Ouhsain, M., & Hamza, A. (2009). Image watermarking scheme using nonnegative regression approach for stock market price forecasting. Expert Systems with matrix factorization and wavelet transform. Expert Systems with Applications, Applications, 38(3), 2177–2186. 36(2 PART 1), 2123–2129. Yeh, C.-Y., Su, W.-P., & Lee, S.-J. (2013). An efficient multiple-kernel learning for Park, J.-G., & Kim, K.-J. (2013). Design of a visual perception model with edge- pattern classification. Expert Systems with Applications, 40(9), 3491–3499. adaptive Gabor filter and support vector machine for traffic sign detection. Zhang, T., Xu, D., & Chen, J. (2008). Application-oriented purely semantic precision Expert Systems with Applications, 40(9), 3679–3687. and recall for ontology mapping evaluation. Knowledge-Based Systems, 21(8), Sun, Q., Hu, F., & Hao, Q. (2014). Mobile target scenario recognition via low-cost 794–799. pyroelectric sensing system: Toward a context-enhanced accurate Zheng, C.-H., Ng, T.-Y., Zhang, L., Shiu, C.-K., & Wang, H.-Q. (2011). Tumor identification. IEEE Transactions on Systems, Man, and Cybernetics. Systems, classification based on non-negative matrix factorization using gene 44(3), 375–384. expression data. IEEE Transactions on Nanobioscience, 10(2), 86–93. Sun, Q., Hu, F., & Qi, H. (2010a). Context awareness emergence for distributed binary pyroelectric sensors. In 2010 IEEE conference on multisensor fusion and integration for intelligent systems (MFI) (pp. 162–167). IEEE.