Robust Hypergraph Regularized Non-Negative Matrix Factorization for Sample Clustering and Feature Selection in Multi-View Gene E
Total Page:16
File Type:pdf, Size:1020Kb
Yu et al. Human Genomics 2019, 13(Suppl 1):46 https://doi.org/10.1186/s40246-019-0222-6 RESEARCH Open Access Robust hypergraph regularized non- negative matrix factorization for sample clustering and feature selection in multi- view gene expression data Na Yu1, Ying-Lian Gao2, Jin-Xing Liu1*, Juan Wang1* and Junliang Shang1 From IEEE International Conference on Bioinformatics and Biomedicine 2018 Madrid, Spain. 3-6 December 2018 Abstract Background: As one of the most popular data representation methods, non-negative matrix decomposition (NMF) has been widely concerned in the tasks of clustering and feature selection. However, most of the previously proposed NMF-based methods do not adequately explore the hidden geometrical structure in the data. At the same time, noise and outliers are inevitably present in the data. Results: To alleviate these problems, we present a novel NMF framework named robust hypergraph regularized non-negative matrix factorization (RHNMF). In particular, the hypergraph Laplacian regularization is imposed to capture the geometric information of original data. Unlike graph Laplacian regularization which captures the relationship between pairwise sample points, it captures the high-order relationship among more sample points. Moreover, the robustness of the RHNMF is enhanced by using the L2,1-norm constraint when estimating the residual. This is because the L2,1-norm is insensitive to noise and outliers. Conclusions: Clustering and common abnormal expression gene (com-abnormal expression gene) selection are conducted to test the validity of the RHNMF model. Extensive experimental results on multi-view datasets reveal that our proposed model outperforms other state-of-the-art methods. Keywords: Non-negative matrix decomposition, Hypergraph Laplacian, L2,1-norm, Clustering, Common abnormal gene selection, Multi-view gene expression data Background well known that gene expression data can be down- Due to the development of sequencing technology, loaded from The Cancer Genome Atlas (TCGA) plat- more and more gene expression data have been de- form. We then integrated the gene expression data tected. At the same time, there are many meaningful into multi-view data for different diseases with the biological information in gene expression data. The same genes. Multi-view data will provide a new per- effective analysis and research of this information are spective to mine the connections between multiple of great significance to the prevention and treatment cancers. of diseases. And multi-view data obtained by integrat- To meet the demand for studying explosive gene ing data from different sources have gained much expression data, modern biologists are increasingly attention in the field of machine learning [1]. It is concerned with clustering and feature selection. Clustering is the process of dividing a series of genes * Correspondence: [email protected]; [email protected] or samples into different subsets, and the genes or 1School of Information Science and Engineering, Qufu Normal University, Rizhao 276826, China samples in the same subset are similar [2]. Generally Full list of author information is available at the end of the article speaking, feature selection can not only find useful © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Yu et al. Human Genomics 2019, 13(Suppl 1):46 Page 2 of 10 information and eliminate noise, but also reduce the influences of noise and outliers. The main contributions complexity of the computation. In this paper, we of RHNMF are summarized as follows: performed the selection of com-abnormal genes to study the relationship between genes and multiple (i) To capture high-order relationship between more cancers [3]. sample points, hypergraph regularization is applied As an effective matrix decomposition method, to the objective function. This makes sense for non-negative matrix factorization (NMF) [4]is enhancing the performance of NMF-based widely prevalent in bioinformatics [5], image repre- methods. sentation [6], and other fields [7]. NMF can learn (ii) The L2, 1-norm instead of the Frobenius norm is part-based representations of objects. This is consist- used to estimate the residual approximation, so that ent with the human brain’s perception mechanism. the error term for each data point is no longer Some extensions to NMF have been proposed from squared form. This will greatly suppress the effects different perspectives. For example, the non-negative of noise and outliers. And L2, 1-norm is suitable for local coordinate factorization (NLCF) was presented clustering and feature selection because it produces by imposing the locality coordinate constraint into sparse rows. the original NMF [8]. Kim et al. presented the sparse (iii)Scientific and comprehensive experiments are non-negative matrix factorization (NMFs) method in designed on the multi-view datasets to prove the combination with sparse constraints [9]. In practical effectiveness of the RHNMF and achieved applications, the data are sometimes negative, so semi- satisfactory results. non-negative matrix factorization (Semi-NMF) and convex non-negative matrix factorization (Convex-NMF) The rest of the paper is arranged as follows. In the are derived to solve the problem of positive and negative “Methods” section, we introduce the NMF, L2, 1- data [10]. norm, and hypergraph regularization. The proposed As we mentioned above, these methods have enhanced RHNMF method, the solution process, its conver- the performance of NMF, but there also exist the follow- gence, and computational complexity analysis are ing limitations: (1) In fact, there is an intrinsic geomet- also described in detail. Experimental results are rical information in the high-dimensional data. But these demonstrated in the section “Results and discussion.” methods ignore the nonlinear low-dimensional geomet- The conclusion is drawn in the section rical structure in the original data. (2) There are always “Conclusions.” noise and outliers in real data. Therefore, we need a ro- bust NMF-based approach to effectively suppress noise Methods and outliers. Non-negative matrix factorization For the first question, the graph regularized non-nega- In the field of bioinformatics, gene expression data tive matrix factorization (GNMF) was presented to dis- are usually expressed in the form of a matrix. The cover the manifold structure of raw data [11]. However, sample is represented by a column of matrices, and graph regularization is based on constructing k-nearest the level of gene expression is represented by the neighbors in a simple graph, which explores only the rows of the matrix. Given a data matrix X =[x1, x2, pairwise relationship between two sample points. Zeng m × n …, xn] ∈ R , the column vector xj isasamplevec- et al. introduced hypergraph regularized non-negative tor. NMF aims at finding two non-negative matrices matrix factorization (HNMF) to encode the relationship m × k k × n U =[u1, u2, …, uk] ∈ R and V =[v1, v2, …, vn] ∈ R between two or more than two sample points [12]. Un- whose products are similar to the data matrix X [14]. like simple graphs, the hyperedge of a hypergraph con- U represents a basis matrix, and V represents a coef- tains a series of related vertices. Therefore, high-order ficient matrix. The minimizing objective function of relationship of the data can be found. GNMF and the NMF is as below: HNMF consider important manifold information, but they are exceptionally sensitive to noise and outliers. For Xn 2 2 the second problem, using the L2, 1-norm when estimat- min kkX−UV ¼ x −Uv s:t:U≥0; V≥0; U;V F j j ing the residual can be effectively alleviated [13]. j¼1 Inspired by these work, this paper presents a novel ð1Þ NMF model called robust hypergraph regularized non- negative matrix factorization (RHNMF). It adds hyper- graph regularization and L2, 1-norm to the traditional where ‖⋅‖F represents the Frobenius norm of matrix. xj NMF. So it has the advantage of considering the higher- can be seen as a linear combination of columns of U, pa- order relationship among samples and controlling the rameterized by each column of V [15]. Yu et al. Human Genomics 2019, 13(Suppl 1):46 Page 3 of 10 L2, 1-norm represents the hypergraph’s incidence matrix. Then, m × n Given any matrix X ∈ R ,the‖X‖2, 1 is to first we can calculate it as below: calculate L -norm for rows to form a column matrix, 2 ∈ ; and then calculate L -norm for column matrix [13], HðÞ¼; 1 if v e ð Þ 1 v e : 3 i.e., 0 otherwise vffiffiffiffiffiffiffiffiffiffiffiffiffiffi u Xm uXn For any hyperedge ei, its weight Wi is denoted as kkX ¼ t x2 2;1 i; j follows: i¼1 j¼1 m X Wi ¼ WðÞei i ! ¼ x X 2 2 vi−v j i¼1 ¼ − 2 ; exp δ ∈ ð2Þ v j ei ð4Þ . As shown above, L -norm will cause row sparsity X 2, 1 1 2 where δ ¼ kvi−v jk , k represents the value of k [16]. At the same time, the L2, 1-norm is not susceptible k 2 v j∈ei to noise and outliers, so the robustness of the algorithm nearest neighbors for each vertex. d(v) represents the can be improved. degree of vertex v and is expressed as follows: X Hypergraph regularization dvðÞ¼ weðÞHðÞv; e : ð5Þ Inspired by the simple graph theory, hypergraph e∈E came into being [17]. In the sample graph, one edge And the degree of each hyperedge can be denoted as: is connected by two data samples and the weight of the edge denotes the pairwise relationship between X ðÞ¼ HðÞ; : ð Þ two sample points [11].