Weighted-SAMGSR: Combining Significance Analysis of Microarray
Total Page:16
File Type:pdf, Size:1020Kb
Tian et al. Biology Direct (2016) 11:50 DOI 10.1186/s13062-016-0152-3 RESEARCH Open Access Weighted-SAMGSR: combining significance analysis of microarray-gene set reduction algorithm with pathway topology-based weights to select relevant genes Suyan Tian1,2*, Howard H. Chang3 and Chi Wang4 Abstract Background: It has been demonstrated that a pathway-based feature selection method that incorporates biological information within pathways during the process of feature selection usually outperforms a gene-based feature selection algorithm in terms of predictive accuracy and stability. Significance analysis of microarray-gene set reduction algorithm (SAMGSR), an extension to a gene set analysis method with further reduction of the selected pathways to their respective core subsets, can be regarded as a pathway-based feature selection method. Methods: In SAMGSR, whether a gene is selected is mainly determined by its expression difference between the phenotypes, and partially by the number of pathways to which this gene belongs. It ignores the topology information among pathways. In this study, we propose a weighted version of the SAMGSR algorithm by constructing weights based on the connectivity among genes and then combing these weights with the test statistics. Results: Using both simulated and real-world data, we evaluate the performance of the proposed SAMGSR extension and demonstrate that the weighted version outperforms its original version. Conclusions: To conclude, the additional gene connectivity information does faciliatate feature selection. Reviewers: This article was reviewed by Drs. Limsoon Wong, Lev Klebanov, and, I. King Jordan. Keywords: Pathway knowledge, Pathway-based feature selection, Significance analysis of microarray (SAM), Weighted gene expression profiles, Non-small cell lung cancer (NSCLC), Multiple sclerosis (MS) Background In contrast to a pathway analysis method, which Many studies have demonstrated that pathway-based examines the association of a whole pathway with the feature selection algorithms, which utilize biological phenotype of interest, a pathway-based feature selection information contained in pathways to guide which algorithm focuses on the identification of relevant features/genes should be selected, are usually superior to individual features (e.g., genes) while considering known traditional gene-based feature selection algorithms in pathway knowledge. Pathway-based feature selection terms of predictive accuracy, stability, and biological algorithms can be classified into three major categories – interpretation [1–10]. Consequently, pathway-based penalty, stepwise forward, and weighting. Their definitions feature selection algorithms have become increasingly and characteristics are presented in Table 1. In the penalty popular and widespread. category, an additional penalty term accounting for the pathway structure/topology is added to the objective func- tion for optimization. In essence, this penalty term pro- * Correspondence: [email protected] vides some smoothness on nearby genes within a pathway, 1Division of Clinical Research, The First Hospital of Jilin University, 71Xinmin Street, Changchun, Jilin, China 130021 relying on the assumption that neighboring genes inside a 2School of Mathematics, Jilin University, 2699 Qianjin Street, Changchun, Jilin, pathway are more likely to function together or to be China 130012 Full list of author information is available at the end of the article © 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Tian et al. Biology Direct (2016) 11:50 Page 2 of 15 Table 1 Three categories of pathway-based feature selection algorithms. The filter and embedded methods are two typical types for the gene-based feature selection algorithms. As defined by [32], filter methods access the relevance of features by calculating some functional score while embedded methods search for the optimal subset simultaneously with the classifier construction Category/description Property Pathway topology information Examples [Ref.] Penalty: add an extra penalty term which Embedded feature selection methods, Need the pathway topology Net-Cox [Zhang accounts for the pathway structure to the carry out feature selection and information for all genes, e.g., et al. 2013] netSVM objective function, then optimize the coefficient estimation simultaneously, are they connected and the [Chen et al. 2011] resulting function to get the final gene subset moderate to heavy computing burden distance between them Stepwise forward: order genes based on one Usually filter methods, the beneath Usually ignore the pathway SAM-GSR specific statistic, and then add gene one by one concepts and theory are simple. However, topology information, the [Dinu et al, 2009] until there is no gain on the pre-defined score. they also inherits the filter methods’ decision hinges mainly on SurvNet [Li et al. 2012] drawbacks of inferior model parsimony the genes’ expression values and thus high false positive rate. Weighting: create some kind weights according With different weights, the chance of Account for the pathway RRFE [Johannes to the pathway knowledge and then combine those “driving” genes with subtle change topology information. et al. 2010] DRW with other feature selection methods to identify being selected increases. However, if the [Liu et al. 2013] the relevant genes estimated weights subject to big biases, the resulting model might even be inferior to those without weights. involved in the same biological process than those are far categories, while the weighting category is simpler, it has away. As a result, a ‘driving’ gene that has subtle changes been underutilized. Its underutilization might be due to but alters its neighbors’ expression values dramatically is the estimated weights being subject to errors and biases more likely to be selected. One limitation of penalization where the impact on the resulting significant may be methods is that they are associated with higher theoretical substantial. complexity and more computational efforts. In this study, we propose a hybrid method that com- The stepwise forward methods first select one gene bines SAMGSR with a pathway topology-based weight (e.g., the most significantly differentially expressed) and to carry out feature selection. As a combination of the add genes gradually, and then evaluate the performance weighting method and stepwise forward method, the of the resulting gene subset based on some statistic until objective is to address some disadvantages associated no further gain upon this statistic can be obtained. The with SAMGSR and the weighting method while utilizing SAMGSR algorithm proposed by [11] falls within this their strengths. The proposed method is referred to as category, and it consists of two steps. Its first step is weighted-SAMGSR herein. Applying it to both the simu- essentially an extension of the SAM method [12] to all lated data and real-world application, we evaluate if genes inside a pathway and the significance level of a weights reflecting gene connectivity information are pathway is determined using permutation tests. Then a valuable for feature selection. core subset is extracted from each significant pathway identified by the first step on the basis on the magni- Methods tudes of individual genes’ SAM statistics. Based on the Experimental data simulation results by us [13], SAMGSR may increase the We considered two sets of microarray data and one likelihood of those genes involved in many pathways RNA-Seq dataset in this study. One set of microarray being selected. But SAMGSR only considers a gene’s data is for a multiple sclerosis (MS) application; the pathway membership and ignores pathway topology other set and the RNA-Seq data are for a non-small cell information, it may miss those ‘driving’ genes with subtle lung cancer (NSCLC) application. changes because the inclusion of a gene in the reduced core subsets hinges on its expression difference between MS data two phenotypes. The MS application consisted of two microarray experi- The third category is to create a pathway knowledge- ments.Thefirstoneincludedchipsfromtheexperiment based weight for each gene. For instance, the reweighted E-MTAB-69 stored in the ArrayExpress [16] repository recursive feature elimination (RRFE) algorithm proposed (http://www.ebi.ac.uk/arrayexpress). All chips were hybrid- by Johannes et al. [14] uses GeneRank [15] to calculate a ized on Affymetrix HGU133 Plus 2.0 chips. In this study, rank for each gene and then weighs the coefficients in a there were 26 patients with relapsing-remitting multiple support vector machine (SVM) model by this rank. It sclerosis (RRMS) and 18 controls with neurological disor- had been demonstrated that the resulting gene signa- ders of a non-inflammatory nature. The second dataset was tures have better stability and more meaningful bio- provided by the sbv IMPROVER challenge in the year of logical interpretation [14]. Compared to the other two 2012 [17],