A Decision Analysis Model for KEGG Pathway Analysis Junli Du1,2†, Manlin Li1†, Zhifa Yuan1, Mancai Guo1, Jiuzhou Song3, Xiaozhen Xie1 and Yulin Chen2*
Total Page:16
File Type:pdf, Size:1020Kb
Du et al. BMC Bioinformatics (2016) 17:407 DOI 10.1186/s12859-016-1285-1 RESEARCH ARTICLE Open Access A decision analysis model for KEGG pathway analysis Junli Du1,2†, Manlin Li1†, Zhifa Yuan1, Mancai Guo1, Jiuzhou Song3, Xiaozhen Xie1 and Yulin Chen2* Abstract Background: The knowledge base-driven pathway analysis is becoming the first choice for many investigators, in that it not only can reduce the complexity of functional analysis by grouping thousands of genes into just several hundred pathways, but also can increase the explanatory power for the experiment by identifying active pathways in different conditions. However, current approaches are designed to analyze a biological system assuming that each pathway is independent of the other pathways. Results: A decision analysis model is developed in this article that accounts for dependence among pathways in time-course experiments and multiple treatments experiments. This model introduces a decision coefficient—a designed index, to identify the most relevant pathways in a given experiment by taking into account not only the direct determination factor of each Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway itself, but also the indirect determination factors from its related pathways. Meanwhile, the direct and indirect determination factors of each pathway are employed to demonstrate the regulation mechanisms among KEGG pathways, and the sign of decision coefficient can be used to preliminarily estimate the impact direction of each KEGG pathway. The simulation study of decision analysis demonstrated the application of decision analysis model for KEGG pathway analysis. Conclusions: A microarray dataset from bovine mammary tissue over entire lactation cycle was used to further illustrate our strategy. The results showed that the decision analysis model can provide the promising and more biologically meaningful results. Therefore, the decision analysis model is an initial attempt of optimizing pathway analysis methodology. Keywords: Pathway analysis, Decision coefficient (DC), Coefficient of determination (CD), Bovine mammary Background ortholog group tables including category pathways, sub- To gain more mechanistic insights into the underlying category pathways and the secondary pathways, which biology of the condition being studied, analyzing high- are often encoded by positionally coupled genes on the throughput molecular measurements at the functional chromosome and particularly useful in predicting gene level has become more and more appealing [1]. Espe- functions [3]. Therefore, KEGG pathway databases are cially, the knowledge base-driven pathway analysis is be- more widely used in current enrichment analysis plat- coming the first choice for many investigators, which forms. There are two advantages in this kind of pathway mainly exploit pathway knowledge in public repositories, analysis. One is to reduce the complexity through group- such as Gene Ontology (GO) and Kyoto Encyclopedia of ing thousands of differentially expressed genes (DEG) Genes and Genomes (KEGG) [2]. KEGG pathway data- from those high-throughput technologies to just several bases store the higher order functional information for hundred pathways; another is to increase the explanatory systematic analysis of gene functions. Importantly, power for the experiment through identifying the most KEGG pathway databases can be viewed as a set of impacted pathways under the given conditions [1, 2]. In the last decade, the pathway analysis has experi- – * Correspondence: [email protected]; [email protected] enced the over-representation approach (ORA) [4 11] †Equal contributors and a functional class scoring approach (FCS) stages 2 College of Animal Science and Technology, Northwest A&F University, [12–25]. Both ORA and FCS, including singular enrich- Yangling 712100, People’s Republic of China Full list of author information is available at the end of the article ment analysis (SEA), gene set enrichment analysis © 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Du et al. BMC Bioinformatics (2016) 17:407 Page 2 of 12 (GSEA) and modular enrichment analysis (MEA), aim to subcategory. In this method, a decision coefficient (DC) identify the significant pathways by considering the index was conceived by considering the mutual regula- number of genes in a pathway or gene co-expression [2]. tion between pathways and was used to identify the However, these methods are currently limited by the fact most impacted pathways. Besides, the subdivision of DC that they handle each pathway independently [19, 26]. In included not only the direct determination factor of each fact, the pathways can cross and overlap because each pathway itself, but also the indirect determination fac- gene has multiple functions and can act in more than tors from its related pathways. Meanwhile, the mutual one pathway [2]. Hence, exploring advanced data ana- regulation mechanisms among pathways can be demon- lysis methods and considering inter-pathway dependence strated by the subdivision of DC. In addition, the impact are still the most important challenge in pathway ana- direction of each pathway can be preliminarily estimated lysis up to now. To our knowledge, only Go-Bayes by using the sign of DC. Moreover, the decision percent- method has incorporated the dependence structure of age can be obtained by the ratio of DC absolute value of the directed acyclic graph (DAG) in assessing Gene the given subcategory pathway divided by the sum of Ontology (GO) term over-representation [27]. Besides, a DC absolute values of all subcategories in the same cat- KEGG-PATH approach took into account the correl- egory. According to the decision percentage, a decision ation among the KEGG pathways in identifying the most tree can be constructed to visualize the decision results. impact pathways and exploiting the regulations among We tested the utility of the method using the DIA im- the KEGG pathways [28]. pact value dataset from a functional analysis of the bo- For time-course experiments and multiple treatments vine mammary transcriptome during the lactation cycle. experiments, the Dynamic Impact Approach (DIA) had been validated to be an effective functional analysis method in real study based on a priori biological know- Methods ledge [29, 30]. In DIA, the impact values and the impact To introduce the decision analysis model, we take the direction were calculated as “Impact = [Proportion of KEGG pathways for an example to define the following DEG in the pathway] × [average log2 fold change of the notations. T DEG] × [average of –log P-value of the DEG]” and “Im- We assumed that X =(X1, X2, ⋯, Xm) , Xi =(Xi1, Xi2, ⋯, T T pact direction = Impact of up-regulated DEG-Impact of Xip) ,(i =1,2,⋯, m), Xij =(Xij1, Xij2, ⋯, Xijk) ,(i =1,2,⋯, down-regulated DEG” [29]. In fact, the impact value was m; j =1,2,⋯, p) are the sets of KEGG pathway categories, a pathway-level statistic aggregated the gene-level statis- subcategories and the secondary pathways, respectively. tics for all DEGs in the pathway. By ranking the esti- Let yi(i =1,2,⋯, m) be the impact values of the i-th T mated impact values of pathways and considering the KEGG pathway category and x =(xi1, xi2, ⋯ xip) be the sign of impact direction values, the DIA approach can impact values of its corresponding subcategory. The efficiently identify the most impact pathways and pro- vector x is assumed to follow a normal distribution x ~ vide the impact direction of pathways. The outstanding N(0, Rx), where Rx is the correlation matrix of x. Based on advantage of this method is to capture the dynamic na- the path analysis model, the total effect can be subdivided ture of the changing transcriptome. But, this ranking into the direct effect and indirectÀÁ effect through the method by ‘average values’ had two limitations: 1) hand- ^ Ã ¼ ^ ^ ¼ equation Rxb Rxy ,whereRx rjt pÂp is the max- ling each pathway independently; 2) considering the ef- imum likelihood estimation of correlation matrix Rx, fect of transcriptome expression in a cell for each time- ÀÁ and R^ ¼ r is the correlation matrix of x and course as “equal weight”. In fact, a biology mechanism xy jy pÂ1 * * ⋯ * T (or better a biology process) is a very complex network yi, b*=(b1, b2, , bp) is the solved path coefficient in- consisting of multiple pathways/functions. Obviously, dicating the direct effect of subcategory pathways the mutual regulations among pathways must exist and [28]. In fact, the path analysis approach is a standard the effect weight of transcriptome expression at different multiple linear regression model. In the linear regression ≤ 2 ≤ time-course is unequal. In KEGG-PATH approach, the analysis, the coefficient of determination (CD) (0 R 1) indirect regulation from the other related pathways was is the proportion of total variation of outcomes explained considered in the calculation of total effect for each by the model, which provides a measure of how well ob- pathway, neglecting