<<

Network-Based Approaches for Pathway UNIT 8.25 Level Analysis Tin Nguyen,1 Cristina Mitrea,2 and Sorin Draghici2,3 1Department of and Engineering, University of Nevada, Reno, Nevada 2Department of Computer Science, Wayne State University, Detroit, Michigan 3Department of Obstetrics and Gynecology, Wayne State University, Detroit, Michigan

Identification of impacted pathways is an important problem because it allows us to gain insights into the underlying biology beyond the detection of differ- entially expressed genes. In the past decade, a plethora of methods have been developed for this purpose. The last generation of pathway analysis methods are designed to take into account various aspects of pathway topology in order to increase the accuracy of the findings. Here, we cover 34 such topology-based pathway analysis methods published in the past 13 years. We compare these methods on categories related to implementation, availability, input format, graph models, and statistical approaches used to compute pathway level statis- tics and statistical significance. We also discuss a number of critical challenges that need to be addressed, arising both in methodology and pathway repre- sentation, including inconsistent terminology, data format, lack of meaningful benchmarks, and, more importantly, a systematic bias that is present in most existing methods. C 2018 by John Wiley & Sons, Inc. Keywords: systems biology r pathway r topology r gene network r survey r pathway analysis

How to cite this article: Nguyen, T., Mitrea, C., & Draghici, S. (2018). Network-based approaches for pathway level analysis. Current Protocols in , 61, 8.25.1–8.25.24. doi: 10.1002/cpbi.42

INTRODUCTION this gap is the fact that living organisms are With rapid advances in high-throughput complex systems whose emerging phenotypes technologies, various kinds of genomic are the results of multiple complex interactions data have become prevalent in most of taking place on various metabolic and signal- biomedical research. Advanced techniques ing pathways. in sequencing (e.g., RNA-Seq, miRNA-Seq, Regardless of the technology being used, DNA-Seq) and microarray assays (e.g., gene a typical comparative analysis (e.g., disease expression, methylation) have transformed bi- versus control, treated versus not treated, drug ological research by enabling comprehen- A versus drug B, etc.) often yields a set of sive monitoring of biological systems. Vast genes that are differentially expressed (DE) be- amounts of data of all types have accumu- tween the two phenotypes. Even though these lated in many public repositories, such as Gene lists of DE genes are important in identify- Expression Omnibus (GEO; Barrett et al., ing the genes that may be involved in biologi- 2013; Edgar, Domrachev, & Lash, 2002), Ar- cal changes, they fail to reveal the underlying ray Express (Brazma et al., 2003; Rustici mechanisms. In order to translate these lists et al., 2013), The Cancer Genome At- of DE genes into a better understanding of bi- las (https://cancergenome.nih.gov/), and cBio- ological phenomena, researchers have devel- Portal (Cerami et al., 2012; Gao et al., 2013). oped a variety of knowledge bases that map However, there is a large gap between the ease genes to functional modules. Depending on of data collection and our ability to extract the amount of information that one wishes to knowledge from these data. Contributing to include, these modules can be described as Analyzing Molecular Interactions

Current Protocols in Bioinformatics 8.25.1–8.25.24, March 2018 8.25.1 Published online March 2018 in Wiley Online (wileyonlinelibrary.com). doi: 10.1002/cpbi.42 Copyright C 2018 John Wiley & Sons, Inc. Supplement 61 Figure 8.25.1 Pathways are more than gene sets. Panel A shows the graphical representation of ERBB signaling pathway from KEGG while panel B shows the set of genes on the pathway. The graph in panel A contains important information regarding gene product (protein) localization, gene, protein, or metabolite interactions and the types of these interactions (activation, repression, etc.), the direction of the signal propagation, etc. Over-representation analysis (ORA) and functional class scoring (FCS) approaches are unable to exploit this information. As such, for the same molecular measurement, these approaches would yield exactly the same significance value for this pathway even if the graph were to be completely redesigned by future discovery. In contrast, network-based approaches are able to take into account the topological order of the genes and there interactions.

simple gene sets based on a function, process & Morishima, 2017; Kanehisa & Goto, 2000), or component (e.g., the Molecular Signatures Reactome (Croft et al., 2014), and Biocarta Database MSigDB; Liberzon et al., 2011), or- (http://www.biocarta.com). ganized in a hierarchical structure that con- Concurrently, statistical methods have been tains information about the relationship be- developed to identify the functional modules tween the various modules, as found in the or pathways that are impacted from the dif- Gene Ontology (Ashburner et al., 2000), or ferential expression evidence. They allow us organized into pathways that describe in de- to gain insights into the functional mecha- tail all known interactions between the vari- nisms of cells beyond the detection of dif- ous genes that are involved in a certain phe- ferential expressed genes. The earliest ap- nomenon. Biological processes in which genes proaches use Over-Representation Analysis are known to interact with each other are ac- (ORA; Draghici,ˇ Khatri, Martins, Ostermeier, Network-Based Approaches for cumulated in public , such as the & Krawetz, 2003; Tavazoie, Hughes, Camp- Pathway Level Kyoto Encyclopedia of Genes and Genomes bell, Cho, & Church, 1999) to identify gene Analysis (KEGG) (Kanehisa, Furumichi, Tanabe, Sato, sets that have more DE genes than expected 8.25.2

Supplement 61 Current Protocols in Bioinformatics Table 8.25.1 Pathway Analysis Tools

Method Availabilitya HIPAAb Licensec Coded Year Reference Signaling pathway analysis using gene expression MetaCoree Web (https://www.genego.com/ No Thomson Java 2004 N/A metacore.php) Reuters Pathway- Standalone (Bioconductor), Web No Freef Java, R 2005 (Draghiciˇ et al., Express (http: //vortex.cs.wayne.edu/) 2007, Khatri superseded by the ROntoTools et al., 2007) PathOlogist Standalone (ftp://ftp1.nci.nih.gov/pub/ No Free MATLAB 2007 (Efroni, pathologist/) Schaefer, & Buetow, 2007; Greenblum, Efroni, Schaefer, & Buetow, 2011) iPathway Web (https://www.advaitabio Yes Advaita Java, R 2009 N/A Guidee .com/products.html) Corp. SPIA Standalone (Bioconductor) No GPL (ࣙ2) R 2009 (Tarca et al., 2009) NetGSA Standalone (https://www.biostat No GPL-2 R 2009 (Shojaie & .washington.edu/ashojaie/ Michailidis, software/ 2009; Shojaie & Michailidis, 2010) PWEA Standalone (https://zlab.bu No Freef C++ 2010 (Hung et al., .edu/PWEA/) 2010) TopoGSA Web (https://www.infobiotics No Freef PHP, R 2010 (Glaab, Baudot, .net/topogsa) Krasnogor, & Valencia, 2010) Topology Standalone (CRAN) No AGPL-3 R 2010 (Massa, GSA Chiogna, & Romualdi, 2010) DEGraph Standalone (Bioconductor) No GPL-3 R 2010 (Jacob, Neuvial, & Dudoit, 2010) GGEA Standalone (Bioconductor) No Artistic- R 2011 (Geistlinger, 2.0 Csaba, Kuffner,¨ Mulder, & Zimmer, 2011) BPA Standalone No Freef MATLAB 2011 (Isci, Ozturk, (https://bumil.boun.edu.tr/bpa) Jones, & Otu, 2011) GANPA Standalone (CRAN) No GPL-2 R 2011 (Fang, Tian, & Ji, 2011) ROnto Standalone (Bioconductor) No CC BY- R 2012 (Voichit¸a et al., Toolse NC-ND 2012) 4 BAPA- No implementation available No N/A R 2012 (Zhao et al., IGGFD 2012)

continued 8.25.3

Current Protocols in Bioinformatics Supplement 61 Table 8.25.1 Pathway Analysis Tools, continued

Method Availabilitya HIPAAb Licensec Coded Year Reference CePa Standalone (CRAN), Web No GPL (ࣙ2) R 2012 (Gu, Liu, Cao, (https://mcube.nju.edu.cn/cgi-bin/ Zhang, & Wang, cepa/main.pl) 2012) THINK- Standalone, Web No Freef Java 2012 (Farfan,´ Ma, Back-DS (https://eecs.umich.edu/db/think/ Sartor, software.html) Michailidis, & Jagadish, 2012) TBScore No implementation available No N/A N/A 2012 (Ibrahim, Jassim, Cawthorne, & Langlands, 2012) ACST Standalone (available as article No N/A R 2012 (Mieczkowski, supplemental) Swiatek- Machado, & Kaminska, 2012) EnrichNet Web (https://www.enrichnet.org/) No free** PHP 2012 (Glaab, Baudot, Krasnogor, Schneider, & Valencia, 2012) clipper Standalone (https://romualdi.bio No AGPL-3 R 2013 (Martini, Sales, .unipd.it/software) Massa, Chiogna, & Romualdi, 2013) DEAP Standalone (available as article No GNU Python 2013 (Haynes, supplemental) Lesser Higdon, GPL Stanberry, Collins, & Kolker, 2013) DRAGEN Standalone (https://bioinfo.au. No N/A C++ 2014 (Ma, Jiang, & tsinghua.edu.cn/dragen/) Jiang, 2014) ToPASeqg Standalone (Bioconductor) No AGPL-3 R 2016 (Ihnatova & Budinska, 2015) pDis Standalone (Bioconductor) No Freef R 2016 (Ansari, Voichit¸a, Donato, Tagett, &Draghici,˘ 2017) SPATIAL No implementation available No N/A N/A 2016 (Bokanizad, Tagett, Ansari, Helmi, & Draghici,˘ 2016) BLMAh Standalone (Bioconductor) No GPL (ࣙ2) R 2017 (Nguyen, Tagett, Donato, Mitrea, & Draghici, 2016; also see Resources)

continued

8.25.4

Supplement 61 Current Protocols in Bioinformatics Table 8.25.1 Pathway Analysis Tools, continued

Method Availabilitya HIPAAb Licensec Coded Year Reference Signaling pathway analysis using multiple types of data PARADIGM Standalone (https://sbenz.github No UCSC- C 2010 (Vaske et al., .com/Paradigm) CGB, 2010) freef micro Standalone No AGPL-3 R 2014 (Calura et al., Graphite (https://romualdi.bio.unipd.it/software) 2014) mirIntegrator Standalone (Bioconductor) No GPLࣙ3 R 2016 (Diaz et al., 2016; Nguyen et al., 2016) Metabolic pathway analysis ScorePAGE No implementation available No N/A N/A 2004 (Rahnenfuhrer¨ et al., 2004) TAPPA No implementation available No N/A N/A 2007 (Gao & Wang, 2007) MetPA Web (https://metpa.metabolomics.ca)No GPL(ࣙ2) PHP, R 2010 (Xia & Wishart, 2010) Metabo Web (https://metpa.metabolomics.ca)No GPL(ࣙ2) R 2011 (Xia & Wishart, Analyst 2011; Xia, Sinelnikov, Han, & Wishart, 2015) aAvailability is a criterion that describes the implementation of the method as standalone or Web-based. bHIPAA provides information about HIPAA compliance. cLicense provides information about the type of the software license. GPL is an abbreviation for the GNU General Public License; AGPL is an abbreviation for the GNU Affero General Public License; CC BY-NC-ND 4 is an abbreviation of Creative Commons Attribution–NonCommercial- NoDerivatives 4.0 International Public License. dCode shows the used for the method implementation. eCommercial method. fFree for academic and non-commercial use; UCSC-CGB is the University of California Santa Cruz Cancer Genome Browser. gToPASeq provides an R package that runs TopologyGSA, DEGraph, clipper, SPIA, TBScore, PWEA, TAPPA. hBLMA provides an R package for bi-level meta-analysis that runs SPIA, ORA, GSA, and PADOG using one or multiple expression datasets. by chance. The drawbacks of this type of ap- ORA and FCS approaches are often re- proach include: (i) it only considers the num- ferred as gene set enrichment methods. Com- ber of DE genes and completely ignores the prehensive lists of gene set analysis ap- magnitude of the actual expression changes, proaches, as well as comparisons between resulting in information loss; (ii) it assumes them, can be found in well-developed sur- that genes are independent, which they are veys (Emmert-Streib & Glazko, 2011; Kelder, not (since the pathways are graphs that de- Conklin, Evelo, & Pico, 2010; Khatri, Sirota, scribe precisely how these genes influence & Butte, 2012; Misman et al., 2009). While each other); and (iii) it ignores the interactions useful for the purpose for which they have between various genes and modules. Func- been developed—to analyze sets of genes— tional Class Scoring (FCS) approaches, such these methods completely ignore the topol- as Gene Set Enrichment Analysis (GSEA; ogy and interaction between genes. Topology- Subramanian et al., 2005) and Gene Set Anal- based approaches, which fully exploit all ysis (GSA; Efron & Tibshirani, 2007), have the knowledge about how genes interact as been developed to address some of the is- described by pathways, have been devel- sues raised by ORA approaches. The main oped more recently. The first such techniques improvement of FCS is the observation that were ScorePAGE (Rahnenfuhrer,¨ Domingues, small but coordinated changes in expression Maydt, & Lengauer, 2004) for metabolic Analyzing of functionally related genes can have signifi- pathways and Impact Analysis (Draghiciˇ Molecular cant impact on pathways. et al., 2007; Tarca et al., 2009) for signaling Interactions 8.25.5

Current Protocols in Bioinformatics Supplement 61 pathways. Figure 8.25.1 shows an example Mitrea et al. (2013) or referenced manuscripts pathway named ERBB signaling pathway,in (Table 8.25.1). which the nodes represent genes and com- pounds while the edges represent the known NETWORK-BASED PATHWAY interaction between the compounds. Gene set ANALYSIS analysis approaches are not able to account The term pathway analysis is used in a for the topological order of genes, nor are able very broad context in the literature, includ- to explain the signal propagation and mecha- ing biological network construction and in- nisms of the pathway. These approaches would ference. Here we focus on methods that are yield exactly the same significance value for able to exploit biological knowledge in pub- this pathway even if the graph were to be lic repositories, rather than on network in- completely redesigned by future discovery. ference approaches that attempt to infer or In contrast, network-based pathway analysis reconstruct pathways from molecular mea- can distinguish between this pathway and any surements. Table 8.25.1 shows the list of 34 other pathway with the same proportion of DE pathway analysis approaches, together with genes. their availability and licensing. In this sec- Here we provide a survey of 34 tion we start by describing the availability of network-based methods developed for path- each method before discussing the input for- way analysis. Our survey of commercial mat, graph model, and underlying statistical tools for pathway analysis found iPath- approaches. wayGuide (Advaita Corporation, https://www .advaitabio.com) and MetaCore (Thomson Software Availability and Reuters, https://www.thomsonreuters.com)to Implementation be topology based. Other commercial tools, We often think that the main strength of such as Ingenuity Pathway Analysis (https:// an approach lies in its novelty and www.ingenuity.com) or Genomatix (https:// efficiency. However, the implementation and www.genomatix.de), do not use pathway topol- availability of a tool have become increas- ogy information and thus are not included in ingly important for several reasons. First, soft- this review. The existence of only two com- ware availability and version control are cru- mercial tools is more evidence of the challenge cial for reproducing the experimental results faced by the developers of such methods due that were used to assess the performance of to lack of standards. the approach (Sandve, Nekrutenko, Taylor, & In this document, we categorize and com- Hovig, 2013). For this reason, many journals pare the methods based on the following request authors to make their software and data criteria: the type of input the method ac- available before accepting methods articles. cepts, graph model, and the statistical ap- Second, if a software application is not ready- proaches used to evaluate gene and pathway to-run, it is very unlikely that the intended au- changes. Under the heading “Experiment In- dience (mostly life scientists) will invest the put and Pathway Databases” below we de- time to understand and implement complex scribe the types of input the surveyed meth- . Practicality, user-friendliness, out- ods use. Under the heading “Graph Mod- put format, and type of interface are all to be els” we provide details regarding the graph considered. Depending on the desired avail- models used for the biological networks. Un- ability and intended audience, a software pack- der “Pathway Scoring Strategies” we discuss age may be implemented as standalone or Web the statistical methods used by the surveyed based. Among the 34 approaches, there are 30 methods to assess gene and pathway changes that are available either as a standalone soft- between two phenotypes. Under “Challenges ware package or Web service. in Pathway Analysis” we discuss a number Typically, standalone tools need to be in- of outstanding challenges that needed to be stalled on local machines or servers, which of- addressed in order to improve the reliabil- ten requires some administrative skills. Most ity, as well as the relevance, of the next- standalone tools depend on full or partial generation pathway analysis approaches. We copies of public pathway databases, stored discuss the key elements and limitations of the locally, and need to be updated periodically. surveyed methods without going into details Advantages of standalone tools include: (i) Network-Based of each technique. For a more detailed review instant availability that does not require Inter- Approaches for and technical descriptions of network-based net access, and (ii) the security and privacy of Pathway Level pathway analysis methods, please refer to Analysis the experimental data. Web-based tools, on the 8.25.6

Supplement 61 Current Protocols in Bioinformatics other hand, run the analyses on a remote server or pathways are often represented in the form providing computational power and a graph- of graphs that capture our current knowl- ical interface. The major advantage of Web- edge about the interactions of genes, proteins, based tools is that they are user-friendly and do metabolites, or compounds in an organism. not require a separate local installation. From The pathway data is accumulated, updated, the accessibility perspective, Web-based tools and refined by amassing knowledge from sci- have the advantage of being available from any entific literature describing individual interac- location as long as there is an Internet connec- tions or high-throughput experiment results. tion and a browser available. Also, the update Table 8.25.2 shows a summary of in- is almost transparent to the client. This makes put format and mathematical modeling of the user’s task easy and enables collaboration, the surveyed approaches. Most pathway since users all over the world can utilize the analysis methods analyze data from high- same method without the burden of installing throughput experiments, such as microarrays, it or keeping it up-to-date. There are methods next-generation sequencing, or . that provide both Web-based and standalone They accept either a list of gene IDs or a list implementations. of such gene IDs associated with measured HIPAA compliance may also be a factor in changes. These changes could be measured certain applications that involve data coming with different technologies and therefore can from patients or data linked to other clinical serve as proxies for different biochemical en- data or clinical records. Currently, the iPath- tities. For instance, one could use gene ex- wayGuide is the only tool available that can pression changes measured with microarrays, do topological pathway analysis and is HIPAA or protein levels measured with a proteomic compliant. approach, etc. The programming language and style used Different analysis methods use different in- for implementation also play an important role put formats. Many methods accept a list of all in the acceptance of a method. Software tools genes considered in the experiment together that are neatly implemented and packaged are with their expression values. Some analysis more appealing compared to those that do not methods select a subset of genes, considered have ready-to-use implementations. Many of to be differentially expressed (DE), based on a the methods are implemented in the R pro- predefined cut-off. The cut-off is typically ap- gramming language and are available as soft- plied on fold-change, p-value, or both. These ware packages from Bioconductor, CRAN, or methods use the list of DE genes and their cor- the author’s Web site. Their popularity among responding (fold-change, p-value) as biologists and bioinformaticians is due to the input. Other methods use only the list of DE fact that many bioinformatics-dedicated pack- genes, without corresponding expression val- ages are available in R. In addition, the rigor- ues, because their scoring methods are based ous review procedure provided by Bioconduc- only on the relative positions of the genes in tor and CRAN makes the software packages the graph. more standardized and reliable. Methods which use cut-offs are sensitive to the chosen threshold value, because a small Experiment Input and Pathway change in the cut-off may drastically change Databases the number of selected genes (Nam & Kim, A pathway analysis approach typically re- 2008). In addition, they typically use the most quires two types of input: experiment data and significant genes and discard the rest whose known biological networks. Experiment data weaker but coordinated changes may also have is usually collected from high-throughput ex- significant impact on pathways. Genes with periments that compare a condition phenotype moderate differential expression may be lost, to a control phenotype. A condition pheno- even though they might be important play- type can be a disease, a drug treatment, or ers in the impacted pathways (Ben-Shaul, the knock-out (KO) of a gene, while a con- Bergman, & Soreq, 2005). Furthermore, the trol phenotype can be a healthy state, a dif- genes included in the set of DE genes can ferent disease or drug treatment, or wild-type vary dramatically if the selection methods are (non-KO) samples. Experimental data can be changed. Hence, the results of pathway anal- obtained from multiple technologies that pro- yses based on DE genes may be vastly differ- duce different types of data: gene expression, ent depending on both the selection method as protein abundance, metabolite concentration, well as the threshold value (Pan, Lih, & Co- Analyzing miRNA expression, etc. Biological networks hen, 2005). Furthermore, for the same disease, Molecular Interactions 8.25.7

Current Protocols in Bioinformatics Supplement 61 Table 8.25.2 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the Surveyed Methods is Presented

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring Signaling pathway analysis using gene expression MetaCore DE genes Proprietary canonical Single-type, directed Hierarchically pathway, genome-scale aggregated network Pathway-Express DE genes change, or KEGG signaling Single-type, directed Hierarchically measured genes aggregated changed PathOlogist Measured genes KEGG Multi-type, directed Hierarchically expression aggregated iPathwayGuide DE genes change, or KEGG signaling, Single-type, directed Hierarchically measured genes Reactome, NCI, aggregated change BioCarta SPIA DE genes change KEGG signaling Single-type, directed Hierarchically aggregated NetGSA Measured genes KEGG signaling Single-type, directed Multivariate analysis expression PWEA Measured genes YeastNet Single-type, Hierarchically expression undirected aggregated TopoGSA DE genes PPI network, KEGG Single-type, Hierarchically undirected aggregated TopologyGSA Measured genes NCI-PID Single-type, Multivariate analysis expression undirected DEGraph Measured genes KEGG Single-type, Multivariate analysis expression undirected GGEA Measured genes KEGG Single-type, directed Aggregate fuzzy expression similarity BPA Measured genes NCI-PID Single-type, DAG Bayesian network expression—with cut-off GANPA DE genes change, or PPI network, KEGG, Single-type, Hierarchically measured genes Reactome, NCI-PID, undirected aggregated expression HumanCyc ROntoTools DE genes change, or KEGG signaling Single-type, directed Hierarchically measured gene aggregated expression BAPA-IGGFD Measured genes Literature-based Single-type, DAG Bayesian network expression - with interaction database; cut-off KEGG, WikiPathways; Reactome; MSigDB; GO BP; PANTHER; constructed gene association network from PPIs; co-annotation in GO Biological Process (BP); and co-expression in microarray data

continued 8.25.8

Supplement 61 Current Protocols in Bioinformatics Table 8.25.2 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the Surveyed Methods is Presented, continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring CePa DE genes expression, NCI-PID Single-type, directed Hierarchically or measured genes aggregated expression THINK-Back-DS DE genes change, KEGG, PANTHER, Single-type, directed Hierarchically measured genes BioCarta, Reactome, aggregated expression GenMAPP TBScore DE genes change KEGG signaling Single-type, directed Hierarchically aggregated ACST Measured genes KEGG signaling Single-type, directed Hierarchically expression aggregated EnrichNet DE genes list PPI network, KEGG, Single-type, Hierarchically BioCarta, undirected aggregated WikiPathways, Reactome, NCI-PID, InterPro, GO with STRING 9.0 clipper Measured genes BioCarta, KEGG, Single-type, directed Multivariate analysis expression NCI-PID, Reactome DEAP Measured genes KEGG, Reactome Single-type, directed Hierarchically expression aggregated DRAGEN Measured genes RegulonDB, M3D, Single-type, directed Linear regression expression HTRIdb, ENCODE, MSigDB ToPASeqe Measured genes KEGG Single-type, directed Hierarchical & expression multivariate pDis DE genes change, or KEGG signaling Single-type, directed Hierarchically measured genes aggregated changed SPATIAL DE genes change KEGG signaling Single-type, directed Hierarchically aggregated BLMAf Measured genes KEGG signaling Single-type, directed Hierarchically expression aggregated Signaling pathway analysis using multiple types of data PARADIGM Measured genes Constructed PPI Multi-type, directed Hierarchically expression, copy networks from MIPS, aggregated number, and proteins DIP, BIND, HPRD, levels IntAct, and BioGRID microGraphite Measured gene BioCarta, KEGG, Single-type, directed Multivariate analysis expression, and NCI-PID, Reactome miRNA expression mirIntegrator DE genes change, KEGG signaling, Single-type, directed Hierarchically and DE miRNA miRTarBase aggregated change Metabolic pathway analysis ScorePAGE Measured genes KEGG metabolic Single-type, Hierarchically expression undirected aggregated

continued 8.25.9

Current Protocols in Bioinformatics Supplement 61 Table 8.25.2 A Summary of the Experimental Data Input Format and Biological Network Databases Used by the Surveyed Methods is Presented, continued

Method Experiment inputa Pathway databaseb Graph modelc Pathway scoring TAPPA Measured genes KEGG metabolic Single-type, Hierarchically expression undirected aggregated MetPA DE metabolites KEGG metabolic Single-type, directed Hierarchically change aggregated MetaboAnalyst DE genes change and KEGG metabolic Single-type, directed Hierarchically DE metabolites aggregated change aExperiment input shows the experiment data input format for each method. “DE” means differentially expressed. “Change” means fold-change value or t-statistics when comparing gene/metabolites values between two phenotypes. “Measured” means the list of all the genes/metabolites measured in the experiment; “List” represents a list of genes/metabolites identifiers (e.g., symbols). “With cut-off” show methods that take as input the list of all measured genes and in the analysis they mark the DE genes. bPathway database shows the name of the knowledge source for biological interactions. cGraph model is a characteristic that shows if the graph has one or multiple types of nodes as well as if directed or undirected. dThe package ROntoTools (Bioconductor) allows for analysis using expression values of all genes. eToPASeq provides an R package that runs TopologyGSA, DEGraph, clipper, SPIA, TBScore, PWEA, TAPPA. fBLMA provides an R package for bi-level meta-analysis that runs SPIA, ORA, GSA, and PADOG using one or multiple expression datasets.

independent studies or measurements often (over 1200 citations to date for the two impact produce different sets of differentially ex- analysis papers mentioned above; Draghiciˇ pressed (DE) genes (Ein-Dor, Kela, Getz, et al., 2007; Tarca et al., 2009) and over 30 Givol, & Domany, 2005; Ein-Dor, Zuk, & Do- other methods have been developed since. many, 2006; Tan et al., 2003). This makes ap- There are other types of biological net- proaches that use DE genes as input appear works that incorporate genome-wide inter- even more unreliable. actions between genes or proteins such as Usually, pathways are sets of genes and/or protein-protein interaction (PPI) networks. gene products that interact with each other in These networks are not restricted to specific a coordinated way to accomplish a given bi- biological functions. The main caveat related ological function. A typical signaling path- to PPI data is that most such data are obtained way (in KEGG for instance) uses nodes to from a bait-prey laboratory assays, rather than represent genes or gene products and edges from in vivo or in vitro studies. The fact that to represent signals, such as activation or re- two proteins stick to each other in an assay pression, that go from one gene to another. performed in an artificial environment can be A typical metabolic pathway uses nodes to misleading, since the two proteins may never represent biochemical compounds and edges be present at the same time in the same tissue to represent reactions that transform one or or the same part of the cell. more compound(s) into one or more other Publicly available curated pathway compounds. These reactions are usually car- databases used by methods listed in Table ried out or controlled by enzymes, which are 8.25.2 are KEGG (Ogata et al., 1999), in turn coded by genes. Hence, in a metabolic NCI-PID (Schaefer et al., 2009), BioCarta pathway, genes or gene products are associated (https://www.biocarta.com), WikiPathways with edges rather than nodes, as in a signal- (Pico et al., 2008), PANTHER (Mi et al., ing pathway. The immediate consequence of 2005), and Reactome (Joshi-Tope et al., this difference is that many techniques cannot 2005). These knowledge bases are built by be applied directly on all available pathways. manually curating experiments performed in This is why the analysis of metabolic pathways different cell types under different conditions. is still generally and arguably underdeveloped These curated knowledgebases are more (only 128 Google Scholar citations to date for reliable than protein interaction networks the original ScorePage paper; Rahnenfuhrer¨ but do not include all known genes and et al., 2004) and there are not many other their interactions. As an example, despite Network-Based methods available for metabolic pathway anal- being continuously updated, KEGG includes Approaches for ysis. However, the topology-based analysis of only about 5,000 human genes in signaling Pathway Level signaling pathways has been very successful pathways while the number of protein-coding Analysis 8.25.10

Supplement 61 Current Protocols in Bioinformatics Figure 8.25.2 Five biological network graph models used by public databases are displayed. The KEGG database contains both signaling and metabolic networks. Signaling networks have genes/gene products as nodes and regulatory signals as edges. Types of regulatory signals include activation, inhibition, phosphorylation, and many others. Metabolic networks have bio- chemical compounds as nodes and chemical reactions as edges. Enzymes (specialized pro- teins/gene products) catalyze biochemical reactions; therefore, genes are linked to the edges in these networks. The Reactome database is a collection of biochemical reactions that are grouped in functionally related sets to form a pathway. There are two types of nodes: biochemical com- pounds and reactions. Reaction nodes link biochemical compounds as reactants and products. NCI-PID signaling networks also have two types of nodes: component nodes and process nodes. Component nodes are usually biomolecular components. Process nodes are usually biochemical reactions or biological processes. In these networks, process nodes link two or more component nodes through directed edges. Process nodes are assigned one of the following states: positive or negative regulation, or involved in. Protein-protein interaction (PPI) networks have proteins as nodes and physical binding as edges or interactions. Two-hybrid assays are typically used to determine protein-protein interactions. PPIs can be directed in the bait-prey orientation when the bait-prey relation is considered (top), or undirected when is not (bottom). The Biological Pathway Exchange (BioPAX) format has physical entities as nodes and conversions as edges. The ad- vantage of this representation is that it is generic and provides a lot of flexibility to accommodate various types of interactions. In addition, it provides a machine-readable standard that can be used for all databases to provide their data in a unified format. Network nodes can be genes, gene products, complexes, or non-coding RNA. Network edges can be assembly of a complex, disassembly of a complex, or biochemical reactions, among others. genes is estimated to be between 19,000 and pare the pros and cons of existing knowledge 20,000 (Ezkurdia et al., 2014). bases. It is completely up to researchers to The implementation of analysis methods choose the tools that are able to work with constrains the software to accept a specific in- the database they trust. Intuitively, methods put pathway data format, while the underlying that are able to exploit the complementary in- graph models in the methods are independent formation available from different databases of the input format. Regardless of the pathway have an edge over methods that work with one format, this information must be parsed into a single database. computer-readable graph data structure before being processed. The implementation may in- Graph Models corporate a parser, or this may be up to the user. Pathway analysis approaches use two major For instance, SPIA accepts any signaling path- graph models to represent biological networks way or network if it can be transformed into an obtained from knowledge bases: (i) single- adjacency matrix representing a directed graph type and (ii) multiple-type. The first model where all nodes are components and all edges allows only one type of node, i.e., a gene or are interactions. NetGSA is similarly flexible protein, with edges representing molecular in- with regard to signaling and metabolic path- teractions between the nodes (KEGG signal- ways. SPIA provides KEGG signaling path- ing in Fig. 8.25.2). Models that contain di- ways as a set of pre-parsed adjacency matri- rected graphs are more suitable for analyses ces. The methods described in this chapter may that include signal propagation of gene pertur- be restricted to only one pathway database, or bation. The second graph model allows mul- may accept several. tiple type of nodes, such as components and Analyzing To the best of our knowledge, there have interactions (NCI-PID in Fig. 8.25.2). Multi- Molecular Interactions been no comprehensive approaches to com- type graph models are more complex than 8.25.11

Current Protocols in Bioinformatics Supplement 61 single-type, but they are expected to capture tion. Biological entity nodes are DNA, mRNA, more pathway characteristics. For example, protein, and active protein. The interaction single-type models are limited when trying nodes are transcription, translation, or pro- to describe “all” and “any” relations between tein activation, among others. Biological en- multiple components that are involved in the tity and interaction node values are derived same interaction. Bipartite graphs, which con- from these data and specify the of tain two types of nodes and allow connection the node being active. These are the hidden only between nodes of different types, are a states of the model. From the mathematical particular case of multi-type graph models. model perspective, models that allow multiple The majority of analysis methods surveyed types of nodes—component nodes and inter- here use a single-type graph model. Some ap- action nodes—are more flexible and are able to ply the analysis on a directed or un-directed model both AND and OR gates, which are very single-type network built using the input path- common when describing cellular processes. way, while others transform the pathways into graphs with specific characteristics. An ex- Pathway Scoring Strategies ample of the later is TopologyGSA, which The goal of the scoring method is to com- transforms the directed input pathway into pute a score for each pathway based on the ex- an undirected decomposable graph, which has pression change and the graph model, resulting the advantage of being easily broken down in a ranked list of pathways or sub-pathways. into separate modules (Lauritzen, 1996). In There are a variety of approaches to quantify this method, decomposable graphs are used the changes in a pathway. Some of the analysis to find “important” submodules—those that methods use a hierarchically aggregated scor- drive the changes across the whole pathway. ing algorithm, where on the first level a score For each pathway, TopologyGSA creates an is calculated and assigned to each node or pair undirected moral graph from the underlying of nodes (component and/or interaction). On directed acyclic graph (DAG) by connecting the second level, these scores are aggregated to the parents of each child and removing the compute the score of the pathway. On the last edge direction. [The moral graph of a DAG level, the statistical significance of the path- is the undirected graph created by adding an way score is assessed using univariate hypoth- (undirected) edge between all parents of the esis testing. Another approach assigns a ran- same node (sometimes called marrying), and dom variable to each node, and a multivariate then replacing all directed edges by undirected probability distribution is calculated for each edges. The name stems from the fact that, in pathway. The output score can be calculated in a moral graph, two nodes that have a com- two ways. One way is to use multivariate hy- mon child are required to be married by shar- pothesis testing to assess the statistical signif- ing an edge.] The moral graph is then used icance of changes in the pathway distribution to test the hypothesis that the underlying net- between the two phenotypes. The other way is work is changed significantly between the two to estimate the distribution parameters based phenotypes. If the the research hypothesis is on the Bayesian network model and use this rejected, a decomposable/triangulated graph is distribution to compute a probabilistic score to generated from the moral graph by adding new measure the changes. In this section, we pro- edges. This graph is broken into the maximal vide details regarding the scoring algorithms possible submodules and the hypothesis is re- of the surveyed methods. See Figure 8.25.3 for tested on each of them. scoring algorithms categories. PathOlogist and PARADIGM are the two surveyed methods that use multi-type Hierarchically aggregated scoring graph models. PathOlogist uses a bipartite The workflow of the hierarchically aggre- graph model with component and interac- gated scoring strategy has three levels: node tion nodes. PARADIGM, conceptually moti- statistic computation, pathway statistic com- vated by the central dogma of molecular bi- putation and the evaluation of the significance ology, takes a pathway graph as input and for the pathway statistic. converts it into a more detailed graph, where each component node is replaced by sev- Node level scoring eral more specific nodes: biological entity Most of the surveyed methods incorpo- Network-Based nodes, interaction nodes, and nodes contain- rate pathway topology information in the node Approaches for ing observed experiment data. The observed scores. The node level scoring can be di- Pathway Level experiment nodes could in principle contain vided into four categories: (i) graph mea- Analysis gene-expression and copy-number informa- sures (centrality), (ii) similarity measures, 8.25.12

Supplement 61 Current Protocols in Bioinformatics Figure 8.25.3 A summary of the statistical models is presented for the surveyed methods. Most methods use the hierarchically aggregated scoring strategy, in which the score is computed at the node level before being aggregated at the pathway level for significance assessment. In the left panel, the rows show a node-level statistical model while the columns show the aggregating strategies (linear, non-linear, or weighted gene set). Approaches in multivariate analysis and Bayesian network categories use multivariate modeling and Bayesian network, respectively, to find pathways that are most likely to be impacted. Both BLMA and ToPASeq provide R packages that run multiple algorithms. DRAGEN follows a linear regression strategy while GGEA applies fuzzy modeling to rank the pathways.

(iii) probabilistic graphical models, and (iv) Methods in this category include MetaCore, normalized node value (NNV). Approaches in SPIA, iPathwayGuide, MetPA, SPATIAL, To- the first category use centrality measures or a poGSA, DEAP, pDis, mirIntegrator, Pathway- variation of these measures to score nodes in a Express, and BLMA. given pathway. Centrality measures represent Approaches in the second category use the importance of a node relative to all other similarity measures in their node level scor- nodes in a network. There are several central- ing. Similarity measures estimate the co- ity measures that can be applied to networks expression, behavioral similarity, or co- of genes and their interactions: degree central- regulation of pairs of components. Their ity, closeness, between-ness, and eigenvector values can be correlation coefficients, covari- centrality. Degree centrality accounts for the ances, or dot products of the gene expression number of directed edges that enter and leave profile across time or samples. In these meth- each node. Closeness sums the shortest dis- ods, the pathways with clusters of highly cor- tance from each node to all other nodes in related genes are considered more significant. the network. Node between-ness measures the At the node level, a score is assigned to each importance of a node according to the number pair of nodes in the network which is the ra- of shortest paths that pass through it. Eigen- tio of one similarity measure over the shortest vector centrality uses the network adjacency path distance between these nodes. Thus, the matrix of a graph to determine a dominant topology information is captured in the node eigenvector; each element of this vector is a score by incorporating the shortest path dis- score for the corresponding node. Thus, each tance of the pair. Methods in this category in- score is influenced by the scores of neighbor- clude ScorePAGE and PWEA. ing nodes. In the case of directed graphs, a Approaches in the third category incorpo- node that has many downstream genes has rate the topology in the node level scoring Analyzing more influence and receives a higher score. using a probabilistic graphical model. In this Molecular Interactions 8.25.13

Current Protocols in Bioinformatics Supplement 61 model, nodes are random variables, and edges relevant pathway is the one with the greatest define the conditional dependency of the nodes difference between the pathway node score they link. For example, PARADIGM takes ob- distribution and the background score distri- served experimental data and calculates scores bution. The difference between the two distri- for all component nodes, in both observed and butions is measured by the weighted averaging hidden states, from the detailed network cre- of the difference between the two discretized ated by the method based on the input path- and normalized distributions. The averaging way. For each node score, a positive or negative method down-weights the higher distances and value denotes how likely it is for the node to emphasizes the lower distance nodes. be active or inactive, respectively. The scores Methods in the third group (weighted gene are calculated to maximize the probability of set) design scoring techniques that incorpo- the observed values. A p-value is associated rate existing gene set analysis methods, such with each score of each sample such that each as GSEA (Subramanian et al., 2005), GSA node can be tagged as significantly active, (Efron & Tibshirani, 2007), or LRPath (Sartor, significantly inactive, or not significant. For Leikauf, & Medvedovic, 2009). Pathway-level each network, a matrix of p-values is output scores can be calculated using node scores in which columns are samples and rows are that represent the topology characteristic of the component nodes. Methods in this category pathway as weight adjustments to a gene set include PARADIGM and PathOlogist. analysis method. PWEA, GANPA, THINK- Approaches in the fourth category simply Back-DS, and CePa use this approach, and we compute the score for each node using the in- refer to them as weighted gene set analysis formation obtained from the experiment input. methods. For example, TAPPA calculates the score of each node as the square root of the normalized Pathway significance assessment log gene expressions (node value) while ACST Pathway scores are intended to provide in- and TBScore calculate the node level score us- formation regarding the amount of change ing a sign statistic and log fold-change. incurred in the pathway between two phe- notypes. However, the amount of change is Pathway level scoring not meaningful by itself, since any amount of There are three different ways to compute change can take place just by chance. An as- the pathway score: (i) linear, (ii) non-linear, sessment of the significance of the measured and iii) weighted gene set. Most methods ag- changes is thus required. gregate node level statistics to pathway level TopoGSA, MetPA, and EnrichNet output statistics using linear functions such as av- scores without any significance assessment, eraging or summation. For example, iPath- leaving it up to the user to interpret the re- wayGuide computes the scores of pathway as sults. This is problematic because users do not the sum of all genes while TBScore weights have any instrument to distinguish between the pathway DE genes based on their log fold changes due to noise or random causes and change and the number of distinct DE genes meaningful changes that are unlikely to occur directly downstream of them, using a depth- just by chance and that therefore are possi- first search algorithm. bly related to the phenotype. The rest of the Approaches in the second group (non- analysis methods perform a hypothesis test for linear) use a non-linear function to compute each pathway. The null hypothesis is that the the pathway scores. For example, TAPPAcom- value of the observed statistic is due to random putes the pathway score for each sample as a noise or chance alone. The research hypothe- weighted sum of the product of all node pair sis is that the observed values are substantial scores in the pathway. The weight coefficient enough that they are potentially related to the is 0 when there is no edge between a pair. For phenotype. A p-value for calculated score is any connected node pair the weight is a sign then computed, and a user-defined threshold function, which represents joint up- or down- on the p-value is used to decide whether the regulation of the pair. Another example is En- the null hypothesis can be rejected or not for richNet, in which pathway scores measure the each pathway. Finally, a correction for multi- difference of the node score distribution for a ple comparisons should be performed. pathway and a background network/gene set Typically, pathway analysis methods com- Network-Based which consists of all pathways. At the node pute one score per pathway. The distribution of Approaches for level, the distance of all DE genes to the path- this score under the null hypothesis can be con- Pathway Level way is measured and summarized as a distance structed and compared to the observed score. Analysis distribution. The method assumes that the most However, there are often too few samples to 8.25.14

Supplement 61 Current Protocols in Bioinformatics calculate this distribution, so it is assumed the Bayesian random variable assigned to each that the distribution is known. For example, in node captures the state of a gene (DE or not). In MetaCore and many other techniques, when contrast, in BAPA-IGGFD each random vari- the pathway score is the number of DE nodes able assigned to an edge is the probability that that fall on the pathway, the distribution is up or down regulation of the genes at both ends assumed to be hypergeometric. However, the of an interaction are concordant with the type hypergeometric distribution assumes that the of interaction which can be activation or inhi- variables (genes in this case) are independent, bition. In both BPA and BAPA-IGGFD, each which is incorrect, as witnessed by the fact random variable is assumed to follow a bino- that the pathway graph structure itself is de- mial distribution whose probability of success signed to reflect the specific ways in which the follows a beta distribution. However, these two genes influence each other. Another approach methods use different approaches in represent- to identify the distribution is to use statistical ing the multivariate distribution of the corre- techniques such as the bootstrap (Efron, 1979). sponding random vector. BPA assumes that Bootstrapping can be done either at the sam- the random vector has a multinomial distribu- ple level, by permuting the sample labels, or tion, which is the generalization of the bino- at gene-set level, by permuting the the values mial distribution. In this case, the vector of the assigned to the genes in the set. success probability follows the Dirichlet dis- tribution, which is the multivariate extension Multivariate analysis and Bayesian of the beta distribution. In contrast, BAPA- network IGGFD assumes the random variables are Multivariate analysis methods mostly use independent, therefore the multivariate distri- multivariate probability distributions to com- butions are calculated by multiplying the dis- pute pathway-level statistics and these can be tributions of the random variables in the vec- grouped into two subcategories. Methods in tor. It is worth mentioning that the assumption the first category use multivariate hypothesis of independence in BAPA-IGGFD is contra- testing, while methods in the second category dicted by evidence, specifically in the case of are based on Bayesian network. edges that share nodes. NetGSA, TopologyGSA, DEGraph, clip- per, and microGraphite are methods based on Other approaches multivariate hypothesis testing. These anal- The four methods DRAGEN, GGEA, ysis methods assume the vectors of gene ToPASeq, and BLMA follow strategies that expression values in each (sub)pathway are are very different from those of the other 30 random vectors with multivariate normal dis- methods. As such, BLMA implements a bi- tributions. The information level meta-analysis approach that can be ap- is stored in the covariance matrix of the cor- plied in conjunction with any of the four sta- responding distribution. For a network, if the tistical approaches SPIA (Tarca et al., 2009), two distributions of the gene expression vec- GSA (Efron & Tibshirani, 2007), ORA (Tava- tors corresponding to the two phenotypes are zoie et al., 1999), or PADOG (Tarca, Draghici,ˇ significantly different, the network is assumed Bhatti, & Romero, 2012). The package al- to be significantly impacted when comparing lows users to perform pathway analysis with the two phenotypes. The significance assess- one dataset or with multiple datasets. Simi- ment is done by a multivariate hypothesis test. larly, ToPASeq provides an R package that The definition of the null hypothesis for the runs TopologyGSA, DEGraph, clipper, SPIA, statistical tests and the techniques to calcu- TBScore, PWEA, and TAPPA. This method late the parameters of the distributions are the models the input (biological networks and ex- main differences between these three analysis periment data) in such a way that it can be used methods. for any of the seven different analyses. BPA and BAPA-IGGFD are two methods DRAGEN is an analysis method that scores based on Bayesian networks. In a Bayesian the interactions rather then the genes and uses network, which is a special case of proba- a regression model to detect differential regu- bilistic graphical models, a random variable lation. DRAGEN fits each edge of a pathway is assigned to each node of a directed acyclic into two linear models (for case and control) (DAG) graph. The edges in the graph represent and then computes a p-value that represents the conditional between nodes, the difference between the two models. For so that the children are independent from each each pathway, a summary statistic is computed Analyzing other and the rest of the graph when condi- by combining the p-values of the edges us- Molecular tioned on the parents. In BPA, the value of ing a weighted Fisher’s method (Fisher, 1925). Interactions 8.25.15

Current Protocols in Bioinformatics Supplement 61 GGEA, on the other hand, uses Petri Net (Mu- patient to another. Understanding the specific rata, 1989) to model the pathway. The sum- pathways that are impacted in a given pheno- mary statistic, named consistency, is computed type or sub-group of patients should be another from the fuzzy similarity between the observed goal for the next generation of pathway anal- gene expression and Petri net with fuzzy logic ysis tools. See Khatri et al. (2012) for a more (PNFL; Kuffner,¨ Petri, Windhager, & Zimmer, detailed discussion of annotation limitations 2010). For both DRAGEN and GGEA, the p- of existing knowledge bases. value of the pathway is calculated by com- Here, we focus on challenges of pathway paring the observed summary statistic against analysis from computational perspectives. We the null distribution that is constructed by demonstrate that there is a systematic bias in permutation. pathway analysis (Nguyen, Mitrea, Tagett, & Draghici,˘ 2017). This leads to the unreliability of most if not all pathway analysis approaches. CHALLENGES IN PATHWAY We also discuss the lack of benchmark datasets ANALYSIS or pipelines to assess the performance of ex- Pathway analysis has become the first isting approaches. choice for gaining insights into the underly- ing biology of a phenotype due its explanatory Systematic Bias of Pathway Analysis power. However, there are outstanding annota- Methods tion and methodological limitations that have Pathway analysis approaches often rely on not been addressed (Kelder, Conklin, Evelo, hypothesis testing to identify the pathways that & Pico, 2010; Khatri et al., 2012). There are are impacted under the effects of different dis- three main limitations of current knowledge eases. Null distributions are used to model bases. First, existing knowledge bases are un- populations so that statistical tests can deter- able to keep up with the information available mine whether an observation is unlikely to oc- in data obtained from recent technologies. For cur by chance. In principle, the p-values pro- example, RNA-Seq data allows us to iden- duced by a sound statistical test must be uni- tify transcripts that are active under certain formly distributed under the null hypothesis conditions. Alternatively spliced transcripts, (Barton, Crozier, Lillycrop, Godfrey, & Inskip, even if they originate from the same gene, 2013; Bland, 2013; Fodor, Tickle, & Richard- may have distinct or even opposite functions son, 2007; Storey & Tibshirani, 2003). For ex- (Wang et al., 2008). However, most knowl- ample, the p-values that result from comparing edge bases provide pathway annotation only two groups using a t-test should be distributed at the gene level. Second, there is a lack of uniformly if the data are normally distributed condition and cell-specific information, i.e., (Bland, 2013). When the assumptions of sta- information about cell type, conditions, and tistical models do not hold, the resulting p- time points. Finally, current pathway annota- values are not uniformly distributed under the tions are neither complete nor perfectly accu- null hypothesis. This makes classical methods, rate (Khatri et al., 2012; Rhee, Wood, Dolinski, such as t-test inaccurate since gene expression &Draghici,˘ 2008). For example, the number of values do not necessary follow their assump- genes in KEGG have remained around 5,000 tions. Here, we also show that the problem is despite being updated continuously in the past extended to pathway analysis, as the pathway 10 years. while there are approximately 19,000 p-values obtained from statistical approaches human genes annotated with at least one GO are not uniformly distributed under the null term. The number of protein-coding genes is hypothesis. This might lead to severe bias to- estimated to be between 19,000 and 20,000 wards well-studied diseases, such as cancer, (Ezkurdia et al., 2014), most of which are and thus make the results unreliable (Nguyen included in DNA microarray assays, such as et al., 2017). Affymetrix HG U133 plus 2.0. Consider three pathway analysis methods Another challenge is the oversimplification that represent three different classes of that characterizes many of the models pro- methods for pathway analysis: Gene Set vided by pathway databases. In principle, each Analysis (GSA; Efron & Tibshirani, 2007) type of tissue might have different mecha- is a Functional Class Scoring method (Efron nisms, so generic, organism-level pathways & Tibshirani, 2007; Mootha et al., 2003; Network-Based present a somewhat simplistic description of Subramanian et al., 2005; Tarca et al., Approaches for the phenomena. Furthermore, signaling and 2012), Down-weighting of Overlapping Pathway Level metabolic processes can also be different from Genes (PADOG; Tarca et al., 2012) is an Analysis one condition to another, or even from one enrichment method (Beiẞbarth & Speed, 8.25.16

Supplement 61 Current Protocols in Bioinformatics 2004; Draghiciˇ et al., 2003; Khatri, Draghici,˘ (A0, B0, C0) shows the cumulative p-values Ostermeier, & Krawetz, 2002), and Sig- from all KEGG signaling pathways. The small naling Pathway Impact Analysis (SPIA; panels, six per method, display extreme ex- Tarca et al., 2009) is a topology-aware amples of nonuniform p-value distributions method (Draghiciˇ et al., 2007; Tarca et al., for specific pathways. For each method, we 2009). To simulate the null distribution, we show three distributions severely biased to- download and process the data from nine wards zero (e.g., A1-A3), and three distri- public datasets: GSE14924 CD4, GSE14924 butions severely biased towards one (e.g., CD8, GSE17054, GSE12662, GSE57194, A4-A6). GSE33223, GSE42140, GSE8023, and These results show that, contrary to gen- GSE15061. Using 140 control samples from erally accepted beliefs, the p-values are not the nine datasets, we simulate 40,000 datasets uniformly distributed for the three methods as follows. We randomly label 70 samples as considered. Therefore, one should expect a control samples and the remaining 70 samples very strong and systematic bias in identifying as disease samples. We repeat this procedure significant pathways for each of these meth- 10,000 times to generate different groups of ods. Pathways that have p-values biased to- 70 control and 70 disease samples. To make wards zero will often be falsely identified as the simulation more general, we also create significant (false positives). Likewise, path- 10,000 datasets consisting of 10 control and ways that have p-values biased towards one are 10 disease samples, 10,000 datasets consisting likely to rarely meet the significance require- of 10 control and 20 disease samples, and ments, even when they are truly implicated 10,000 datasets consisting of 20 control in the given phenotype (false negatives). Sys- and 10 disease samples. We then calculate tematic bias, due to nonuniformity of p-value the p-values of the KEGG human signaling distributions, results in failure of the statisti- pathways using each of the three methods. cal methods to correctly identify the biological The effect of combining control (i.e., pathways implicated in the condition, and also healthy) samples from different experiments leads to inconsistent and incorrect results. is to uniformly distribute all sources of bias among the random groups of samples. If we compare groups of control samples based on Querying Data from Knowledge Bases experiments, there could be true differences Independent research groups have tried due to batch effects. By pooling them together, different strategies to model complex bio- we form a population which is considered the molecular phenomena. These independent ef- reference population. This approach is simi- forts have led to variation among pathway lar to selecting from a large group of people databases, complicating the task of devel- that may contain different sub-groups (e.g., oping pathway analysis methods. Depending different ethnicities, gender, race, life style, on the database, there may be differences or living conditions). When we randomly se- in information sources, experiment interpre- lect samples (for the two random groups to tation, models of molecular interactions, or be compared) from the reference population, boundaries of the pathways. Therefore, it is we expect all bias (e.g., ethnic subgroups) to possible that pathways with the same desig- be represented equally in both random groups, nation and aiming to describe the same phe- and therefore, we should see no difference be- nomena may have different topologies in dif- tween these random groups, no matter how ferent databases. As an example, one could many distinct ethnic subgroups were present compare the insulin signaling pathways of in the population at large. Therefore, the p- KEGG and BioCarta. BioCarta includes fewer values of a test for difference between the two nodes and emphasizes the effect of insulin on randomly selected groups should be equally transcription, while KEGG includes transcrip- probable between zero and one. tion regulation as well as apoptosis and other Figure 8.25.4 displays the empirical null biological processes. Differences in graph distributions of p-values using PADOG, GSA, models for molecular interactions are partic- and SPIA. The horizontal axes represent p- ularly apparent when comparing the signal- values while the vertical axes represent p- ing pathways in KEGG and NCI-PID. While value densities. Green panels (A0-A6) show KEGG represents the interaction information p-value distributions from PADOG, while blue using the directed edges themselves, NCI-PID (B0-B6) and purple (C0-C6) panels show p- introduces “process nodes” to model interac- Analyzing value distributions from GSA and SPIA, re- tions (see Fig. 8.25.2). Developers are facing Molecular Interactions spectively. For each method, the larger panel the challenge of modifying methods to accept 8.25.17

Current Protocols in Bioinformatics Supplement 61 Figure 8.25.4 The empirical distributions of p-values using: Down-weighting of Overlapping Genes (PADOG; top), Gene Set Analysis (GSA; middle), and Signaling Pathway Impact Analysis (SPIA; bottom). The distributions are gener- ated by re-sampling from 140 control samples obtained from 9 AML datasets. The horizontal axes display the p-values, while the vertical axes display the p-value densities. Panels A0-A6 (green) show the distributions of p-values from PADOG; panels B0-B6 (blue) show the distribution of p-values from GSA; panels C0-C6 (purple) show the distribution of p-values from SPIA. The large panels on the left, A0, B0, and C0, display the distributions of p-values cumulated from all KEGG signaling pathways. The smaller panels on the right display the p-value distributions of selected individual pathways, which are extreme cases. For each method, the upper three distributions, for example A1-A3, are biased towards zero and the lower three distributions, for example A4-A6, are biased towards one. Since none of these p-value distributions are uniform, there will be systematic bias in identifying significant pathways using any one of the methods. Pathways that have p-values biased towards zero will often be falsely identified as significant (false positives). Likewise, pathways that have p-values biased towards one are more likely to be among false negative results even if they may be implicated in the given phenotype.

novel pathway databases or modifying the ac- KEGG Markup Language (KGML), Biolog- tual pathway graphs to conform to the method. ical Pathway Exchange (BioPAX)Level2 Pathway databases not only differ in the and Level 3, System Biology Markup Lan- way that interactions are modeled, but their guage (SBML), and the Biological Connec- Network-Based data are provided in different formats as tion Markup Language (BCML; Beltrame Approaches for well (Chuang, Hofree, & Ideker, 2010). Com- et al., 2011). The NCI provides a uni- Pathway Level mon formats are Pathway Interaction Database fied assembly of BioCarta and Reactome, as Analysis eXtensible Markup Language (PID XML), well as their in-house “NCI- curated 8.25.18

Supplement 61 Current Protocols in Bioinformatics pathways,” in NCI-PID format (Schaefer et al., tive way. The lower the rank and the p-value 2009). In order to unify pathway databases, of the target pathway in the method output, pathway information should be provided in a the better the method. This approach has sev- commonly accepted format. eral important advantages including the fact Another challenge is that the same biologi- that it is reproducible and completely objec- cal pathways are represented differently from tive, and relies on more than just a couple of one pathway database to another. None of the data sets that are assessed by the authors using tools is compatible with all database formats, the literature. This approach was also used by requiring either modification of pathway input Bayerlova et al. (2015), using a different set of or alteration of the underlying algorithm in or- 36 real datasets, as well by other, more recent der to accommodate the differences. As an ex- papers. ample, a study by Vaske et al. (2010) attempts However, in spite of its advantages and to compare SPIA (Tarca et al., 2009) with their great superiority compared to the usual method tool PARADIGM by re-implementing SPIA in of only analyzing a couple of data sets, even C and forcing its application on NCI-PID path- this benchmarking has important limitations. ways. Implementation errors are present in the First, not all conditions have a namesake path- C version of SPIA, invalidating the compar- way in existing databases or described in the ison. A solution to overcome this challenge literature. Second, complex diseases are of- could be the development of a unified glob- ten associated with not only one target path- ally accepted pathway format. Another pos- way, but with many biological processes. By sible solution is to build conversion software its nature, this assessment approach will ig- tools that can translate between pathway for- nore other pathways and their ranking, even mats. Some attempts exist to use BIO-PAX though they may be true positives. More im- (Demir et al., 2010) as the lingua franca for portantly, these approaches fail to take into this domain. consideration the systematic bias of pathway analysis approaches. In these review papers, Missing Benchmark Datasets and most of the datasets are related to cancer. As Comparison such, 28 out of 42 datasets used in Tarca et al. Newly developed approaches are typically (2013), and 26 out of 36 datasets used in Bay- assessed by simulated data or by well-studied erlova et al. (2015) are cancer datasets. In those biological datasets (Bayerlova et al., 2015; cases, methods that are biased towards cancer Varadan, Mittal, Vaske, & Benz, 2012). The are very likely to identify cancer pathways, advantage of using simulation is that the which are also the target pathways, as signifi- ground truth is known and can be used to com- cant. For this reason, the comparisons obtained pare the sensitivity and specificity of different from these reviews are likely to be biased and methods. However, simulation is often biased not reliable for assessing the performance of and does not fully reflect the complexity of liv- existing approaches. ing organisms. On the other hand, when using real biological data, the biology is never fully known. In addition, many papers presenting CONCLUSIONS new pathway analysis methods include results Pathway analysis is a core strategy of many obtained on only a couple of datasets, and re- basic research, clinical research, and transla- searchers are often influenced by the observer- tional medicine programs. Emerging applica- expectancy effect (Sackett, 1979). Thus, such tions range from targeting and modeling dis- results are not objective, and many times they ease networks to screening chemical or ligand cannot be reproduced. libraries, to identification of drug target inter- A better evaluation approach has been pro- actions for improved efficacy and safety. The posed by Tarca et al. (2013) using 42 real integration of molecular interaction informa- datasets. This approach uses a target pathway tion into pathway analysis represents a major which is the pathway describing the condi- advance in the development of mathematical tion under study available. For instance, in an techniques aimed at the evaluation of systems experiment with colon cancer versus healthy, perturbations in biological entities. the target pathway would be the colon can- This unit discussed and categorized 34 cer pathway. The datasets are chosen so that existing network-based pathway analysis ap- there is a target pathway associated with each proaches from different perspectives, includ- of the datasets. The datasets are also all pub- ing experiment input, graphical representation Analyzing lic, so other methods can be compared on the of pathways in knowledge bases, and statisti- Molecular Interactions very same data in a reproducible and objec- cal approaches to assess pathway significance. 8.25.19

Current Protocols in Bioinformatics Supplement 61 Despite being widely used, employing DE approach, as well as to validate existing knowl- genes as input makes the software sensitive to edge bases and their graphical representation cutoff parameters. Regarding the graph mod- of pathways. There have been some efforts to els, approaches using multiple types of nodes provide benchmarks for this purpose but such and bipartite graphs are more flexible and methods are not fully reliable yet. are able to model both AND and OR gates, which are very common when describing cel- lular processes. In addition, tools that are able CONFLICT-OF-INTEREST to work with multiple knowledge bases are STATEMENT Sorin Draghici is the founder and CEO expected to perform better due to their com- of Advaita Corporation, a Wayne State Uni- plementary and independent information that versity spin-off that commercializes iPath- cannot be obtained from individual databases. wayGuide, one of the pathway analysis tools We also pointed out that there has been no mentioned in this chapter. reliable benchmark to assess and rank exist- ing approaches. Some initial efforts to assess pathway analysis methods in an objective and ACKNOWLEDGMENTS reproducible way do exist, but they still fail to This work has been partially supported by take into account the statistical bias of existing the following grants: National Institutes of approaches. Health (R01 DK089167, R42 GM087013), Despite tremendous efforts in the field, National Science Foundation (DBI-0965741), there are outstanding challenges that need to be and the Robert J. Sokol, M.D. Endowed addressed. First, current pathway databases are Chair in Systems Biology. Any opinions, find- unable to provide transcript-level activity or ings, and conclusions or recommendations ex- information related to new types of data, such pressed in this material are those of the authors as SNP, mutation, or methylation. Second, in- and do not necessarily reflect the views of any complete annotation and the lack of condition- of the funding agencies. and cell-specific information hinder the accu- racy of downstream pathway analysis. Third, LITERATURE CITED variation among pathway databases and the Ansari, S., Voichit¸a, C., Donato, M., Tagett, R., & lack of a standard format in which the pathway Draghici,˘ S. (2017). A novel pathway analysis approach based on the unexplained disregulation data is provided pose a real challenge for im- of genes. Proceedings of the IEEE, 105(3), 482– plementation. Developers are facing the chal- 495. doi: 10.1109/JPROC.2016.2531000. lenge of modifying methods to accept novel Ashburner, M., Ball, C. A., Blake, J. A., Botstein, pathway databases or modifying the actual D., Butler, H., Cherry, J. M., . . . Sherlock, G. pathway graphs to conform with their method. (2000). Gene Ontology: Tool for the unifica- Fourth, there is a systematic bias due to the tion of biology. Nature Genetics, 25, 25–29. doi: fact that certain conditions, such as cancer, are 10.1038/75556. much more studied than others. Using con- Barrett, T., Wilhite, S. E., Ledoux, P., Evange- trol data obtained from nine mRNA datasets, lista, C., Kim, I. F., Tomashevsky, M., . . . Soboleva, A. (2013). NCBI GEO: Archive we showed that the p-values obtained by three for functional data sets–update. Nu- pathway analysis approaches that represent cleic Acids Research, 41(D1), D991–D995. doi: three mainstream strategies in pathway analy- 10.1093/nar/gks1193. sis (FCS, enrichment, and network-based) are Barton, S. J., Crozier, S. R., Lillycrop, K. A., not uniformly distributed under the null. Path- Godfrey, K. M., & Inskip, H. M. (2013). ways that have p-values biased towards zero Correction of unexpected distributions of P will often be falsely identified as significant values from analysis of whole genome ar- rays by rectifying violation of statistical as- (false positives). Likewise, pathways that have sumptions. BMC Genomics, 14(1), 161 doi: p-values biased towards one are likely to rarely 10.1186/1471-2164-14-161. meet the significance requirements, even when Bayerlova, M., Jung, K., Kramer, F., Klemm, they are truly implicated in the given phe- F., Bleckmann, A., & Beiẞbarth, T. (2015). notype (false negatives). Systematic bias, due Comparative study on gene set and path- to non-uniformity of p-value distributions, re- way topology-based enrichment meth- sults in failure of the statistical methods to ods. BMC Bioinformatics, 16(1), 334. doi: 10.1186/s12859-015-0751-5. correctly identify the biological pathways im- ẞ plicated in the condition, and also leads to Bei barth, T., & Speed, T. P. (2004). GOstat: Network-Based Find statistically overrepresented Gene Ontolo- Approaches for inconsistent and incorrect results. gies within a group of genes. Bioinformat- Pathway Level Finally, there is a lack of benchmarks to Analysis ics, 20, 1464–1465. doi: 10.1093/bioinformat- assess the performance of each mathematical ics/bth088. 8.25.20

Supplement 61 Current Protocols in Bioinformatics Beltrame, L., Calura, E., Popovici, R. R., Rizzetto, pression. Genomics, 81(2), 98–104. doi: L., Guedez, D. R., Donato, M., . . . Cavalieri, D. 10.1016/S0888-7543(02)00021-6. (2011). The biological connection markup lan- Draghici,ˇ S., Khatri, P., Tarca, A. L., Amin, K., guage: A SBGN-compliant format for visual- Done, A., Voichit¸a, C., . . . Romero, R. (2007). ization, filtering and analysis of biological path- A systems biology approach for pathway level ways. Bioinformatics, 27(15), 2127–2133. doi: analysis. Genome Research, 17(10), 1537–1545. 10.1093/bioinformatics/btr339. doi: 10.1101/gr.6202607. Ben-Shaul, Y., Bergman, H., & Soreq, H. (2005). Edgar, R., Domrachev, M., & Lash, A. E. (2002). Identifying subtle interrelated changes in func- Gene Expression Omnibus: NCBI gene expres- tional gene categories using continuous mea- sion and hybridization array data repository. sures of gene expression. Bioinformatics, Nucleic Acids Research, 30(1), 207–210. doi: 21(7), 1129–1137. doi: 10.1093/bioinformat- 10.1093/nar/30.1.207. ics/bti149. Efron, B. (1979). Bootstrap methods: Another look Bland, M. (2013). Do baseline p-values follow at the jackknife. The Annals of Statistics, 7(1), a uniform distribution in randomised trials? 1–26. doi: 10.1214/aos/1176344552. PloS One, 8(10), e76010. doi: 10.1371/jour- nal.pone.0076010. Efron, B., & Tibshirani, R. (2007). On testing the significance of sets of genes. The An- Bokanizad, B., Tagett, R., Ansari, S., Helmi, B. nals of Applied Statistics, 1(1), 107–129. doi: H.,&Draghici,˘ S. (2016). SPATIAL: A system- 10.1214/07-AOAS101. level PAThway impact AnaLysis approach. Nu- cleic Acids Research, 44(11), 5034–5044. doi: Efroni, S., Schaefer, C. F., & Buetow, K. H. (2007). 10.1093/nar/gkw429. Identification of key processes underlying can- cer phenotypes using biologic pathway analy- Brazma, A., Parkinson, H., Sarkans, U., Shojata- sis. PLoS One, 2(5), e425. doi: 10.1371/jour- lab, M., Vilo, J., Abeygunawardena, N., . . . nal.pone.0000425. Sansone, S-A. (2003). ArrayExpress—a public repository for microarray gene expression data Ein-Dor, L., Kela, I., Getz, G., Givol, D., & Do- at the EBI. Nucleic Acids Research, 31(1), 68– many, E. (2005). Outcome signature genes in Bioinfor- 71. doi: 10.1093/nar/gkg091. breast cancer: Is there a unique set? matics, 21(2), 171–178. doi: 10.1093/bioinfor- Calura, E., Martini, P., Sales, G., Beltrame, L., matics/bth469. Chiorino, G., D’Incalci, M., . . . Romualdi, C. (2014). Wiring miRNAs to pathways: A topo- Ein-Dor, L., Zuk, O., & Domany, E. (2006). logical approach to integrate miRNA and mRNA Thousands of samples are needed to gen- expression profiles. Nucleic Acids Research, erate a robust gene list for predicting out- In Proceedings of the National 42(11), e96. doi: 10.1093/nar/gku354. come in cancer. Academy of Sciences, 103(15), 5923–5928. doi: Cerami, E., Gao, J., Dogrusoz, U., Gross, 10.1073/pnas.0601231103. B. E., Sumer, S. O., Aksoy, B. A., . . . Larsson, E. (2012). The cBio cancer ge- Emmert-Streib, F., & Glazko, G. V. (2011). Path- nomics portal: An open platform for ex- way analysis of expression data: Deciphering ploring multidimensional cancer genomics functional building blocks of complex diseases. PLoS 7 data. Cancer Discovery, 2(5), 401–404. doi: , (5), e1002053. 10.1158/2159-8290.CD-12-0095. doi: 10.1371/journal.pcbi.1002053. Chuang, H.-Y., Hofree, M., & Ideker, T. (2010). A Ezkurdia, I., Juan, D., Rodriguez, J. M., Frankish, decade of systems biology. Annual Review of A., Diekhans, M., Harrow, J., . . . Tress, M. L. Cell and , 26, 721–744. (2014). Multiple evidence strands suggest that doi: 10.1146/annurev-cellbio-100109-104122. there may be as few as 19 000 human protein- coding genes. Human Molecular Genetics, Croft, D., Mundo, A. F., Haw, R., Milacic, M., 23(22), 5866–5878. doi: 10.1093/hmg/ddu309. Weiser, J., Wu, G., . . . D’Eustachio, P. (2014). The Reactome pathway knowledgebase. Nu- Fang, Z., Tian, W., & Ji, H. (2011). A network- cleic Acids Research, 42(D1), D472–D477. doi: based gene-weighting approach for pathway Cell Research 22 10.1093/nar/gkt1102. analysis. , (3), 565–580. doi: 10.1038/cr.2011.149. Demir, E., Cary, M. P.,Paley, S., Fukuda, K., Lemer, ´ C., Vastrik, I., . . . Bader, G. D. (2010). The Farfan, F., Ma, J., Sartor, M. A., Michai- BioPAX community standard for pathway data lidis, G., & Jagadish, H. V. (2012). THINK sharing. Nature Biotechnology, 28(9), 935–942. Back: Knowledge-based interpretation of high BMC Bioinformatics 13 doi: 10.1038/nbt.1666. throughput data. , (Suppl 2), S4. doi: 10.1186/1471-2105-13-S2-S4. Diaz, D., Donato, M., Nguyen, T., & Draghici, S. Statistical methods for re- (2016). MicroRNA-augmented pathways (mi- Fisher, R. A. (1925). search workers rAP) and their applications to pathway analy- . Edinburgh: Oliver & Boyd. sis and disease subtyping. In Pacific Symposium Fodor, A. A., Tickle, T. L., & Richardson, C. (2007). on Biocomputing. Pacific Symposium on Bio- Towards the uniform distribution of null P val- , volume 22, page 390. NIH Public ues on Affymetrix microarrays. Genome Biol- Access. ogy, 8(5), R69. doi: 10.1186/gb-2007-8-5-r69. Draghici,ˇ S., Khatri, P., Martins, R. P., Os- Gao, J., Aksoy, B. A., Dogrusoz, U., Dresdner, G., Analyzing termeier, G. C., & Krawetz, S. A. (2003). Gross, B., Sumer, Sur., . . . Schultz, N. (2013). Molecular Interactions Global functional profiling of gene ex- Integrative analysis of complex cancer genomics 8.25.21

Current Protocols in Bioinformatics Supplement 61 and clinical profiles using the cBioPortal. Sci- Stein, L. (2005). REACTOME: A knowledge- ence Signaling, 6(269), pl1. doi: 10.1126/scisig- base of biological pathways. Nucleic Acids Re- nal.2004088. search, 33(Database issue), D428–D432. doi: Gao, S., & Wang, X. (2007). TAPPA: Topolog- 10.1093/nar/gki072. ical analysis of pathway phenotype associa- Kanehisa, M., & Goto, S. (2000). KEGG: Ky- tion. Bioinformatics, 23(22), 3100–3102. doi: oto encyclopedia of genes and genomes. 10.1093/bioinformatics/btm460. Nucleic Acids Research, 28(1), 27–30. doi: Geistlinger, L., Csaba, G., Kuffner,¨ R., Mul- 10.1093/nar/28.1.27. der, N., & Zimmer, R. (2011). From sets to Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., graphs: Towards a realistic enrichment analy- & Morishima, K. (2017). KEGG: New perspec- sis of transcriptomic systems. Bioinformatics, tives on genomes, pathways, diseases and drugs. 27(13), i366–i373. doi: 10.1093/bioinformat- Nucleic Acids Research, 45(D1), D353–D361. ics/btr228. doi: 10.1093/nar/gkw1092. Glaab, E., Baudot, A., Krasnogor, N., & Valencia, Kelder, T., Conklin, B. R., Evelo, C. T., & A. (2010). TopoGSA: Network topological gene Pico, A. R. (2010). Finding the right ques- set analysis. Bioinformatics, 26(9), 1271–1272. tions: Exploratory pathway analysis to enhance doi: 10.1093/bioinformatics/btq131. biological discovery in large datasets. PLoS Glaab, E., Baudot, A., Krasnogor, N., Schneider, Biology, 8(8), e1000472. doi: 10.1371/jour- R., & Valencia, A. (2012). EnrichNet: Network- nal.pbio.1000472. based gene set enrichment analysis. Bioinfor- Kelder, T., Conklin, B. R., Evelo, C. T., & matics, 28(18), i451–i457. doi: 10.1093/bioin- Pico, A. R. (2010). Finding the right ques- formatics/bts389. tions: Exploratory pathway analysis to enhance Greenblum, S., Efroni, S., Schaefer, C., & Buetow, biological discovery in large datasets. PLoS K. (2011). The PathOlogist: An automated tool Biology, 8(8), e1000472. doi: 10.1371/jour- for pathway-centric analysis. BMC Bioinformat- nal.pbio.1000472. ics, 12(1), 133. doi: 10.1186/1471-2105-12-133. Khatri, P., Draghici,˘ S., Ostermeier, G. C., & Gu, Z., Liu, J., Cao, K., Zhang, J., & Krawetz, S. A. (2002). Profiling gene expression Wang, J. (2012). Centrality-based pathway en- using Onto-Express. Genomics, 79(2), 266–270. richment: A systematic approach for find- doi: 10.1006/geno.2002.6698. ing significant pathways dominated by key Khatri, P., Sirota, M., & Butte, A. J. (2012). genes. BMC Systems Biology, 6(1), 56. doi: Ten years of pathway analysis: Current ap- 10.1186/1752-0509-6-56. proaches and outstanding challenges. PLoS Haynes, W. A., Higdon, R., Stanberry, L., Collins, Computational Biology, 8(2), e1002375. doi: D., & Kolker, E. (2013). Differential expres- 10.1371/journal.pcbi.1002375. sion analysis for pathways. PLoS Computational Khatri, P., Voichit¸a, C., Kattan, K., Ansari, N., Kha- Biology, 9(3), e1002967. doi: 10.1371/jour- tri, A., Georgescu, C., . . . Draghici,ˇ S. (2007). nal.pcbi.1002967. Onto-Tools: New additions and improvements Hung, J.-H., Whitfield, T. W., Yang, T.-H., Hu, in 2006. Nucleic Acids Research, 35(Web Z., Weng, Z., & DeLisi, C. (2010). Identifica- Server issue), W206–W211. doi: 10.1093/nar/ tion of functional modules that correlate with gkm327. phenotypic difference: The influence of net- Kuffner,¨ R., Petri, T., Windhager, L., & Zim- work topology. Genome Biology, 11(2), R23. mer, R. (2010). Petri nets with fuzzy logic doi: 10.1186/gb-2010-11-2-r23. (PNFL): Reverse engineering and parametriza- Ibrahim, M. A.-H., Jassim, S., Cawthorne, M. tion. PLoS One, 5(9), e12807. doi: 10.1371/jour- A., & Langlands, K. (2012). A topology- nal.pone.0012807. based score for pathway enrichment. Journal Lauritzen, S. L. (1996). Graphical models (Vol. 17). of Computational Biology, 19(5), 563–573. doi: Oxford University Press. 10.1089/cmb.2011.0182. Liberzon, A., Subramanian, A., Pinchback, R., Ihnatova, I., & Budinska, E. (2015). ToPASeq: Thorvaldsdottir,´ H., Tamayo, P., & Mesirov, An R package for topology-based path- J. P. (2011). Molecular signatures database way analysis of microarray and RNA-Seq (MSigDB) 3.0. Bioinformatics, 27(12), 1739– data. BMC Bioinformatics, 16(1), 350. doi: 1740. doi: 10.1093/bioinformatics/btr260. 10.1186/s12859-015-0763-1. Ma, S., Jiang, T., & Jiang, R. (2014). Differen- Isci, S., Ozturk, C., Jones, J., & Otu, H. H. (2011). tial regulation enrichment analysis via the in- Pathway analysis of high-throughput biolog- tegration of transcriptional regulatory network ical data within a Bayesian network frame- and gene expression data. Bioinformatics, 31(4), work. Bioinformatics, 27(12), 1667–1674. doi: 563–571. doi: 10.1093/bioinformatics/btu672. 10.1093/bioinformatics/btr269. Martini, P., Sales, G., Massa, M. S., Chiogna, M., Jacob, L., Neuvial, P., & Dudoit, S. (2010). & Romualdi, C. (2013). Along signal paths: An Gains in power from structured two-sample empirical gene set approach exploiting pathway Network-Based tests of means on graphs. Arxiv preprint topology. Nucleic Acids Research, 41(1), e19– Approaches for arXiv:1009.5173. e19. doi: 10.1093/nar/gks866. Pathway Level Joshi-Tope, G., Gillespie, M., Vastrik, I., Massa, M. S., Chiogna, M., & Romualdi, C. (2010). Analysis D’Eustachio, P., Schmidt, E., de Bono, B., . . . Gene set analysis exploiting the topology of a 8.25.22

Supplement 61 Current Protocols in Bioinformatics pathway. BMC Systems Biology, 4(1), 121. doi: sions reached during analysis of gene expres- https://doi.org/10.1186/1752-0509-4-121. sion by DNA microarrays. Proceedings of the Mi, H., Lazareva-Ulitsky, B., Loo, R., Kejariwal, National Academy of Sciences of the United A., Vandergriff, J., Rabkin, S., . . . Thomas, P. States of America, 102(25), 8961–8965. doi: D. (2005). The PANTHER database of protein 10.1073/pnas.0502674102. families, subfamilies, functions and pathways. Pico, A. R., Kelder, T., van Iersel, M. P., Hanspers, Nucleic Acids Research, 33(Suppl 1), D284– K., Conklin, B. R., & Evelo, C. (2008). D288. doi: 10.1093/nar/gki078. WikiPathways: Pathway editing for the people. Mieczkowski, J., Swiatek-Machado, K., & Kamin- PLoS Biology, 6(7), e184. doi: 10.1371/jour- ska, B. (2012). Identification of pathway nal.pbio.0060184. deregulation–gene expression based analysis of Rahnenfuhrer,¨ J., Domingues, F. S., Maydt, J., & consistent signal transduction. PLoS One, 7(7), Lengauer, T. (2004). Calculating the statistical e41541. doi: 10.1371/journal.pone.0041541. significance of changes in pathway activity from Misman, M. F., Deris, S., Hashim, S. Z. M., Jumali, gene expression data. Statistical Applications R., & Mohamad, M. S. (2009). Pathway-based in Genetics and , 3(1). doi: microarray analysis for defining statistical sig- 10.2202/1544-6115.1055. nificant phenotype-related pathways: A review Rhee, Y. S., Wood, V., Dolinski, K., & Draghici,˘ of common approaches. In Information Man- S. (2008). Use and misuse of the Gene Ontol- agement and Engineering, 2009. ICIME’09. ogy annotations. Nature Reviews Genetics, 9(7), International Conference on, pages 496–500. 509–515. doi: 10.1038/nrg2363. Piscataway, NJ: IEEE. Rustici, G., Kolesnikov, N., Brandizi, M., Bur- Mitrea, C., Taghavi, Z., Bokanizad, B., Hanoudi, dett, T., Dylag, M., Emam, I., . . . Sarkans, U. S., Tagett, R., Donato, M., . . . Draghici,˘ (2013). ArrayExpress update–trends in database S. (2013). Methods and approaches in the growth and links to data analysis tools. Nu- topology-based analysis of biological path- cleic Acids Research, 41(D1), D987–D990. doi: ways. Frontiers in Physiology, 4, 278. doi: 10.1093/nar/gks1174. 10.3389/fphys.2013.00278. Sackett, D. L. (1979). Bias in analytic research. Mootha, V. K., Lindgren, C. M., Eriksson, K- Journal of Chronic Diseases, 32(1-2), 51–63. F., Subramanian, A., Sihag, S., Lehar, J., doi: 10.1016/0021-9681(79)90012-2. α . . . Groop, L. C. (2003). PGC-11 -responsive Sandve, G. K., Nekrutenko, A., Taylor, J., & Hovig, genes involved in oxidative phosphorylation E. (2013). Ten simple rules for reproducible are coordinately downregulated in human di- computational research. PLoS Computational abetes. Nature Genetics, 34(3), 267–273. doi: Biology, 9(10), e1003285. doi: 10.1371/jour- 10.1038/ng1180. nal.pcbi.1003285. Murata, T. (1989). Petri nets: Properties, analy- Sartor, M. A., Leikauf, G. D., & Medvedovic, M. sis and applications. Proceedings of the IEEE, (2009). LRpath: A logistic regression approach 77(4), 541–580. doi: 10.1109/5.24143. for identifying enriched biological groups in Nam, D., & Kim, S.-Y. (2008). Gene-set ap- gene expression data. Bioinformatics, 25(2), proach for expression pattern analysis. Brief- 211–217. doi: 10.1093/bioinformatics/btn592. ings in Bioinformatics, 9(3), 189–197. doi: Schaefer, C. F., Anthony, K., Krupa, S., Buchoff, J., 10.1093/bib/bbn001. Day, M., Hannay, T., . . . Buetow, K. H. (2009). Nguyen, T., Diaz, D., Tagett, R., & Draghici, S. PID: The pathway interaction database. Nucleic (2016). Overcoming the matched-sample bottle- Acids Research, 37(Suppl 1), D674–D679. doi: neck: An orthogonal approach to integrate omic 10.1093/nar/gkn653. data. Nature Scientific Reports, 6, 29251. doi: Shojaie, A., & Michailidis, G. (2009). Analysis of 10.1038/srep29251. gene sets based on the underlying regulatory net- Nguyen, T., Mitrea, C., Tagett, R., & Draghici,˘ S. work. Journal of Computational Biology, 16(3), (2017). DANUBE: Data-driven meta-ANalysis 407–426. doi: 10.1089/cmb.2008.0081. using UnBiased Empirical distributions— Shojaie, A., & Michailidis, G. (2010). Network applied to biological pathway analysis. Pro- enrichment analysis in complex experiments. ceedings of the IEEE, 105(3), 496–515. doi: Statistical Applications in Genetics and Molec- 10.1109/JPROC.2015.2507119. ular Biology, 9(1). doi: 10.2202/1544-6115. Nguyen, T., Tagett, R., Donato, M., Mitrea, C., 1483. & Draghici, S. (2016). A novel bi-level meta- Storey, J. D., & Tibshirani, R. (2003). Statistical analysis approach-applied to biological pathway significance for genomewide studies. Proceed- analysis. Bioinformatics, 32(3), 409–416. doi: ings of the National Academy of Sciences of the 10.1093/bioinformatics/btv588. United States of America, 100(16), 9440–9445. Ogata, H., Goto, S., Sato, K., Fujibuchi, W., doi: 10.1073/pnas.1530509100. Bono, H., & Kanehisa, M. (1999). KEGG: Subramanian, A., Tamayo, P., Mootha, V. K., Kyoto Encyclopedia of Genes and Genomes. Mukherjee, S., Ebert, B. L., Gillette, M. A., Nucleic Acids Research, 27(1), 29–34. doi: . . . Mesirov, J. P. (2005). Gene set enrichment 10.1093/nar/27.1.29. analysis: A knowledge-based approach for inter- Analyzing Pan, K.-H., Lih, C.-J., & Cohen, S. N. (2005). Ef- preting genome-wide expression profiles. Pro- Molecular fects of threshold choice on biological conclu- ceeding of The National Academy of Sciences of Interactions 8.25.23

Current Protocols in Bioinformatics Supplement 61 the Unites States of America, 102(43), 15545– ing PARADIGM. Bioinformatics, 26(12), i237– 15550. doi: 10.1073/pnas.0506580102. i245. doi: 10.1093/bioinformatics/btq182. Tan, P. K., Downey, T. J., Spitznagel, E. L. Jr, Xu, Voichit¸a, C., Donato, M., & Draghici,ˇ S. (2012). In- P.,Fu,D.,Dimitrov,D.S.,...Cam,M.C. corporating gene significance in the impact anal- (2003). Evaluation of gene expression measure- ysis of signaling pathways. In ments from commercial microarray platforms. and Applications (ICMLA), 2012 11th Interna- Nucleic Acids Research, 31(19), 5676–5684. tional Conference on, volume 1, pages 126–131, doi: 10.1093/nar/gkg763. Boca Raton, FL, USA, 12–15 December 2012. Tarca, A. L., Bhatti, G., & Romero, R. (2013). Piscataway, NJ: IEEE. A comparison of gene set analysis meth- Wang, E. T., Sandberg, R., Luo, S., Khrebtukova, I., ods in terms of sensitivity, prioritization and Zhang, L., Mayr, C., . . . Burge, C. B. (2008). specificity. PloS One, 8(11), e79217. doi: Alternative isoform regulation in human tis- 10.1371/journal.pone.0079217. sue transcriptomes. Nature, 456(7221), 470. doi: Tarca, A. L., Draghici,ˇ S., Bhatti, G., & 10.1038/nature07509. Romero, R. (2012). Down-weighting over- Xia, J., & Wishart, D. S. (2010). MetPA: lapping genes improves gene set analy- A web-based metabolomics tool for path- sis. BMC Bioinformatics, 13(1), 136. doi: way analysis and . Bioinformatics, https://doi.org/10.1186/1471-2105-13-136. 26(18), 2342–2344. doi: 10.1093/bioinformat- Tarca, A. L., Draghici,ˇ S., Khatri, P., Hassan, ics/btq418. S. S., Mittal, P., Kim, J.-S., . . . Romero, R. Xia, J., & Wishart, D. S. (2011). Web-based infer- (2009). A novel signaling pathway impact anal- ence of biological patterns, functions and path- ysis (SPIA). Bioinformatics, 25(1), 75–82. doi: ways from metabolomic data using Metabo- 10.1093/bioinformatics/btn577. Analyst. Nature Protocols, 6(6), 743. doi: Tarca, A. L., Draghici,ˇ S., Khatri, P., Hassan, S. S., 10.1038/nprot.2011.319. Mittal, P., Kim, J.-S., . . . Romero, R. (2009). A Xia, J., Sinelnikov, I. V., Han, B., & novel signaling pathway impact analysis. Bioin- Wishart, D. S. (2015). MetaboAnalyst 3.0— formatics, 25(1), 75–82. doi: 10.1093/bioinfor- making metabolomics more meaningful. Nu- matics/btn577. cleic Acids Research, 43(W1), W251–W257. Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, doi: 10.1093/nar/gkv380. R. J., & Church, G. M. (1999). Systematic de- Zhao, Y., Chen, M.-H., Pei, B., Rowe, D., Shin, D- termination of genetic network architecture. Na- G., Xie, W., . . . Kuo, L. (2012). A Bayesian ap- ture Genetics, 22, 281–285. doi: 10.1038/10343. proach to pathway analysis by integrating gene– Varadan, V., Mittal, P., Vaske, C. J., & Benz, S. gene functional directions and microarray data. C. (2012). The integration of biological path- Statistics in Biosciences, 4(1), 105–131. doi: way knowledge in cancer genomics: A review 10.1007/s12561-011-9046-1. of existing computational approaches. Signal Internet Resources Processing Magazine, IEEE, 29(1), 35–50. doi: https://www.biocarta.com 10.1109/MSP.2011.943037. BioCarta: Charting Pathways of Life. Vaske, C. J., Benz, S. C., Sanborn, J. Z., Earl, D., https://bioconductor.org/packages/release/bioc/ Szeto, C., Zhu, J., . . . Stuart, J. M. (2010). Infer- manuals/BLMA/man/BLMA.pdf ence of patient-specific pathway activities from multi-dimensional cancer genomics data us- BLMA: A package for bi-level meta-analysis. R package version 1

Network-Based Approaches for Pathway Level Analysis 8.25.24

Supplement 61 Current Protocols in Bioinformatics