Hum Genet (2011) 130:465–481 DOI 10.1007/s00439-011-0983-z

REVIEW PAPER

Integrative computational biology for cancer research

Kristen Fortney · Igor Jurisica

Received: 3 February 2011 / Accepted: 2 April 2011 / Published online: 22 April 2011 © The Author(s) 2011. This article is published with open access at Springerlink.com

Abstract Over the past two decades, high-throughput Introduction (HTP) technologies such as microarrays and mass spec- trometry have fundamentally changed clinical cancer Since the commercialization of DNA microarray technol- research. They have revealed novel molecular markers of ogy in the late 1990s, high-throughput (HTP) data relevant cancer subtypes, metastasis, and drug sensitivity and resis- to cancer research have been accumulating at an ever- tance. Some have been translated into the clinic as tools for increasing rate. These data have led to crucial insights into early disease diagnosis, prognosis, and individualized treat- fundamental cancer biology, including the mechanisms of ment and response monitoring. Despite these successes, tumorigenesis, metastasis, and drug resistance (Rhodes and many challenges remain: HTP platforms are often noisy Chinnaiyan 2005). They have also had enormous clinical and suVer from false positives and false negatives; optimal impact, e.g., several cancers can now be fractionated into analysis and successful validation require complex work- therapeutic subsets with unique prognostic outcomes Xows; and great volumes of data are accumulating at a based on their molecular phenotypes (Buyse et al. 2006; rapid pace. Here we discuss these challenges, and show Dhanasekaran et al. 2001; Lowe et al. 2010; Pegram et al. how integrative computational biology can help diminish 1998; Slamon and Press 2009; Spentzos et al. 2004; Zhu them by creating new software tools, analytical methods, et al. 2010b). Despite these successes, many cancers still and data standards. have a high mortality rate and no eVective treatment. Looking at 1.9 million patients from 31 countries and 5 continents, the CONCORD study found that current treatments achieve a 5-year survival rate for less than 50% of diagnosed can- Electronic supplementary material The online version of this cers (Coleman et al. 2008). For many cancers, survival article (doi:10.1007/s00439-011-0983-z) contains supplementary rates have not changed in decades—pancreatic cancer material, which is available to authorized users. remains almost 100% lethal, and the overall survival rate K. Fortney · I. Jurisica for lung cancer has improved only from 13% to 16%. Most Department of Medical Biophysics, cancers still lack any eVective early disease biomarkers, University of Toronto, Toronto, ON, Canada and predictive signatures are limited to a few known muta- I. Jurisica tions, such as EGFR or kRAS in lung cancer or HER2 in Department of Computer Science, breast cancer. Predictive and prognostic biomarkers are University of Toronto, Toronto, ON, Canada often inconsistent from study to study (i.e., they show poor overlap), and cannot be validated by other methods or in I. Jurisica (&) Ontario Cancer Institute, Princess Margaret Hospital, new cohorts of patients (Diamandis 2010; Dupuy and University Health Network, Toronto, ON, Canada Simon 2007; Lau et al. 2007). e-mail: [email protected] The key diYculty is that cancer is a complex and hetero- geneous disease: many are ampliWed, deleted, I. Jurisica Campbell Family Institute for Cancer Research, mutated, and up- or down-regulated. Many pathways are Toronto, ON, Canada activated or suppressed. These changes vary substantially 123 466 Hum Genet (2011) 130:465–481 in diVerent cancers, in diVerent patients with the same can- The challenges facing HTP cancer biology cer, and even in diVerent tumor samples from the same patient (Axelrod et al. 2009; Bachtiary et al. 2006; Black- A wealth of genomic and proteomic cancer data is now hall et al. 2004). To get the full picture, we will need to available from HTP screens. While these data have combine information from diverse experimental platforms improved our understanding of basic cancer biology and and other sources that oVer diVerent perspectives on the some have even translated into improved patient diagnosis problem, e.g., and protein expression, protein–protein and treatment, signiWcant challenges remain. In this section interactions (PPIs) and pathways, chromosomal aberra- we brieXy review some of the major obstacles to the pro- tions, mutation events, epigenetic changes, and clinical gress of HTP cancer research. information from drug trials and the bedside—leading to integrative computational biology. Noise The challenges fall into three main categories. The Wrst is noise: HTP platforms are inherently noisy—results vary Cancer is a heterogeneous disease, and HTP platforms are substantially from run to run and from lab to lab, and are noisy, i.e., the resulting data sets have false positives and prone to false positives and negatives. The second chal- false negatives. Consequently, there can be large variation lenge is volume: there is a vast quantity of relevant data, in results from lab to lab (Bell et al. 2009; Irizarry et al. new data are piling up at an increasing rate and old data 2005); methods are needed to control this noise and to inte- need to be constantly reinterpreted and updated in the light grate diVerent data sources to make them more reliable. The of new Wndings or reanalyzed with improved algorithms. main noise issues that plague HTP cancer research are false The third challenge is analysis: simple methods—such as negatives and false positives, intra- and inter-sample heter- diVerential expression analyses of microarray data—often ogeneity, and platform bias. miss much of the signal in the data. Integrative computational methods will continue to play False negatives and false positives in HTP data a central role in addressing these challenges. We need new designs for databases; new software and workXows to com- HTP screens suVer from substantial noise—both false bine and continuously update heterogeneous and distrib- positives and false negatives—which must be resolved uted data; new analytical methods to identify complex through complementary experiments and computational signals in diVerent data sources; and new standards for analysis (AuVray et al. 2009; Augen 2001). For example, generating, maintaining, and sharing data. These methods while HTP PPI screens can identify thousands of protein will depend on advances in many areas such as statistics, interactions at once, they do so at the cost of either high false knowledge representation and ontology, machine learning, discovery rates or poor sensitivity. The false discovery rate is data mining, graph theory, and visualization. Integrative the proportion of detected interactions that are false, and sen- analyses may ultimately lead to a better understanding of sitivity is the proportion of true interactions that are success- cancer, earlier diagnoses and true personalized medicine, fully detected. When interactions detected by two HTP where therapies are individually tailored based on combina- studies were tested in small-scale screens, they were found to tions of single nucleotide polymorphisms, and gene, protein have false discovery rates of 22% (Rual et al. 2005) and 38% and microRNA expression levels (AuVray et al. 2009; (Stelzl et al. 2005). An evaluation of Wve HTP methods Augen 2001; Cervigne et al. 2009; Reis et al. 2010; Zhu found that their sensitivity rates ranged from only 21 to 36% et al. 2010a, b). at false discovery rates of 0–11% (Braun et al. 2009). Simi- Integrative computational biology shares many tools and larly, mass spectrometry analyses of human serum typically goals with the closely related Weld of systems biology, the produce many false negatives (Gstaiger and Aebersold discipline that attempts to explain the structure and behav- 2009). One problem is that human serum has a high dynamic ior of complex biological systems as a function of their range—protein concentrations are estimated to vary over ten simpler components (Kirschner 2005). The term systems orders of magnitude (Gstaiger and Aebersold 2009). biology may be applied to describe anything from the out- Mass spectrometers have a much smaller range of detection, put of an ‘omics’ experiment to explicit mechanical models leading to false negatives: low abundance proteins are not of tumor growth (Deisboeck et al. 2009; Hornberg et al. detected. This challenge may be somewhat diminished by 2006). Integrative computational biology, in contrast, spe- extensive sample fractionation (Kislinger et al. 2006). ciWcally concerns the computational interrogation and inte- gration of diverse HTP molecular biology data. Tumors are heterogeneous In this review, we discuss the major challenges of HTP cancer studies and give examples of integrative computa- Often only a single sample from each tumor is available for tional methods that help to meet them. analysis. But tumors are highly heterogeneous, so these 123 Hum Genet (2011) 130:465–481 467 samples may not be representative of the whole tumor use every year; and there will always be new diVerences (Axelrod et al. 2009; Bachtiary et al. 2006; Blackhall et al. and conXicts to resolve. The challenge is to have infrastruc- 2004). Tumors comprise cells belonging to several distinct ture set-up to compare and integrate new technologies as subpopulations—e.g., tumor regions can be hypoxic to they come along, and systematically identify the best diVerent extents, or be made up of diVerent proportions of workXow for data processing and analysis, e.g. (Ponzielli tumor initiating cells (cancer stem cells)—and these diVer- et al. 2008). ences have consequences for predicting drug response and prognosis (Axelrod et al. 2009; Blackhall et al. 2004; Jubb Analysis et al. 2010). Intra-tumor heterogeneity can lead to diVerent cell populations expressing diVerent levels of protein, or A primary goal of integrative computational biology analy- having diVerent mutations and copy number alterations, sis in individualized medicine is to identify small groups of which complicates analyses. For example, variable tumor genes/proteins/microRNAs, etc., that can be used to epithelial and stromal cell content in breast tumor samples improve diagnosis, predict outcome or predict treatment can signiWcantly aVect gene expression proWles and signa- response, i.e., to identify prognostic or predictive signa- ture accuracy (Cleator et al. 2006; Myhre et al. 2010). tures. Their identiWcation in HTP data is challenging since Some of these diYculties can be alleviated with techniques basic analysis methods fail to capture the entire signal in such as laser-capture micro-dissection (Fend and RaVeld the data, and good signatures comprise not only the most 2000), which can isolate regions of a sample that contain a diVerentially expressed molecules. For this section we will more uniform population of cells, but these techniques focus our attention on gene expression microarray studies, remain costly and slow. In addition, they may not result in a but the criticisms apply equally well to similar experimen- suYcient amount of material for follow-up experiments. tal designs, such as those using HTP protein or microRNA Though single samples from tumors can suYce for popula- assays. tion-level studies (Axelrod et al. 2009), the fact that two samples from the same patient can be quite diVerent means Lists of diVerentially expressed genes show poor overlap that this heterogeneity poses a signiWcant challenge to per- across studies sonalized medicine. On top of that, in cancer studies there are issues with the control samples available for analysis: The most popular way of analyzing microarray data is to often these “normal” samples come from tissue directly detect whether individual genes are diVerentially expressed adjacent to the tumor. The properties of these samples, such in one condition versus in another, e.g., in non-responders as gene expression proWles, may be quite diVerent from versus responders to some drug treatment. DiVerentially those of more distant tissue or of a healthy patient (Chan- expressed genes are widely used in cancer research—their dran et al. 2005). applications include deriving molecular signatures of can- cer subtypes, increasing our understanding of the biology of DiVerent experimental platforms can disagree tumorigenesis, and providing new candidate markers for diagnosis, prognosis, and drug response. Unfortunately, Experimental platforms produced by diVerent companies there are major challenges with their identiWcation and can yield conXicting results (Curtis et al. 2009; Elias et al. interpretation. Previous work has shown that lists of diVer- 2005; Tan et al. 2003). For example, diVerent microarray entially expressed genes are poorly reproduced across stud- platforms use diVerent probe design and labeling and have ies (Tian et al. 2005; Zhang et al. 2008); even random diVerent dynamic ranges. A gene may be overexpressed in subsets of samples from one experiment can yield widely cancer on one platform, yet under-expressed on another, divergent gene lists. These problems are caused by high simply because the two platforms use diVerent DNA dimensionality, small number of samples, and noise (bio- sequences to “probe” the same gene. On a given platform, logical and technical variability), but they can be exacer- some genes are represented by many probes while others by bated by the analysis method. For example, analyses that one or none, and this representation is diVerent for diVerent quantify the diVerential expression of gene groups rather platforms. Genes that are expressed at low levels are partic- than individual genes show higher conservation across plat- ularly problematic for concordance across array platforms forms and studies (Subramanian et al. 2005). (Barnes et al. 2005). Another problem is that genome anno- tations continue to grow and change (Eggle et al. 2009). The most diVerentially expressed genes do not yield Updated probe set deWnitions can substantially aVect the the best signatures number and the identity of diVerentially expressed genes (Sandberg and Larsson 2007). Further, HTP technology Very often in microarray experiments, the most diVeren- is developing rapidly and new technologies are coming into tially expressed genes are used to construct prognostic or 123 468 Hum Genet (2011) 130:465–481 predictive signatures: machine learning methods are trained 600 Cancer & Signature Whole & Genome & Sequencing to use the expression levels of those genes to predict, e.g., Personalized medicine the disease state of a patient and probability of survival. Cancer genome The problem with this approach is that single-gene analyses 400 overlook multivariate eVects. More sophisticated analyses are needed to identify sets of genes that complement one 200 another, i.e., ones whose combined expression levels yield Publications the best-performing prognostic signatures. Such analyses show that genes essential to a good prognostic signature 0 are often not highly diVerentially expressed on their own 2000 2002 2004 2006 2008 2010 (Chuang et al. 2007; Fujita et al. 2008). Ye a r Fig. 1 The rapid accumulation of personalized medicine publications. Signatures validate poorly on other data sets Counts of associated publications from the last 10 years for several personalized medicine search terms, retrieved with MEDSUM. Year is One of the most important contributions of HTP biology to shown on the x-axis and publication count on the y-axis. The search term “cancer AND signature” is shown in blue; “whole AND genome cancer research has been to develop prognostic and predic- AND sequencing” in turquoise; “personalized medicine” in red; and tive signatures. Unfortunately, many signatures have failed “cancer genome” in green to validate by other methods or in new cohorts of patients (Lau et al. 2007; Zhu et al. 2008). Obviously, this is a great concern for personalized medicine. Existing prognostic and sequence 500 tumors from each of 50 diVerent cancers predictive biomarkers for the same condition overlap only (Hudson et al. 2010), and the Cancer Genome Atlas partially, and the set of biomarker genes identiWed depends (TCGA) will sequence more than 20 diVerent tumor types strongly on the subset of patients used to generate it in the next 5 years (Ledford 2010). Between 2008 and (Ein-Dor et al. 2005). One study estimated that to achieve 2010, the number of papers published per year containing 50% overlap in prognostic gene sets for breast cancer in their abstracts “cancer genome”, “personalized medi- patients would require several thousand samples (Ein-Dor cine”, and “whole AND genome AND sequencing” have et al. 2006). Several factors contribute to these problems, roughly doubled, and “cancer AND signature” has shown a including: (1) diverse patients and heterogeneous tumors, 50% increase (Fig. 1) [data generated using MEDSUM (2) diVerent proWling platforms, (3) diverse statistical and (Galsworthy 2009)]. bioinformatics approaches to biomarker identiWcation (Shedden et al. 2008), (4) an insuYcient number of samples Many disease-related targets remain uncharacterized (Ein-Dor et al. 2006), and (5) the existence of multiple equivalent signatures (Boutros et al. 2009; Ein-Dor et al. Many disease-related targets remain uncharacterized; we 2005). have little or no information about their protein interaction partners, the pathways they control or are aVected by, or Volume their splice variants, functional mutations, or protein struc- ture. For example, while there are roughly 21,000 genes in The pace of discovery is rapid, data are piling up, and anno- the (Clamp et al. 2007), the Reactome and tations are changing all the time. We need integrated work- KEGG Pathway databases contain pathway annotations for Xows to organize data sets and make them easy to access, only about 4,000 of these (Kanehisa et al. 2008; Matthews incorporate, annotate and update. et al. 2009). And while many studies have shown that inter- action networks provide information that can substantially Large amounts of data need to be integrated improve predictive signatures (Chuang et al. 2007; Fortney and maintained et al. 2010; Nibbe et al. 2010), they can do so only for genes represented in the interactome. There are several examples of the rapid increase of knowl- We used the I2D interaction database (Brown and Juri- edge and data. The Gene Expression Omnibus (GEO) cur- sica 2005, 2007; Niu et al. 2010), to investigate how many rently contains over 490,000 microarray samples (Edgar cancer-associated genes remain uncharacterized in terms of et al. 2002). Roughly 100 cancer genomes have been their protein interactions, combining known interactions sequenced so far—most of these just within the past year— and interologs (interactions in other species that are trans- and several major projects are underway, which should see ferred to humans via orthology). For instance, if we con- that number quickly increase. For instance, the Interna- sider cancer genes with fewer than 5 interactions to be tional Cancer Genome Consortium (ICGC) plans to uncharacterized [this is unlikely to be their true connectivity, 123 Hum Genet (2011) 130:465–481 469

Census overlap is usually low, and even when combined were OMIM 60 CDIPprostate shown to result in a false negative rate of around 41% CDIPlung CDIPovarian (Braun et al. 2009). Interaction dynamics with respect to time and localization remain largely unknown. Also, 40 though over 92% of human genes have splice variants (Wang et al. 2008), and many genes may have hundreds or 20 even thousands of such variants [for instance, the Drospo- hila gene Dscam may have over 38,000 alternative splice

Percent uncharacterized Percent 0 0 510 forms (Schmucker et al. 2000)], we still lack information W Degree about most variant-speci c interactions.

Fig. 2 Many cancer-implicated genes remain uncharacterized. Percentages of known disease genes (y-axis) with zero, fewer than 5, or fewer than 10 known protein interactions (x-axis) in I2D version How integrative computational biology can address 1.85 (provided in Supplementary File 1). A large percentage of disease these challenges genes from the Sanger Cancer Gene Census (blue) and Online Mende- lian Inheritance in Man (turquoise) remain uncharacterized. Similar The Weld of integrative computational biology uses tech- results hold for genes implicated in prostate (green), lung (orange), and ovarian (red) cancers by two or more studies, as retrieved from the niques from computer science, mathematics, physics and Cancer Data Integration Portal engineering to comprehensively analyze and interpret bio- logical data. Through the creation of new analysis and visu- because cancer proteins are typically network hubs (Jons- alization methods, software tools and databases, it can help son and Bates 2006)], then we identify 19% of genes in the diminish the challenges to HTP cancer biology. Here we Sanger Cancer Gene Census [SCGC; (Futreal et al. 2004)] present some successful applications of integrative compu- and 47% of the genes in Online Mendelian Inheritance in tational biology to cancer research. These applications fall Man [OMIM; (Hamosh et al. 2005)] as uncharacterized into four main categories: data integration, network analy- (Fig. 2). Looking at genes in the Cancer Data Integration sis, databases, and standards. Portal (CDIP) that are implicated in cancer by at least two studies, we Wnd similar numbers; 42% of genes associated Data integration with prostate cancer, 31% of genes associated with ovarian cancer, and 24% of genes associated with lung cancer are As we have seen, noise in HTP cancer studies can arise uncharacterized. Interestingly, even though lung cancer is both from biological and technological variability. One of represented by a much smaller set of genes in CDIP—only the most eVective strategies for reducing both types of 1,232 genes, versus 5,575 for ovarian and 9,803 for prostate noise is data integration. The idea is simple: we can be cancer—it shows a roughly comparable proportion of more conWdent about the result of an experiment if similar uncharacterized genes across diVerent interaction cut-oVs experiments yielded similar results. We can integrate diVer- (Fig. 2). ent experiments that measure the same biological entity, such as microarray studies measuring tumor versus normal The human interactome is poorly characterized gene expression diVerences on diVerent experimental plat- forms. We can also integrate diVerent data types, such as PPI networks provide biological context and meaning for mutation, expression, and proteomic data. Clearly, data gene signatures and yield better network-based prognostic integration can increase our conWdence in results that are and predictive signatures. Unfortunately, with current consistent across multiple studies and experimental modali- experimental knowledge the coverage of the human inter- ties. But data integration can also increase sensitivity, since actome is only around 13%. The human interactome is esti- diVerent platforms and methods exhibit diVerent biases— mated to have around 650,000 interactions (Stumpf et al. e.g., some protein interactions may be undetectable by 2008), and currently there are only 82,857 unique experi- some methods. Integrative computational approaches can mentally derived interactions (and an additional 55,471 also be complemented by better experimental design: e.g., unique interologous interactions) in I2D, a database that in tumor proWling, analyzing multiple samples from the integrates HTP PPI data sets and most online PPI databases same patient reduces the eVect of intra-tumor heterogeneity [such as BioGRID (Stark et al. 2006), BIND (Bader et al. (Bachtiary et al. 2006; Blackhall et al. 2004), and executing 2003), DIP (Xenarios et al. 2000), HPRD (Peri et al. 2004), multiple MudPIT (Multi-dimensional Protein IdentiWca- InnateDB (Lynn et al. 2008); IntAct (Hermjakob et al. tion Technology) runs signiWcantly improves sensitivity 2004) and MINT (Zanzoni et al. 2002)]. While HTP meth- (Gortzak-Uzan et al. 2008; Kislinger and Emili 2005; ods can help to identify many protein interactions, their Sodek et al. 2008; Wei et al. 2011). 123 470 Hum Genet (2011) 130:465–481

Integrating the same type of data across multiple platforms Reactome which allows us to convert lists of diVerentially and studies expressed genes into lists of diVerentially expressed gene groups, which are more stable across studies (Subramanian With microarray and similar data, small sample numbers et al. 2005). and diVerent experimental platforms can lead to highly var- iable results. These problems can be addressed by combin- Integrating data to predict new PPIs ing data from diVerent studies and platforms, which increases the eVective number of samples and helps control Protein networks are essential for cancer signature design for inter-platform heterogeneity. and interpretation (Chuang et al. 2007; Ergun et al. 2007; Most approaches to integrating microarray data can be Rhodes and Chinnaiyan 2005), yet current experimental divided into two general classes, pooling and meta-analy- networks are far from complete, contain false positives, and ses. In pooling, multiple expression data sets are merged in general lack context, such as condition, place and into a single data set (van Vliet et al. 2008; Warnat et al. strength of individual interactions. Biological techniques 2005); typically, gene measurements from each separate can also be enriched by in silico approaches to predict new study are transformed before pooling to make the experi- interactions and the context of existing ones. For example, ments more comparable (Fierro et al. 2008). Previous work we recently developed FpClass (Kotlyar and Jurisica 2006; found that pooling six breast cancer data sets (over 900 Kotlyar et al. 2011), an association mining algorithm that samples in total) yielded better-performing signatures (van greatly expands coverage of the human proteome without Vliet et al. 2008). In contrast, for a meta-analysis, statistics increasing false discovery rate: it predicts 178,738 new are computed for each data set separately and then com- high-accuracy interactions for 9,372 proteins. FpClass inte- bined. Meta-analyses identify gene changes that are seen grates many diVerent data types to predict interactions, consistently across many studies (Rhodes and Chinnaiyan including sequence, domains, gene expression, and post- 2005). Several cancer-speciWc databases gather information translational modiWcations. Importantly, FpClass reduces from multiple studies to facilitate meta-analysis, including the number of uncharacterized disease genes (again, where Oncomine (Rhodes et al. 2007), GeneSigDB (Culhane et al. a gene is considered uncharacterized if its protein product 2010), and CDIP. CDIP currently covers lung, ovarian, participates in fewer than 5 interactions, see Fig. 2) to 34% prostate and head and neck cancers. For example, for pros- for OMIM and 11% for SCGC. tate cancer, CDIP contains 119 diVerent microarray analy- ses with over 850 tumor and normal samples. While 12,975 Integrating data to predict disease gene function unique genes are signiWcantly diVerentially regulated in at least one study, far fewer show this trend across multiple Though many cancer-related genes remain uncharacterized, studies and we can use this information to prioritize genes, HTP data and in silico methods can be applied to assign both for biological validation and for signature generation. them putative functions (Hu et al. 2007). For example, in the successful MouseFunc prediction challenge (Pena- Integrating diVerent types of data Castillo et al. 2008), teams of scientists competed to predict mouse gene function ( categories) on the Integrating complementary data from diVerent sources is basis of several diVerent data sources, including expression, helpful for reducing noise and prioritizing targets (Gortzak- sequence, interactions, phenotype annotations, disease Uzan et al. 2008; Varambally et al. 2005). For example, associations and phylogenetic proWles. Computational Gortzak-Uzan et al. (2008) combined proteins identiWed in methods for predicting the function of uncharacterized can- ovarian cancer ascites with diVerentially expressed genes cer genes have been previously reviewed in (Hu et al. from CDIP and PPIs from I2D to identify putative biomark- 2007). ers for early ovarian cancer detection in serum. For target prioritization, prognostic and predictive signatures can be Network analysis put in their biological context by overlapping and expand- ing them with data from cancer-speciWc resources such as Network approaches have been successful in addressing the Sanger Census database (Bamford et al. 2004; Stratton some of the analysis-based challenges of HTP cancer et al. 2009), the COSMIC catalog of somatic mutations biology. Genes do not act in isolation; they form highly (Bamford et al. 2004), GeneSigDB, and Oncomine. Tech- complex and interlinked molecular networks. Examining niques such as Gene Set Enrichment Analysis (GSEA) genes in the context of these networks can yield valuable (Subramanian et al. 2005) can be used to combine experi- clues about their function and relations, and expand our mental data with annotation and pathway databases such as knowledge of individual cancer pathways and their cross-talk the Gene Ontology (Ashburner et al. 2000), KEGG, and (Agarwal et al. 2009; Mills et al. 2009). Despite the noise 123 Hum Genet (2011) 130:465–481 471 in current protein interaction data sets, network analysis can were highly related: there were direct interactions between uncover biologically relevant information, such as lethality the protein products of genes from our signature and others. (Hahn and Kern 2005; Jeong et al. 2001), functional organi- Similar results have been shown in other studies (Zhu et al. zation (Gavin et al. 2002; Maslov and Sneppen 2002; 2009, 2010b). Wuchty 2006), hierarchical structure (Ravasz et al. 2002; Yu et al. 2006), modularity (Han et al. 2004) and network- Networks for identifying new disease genes building motifs (Milo et al. 2002; Przulj et al. 2006; Rice et al. 2005). Three important applications of protein net- Analyses of the network connectivity of cancer genes have works to cancer research include signature generation, sig- shown that they can be characterized by several topological nature interpretation, and disease gene prediction. properties. For example, proteins encoded by cancer genes tend to be central in interaction networks [they have high Networks for signature generation degree and betweenness centrality (Jonsson and Bates 2006; Rambaldi et al. 2008; Syed et al. 2010)], have high clustering By integrating network information with gene expression coeYcients (Li et al. 2009), and are overrepresented in net- data, we can identify predictive signatures that perform bet- work motifs (Rambaldi et al. 2008). Several methods use the ter and are more conserved across studies than signatures topological characteristics of known cancer genes, in combi- based on gene expression data alone. There are many ways nation with other features (such as Gene Ontology categories, of using networks to create improved gene signatures. One protein domains, biological pathways, and sequence fea- class of methods that has proven very successful is score- tures), to predict new cancer genes (Aragues et al. 2008; based subnetwork biomarkers (Chuang et al. 2007; Fortney Rambaldi et al. 2008) or functional SNPs (Savas et al. 2009). et al. 2010; Hwang et al. 2009; Ideker et al. 2002; Nacu Many algorithms identify modules in interaction networks, et al. 2007). In these approaches, genes are aggregated or groups of densely interconnected genes that can be highly around an initial “seed” gene in a network to generate sub- functionally related (King et al. 2004; Newman 2006; Palla networks whose pooled activity levels can be used to predict et al. 2005; Spirin and Mirny 2003). Module-Wnding algo- the value of some response variable, such as disease status rithms can also be applied to predict new disease genes (Chu- or survival time. Subnetworks identiWed using this approach ang et al. 2007; Goh et al. 2007): clusters enriched for known were shown to be highly conserved across studies, and to cancer genes may implicate novel genes in cancer. perform better than individual genes or pre-deWned gene groups at predicting breast cancer metastasis (Chuang et al. Generating context-speciWc co-expression networks 2007). Importantly, many crucial genes belonging to subnet- from microarray data work biomarkers were not diVerentially expressed on their own, demonstrating the added value of a network approach. Co-expression networks are distinct from PPI networks: Related approaches have been used to develop subnetwork instead of indicating physical interactions, edges (and edge biomarkers for colon cancer using a combination of prote- weights) between genes reXect the degree of correlation of ome and transcriptome data (Nibbe et al. 2009, 2010). their expression proWles. Co-expression networks have also been useful in cancer research, e.g., (Aggarwal et al. 2006; Networks for signature interpretation Choi et al. 2005). For example, we can use microarray data from diVerent health and disease conditions, such as cancer Many genes that play a role in predictive signatures have and non-cancer, to create networks speciWc to those con- not been previously linked to cancer, and thus can be con- texts. We can then compare the networks using a variety of sidered as novel candidate cancer genes. Networks can be measures that reXect local and global network structure, used to link these genes with known cancer mechanisms such as graphlets (Przulj et al. 2006), communities (Palla and pathways (Radulovich et al. 2010; Rhodes et al. 2005; et al. 2005), etc. Weighted Gene Co-expression Network Sodek et al. 2008; Tomasini et al. 2008). Gene signatures Analysis (WCGNA) (Zhang and Horvath 2005), a popular mapped to protein interactions can be further annotated method for generating and interpreting co-expression net- with other proWles (including proteomic, CGH, and miRNA works, has been applied to study prostate cancer (Wang studies), and with network structures, such as graphlets et al. 2009) and glioblastoma (Horvath et al. 2006). (Przulj 2007; Przulj et al. 2006). Networks can also reveal new connections between diVerent prognostic signatures. Databases and visualization For example, we recently identiWed a 15-gene prognostic and predictive signature in lung cancer (Zhu et al. 2010a). Integrated and updated tools and resources can help to deal Though our signature did not directly overlap with previ- with the large volume of HTP biological data being gener- ously published ones, network analysis revealed that they ated, as well as facilitate integrative analyses of cancer 123 472 Hum Genet (2011) 130:465–481

from, e.g., inconsistent gene nomenclature). One example is CDIP (Gortzak-Uzan et al. 2008), a cancer informatics portal that combines HTP data from microarrays, CGH, KEGG, I2D, annotations from GeneCards (Rebhan et al. 1997), and other molecular information speciWc to diVerent cancer types. Another is mirDIP (Shirdel et al. 2011), a microRNA integration portal that combines computational predictions of microRNA: mRNA binding from multiple databases (version 1 combines 11 largest sources).

Updating old data to reXect the latest knowledge

In part because of the noise inherent in HTP platforms, our best knowledge of what they are measuring and how their data should be interpreted changes with time. Past work has shown that using updated transcript data to re-annotate the Fig. 3 The value of data integration: I2D, the Interologous Interaction Database. Experimentally derived human interactions from I2D ver- assignment of microarray probe sets to genes provided for sion 1.85; the network comprises 13,190 proteins connected by 82,857 AVymetrix microarray platforms aVected 20–30% of all interactions. Interactions unique to a single source or database are ren- probe sets (Gautier et al. 2004). Updated probe set deWni- dered as red (covering 11,781 proteins and 49,440 interactions), and tions lead to higher precision and accuracy (Sandberg and interactions present in two or more sources are colored blue. Node col- V or and size are proportional to degree. Visualization and annotation of Larsson 2007), and more consistent results across di erent the network was performed in NAViGaTOR ver. 2.2.1. Network Wle microarray platforms (Carter et al. 2005; Elo et al. 2005). with GO and PubMed annotation is available as Supplemental File 2, The Ensembl database (Hubbard et al. 2009) maintains and in NAViGaTOR 2.2 XML format regularly updates a list of probeset mappings for many pop- ular microarray platforms. proWles. Integrative databases fulWll several essential func- tions; for instance, they organize heterogeneous and distrib- Visualizing biological networks uted data sources and make them easy to use, and they update and reinterpret old data in the light of new Wndings Tools like Cytoscape (Shannon et al. 2003) and NAViGa- (e.g., updated probe set mappings for a microarray plat- TOR (Brown et al. 2009; McGuYn and Jurisica 2009; Viau form). For example, the I2D database integrates HTP- et al. 2010) allow biologists to visualize protein interactions detected PPIs with PPIs from several source databases. and perform network analyses using an intuitive graphical Tens of thousands of these I2D PPIs are unique to a single interface, and have been widely used for cancer signature source database (Fig. 3); integrating these diVerent sources interpretation and development. EVective network visuali- both substantially improves coverage and increases conW- zation tools integrate several relevant resources; e.g., NAV- dence for those interactions present in multiple sources. iGaTOR users can choose to supplement and annotate Importantly, to support HTP data analysis, databases and network nodes and edges with additional data from several portals need to support multiple identiWers and batch pro- sources including I2D PPIs, PSICQUIC (Orchard et al. cessing. Both databases and network visualization methods 2010), STRING (Szklarczyk et al. 2011), Gene Ontology have played essential roles in interpreting HTP biological categories, and KEGG, PhosphoSite (Hornbeck et al. data and cancer signatures. 2004), PathwayCommons (Cerami et al. 2011), Reactome, WikiPathways (Pico et al. 2008) biological pathways, and Integrating heterogeneous and distributed data sources i-HOP (HoVmann and Valencia 2005). NAViGaTOR also provides tools for network analysis, including several mea- DiVerent data sources can be complementary. For instance, sures of centrality, methods for identifying motifs and com- while biological pathway databases may disagree on individ- munities, and random network and enrichment analysis. ual pathway deWnitions, by combining them we can diminish DiVerent visualization tools can yield diVerent and comple- their false positives and false negatives; by integrating path- mentary biological information; Fig. 4 shows three alterna- ways with protein interactions, we can improve their coverage tive visualizations of MAPK3 using pathway databases and and relevance (Radulovich et al. 2010; Savas et al. 2009). PPIs in NAViGaTOR. The MAP kinase signaling cascade Having all of these data available from a single source or at is implicated in tumor growth and may be a key anti-cancer least in a standardized format simpliWes integrative analyses drug target (Roberts and Der 2007; Sebolt-Leopold and and reduces the error and database incompatibility (arising Herrera 2004). 123 Hum Genet (2011) 130:465–481 473

Fig. 4 DiVerent databases oVer diVerent and complementary perspec- from I2D version 1.85, displayed in NAViGaTOR 2.2.1, and overlaid tives on the same data. Querying diVerent databases for the same gene, with pathway data from a and b. Nodes represent proteins unique to MAPK3 (with protein product MK03), results in pathway-related data hsa04010 (blue), proteins unique to REACT_634 (green), and proteins with diVerent structure and organization. Pathway data can be supple- present in both pathways (red). Edges represent links unique to mented with protein–protein interaction data to provide more complete hsa04010 (dark blue), REACT_634 (dark green), I2D (gray), or pres- networks. a From KEGG (release 57.0), the MAPK signaling pathway ent in both hsa04010 and I2D (light blue), or REACT_634 and I2D (PATHWAY:hsa04010). b From Reactome (version 35), the MAP (light green). The concentric circle layout uses Reactome nodes as a Kinase Cascade (REACT_634). c Protein–protein interaction network center, and organizes remaining nodes based on local connectivity 123 474 Hum Genet (2011) 130:465–481

C FGF4 FGF17

FGF21 FGF8

FGF6 FGF18

FGFR1 FGF3 EVI1 GRP4 FGF23 IL1A GRB2

IL1B FGF10 PP2BB RASH FGF9 JUN NFAC2 MP2K1 RASN ECSIT MK01 1433B FGF19 MP2K2 M4K2 CDC2 RAF1 FGF22 M3K1 TRAF2 M4K1 MKNK1 TNR1A MK03 M4K3 MYC CASP3 TAOK3 RASK M4K4 ARRB2 FGF1 NTRK1 NTF3

BDNF CD14 DUS6 NTF4 MK14 FGF5 NGF

TF65 MLTK NFKB1 MAX FGF20 PPM1B NFAC4 FLNA FGF13

CANB2

CHP1 CHP2

Fig. 4 continued

Web services for frequent updates Text mining for automated database annotation and quality-checking One of the main challenges faced by integrated databases is staying current: the knowledge and data in the source dat- Many of the data and observations gained from biological abases are changing all the time. Fortunately, many source experiments are not available in an easily accessible databases provide access to their content through web services. form—they are buried in the text of journal articles. But Integrated databases and analysis tools can then query these techniques from text mining can be used to automatically databases in response to a user request and access the latest extract biological knowledge from free text; these have data. For example, NAViGaTOR uses web services to access been extensively reviewed in, e.g., (Altman et al. 2008; pathway data from KEGG, Reactome, Pathway Commons Cohen and Hunter 2008; HoVmann et al. 2005; Jensen et al. (Cerami et al. 2011) and Wiki pathways (Pico et al. 2008), 2006; Rodriguez-Esteban 2009). For integrated databases, protein interactions from the IMEx consortium (Orchard et al. text mining methods have proven especially valuable for 2007), and protein interactions and gene associations via i- automatically annotating and quality-checking HTP data. HOP (HoVmann and Valencia 2005) or STRING. Of course, For example, text mining has been applied to annotate the not every relevant data set can be accessed through web ser- I2D PPI database (Niu et al. 2010) and to verify predicted vices; a list of web services currently available is provided by PPIs from FpClass (Kotlyar et al. 2011). Text mining can the EMBRACE Registry (Pettifer et al. 2009). also supplement integrated databases with new knowledge; 123 Hum Genet (2011) 130:465–481 475 e.g., by extracting gene-drug relationships (Kuhn et al. lenge Consortium for the Molecular ClassiWcation of Lung 2010; von Eichborn et al. 2011). Adenocarcinoma (Shedden et al. 2008). New initiatives such as TCGA and ICGC are expanding cancer proWles Standards and recommendations across multiple diVerent platforms and tumor types. Mak- ing the data and related clinical information publicly avail- Progress in HTP research relies on well-designed and widely able will signiWcantly contribute to our understanding of adopted standards for how data is produced, managed, and the molecular changes in cancer, enabling new discoveries analyzed. These provide a consistent way of dealing with as well as more comprehensive validation of novel prog- large volumes of data, help reduce noise, and ensure repro- nostic and predictive signatures. ducibility at both the experimental and computational levels. Recommendations for biological experiments Standards for HTP data and tools The quality of any computational analysis is limited by the DiVerent research groups, databases, and analyses must use data that are available. Considering intra- and inter-tumor consistent reporting standards so their data can be eVec- heterogeneity, we may sometimes have too few samples or tively shared, integrated, and evaluated. In cancer research, too much noise to be able to develop cancer signatures with standards are needed at several levels of data collection and the required accuracy and reproducibility. In this case, analysis, from tumor collection, storage, and sample prepa- computational analyses of the available data can at least ration to pre-processing and further computational study of provide guidelines as to the kinds of experiments and data the resulting HTP data. that are needed. For example, one study estimated that to For many HTP data types, there are now well-developed achieve 50% overlap in prognostic gene sets for breast can- standards for data reporting, such as Minimum Information cer patients would require several thousand samples (Ein- About a Microarray Experiment [MIAME (Brazma et al. Dor et al. 2006), and other work created a general tool to 2001)] and large public repositories for raw data, such as calculate the number of samples needed to train a classiWer the GEO (Barrett et al. 2005) and the PRoteomics IDEntiW- as a function of normalized fold change, class prevalence, cations database [PRIDE (Jones et al. 2006)]. Data stan- and the number of genes on an array (Dobbin et al. 2008). dards development is an ongoing area of research, and has Much of the human PPI network, including PPIs of known been extensively reviewed (Brazma et al. 2006; Brooks- cancer genes, remains uncharacterized. Past work shows bank and Quackenbush 2006; Enkemann 2010). Some out- that a reliable conWdence score can be associated with an standing issues include data identiWers—e.g., the many-to- interaction by feeding the output of four diVerent comple- many mappings between genes from databases like Entrez- mentary HTP assays into a logistic regression model Gene and Ensembl—and open access to data. Though many (Braun et al. 2009). Also, high-conWdence PPIs predicted high-impact journals now mandate that all HTP data and by various computational methods can focus future experi- code associated with a publication be made open-access, ments and reduce their false positives and false negatives this requirement is not universal, and sometimes the same (Kotlyar and Jurisica 2006; Kotlyar et al. 2011). journal will require microarray data submission, but not proteomic data. Also, many articles that do publish their data too frequently make them available only in an imprac- A directory of tools and resources for integrative tical form—such as hundreds of pages in supplementary computational cancer biology PDF Wles. We need new initiatives to standardize (and enforce) HTP data submission formats to ensure that they Below is a brief directory of links to some of the major are machine-readable. This will enable wider use of data resources for integrative computational biology that appear and new discoveries, and may also substantially improve in this review. quality of research by reducing errors and mistakes (Bag- gerly and Coombes 2009; Carey and Stodden 2010). Biological pathway annotations and related data For computational data analysis, we also need large- scale comparisons to provide a fair evaluation of diVerent Reactome http://www.reactome.org/ methods. In cancer research, these eVorts depend on Pathway Commons http://www.pathwaycommons.org/pc/ resources like GeneSigDB (Culhane et al. 2010), a curated KEGG PATHWAY—Kyoto Encyclopedia of Genes database containing over 2,000 cancer gene signatures, and and Genomes http://www.genome.jp/kegg/pathway. large well-annotated data sets such as the set of gene html expression proWles for over 400 non-small cell lung cancer WikiPathways http://www.wikipathways.org/ adenocarcinoma tumors provided by the Director’s Chal- GO—Gene Ontology http://www.geneontology.org/ 123 476 Hum Genet (2011) 130:465–481

Experimental data Gene function prediction

Protein–protein interactions GeneMania http://genemania.org/

I2D—Interologous Interaction Database http://ophid. Cancer genome initiatives utoronto.ca/i2d/ STRING—Search Tool for the Retrieval of Interacting TCGA—The Cancer Genome Atlas http://cancergenome. Genes http://string-db.org/ nih.gov/ BioGRID—Biological General Repository for Interac- ICGC—International Cancer Genome Consortium http:// tion Datasets http://thebiogrid.org/ www.icgc.org/ HPRD—Human Protein Reference Database http://www. hprd.org/ Acknowledgments This work was supported in part by Ontario Re- IntAct http://www.ebi.ac.uk/intact/main.xhtml search Fund (GL2-01-030), Canada Institutes for Health Research (BIO-99745), the Canada Foundation for Innovation (CFI #12301 and MINT—Molecular INTeraction Database http://mint. #203383), the Canada Research Chair Program, and IBM to IJ, and the bio.uniroma2.it/ Ontario Ministry of Health and Long Term Care. The views expressed IMEx—International Molecular Exchange Consortium do not necessarily reXect those of the OMOHLTC. We thank M. Kotl- W http://www.imexconsortium.org/ yar, A. Hrvojic, A. Teisanu, and W. Xie for providing data for the g- ures.

Gene expression Open Access This article is distributed under the terms of the Crea- tive Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, CDIP—Cancer Data Integration Portal http://ophid. provided the original author(s) and source are credited. utoronto.ca/cdip Oncomine https://www.oncomine.org/ GeneSigDB http://compbio.dfci.harvard.edu/genesigdb/ References ArrayExpress http://www.ebi.ac.uk/arrayexpress/ GEO—Gene Expression Omnibus http://www.ncbi. Agarwal R, Gonzalez-Angulo AM, Myhre S, Carey M, Lee JS, Overg- nlm.nih.gov/geo/ aard J, Alsner J, Stemke-Hale K, Lluch A, Neve RM, Kuo WL, Sorlie T, Sahin A, Valero V, Keyomarsi K, Gray JW, Borresen- Dale AL, Mills GB, Hennessy BT (2009) Integrative analysis of Mutations cyclin protein levels identiWes cyclin b1 as a classiWer and predic- tor of outcomes in breast cancer. Clin Cancer Res 15:3654–3662 Sanger Cancer Gene Census http://www.sanger.ac.uk/ Aggarwal A, Guo DL, Hoshida Y, Yuen ST, Chu KM, So S, Boussiou- tas A, Chen X, Bowtell D, Aburatani H, Leung SY, Tan P (2006) genetics/CGP/Census/ Topological and functional discovery in a gene coexpression COSMIC—Catalogue of Somatic Mutations in Cancer meta-network of gastric cancer. Cancer Res 66:232–241 http://www.sanger.ac.uk/resources/databases/cosmic.html Altman RB, Bergman CM, Blake J, Blaschke C, Cohen A, Gannon F, Grivell L, Hahn U, Hersh W, Hirschman L, Jensen LJ, Krallinger M, Mons B, O’Donoghue SI, Peitsch MC, Rebholz-Schuhmann Proteomics D, Shatkay H, Valencia A (2008) Text mining for biology—the way forward: opinions from leading scientists. Genome Biol PRIDE—PRoteomics IDEntiWcations database http:// 9(Suppl 2):S7 www.ebi.ac.uk/pride/ Aragues R, Sander C, Oliva B (2008) Predicting cancer involvement of genes from heterogeneous data. BMC Bioinformatics 9:172 PhosphoSite http://www.phosphosite.org/ Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, microRNAs Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the uniWcation of biology. The Gene Ontology Consortium. mirDIP—microRNA Data Integration Portal http://ophid. Nat Genet 25:25–29 utoronto.ca/mirDIP AuVray C, Chen Z, Hood L (2009) Systems medicine: the future of medical genomics and healthcare. Genome Med 1:2 Augen J (2001) Information technology to the rescue! Nat Biotechnol Network visualization software 19 Suppl: BE39–40 Axelrod DE, Miller N, Chapman JA (2009) Avoiding pitfalls in the sta- NAViGaTOR—Network Analysis, Visualization, & tistical analysis of heterogeneous tumors. Biomed Inform Insights Graphing TORonto http://ophid.utoronto.ca/navigator/ 2:11–18 Bachtiary B, Boutros PC, Pintilie M, Shi W, Bastianutto C, Li JH, Cytoscape http://www.cytoscape.org/ Schwock J, Zhang W, Penn LZ, Jurisica I, Fyles A, Liu FF (2006)

123 Hum Genet (2011) 130:465–481 477

Gene expression proWling in cervical cancer: an exploration of in- Carter SL, Eklund AC, Mecham BH, Kohane IS, Szallasi Z (2005) tratumor heterogeneity. Clin Cancer Res 12:5632–5640 RedeWnition of AVymetrix probe sets by sequence overlap with Bader GD, Betel D, Hogue CW (2003) BIND: the Biomolecular Inter- cDNA microarray probes reduces cross-platform inconsistencies action Network Database. Nucleic Acids Res 31:248–250 in cancer-associated gene expression measurements. BMC Bioin- Baggerly KA, Coombes KR (2009) Deriving chemosensitivity from formatics 6:107 cell lines: forensic bioinformatics and reproducible research in Cerami EG, Gross BE, Demir E, Rodchenkov I, Babur O, Anwar N, high-throughput biology. Ann Appl Stat 3:1309–1334 Schultz N, Bader GD, Sander C (2011) Pathway Commons, a web Bamford S, Dawson E, Forbes S, Clements J, Pettett R, Dogan A, resource for biological pathway data. Nucleic Acids Res 39: Flanagan A, Teague J, Futreal PA, Stratton MR, Wooster R D685–D690 (2004) The COSMIC (Catalogue of Somatic Mutations in Can- Cervigne NK, Reis PP, Machado J, Sadikovic B, Bradley G, Galloni cer) database and website. Br J Cancer 91:355–358 NN, Pintilie M, Jurisica I, Perez-Ordonez B, Gilbert R, Gullane P, Barnes M, Freudenberg J, Thompson S, Aronow B, Pavlidis P (2005) Irish J, Kamel-Reid S (2009) IdentiWcation of a microRNA signa- Experimental comparison and cross-validation of the AVymetrix ture associated with progression of leukoplakia to oral carcinoma. and Illumina gene expression analysis platforms. Nucleic Acids Hum Mol Genet 18:4818–4829 Res 33:5914–5923 Chandran UR, Dhir R, Ma C, Michalopoulos G, Becich M, Gilbertson Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, J (2005) DiVerences in gene expression in prostate cancer, normal Rudnev D, Lash AE, Fujibuchi W, Edgar R (2005) NCBI GEO: appearing prostate tissue adjacent to cancer and prostate tissue mining millions of expression proWles—database and tools. from cancer free organ donors. BMC Cancer 5:45 Nucleic Acids Res 33:D562–D566 Choi JK, Yu U, Yoo OJ, Kim S (2005) DiVerential coexpression anal- Bell AW, Deutsch EW, Au CE, Kearney RE, Beavis R, Sechi S, ysis using microarray data and its application to human cancer. Nilsson T, Bergeron JJ (2009) A HUPO test sample study reveals Bioinformatics 21:4348–4355 common problems in mass spectrometry-based proteomics. Nat Chuang HY, Lee E, Liu YT, Lee D, Ideker T (2007) Network-based Methods 6:423–430 classiWcation of breast cancer metastasis. Mol Syst Biol 3:140 Blackhall FH, Pintilie M, Wigle D, Jurisica I, Liu N, Radulovitch N, Clamp M, Fry B, Kamal M, Xie X, CuV J, Lin MF, Kellis M, Lindblad- Keshavjee S, Johnston M, Shepherd FA, Tsao M-S (2004) Stability Toh K, Lander ES (2007) Distinguishing protein-coding and non- and heterogeneity of expression proWles in lung cancer specimens coding genes in the human genome. Proc Natl Acad Sci USA harvested following surgical resection. Neoplasia 6:761–767 104:19428–19433 Boutros PC, Lau SK, Pintilie M, Liu N, Shepherd FA, Der SD, Tsao MS, Cleator SJ, Powles TJ, Dexter T, Fulford L, Mackay A, Smith IE, Penn LZ, Jurisica I (2009) Prognostic gene signatures for non- Valgeirsson H, Ashworth A, Dowsett M (2006) The eVect of the small-cell lung cancer. Proc Natl Acad Sci USA 106:2824–2828 stromal component of breast tumours on prediction of clinical Braun P, Tasan M, Dreze M, Barrios-Rodiles M, Lemmens I, Yu H, outcome using gene expression microarray analysis. Breast Sahalie JM, Murray RR, Roncari L, de Smet AS, Venkatesan K, Cancer Res 8:R32 Rual JF, Vandenhaute J, Cusick ME, Pawson T, Hill DE, Taver- Cohen KB, Hunter L (2008) Getting started in text mining. PLoS Com- nier J, Wrana JL, Roth FP, Vidal M (2009) An experimentally de- put Biol 4:e20 rived conWdence score for binary protein–protein interactions. Coleman MP, Quaresma M, Berrino F, Lutz JM, De Angelis R, Capo- Nat Methods 6:91–97 caccia R, Baili P, Rachet B, Gatta G, Hakulinen T, Micheli A, Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Sto- Sant M, Weir HK, Elwood JM, Tsukuma H, Koifman S, ES GA, eckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland Francisci S, Santaquilani M, Verdecchia A, Storm HH, Young JL T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, (2008) Cancer survival in Wve continents: a worldwide popula- Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart tion-based study (CONCORD). Lancet Oncol 9:730–756 J, Taylor R, Vilo J, Vingron M (2001) Minimum information Culhane AC, Schwarzl T, Sultana R, Picard KC, Picard SC, Lu TH, about a microarray experiment (MIAME)—toward standards for Franklin KR, French SJ, Papenhausen G, Correll M, Quacken- microarray data. Nat Genet 29:365–371 bush J (2010) GeneSigDB—a curated database of gene expres- Brazma A, Krestyaninova M, Sarkans U (2006) Standards for systems sion signatures. Nucleic Acids Res 38:D716–D725 biology. Nat Rev Genet 7:593–605 Curtis C, Lynch AG, Dunning MJ, Spiteri I, Marioni JC, HadWeld J, Brooksbank C, Quackenbush J (2006) Data standards: a call to action. Chin SF, Brenton JD, Tavare S, Caldas C (2009) The pitfalls of OMICS 10:94–99 platform comparison: DNA copy number array technologies Brown KR, Jurisica I (2005) Online predicted human interaction data- assessed. BMC Genomics 10:588 base. Bioinformatics 21:2076–2082 Deisboeck TS, Zhang L, Yoon J, Costa J (2009) In silico cancer mod- Brown KR, Jurisica I (2007) Unequal evolutionary conservation of eling: is it ready for prime time? Nat Clin Pract Oncol 6:34–42 human protein interactions in interologous networks. Genome Dhanasekaran SM, Barrette TR, Ghosh D, Shah R, Varambally S, Biol 8:R95 Kurachi K, Pienta KJ, Rubin MA, Chinnaiyan AM (2001) Brown KR, Otasek D, Ali M, McGuYn M, Xie W, Devani B, van Toch Delineation of prognostic biomarkers in prostate cancer. Nature IL, Jurisica I (2009) NAViGaTOR: Network Analysis, Visualiza- 412:822–826 tion and Graphing Toronto. Bioinformatics 25:3327–3329 Diamandis EP (2010) Cancer biomarkers: can we turn recent failures Buyse M, Loi S, van’t Veer L, Viale G, Delorenzi M, Glas AM, d’As- into success? J Natl Cancer Inst 102:1462–1467 signies MS, Bergh J, Lidereau R, Ellis P, Harris A, Bogaerts J, Dobbin KK, Zhao Y, Simon RM (2008) How large a training set is Therasse P, Floore A, Amakrane M, Piette F, Rutgers E, Sotiriou needed to develop a classiWer for microarray data? Clin Cancer C, Cardoso F, Piccart MJ (2006) Validation and clinical utility of Res 14:108–114 a 70-gene prognostic signature for women with node-negative Dupuy A, Simon RM (2007) Critical review of published microarray breast cancer. J Natl Cancer Inst 98: 1183–92 studies for cancer outcome and guidelines on statistical analysis Carey VJ, Stodden V (2010) Reproducible research concepts and tools and reporting. J Natl Cancer Inst 99:147–157 for cancer bioinformatics. In: Ochs MF, Cassagrande JT, Davulu- Edgar R, Domrachev M, Lash AE (2002) Gene Expression Omnibus: ri RV (eds) Biomedical informatics for cancer research. Springer, NCBI gene expression and hybridization array data repository. US, pp 149–175 Nucleic Acids Res 30:207–210

123 478 Hum Genet (2011) 130:465–481

Eggle D, Debey-Pascher S, Beyer M, Schultze JL (2009) The develop- Hamosh A, Scott A, Amberger J, Bocchini C, McKusick V (2005) On- ment of a comparison approach for Illumina bead chips unravels line Mendelian Inheritance in Man (OMIM), a knowledgebase of unexpected challenges applying newest generation microarrays. human genes and genetic disorders. Nucleic Acids Res 33:D514– BMC Bioinformatics 10:186 D517 Ein-Dor L, Kela I, Getz G, Givol D, Domany E (2005) Outcome sig- Han JD, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy nature genes in breast cancer: is there a unique set? Bioinformat- D, Walhout AJ, Cusick ME, Roth FP, Vidal M (2004) Evidence ics 21:171–178 for dynamically organized modularity in the yeast protein–protein Ein-Dor L, Zuk O, Domany E (2006) Thousands of samples are needed interaction network. Nature 430:88–93 to generate a robust gene list for predicting outcome in cancer. Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien Proc Natl Acad Sci USA 103:5923–5928 S, Orchard S, Vingron M, Roechert B, RoepstorV P, Valencia A, Elias JE, Haas W, Faherty BK, Gygi SP (2005) Comparative evalua- Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Ap- tion of mass spectrometry platforms used in large-scale proteo- weiler R (2004) IntAct: an open source molecular interaction mics investigations. Nat Methods 2:667–675 database. Nucleic Acids Res 32:D452–D455 Elo LL, Lahti L, Skottman H, Kylaniemi M, Lahesmaa R, Aittokallio HoVmann R, Valencia A (2005) Implementing the iHOP concept for T (2005) Integrating probe-level expression changes across gen- navigation of biomedical literature. Bioinformatics 21 Suppl 2: erations of AVymetrix arrays. Nucleic Acids Res 33:e193 ii252–ii258 Enkemann SA (2010) Standards aVecting the consistency of gene HoVmann R, Krallinger M, Andres E, Tamames J, Blaschke C, Valen- expression arrays in clinical applications. Cancer Epidemiol Bio- cia A (2005) Text mining for metabolic pathways, signaling cas- markers Prev 19:1000–1003 cades, and protein networks. Sci STKE 2005(283):pe21 Ergun A, Lawrence CA, Kohanski MA, Brennan TA, Collins JJ (2007) Hornbeck PV, Chabra I, Kornhauser JM, Skrzypek E, Zhang B (2004) A network biology approach to prostate cancer. Mol Syst Biol PhosphoSite: a bioinformatics resource dedicated to physiologi- 3:82 cal protein phosphorylation. Proteomics 4:1551–1561 Fend F, RaVeld M (2000) Laser capture microdissection in pathology. Hornberg JJ, Bruggeman FJ, WesterhoV HV, Lankelma J (2006) Can- J Clin Pathol 53:666–672 cer: a systems biology disease. Biosystems 83:81–90 Fierro AC, Vandenbussche F, Engelen K, Van de Peer Y, Marchal K Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Lau- (2008) Meta analysis of gene expression data within and across rance MF, Zhao W, Qi S, Chen Z, Lee Y, Scheck AC, Liau LM, species. Curr Genomics 9:525–534 Wu H, Geschwind DH, Febbo PG, Kornblum HI, Cloughesy TF, Fortney K, Kotlyar M, Jurisica I (2010) Inferring the functions of lon- Nelson SF, Mischel PS (2006) Analysis of oncogenic signaling gevity genes with modular subnetwork biomarkers of Caeno- networks in glioblastoma identiWes ASPM as a molecular target. rhabditis elegans aging. Genome Biol 11:R13 Proc Natl Acad Sci USA 103:17402–17407 Fujita A, Gomes LR, Sato JR, Yamaguchi R, Thomaz CE, Sogayar Hu P, Bader G, Wigle DA, Emili A (2007) Computational prediction MC, Miyano S (2008) Multivariate gene expression analysis of cancer-gene function. Nat Rev Cancer 7:23–34 reveals functional connectivity changes between normal/tumoral Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent prostates. BMC Syst Biol 2:106 S, Chen Y, Clapham P, Clarke L, Coates G, Fairley S, Fitzgerald Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, S, Fernandez-Banet J, Gordon L, Graf S, Haider S, Hammond M, Rahman N, Stratton MR (2004) A census of human cancer genes. Holland R, Howe K, Jenkinson A, Johnson N, Kahari A, Keefe D, Nat Rev Cancer 4:177–183 Keenan S, Kinsella R, Kokocinski F, Kulesha E, Lawson D, Galsworthy MJ (2009) MEDSUM: an online MEDLINE summary Longden I, Megy K, Meidl P, Overduin B, Parker A, Pritchard B, tool. http://www.medsum.info Rios D, Schuster M, Slater G, Smedley D, Spooner W, Spudich Gautier L, Moller M, Friis-Hansen L, Knudsen S (2004) Alternative G, Trevanion S, Vilella A, Vogel J, White S, Wilder S, Zadissa A, mapping of probes to genes for AVymetrix chips. BMC Bioinfor- Birney E, Cunningham F, Curwen V, Durbin R, Fernandez- matics 5:111 Suarez XM, Herrero J, Kasprzyk A, Proctor G, Smith J, Searle S, Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Flicek P (2009) Ensembl 2009. Nucleic Acids Res 37:D690–D697 Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, Bernabe RR, C, Schelder M, Brajenovic M, RuVner H, Merino A, Klein K, Hu- Bhan MK, Calvo F, Eerola I, Gerhard DS, Guttmacher A, Guyer dak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse M, Hemsley FM, Jennings JL, Kerr D, Klatt P, Kolar P, Kusada B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth J, Lane DP, Laplace F, Youyong L, Nettekoven G, Ozenberger B, E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Sera- Peterson J, Rao TS, Remacle J, Schafer AJ, Shibata T, Stratton phin B, Kuster B, Neubauer G, Superti-Furga G (2002) Func- MR, Vockley JG, Watanabe K, Yang H, Yuen MM, Knoppers tional organization of the yeast proteome by systematic analysis BM, Bobrow M, Cambon-Thomsen A, Dressler LG, Dyke SO, of protein complexes. Nature 415:141–147 Joly Y, Kato K, Kennedy KL, Nicolas P, Parker MJ, Rial-Sebbag Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL (2007) E, Romeo-Casabona CM, Shaw KM, Wallace S, Wiesner GL, The human disease network. Proc Natl Acad Sci USA 104:8685– Zeps N, Lichter P, Biankin AV, Chabannon C, Chin L, Clement 8690 B, de Alava E, Degos F, Ferguson ML, Geary P, Hayes DN, Johns Gortzak-Uzan L, Ignatchenko A, Evangelou AI, Agochiya M, Brown AL, Kasprzyk A, Nakagawa H, Penny R, Piris MA, Sarin R, Scar- KA, St Onge P, Kireeva I, Schmitt-Ulms G, Brown TJ, Murphy J, pa A, van de Vijver M, Futreal PA, Aburatani H, Bayes M, Bot- Rosen B, Shaw P, Jurisica I, Kislinger T (2008) A proteome well DD, Campbell PJ, Estivill X, Grimmond SM, Gut I, Hirst M, resource of ovarian cancer ascites: integrated proteomic and bio- Lopez-Otin C, Majumder P, Marra M, McPherson JD, Ning Z, informatic analyses to identify putative biomarkers. J Proteome Puente XS, Ruan Y, Stunnenberg HG, Swerdlow H, Velculescu Res 7:339–351 VE, Wilson RK, Xue HH, Yang L, Spellman PT, Bader GD, Bou- Gstaiger M, Aebersold R (2009) Applying mass spectrometry-based tros PC, Flicek P, Getz G, Guigo R, Guo G, Haussler D, Heath S, proteomics to genetics, genomics and network biology. Nat Rev Hubbard TJ, Jiang T et al (2010) International network of cancer Genet 10:617–627 genome projects. Nature 464:993–998 Hahn MW, Kern AD (2005) Comparative genomics of centrality and Hwang YC, Lin CC, Chang JY, Mori H, Juan HF, Huang HC (2009) essentiality in three eukaryotic protein–interaction networks. Mol Predicting essential genes based on network and sequence analy- Biol Evol 22:803–806 sis. Mol Biosyst 5:1672–1678 123 Hum Genet (2011) 130:465–481 479

Ideker T, Ozier O, Schwikowski B, Siegel AF (2002) Discovering Maslov S, Sneppen K (2002) SpeciWcity and stability in topology of regulatory and signalling circuits in molecular interaction net- protein networks. Science 296:910–913 works. Bioinformatics 18(Suppl 1):S233–S240 Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Garapati P, Hemish J, Hermjakob H, Jassal B, Kanapin A, Lewis Gabrielson E, Garcia JG, Geoghegan J, Germino G, GriYn C, S, Mahajan S, May B, Schmidt E, Vastrik I, Wu G, Birney E, Hilmer SC, HoVman E, Jedlicka AE, Kawasaki E, Martinez- Stein L, D’Eustachio P (2009) Reactome knowledgebase of hu- Murillo F, Morsberger L, Lee H, Petersen D, Quackenbush J, Scott man biological pathways and processes. Nucleic Acids Res A, Wilson M, Yang Y, Ye SQ, Yu W (2005) Multiple-laboratory 37:D619–D622 comparison of microarray platforms. Nat Methods 2:345–350 McGuYn M, Jurisica I (2009) Interaction techniques for selecting and Jensen LJ, Saric J, Bork P (2006) Literature mining for the biologist: manipulating subgraphs in network visualizations. IEEE Trans from information retrieval to biological discovery. Nat Rev Genet Vis Comput Graph 15(6):937–944 7:119–129 Mills GB, Jurisica I, Yarden Y, Norman JC (2009) Genomic amplicons Jeong H, Mason SP, Barabasi AL, Oltvai ZN (2001) Lethality and cen- target vesicle recycling in breast cancer. J Clin Invest 119:2123– trality in protein networks. Nature 411:41–42 2127 Jones P, Cote RG, Martens L, Quinn AF, Taylor CF, Derache W, Her- Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U mjakob H, Apweiler R (2006) PRIDE: a public repository of pro- (2002) Network motifs: simple building blocks of complex net- tein and peptide identiWcations for the proteomics community. works. Science 298:824–827 Nucleic Acids Res 34:D659–D663 Myhre S, Mohammed H, Tramm T, Alsner J, Finak G, Park M, Overg- Jonsson PF, Bates PA (2006) Global topological features of cancer aard J, Borresen-Dale AL, Frigessi A, Sorlie T (2010) In silico proteins in the human interactome. Bioinformatics 22:2291–2297 ascription of gene expression diVerences to tumor and stromal Jubb AM, BuVa FM, Harris AL (2010) Assessment of tumour hypoxia cells in a model to study impact on breast cancer outcome. PLoS for prediction of response to therapy and cancer prognosis. J Cell One 5:e14002 Mol Med 14:18–29 Nacu S, Critchley-Thorne R, Lee P, Holmes S (2007) Gene expression Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katay- network analysis and applications to immunology. Bioinformat- ama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y ics 23:850–858 (2008) KEGG for linking genomes to life and the environment. Newman ME (2006) Modularity and community structure in networks. Nucleic Acids Res 36:D480–D484 Proc Natl Acad Sci USA 103:8577–8582 King AD, Przulj N, Jurisica I (2004) Protein complex prediction via Nibbe RK, Markowitz S, MyeroV L, Ewing R, Chance MR (2009) Dis- cost-based clustering. Bioinformatics 20:3013–3020 covery and scoring of protein interaction subnetworks discrimina- Kirschner MW (2005) The meaning of systems biology. Cell 121:503– tive of late stage human colon cancer. Mol Cell Proteomics 504 8:827–845 Kislinger T, Emili A (2005) Multidimensional protein identiWcation Nibbe RK, Koyuturk M, Chance MR (2010) An integrative-omics ap- technology: current status and future prospects. Expert Rev Pro- proach to identify functional sub-networks in human colorectal teomics 2:27–39 cancer. PLoS Comput Biol 6:e1000639 Kislinger T, Cox B, Kannan A, Chung C, Hu P, Ignatchenko A, Scott Niu Y, Otasek D, Jurisica I (2010) Evaluation of linguistic features MS, Gramolini AO, Morris Q, Hallett MT, Rossant J, Hughes TR, useful in extraction of interactions from PubMed; application to Frey B, Emili A (2006) Global survey of organ and organelle pro- annotating known, high-throughput and predicted interactions in tein expression in mouse: combined proteomic and transcriptomic I2D. Bioinformatics 26:111–119 proWling. Cell 125:173–186 Orchard S, Kerrien S, Jones P, Ceol A, Chatr-Aryamontri A, Salwinski Kotlyar M, Jurisica I (2006) Predicting protein–protein interactions by L, Nerothin J, Hermjakob H (2007) Submit your interaction data association mining. Inf Syst Frontiers 8:37–47 the IMEx way: a step by step guide to trouble-free deposition. Kotlyar M, Niu Y, Ponzielli R, Ding Z, Mills GB, Penn LZ, Jurisica I Proteomics 7(Suppl 1):28–34 (2011) Predicting human protein–protein interactions using non- Orchard S, Albar JP, Deutsch EW, Eisenacher M, Binz PA, Hermjakob independent features (in preparation) H (2010) Implementing data standards: a report on the HUPOPSI Kuhn M, Szklarczyk D, Franceschini A, Campillos M, von Mering C, workshop September 2009, Toronto, Canada. Proteomics Jensen LJ, Beyer A, Bork P (2010) STITCH 2: an interaction net- 10:1895–1898 work database for small molecules and proteins. Nucleic Acids Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlap- Res 38:D552–D556 ping community structure of complex networks in nature and Lau SK, Boutros PC, Pintilie M, Blackhall FH, Zhu CQ, Strumpf D, society. Nature 435:814–818 Johnston MR, Darling G, Keshavjee S, Waddell TK, Liu N, Lau Pegram MD, Lipton A, Hayes DF, Weber BL, Baselga JM, Tripathy D, Penn LZ, Shepherd FA, Jurisica I, Der SD, Tsao MS (2007) D, Baly D, Baughman SA, Twaddell T, Glaspy JA, Slamon DJ Three-gene prognostic classiWer for early-stage non small-cell (1998) Phase II study of receptor-enhanced chemosensitivity us- lung cancer. J Clin Oncol 25:5562–5569 ing recombinant humanized anti-p185HER2/neu monoclonal Ledford H (2010) Big science: the cancer genome challenge. Nature antibody plus cisplatin in patients with HER2/neu-overexpressing 464:972–974 metastatic breast cancer refractory to chemotherapy treatment. Li L, Zhang K, Lee J, Cordes S, Davis DP, Tang Z (2009) Discovering J Clin Oncol 16:2659–2671 cancer genes by integrating network and functional properties. Pena-Castillo L, Tasan M, Myers CL, Lee H, Joshi T, Zhang C, Guan BMC Med Genomics 2:61 Y, Leone M, Pagnani A, Kim WK, Krumpelman C, Tian W, Obo- Lowe JA, Jones P, Wilson DM (2010) Network biology as a new ap- zinski G, Qi Y, Mostafavi S, Lin GN, Berriz GF, Gibbons FD, proach to drug discovery. Curr Opin Drug Discov Dev 13:524–526 Lanckriet G, Qiu J, Grant C, Barutcuoglu Z, Hill DP, Warde-Far- Lynn DJ, Winsor GL, Chan C, Richard N, Laird MR, Barsky A, Gardy ley D, Grouios C, Ray D, Blake JA, Deng M, Jordan MI, Noble JL, Roche FM, Chan TH, Shah N, Lo R, Naseer M, Que J, Yau WS, Morris Q, Klein-Seetharaman J, Bar-Joseph Z, Chen T, Sun M, Acab M, Tulpan D, Whiteside MD, Chikatamarla A, Mah B, F, Troyanskaya OG, Marcotte EM, Xu D, Hughes TR, Roth FP Munzner T, Hokamp K, Hancock RE, Brinkman FS (2008) Innat- (2008) A critical assessment of Mus musculus gene function pre- eDB: facilitating systems-level analyses of the mammalian innate diction using integrated genomic evidence. Genome Biol 9(Suppl immune response. Mol Syst Biol 4:218 1):S2 123 480 Hum Genet (2011) 130:465–481

Peri S, Navarro JD, Kristiansen TZ, Amanchy R, Surendranath V, Doucette-Stamm L, Cusick ME, Hill DE, Roth FP, Vidal M Muthusamy B, Gandhi TK, Chandrika KN, Deshpande N, Suresh (2005) Towards a proteome-scale map of the human protein–pro- S, Rashmi BP, Shanker K, Padma N, Niranjan V, Harsha HC, tein interaction network. Nature 437:1173–1178 Talreja N, Vrushabendra BM, Ramya MA, Yatish AJ, Joy M, Sandberg R, Larsson O (2007) Improved precision and accuracy for Shivashankar HN, Kavitha MP, Menezes M, Choudhury DR, microarrays using updated probe set deWnitions. BMC Bioinfor- Ghosh N, Saravana R, Chandran S, Mohan S, Jonnalagadda CK, matics 8:48 Prasad CK, Kumar-Sinha C, Deshpande KS, Pandey A (2004) Savas S, Geraci J, Jurisica I, Liu G (2009) A comprehensive catalogue Human protein reference database as a discovery resource for pro- of functional genetic variations in the EGFR pathway: protein– teomics. Nucleic Acids Res 32:D497–D501 protein interaction analysis reveals novel genes and polymor- Pettifer S, Thorne D, McDermott P, Attwood T, Baran J, Bryne JC, phisms important for cancer research. Int J Cancer 125:1257– Hupponen T, Mowbray D, Vriend G (2009) An active registry for 1265 bioinformatics web services. Bioinformatics 25:2090–2091 Schmucker D, Clemens JC, Shu H, Worby CA, Xiao J, Muda M, Dix- Pico AR, Kelder T, van Iersel MP, Hanspers K, Conklin BR, Evelo C on JE, Zipursky SL (2000) Drosophila Dscam is an axon guidance (2008) WikiPathways: pathway editing for the people. PLoS Biol receptor exhibiting extraordinary molecular diversity. Cell 6:e184 101:671–684 Ponzielli R, Boutros PC, Katz S, Stojanova A, Hanley AP, Khosravi F, Sebolt-Leopold JS, Herrera R (2004) Targeting the mitogen-activated Bros C, Jurisica I, Penn LZ (2008) Optimization of experimental protein kinase cascade to treat cancer. Nat Rev Cancer 4:937–947 design parameters for high-throughput chromatin immunoprecip- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, itation studies. Nucleic Acids Res 36:e144 Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software Przulj N (2007) Biological network comparison using graphlet degree environment for integrated models of biomolecular interaction distribution. Bioinformatics 23:e177–e183 networks. Genome Res 13:2498–2504 Przulj N, Corneil DG, Jurisica I (2006) EYcient estimation of graphlet Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald frequency distributions in protein–protein interaction networks. WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, Chang AC, Bioinformatics 22:974–980 Zhu CQ, Strumpf D, Hanash S, Shepherd FA, Ding K, Seymour Radulovich N, Pham NA, Strumpf D, Leung L, Xie W, Jurisica I, Tsao L, Naoki K, Pennell N, Weir B, Verhaak R, Ladd-Acosta C, Go- MS (2010) DiVerential roles of cyclin D1 and D3 in pancreatic lub T, Gruidl M, Sharma A, Szoke J, Zakowski M, Rusch V, Kris ductal adenocarcinoma. Mol Cancer 9:24 M, Viale A, Motoi N, Travis W, Conley B, Seshan VE, Meyerson Rambaldi D, Giorgi FM, Capuani F, Ciliberto A, Ciccarelli FD (2008) M, Kuick R, Dobbin KK, Lively T, Jacobson JW, Beer DG (2008) Low duplicability and network fragility of cancer genes. Trends Gene expression-based survival prediction in lung adenocarci- Genet 24:427–430 noma: a multi-site, blinded validation study. Nat Med 14:822–827 Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL (2002) Shirdel EA, Xie W, Mak TW, Jurisica I (2011) NAViGaTing the mic- Hierarchical organization of modularity in metabolic networks. roNome—using multiple microRNA prediction databases to Science 297:1551–1555 identify signaling pathway-associated microRNA. PLoS One Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D (1997) GeneCards: 6(2):e17429 integrating information about genes, proteins and diseases. Slamon DJ, Press MF (2009) Alterations in the TOP2A and HER2 Trends Genet 13:163 genes: association with adjuvant anthracycline sensitivity in hu- Reis PP, Tomenson M, Cervigne NK, Machado J, Jurisica I, Pintilie M, man breast cancers. J Natl Cancer Inst 101:615–618 Sukhai MA, Perez-Ordonez B, Grenman R, Gilbert RW, Gullane Sodek KL, Evangelou AI, Ignatchenko A, Agochiya M, Brown TJ, PJ, Irish JC, Kamel-Reid S (2010) Programmed cell death 4 loss Ringuette MJ, Jurisica I, Kislinger T (2008) IdentiWcation of path- increases tumor cell invasion and is regulated by miR-21 in oral ways associated with invasive behavior by ovarian cancer cells squamous cell carcinoma. Mol Cancer 9:238 using multidimensional protein identiWcation technology (Mud- Rhodes DR, Chinnaiyan AM (2005) Integrative analysis of the cancer PIT). Mol Biosyst 4:762–773 transcriptome. Nat Genet 37 Suppl: S31–37 Spentzos D, Levine DA, Ramoni MF, Joseph M, Gu X, Boyd J, Liber- Rhodes DR, Tomlins SA, Varambally S, Mahavisno V, Barrette T, mann TA, Cannistra SA (2004) Gene expression signature with Kalyana-Sundaram S, Ghosh D, Pandey A, Chinnaiyan AM independent prognostic signiWcance in epithelial ovarian cancer. (2005) Probabilistic model of the human protein–protein interac- J Clin Oncol 22:4700–4710 tion network. Nat Biotechnol 23:951–959 Spirin V, Mirny LA (2003) Protein complexes and functional modules Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Varambally R, Yu J, in molecular networks. Proc Natl Acad Sci USA 100:12123– Briggs BB, Barrette TR, Anstet MJ, Kincead-Beal C, Kulkarni P, 12128 Varambally S, Ghosh D, Chinnaiyan AM (2007) Oncomine 3.0: Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M genes, pathways, and networks in a collection of 18, 000 cancer (2006) BioGRID: a general repository for interaction datasets. gene expression proWles. Neoplasia 9:166–180 Nucleic Acids Res 34:D535–D539 Rice JJ, Kershenbaum A, Stolovitzky G (2005) Lasting impressions: Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, motifs in protein–protein maps may provide footprints of evolu- Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, Timm J, tionary events. Proc Natl Acad Sci USA 102:3173–3174 MintzlaV S, Abraham C, Bock N, Kietzmann S, Goedde A, Roberts PJ, Der CJ (2007) Targeting the Raf-MEK-ERK mitogen-acti- Toksoz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach vated protein kinase cascade for the treatment of cancer. Onco- H, Wanker EE (2005) A human protein–protein interaction net- gene 26:3291–3310 work: a resource for annotating the proteome. Cell 122:957–968 Rodriguez-Esteban R (2009) Biomedical text mining and its applica- Stratton MR, Campbell PJ, Futreal PA (2009) The cancer genome. tions. PLoS Comput Biol 5:e1000597 Nature 458:719–724 Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Stumpf MP, Thorne T, de Silva E, Stewart R, An HJ, Lappe M, Wiuf Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, Klitg- C (2008) Estimating the size of the human interactome. Proc Natl ord N, Simon C, Boxem M, Milstein S, Rosenberg J, Goldberg Acad Sci USA 105:6959–6964 DS, Zhang LV, Wong SL, Franklin G, Li S, Albala JS, Lim J, Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Fraughton C, Llamosas E, Cevik S, Bex C, Lamesch P, Sikorski Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, RS, Vandenhaute J, Zoghbi HY, Smolyar A, Bosak S, Sequerra R, Mesirov JP (2005) Gene set enrichment analysis: a knowledge-based 123 Hum Genet (2011) 130:465–481 481

approach for interpreting genome-wide expression proWles. Proc Warnat P, Eils R, Brors B (2005) Cross-platform analysis of cancer Natl Acad Sci USA 102:15545–15550 microarray data improves gene expression based classiWcation of Syed AS, D’Antonio M, Ciccarelli FD (2010) Network of Cancer Genes: phenotypes. BMC Bioinformatics 6:265 a web resource to analyze duplicability, orthology and network Wei Y, Tong J, Taylor P, Strumpf D, Ignatchenko V, Pham NA, Ya- properties of cancer genes. Nucleic Acids Res 38:D670–D675 nagawa N, Liu G, Jurisica I, Shepherd FA, Tsao MS, Kislinger T, Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Moran MF (2011) Primary tumor xenografts of human lung ade- Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, von no and squamous cell carcinoma express distinct proteomic sig- Mering C (2011) The STRING database in 2011: functional inter- natures. J Proteome Res 10:161–174 action networks of proteins, globally integrated and scored. Wuchty S (2006) Topology and weights in a protein domain interac- Nucleic Acids Res 39:D561–D568 tion network—a novel way to predict protein interactions. BMC Tan PK, Downey TJ, Spitznagel EL Jr, Xu P, Fu D, Dimitrov DS, Genomics 7:122 Lempicki RA, Raaka BM, Cam MC (2003) Evaluation of gene Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, Eisen- expression measurements from commercial microarray plat- berg D (2000) DIP: the database of interacting proteins. Nucleic forms. Nucleic Acids Res 31:5676–5684 Acids Res 28:289–291 Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ Yu H, Paccanaro A, Trifonov V, Gerstein M (2006) Predicting interac- (2005) Discovering statistically signiWcant pathways in expres- tions in protein networks by completing defective cliques. Bioin- sion proWling studies. Proc Natl Acad Sci USA 102:13544–13549 formatics 22(7):823–829 Tomasini R, Tsuchihara K, Wilhelm M, Fujitani M, RuWni A, Cheung Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer- CC, Khan F, Itie-Youten A, Wakeham A, Tsao MS, Iovanna JL, Citterich M, Cesareni G (2002) MINT: a Molecular INTeraction Squire J, Jurisica I, Kaplan D, Melino G, Jurisicova A, Mak TW database. FEBS Lett 513:135–140 (2008) TAp73 knockout shows genomic instability with infertil- Zhang B, Horvath S (2005) A general framework for weighted gene ity and tumor suppressor functions. Genes Dev 22:2677–2691 co-expression network analysis. Stat Appl Genet Mol Biol 4 (arti- van Vliet MH, Reyal F, Horlings HM, van de Vijver MJ, Reinders MJ, cle17) Wessels LF (2008) Pooling breast cancer datasets has a synergetic Zhang M, Yao C, Guo Z, Zou J, Zhang L, Xiao H, Wang D, Yang D, eVect on classiWcation performance and improves signature sta- Gong X, Zhu J, Li Y, Li X (2008) Apparently low reproducibility bility. BMC Genomics 9:375 of true diVerential expression discoveries in microarray studies. Varambally S, Yu J, Laxman B, Rhodes DR, Mehra R, Tomlins SA, Bioinformatics 24:2057–2063 Shah RB, Chandran U, Monzon FA, Becich MJ, Wei JT, Pienta Zhu CQ, da Cunha Santos G, Ding K, Sakurada A, Cutz JC, Liu N, KJ, Ghosh D, Rubin MA, Chinnaiyan AM (2005) Integrative Zhang T, Marrano P, Whitehead M, Squire JA, Kamel-Reid S, genomic and proteomic analysis of prostate cancer reveals signa- Seymour L, Shepherd FA, Tsao MS (2008) Role of KRAS and tures of metastatic progression. Cancer Cell 8:393–406 EGFR as biomarkers of response to erlotinib in National Cancer Viau C, McGuYn MJ, Chiricota Y, Jurisica I (2010) The Xowvizmenu Institute of Canada Clinical Trials Group Study BR.21. J Clin On- and parallel scatterplot matrix: hybrid multidimensional visual- col 26:4268–4275 izations for network exploration. Vis Comput Graph IEEE Trans Zhu CQ, Pintilie M, John T, Strumpf D, Shepherd FA, Der SD, Jurisica 16:1100–1108 I, Tsao M-S (2009) Understanding prognostic gene expression von Eichborn J, Murgueitio MS, Dunkel M, Koerner S, Bourne PE, signatures in lung cancer. Clin Lung Cancer 10(5):331–340 Preissner R (2011) PROMISCUOUS: a database for network- Zhu CQ, Ding K, Strumpf D, Weir BA, Meyerson M, Pennell N, Tho- based drug-repositioning. Nucleic Acids Res 39:D1060–D1066 mas RK, Naoki K, Ladd-Acosta C, Liu N, Pintilie M, Der S, Sey- Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, mour L, Jurisica I, Shepherd FA, Tsao MS (2010a) Prognostic and Kingsmore SF, Schroth GP, Burge CB (2008) Alternative isoform predictive gene signature for adjuvant chemotherapy in resected regulation in human tissue transcriptomes. Nature 456:470–476 non-small-cell lung cancer. J Clin Oncol 28:4417–4424 Wang L, Tang H, Thayanithy V, Subramanian S, Oberg AL, Cunning- Zhu CQ, Strumpf D, Li CY, Li Q, Liu N, Der S, Shepherd FA, Tsao ham JM, Cerhan JR, Steer CJ, Thibodeau SN (2009) Gene net- MS, Jurisica I (2010b) Prognostic gene expression signature for works and microRNAs implicated in aggressive prostate cancer. squamous cell carcinoma of lung. Clin Cancer Res 16:5038–5047 Cancer Res 69:9490–9497

123