Integrative Network Analysis for Understanding Human Complex Traits

by

Lili Wang

A thesis submitted to the School of Computing in conformity with the requirements for the degree of Doctor of Philosophy

Queen’s University Kingston, Ontario, Canada April 2015

Copyright c Lili Wang, 2015 Abstract

Over the last decade, high throughput biological data, have been accumulating at rapidly increasing rates, providing the opportunity to gain insight into various funda- mental biological processes. Such large-scale data have been explored using network representation and graph theory to study biological relationships. Meanwhile, a great amount of effort has also been dedicated to integrate diverse biological data types in order to build networks and apply computational analysis to distill meaningful in- formation for specific biological problems. As a result, network-based analysis has become a powerful paradigm to model and study large-scale biological data. The goal of network-based analysis of human complex traits is to annotate or predict new relationships between biological entities, such as , drugs and phenotypes. Fur- thermore, such analysis can facilitate the diagnosis and prognosis of common complex diseases. This thesis comprises three contributions. First, heterogeneous biological data are integrated and a novel tool has been developed to easily construct and navigate networks representing the large scale data. In addition the resulting networks can be analyzed using computational methods to solve specific biological problems. Second, an integrative network-based pathway analysis for genome-wide association studies

i (GWAS) has been proposed to take advantage of the large scale network to com- bine topological connectivity with signals from GWAS in order to detect enriched pathways. Third, an integrative strategy combines multiple quantitative profiles with a large scale network to assist the biomarker selection for ovarian cancer using two different computational methods: (A) an aggregate ranking to score the candidate proteins and (B) pathway analysis to find enriched sub-networks. These three contributions demonstrate a pipeline to model large heterogeneous biological data in terms of networks and conduct network-based analysis for under- standing the molecular basis of human diseases.

ii Acknowledgments

First, I would like to say thank you to my parents who have been always stand- ing behind me and supporting me, and to my daughter who inspires and gives me motivation to achieve success. Special thanks go to my gracious advisor and mentor Dr. Parvin Mousavi for her great encouragement and generous support through all these years for my PhD study. Without her, I would not have accomplished my study, and not able to participate in exploring such a cross-discipline research field. She is a role model of inspiration and wisdom of life. I would also like to thank my mentors Dr. Sergio Baranzini and Dr. Igor Jurisica for providing a huge amount of help with all aspects of my research. Many thanks to members of Baranzini Lab at University of California San Francisco, and Jurisica Lab at University of Toronto for great collaborations and sharing ideas and experience. Additionally, I would like to thank my PhD advisory committee members Dr. Dorothea Blostein and Dr. Janice Glasgow for their feedback and support. I really appreciate their effort in reading and commenting on my research. I would also like to thank the past and current members of the Med-i Lab: Amir, Farhad, Layan, Nathan, and Zhen for your friendship.

iii Finally, I would like to give thanks to the School of Computing at Queen’s Uni- versity, for the good education they offered me that helped me be who I am.

iv Glossary

CROC Concentrated Receiver Operating Characteristic.

CTD Comparative Toxicogenomics Database.

DO Disease Ontology.

GO Ontology.

GWAS Genome-wide association studies.

I2D Interologous Interaction Database.

KEGG Kyoto Encyclopedia of and Genomes.

NHGRI National Research Institute.

OMIM Online Mendelian Inheritance in Man database.

PPI -protein interaction.

ROC Receiver operating characteristic.

SNP Single nucleotide polymorphism.

v Contents

Abstract i

Acknowledgments iii

Glossary v

Contents vi

List of Figures ix

List of Tables xi

Chapter 1: Introduction 1 1.1 Motivation ...... 1 1.2 Objectives ...... 4 1.3 Contributions ...... 5 1.4 Organization of Thesis ...... 7

Chapter 2: Background 9 2.1 Network Terminology and Definitions ...... 9 2.1.1 Degree ...... 10 2.1.2 Path ...... 10 2.1.3 Clustering Coefficient ...... 11 2.1.4 Network Models ...... 11 2.2 Biological Interactions and Sources ...... 13 2.2.1 Protein-protein Interactions ...... 13 2.2.2 Metabolic and Signalling Pathways ...... 15 2.2.3 Disease Gene Associations ...... 16 2.2.4 Drug Targets ...... 18 2.2.5 Gene Expression Data ...... 19 2.2.6 Protein Expression Data ...... 20 2.2.7 microRNA Data ...... 21

vi 2.3 Data Integration ...... 22 2.4 Network Analysis ...... 24 2.4.1 Gene Prioritization ...... 25 2.4.2 Subnetwork Detection ...... 33 2.5 Discussion ...... 39

Chapter 3: Integrative Complex Traits Networks 41 3.1 Introduction ...... 41 3.2 Background ...... 42 3.3 Materials and Methods ...... 43 3.3.1 Nodes ...... 44 3.3.2 Edges ...... 47 3.3.3 Computational Analysis ...... 48 3.4 Implementation ...... 50 3.4.1 Availability and Requirements ...... 50 3.4.2 Database ...... 50 3.5 Features ...... 51 3.5.1 Disease-centered Network Model ...... 51 3.5.2 Gene-centered Network Model ...... 54 3.5.3 Drug-centered Network Model ...... 56 3.5.4 Similarity Network ...... 58 3.6 Discussion ...... 59

Chapter 4: Integrative Network-based Functional Module Discov- ery for Genome-wide Association Studies 62 4.1 Introduction ...... 62 4.2 Background ...... 63 4.3 Data ...... 66 4.3.1 GWAS Data Sets ...... 66 4.3.2 Human Protein Interaction Network ...... 67 4.3.3 Benchmark Genes ...... 67 4.4 Methods ...... 67 4.4.1 Step One ...... 68 4.4.2 Step Two ...... 69 4.4.3 Step Three ...... 70 4.4.4 Parameters ...... 72 4.4.5 Evaluation ...... 72 4.5 Results ...... 73 4.5.1 Prediction of WTCCC2 Genes ...... 74 4.5.2 Predicting iChip Genes ...... 75

vii 4.5.3 Significantly Enriched Networks ...... 76 4.5.4 Sensitivity of iPINBPA to Parameters ...... 78 4.6 Application ...... 80 4.6.1 Availability and Requirements ...... 80 4.6.2 Features ...... 80 4.7 Discussion ...... 88

Chapter 5: Integrative Biomarker Selection for Ovarian Cancer 91 5.1 Introduction ...... 91 5.2 Background ...... 92 5.3 Data ...... 94 5.3.1 Proteomic Profiles ...... 94 5.3.2 Secretomic Profiles ...... 95 5.3.3 Transcriptomic Profiles ...... 95 5.3.4 Interologous PPI Network ...... 96 5.3.5 Matching Profiles ...... 96 5.4 Candidate Protein Selection ...... 97 5.5 Aggregate Ranking ...... 100 5.6 Sub-networks Detection ...... 102 5.7 Results ...... 103 5.7.1 Enriched Subnetworks ...... 107 5.7.2 Enriched KEGG Pathways ...... 107 5.8 Conclusion ...... 108

Chapter 6: Summary and Future Work 112 6.1 Future Directions ...... 114

Bibliography 116

Appendix A: iCTNet Data Source Description 147

Appendix B: iCTNet Database Schema 156

Appendix C: ACS Proteins Selected for Ovarian Cancer 158

viii List of Figures

1.1 Network-based analytic process of biological data ...... 4

2.1 KEGG Pathway: Insulin Secretion (http://www.kegg.jp) ...... 16

3.1 iCTNet database schema and user interface ...... 44 3.2 Disease-centered network schema ...... 53 3.3 Disease network of breast cancer ...... 54 3.4 Gene-centered network schema ...... 55 3.5 Gene network of BRCA1, BRCA2 and BRCA3 ...... 56 3.6 Drug-centered network schema ...... 57 3.7 Drug network of Methotrexate ...... 58 3.8 Human disease networks ...... 60

4.1 Work flow of iPINBPA ...... 68 4.2 ROC and CROC curves [1] ...... 73 4.3 CROC curves of Meta2.5 and WTCCC2 GWAS data sets ...... 76 4.4 CROC curves with different restart ratios ...... 78 4.5 CROC fold enrichments of different values of restart ratio r ...... 79 4.6 Overview of iPINBPA app ...... 82 4.7 Manhattan plot in iPINBPA ...... 83

ix 4.8 Table of genes and blocks ...... 84 4.9 Node attributes added after running random walk with restart . . . . 85 4.10 Sub-network table ...... 86 4.11 Sub-network of genes with p-value ≤ 0.05 ...... 87 4.12 Histograms of random networks ...... 88

5.1 Data structure ...... 98 5.2 Data distribution ...... 99 5.3 Aggregate ranking strategy ...... 101 5.4 Data sources used for sub-network detection ...... 103 5.5 GO cellular component distribution in three sets ...... 104 5.6 Network of 43 ACS proteins ...... 106 5.7 Sub-network detection results ...... 108 5.8 Detected sub-network including 71 nodes and 269 edges ...... 109 5.9 21 enriched KEGG pathways ...... 110

B.1 iCTNet database schema ...... 157

x List of Tables

2.1 PPI database list ...... 14

3.1 Data sources in iCTNet database ...... 45

4.1 Methodological difference between iPINBPA and other methods . . . 66 4.2 Stats of top scored sub-networks from four methods ...... 77 4.3 Functional annotation clusters of 1,299 genes in DAVID ...... 78 4.4 Stats of top scored sub-networks from iPINBPA with different cutoffs 79

A.1 Ovarian cancer entry in disease ontology ...... 150 A.2 Gene BRCA1 data entry in HGNC ...... 150 A.3 miRNA hsa-let-7a entry in mirCat ...... 151 A.4 Chemical-gene data entry in CTD ...... 152 A.5 Chemical-disease data entry in CTD ...... 153 A.6 Gene-disease data entry in CTD ...... 153 A.7 Data entry in GWAS Catalog ...... 154 A.8 Gene-disease data entry in OMIM ...... 155 A.9 Drug-target data entries in DrugBank ...... 155 A.10 Drug-side effect entries in SIDER ...... 155 A.11 Protein-protein interaction entries in ppiTrim ...... 155

xi C.1 43 ACS proteins and their values ...... 161

xii 1

Chapter 1

Introduction

1.1 Motivation

A central goal of human genetics is to identify risk factors for common, complex dis- eases such as autoimmune diseases and cancers, which have a multifactorial etiology. These complex traits are influenced by many genetic or environmental factors. De- spite the development of the HapMap Project [2] to provide a catalog of common genetic variants that occur in human beings, the complexity of the underlying mech- anisms of such diseases limits our efforts to tackle them using information only from common genetic variants. To date, the survival rates for some complex diseases, e.g., ovarian cancer, remains low due to the lack of knowledge about the corresponding molecular basis. The wide usage of high throughput technologies has resulted in a variety of molecu- lar interactions such as protein-protein interactions (PPI), metabolic pathways, func- tional associations including gene co-expressions, drug-target and phenotype-gene associations, to be accumulated and shared at rapidly increasing rates. These tech- nologies include DNA microarray, which measures the expression levels of a large 1.1. MOTIVATION 2

number of genes simultaneously, chip-based microarrays for assaying millions of sin- gle nucleotide polymorphisms (SNPs) in genome-wide association studies (GWAS), mass spectrometry-based screening of protein-ligand complexes and many more ex- amples in drug discovery industry [3]. In the meantime, many efforts in biology, statistics, graph theory, data mining, machine learning and visualization have been devoted to an emerging interdisciplinary field, namely network biology. Network bi- ology offers a new conceptual framework to remodel our view of disease pathologies [4]. In this framework, interactions between biological entities, e.g., genes, proteins, and phenotypes, are presented as networks or graphs, where the biological entities are denoted as nodes and connected by edges that indicate physical or functional interactions. The network-based modelling of diverse biological interaction data has the potential to provide novel insights into biological processes and molecular func- tions. The network-based modelling is data-driven, as each type of biological inter- action captures distinct features that provide indirect information about biological functions involved in complex traits. Various types of molecular interactions offer complementary insights to biological process, but none of them are complete on their own [5]. The processing and integration of large-scale data is a highly complex and time-consuming task. In the literature to date, most studies have focused on the in- tegration of biological data up to three different types, or on a specific disease alone, leading to a lack of systematic global view of complex diseases. To build a comprehensive view of complex diseases, several different biological interactions should be merged and modelled in terms of networks. The integration of heterogeneous biological data is the cornerstone in the practical applications of network biology, e.g., disease prediction and drug discovery. Once networks have 1.1. MOTIVATION 3

been built, computational network-based analysis can be applied to discover patterns and predict causal genetic markers (nodes in the network), or active modules (referred to as sub-networks or pathways) to help understand the molecular basis of diseases and their relationships. This can facilitate early diagnosis, prognosis, prevention of disease, as well as drug discovery. To build a comprehensive view of diseases, and to study them systematically, the network-based analytic process, as shown in 1.1, includes four steps: (1) data in- tegration; (2) network construction; (3) network analysis, and (4) validation. Each step requires tackling of several challenges. The integration of diverse biological data sources (steps 1 and 2) is complex, as the scale of large genomic projects is going to reach the petabyte scale, and still increasing at unprecedented speed [6]. At the same time, these steps play an important role in the success of subsequent compu- tational analysis; data integration and network construction can help us interpret the large-scale and high-dimensional data sets, and obtain higher-order biological relationships for complex diseases. Subsequently, in step 3, with the complex and large-scale biological data, the sizes of networks are quite large. For example, the latest human PPI network in Human Protein Reference contains more than 30,000 nodes and 41,000 edges. It is extremely difficult to distill meaningful information from such large networks. Computational tools for network analysis, that scale up well, are needed to search for valuable information. In the literature, network-based analysis can be classified into two major categories: (1) approaches that prioritize candidate genes, and (2) approaches that detect modules (sub-networks), to search for predictive signatures of complex diseases. At the end, it is also quite challenging to validate the results of network analysis of human complex traits because of our 1.2. OBJECTIVES 4

incomplete understanding of them and the absence of gold standards.

Figure 1.1: Network-based analytic process of biological data

1.2 Objectives

As explained before, there is a need for integrating multiple large-scale biological resources to build a systematic network model of molecular interactions to offer a comprehensive view of complex diseases. Data integration from biological resources is challenging. First, the identification of appropriate data sources is a complicated task due to the diversity of existing data types and formats [7]. Data from different platforms may meet specific needs of the research, but also introduce biases which may affect the analysis. Second, the network construction is more complicated as it involves not only actual datasets but also domain knowledge [7]. A variety of biolog- ical data are available from multiple sources and these data can be integrated if they 1.3. CONTRIBUTIONS 5

share overlapping content [7]. But the domain knowledge of genomics, proteomics or metabolomics, is needed to incorporate data across multiple biological domains. In addition, network-based computational analysis should aim to distill useful informa- tion from large networks. My research consists of three objectives: (1) integrating a collection of hetero- geneous biological interaction data into networks and developing a tool that can easily construct and visualize the resulting networks, which can be analyzed using computational methods to solve specific biological problems; (2) designing an in- tegrative network-based pathway analysis for GWAS to detect enriched pathways (sub-networks) by combining the topological connectivity of the network with the association signals from GWAS; and (3) designing a network-based gene selection ap- proach for biomarker discovery in a particular disease as an example (ovarian cancer). The tool developed in aim 1 provides a platform to apply computational analysis for sub-network detection and gene prioritization as in aims 2 and 3. We hypothesize that by integrating multiple biological data in terms of networks, we can develop a framework to analyze functionally related connections for under- standing the molecular basis of phenotypes, especially the high-order biological rela- tionships among diseases. Through these aims, I will focus on the three steps in the analytic process: data integration, network construction and network analysis, and demonstrate a practical pipeline of conducting network modelling and analysis for complex disease of interest.

1.3 Contributions

The major contributions of this thesis are as follows: 1.3. CONTRIBUTIONS 6

• Development of integrated Complex Traits Networks (iCTNet) database by integrating nine types of biological interactions: phenotype-gene, phenotype- tissue, tissue-gene, drug-target, drug-phenotype, drug-side effect, side effect- tissue, protein-protein, and miRNA-gene.

• Development of the software, iCTNet, a free application (referred to as app) for Cytoscape [8]. Cytoscape is an open source platform for large data visual- ization. iCTNet application allows automated and systematic construction of meta-networks from iCTNet database based on a user’s interest. The regular biological network contains only one type of biological interactions among one or two different types of biological components. The meta-network is a “network of networks”, which integrates different types of biological interactions.

• An overview of the biological data sources available for data integration in terms of biological networks. Such integration requires mapping data from various sources and sharing them standardized.

• Design of integrative protein-interaction-network-based pathway analysis (iP- INBPA) for GWAS data to detect sub-networks with strong association signals from GWAS.

• Development of the software, iPINBPA, a free and novel Cytoscape app to analyze GWAS data in a network fashion in six useful features.

• Design of an integrative strategy including both aggregate ranking and pathway analysis for biomarker selection. This strategy integrates transcriptomic, pro- teomic, and secretomic profiles with an extremely large PPI network, to rank candidate proteins and also detect enriched sub-networks for ovarian cancer. 1.4. ORGANIZATION OF THESIS 7

1.4 Organization of Thesis

This dissertation is organized as follows: Chapter 2, Background: first introduces the major biological interaction databases and online repositories, and follows up with a review of past efforts in network-based analysis of biological data. Chapter 3, Integrative Complex Traits Networks: introduces the integrated Complex Traits Networks (iCTNet) database consisting of a set of heterogeneous bi- ological interactions, and the iCTNet application, a tool with a user-friendly interface to assist automated construction of meta-networks and subsequent exploration of higher-order biological relationships such as the discovery of new disease genes, and new therapeutic applications for existing drugs. Chapter 4, Integrative Network-based Functional Module Discovery for Genome-wide Association Studies: introduces the integrative protein-interaction- network-based pathway analysis (iPINBPA) for GWAS, a method to identify and pri- oritize genetic associations by merging statistical evidence of association with physical evidence of interaction at the protein level. In addition, a software iPINBPA appli- cation has also been implemented with six useful features to analyze GWAS data in a network fashion. Chapter 5, Integrative Biomarker Selection for Ovarian Cancer: intro- duces an aggregating ranking strategy to select potential biomarkers by integrating three different profiles: transcritpomic, proteomic and secretomic data with PPI net- work, and the pathway analysis method, which applies the same concepts in Chapter 4, to detect enriched sub-networks. Chapter 6, Summary and Future Work: summarizes the research presented 1.4. ORGANIZATION OF THESIS 8

in this dissertation, and discusses recommendations for future work. 9

Chapter 2

Background

This chapter introduces the commonly used network terminology and definitions, and an overview of biological interactions. The biological interactions are categorized into eight different classes: protein-protein interactions, metabolic and signalling path- ways, disease gene associations, drug side effects, drug targets, drug therapy, gene expression data and microRNA data. A discussion of the challenges and strategies to integrate multiple sources of biological data is presented. The chapter ends with a review of computational network-based analysis in the literature.

2.1 Network Terminology and Definitions

In mathematical terms, a network can be represented by a graph G = (V,E) where V is the set of nodes (vertices) and E is the set of edges that are directed or undirected.

In an undirected graph, the edge eij is identical to the edge eji; in a directed graph, each edge has a direction from the start node to the end node. The edge eij starts from node i and points to node j. The graph is sparse if |E| is much less than |V |2; otherwise, the graph is dense. If each edge has an associated weight, i.e., each edge is assigned a numerical value, the graph is called a weighted graph. It is common to 2.1. NETWORK TERMINOLOGY AND DEFINITIONS 10

represent unweighted graph G using a binary matrix W where the element wij = 1 if

there exists an edge eij exists, otherwise wij = 0. For a weighted graph, a weighted

matrix W is defined as wij > 0 to denote the weight of the edge eij [9].

2.1.1 Degree

The degree of a node in a graph is the number of edges incident to the node. In an undirected network, the degree of a node is exactly the number of edges connecting to it. In directed networks, a node has an incoming degree, defined as the number of links that point to this node, and an outgoing degree, defined as the number of links that start from it. Nodes with higher degree than the average are called hubs. The degree distribution P (k) denotes the probability that a selected node has exactly k links [4]. Assume each of n nodes in a network is connected to another node independently with equal probability p, the degree distribution can be defined as:

n − 1 P (k) = pk(1 − p)n−1−k (2.1) k

Many studies agree that the degree distribution of a PPI network follows a power law as: P (k) ∼ k−γ, where γ is a constant [10]. Such a feature indicates the presence of hubs in the network.

2.1.2 Path

A path is defined as a sequence of nodes where each node is adjacent to the node that follows it in the sequence [4]. The distance between two nodes is measured by the path length, i.e., the number of links we need to travel from the start node to the end node. As there may be multiple paths between a pair of nodes, the shortest path 2.1. NETWORK TERMINOLOGY AND DEFINITIONS 11

is the route(s) with the smallest number of links between the selected nodes. The shortest path is a simple way to measure the distance between nodes in a network. If there are any cycles in the graph, then the graph is cyclic. A graph that does not contain any cycles is acyclic. Most biological networks are dense and cyclic.

2.1.3 Clustering Coefficient

The clustering coefficient of a node is defined as:

2ni Ci = (2.2) ki(ki − 1)

where ni is the number of links connecting the ki neighbours of the node i to each other [4]. In other words, Ci gives the proportion of the number of triangles that go through node i, to the total number of triangles that could pass through node i.

2.1.4 Network Models

The modelling of networks provides novel insights of real biological networks and sig- nificantly improves our understanding of the relationships between biological entities, such as genes, proteins, and phenotypes. We produce a brief introduction of network models discussed in biological network studies [11][12][13], to explain the difference between biological networks and other complex networks.

Random networks

The theory of random networks was introduced by Paul Erd¨os and Alfred R´enyi [10]. The simple random network model, also called Erd¨os-R´enyi (ER) random model, starts with N nodes and connects each pair of nodes by an edge with the same 2.1. NETWORK TERMINOLOGY AND DEFINITIONS 12

probability p, which creates a graph whose expected number of edges is pN(N −1)/2. In this model, nodes have approximately the same number of connections, and hubs occur only rarely. Biological and social networks, are not random networks, as they reveal a high level of order and organization [14].

Scale-free networks

Scale-free networks are characterized by a power-law degree distribution, as we dis- cussed before. In random networks, all the nodes share the same probability to be connected, so there are no hubs. However the power-law degree distribution indicates the presence of a few hubs in scale-free networks. Studies of biological networks have shown that a scale-free topology has been found at many different organizational levels, ranging from genetic to protein interaction and metabolic networks. Oltavai and Barab´asi [15] described the cellular system as a scale-free network. Scale-free networks have a much larger clustering coefficient than a random network [16].

Hierarchical networks

Ravase et al. [17] first introduced the concept of hierarchical network in 2002. In a hierarchical network, a central node in the top level of the hierarchy, is connected to one or more central nodes in a lower level in the hierarchy. Hierarchical networks combine the modularity, high degree of clustering, and the scale-free topology in a single manner [16]. A large number of real biological networks, including metabolic networks and PPI networks, are both scale-free and hierarchical [16]. The presence of a hierarchy and the scale-free property can be captured in a quantitative manner using a scaling law which describes the dependence of the clustering coefficient on 2.2. BIOLOGICAL INTERACTIONS AND SOURCES 13

the node degree [16]. Random networks are not hierarchical, and scale-free networks may not be hierarchical [16].

2.2 Biological Interactions and Sources

Many studies in the last decade review biological interaction databases as a source of expert knowledge that can be used to guide disease research. To date, there are thousands of biological database resources available [18]. In the following sections, we introduce the most studied biological interactions or associations in the follow- ing order: protein-protein interactions, metabolic and signalling pathways, disease gene associations, drug targets, gene expression data, protein expression data and microRNA target data. A comprehensive list of online biology databases is available in the latest database issue of the Nucleic Acids Research journal [19]. In this section, we review a collection of sources of biological interactions which can be integrated with each other to build a systematic network model to offer novel insights of complex diseases. Across this thesis, we use the terms SNP/gene/protein interchangeably as we aim to provide a high level view of complex diseases.

2.2.1 Protein-protein Interactions

Protein-protein interactions (PPIs) are the most fundamental relationships in human disease research, as PPIs are essential to many cellular functions. PPIs occur when two or more proteins bind together to carry out cellular functions. There are two main categories of PPI: experimentally validated (physical) or computationally predicted (functional). Usually the physical interactions have been experimentally validated using yeast two-hybrid system (Y2H). Current experimentally detected PPIs are only 2.2. BIOLOGICAL INTERACTIONS AND SOURCES 14

a small fraction of the entire PPIs, and stored in different databases. Most PPI databases are composed of manually curated literature and experimental data. Ex- tracting information from literature and subsequent manual curation is challenging. BIND [20], CORUM [21], DIP [22], HPRD [23], IntAct [24], and MINT [25] are among the highly accessed, manually compiled PPI databases. In addition to the above databases, two comprehensive databases I2D [26] and iRefIndex [27], consoli- date multiple public databases to provide a thorough coverage of the existing PPIs. Links to the PPI databases introduced in this section are listed in Table 2.1, with brief descriptions.

Name Description BIND [20] Biomolecular Interaction Network Database http://bind.ca BioGRID [28] Biological General Respository for Interaction Datasets http://thebiogrid.org CORUM [21] Comprehensive Resource of Mammalian protein complexes http://mips.helmholtz-muenchen.de/genre/proj/corum DIP [22] Database of Interacting Proteins http://ww.dip.doe-mbi.ucla.edu HPRD [23] Human Protein Reference Database http://www.hprd.org I2D [26] Interologous Interaction Database http://ophid.utoronto.ca/i2d IntAct [24] Open source database of molecular interaction data http://www.ebi.ac.uk/intact iRefIndex [27] A reference index for protein interaction data http://irefindex.org MINT [25] Molecular INTeraction Database http://mint.bio.uniroma2.it MIPS [29] Munich Information Centre for Protein Sequences http://mips.helmholtz-muenchen.de/proj/ppi STRING [30] Known and predicted PPIs http://string-db.org Table 2.1: PPI database list 2.2. BIOLOGICAL INTERACTIONS AND SOURCES 15

2.2.2 Metabolic and Signalling Pathways

It has been widely known that some complex human diseases are caused by defects in biological pathways. For example, insulin signalling plays a crucial role in the onset of type II diabetes. The understanding of biological pathways starts from metabolism. Metabolism is the set of chemical reactions that occurs in a cell to support its life. Metabolic networks, also called metabolic pathways, are the most comprehensive biological networks [31]. In metabolic networks, nodes are biochemical metabolites and directed or undirected edges are either unreversible or reversible biochemical reactions [32]. KEGG [33] is the leading metabolic pathway database and was first released in 1997. In KEGG, each pathway is a manually drawn graphical diagram, representing molecular pathways for metabolism, genetic information processing, environmental in- formation processing, other cellular processes, human diseases, and drug development [33]. Fig. 2.1 shows a cascade of biochemical processes in the insulin secretion path- way in KEGG. The description and other information of this pathway can be found at http://www.kegg.jp/dbget-bin/www_bget?pathway+map04911. Other compre- hensive pathway databases include Pathway Commons [34], WikiPathways [35] and Reactome [36]. To test the hypothesis that there are direct connections between dis- eases associated with the same metabolic pathway, a metabolic disease network was constructed in which two disorders are connected if they are linked to potentially cor- related reactions. Results showed that in general, metabolically connected diseases have a higher than average comorbidity (or co-occurrence) rate [37]. 2.2. BIOLOGICAL INTERACTIONS AND SOURCES 16

Figure 2.1: KEGG Pathway: Insulin Secretion (http://www.kegg.jp)

2.2.3 Disease Gene Associations

The most studied disease-gene association database is the Online Mendelian Inher- itance in Man (OMIM) [38]. OMIM collects genetic information related to disease phenotypes from biomedical literature, via text mining. Each OMIM entry has a full text summary of phenotype with a list of short descriptions of relevant publications. By collecting disease-gene associations from OMIM, Goh et al. [39] generated the first global disease network in which two diseases are connected if they are associated with the same gene. Their aim was to provide a novel view of the genetic relationship among diseases. In addition, genome-wide association studies (GWAS) have produced rapidly in- creasing amount of disease gene associations for common human complex traits by 2.2. BIOLOGICAL INTERACTIONS AND SOURCES 17

investigating single nucleotide polymorphysims (SNP) of DNA sequences in popula- tions. Such studies aim to identify how these polymorphisms are different in disease and normal populations. These studies have the potential to uncover relationships between disease and SNPs that could be genetically inherited, as well as the risk of an individual for disorders. Statistical approaches are used to identify poten- tial polymorphisms associated with disease, based on the distribution of a particular SNP’s value between two populations, and significance values are reported. The Na- tional Human Genome Research Institute (NHGRI) GWAS catalog [40], collects and reports all large-scale GWAS, their reported highly significant SNPs, and the associ- ated pubmed manuscripts. The data in OMIM collected using text mining approach is susceptible to inconsistencies and omissions. In contrast, the GWAS catalog pro- vides more reliable information from published GWA studies, in which the statistical significance score p-value is provided for each phenotype-SNP association [40]. Other notable databases containing disease gene associations include the Genetic Association Database (GAD) [41] and the Human Genome Variation database of Genotype to Phenotype information (HGVbaseG2P) [42]. GAD focuses on archiving information from the literature on commom complex human diseases which are multi- factorial, rather than rare Mendelian disorders, which is controlled by a single gene, as found in OMIM. Each record in GAD represents one association between a disease and a gene, annotated with links to molecular databases (GeneCards, HapMap, etc.) and publications. HGVbaseG2P is a database for summary-level findings from GWAS. Each query of a phenotype in HGVbaseG2P returns a list of GWAS studies along with a short description and the number of reported associations for each study. 2.2. BIOLOGICAL INTERACTIONS AND SOURCES 18

2.2.4 Drug Targets

The traditional drug design paradigm is to find a small molecule which interacts with one or two targets (proteins) resulting in treatment and prevention of human dis- eases [43]. Systems pharmacology is an emerging field to study drug action across multiple biological levels, from molecular and cellular to tissue and organism, to understand the relationship between drugs and targets, and assist drug target iden- tification and drug repurposing. A comprehensive and manually curated data source is DrugBank (http://www.drugbank.ca) [44], which contains 4, 282 non-redundant proteins linked to about 7, 740 drug entries. Each entry contains drug/chemical data and target protein data. Other notable databases include the Comparative Toxi- cogenomics Database (CTD) [45] and the Therapeutic Target Database (TTD) [46]. CTD provides information about interactions between environmental chemicals and proteins and their relationships to diseases, including over 15 million toxicogenomic relationships. TTD includes approximately 2, 000 targets linked to 5, 000 drug com- pounds. Other drug-target databases have been addressed in [47]. In addition to drug- target databases discussed above, an open access resource, the International Union of Basic and Clinical Pharmacology/British Pharmacological Society (IUPHAR/BPS) Guide to PHARMACOLOGY (http://www.guidetopharmacology.org) [48], has been released recently to provide comprehensive information on the targets of approved and experimental drugs, including pharmacological, chemical, genetic, functional and pathophysiological data. Another database, called NetwoRX, provides pre-computed drug lists for KEGG pathway, GO categories, YEASTRACT transcription factor targets, and phenotypes [49]. 2.2. BIOLOGICAL INTERACTIONS AND SOURCES 19

2.2.5 Gene Expression Data

Genes are expressed differently in different developmental stages or diseases, as well as in different tissue types [50]. The advent of DNA microarrays accelerated the collection of gene expression data, as such arrays enable the analysis of the mRNA expression of thousands of genes simultaneously. The resulting gene expression studies lead to identification of new genes with importance in diseases, and to construction of a global genetic map of the human complex traits. The public repositories of gene expression data have been booming since the advent of microarray technology. The four highly accessed gene expression databases are ArrayExpress, GeneSigDB, Gene Expression Omnibus (GEO) and Oncomine. The ArraryExpress Archive of functional genomics database [51] contains data generated by sequencing or array-based technologies. Unlike genomics, functional genomics focuses on gene and protein functions and interactions, instead of DNA se- quence or structures. The ArraryExpress Archive is updated daily and the definition of experimental dataset is determined by the user that has uploaded the data and is variable. GeneSigDB [52] is a database of prognostic, diagnostic and other gene sig- natures of cancer and related diseases. The data in GeneSigDB have been extracted and manually curated from the published literature. The GEO database [53] was es- tablished at the National Center for Biotechnology Information (NCBI) a decade ago. Now GEO stores over 20, 000 microarray- and sequence- based functional genomics studies. Oncomine [54] is a cancer microarray database. To date, Oncomine contains clinical outcome analyses on 128 datasets with 23, 124 samples, pathway and drug analyses on 134 datasets with 5, 042 samples, and the latest Cancer Genome Atlas. Disease genes that participate in a common functional module tend to be expressed 2.2. BIOLOGICAL INTERACTIONS AND SOURCES 20

in the same tissue [39]. Some studies [55][56][57] have looked at the expression pat- terns of genes in healthy and diseased tissues by integrating gene expression data with PPI networks. These studies have shown that the ubiquitously expressed pro- teins have extensive interactions with tissue-specific proteins, which only occur in a restricted subset of cells or tissues. Publicly available data with reported associations between genes and tissues includes: (1) gene expression data in 79 different human cell or tissue types organized by Bossi and Lehner [55]; (2) data from 27, 868 human genes for 31 different tissue types in [56]; and (3) gene expression data for 19 different tissue types from the Human Gene Expression Index (HuGE Index) [57] database. Two recent reviews [58][50] have addressed a global view of protein expression in human cells and tissues. Through their analysis, Ponten et al. indicate that although many proteins are expressed in all tissues, there are a handful of proteins that are only expressed in certain tissue types and the amount of expression can be tissue-specific [58]. Lukk et al. created a global gene expression map in hundreds of different tissues and cells [50]. The research on the relationships between disease genes and tissues is a relatively new initiative and ongoing; further studies are needed to help us better define these relationships.

2.2.6 Protein Expression Data

Although proteins are composed of 20 standard amino acids, protein splicing and post- translational modification result in a variety of protein products from a single gene. In addition, the dynamic processes, such as protein maturation and degradation, increase the functional complexity of proteins [59]. Protein expression data are a useful source to discover biomarkers which change uniquely though different stages 2.2. BIOLOGICAL INTERACTIONS AND SOURCES 21

of diseases. The proteomics protocols, such as experimental methods for protein separation, quantification, and identification are discussed in [59]. An online database, tthe Human Protein Atlas portal [60], provides a tissue-based map of proteins enriched in 44 different normal human tissues, 20 different cancer types, and 46 different human cell lines.

2.2.7 microRNA Data

MicroRNAs (also called miRNAs) are small (approximately 22-nucleotide), endogeous non-coding RNAs. microRNAs are able to bind to complementary sequences on target messenger RNA (mRNA) to control mRNA degradation and translation inhibition. microRNAs are essential for various cellular processes, including cell cycle control, cell growth and organ development, and consequently play a critical role in disease development, progression, prognosis, diagnosis and treatment response [61]. So far, there are over 1, 800 human microRNAs annotated using deep sequencing data in the primary microRNA sequence repository miRBase [62]. Currently, several microRNA data sources have also been established. These in- clude experimentally validated microRNA-gene association database Tarbase [63] and miRTarBase [64], computationally predicted microRNA-gene interaction databases DIANA-microT [65], picTar [66] PITA [67], TargetScan [68] and microRNA Data Integration Portal (mirDIP) [69]. mirDIP is the most comprehensive repository for microRNA-target predictions, which integrates up-to-date data from the following data sources: DIANA-microT, microRNA [70], miRBase, picTar, PITA, RNA22 [71], and TargetScan. In addition to the microRNA-gene interaction data, miRGator [72], miR2Disease 2.3. DATA INTEGRATION 22

[61] and PhenomiR [73] are public databases providing information regarding mi- croRNA and related phenotypes. miRGator is an integrated database offering miRNA- associated gene expression, disease association and genomic annotation. miR2Disease collects data from more than 600 manuscripts through PubMed, and then manually curates 1, 939 connections between 299 human microRNAs and 94 human diseases. PhenomiR includes 11, 029 microRNA records from 542 studies in 296 published ar- ticles.

2.3 Data Integration

Due to the increasing accumulation of a variety of high throughput experimental data, data integration has been a central focus in making reliable inference in network biology. It is well accepted that no error-free or complete biological dataset exist. Fortney and Jurisica [74] have given practical examples to explain the existence of errors in high throughput data and the disagreements among different experimental platforms. Although integrating diverse information across different databases or repositories is very challenging [75], it is the cornerstone in biological network analysis, as the integration of diverse genetic interactions can potentially reduce both noise and platform bias to some extent [74]. Furthermore, the integration of heterogeneous data can offer novel insights into the genetic basis of complex traits. Generally speaking, there are two types of integration strategies in the litera- ture. One is to integrate multiple sources of the same type of interactions to improve the coverage and reliability. For example, different types of protein-protein interac- tion repositories have been created with different research foci, and the data in these repositories are somewhat complementary, as well as redundant. A common identifier 2.3. DATA INTEGRATION 23

should be shared in order to integrate two or more sources. Some PPI repositories, such as I2D [26] and iRefIndex [27] integrate PPIs by consolidating multiple existing sources. The other integration strategy is to merge heterogenous biological interac- tions to provide a global or comprehensive view for specific purposes. For instance, in the research field of gene prioritization which ranks candidate genes by measuring their similarities to known disease associated genes, approaches merging heterogenous biological interactions have achieved much higher prediction accuracy than ones using a single data source. The amount of data from large genomic projects is at the petabyte scale, and still increasing at unprecedented speed [6]. The most pressing challenges posed by large- scale data in life science includes: (1) transfer, management and maintenance of data; (2) standardization of data formats to unify the representation; and (3) integration of diverse large-scale data sources. This thesis focuses on the third challenge: integration of heterogenous biological interactions and their representation in form of networks to assist our understanding of genetic basis of human complex diseases. The current and future challenges in such data integration have been discussed in a recent review [7]. Along with exponentially increasing accumulation of diverse datasets representing various aspects of cellular processes, large-scale biological data have been successfully integrated and analyzed to help scientists further our understanding of complex hu- man diseases. As mentioned before, biological networks of multiple entities, namely, metabolic and signalling pathways, PPI networks, disease-gene associations, tissue ex- pression, drug target networks, gene expression data and microRNA data, have been explored and well studied to help generate a global view of fundamental biological 2.4. NETWORK ANALYSIS 24

processes relevant to human diseases. Unfortunately, in the literature to date, most studies have integrated up to three different types of interactions, leading to a lack of systematic global view of complex diseases. In addition, no standard method for in- tegration of these data exists. There are many-to-one or many-to-many relationships among biological objects, e.g., a protein-encoding gene may have a standard gene symbol and several alias, and a single gene can have several protein products. The cross-mappings between genes and proteins vary across different studies. Therefore, the quality of computational analysis approaches developed for biological networks is limited by the availability and reliability of the data, i.e., it is data-driven. In order to improve the performance of computational analysis, and to enhance its coverage and reliability, the integration of high dimensional data is essential.

2.4 Network Analysis

The early studies of biological networks have focused on descriptive analytics to ad- dress their common topological patterns [10]. With the development of systems bi- ology, an increasing number of studies have reported the modular nature of human diseases; that is, associated genes for the same or similar phenotypes are likely to reside in the same biological module, such as pathway or protein complex [76]. These modules are densely interconnected regions in the networks. Subsequently, efforts have been devoted to predictive analytics, which utilizes a combination of statistical, modelling, data mining, and machine learning techniques to study large amounts of biological data, to predict unknown genetic bases of complex diseases. Therefore the main focus of computational analysis of biological networks has shifted from the un- derstanding of network patterns to the prediction of disease causing genes or modules 2.4. NETWORK ANALYSIS 25

in the context of genetic interactions. As a result of the modular nature of human genetic diseases [76], the current com- putational analyses of disease networks have concentrated on measuring the topolog- ical proximity between nodes within these networks to annotate their functional rela- tionships. In this section, we provide an overview of the computational approaches for the network-based human disease analysis. Various review articles [77][78] categorize computational analysis of biological networks differently. However, in general, these computational analysis methods can be mainly grouped into two categories: (i) pre- dicting new disease-genes by ranking their closeness to known diseases-genes, named “Gene Prioritization”, and (ii) detecting functional modules or subnetworks to facili- tate the generation or annotation of prognostic and predictive signatures for complex diseases, named “Subnetwork Detection”. The proposed approaches for gene priori- tization compute local or global scores of the topological similarity between nodes in a network with respect to some biological reference, and rank the candidate nodes on the basis of these scores. The same scores are also measured to assist the subnetwork detection in identifying functional modules.

2.4.1 Gene Prioritization

Over the past few years, a multitude of genome-wide association studies have iden- tified a large number of significant associated genes for complex human disease [79]. Network-based computational approaches have been proposed to take advantage of the prior knowledge of known disease-gene associations from existing GWA studies to facilitate the prediction of disease genes by prioritizing candidate genes using the Guilt-by-Association hypothesis [78]. Guilt-by-Association assumes that genes close 2.4. NETWORK ANALYSIS 26

to known disease associated genes in the network will be more likely to be associ- ated with either the same or a similar disease. As we mentioned at the beginning of this chapter, this hypothesis has been supported by the discovery of the modular nature of human genetic diseases [76], that is, associated genes for the same or similar phenotypes are likely to reside in the same biological modules. Typically, the computational approaches to prioritizing candidate genes first model the biological data into an integrative network, in which a scoring scheme can be di- rectly defined to measure the association between a candidate gene and a selected disease if both are represented as nodes in the same network, or indirectly measure the relationships between a candidate gene and known disease associated genes in PPI network instead. In practice, the direct or indirect measurement is mainly deter- mined by the availability of the data sources. Known disease-gene associations and PPI networks are essential data sources in current gene prioritization studies. Addi- tional data sources, such as pathways, GO terms, gene expression data, or phenotype similarity data are assets for some studies. Some forms of data, such as phenotype similarity from OMIM, have been under-annotated and may mislead the results [80]. It is natural to measure the topological closeness between nodes in a network in the form of scores. These scores reflect the local or global relationships between nodes, no matter whether they are connected or not. The similarity or dissimilarity scores can be measured between a pair of nodes or between groups of nodes. In the literature, many similarity scores are computed and used directly or indirectly to rank the candidate genes. The methods proposed to compute such scores include but not limited to local scores [81], kernels [81] [82], linear regression [83][84] and random walks [81][9][80]. 2.4. NETWORK ANALYSIS 27

Local scores

Direct neighbor (DN) and shortest path (SP) are commonly used local scores [81]. Local scores only provide the direct connectivity information, such as whether two nodes are neighbors or connected through any path in a network, instead of expressing the degree of similarity between any two nodes on the global structure of the network. The direct neighbor distance DN(i, j) between node i and node j is equal to 1, if there is an edge directly connecting node i and node j, +∞ otherwise. The shortest path distance SP (i, j) between node i and node j is defined as the length of a shortest path between the two nodes. In many practical network-based analyses, these local scores are subject to noise, as DN and SP are extremely sensitive to the insertion or deletion of individual edges in a network.

Kernels

In contrast, global scores provide robust similarity measures on the space of an input graph G. An important class of global similarity scores includes kernels, such as diffusion kernels [81] [82]. A kernel must be positive definite and symmetric [85]. The simplified form of diffusion kernel [81] K of a graph G is defined as:

K = eβL (2.3)

where β controls the magnitude of the diffusion. The matrix L is the Laplacian of the graph. L = D − W , D is a diagonal matrix containing the node degrees and W is the adjacency matrix of the interaction network. The diffusion kernel represents a global pairwise similarity between nodes in a network. Higher values in a kernel 2.4. NETWORK ANALYSIS 28

represent closer relationships between pairs of nodes in the network. If nodes are not connected in a network, then the corresponding values in a kernel will be close to zero. One of the advantages of diffusion kernels is to help integrate multiple data sources. For example, in [82], the authors integrated different networks using a customized distance measure of pairwise distance. Let W 1,W 2, ..., W m denote the adjacency matrix derived from m different independent networks respectively. Let Kl, l = 1, 2, ..., m denote their corresponding diffusion kernels. They defined the final ranking score for each candidate gene by summing the evidence from all kernels.

Linear regression

The linear regression model has been successfully applied to measure the global corre- lation between the similarity in phenotype network and the closeness in PPI network to score the associations between phenotypes and genes [83][84]. The linear regres- sion models are proposed on the assumption that: candidate genes close to known disease genes in human PPI network are highly associated with the same phenotype or closely connected phenotypes in the human phenotype network. In the literature, the first linear regression model covering 5, 080 OMIM phenotypes, named Correlat- ing protein Interaction network and PHEnotype network to pRedict disease genes (CIPHER ), was proposed for gene prioritization [83]. In this model, the authors built an integrated network with three layers of data: (1) phenotype-phenotype sim- ilarity network; (2) gene-phenotype association network; and (3) a reliable physical PPI network. In 2011, Zhang et al. [84] added more data sources of PPIs into this integrated model proposed in [83]. Given a query phenotype p and a candidate gene g, linear regression approaches [83][84] calculate the phenotype similarity profile by 2.4. NETWORK ANALYSIS 29

measuring the topological distance between p and all other phenotypes in the pheno- type network. In the phenotype network, phenotypes are connected if they share a similar molecular basis. The shorter the distance between phenotypes, the higher the similarity they share. At the same time, the closeness profile of candidate gene, g, is measured using topological distances between gene g to all known disease genes in the human PPI network. By mapping the known disease genes to associated phenotypes, the closeness profile measures the indirect distances between candidate gene g to all phenotypes. Linear regression based approaches, then model the closeness profile and similarity profile with a standard linear equation as:

X Sp = Cp + βpgΦg (2.4) g∈G(p)

where Sp is the phenotype similarity profile, Φg is the closeness profile, G(p) denotes all disease genes relevant to the phenotype p, Cp is a constant and βpg is the coefficient of this regression model. Such mathematical linear models are generally solved or estimated using enough training data and then applied to test data. However, in the network-based human disease studies, due to the incomplete and noisy nature of data underlying the net- work, it is hard to solve the model precisely. Instead, current methods studying associations between diseases and genes concentrate on finding the straightforward correlation measure between S and Φ. Accordingly, given the linear model, a gene, g, whose closeness profile, Φ, has high concordance with the similarity profile, S, of the query disease, p, is highly likely associated with p. In [83], the authors used the Pearson linear correlation coefficient between S and Φ as the concordance score to rank the candidate genes. Alternatively, in another study [84] , the authors used the 2.4. NETWORK ANALYSIS 30

Bayes factor to measure the strength of evidence for the association between gene g and phenotype p. The Bayes factor is the ratio of the marginal likelihood under the alternative hypothesis divided by the likelihood under the null hypothesis. In other words, we calculate the similarity profile S and the closeness profile Φ under one of the two hypotheses H1 : S and Φ are linearly correlated, and H2 : they are not. If the prior probabilities for both hypotheses are one half (0.5), then the Bayes factor is equal to the posterior probability in favor of H1. So a larger Bayes factor means a stronger linear relationship between the similarity profile and the closeness profile.

Random walks

Another important class of measures of global similarity is based on random walks [14]. There is a short definition of random walks in [86]: A random walk is a finite Markov chain that is time-reversible. A Markov chain is stochastic process with a finite number of states, in which the next state only depends on the current state and not upon any previous states. A random walk on a network can be described as a walker starting its trip on an arbitrary node in the network and visiting one arbitrary neighbour of the current node successively. The probability to visit a target node in no more than a given number of time steps after a walker leaves a start node has been proposed to measure the similarity between a pair of nodes in the network [14]. The most promising random walk approach in the latest studies is random walk with restart (RWR), which adds the probability for the walker to jump back to the start node at any moment. RWR has been used to rank candidate disease-genes in three studies [12][80][81]. The formal 2.4. NETWORK ANALYSIS 31

definition of RWR is as follows:

Pt = (1 − r)W · Pt−1 + rP0 (2.5)

where r is the restart probability, W is the normalized adjacency matrix of the

graph G, and Pt is a vector in which the ith element represents the probability of being at node i at time step t [81]. If we have no prior knowledge at all, the initial vector P0 assigns equal probabilities to all nodes in the network, with the sum equal to 1.

Equation 2.5 can be easily adopted into a flow-based approach, where Pt denotes

the association scores and P0 represents a prior knowledge function. Then this itera- tive algorithm can be best understood as simulating information flow transfer through the edges within the network. By predefining weights on different types of edges, it is easy to control the transfer of information flow over different types of biological interactions or associations. Another successful random walk-based approach in the analysis of large-scale biological networks is Markov clustering (MCL). MCL was firstly proposed by Stijn van Dongen [87] in his PhD thesis in 2000. MCL simulates random walks on the interaction network, by alternating two operations: expansion and inflation. The expansion operator is applied to expand the flow to connect differ- ent clusters by taking the eth power of the normalized adjacency matrix W, and the inflation operator is used to strengthen strong currents, as well as to weaken weak currents. These two operators will be repeated successively until a steady state is reached. The complexity of MCL is O(N 3), where N is the number of nodes. A comparative study for prioritization of candidate disease genes [81] demon- strated that RWR outperformed diffusion kernel and other local similarity scores 2.4. NETWORK ANALYSIS 32

(direct neighbors and shortest paths). In addition, an adapted form of RWR named PRINCE [9], was used to associate genes and protein complexes with human disease. In the literature, there are four highly cited random walk approaches for gene prior- itization. Besides the concerns of computational methodology, the major difference between these four random walk-based studies [9][12][80][81], is the integration of dif- ferent types of data sources. The study in [81] only utilized two kinds of connection: known disease-gene associations from OMIM and PPI. In [80], the authors added one more kind of connection: phenotype-phenotype similarity. In order to increase the reliability of PPIs, in [9], the authors not only utilized the same types of connections as in [80], but also assembled multiple data sources of PPIs to construct a weighted PPI network with confidence scores instead of unweighted one. Furthermore, in [12], for the first time in the literature, the authors integrated novel from interactions between protein domains and PPIs and then applied RWR to investigate the rela- tionships between functional coherence and topological proximity. A protein domain is defined as a structural or functional subunit [12]. In conclusion, the random walk-based analysis has several advantages over others: (1) it is easy to add or remove biological prior knowledge by changing the initial value

of P0; (2) it is natural to model an integrative biological network with a stochastic model [88]; (3) it measures the global topological relationships of nodes within the network [89]; and (4) it is possible to balance the computing time and prediction accuracy. For small-size networks, it is definitely feasible to repeat a random-walk process until a steady state is reached when the changes between Pt and Pt−1 fall below a certain threshold. For integrative biological networks with extra large sizes, we have to input a pre-defined number of time steps to control the computing cost. 2.4. NETWORK ANALYSIS 33

Random walk-based methods do have limitations; for instance in previous studies of RWR, trail and error or guess and check games were used to set the restart probability r, as there is no standard guide at all.

2.4.2 Subnetwork Detection

As mentioned before, one of the goals of computational analysis of biological net- works is to identify small groups of connected biological entities (genes, proteins, mi- croRNAs and so on) that have specific biological functions. Such subnetworks (e.g., protein complexes) can be helpful to annotate or predict molecular functions related to human complex diseases. Network-based analysis approaches have been proposed to detect predictive subnetwork biomarkers. For example, a study of network-based classification of breast cancer metastasis [90] identified subnetwork signatures by in- tegrating gene expression profiles of tissues samples from both metastatic and non- metastatic breast cancer with a human PPI network. Compared to individual marker genes, the subnetwork markers achieved higher prediction accuracy in the classifica- tion of metastatic versus non-metastatic tumors of breast cancer patients. In another study [91] subnetwork markers to classify cancer tissue samples for breast cancer and colon cancer have been detected with highly increased accuracy. In addition to the sub-network biomarker identification, the subnetwork detection methods have been applied to predict essential genes [31] to identify functional modules [92], and to link known disease genes with either biological pathways or PPIs for further functional annotations [74][77]. 2.4. NETWORK ANALYSIS 34

Clustering methods

Clustering algorithms have been the most common approaches to detect sub-networks in biological networks [93]. Traditional well-known partitional clustering methods, such as k-means clustering [94], where the number of clusters should be pre-assigned, are not practical to fit biological hypotheses underlying these biological networks for human disease research. Since the early studies of clustering [95], approaches in graph analysis and physics have been adapted to find clusters in biological networks [96]. The hierarchical nature of biological networks makes it natural to apply hierarchical clustering for subnetwork detection. Hierarchical clustering aims to categorize nodes into a hierarchical set of clusters in a tree structure (dendrogram) by calculating pairwise similarities among clusters [97]. Average linkage (some times referred to Unweighted Pair Group Method with Arithmetic Mean - UPGMA) is probably the most popular distance matrix used for hierarchical clustering. UPGMA formulates the mean similarity of all pairs of nodes in two clusters as the pairwise similarity across clusters and is thus robust than single linkage in which only the similarity between the two closest nodes in two clusters is required. However, UPGMA has to store the entire similarity matrix in memory and the size of the matrix is approximately quadratic in the number of nodes. In [97], the authors showed that an entire similarity matrix for 1.5 × 109 edges requires at least 30GB memory. To fit the large size of biological datasets, a modified average linkage method, memory-constrained UPGMA (MCUPGMA) was proposed to break the clustering process into multiple rounds to reduce the usage of memory [97]. An obvious limitation of the hierarchical clustering is that it is hard to determine the appropriate level of the clustering dendrogram to study, or the number of expected 2.4. NETWORK ANALYSIS 35

clusters. As a full tree structure, it is easy to see that all nodes can be grouped into a single cluster at the root, but all nodes are their own clusters at the leaf. Many external stopping criterion have to be pre-defined. SuperParamagnetic Clustering (SPC) [98] has been proposed to overcome this limitation. SPC is based on an analogy to the physical properties of a ferromagnetic model, in which the change of temperature T controls the clustering level of nodes. At high temperature, the system is in a disorder phase (paramagnetic phase), in which each node is a cluster. As the temperature T is lowered a phase transition occurs (super-paramagnetic phase), and correlated nodes are aligned together. Clusters keep being aligned as T is further lowered, until at low enough values of T all nodes form a single cluster (ferromagnetic phase). Only in the temperature range ∆T (super-paramagnetic phase), the division of clusters is stable. The most important advantage of SPC is that no prior knowledge about cluster structure is needed.

Belief propagation

In addition, belief propagation algorithms, such as Markov random field (MRF) [99][100] have been proposed for subnetwork detection in biological network anal- ysis. In order to infer protein function in a PPI network, binary functional labels are assigned to known proteins in the network. If a protein i is associated with a (GO) term t, then the label of protein i related to term t is 1, and 0 otherwise. Belief propagation approaches are proposed on the Markov assumption: the labelling of any node in a graph is conditionally independent of all other nodes given its neighbours [99]. As a result, the estimated labelling probability of a node, called belief, is entirely a function of its neighbours’ labelling probabilities. The belief 2.4. NETWORK ANALYSIS 36

propagation algorithm, also called message passing algorithm, is an iterative process. At every iteration, each unlabelled node receives information about its neighbour’s labels and their beliefs on the label, and then any unlabelled node with certain high probabilities of its possible labels (exceeding a certain threshold) is updated with a label with the highest belief. This entire process will stop when no unlabelled nodes are left. Another complex belief propagation approach, Affinity Propagation (AP) [101] was proposed in 2007 for k-centers clustering based on a measure of similarity. The k-centers clustering begins with an initial set of randomly selected k exemplars from actual nodes in the network, as the centers of subnetworks, and iteratively updates the exemplars (centers) so as to minimize the sum of squared errors between nodes and their exemplars [101]. AP takes a collection of real-valued similarities between nodes and their exemplars as input. The similarity function s(i, k) indicates how close node k is to its exemplar. In order to avoid inputting the number of clusters, AP assumes each node is its own initial exemplar. If all nodes are equally suitable as exemplars, then s(k, k) for any node k should be set to a common value. These exemplars then emerge in a message-passing procedure. AP exchanges two kinds of real-valued messages between nodes: responsibility and availability. The responsibility r(i, k), sent from node i to node k, indicates the evidence for node k to serve as the exemplar for node i. The availability a(i, k), sent from node k to node i, indicates the evidence for node i to choose node k as its exemplar. The definitions of these two terms can be varied in specific tasks. In the first iteration, the availabilities are zero, and r(i, k) is set to s(i, k) minus the largest of the similarities between node i and other candidate exemplars. In later iterations, these two kinds of messages will be exchanged between 2.4. NETWORK ANALYSIS 37

pairs of nodes with known similarities and the node i can be assigned to the exemplar k to maximize the sum r(i, k)+a(i, k). This process will stop when there is no change of exemplars for all nodes. The limitations of belief propagation methods are: (1) prior knowledge is needed; (2) many parameters to adjust; and (3) no guarantee to reach the final stable condition [99].

Heuristic search

The clustering and belief propagation methods discussed above are not efficient at finding sub-networks overlapping each other, so a group of score-based heuristic meth- ods were proposed with the capability of identifying overlapping sub-networks. For example, Molecular Complex Detection (MCODE) [102], Restricted Neighborhood Search Clustering algorithm (RNSC) [103], and Speed and Performance In Clustering (SPICi) [104] showed the effort to detect protein complexes in different PPI networks of S. cerevisiae, D. melanogaster, C. elegans, E. coli, or human. These methods aim to detect densely connected regions using topological properties of the large interaction network. Typically, these approaches include three common steps:

1. Constructing an integrative network from multiple data sources;

2. Defining the scoring scheme;

3. Searching for optimal subnetworks with maximum or minimum scores.

Step 1 is mainly about collecting and processing multiple data sources. In Step 2, the scoring scheme is usually defined on the topological properties of the network, such as the network density, which is usually defined as the number of edges in the 2.4. NETWORK ANALYSIS 38

network divided by the total number of possible edges [102]. The search algorithms in Step 3 usually start from initial seed nodes, and then greedily add an optimal neighbouring node of these nodes to maximize or minimize the total score of the subnetwork in an interactive process, until the stopping criterion has been reached. MCODE [102] defined the sub-network score as the product of the density of the sub-

network and the number of nodes in the subnetwork (Dc × |V |). RNSC [103] score the sub-network using a functional homogeneity p-value, which is the probability that a given set of proteins is enriched by a given functional group by chance, following the hypergeometric distribution. SPICi [104] uses a simple scoring scheme in which three scores have been defined: weighted degree, density score, and support score. The weighted degree of a node is the sum of weights of all its incident edges; density score of a set of nodes S is defined as the sum of the weights of the edges among them divided by the total number of possible edges; support score for each node n to set S is defined as the sum of the weights of the edges linking n to any nodes in S. The searching process first selects one seed node with the highest weighted degree in the network, and then picks one of its neighbours with the highest weighted degree as the second seed node. After obtaining two seed nodes as the initial set S, in a interactive process, the algorithm searches for a node with maximum support score amongst all the un-grouped nodes that are adjacent to a node in S and adds it into S until either the support score or the density score is smaller than a user-defined threshold [104]. The heuristic methods that have been discussed above are quite limited to detect densely connected sub-networks using the topological properties of the network. With the rapid accumulation of biological interactions, especially PPI, the network gets larger and denser. The detected sub-networks using topological features may be 2.5. DISCUSSION 39

meaningless for practical research. Another group of methods combine the biological profiles, e.g., gene expression data or GWAS data, with the PPI networks to detect the active sub-networks. For example, jActiveModule [105] aggregates each p-value of the node from the gene expression data into a single score to rank sub-networks, and applies a greedy algorithm to search for highly scored sub-networks. Protein- interaction-network based pathway analysis (PINBPA) [106] is designed to aggregate each p-value of the node from GWAS data to identify enriched sub-networks in GWAS datasets. Practical searching algorithms [102][104][103] are polynomial-time approximation schemes, that take into account the large size of biological networks. In contrast to heuristic searching, exhaustive enumeration algorithms, such as CFinder [107][108], and biclustering [91] help to find all possible subnetworks. CFinder finds a set of con- nected k-cliques (completely connected subnetworks of size k) in undirected graphs, such as PPI networks. The biclustering approach in [91] finds all subnetworks that satisfy a certain density threshold. Here the density of a subnetwork refers to the average paire-wise edge weight in a PPI network. By integrating gene expression data and a PPI network, a subnetwork qualified in [91] is also a bicluster, which is defined as a set of genes and experimental conditions with high correlations.

2.5 Discussion

Numerous computational network-based approaches have been proposed for ranking disease-genes and detecting subnetworks within a biological network model. However, there is no one general solution for studying biological networks in human disease re- search. To date, no comparative study addressing both biological data integration 2.5. DISCUSSION 40

and computational approaches covering all aspects we introduced in this section has been done yet. Due to the incomplete and noisy nature of biological data, the evalua- tions from some comparative studies are questionable as no reliable benchmark model exists. In order to achieve high prediction accuracy, researchers are attempting to add more data sources into the existing models. Unfortunately, most studies have focused on the integration of up to three different types of biological interactions due to the difficulty of integrating large-scale heterogeneous biological data. As one of the objectives of this thesis, heterogeneous biological interaction data will be collected across multiple biological domains to build a systematic view of human diseases in terms of networks, a tool to construct and visualize the resulting networks. To some extent, this systematical integration of heterogeneous interactions has the potential to provide structured prior knowledge extracted from massive amounts of biological data, to assist biological investigations, such as drug repurposing, in future studies. With respect to the size of the integrated biological networks with or without any additional functional information, any kind of analysis can be computationally ex- pensive. Several popular large-scale computing platforms such as cluster computing, cloud computing, grid computing and heterogeneous computing have been proposed to meet such challenges [6]. Besides high performance computational systems, much effort has been spent on the design of efficient algorithms to test biological hypothe- ses from these networks. The second and third aims of this thesis mainly focus on the design of efficient network-based computing methods to tackle specific biological problems. 41

Chapter 3

Integrative Complex Traits Networks

Preamble: A part of the material presented in this chapter has been published in [109]. The tool proposed in [109] has been applied in neurological studies [110][111]. In addition, a Nature Methods paper [112] addressed this tool as “One of the most integrative Cytoscape plugins to date is the iCTNet plugin, which was recently devel- oped to integrate genome-wide association data with protein-protein, disease-tissue, tissue-gene and drug-gene interactions. It may assist users in elucidating a new trait classification, pathogenic mechanism or treatments for human disease traits”.

3.1 Introduction

In the past decade, an exponential increase has happened in the variety of publicly available genomic, proteomic and metabolic data (“omics”), which encompass differ- ent biological interactions. As we have discussed in the previous chapter, each type of interaction captures distinct features of molecular functions involved in complex traits, to describe and to understand biological complexity. However, none of these genome-wide data are complete on their own, and further systematical exploration or annotation can be made by combining different levels of biological components, 3.2. BACKGROUND 42

e.g., gene, protein, miRNA, tissue. In terms of network biology, these various bio- logical interactions are presented in a graph fashion, which allows easy integration of heterogeneous data sources. In this chapter, a systematical integration of heteroge- neous interactions is proposed to offer a platform to construct meta-networks, which provide novel views of complex traits, to ultimately facilitate disease prediction or drug discovery. In addition, an application to navigate and visualize the resulting integrative networks is also developed.

3.2 Background

The integration of biological interactions was inspired by the first human disease network[39]. Goh et al.[39] built the first Diseasome, which is a bipartite network of disease and known disease genes. Lage et al. [113] merged a protein-protein interac- tion network with disease gene associations. Later on, drug-target networks attracted attention in drug development [114][115], and led to the field of systems pharmacol- ogy. Meanwhile, other kinds of databases have been assembled with information about tissue [116] and miRNA [62]. The availability of large-scale heterogeneous datasets in public repositories has prompted the need for systematical integration. Hetero- geneous data sets can be joined based on common keys, e.g., identifiers or ontology terms, but the integration of large-scale biological interactions is complex and time- consuming due to the lack of uniform identifiers in different repositories and the de- mands for interdisciplinary skills. Some pioneering tools, e.g., GeneMANIA[117] and DisGeNET[118], have elegantly integrated multiple data sources. GeneMANIA inte- grates protein and generic interactions, pathways, co-expression, co-localization and protein domain similarity to provide fast gene function prediction [117]. DisGeNet 3.3. MATERIALS AND METHODS 43

integrates human gene-disease associations from various databases and provides a tool to query and analyze human gene-disease networks [118]. In the literature, most studies have focused on the integration of up to three different types of biological interactions. In this chapter, we present the integrative complex traits networks (iCTNet), a tool to create, analyze and visualize human complex traits networks incorporating ontologies from several domains that may assist in elucidating a new classification, pathogenic mechanism, or treatment for common human traits. To date, there are thousands of biological database resources [18], and many tools are available to visual- ize large-scale networks. One of the most popular network analysis and visualization tools, Cytoscape, is an open source platform that allows for a wide variety of ap- plications (referred to as apps) through the development of task-specific plugins [8]. iCTNet has been implemented in Java as a Cytoscape app and provides a user-friendly interface to search, visualize, and analyze large-scale biological networks for human diseases and traits. In iCTNet, we use the terms SNP/gene/protein interchangeably because of the correspondence between SNPs and genes - more specifically, the most significantly associated SNPs define a gene-trait edge and gene products (proteins) are used in the interaction networks.

3.3 Materials and Methods

The definition and classification of disease is an open problem [119]. Manual curation of the list of diseases is a time-consuming step and also requires medical knowledge. iCTNet incorporates the disease ontology (DO) [120] as the primary vocabulary for cataloguing phenotypes in a tree structure, with iCTNet database (version 2), which 3.3. MATERIALS AND METHODS 44

consists of six types of nodes: phenotype, gene/protein, miRNA, tissue, drug and drug side effect and nine types of edges, as shown in Fig. 3.1A. iCTNet database can be accessed via iCTNet app in Cytoscape control panel, shown in Fig. 3.1B.

Figure 3.1: iCTNet database schema and user interface

In the following section, we will describe the different types of nodes and edges included in iCTNet database. We have collected 13 publicly available sources to build the iCTNet database, and these sources are listed in the Table 3.1. The descriptions and example data entries can be found in Appendix A.

3.3.1 Nodes

Phenotype: In addition to DO, we also included two other disease vocabularies: the Experimental Factor Ontology (EFO) [121] and MEDIC [45]. EFO is an ontology developed by the European Bioinformatics Institute (EBI) with a substantial disease 3.3. MATERIALS AND METHODS 45

Table 3.1: Data sources in iCTNet database

Type Resources URL Phenotype Disease Ontology http://disease-ontology.org Gene HGNC http://www.genenames.org miRNA mirCat http://www.mirma.org Tissue BRENDA http://www.brenda-enzymes. Node org Drug CTD http://ctdbase.org Side effect MedDRA http://www.meddra.org Phenotype-gene GWAS Catalog http://www.genome.gov/ gwastudies/ OMIM http://www.omim.org CTD http://ctdbase.org Phenotype-tissue Ontoglogy inference Gene-tissue GNF Gene Atlas http://www.gnf.org Drug-phenotype CTD http://ctdbase.org Edge Drug-gene CTD http://ctdbase.org DrugBank http://www.drugbank.ca Drug-side effect SIDER http://sideeffects.embl.de Side effect-tissue Ontology inference Protein-protein iRefIndex http://irefindex.org ppiTrim http://www.ncbi.nlm.nih.gov/ CBBresearch/Yu/downloads/ ppiTrim.html miRNA-gene mirCat http://www.mirrna.org

component. MEDIC is a vocabulary produced by the Comparative Toxicogenomics Database (CTD)[45] which incorporates disease terms from the Online Mendelian Inheritance in Man (OMIM) [122] and the U.S. National Library of Medicine’s Medi-

cal Subject Headings (MeSH) (http://www.nlm.nih.gov/mesh/). The DO includes OMIM cross-references providing the mapping for our network. We stacked the DO cross-references to OMIM and MeSH to provide mappings to MEDIC. The DO were manually mapped relevant EFO terms. The mapping only covers the subset of EFO 3.3. MATERIALS AND METHODS 46

disease terms present in the GWAS catalog [40]. In total, there are 6, 338 phenotype records in iCTNet database. Gene: Genes are supplied by the HUGO Committee’s list of human genes (HGNC). iCTNet includes only currently valid genes, but store outdated gene symbols and synonyms into an alias table for cross-reference. Non-protein coding genes are included as well. In order to map gene symbols or IDs across different data resources, we used NCBI Gene [123] identifier as the main index of gene or protein. There are 38, 079 gene records in iCTNet database.

miRNA: miRNAs are collected from a online database miRCat (http://www. mirrna.org), which assembles experimentally verified data from five databases: mi- croRNA.org, miRTarBase, tarbase, microT (v3.0) and miR2Disease. Tissue: Tissue types were taken from BRENDA tissue ontology [124]. Drug: The CTD [45] provides the primary resource for drugs in iCTNet. CTD includes over 13, 000 curated chemicals and includes mappings to several other major chemical databases. We also included DrugBank 3.0 [125] which contains fewer chem- icals but has extensive information on most FDA approved compounds. The CTD database provides references to DrugBank identifiers providing the mapping between the two resources. iCTNet contains information of 151, 378 drugs in total.

Side effect: The Medical Dictionary for Regulatory Activities (MedDRA) (http: //www.meddra.org) provides the side effect ontology in iCTNet. While providing a high quality and widely adopted vocabulary, the commercial nature of this resource prevents large-scale republication of its terms. Instead our database reports the

Unified Medical Language System (UMLS) (http://www.nlm.nih.gov/research/ umls/) concepts for side effects. Since MedDRA is a source vocabulary for the UMLS, 3.3. MATERIALS AND METHODS 47

the mapping is straightforward and reversible. Nonetheless, upon request we will pro- vide researchers who possess a valid MedDRA license with an untranslated version of our database which includes the hierarchical relationships between side effects.

3.3.2 Edges

Phenotype-gene: The phenotype-gene associations are the primary resources to study the genetic factors of complex traits. iCTNet merges phenotype-gene associa- tions from three online databases: GWAS Catalog, OMIM and CTD. Phenotype-tissue: The phenotype-tissue associations are manually curated ac- cording to BRENDA tissue ontology with the help of researchers with medical knowl- edge. Gene-tissue: iCTNet collects an extensive atlas of tissue-specific gene expression from the GNF gene atlas [126], which provides the tissue-specific expression patterns of thousands of genes. The expression patterns of 79 human tissues indicate important clues about gene functions. Drug-phenotype: The drug-gene interactions are assembled from CTD and DrugBank, the two major databases containing drug information. Drug-gene: The drug-gene interactions are assembled from CTD and DrugBank, the two major databases containing drug information. Drug-side effect: The side effects of drugs in humans are an essential source to understand human phenotypes. iCTNet collects the information of 888 drugs and 1, 450 side effect terms from the side effect resource (SIDER) [127] , with available side effect frequency. Side effect-tissue: The side effect-tissue associations are manually curated. 3.3. MATERIALS AND METHODS 48

PPI: PPIs are collected from ppiTrim [128], which is a third party tool pro- viding curated data from iRefIndex [27]. iRefIndex is a consolidated PPI database assembling experimental verified PPIs from 13 different sources. The latest version of iRefIndex, release 13 has been publicly available since December 2013. Release 13 of iRefIndex contains 194, 174 human PPIs. In iRefIndex, it is quite common that one interaction is supported by multiple evidences, such as pubmed IDs, from different sources. ppiTrim removed redundant evidence for every interaction in iRefIndex to make sure each independent experimental support is counted exactly once [128]. miRNA-gene: 2, 457 miRNA-gene interactions are collected from the online database miRCat.

3.3.3 Computational Analysis iCTNet also provides two candidate prioritization algorithms that take full advantage of the underlying protein interaction network. We implemented the random walk with restart [81] and PRINCE [9] algorithms to take a set of “associated” genes (genes with association p-value below a user-selected threshold) and perform searches through the entire protein interaction network. This is a powerful way to identify a set of candidate genes, which have modest (non-significant) p-values but are close to associated genes in the network. The core of both implemented algorithms is similar (they are both random walk methods). The two methods are implemented and based on the following equation:

P (t) = (1 − r)W · P (t − 1) + rP (0) (3.1)

where an iterative walker’s transition in the network is explained and where P (t) 3.3. MATERIALS AND METHODS 49

is the vector holding the scores of the nodes at time step t, W is the normalized adjacency matrix of the network, P (t − 1) is a vector holding the score of the nodes at the previous time step t − 1, and r is the restart rate ranging from 0 to 1. In both methods [9][81], the walker begins with starting nodes and extends to randomly selected neighbours in the network. The restart ratio represents the probability of the transition to jump back to starting nodes at every time step. In other words, the transition will reach farther nodes in the network with small restart ratio; otherwise, the walker will be trapped at starting nodes if the restart ratio is 1. The main difference between these two methods is the input data P (0) and the adjacency matrix W . In random walk with restarts [81], the initial vector P (0) was constructed such that equal probabilities were assigned to the starting nodes, and the adjacent matrix is a binary matrix. So each edge in the network has equal weight. PRINCE [9] used the prior knowledge to initialize vector P0 that genes having direct associations to the given disease are scored high, and genes having indirect associations are scored low. In PRINCE, each edge in the network is weighted by a confidence score between 0 and 1. In iCTNet, PPIs are not weighted due to the lack of reliable source. The complexity of both methods is O(tn2), where n is the number of nodes in the network, and t represents the number of time-steps. The running times depend on the number of truly associated genes, their associated strength (p- value), and the number of connections among their protein products in the network. To the best of our knowledge a head-to-head comparison of these algorithms under an extensive range of parameters has not been performed. Thus, we are unable to recommend the use of a particular algorithm, and instead encourage the user to test them under different experimental scenarios. 3.4. IMPLEMENTATION 50

3.4 Implementation

3.4.1 Availability and Requirements

• iCTNet is free to download from http://www.cs.queensu.ca/ictnet.

• Operating system: platform independent.

• Programming language: Java

• Cytoscape version: Two versions of iCTNet are available. iCTNet version 1 requires Cytoscape version 2.6, 2.7 and 2.8, and iCTNet version 2 is a historical update for Cytoscape version 3.0 and later versions. iCTNet version 1 has been published in [109]. iCTNet version 2 has not be released to the public yet.

3.4.2 Database

All data resources have been processed and stored in a relational MySQL (http: //www.mysql.com) database system hosted on our database server. The database schema has been designed using MySQL Workbench 5.2 (http://www.mysql.com/ products/workbench/). The view of database schema of iCTNet version 2 is attached in Appendix B. In iCTNet version 1, the user can choose between connecting to a remote database, or downloading a number of text files, which contain the information of nodes and edges, and can be imported into Cytoscape to construct a network with all nodes and edges. 3.5. FEATURES 51

3.5 Features iCTNet represents a new application that is designed to integrate and analyze dis- parate data sources, a key pillar in the new paradigm of systems biology. The prior knowledge extracted from massive amounts of biological data described in Section 3.3 is represented in form of networks, to assist biological investigations involving phenotype, gene/protein, miRNA, tissue, drug and drug side effect. iCTNet is one of the most integrative Cytoscape plugins to date [112]. In iCTNet version 2, there are three different schemas to build integrative disease-centered, gene-centered and drug-centred networks, respectively. The different network schemas provide differ- ent perspectives to explore iCTNet networks, in which network-based computational analysis can be applied for further investigation.

3.5.1 Disease-centered Network Model

Human diseases, especially complex traits are the consequence of the perturbation of multiple cellular components. It has been known that several disorders may arise from mutations in a few genes, and their protein products are involved in the same cellular pathway, or functional module [39]. In iCTNet, the user can search human diseases among 6, 338 entries in disease ontology, and build disease networks. Fig. 3.2 shows the steps to build disease networks in iCTNet for a given list of human diseases. The directed edge in the schema represents a process of adding both nodes and edges, and the un-directed edge displays a process of adding edges only. We explain each step in Fig. 3.2 as the following:

• Step 1: to add the associated genes for the diseases, and corresponding genotype- gene edges. 3.5. FEATURES 52

• Step 2: to add the associated tissues for the diseases, and corresponding phenotype- tissue edges.

• Step 3: to add the associated drugs for the diseases, and corresponding drug- phenotype edges.

• Step 4: to add the side effect associated with drugs in the network, and corre- sponding drug-side effect edges.

• Step 5: to add interacting proteins for genes in the network, and corresponding protein-protein edges if the interacting proteins are not in the network. Other- wise, protein-protein edges only. Repeat this step for d times, and d is growth depth of PPI. Only adding protein-protein edges for genes in the network if d = 0.

• Step 6: to add miRNA for genes added in both Step 1 and Step 5, and corre- sponding miRNA-gene edges.

• Step 7: to add gene-tissue edges.

• Step 8: to add drug-gene edges.

• Step 9: to add side effect-tissue edges.

For instance, we build a disease network of breast cancer as shown in Fig. 3.3. First, we search “breast cancer” in iCTNet, and there are 78 entries in total including any subtypes of breast cancer. We treat these 78 entries as one. The entry “breast carcinoma” is associated with 92 genes, 66 genes among them are targeted by 87 drugs, and 12 genes are targeted by 24 miRNAs. The default d of growth depth 3.5. FEATURES 53

Figure 3.2: Disease-centered network schema of PPI in Step 5 is 0 by default, so no interacting proteins would be added in this breast cancer network. If the growth depth of PPI increases, the size of resulting network will grow dramatically. Five disease entries: breast cancer, invasive ductal carcinoma, cribriform carcinoma, ehrlich tumor carcinoma and inflammatory breast carcinoma are connected to 129 drugs. Three types of tissues: muscular system, skeletal system and thorax are associated in the network. There are 1, 017 different kinds of side-effects are associated with 63 drugs. The disease-centered network provides the power to explore the high-order rela- tionships, for instance the inferred associations among diseases and miRNAs. In the breast cancer network, breast carcinoma are associated with gene ECHDC1 and gene CCND1; miRNAs: MIR16-1 and MIR16-2 are both associated with ECHDC1 and CCND1. Then miRNAs MIR16-1 and MIR16-2 can be inferred as breast cancer as- sociated miRNAs for further investigation. If we set d > 0 in Step 5, then interacting 3.5. FEATURES 54

Figure 3.3: Disease network of breast cancer proteins of the genes in the network would be added, resulting in the power to explore unknown relationships among diseases and proteins which are not directly connected to the given diseases.

3.5.2 Gene-centered Network Model

A single gene may influence many pathologies, and consequently be associated with multiple diseases. First, we show the network schema to build a gene-centered network in iCTNet, and then discuss the gene network of three breast cancer genes: BRCA1, BRCA2 and BRCA3. Fig. 3.4 displays the nine steps to build gene networks in iCTNet for a given list of human gene symbols. These steps are similar to the ones to build disease-centred network, in which the directed edge in the schema represents a process of adding both nodes and edges, and the un-directed edge displays a process of adding edges only. Fig. 3.5 shows the gene network of the three breast cancer genes: BRCA1, 3.5. FEATURES 55

Figure 3.4: Gene-centered network schema

BRCA2, and BRCA3. In this gene network, there are 5 phenotypes associated with BRCA1, and 14 phenotypes associated with BRCA2. Two mirRNAs: microRNA 16-1 and microRNA 16-2 target BRCA1. microRNA 16-1 is also associated to one phenotype: chronic lymphocytic leukemia. Both BRCA1 and BRCA2 are expressed in 69 different tissues. The gene-centered network could offer the power to explore the high-order rela- tionships among genes and miRNAs, drugs or diseases if the growth depth of PPI d > 0. Even with d = 0, the gene-centered network still provides the power to mea- sure the similarity among genes by counting the number of shared neighbours in the network. For example, in the gene network of BRCA1, BRCA2 and BRCA3, there are no edges connected to BRCA3. BRCA1 and BRCA2 are both highly expressed in 69 different tissues. These findings could indicate that BRCA1 and BRCA2 are 3.5. FEATURES 56

Figure 3.5: Gene network of BRCA1, BRCA2 and BRCA3 more likely involved in similar cellular functions. iCTNet offers an unique feature to build the similarity networks, and this feature will be introduced in Section 3.5.4.

3.5.3 Drug-centered Network Model

The drug-centered network model is useful in drug repurposing, in which drugs that are effectively used for one disease may be identified as plausible alternatives for another disease, which is genetically associated to the same drug target. In this section, first we show the network schema of a drug network, and then discuss the drug network of Methotrexate, which is a FDA approved drug to treat breast cancer, lung cancer and other types of cancer. Fig. 3.6 shows the nine steps to build drug networks in iCTNet for a given list of drug names. These steps are similar to the previous ones to build disease-centred and gene-centred networks, in which the directed edge in the schema represents a process of adding both nodes and edges, and the un-directed edges displays a process 3.5. FEATURES 57

of adding edges only.

Figure 3.6: Drug-centered network schema

Methotrexate is an antimetabolite and antifolate agent with antineoplastic and immunosuppressant activities, as described in National Cancer Institute Drug Dic- tionary (http://www.cancer.gov/drugdictionary). In the drug network shown in Fig. 3.7, Methotrexate is connected with 114 phenotypes, in which 91 phenotypes are associated with 13 different types of tissues. Methotrexate targets 591 genes, in which 70 genes are targeted by 58 miRNAs. The disease-centered network provides the power to explore the high-order rela- tionships among (1) drug and tissue; and (2) drug and miRNA. As we described in the previous paragraph, Methotrexate is associated with 91 phenotypes which connect 13 different types of tissues. Then these 13 tissues can be inferred as Methotrexate tissues. Similarly, Methotrexate targets 70 miRNA targets, indicating Methotrexate may be a miRNA-targeted drug, which is a trend in therapeutic research [129]. 3.5. FEATURES 58

Figure 3.7: Drug network of Methotrexate

3.5.4 Similarity Network iCTNet offers an unique feature to generate a similarity network by counting the shared direct neighbours. When Goh et al.[39] built the first Diseasome, they also built the first Human Disease Network (HDN), in which two diseases are connected if there is at least one gene associated with both. The HDN displays connections be- tween human diseases. The first HDN contains 1, 284 disorders using the disease-gene associations obtained from the OMIM [39]. iCTNet collects disease-gene associations from three online repositories: GWAS Catalog, OMIM and CTD. GWAS Catalog reports disease-gene associations in large-scale GWAS. OMIM initially focused on monogenic diseases and then expanded to complex diseases. CTD contains curated and inferred disease-gene associations. Curated associations are extracted from ei- ther the literature or OMIM, and the inferred associations are established via curated chemical-gene interactions in CTD. iCTNet provides the user options to select from the three data sources (for disease-gene associations) by checking the boxes in the 3.6. DISCUSSION 59

disease-gene association panel in Fig. 3.1. Using iCTNet, we can easily build the HDNs (Fig. 3.8) of 6, 338 diseases using OMIM, CTD and GWAS Catalog, respectively. The HDNs in Fig. 3.8 only display the diseases having at least one connection with other diseases. Fig. 3.8A displays the HDN of OMIM, containing 436 diseases and 871 edges. Fig. 3.8B displays the HDN of CTD, containing 1, 213 diseases and 35, 957 edges. Fig. 3.8C displays the HDN of GWAS Catalog, containing 84 diseases and 469 edges (a cut-off of the disease-gene associations: −log(p − value) > 7 has been applied). In all three HDNs, each node represents a distinct diseases/trait, coloured based on their disease class, the size of each node is proportional to node degree in the network, and force-directed layout has been applied. As GWAS Catalog contains experimental data only, and a stringent cutoff has been applied to filter the disease-gene associations, so the size of the HDN of GWAS Catalog is much smaller than the other two networks. On the contrary, CTD contains both curated and inferred associations, so it is not surprising that the HDN of CTD is much larger and denser than the other two networks. In addition, iCTNet offers a flexibility to build similarity networks on any node types. In such similarity networks, a node represents an entry of the selected type node, and two nodes are connected if they share at least one common neighbour.

3.6 Discussion iCTNet is a powerful app for Cytoscape, providing easy access to an integrative database that assembles interactions among human phenotypes, proteins, tissues, and drugs. It utilizes the power of multi-partite network analysis and visualization to uncover high order genetic relationships among multiple traits to suggest alternative 3.6. DISCUSSION 60

Figure 3.8: Human disease networks

r 3.6. DISCUSSION 61

therapeutic approaches and to prioritize disease-associated genes. iCTNet enables a point and click environment to load views for user-selected phenotypes, genes or drugs. As some of the data sources in iCTNet database are updated periodically, e.g., GWAS Catalog is updated daily, python scripts have been coded to automate the updating of iCTNet database, but some parts requiring manual curation have to be updated manually. So far, iCTNet has been used in several neurological studies [110][111]. Networks containing traits, genes and drugs were built using iCTNet for multiple sclerosis, type 1 diabetes, and rheumatoid arthritis to identify hidden patterns, which may assist in the creation of more detailed models of disease pathogenesis [110]. Similar networks were built using iCTNet for neurological diseases and related diseases in a later study [111]. The iCTNet database and iCTNet app are systematically developed resource and tool for genomic disease and drug studies. iCTNet demonstrates the integration of heterogeneous interactions and makes it as easy as clicking buttons to build meta- networks including genes, drugs, phenotypes, side effects, miRNAs, tissues and con- nections among them, to offer a comprehensive map. Once installed, iCTNet will show up automatically in Cytoscape. Through Cytoscape platform, networks con- structed via iCTNet can be visualized in different layouts with many visualization features. Other built-in functions or analysis apps in Cytoscape can be easily applied as well. As a result, iCTNet provides flexibility to apply computational analysis on the generated networks for further exploration, such as disease gene prediction and module detection methods developed in other apps, for example, the iPINBPA app to be introduced in the next Chapter. 62

Chapter 4

Integrative Network-based Functional Module

Discovery for Genome-wide Association Studies

Preamble: The material presented in this chapter has been published in [130][131].

4.1 Introduction

A primary goal of human genetic studies is to identify genetic risk factors for com- mon complex diseases, such as autoimmune disorders. GWAS have been a powerful population-based tool to identify genetic risk factors by measuring SNPs through the human genome. GWAS test hundreds of thousands of SNPs to identify statistically significant difference in allelic frequencies between cases and controls [132]. So for each SNP, there is a p-value to show the probability of seeing a SNP having equal or stronger association than itself if the null hypothesis is true. The null hypothesis is that there is no association between this SNP and the studied phenotype. So if the p-value is low, then the chance of seeing such a result is small. If the p-value falls below a certain threshold, which is commonly set to 0.05, then the null hypoth- esis is rejected. In order to measure an association of genome-wide significance, a 4.2. BACKGROUND 63

Bonferroni correction is typically applied (p-value < 5 × 10−8 for 1 million markers) under the assumption that SNPs are independent to each other. While this method guarantees a low fraction of false positives (type I error), but inevitably increases the proportion of false negatives (type II error), which limits the overall utility of GWAS. Furthermore, the results of GWAS do not directly provide any functional information of the variants, so functional studies will be needed. With the advent of knowledge of biological networks, network-based analysis has been proven successful to detect functional modules, and promising to investigate statistically modest associations in GWAS in the context of functional modules to elucidate the underlying molecular mechanism [133][134][106].

4.2 Background

The pathway analysis approaches can be divided distinctly into three generations according to a recent survey addressing the evolution of pathway analysis proposed in the last decade [135]. The first generation is composed of over-representation analysis (ORA). Examples include DAVID[136] and INRICH[137]. The typical input data of such tools is a list of genes or intervals, and a statistical significance value is calculated to measure the degree of enrichment of each pathway in various knowledge databases, such as Kyoto Encyclopedia of Genes and Genomes (KEGG) [33] or Gene Ontology (GO). The commonly used tests in ORA are based on the hypergeometric, chi-square or binomial distribution [135]. The second generation is represented by functional class scoring (FCS) approaches, such as GenGen [138], SSEA [139] and 4.2. BACKGROUND 64

PARIS [140]. The input data for these tools are SNP-wise statistics, such as p- values. FCS methods aggregate the SNP-wise statistics into a single score of pre- defined pathways in KEGG or other databases. The pathway-level score in FCS can be calculated using sum, mean, or median of SNP-wise statistics. Both ORA and FCS ignore the functional connections between genes and assume the independence of pathways. The last generation is network-based pathway analysis, and largely overcomes the assumption of independence of pathways. Network-based analysis commonly use a scaffold of protein interactions to build connections between gene products, where nodes represent proteins and edges represent physical or functional interactions between pairs of proteins. Rather than focusing on individual markers, network-based analysis methods take into account multiple markers in the context of molecular networks. Due to this critical feature, these methods can afford to use sub-genome-wide statistical significance and yet increase the power to detect new associations and functional relationships between genes in complex traits. To date, several network-based methods have been proposed to identify func- tional modules in the form of sub-networks in a given network. For example, protein- interaction-network based pathway analysis (PINBPA) [106] is developed to iden- tify enriched sub-networks in GWAS datasets. PINBPA reads in gene-wise statistics (gene-wise p-values) from the GWAS dataset, aggregates gene-wise statistics of the nodes in each possible sub-network into a single score to rank the sub-networks, then applys a greedy algorithm [105] to identify highly scored sub-networks. Dense mod- ule searching of GWAS (dmGWAS) [141] also extensively searches for sub-networks enriched with low p-value genes in GWAS datasets, but the input data for dmGWAS are SNP p-values. Both PINBPA and dmGWAS apply greedy algorithm to search 4.2. BACKGROUND 65

for active modules, but do not incorporate topological properties of the human pro- tein interaction network (PIN, also called PPI network). Another tool called net- work interface miner for multigenic interactions (NIMMI) [142] combines topological connectivity with association signals from GWAS. In this tool, sub-networks are pre- generated from each node by adding their neighbours up to the second-order. Nodes are weighted using a modified Goolge PageRank algorithm and then the pre-generated sub-networks are scored. More recently, the disease association protein-protein link evaluator (DAPPLE) [143] was reported to effectively prioritize novel associations in Crohn’s disease and rheumatoid arthritis datasets. Unfortunately, both NIMMI and DAPPLE do not accept user defined networks, in other words, the user can not use their own network rather than the pre-defined data in NIMMI and DAPPLE. In this chapter, we introduce integrative protein-interaction based pathway anal- ysis (iPINBPA), a novel network-based pathway analysis method. Inspired by the topological connectivity in NIMMI, iPINBPA is an extension of PINBPA and inte- grates topological connectivity among genes in network space, with the association signals from GWAS to extensively search for sub-networks enriched in significant GWAS signals. Table 4.1 shows the methodological difference between iPINBPA and other methods. The details of this integrative search strategy will be described in Section 4.4. We also develop and describe a Cytoscape [8] app for iPINBPA in detail in Section 4.6 [130], in which the user can upload their own networks (e.g., iCTNet networks introduced in Chapter 3) and perform iPINBPA analysis. We also compare the performance of iPINBPA with three related approaches and evaluate the results using independent datasets. 4.3. DATA 66

iPINBPA PINBPA NIMMI dmGWAS DAPPLE Input Gene-wise p-value Gene-wise p-value Gene-wise p-value SNP p-value Gene/SNP list Output sub-networks sub-networks sub-networks sub-networks list Algorithm Random walk+Greedy Greedy Google PageRank Greedy N/A PPI User-defined User-defined Fixed User-defined Fixed Topological Connectivity Yes No Yes No No Interface Standalone Standalone Standalone Standalone Web

Table 4.1: Methodological difference between iPINBPA and other methods

4.3 Data

The data used in this chapter include two independent GWAS data sets in multiple sclerosis (MS), human PPI network and benchmark MS genes for evaluating the proposed method.

4.3.1 GWAS Data Sets

Two large-scale GWAS data sets have been used to evaluate iPINBPA. The first data set includes 137, 432 SNPs mapped to 17, 425 unique genes for 5, 545 cases and 12, 153 controls [144], denoted as Meta2.5. The second data set is composed of 137, 457 SNPs mapped to 17, 401 unique genes for 9, 772 cases and 17, 376 controls [145], denoted as WTCCC2. For both data sets, the SNP-wise statistical significance (SNP- wise p-value) was transformed into gene-wise p-values, using the versatile gene-based association study (VEGAS) tool [146]. If the gene-wise p-value ≤ 0.05 in GWAS data set, the corresponding gene is defined as nominally significant. There are 1, 982 unique significant genes in the WTCCC2 data set and 1, 690 unique significant genes in the Meta2.5 data set. 4.4. METHODS 67

4.3.2 Human Protein Interaction Network

In this study, we used a high-confidence, manually curated human protein interaction network previously reported [106]. The PIN is represented as an undirected graph including 8, 960 nodes and 27, 724 edges.

4.3.3 Benchmark Genes

Two sets of benchmark genes are collected to test the proposed method. The first gene set includes 45 known MS genes identified in previous GWAS studies, mainly from WTCCC2, denoted as WTCCC2 genes. The second gene set includes 135 genes identified in the most recent MS study, the ImmunoChip custom genotyping array [147], denoted as iChip genes.

4.4 Methods

Given a human protein interaction network, and gene-wise p-value from GWAS data set, iPINBPA detects enriched sub-networks in three steps as shown in Fig. 4.1. First, for a given GWAS data set, the statistically significant genes (gene-wise p-value ≤ 0.05) are selected as seed genes, and then nodes in the network are weighted through the random walk with restart algorithm according to their connectivity to seed genes based on guilt-by-association. Second, a network score is defined as the combination of the gene-wise p-values with node weights using the Liptak-Stouffer method [148][149]. The background distribution for the network score is calculated using random sampling for various network sizes. At the end, a heuristic algorithm extensively searches for modules enriched with genes with low p-values and high weights, i.e., high network score. We will explain each step in detail in the following 4.4. METHODS 68

subsections.

Figure 4.1: Work flow of iPINBPA

4.4.1 Step One

In the first step, nodes in the network are scored through random walk with restart method [81]. In this method, as we described in Chapter 2, a walker starts moving from a seed node to connecting neighbors randomly. Nodes in the network are scored according to the probabilities of the walker to reach them. iPINBPA extends K¨ohler´s approach by weighting the edge eij connecting ni and nj using corresponding gene- wise p-values as: Wij = ((1 − pi) + (1 − pj))/2, where pi and pj are gene-wise p-values of ni and nj, and normalizing the adjacency matrix W by its columns. A score vector 4.4. METHODS 69

is calculated after each step of the walker as follow:

P (t) = (1 − r)W · P (t − 1) + rP (0) (4.1)

where P (t) is the score after t steps of walking, and r is the restart ratio. The initial score vector P (0) represents a priori knowledge of genes, where 1 is assigned for seed genes and 0 for the rest. Finally, all nodes are scored according to their values in vector P (T ), which quantitatively measures the topological connection to seed genes. As mentioned above, iPINBPA requires a group of seed genes to start random walks. In this study, we used the nominally significant genes (gene-wise p-value ≤ 0.05) in the GWAS data set as seed genes. This refines the searching for enriched sub-networks, as nominally significant genes will be assigned higher scores than the non-significant ones.

4.4.2 Step Two

In the second step of our approach, we are faced with the problem to aggregate p-values of nodes in the network into a single number as network score. Fisher’s method is a simple and popular aggregation statistic, which combines n p-values by

2 Pn χ (2n) = −2 i=1 log(pi) [150]. However, this method is asymmetrically sensitive to small p-values compared to large p-values [151]. For example, given two p-values

from two studies, p1 = 0.001 and p2 = 0.999, clearly on average there is no consistent evidence in these two studies. But the Fisher’s p-value = 0.008, which is in favour of the small p-value. Another commonly used approach is the Z-test proposed by Stouffer et al. [152].

Using the Z-test, each p-value pi is transformed into its standard normal deviate 4.4. METHODS 70

−1 zi using the inverse normal CDF: zi = Φ (1 − pi), and then a network score for √ Pk a network A containing k nodes is defined as: ZA = i=1 zi/ k ∼ N(0, 1). This test does not cause the asymmetry problems discussed above for Fisher’s method. In PINBPA, network scores are calculated using the Z-test, in which each gene is assigned equal weight. Later, a weighted Z-test, also called the Liptak-Stouffer formula, was proposed [148][149], as shown in Equation 4.2, to give different weight to each p-value. By using this formula, the gene-wise significance from GWAS is combined with node connectivity to seed genes. Nodes with low p-value and close to seed genes will be scored high.

P P (T ) Z Z = i∈A i i (4.2) A pP 2 i∈A P (T )i To determine the significance of the network score calculated above, we performed a random sampling of gene sets of size k ∈ [1, 500] for 1, 000 times. For gene sets at

size k, we computed their scores ZA, then calculated the mean of the network score

µk and the standard deviation σk. The adjusted network score is defined as:

ZA − µk SA = (4.3) σk

4.4.3 Step Three

The last step of iPINBPA is to find locally optimal sub-networks according to the adjusted network scores. A greedy algorithm starts searching for the optimal sub-

network G for each node vstart in the network. It searches all neighbors of G as

long as their shortest path to vstart is less than or equal to 2, if adding a neighbor 4.4. METHODS 71

increases the network score SG , then add the neighbor with the largest increase. It stops adding until there is no increasing of SG. Then it starts searching any node inside G as long as this node is not vstart and removable, which means G is still a connected sub-network after removing this node. If removing a node will increase SG, then remove the one with the largest increase. The algorithm stops searching until there is no increasing of SG . The pseudo code of this algorithm is shown as follows:

1. G ← {vstart} 2. For each neighbor node v of G and depth ≤ 2:

0 3. Calculate score S G if add v into G

0 4. If max(SG) > SG then:

5. Add the corresponding node vmax into G 6. Go back to step (2) 7. Else:

8. For each node v in G except vstart:

9. Calculate score S”G if remove v from G

10. If max(S”G) > SG then:

0 11. If the corresponding node vmax is removable:

0 12. Remove the corresponding node vmax from G 13. Go back to step (8). 14. Else: 15. Return G 4.4. METHODS 72

4.4.4 Parameters

We chose the network’s characteristic path length (4.38 for the given PIN) as the default time step (T = 5). The second parameter of random walk is the restart ratio r, which weights the prior knowledge. As there is no standard criterion to select the restart ratio, we set up the default value as 0.5. However, when other values were chosen (from 0.3 − 0.7) no significant differences in performance were observed.

4.4.5 Evaluation

To evaluate the performance of iPINBPA, we tested two sets of reported benchmark genes rather than using cross-validation. As shown in previous study [106], if we define association regions (blocks) composed of significant genes (gene-wise p-value ≤ 0.05), there are 665 association blocks containing 1, 982 unique genes in the WTCCC2 data set, and 612 blocks containing 1, 690 unique genes in the Meta2.5 data set. The sizes of association blocks vary from 1 to > 100, so statistically it is difficult to quantitatively compare the prediction or enrichment performance for each association block. In this study, we applied iPINBPA to identify sub-networks in a given PIN and a GWAS data set, and ranked genes using their highest network score in descending order. For genes having the same network score, they were ranked by their gene-wise p-values in ascending order. Based on the ranking, Concentrated ROC (CROC) curves [1] were plotted to test the efficiency of iPINBPA to identify the benchmark genes. CROC curves magnify any portion of the ROC curve of interest. As shown in Fig. 4.2, CROC curve magnifies the early portion of the x-axis using an exponential transformation (f(x) = (1 − e(−αx))/(1 − e(−α))), in which α is the global magnification factor. As recommended in [1], when α = 7, f(0.1) = 0.5, and when α = 14, f(0.05) = 0.5. 4.5. RESULTS 73

As the ratio of the number of benchmark genes to the total number of nodes in our network is smaller than 0.05, so we choose α = 14. CROC curves are stronger than ROC curves to measure the ability of early retrieval in drug discovery and gene prediction [1]. In our case, the early retrieval performance plays a much more important role, as we only consider the top scored/ranked nodes or sub-networks, and the size of benchmark genes is less than one percent of the total number of genes in the network.

Figure 4.2: ROC and CROC curves [1]

4.5 Results

Based on its predecessor (PINBPA), iPINBPA introduces node-weighting by means of significant disease-related genes and integrates these weights with gene-based sig- nificance into a score, which is further normalized by network size. We applied the iPINBPA approach to two independent large-scale GWAS datasets in MS (Meta2.5 and WTCCC2), and benchmarked its performance against other established methods 4.5. RESULTS 74

on the same input data. In pathway analysis of GWAS, it is necessary to compute gene-wise (rather than SNP-wise) significance. Given that most associations fall outside coding regions, al- locating a significant finding to a gene is not always straightforward. One common strategy is to assign the significant association to the closest gene, taking into ac- count recombination hotspots. However, due to linkage disequilibrium (LD), it is not unusual to find that several genes map within the area of influence of the lead SNP. While usually the closest gene to the lead SNP is assigned, in reality, patterns of extended LD make it impossible to assign any given gene within that area with certainty. It is challenging to compare different pathway analysis methods because of the lack of accurate knowledge of complex traits and the incomplete human PPI network. Since DAPPLE and NIMMI do not accept a user-defined network and DAPPLE only accepts a short list of SNPs or genes (up to 500), it is not possible to directly compare these methods to iPINBPA. Thus, we compared iPINBPA to PINBPA and to dmGWAS. We performed three different tests: (1) Prediction of WTCCC2 genes using Meta2.5 data; (2) Prediction of iChip genes using WTCCC2 data; and (3) Significantly enriched networks from both GWAS data sets.

4.5.1 Prediction of WTCCC2 Genes

We first tested the ability of each method to identify the WTCCC2 genes using Meta2.5 data. There are 45 WTCCC2 genes previously identified in GWAS stud- ies (24 of them are represented in our network). Meta2.5 data are aggregated from seven moderately powered GWAS and one meta-analysis before the completion of 4.5. RESULTS 75

WTCCC2. Meta2.5 GWAS data set contains weaker association signals than WTCCC2 GWAS data set (26 SNPs with p-value < 5 × 10−8 and 1, 690 nominally significant genes in Meta2.5, but 57 associated SNPs and 1, 982 nominally significant genes in WTCCC2). We measured the fold enrichment of AUC score of each method compared to a random classifier. As shown in Fig. 4.3A, iPINBPA (fold enrichment = 5.858) performs marginally better than PINBPA (fold enrichment = 5.386) and significantly better than dmGWAS (fold enrichment = 3.646), with α = 14, f(0.05) = 0.5.

4.5.2 Predicting iChip Genes

Immunochip (iChip) is an Illumina Infinium SNP microarray, which produce fine map- ping of GWAS loci [153]. In this study, we also tested the ability of each method to identify the latest MS genes reported in a recent study using the iChip custom geno- typing array [147] using WTCCC2. In this test, a total of 135 genes were associated with MS (Although the total number of reported associated loci is 110, some SNPs map to more than one gene). Of these 135 iChip MS genes, 42 genes were WTCCC genes (23 of them are represented in our network), and thus 93 genes (54 genes rep- resented in our network) found in iChip are novel. As shown in Fig. 4.3B, iPINBPA (fold enrichment = 6.22) performs better than PINBPA (fold enrichment = 5.211) and dmGWAS (fold enrichment = 2.818) in the prediction of iChip genes, with α = 14, f(0.05) = 0.5. 4.5. RESULTS 76

Figure 4.3: CROC curves of Meta2.5 and WTCCC2 GWAS data sets

4.5.3 Significantly Enriched Networks

As the primary goal of our approach is to identify the enriched pathways for the given GWAS data set, we selected the top scored sub-networks (score > 3 and size > 5) from each method. For this analysis we also tested NIMMI, which returns sub-networks with p-values. For NIMMI, the sub-networks with p-value < 0.0013 (equivalent z- score to the other methods) were selected. As shown in Table 4.2, iPINBPA is more sensitive to GWAS signals and identifies smaller networks, resulting in higher preci- sion. The precision is calculated as the ratio of benchmark genes to the overlap of selected nodes from both GWAS data sets. Precision here assumes that all of the novel predictions, e.g., nodes in the selected sub-networks, which have not been previously reported as being related to the disease, are false positives. This is a conservative 4.5. RESULTS 77

estimate as many of these nodes will be novel true positives. By overlapping the selected networks from both WTCCC2 and Meta2.5, iPINBPA identified 1, 299 genes (including 17 WTCCC2 genes and 44 iChip genes), PINBPA identified 5, 047 genes (including 23 WTCCC2 genes and 69 iChip genes), dmGWAS identified 7, 634 genes (including 24 WTCCC2 genes and 77 iChip genes). NIMMI identified 4, 832 genes (including 19 WTCCC2 genes and 49 iChip genes). Altogether, iPINBPA achieved the highest precision for both sets of benchmark genes.

iPINBPA PINBPA dmGWAS NIMMI GWAS data set WTCCC2 Meta2.5 WTCCC2 Meta2.5 WTCCC2 Meta2.5 WTCCC2 Meta2.5 #networks 1496 1295 4079 4080 7109 7000 402 400 #total nodes 2163 1938 6012 6079 7665 7643 4950 4979 #overlap of nodes 1299 5047 7634 4832 Precision 0.013 0.005 0.003 0.004 (#WTCCC2 genes) (17) (23) (24) (19) Precision 0.034 0.014 0.01 0.01 (#iChip genes) (44) (69) (77) (49)

Table 4.2: Stats of top scored sub-networks from four methods

To evaluate the biological significance of the 1, 299 candidate associated genes reported by iPINBPA, we tested their functional annotation clustering using the online tool DAVID [136][154]. The KEGG pathways in the cluster with the highest enrichment score (8.94) are listed in Table 4.3. While the precise etiology of MS is still unclear, it has been consistently described as a T-cell-mediated autoimmune disease. As such, it is not surprising that related KEGG pathways such as allograft rejection, type 1 diabetes mellitus, graft-versus-host disease, and autoimmune thyroid disease are significantly enriched. This result suggests that genes prioritized by iPINBPA are consistent with the biological functions likely implicated in MS pathogenesis. 4.5. RESULTS 78

KEGG Pathway Count P-Value Benjamini Allograft rejection 24 4.7E-12 3.5E-11 Type I diabetes mellitus 24 4.0E-10 2.1E-9 Graft-versus-host disease 22 3.5E-9 1.6E-8 Autoimmune thyroid disease 23 2.6E-7 9.8E-7

Table 4.3: Functional annotation clusters of 1,299 genes in DAVID

4.5.4 Sensitivity of iPINBPA to Parameters

The restart ratio r in random walk with restart can be tuned by the user. We tested iPINBPA with different restart ratios (0.1, 0.3, 0.5, 0.7, and 0.9) and evaluated its performance as shown in Fig. 4.4. The corresponding CROC enrichments are shown in Fig. 4.5.

Figure 4.4: CROC curves with different restart ratios

In addition, we also tested iPINBPA with different cutoffs of selecting seed genes 4.5. RESULTS 79

Figure 4.5: CROC fold enrichments of different values of restart ratio r

to start random walks, which controls the sensitivity of iPINBPA indirectly. By default, we used the nominally significant genes (gene-wise p-value ≤ 0.05). If a more stringent cutoff is used to select fewer number of seed genes, iPINBPA usually returns smaller sub-networks. Table 4.4 shows the sizes and precision of top selected sub-networks from iPINBPA with different cutoffs. Results show that no significant differences in performance were observed with restart ratios between 0.3 − 0.7.

p − value ≤ 0.01 p − value ≤ 0.005 p − value ≤ 0.001 GWAS data set WTCCC2 Meta2.5 WTCCC2 Meta2.5 WTCCC2 Meta2.5 # networks 1732 1224 1691 1293 1547 1458 # total nodes 2108 1522 2000 1535 1774 1617 Mean of network 17.25 12.71 12.6 10.24 9.95 7.7 size (node)(std) (15.49) (8.78) (11.12) (4.36) (4.52) (2.32) Mean of network 26.91 16.22 16.21 12.38 10.68 8.37 size (edge)(std) (34.93) (16.87) (23.8) (7.73) (6.6) (3.67) # overlap of nodes 1133 1106 1082 Precision 0.014 0.013 0.01 #WTCCC2 genes 16 14 11 Precision 0.036 0.034 0.028 #iChip genes 41 38 30

Table 4.4: Stats of top scored sub-networks from iPINBPA with different cutoffs 4.6. APPLICATION 80

4.6 Application

4.6.1 Availability and Requirements

In addition to our iPINBPA methodology, an iPINBPA app [130] for Cytoscape has been developed in Java. Additionally, R scripts are called via Rserve inside Cytoscape for plotting using R library ggplot2 (http://ggplot2.org).

• iPINBPA app is free to download from Cytoscape app store: http://apps. cytoscape.org/apps/pinbpa.

• Operating system: platform independent.

• Programming language: Java and R.

• Cytoscape version: Cytoscape version 3.0 and later versions. Rserve and R library ggplot2 are required.

• The user manual is available at http://www.pinbpa.com.

4.6.2 Features

As shown in Fig. 4.6, the iPINBPA app has six unique features. First of all, iPINBPA app reads the VEGAS output or any other text files in the required format which contain the gene-wise p-values of GWAS. VEGAS is a tool for transforming SNP- wise p-values into gene-wise p-values. Subsequently, iPINBPA offers six features to exploit the GWAS data: (1) Generates a gene-wise Manhattan plot of the GWAS; (2) Sorts all genes by their genomic coordinates and defines association blocks at any user-defined threshold (p-value < 0.05 by default); (3) Runs network smoothing (an optional gene prioritization scheme) using a random walk with restart algorithm 4.6. APPLICATION 81

[81]; (4) Detects sub-networks enriched in significant genes using either unweighted [105] or weighted z-scores [151]; (5) Generates a sub-network of only significant genes (first-order networks) exceeding a user-defined threshold; and (6) Tests the statistical significance of the sub-networks using random permutations. iPINBPA app is the first Cytoscape app designed for network analysis of GWAS data. Through an user-friendly interface, iPINBPA app enables genomic researchers and biologists with no computational expertise to run a powerful and otherwise com- plex analytical pipeline. The six useful features will be introduced one by one in the following section.

Manhattan Plot of the GWAS Data

Manhattan plot is a very popular figure to show the associations of genomic loci along the whole human genome. Once the app reads in the GWAS data, it is easy to draw a Manhattan plot as shown in Fig. 4.7. As the input file is gene-wise GWAS data, the Manhattan plot drawn by iPINBPA represents genes rather than SNPs.

Blocks along Human Genome

In GWAS, most significant SNPs fall outside coding regions, so it is not always straightforward to assign a significant SNP to a single gene, e.g., the closest gene. iPINBPA provides a feature to define association blocks composed of genes passing a given cutoff, for example gene-wise p-value ≤ 0.05 (in general the larger the GWAS, the more relaxed the p-value that can be used for defining association blocks). Once iPINBPA reads in the GWAS data, all the genes will be ordered by their positions in the table. After clicking button “Find blocks”, the genes 4.6. APPLICATION 82

Figure 4.6: Overview of iPINBPA 4.6. APPLICATION 83

Figure 4.7: Manhattan plot in iPINBPA 4.6. APPLICATION 84

within blocks will be assigned a block number and highlighted in red color in the table, as shown in Fig. 4.8. Genes in each block are located on the same chromosome.

Figure 4.8: Table of genes and blocks

After finding blocks, the user can select a Target network (combo box) and click the Map button. This action will map (as a node attribute) every p-value from the GWAS to each corresponding node in the target network. These attributes can be used by the iPINBPA app to search for first-order and enriched sub-networks. 4.6. APPLICATION 85

Gene Prioritization

iPINBPA has random walk with restart as a feature, which can be applied to quanti- tatively measure the topological connectivity of each node to any set of start nodes. The start nodes are currently selected nodes in Cytoscape. After running random walk with restart, two more node attributes will be added to the node attribute ta- ble: RW NodeWeight and RW rankScore, as shown in Fig. 4.9. The RW NodeWeight is the node score returned directly from the random walk, and RW rankScore is the normalized ranking score ∈ (0, 1] for each node in an ascending order. For example, the node with the highest node score will be assigned 1, and the node with the lowest node score will be assigned 1/|N|. |N| is the total number of nodes in the network.

Figure 4.9: Node attributes added after running random walk with restart

Enriched Sub-networks

A greedy algorithm was implemented in the iPINBPA app to find enriched sub- networks using either unweighted z-score or weighted z-score, as described in Equation 4.2. Once finishing the search, all sub-networks will be listed in the table, as shown 4.6. APPLICATION 86

in Fig. 4.10.

Figure 4.10: Sub-network table

User Defined Sub-networks

Given a user-defined threshold of gene-wise p-values, iPINBPA can easily generate a sub-network of only genes (first-order networks) exceeding the given threshold. Fig 4.11 shows a sub-network of genes whose gene-wise p-value ≤ 0.05.

Network Permutation iPINBPA app makes the generation of a sub-network composed of genes passing a given threshold (gene-wise pvalues) extremely easy by only clicking one button. Furthermore, in order to test the statistical significance of this sub-network, iPINBPA app can run a random permutation step (1,000 times by default). If the size of the sub-network is k, then iPINBPA app will randomly select k nodes in each random permutation, and then calculate the number of edges connecting them. Another useful 4.6. APPLICATION 87

Figure 4.11: Sub-network of genes with p-value ≤ 0.05 metric provided is the number of edges in the largest connected component. Based on the number of permutations, the app can plot a histogram to show the distribution of the total number of edges (left), and the number of edges in the largest connected component (right). The histograms are colored based on different percentiles: 70% (cyan), 90% (pink) and 99% (red), as shown in Fig. 4.12. The red dashed line shows the parameters of the sub-network in question. 4.7. DISCUSSION 88

Figure 4.12: Histograms of random networks

4.7 Discussion

GWAS have been extremely successful in identifying thousands of associations in hun- dreds of complex traits. Due to the extensive statistical adjustments needed to avoid type 1 errors, type 2 errors are necessarily a consequence of GWAS studies, thus lim- iting their effectiveness. Furthermore, typically, only a few markers are replicated in any given GWAS. Effective post-GWAS analysis methods can help prioritize associa- tions using additional sources of evidence and are becoming a useful complementary 4.7. DISCUSSION 89

strategy to the standard analytical pipeline. Here we introduced a novel network-based pathway analysis strategy for GWAS, which integrates topological connectivity in a PPI network and the association signals from GWAS to detect significant sub-networks and also prioritize genes associated with a complex disease. The main feature of iPINBPA is the strategy we employed to identify enriched sub-networks by merging evidence from multiple sources. To our knowledge, this is the first method that integrates node weighting with a greedy search for significant sub-networks. Comparisons with different data sets and methods have demonstrated that our integrative approach dramatically improves the performance in predicting novel associations. The increase of prediction precision comes mostly from the fact that, unlike in the classical approach, potential associations with no biological relationships to statistically confirmed associations are down-weighted in this approach. The identified sub-networks presented here are around nodes with quite significant p-values, thus the overlap of these sub-networks lends additional support to our methods. Given the size of candidate SNPs (millions or 100, 000 SNPs) of GWAS data, it is common to see a low precision in prioritizing novel associations through network-based pathway analysis due to two major reasons: (1) mass noise in the PPI network caused by the high false positive and false negative rates of PPI detection methods; (2) lack of the annotation for PPIs in the network. Two interacting proteins bind each other under certain conditions and there are different types of PPI in terms of composition and duration. However, PPI networks in most studies are just static binary networks without such annotation indicating whether or when the interactions occur under the studied conditions. Despite its low precision, the computational analysis is still useful 4.7. DISCUSSION 90

to provide potential candidates, e.g., the genes in the top ranked sub-networks, for further experimental investigation. As future work, it is helpful to use the knowledge integrated in iCTNet to assist the prediction of novel associations in iPINBPA. Unlike dmGWAS, iPINBPA uses VEGAS to map SNPs to genes. For the analysis of GWAS data, the mapping of SNPs to genes is an open challenge. In this section, we have focused on the comparison of methodology and performance of different network- based analysis methods. We did not address potential variations emerging from using different strategies of mapping SNPs to genes; the default mapping recommended for each method was utilized. 91

Chapter 5

Integrative Biomarker Selection for Ovarian

Cancer

5.1 Introduction

Cancer is a special category of complex disease, in which cells proliferate out of control, causing tissue damage and inflammation. Although tremendous efforts in the battle to conquer cancer have lead to therapeutic and preventive breakthroughs, it is still challenging to discover clinical biomarkers for certain types of cancer, such as ovarian cancer and lung cancer. A biomarker is a biological molecule found in blood, other body fluids, or tissues that can be used to distinguish the normal and abnormal condition [155]. The identification of effective biomarkers can speed up the diagnosis and treatment of cancer, but the complex nature of cancer makes it a long and difficult journey. In this chapter, we propose an integrative biomarker selection strategy including two computational analysis methods. The first computational analysis is proposed to score the candidate proteins using aggregate ranking; the second computational analysis is proposed to detect the enriched sub-networks in the given quantitative profiles to find active modules for ovarian cancer. 5.2. BACKGROUND 92

5.2 Background

Ovarian cancer is the most lethal gynecologic malignancy among women, with 22, 280 new cases and 15, 460 deaths for 2012 in the United States [156]. The 5 year survival rate of advanced-stage patients is only about 30% [156], although the cure rate for stage I patients is usually greater than 90% [155]. A consensus has been reached that the high rate of deaths is primarily due to the lack of effective measures for early stage diagnosis. The present clinical biomarker to detect ovarian cancer is CA-125. The serum level of CA-125 has been effectively used to detect advanced-stage ovarian cancer, but it has low sensitivity and specificity for early stage detection. Ascites fluid from ovarian cancer patient or conditioned medium from ovarian cancer cell lines contains the secretome of ovarian cancer cells, reflecting the pro- cess of ovarian cancer. In previous studies [157][158][159][160], many efforts have been devoted to discover potential biomarkers from the secretome of ovarian cancer cells using proteomic profiling. Gunawardana et al. generated a list of 51 poten- tial biomarkers after mining the conditioned media of four ovarian cancer cell lines (HTB75, TOV-112D, TOV-21G and RMUG-S) and cross-referencing with ascites proteome [157]. Another study from the same group focused on filtering ascites of stage IV serous ovarian carcinoma patients, who had been treated previously with surgery plus chemotherapy, and published a list of 52 potential biomarkers [158]. A recent study concluded a repertoire of 1129 tumor markers after filtering serum-free conditioned medium samples by proteomic analysis [159]. These three studies utilized solely proteomic profiles from mass spectrometry-based methods. The major limita- tion in proteomic analysis using mass spectrometry is its low sensitivity to identify low abundance proteins [158]. 5.2. BACKGROUND 93

Although the secretome of ovarian cancer cells provides a rich source for biomarker discovery, the underlying mechanisms of secretion are complex, and include various biological processes at different molecular levels, such as mRNA transcript level and protein level. For the purpose of improving our understanding of the biological mech- anisms underlying a specific condition, it is very promising to apply comprehensive approaches which integrate the expression data sets from both gene and protein levels [161][162]. DNA microarrays are widely used platform to measure changes in mRNA transcript level, while mass spectrometry-based proteomics technologies are increas- ingly used to identify the abundance of large numbers of proteins. The pattern of gene expression reflects RNA transcription and degradation rates, while the protein expression levels are influenced by translational and post-translational mechanisms [161]. The expression of both gene and protein levels provide complementary informa- tion for biomarker discovery. An early study utilized proteomic profiles of ascites of four ovarian cancer patients and then integrated 59 microarray data sets and protein- protein interactions (PPI) to finalize a list of 80 putative biomarkers [160]. This is the first integrative analysis using both proteomic and transcriptomic profiles, plus PPIs for ovarian cancer biomarker discovery. In this chapter, we have collected the gene expression profiles and proteomic profiles from three mouse cell lines: B6, IC5 and IG10. B6 is a mouse embryonic stem cell used as a control cell line. IC5 and IG10 are both model cell lines for human ovarian cancer. In addition, secretomic pro- files have been collected from the cancer cell lines. Next, the three different types of profiles are used to annotate the nodes in the PPI network, and an aggregate ranking method is applied to rank the candidate proteins. Furthermore, a sub-network de- tection method is applied to find the enriched modules, which enhance our discovery 5.3. DATA 94

of potential biomarkers of ovarian cancer. This integrative computational analysis is useful to guide further experiment design.

5.3 Data

A cell line is a cell culture developed from a single cell and will proliferate indefi- nitely given appropriate medium and space. We obtained both transcriptomic and proteomic profiles for three mouse ovarian surface epithelial cell lines: B6, IC5 and IG10. B6 is a mouse embryonic stem cell. IC5 and IG10 are both model cell lines for human ovarian cancer, and IG10 cells are much more aggressive than IC5 cells. There is no secretion occurred in B6 cells, so the secretome profiles only include IC5 and IG10 cell lines.

5.3.1 Proteomic Profiles

Proteomic profiling using mass spectrometry provides direct information about pro- teins differentially expressed in cancer cells and control cells. To date, identification of low abundance proteins is limited in mass spectrometry profiling, and consequently, most of the putative biomarkers are highly abundant acute-phase proteins [155]. The proteomic profiles we have used contain two features: fold change (fc) and false dis- covery rate (fdr) for each identified UniProt IDs for both cancer cells: IC5 and IG10. The fc measures how much the protein expression changes between cancer and nor- mal cells. The fdr shows the proportion of positive tests that will be false; 1 − fdr measures the proportion of positive tests that will be true. In statistics, p-value indi- cates the proportion of all tests that will be false positive, while fdr is the proportion of “significant” tests (e.g., p-value< 0.05) that will be false positive; this is clearly 5.3. DATA 95

a smaller number. The fdr is an important correction for multiple tests. Here fdr values were calculated using a Bayesian analysis of mass spectrometry proteomics data. Because of the low sensitivity of mass spectrometry, it is difficult to set up an appropriate threshold for fc to distinguish proteins with significant observed changes in cancer cells compared with control cells.

5.3.2 Secretomic Profiles

Secretomes from the ovarian cancer cells have been profiled using mass spectrometry. As for the normal cell line: B6, there is no secretion, so the abundance of secreted proteins of two cancer cell lines: IC5 and IG10, are measured from the conditioned media which cultured the cells. The spectral count, which is the cumulative sum of peptide spectra matched to a given protein, has been used as the measure of the relative abundance of secreted proteins in the conditioned medium.

5.3.3 Transcriptomic Profiles

We obtained the Affymetrix microarray data sets for the B6 cell line with 4 replicates, IC5 cell line with 2 replicates, and IG10 cell line with 2 replicates. The identifiers in these transcriptomic profiles are probe set IDs. A probe set is a collection of probes designed to hybridize a given sequence. A probe set ID, e.g., 12345 at, is used to refer to a probe set. Each transcriptomic profile contains 19, 553 probe set IDs. These probe set IDs are mapped to 18, 884 IDs. 5.3. DATA 96

5.3.4 Interologous PPI Network

The Interologous Interaction Database (I2D) [163][26] is a comprehensive online database of predicted interologous protein-protein interactions, containing 152, 296 unique interactions of 12, 818 mouse proteins. We collected these unique interactions of mouse proteins to build a mouse PPI network, which is used in the computational analysis proposed in this chapter.

5.3.5 Matching Profiles

Before starting the data analysis, we have to match the probe set IDs in the microarray data sets with the corresponding protein identifiers in the Universal Protein Resource

(UniProt) (http://www.uniprot.org/) [164]. First of all, we mapped the probe set IDs to UniProt IDs directly. As we know the relationship between probe set IDs to gene IDs are many-to-one, and the mapping from gene IDs to protein IDs are one-to-many, so the cross mapping from probe set IDs to protein IDs is many- to-many. For each probe set, a mean expression value is calculated for replicates under each condition, and the fold change is calculated as an inter-condition variance. For each protein, if multiple probe sets are mapped, the expression values of all associated probe-sets are extracted, and then the fold change of the most significant differentially expressed probe set (or probe set with the maximum inter-condition variance) is assigned to the protein. We have tested the results of using mean-, median-, and maximum methods to tackle the cross-mapping problem of multiple probe-sets to any single protein. The maximum mapping is capable of catching the largest inter-condition variance between cancer and normal conditions. The cross- reference between transcriptomics and proteomics helps us integrate complementary 5.4. CANDIDATE PROTEIN SELECTION 97

information.

5.4 Candidate Protein Selection

Before applying computational analysis, we pre-process the data by filtering candidate proteins from each profile. Fig. 5.1 shows our three different types of profiles, features in each profile, and the filters we have applied. In the transcriptomic profiles, there are 18, 884 proteins in each cancer cell line. Genes with (|fc| > 2) are selected as differentially expressed genes (DEGs). There are 2, 678 DEGs in IC5, 3, 869 DEGs in IG10, and 4, 192 unique DEGs in total (set A). In the proteomic profiles, there are 2, 763 proteins in IC5, and 2, 814 proteins in IG10. Proteins whose fdr < 0.1 are selected as significant cell proteins. Here fdr is the proportion of positive tests that will be false positive, so proteins with fdr < 0.1 have less than 10 percent chance to be falsely identified proteins. There are 451 significant cell proteins in IC5, 476 significant cell proteins in IG10 and 838 significant cell proteins in total (set C). In the secretomic profiles, there are 1, 325 proteins in IC5 and 1, 414 proteins in IG10. In previous studies of proteomic profiles of secretome of ovarian cancer [157][158][159][160], only extracellular or plasma membranous proteins are considered as secreted proteins, and other proteins are removed or only considered as disease related proteins. We only keep the extracellular or membranous proteins, and there are 360 proteins left in IC5, 397 proteins left in IG10, and 435 proteins in total (set S). For each cancer cell line, we have three different sets A, C and S. For each set, we simply examined the intersection and the union of two cell lines to observe the differences between them. The distributions of proteins among these three sets for 5.4. CANDIDATE PROTEIN SELECTION 98

Figure 5.1: Data structure each cancer cell line, the intersection or the union of these two cell lines are shown in Fig. 5.2 respectively. From Fig. 5.2, it is obvious that the number of proteins differentially expressed at gene level is much greater than the number of proteins significantly expressed or secreted. In order not to miss any potential biomarkers, we pooled the proteins from sets A, C and S for the union of two cancer cell lines (4, 943 proteins in total, as shown at the right bottom of Fig. 5.2), as the starting list of candidates in our study. In order to facilitate further analysis, we selected four different groups of proteins. First, it is straightforward to select proteins with observed expression changes at both gene and protein levels, as well as identified as secreted proteins. We call these proteins as ACS proteins, as they are the intersection of sets A, C and S. There are 5.4. CANDIDATE PROTEIN SELECTION 99

Figure 5.2: Data distribution

43 ACS proteins from the 4, 943 proteins in our starting list. Only 9 ACS proteins are found for IC5 cell line and 26 ACS proteins for IG10 cell line. Second, we selected proteins found only in the intersection of any two sets, but having interacting partners from the third set in the PPI network. Three different lists are defined to reflect different biological changes observed in our profiles.

1. AC S proteins are proteins showing changes in expression at both gene and protein levels and they are not secreted proteins, but have direct interactions with any secreted proteins. We found 119 proteins in this category.

2. AS C proteins are secreted proteins, differentially expressed at gene level, and have direct interactions with significant cell proteins. 77 AS C proteins are 5.5. AGGREGATE RANKING 100

found in our list.

3. CS A proteins show significantly differential expression at protein level and are secreted from the cancer cells. CS A proteins are not differentially expressed at the gene level, but they have direct interactions with proteins differentially expressed at the gene level. 51 CS A proteins are found in our list.

5.5 Aggregate Ranking

The selection of candidate proteins in section 5.4 categorizes candidate proteins by their observed changes in the profiles, but does not provide any quantitative measure- ment to prioritize the candidates in each group. We propose an aggregate ranking to prioritize protein candidates, taking into account the quantitative measurements in each profile. Typically, the number of candidates with differential changes in microar- ray data sets is much larger than the number of proteins with observed changes in proteomic profiles [161]. As the numbers of candidates in each profile are in different scales, it is difficult to avoid bias by assembling gene and protein expression profiles directly using a summation of the fold change or spectral count. Instead, we propose an aggregate ranking strategy as shown in Fig. 5.3. In this strategy, for each feature, e.g., fold change from a gene expression profile, we assign a single rank score for each protein in the original profile (without filtering). For a

given feature Fj, we rank all the proteins in ascending order. rj is the rank position

for protein Pi in the ranked list, Lj is the total length of the list, then a rank score sij = ri/Lj is assigned to protein Pi for the feature Fj. For the gene expression profile, we ranked the proteins on their fold change ratios. For the cell proteomic profile, we used (1 − fdr)fc to rank the changes of candidates. A protein with low fdr and high 5.5. AGGREGATE RANKING 101

Figure 5.3: Aggregate ranking strategy fc is given a high rank score. For the secretome profile, the averaged spectral count is measured to rank. For proteins having missing values in a feature, their rank scores are set to 0. Then the range of rank score is (0, 1]. In this method, we treated all features equally, thus the weight of each feature was set equally to 1/m. Consequently, the proteins that are up-regulated at gene level, over-expressed at protein level, and also secreted are promising candidates and assigned high scores in the aggregate ranking. 5.6. SUB-NETWORKS DETECTION 102

5.6 Sub-networks Detection

As the aggregate ranking does not explore functional connections among proteins, we also adapted the pathway analysis method introduced in Chapter 4 to search for enriched sub-networks. In this method, we use the original profiles, so the filters applied in section 5.4 are removed. In order to clarify the data sources used for sub- network detection, Fig. 5.4 shows the candidates in each profile. In total, there are 6 profiles: A IC5, A IG10, C IC5, C IG10, S IC5 and S IG10. For each given

profile including n candidates, we scored each candidate as si = ri/n, in which ri is

the ranking for protein Pi in ascending order. The network score is defined as :

P s S = √i∈A i (5.1) A n

We also adjusted the network score using random permutation as :

0 SA − µk SA = (5.2) σk

The greedy searching algorithm in Section 4.4.3 has been applied to search for sub-networks for each profile. 5.7. RESULTS 103

Figure 5.4: Data sources used for sub-network detection

5.7 Results

To better understand the candidate proteins, Fig. 5.5 plots pie charts to show the distribution of GO cellular component in each set (A, C and S). In set A, 1, 467 pro- teins are cytoplasm proteins. The majority of set C are cytoplasm proteins too. The proteins in set S are extracellular or membrane proteins. These pie charts show the difference between cell proteins and secreted proteins in terms of cellular components. In order to compare our findings with others, we downloaded 156 putative biomark- ers from previous proteomic studies [157][158][160], and 135 putative biomarkers from 5.7. RESULTS 104

Figure 5.5: GO cellular component distribution in three sets

a literature review [155]. In total, we collected 253 unique putative biomarkers for ovarian cancer. Note: gene symbols starting with a capital letter followed by lower- case letters are mouse genes; gene symbols with all letters in upper-case are human genes. There are 8 putative biomarkers in our 43 ACS proteins. The information of these 43 proteins are listed in Appendix C (ACS proteins are sorted by their rank scores in descending order). Among 43 ACS proteins, Alpha-enolase (ENO1) is the only one shared in both cancer cell lines. Alpha-enolase is multifunctional enzyme, and plays a role in various processes such as glycolysis, growth control, hypoxia tolerance and allergic responses [165]. A recent study [166] investigated ENO1 with brain metastases of gynaecological malignancies (cervical cancer, endometrial cancer and ovarian cancer). Among the 43 ACS proteins, there are 21 proteins with aggregating score > 0.8 for IC5, and 36 proteins with score > 0.8 for IG10. Only 6 ACS proteins pass the cutoff score > 0.8 for both IC5 and IG10. These 6 proteins are C3, Eno1, Lgals3, Lgals7, Clic4 and Msln. C3 is an extracellular protein and putative ovarian cancer 5.7. RESULTS 105

biomarker reported in [155]. C3 is an inflammatory mediator involved in complement and coagulation cascades (KEGG pathway). Eno1 is involved in KEGG pathways: RNA degradation and Glycolysis/Gluconeogenesis. Galectin-3 (LGALS3) is a β- galactoside-binding lectin involved in regulating cell growth, angiogenesis, and tumor progression [167]. Mesothelin (MSLN) is an antigen, being investigated as a potential biomarker for [168], and also reported a potential biomarker for ovarian cancer [155]. In addition, among the 43 ACS proteins, Cfl1, Egfr, Myh9 and Krt8 are proteins associated with cell morphogenesis; Egfr, Krt8 and Hspg2 are proteins associated with embryonic organ development; Capg, Cfl1, Flnb, Gsn, Msn, Myh9 and Anxa2 are proteins associated with cytoskeletal protein binding. Gpi, Iqgap1, Fam3c, Hspg2, Dsg2 and Tf are putative secreted biomarkers for ovarian cancer. For further investigation, a network has been constructed by adding all direct interactors of these 43 proteins and all relative interactions from the I2D database (shown in Fig. 5.6). We call this network as Network ACS. The ACS network contains 1, 236 nodes and 19, 061 edges. In Fig. 5.6, 43 ACS proteins are represented in the large squares, and 293 shared direct neighbors of these 43 proteins are represented in diamonds. Edges connecting these 43 proteins are shown in red color. Solid edges represent protein-protein interactions in human or mouse. Dashed edges represent the PPIs from other species. Putative biomarkers in previous studies are highlighted in red. The node transparency is proportional to the ranking score. Proteins with mouse ranking score > 0.8 are labelled with their gene names too. The node colors represent different GO functions listed in the legend. This network is visualized in NAViGaTOR [69]. 5.7. RESULTS 106

Figure 5.6: Network of 43 ACS proteins 5.7. RESULTS 107

5.7.1 Enriched Subnetworks

As I2D contains both experimental and inferred PPIs, the PPI network of I2D is quite dense, and consequently a large number of sub-networks have been detected. Fig. 5.7 shows the sizes of the top 1% scored sub-networks from each profile. The intersection of selected sub-networks for all six profiles is used to build the final enriched sub- network, which contains 71 nodes and 269 edges as shown in Fig. 5.8. Among these 71 proteins, 7 proteins are ACS proteins, shown in red colour, 20 AC S proteins, shown in blue colour, 2 AS C proteins, shown in yellow colour and 2 CS A proteins shown in green colour. Edges among these labelled proteins are coloured in red. This network provides complementary information to our selected protein lists discussed in Section 5.4.

5.7.2 Enriched KEGG Pathways

Additionally, we have also used the online DAVID tools [136] for pathway enrichment analysis with its default setting. Then in order to capture more useful pathway in- formation, we obtained the 21 enriched KEGG pathways [33] for the union of ACS, AC S, AS C and CS A protein lists. For each enriched pathway, we generate a sub- network by adding PPIs between involved proteins, shown in Fig. 5.9. It is obvious to see large connected components in the following pathways: (1) Adherens junction, (2) ECM-receptor interaction, (3) Focal adhesion, (4) Regulation of action cytoskeleton, (5) Leukocyte transendothelial migration, (6) pathways in cancer and (7) propanoate metabolism. Adherens junction and focal adhesion play roles in cell motility and proliferation; ECM-receptor interaction plays an import an role in tissue and organ 5.8. CONCLUSION 108

Figure 5.7: Sub-network detection results morphogenesis. Regulation of action cytoskeleton play a role in cell motility; Leuko- cyte transendothelial migration is vital for immune surveillance and inflammation; and pathways in cancer shows the map of known cancer genes.

5.8 Conclusion

We proposed an integrative biomarker selection strategy combining complementary proteomic and transcriptomic profiles, plus a large PPI network from I2D to se- lect potential biomarkers for ovarian cancer. The integration of the gene and protein 5.8. CONCLUSION 109

Figure 5.8: Detected sub-network including 71 nodes and 269 edges expression profiles, and the protein-protein interaction network, involves several map- ping issues between probe set IDs to gene IDs, and gene IDs to protein IDs, as it is common that multiple probe set IDs share the same gene symbol, and the same gene can be translated into multiple proteins. In order to improve the prediction accuracy, we proposed the primary aggregate ranking method to score the candidate proteins, and a sub-network detection method as the secondary analysis to find the enriched functional modules for ovarian cancer. 5.8. CONCLUSION 110

Figure 5.9: 21 enriched KEGG pathways 5.8. CONCLUSION 111

The results shown in this chapter need further explanation and validation in biology. But the computational analysis proposed in this chapter demonstrate a framework to integrate and analyze proteomic, secretomic and transcriptomic profiles, as well as the PPI network. 112

Chapter 6

Summary and Future Work

The increasing availability of a variety of ‘omics’ data in public repositories presents an opportunity for building comprehensive and systematic roadmap of disease patholo- gies, by integrating heterogenous biological data in terms of networks. However, the integration of data from multiple sources is complex and challenging given the lack of unified identifiers and the demand for interdisciplinary knowledge. Existing ap- proaches limit their exploration to either bipartite networks or up to three different types of interactions. In this thesis, we presented an overview of biological data sources available to build integrative networks, and also surveyed the network-based computational methods proposed in the literature. By taking advantage of the public data sources, we designed and developed a unique application: iCTNet, which can be used to construct the most comprehensive meta-networks with up to six types nodes and nine types of edges. Furthermore, we developed two integrative network- based computational methods: (1) iPINBPA for pathway analysis and (2) biomarker selection strategy of integrating three different profiles of ovarian cancer with PPI network. In Chapter 2, we reviewed a variety of biological data in public repositories and 113

computational methods in two categories: (1) gene prioritization and (2) sub-network detection. The construction of biological networks is data-driven. In other words, the size and quality of the network is determined by the data collected from the online repositories. In Chapter 3, we presented a comprehensive database and Cytoscape app to offer a platform to easily integrate heterogeneous biological interactions, construct and visualize meta-networks. To the best of our knowledge, iCTNet constitutes the biggest effort to integrate multiple types of biological interactions as meta-networks thus enabling systematic analysis of human complex traits. In Chapter 4, we addressed the integrative protein-interaction based pathway analysis (iPINBPA), a novel network-based pathway analysis method. This approach integrates topological connectivity among genes in network space and the association signals from the input GWAS data to extensively search for sub-networks enriched in significant GWAS signals. Results from two independent datasets have demonstrated that our integrative approach dramatically improves the performance in predicting novel associations. The strategy we proposed in this study is generic can be read- ily applied to any disease or biological datasets, e.g., gene expression datasets and proteomic data, as long as quantitative gene-wise or protein-wise statistical measures and putative disease genes are available. In Chapter 5, we introduced an integrative strategy including aggregate ranking and sub-network detection for biomarker discovery. This strategy integrates tran- scriptomic, proteomic, and secretomic profiles with a PPI network, to rank candidate proteins using aggregate ranking and detect enriched sub-networks using heuristic 6.1. FUTURE DIRECTIONS 114

search. The aggregate ranking merged quantitative measurements from different pro- files into a single score to help the selection of potential markers, but it does not offer functional connections among the candidate proteins to better understand the genetic map of ovarian caner, so a heuristic method presented in Chapter 4 is adapted to detect enriched sub-networks for ovarian cancer as a secondary analysis. Through the design and development of our iCTNet database, iCTNet and iP- INBPA applications, and other network-based computational methods proposed in this thesis, we have built a pipeline to analyze heterogeneous biological data in terms of networks to better understand the molecular basis of human diseases.

6.1 Future Directions

We have demonstrated promising approaches to integrate large biological interactions, and how to analyze these connections in terms of network analysis. The greatest challenge of any computational analysis of biological data is interpreting the results, which is especially important in human disease research. Computational tools are designed to provide integrated knowledge sources and analysis methods to aid with interpretation to understand the molecular basis of human diseases. We have devel- oped iCTNet database and Cytoscape app to construct systematic maps of human diseases. Currently, we are testing our data and tool on autoimmune disorders, and will collaborate on more complex diseases as possible, to help our understanding of underlying mechanisms of complex diseases. An inherent limitation of all approaches using biological networks, is that in- teractions have only been described for a subset of all known nodes, e.g., proteins. 6.1. FUTURE DIRECTIONS 115

Furthermore, if only high confidence interactions are taken into account in PPI net- work, approximately only half of all proteins are represented. This necessarily places an upper boundary to the number of successful predictions any of these methods can make. With new and more accurate techniques to determine biological interactions, this limitation may be overcome in the near future. Another potential restriction of these methods is that they use global interactions, when actually tissue-specific interactions might be more appropriate. Several efforts are currently underway to develop tissue-specific protein interactions that, together with knowledge about the organ/tissue compromised in a given disease, could be incorporated into network analysis of diseases in the future. Furthermore, with the incorporation of additional information, e.g., Encyclopedia of DNA Elements (ENCODE) [169], Epigenomics

Roadmap (http://www.roadmapepigenomics.org), it will be possible to derive cell specific networks. This will greatly enhance the performance of our computational approaches, as it will enable the incorporation of pathophysiologically relevant and disease-specific data. BIBLIOGRAPHY 116

Bibliography

[1] S. Swamidass, C. Azencott, K. Daily, and P. Baldi, “A CROC stronger than ROC: measuring, visualizing and optimizing early retrieval,” Bioinformatics, vol. 26, no. 10, pp. 1348–1356, 2010.

[2] T. I. H. Consortium, “The International HapMap Project,” Nature, vol. 426, no. 6968, pp. 789–796, 2003.

[3] R. Macarron, M. Banks, D. Bojanic, D. Burns, D. Cirovic, T. Garyantes, D. S. Green, R. Hertzberg, W. Janzen, J. Paslay, U. Schopfer, and G. Sittampalam, “Impact of high-throughput screening in biomedical research,” Nat Rev Drug Discov, vol. 10, no. 3, pp. 188–95, 2011.

[4] A. Barab´asiand Z. Oltvai, “Network biology: understanding the cell’s func- tional organization,” Nature Reviews Genetics, vol. 5, no. 2, pp. 101–113, 2004.

[5] K. Mitra, A. Carvunis, S. Ramesh, and T. Ideker, “Integrative approaches for finding modular structure in biological networks,” Nature reviews.Genetics, vol. 14, no. 10, pp. 719–732, 2013. BIBLIOGRAPHY 117

[6] E. Schadt, M. Linderman, J. Sorenson, L. Lee, and G. Nolan, “Computational solutions to large-scale data management and analysis,” Nature Reviews Ge- netics, vol. 11, no. 9, pp. 647–657, 2010.

[7] D. Gomez-Cabrero, I. Abugessaisa, D. Maier, A. Teschendorff, M. Merken- schlager, A. Gisel, E. Ballestar, E. Bongcam-Rudloff, A. Conesa, and J. Tegn´er, “Data integration in the era of omics: current and future challenges,” BMC Syst Biol, vol. 8 Suppl 2, p. I1, 2014.

[8] P. Shannon, A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang, D. Ramage, N. Amin, B. Schwikowski, and T. Ideker, “Cytoscape: a software environment for integrated models of biomolecular interaction networks,” Genome research, vol. 13, no. 11, pp. 2498–2504, 2003.

[9] O. Vanunu, O. Magger, E. Ruppin, T. Shlomi, and R. Sharan, “Associating genes and protein complexes with disease via network propagation,” PLoS com- putational biology, vol. 6, no. 1, p. e1000641, 2010.

[10] R. Albert and A. Barab´asi,“Statistical mechanics of complex networks,” Re- views of Modern Physics, vol. 74, no. 1, pp. 47–97, 2002.

[11] L. Costa, F. A. Rodrigues, G. Travieso, and P. R. V. Boas, “Characterization of complex networks: A survey of measurements,” Advances in Physics, vol. 56, no. 1, pp. 167–242, 2007.

[12] J. Pandey, M. Koyuturk, and A. Grama, “Functional characterization and topo- logical modularity of molecular interaction networks,” BMC Bioinformatics, vol. 11, no. Suppl 1, p. S35, 2010. BIBLIOGRAPHY 118

[13] S. Yook, Z. N. Oltvai, and A. L. Barab´asi,“Functional and topological charac- terization of protein interaction networks,” Proteomics, vol. 4, no. 4, pp. 928– 942, 2004.

[14] S. Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, pp. 75–174, 2010.

[15] Z. Oltvai and A. Barab´asi,“Life’s complexity pyramid,” Science, vol. 298, no. 5594, pp. 763–764, 2002.

[16] E. Ravasz and A. Barab´asi,“Hierarchical organization in complex networks,” Physical Review E, vol. 67, no. 2, p. 026112, 2003.

[17] E. Ravasz, A. Somera, D. Mongru, Z. Oltvai, and A. Barab´asi,“Hierarchical organization of modularity in metabolic networks,” Science, vol. 297, no. 5586, pp. 1551–1555, 2002.

[18] E. Sayers, T. Barrett, D. Benson, E. Bolton, S. Bryant, K. Canese, V. Chetvernin, D. Church, M. Dicuccio, S. Federhen, M. Feolo, L. Geer, W. Helmberg, Y. Kapustin, D. Landsman, D. J. Lipman, Z. Lu, T. Madden, T. Madej, D. Maglott, A. Marchler-Bauer, V. Miller, I. Mizrachi, J. Ostell, A. Panchenko, K. Pruitt, G. Schuler, E. Sequeira, S. Sherry, M. Shumway, K. Sirotkin, D. Slotta, A. Souvorov, G. Starchenko, T. Tatusova, L. Wagner, Y. Wang, W. Wilbur, E. Yaschenko, and J. Ye, “Database resources of the national center for biotechnology information,” Nucleic acids research, vol. 38, pp. D5–16, 2010. BIBLIOGRAPHY 119

[19] X. Fern´andez-Su´arez,D. Rigden, and M. Galperin, “The 2014 Nucleic Acids Re- search Database Issue and an updated NAR online Molecular Biology Database Collection,” Nucleic Acids Res, vol. 42, pp. D1–6, 2014.

[20] C. Alfarano, C. Andrade, K. Anthony, N. Bahroos, M. Bajec, K. Bantoft, D. Be- tel, B. Bobechko, K. Boutilier, E. Burgess, et al., “The biomolecular interac- tion network database and related tools 2005 update,” Nucleic acids research, vol. 33, no. suppl 1, pp. D418–D424, 2005.

[21] A. Ruepp, B. Waegele, M. Lechner, B. Brauner, I. Dunger-Kaltenbach, G. Fobo, G. Frishman, C. Montrone, and H. Mewes, “CORUM: the comprehensive re- source of mammalian protein complexes–2009,” Nucleic acids research, vol. 38, pp. D497–501, 2010.

[22] L. Salwinski, C. Miller, A. Smith, F. Pettit, J. Bowie, and D. Eisenberg, “The Database of Interacting Proteins: 2004 update,” Nucleic acids research, vol. 32, pp. D449–D451, 2004.

[23] T. Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar, S. Math- ivanan, D. Telikicherla, R. Raju, B. Shafreen, A. Venugopal, L. Balakrishnan, A. Marimuthu, S. Banerjee, D. Somanathan, A. Sebastian, S. Rani, S. Ray, C. Kishore, S. Kanth, M.A., M. Kashyap, R. Mohmood, Y. L. Ramachan- dra, V. Krishna, B. Rahiman, S. Mohan, P. Ranganathan, S. Ramabadran, R. Chaerkady, and A. Pandey, “Human Protein Reference Database-2009 up- date,” Nucleic acids research, vol. 37, pp. D767–D772, 2009.

[24] S. Kerrien, B. Aranda, L. Breuza, A. Bridge, F. Broackes-Carter, C. Chen, M. Duesbury, M. Dumousseau, M. Feuermann, U. Hinz, C. Jandrasits, BIBLIOGRAPHY 120

R. Jimenez, J. Khadake, U. Mahadevan, P. Masson, I. Pedruzzi, E. Pfeiffen- berger, P. Porras, A. Raghunath, B. Roechert, S. Orchard, and H. Hermjakob, “The intact molecular interaction database in 2012,” Nucleic Acids Research, vol. 40, no. D1, pp. D841–D846, 2012.

[25] L. Licata, L. Briganti, D. Peluso, L. Perfetto, M. Iannuccelli, E. Galeota, F. Sacco, A. Palma, A. Nardozza, E. Santonico, L. Castagnoli, and G. Cesareni, “MINT, the molecular interaction database: 2012 update,” Nucleic Acids Re- search, vol. 40, no. D1, pp. D857–D861, 2012.

[26] K. Brown and I. Jurisica, “Unequal evolutionary conservation of human protein interactions in interologous networks,” Genome biology, vol. 8, no. 5, p. R95, 2007.

[27] S. Razick, G. Magklaras, and I. Donaldson, “iRefIndex: a consolidated protein interaction database with provenance,” BMC bioinformatics, vol. 9, no. 1, 2008.

[28] C. Stark, B. B. andA. Chatr-Aryamontri, L. Boucher, R. Oughtred, M. Livs- tone, J. Nixon, K. V. Auken, X. Wang, X. Shi, et al., “The BioGRID interaction database: 2011 update,” Nucleic acids research, vol. 39, no. suppl 1, pp. D698– D704, 2011.

[29] H. Mewes, A. Ruepp, F. Theis, T. Rattei, M. Walter, D. Frishman, K. Suhre, M. Spannagl, K. Mayer, V. St¨umpflen, and A. Antonov, “MIPS: curated databases and comprehensive secondary data resources in 2010,” Nucleic acids research, vol. 39, pp. D220–D224, 2011. BIBLIOGRAPHY 121

[30] D. Szklarczyk, A. Franceschini, M. Kuhn, M. Simonovic, A. Roth, P. Minguez, T. Doerks, M. Stark, J. Muller, P. Bork, et al., “The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored,” Nucleic acids research, vol. 39, no. suppl 1, pp. D561–D568, 2011.

[31] Y. Hwang, C. Lin, J. Chang, H. Mori, H. Juan, and H. Huang, “Predicting essential genes based on network and sequence analysis,” Mol. BioSyst., vol. 5, no. 12, pp. 1672–1678, 2009.

[32] M. Vidal, M. Cusick, and A. Barab´asi, “Interactome networks and human dis- ease,” Cell, vol. 144, no. 6, pp. 986–998, 2011.

[33] M. Kanehisa, S. Goto, Y. Sato, M. Furumichi, and M. Tanabe, “KEGG for integration and interpretation of large-scale molecular data sets,” Nucleic acids research, vol. 40, pp. D109–D114, 2012.

[34] E. Cerami, B. Gross, E. Demir, I. Rodchenkov, O.¨ Babur, N. Anwar, N. Schultz, G. Bader, and C. Sander, “Pathway Commons, a web resource for biological pathway data,” Nucleic Acids Research, vol. 39, no. suppl 1, pp. D685–D690, 2011.

[35] T. Kelder, M. van Iersel, K. Hanspers, M. Kutmon, B. Conklin, C. Evelo, and A. Pico, “WikiPathways: building research communities on biological path- ways,” Nucleic acids research, vol. 40, pp. D1301–D1307, 2012.

[36] D. Croft, A. Mundo, R. Haw, M. Milacic, J. Weiser, G. Wu, M. Caudy, P. Gara- pati, M. Gillespie, M. Kamdar, et al., “The reactome pathway knowledgebase,” Nucleic acids research, vol. 42, pp. D472–D477, 2014. BIBLIOGRAPHY 122

[37] D. Lee, J. Park, K. Kay, N. Christakis, Z. Oltvai, and A. Barab´asi,“The im- plications of human metabolic network topology for disease comorbidity,” Pro- ceedings of the National Academy of Sciences, vol. 105, no. 29, pp. 9880–9885, 2008.

[38] J. Amberger, C. Bocchini, A. Scott, and A. Hamosh, “Mckusick´s online

mendelian inheritance in man (OMIM R ),” Nucleic acids research, vol. 37, no. suppl 1, pp. D793–D796, 2009.

[39] K. Goh, M. Cusick, D. Valle, B. Childs, M. Vidal, and A. Barab´asi,“The human disease network,” Proceedings of the National Academy of Sciences of the United States of America, vol. 104, no. 21, pp. 8685–8690, 2007.

[40] D. Welter, J. MacArthur, J. Morales, T. Burdett, P. Hall, H. Junkins, A. Klemm, P. Flicek, T. Manolio, L. Hindorff, et al., “The NHGRI GWAS Catalog, a curated resource of snp-trait associations,” Nucleic acids research, vol. 42, no. D1, pp. D1001–D1006, 2014.

[41] K. Becker, K. Barnes, T. Bright, and S. Wang, “The genetic association database,” Nature Genetics, vol. 36, no. 5, pp. 431–432, 2004.

[42] G. Thorisson, O. Lancaster, R. Free, R. Hastings, P. Sarmah, D. Dash, S. Brah- machari, and A. J. Brookes, “HGVbaseG2P: a central genetic association database,” Nucleic acids research, vol. 37, pp. D797–D802, 2009.

[43] O. Taboureau, S. Nielsen, K. Audouze, N. Weinhold, D. Edsg¨ard,F. Roque, I.K., A. Bora, R. Curpan, T. Jensen, S. Brunak, and T. Oprea, “ChemProt: a BIBLIOGRAPHY 123

disease chemical biology database,” Nucleic acids research, vol. 39, pp. D367– D372, 2011.

[44] V. Law, C. Knox, Y. Djoumbou, T. Jewison, A. C. Guo, Y. Liu, A. Maciejewski, D. Arndt, M. Wilson, V. Neveu, et al., “DrugBank 4.0: shedding new light on drug metabolism,” Nucleic acids research, vol. 42, no. D1, pp. D1091–D1097, 2014.

[45] A. Davis, C. Murphy, R. Johnson, J. Lay, K. Lennon-Hopkins, C. Saraceni- Richards, D. Sciaky, B. King, M. Rosenstein, T. Wiegers, and C. Mattingly, “The Comparative Toxicogenomics Database: update 2013,” Nucleic Acids Res, vol. 41, pp. D1104–D1114, 2013.

[46] F. Zhu, Z. Shi, C. Qin, L. Tao, X. Liu, F. Xu, L. Zhang, Y. Song, X. Liu, J. Zhang, B. Han, P. Zhang, and Y. Chen, “Therapeutic target database update 2012: a resource for facilitating target-oriented drug discovery,” Nucleic acids research, vol. 40, pp. D1128–D1136, 2011.

[47] S. Kjærulff, L. Wich, J. Kringelum, U. Jacobsen, I. Kouskoumvekaki, K. Au- douze, O. Lund, S. Brunak, T. I. Oprea, and O. Taboureau, “ChemProt-2.0: visual navigation in a disease chemical biology database,” Nucleic acids re- search, vol. 41, no. D1, pp. D464–D469, 2013.

[48] A. Pawson, J. Sharman, H. Benson, E. Faccenda, S. Alexander, A. Buneman, O.P.and Davenport, J. McGrath, J. Peters, C. Southan, M. Spedding, W. Yu, A. Harmar, and NC-IUPHAR, “The IUPHAR/BPS Guide to PHARMACOL- OGY: an expert-driven knowledgebase of drug targets and their ligands,” Nu- cleic Acids Res, vol. 42, pp. D1098–106, 2014. BIBLIOGRAPHY 124

[49] K. Fortney, W. Xie, M. Kotlyar, J. Griesman, Y. Kotseruba, and I. Jurisica, “NetwoRx: connecting drugs to networks and phenotypes in saccharomyces cerevisiae,” Nucleic Acids Res, vol. 41, pp. D720–7, 2013.

[50] M. Lukk, M. Kapushesky, J. Nikkila, H. Parkinson, A. Goncalves, W. Huber, E. Ukkonen, and A. Brazma, “A global map of human gene expression,” Nat Biotech, vol. 28, no. 4, pp. 322–324, 2010.

[51] H. Parkinson, U. Sarkans, N. Kolesnikov, N. Abeygunawardena, T. Burdett, M. Dylag, I. Emam, A. Farne, E. Hastings, E. Holloway, N. Kurbatova, M. Lukk, J. Malone, R. Mani, E. Pilicheva, G. Rustici, A. Sharma, E. Williams, T. Adamusiak, M. Brandizi, N. Sklyar, and A. Brazma, “ArrayExpress update- an archive of microarray and high-throughput sequencing-based functional ge- nomics experiments,” Nucleic Acids Research, vol. 39, no. suppl 1, pp. D1002– D1004, 2011.

[52] A. Culhane, M. Schr¨oder,R. Sultana, S. Picard, E. Martinelli, C.Kelly, B.Haibe- Kains, M. Kapushesky, A. S. Pierre, W. Flahive, K. Picard, D. Gusenleitner, G.Papenhausen, N. O’Connor, M. Correll, and J. Quackenbush, “GeneSigDB: a manually curated database and resource for analysis of gene expression sig- natures,” Nucleic Acids Research, vol. 40, pp. D1060–D1066, 2012.

[53] T. Barrett, S. Wilhite, P. Ledoux, C. Evangelista, I. Kim, M. Tomashevsky, K. Marshall, K. H. Phillippy, P. Sherman, M. Holko, et al., “NCBI GEO: archive for functional genomics data sets—update,” Nucleic acids research, vol. 41, no. D1, pp. D991–D995, 2013. BIBLIOGRAPHY 125

[54] D. Rhodes, S. Kalyana-Sundaram, V. Mahavisno, R. Varambally, J. Yu, B. Briggs, T. Barrette, M. Anstet, C. Kincead-Beal, P. Kulkarni, S. Varam- bally, D. Ghosh, and A. Chinnaiyan, “Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles,” Neoplasia, vol. 9, no. 2, pp. 166–180, 2007.

[55] A. Bossi and B. Lehner, “Tissue specificity and the human protein interaction network,” Molecular Systems Biology, vol. 5, no. 1, 2009.

[56] Z. Dezso, Y. Nikolsky, E. Sviridov, W. Shi, T. Serebriyskaya, D. Dosym- bekov, A. Bugrim, E. Rakhmatulin, R. Brennan, A. Guryanov, K. Li, J. Blake, R. Samaha, and T. Nikolskaya, “A comprehensive functional analysis of tissue specificity of human gene expression,” BMC Biology, vol. 6, no. 1, p. 49, 2008.

[57] L. Hsiao, F. Dangond, T. Yoshida, R. Hong, R. Jensen, J. Misra, W. Dillon, K. Lee, K. Clark, P. Haverty, Z. Weng, G. L. Mutter, M. P. Frosch, M. E. Mac- Donald, E. L. Milford, C. P. Crum, R. Bueno, R. E. Pratt, M. Mahadevappa, J. A. Warrington, G. Stephanopoulos, G. Stephanopoulos, and S. R. Gullans, “A compendium of gene expression in normal human tissues,” Physiological genomics, vol. 7, no. 2, pp. 97–104, 2001.

[58] F. Ponten, M. Gry, L. Fagerberg, E. Lundberg, A. Asplund, L. Berglund, P. Oksvold, E. Bjorling, S. Hober, C. Kampf, S. Navani, P. Nilsson, J. Ot- tosson, A. Persson, H. Wernerus, K. Wester, and M. Uhlen, “A global view of protein expression in human cells, tissues, and organs,” Molecular Systems Biology, vol. 5, no. 1, 2009. BIBLIOGRAPHY 126

[59] A. Dowsey and G. Yang, “The future of large-scale collaborative proteomics,” PROCEEDINGS OF THE IEEE, vol. 96, no. 8, pp. 1292–1309, 2008.

[60] M. Uhl´en,L. Fagerberg, B. Hallstr¨om,C. Lindskog, P. Oksvold, A. Mardinoglu, A.˚ Sivertsson, C. Kampf, E. Sj¨ostedt, A. Asplund, I. Olsson, K. Edlund, E. Lundberg, S. Navani, C. Szigyarto, J. Odeberg, D. Djureinovic, J. Takanen, S. Hober, T. Alm, P. Edqvist, H. Berling, H. Tegel, J. Mulder, J. Rockberg, P. Nilsson, J. Schwenk, M. Hamsten, K. von Feilitzen, M. Forsberg, L. Persson, F. Johansson, M. Zwahlen, G. von Heijne, J. Nielsen, and F. Pont´en,“Tissue- based map of the human proteome,” Science, vol. 347, no. 6220, p. 1260419, 2015.

[61] Q. Jiang, Y. Wang, Y. Hao, L. Juan, M. Teng, X. Zhang, M. Li, G. Wang, and Y. Liu, “miR2Disease: a manually curated database for microrna deregulation in human disease,” Nucleic Acids Research, vol. 37, no. suppl 1, pp. D98–D104, 2009.

[62] A. Kozomara and S. Griffiths-Jones, “mirbase: annotating high confidence mi- crornas using deep sequencing data,” Nucleic Acids Res, vol. 42, pp. D68–D73, 2014.

[63] T. Vergoulis, I. Vlachos, P. Alexiou, G. Georgakilas, M. Maragkakis, M. Reczko, S. Gerangelos, N. Koziris, T. Dalamagas, and A. Hatzigeorgiou, “TarBase 6.0: capturing the exponential growth of mirna targets with experimental support,” Nucleic acids research, vol. 40, no. D1, pp. D222–D229, 2012.

[64] S. Hsu, Y. Tseng, S. Shrestha, Y. Lin, A. Khaleel, C. Chou, C. Chu, H. Huang, C. Lin, S. Ho, et al., “miRTarBase update 2014: an information resource BIBLIOGRAPHY 127

for experimentally validated mirna-target interactions,” Nucleic acids research, vol. 42, no. D1, pp. D78–D85, 2014.

[65] M. Maragkakis, M. Reczko, V. Simossis, P. Alexiou, G. Papadopoulos, T. Dala- magas, G. Giannopoulos, G. Goumas, E. Koukis, K. Kourtis, T. Vergoulis, N. Koziris, T. Sellis, P. Tsanakas, and A. Hatzigeorgiou, “DIANA-microT web server: elucidating microrna functions through target prediction,” Nucleic acids research, vol. 37, no. Web Server issue, pp. W273–276, 2009.

[66] A. Krek, D. Grun, M. Poy, R. Wolf, L. Rosenberg, E. Epstein, P. MacMenamin, I. da Piedade, K. C. Gunsalus, M. Stoffel, and N. Rajewsky, “Combinatorial microRNA target predictions,” Nature Genetics, vol. 37, no. 5, pp. 495–500, 2005.

[67] M. Kertesz, N. Iovino, U. Unnerstall, U. Gaul, and E. Segal, “The role of site accessibility in microrna target recognition,” Nature genetics, vol. 39, no. 10, pp. 1278–1284, 2007.

[68] B. Lewis, I. Shih, M. Jones-Rhoades, D. Bartel, and C. Burge, “Prediction of Mammalian microRNA targets,” Cell, vol. 115, no. 7, pp. 787–798, 2003.

[69] E. A. Shirdel, W. Xie, T. Mak, and I. Jurisica, “NAViGaTing the micronome– using multiple microRNA prediction databases to identify signalling pathway- associated micrornas,” PloS one, vol. 6, no. 2, p. e17429, 2011.

[70] D. Betel, M. Wilson, A. Gabow, D. Marks, and C. Sander, “The microRNA.org resource: targets and expression,” Nucleic Acids Research, vol. 36, no. suppl 1, pp. D149–D153, 2008. BIBLIOGRAPHY 128

[71] K. Miranda, T. Huynh, Y. Tay, Y. Ang, W. Tam, A. Thomson, B. Lim, and I. Rigoutsos, “A pattern-based method for the identification of microRNA binding sites and their corresponding heteroduplexes,” Cell, vol. 126, no. 6, pp. 1203–1217, 2006.

[72] S. Cho, Y. Jun, S. Lee, H. Choi, S. Jung, Y. Jang, C. Park, S. Kim, S. Lee, and W. Kim, “miRGator v2.0: an integrated system for functional investigation of micrornas,” Nucleic acids research, vol. 39, no. suppl 1, pp. D158–D162, 2011.

[73] A. Ruepp, A. Kowarsch, D. Schmidl, F. Bruggenthin, B. Brauner, I. Dunger, G. Fobo, G. Frishman, C. Montrone, and F. Theis, “PhenomiR: a knowledge- base for microrna expression in diseases and biological processes,” Genome Bi- ology, vol. 11, no. 1, p. R6, 2010.

[74] K. Fortney and I. Jurisica, “Integrative computational biology for cancer re- search,” Human Genetics, vol. 130, no. 4, pp. 465–481, 2011.

[75] A. Turinsky, S. Razick, B. Turner, I. Donaldson, and S. Wodak, “Literature curation of protein interactions: measuring agreement across major public databases,” Database, vol. 2010, January 2010.

[76] M. Oti and H. Brunner, “The modular nature of genetic diseases,” Clinical Genetics, vol. 71, no. 1, pp. 1–11, 2007.

[77] K. Fortney, M. Kotlyar, and I. Jurisica, “Inferring the functions of longevity genes with modular subnetwork biomarkers of caenorhabditis elegans aging,” Genome Biology, vol. 11, no. 2, p. R13, 2010. BIBLIOGRAPHY 129

[78] T.Ideker and R. Sharan, “Protein networks in disease,” Genome research, vol. 18, no. 4, pp. 644–652, 2008.

[79] J. Hardy and A. Singleton, “Genomewide association studies and human dis- ease,” The New England journal of medicine, vol. 360, no. 17, pp. 1759–1768, 2009.

[80] Y. Li and J. Patra, “Genome-wide inferring gene-phenotype relationship by walking on the heterogeneous network,” Bioinformatics, vol. 26, no. 9, pp. 1219– 1224, 2010.

[81] S. K¨ohler, S. Bauer, D. Horn, and P. Robinson, “Walking the interactome for prioritization of candidate disease genes,” The American Journal of Human Genetics, vol. 82, no. 4, pp. 949–958, 2008.

[82] Y. Chen, W. Wang, Y. Zhou, R. Shields, S. Chanda, R. Elston, and J. Li, “In silico gene prioritization by integrating multiple data sources,” PloS one, vol. 6, no. 6, p. e21137, 2011.

[83] X. Wu, R. Jiang, M. Zhang, and S. Li, “Network-based global inference of human disease genes,” Molecular systems biology, vol. 4, no. 1, 2008.

[84] W. Zhang, F. Sun, and R. Jiang, “Integrating multiple protein-protein inter- action networks to prioritize disease genes: a bayesian regression approach,” BMC Bioinformatics, vol. 12, no. Suppl 1, p. S11, 2011.

[85] R. Kondor and J. Lafferty, “Diffusion kernels on graphs and other discrete structures,” in In Proceedings of the ICML, pp. 315–322, 2002. BIBLIOGRAPHY 130

[86] A. J. Enright, S. V. Dongen, and C. A. Ouzounis, “An efficient algorithm for large-scale detection of protein families,” Nucleic Acids Research, vol. 30, no. 7, pp. 1575–1584, 2002.

[87] S. van Dongen, Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, 2000.

[88] Z. Tu, L. Wang, M. Arbeitman, T. Chen, and F. Sun, “An integrative ap- proach for causal gene identification and gene regulatory pathway inference,” Bioinformatics, vol. 22, no. 14, pp. e489–e496, 2006.

[89] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, B. Sch¨olkopf, and B. S. Olkopf, “Learning with local and global consistency,” in Advances in Neural Information Processing Systems 16, vol. 16, pp. 321–328, 2003.

[90] H. Chuang, E. Lee, Y. Liu, D. Lee, and T. Ideker, “Network-based classification of breast cancer metastasis,” Molecular Systems Biology, vol. 3, no. 1, 2007.

[91] P. Dao, R. Colak, R. Salari, F. Moser, E. Davicioni, A. Sch¨onhuth, and M. Ester, “Inferring cancer subnetwork markers using density-constrained biclustering,” Bioinformatics, vol. 26, no. 18, pp. i625–i631, 2010.

[92] R. Nibbe, M.Koyut¨urk,and M. Chance, “An integrative-omics approach to identify functional sub-networks in human colorectal cancer,” PLoS Comput Biol, vol. 6, no. 1, p. e1000639, 2010.

[93] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern Recognition Letters, vol. 31, pp. 651–666, June 2010. BIBLIOGRAPHY 131

[94] J. MacQueen, “Some methods for classification and analysis of multivariate ob- servations,” Proceedings of the 5th Berkeley Symposium on Mathematical Statis- tics and Probability, vol. 1, pp. 281–297, 1967.

[95] A. K. Jain and R. C. Dubes, Algorithms for clustering data. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1988.

[96] M. Girvan and M. E. J. Newman, “Community structure in social and biological networks,” Proc Natl Acad Sci U S A, vol. 99, pp. 7821–6, Jun 2002.

[97] Y. Loewenstein, E. Portugaly, M. Fromer, and M. Linial, “Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space,” Bioinformatics, vol. 24, no. 13, pp. i41–i49, 2008.

[98] M. Blatt, S. Wiseman, and E. Domany, “Superparamagnetic clustering of data,” Physical Review Letters, vol. 76, no. 18, pp. 3251–3254, 1996.

[99] S. Letovsky and S. Kasif, “Predicting protein function from protein/protein interaction data: a probabilistic approach,” Bioinformatics, vol. 19, no. suppl 1, pp. i197–i204, 2003.

[100] Z. Wei and H. Li, “A markov random field model for network-based analysis of genomic data,” Bioinformatics, vol. 23, no. 12, pp. 1537–1544, 2007.

[101] B. Frey and D. Dueck, “Clustering by passing messages between data points,” Science, vol. 315, no. 5814, pp. 972–976, 2007.

[102] G. Bader and C. Hogue, “An automated method for finding molecular com- plexes in large protein interaction networks,” BMC Bioinformatics, vol. 4, no. 1, p. 2, 2003. BIBLIOGRAPHY 132

[103] A. King, N. Prˇzulj,and I. Jurisica, “Protein complex prediction via cost-based clustering,” Bioinformatics, vol. 20, no. 17, pp. 3013–3020, 2004.

[104] P. Jiang and M. Singh, “SPICi: a fast clustering algorithm for large biological networks,” Bioinformatics, vol. 26, no. 8, pp. 1105–1111, 2010.

[105] T. Ideker, O. Ozier, B. Schwikowski, and A. F. Siegel, “Discovering regula- tory and signalling circuits in molecular interaction networks,” Bioinformatics, vol. 18 Suppl 1, pp. S233–40, 2002.

[106] I. M. S. G. Consortium, “Network-based multiple sclerosis pathway analysis with GWAS data from 15,000 cases and 30,000 controls,” American Journal of Human Genetics, 2013.

[107] B. Adamcsek, G. Palla, I. J. Farkas, I. Der´enyi, and T. Vicsek, “CFinder: lo- cating cliques and overlapping modules in biological networks,” Bioinformatics, vol. 22, no. 8, pp. 1021–1023, 2006.

[108] G. Palla, I. Derenyi, I. Farkas, and T. Vicsek, “Uncovering the overlapping com- munity structure of complex networks in nature and society,” Nature, vol. 435, no. 7043, pp. 814–818, 2005.

[109] L. Wang, P. Khankhanian, S. Baranzini, and P. Mousavi, “iCTNet: a cytoscape plugin to produce and analyze integrative complex traits networks,” BMC bioin- formatics, vol. 12, no. 1, p. 380, 2011.

[110] P. Villoslada and S. Baranzini, “Data integration and systems biology ap- proaches for biomarker discovery: challenges and opportunities for multiple sclerosis,” J Neuroimmunol, vol. 248, pp. 58–65, Jul 2012. BIBLIOGRAPHY 133

[111] L. Diaz-Beltran, C. Cano, D. P. Wall, and F. J. Esteban, “Systems biology as a comparative approach to understand complex gene expression in neurological diseases,” Behav Sci (Basel), vol. 3, pp. 253–72, Jun 2013.

[112] R. Saito, M. E. Smoot, K. Ono, J. Ruscheinski, P.-L. Wang, S. Lotia, A. R. Pico, G. D. Bader, and T. Ideker, “A travel guide to cytoscape plugins,” Nat Methods, vol. 9, pp. 1069–76, Nov 2012.

[113] K. Lage, E. Karlberg, Z. Storling, P. Olason, A. Pedersen, O. Rigina, A. Hinsby, Z. Tumer, F. Pociot, N.Tommerup, Y. Moreau, and S. Brunak, “A human phenome-interactome network of protein complexes implicated in genetic dis- orders,” Nature Biotechnology, vol. 25, no. 3, pp. 309–316, 2007.

[114] Y. Yamanishi, M. Kotera, Y. Moriya, R. Sawada, M. Kanehisa, and S. Goto, “DINIES: drug-target interaction network inference engine based on supervised analysis,” Nucleic Acids Res, vol. 42, pp. W39–45, 2014.

[115] E. Schadt, S. Friend, and D. Shaywitz, “A network view of disease and com- pound screening,” Nature reviews. Drug discovery, vol. 8, no. 4, pp. 286–295, 2009.

[116] Y. Guan, D. Gorenshteyn, M. Burmeister, A. Wong, J. Schimenti, M. Handel, C. Bult, M. Hibbs, and O. Troyanskaya, “Tissue-specific functional networks for prioritizing phenotype and disease genes,” PLoS Comput Biol, vol. 8, no. 9, p. e1002694, 2012. BIBLIOGRAPHY 134

[117] D. Warde-Farley, S. L. Donaldson, O. Comes, K. Zuberi, R. Badrawi, P. Chao, M. Franz, C. Grouios, F. Kazi, C. T. Lopes, A. Maitland, S. Mostafavi, J. Mon- tojo, Q. Shao, G. Wright, G. D. Bader, and Q. Morris, “The genemania predic- tion server: biological network integration for gene prioritization and predicting gene function,” Nucleic Acids Research, vol. 38, pp. W214–W220, July 2010. Warde2010.

[118] A. Bauer-Mehren, M. Rautschka, F. Sanz, and L. I. Furlong, “Disgenet: a cytoscape plugin to visualize, integrate, search and analyze gene-disease net- works,” Bioinformatics, vol. 26, pp. 2924–2926, November 2010. Bauer- Mehren2010.

[119] A. Mora, K. Michalickova, and I. Donaldson, “A survey of protein interaction data and multigenic inherited disorders,” BMC Bioinformatics, vol. 14, p. 47, 2013.

[120] L. Schriml, C. Arze, S. Nadendla, Y. Chang, M. Mazaitis, V. Felix, G. Feng, and W. A. Kibbe, “Disease Ontology: a backbone for disease semantic integration,” Nucleic acids research, vol. 40 (Database issue), pp. D940–6, 2012.

[121] J. Malone, E. Holloway, T. Adamusiak, M. Kapushesky, J. Zheng, N. Kolesnikov, A. Zhukova, A. Brazma, and H. Parkinson, “Modeling sam- ple variables with an experimental factor ontology,” Bioinformatics, vol. 26, no. 8, pp. 1112–8, 2010.

[122] A. Hamosh, A. Scott, J. Amberger, C. Bocchini, and V. McKusick, “Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders,” Nucleic acids research, vol. 33, pp. D514–D517, 2005. BIBLIOGRAPHY 135

[123] G. Brown, V. Hem, K. Katz, M. Ovetsky, C. Wallin, O. Ermolaeva, I. Tolstoy, T. Tatusova, K. Pruitt, D. Maglott, and T. Murphy, “Gene: a gene-centered information resource at ncbi,” Nucleic Acids Res, vol. 43, pp. D36–D42, 2014.

[124] M. Gremse, A. Chang, I. Schomburg, M. Grote, A.and Scheer, C. Ebeling, and D. Schomburg, “The BRENDA Tissue Ontology (BTO): the first all-integrating ontology of all organisms for enzyme sources,” Nucleic Acids Res, vol. 39, pp. D507–D513, 2011.

[125] C. Knox, V. Law, T. Jewison, P. Liu, S. Ly, A. Frolkis, A. Pon, K. Banco, C. Mak, V. Neveu, et al., “DrugBank 3.0: a comprehensive resource for ‘omics’ research on drugs,” Nucleic acids research, vol. 39, no. suppl 1, pp. D1035– D1041, 2011.

[126] A. Su, T. Wiltshire, S. Batalov, H. Lapp, K. Ching, D. Block, J. Zhang, R. So- den, M. Hayakawa, G. Kreiman, M. Cooke, J. Walker, and J. Hogenesch, “A gene atlas of the mouse and human protein-encoding transcriptomes,” Pro- ceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 16, pp. 6062–6067, 2004.

[127] M. Kuhn, M. Campillos, I. Letunic, L. Jensen, and P. Bork, “A side effect resource to capture phenotypic effects of drugs,” Mol Syst Biol, vol. 6, p. 343, 2010.

[128] A. Stojmirovi´cand Y. Yu, “ppiTrim: constructing non-redundant and up-to- date interactomes,” Database, vol. 2011, p. bar036, 2011. BIBLIOGRAPHY 136

[129] Z. Li and T. M. Rana, “Therapeutic targeting of micrornas: current status and future challenges,” Nature reviews Drug discovery, vol. 13, no. 8, pp. 622–638, 2014.

[130] L. Wang, T. Matsushita, L. Madireddy, P. Mousavi, and S. Baranzini, “PINBPA: Cytoscape app for network analysis of gwas data,” Bioinformatics, vol. 31, no. 2, pp. 262–4, 2015.

[131] L. Wang, P. Mousavi, and S. Baranzini, “iPINBPA: An integrative network- based functional module discovery tool for genome-wide association studies,” Pacific Symposium on Biocomputing, vol. 20, pp. 255–66, 2015.

[132] T. Manolio, “Genomewide association studies and assessment of the risk of disease,” The New England journal of medicine, vol. 363, no. 2, pp. 166–176, 2010.

[133] K. Wang, M. Li, and H. Hakonarson, “Analysing biological pathways in genome- wide association studies,” Nature reviews.Genetics, vol. 11, no. 12, pp. 843–854, 2010.

[134] S. Baranzini, N. Galwey, J. Wang, P. Khankhanian, R. Lindberg, D. Pelletier, W. Wu, B. Uitdehaag, L. Kappos, G. Consortium, C. Polman, P. Matthews, S. Hauser, R. Gibson, J. Oksenberg, and M. Barnes, “Pathway and network- based analysis of genome-wide association studies in multiple sclerosis,” Human molecular genetics, vol. 18, no. 11, pp. 2078–2090, 2009. BIBLIOGRAPHY 137

[135] P. Khatri, M. Sirota, and A. J. Butte, “Ten years of pathway analysis: current approaches and outstanding challenges,” PLoS computational biology, vol. 8, no. 2, p. e1002375, 2012.

[136] W. H. da, B. T. Sherman, and R. A. Lempicki, “Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists,” Nucleic acids research, vol. 37, no. 1, pp. 1–13, 2009.

[137] P. Lee, C. O’Dushlaine, B. Thomas, and S. M. Purcell, “INRICH: interval- based enrichment analysis for genome-wide association studies,” Bioinformat- ics, vol. 28, no. 13, pp. 1797–1799, 2012.

[138] K. Wang, M. Li, and M. Bucan, “Pathway-based approaches for analysis of genomewide association studies,” American Journal of Human Genetics, vol. 81, no. 6, pp. 1278–1283, 2007.

[139] L. Weng, F. Macciardi, A. Subramanian, G. Guffanti, S. G. Potkin, Z. Yu, and X. Xie, “Snp-based pathway enrichment analysis for genome-wide association studies,” BMC bioinformatics, vol. 12, pp. 99–2105–12–99, 2011.

[140] B. L. Yaspan, W. S. Bush, E. S. Torstenson, D. Ma, M. A. Pericak-Vance, M. D. Ritchie, J. S. Sutcliffe, and J. L. Haines, “Genetic analysis of biological pathway data through genomic randomization,” Human genetics, vol. 129, no. 5, pp. 563–571, 2011.

[141] P. Jia, S. Zheng, J. Long, W. Zheng, and Z. Zhao, “dmGWAS: dense mod- ule searching for genome-wide association studies in protein-protein interaction networks,” Bioinformatics (Oxford, England), vol. 27, no. 1, pp. 95–102, 2011. BIBLIOGRAPHY 138

[142] N. Akula, A. Baranova, D. Seto, J. Solka, M. A. Nalls, A. Singleton, L. Ferrucci, T. Tanaka, S. Bandinelli, Y. S. Cho, Y. J. Kim, J. Y. Lee, B. G. Han, B. D. G. S. B. Consortium, W. T. C.-C. Consortium, and F. J. McMahon, “A network- based approach to prioritize results from genome-wide association studies,” PloS one, vol. 6, no. 9, p. e24220, 2011.

[143] E. Rossin, K. Lage, S. Raychaudhuri, R. Xavier, D. Tatar, Y. Benita, I. I. B. D. G. Constortium, C. Cotsapas, and M. J. Daly, “Proteins encoded in genomic regions associated with immune-mediated disease physically interact and suggest underlying biology,” PLoS genetics, vol. 7, no. 1, p. e1001273, 2011.

[144] N. Patsopoulos, B. P. M. G. W. Group, S. C. of Studies Evaluating IFNbeta- 1b, a CCR1-Antagonist, A. Consortium, GeneMSA, I. M. S. G. Consortium, F. Esposito, J. Reischl, S. Lehr, D. Bauer, J. Heubach, R. Sandbrink, C. Pohl, G. Edan, L. Kappos, D. Miller, J. Montalban, C. H. Polman, M. S. Freed- man, H. P. Hartung, B. G. Arnason, G. Comi, S. Cook, M. Filippi, D. S. Goodin, D. Jeffery, P. O’Connor, G. C. Ebers, D. Langdon, A. T. Reder, A. Tra- boulsee, F. Zipp, S. Schimrigk, J. Hillert, M. Bahlo, D. R. Booth, S. Broadley, M. A. Brown, B. L. Browning, S. R. Browning, H. Butzkueven, W. M. Car- roll, C. Chapman, S. J. Foote, L. Griffiths, A. G. Kermode, T. J. Kilpatrick, J. Lechner-Scott, M. Marriott, D. Mason, P. Moscato, R. N. Heard, M. P. Pen- der, V. M. Perreau, D. Perera, J. P. Rubio, R. J. Scott, M. Slee, J. Stankovich, G. J. Stewart, B. V. Taylor, N. Tubridy, E. Willoughby, J. Wiley, P. Matthews, F. M. Boneschi, A. Compston, J. Haines, S. L. Hauser, J. McCauley, A. Ivinson, J. R. Oksenberg, M. Pericak-Vance, S. J. Sawcer, P. L. D. Jager, D. A. Hafler, BIBLIOGRAPHY 139

and P. I. de Bakker, “Genome-wide meta-analysis identifies novel multiple scle- rosis susceptibility loci,” Annals of Neurology, vol. 70, no. 6, pp. 897–912, 2011.

[145] I. M. S. G. Consortium, W. T. C. C. C. 2, S. Sawcer, G. Hellenthal, M. Pirinen, C. C. Spencer, N. A. Patsopoulos, L. Moutsianas, A. Dilthey, Z. Su, C. Free- man, S. E. Hunt, S. Edkins, E. Gray, D. R. Booth, S. C. Potter, A. Goris, G. Band, A. B. Oturai, A. Strange, J. Saarela, C. Bellenguez, B. Fontaine, M. Gillman, B. Hemmer, R. Gwilliam, F. Zipp, A. Jayakumar, R. Martin, S. Leslie, S. Hawkins, E. Giannoulatou, S. D’alfonso, H. Blackburn, F. M. Boneschi, J. Liddle, H. F. Harbo, M. L. Perez, A. Spurkland, M. J. Waller, M. P. Mycko, M. Ricketts, M. Comabella, N. Hammond, I. Kockum, O. T. McCann, M. Ban, P. Whittaker, A. Kemppinen, P. Weston, C. Hawkins, S. Widaa, J. Zajicek, S. Dronov, N. Robertson, S. J. Bumpstead, L. F. Bar- cellos, R. Ravindrarajah, R. Abraham, L. Alfredsson, K. Ardlie, C. Aubin, A. Baker, K. Baker, S. E. Baranzini, L. Bergamaschi, R. Bergamaschi, A. Bern- stein, A. Berthele, M. Boggild, J. P. Bradfield, D. Brassat, S. A. Broadley, D. Buck, H. Butzkueven, R. Capra, W. M. Carroll, P. Cavalla, E. G. Celius, S. Cepok, R. Chiavacci, F. Clerget-Darpoux, K. Clysters, G. Comi, M. Coss- burn, I. Cournu-Rebeix, M. B. Cox, W. Cozen, B. A. Cree, A. H. Cross, D. Cusi, M. J. Daly, E. Davis, P. I. de Bakker, M. Debouverie, M. B. D’hooghe, K. Dixon, R. Dobosi, B. Dubois, D. Ellinghaus, I. Elovaara, F. Esposito, C. Fontenille, S. Foote, A. Franke, D. Galimberti, A. Ghezzi, J. Glessner, R. Gomez, O. Gout, C. Graham, S. F. Grant, F. R. Guerini, H. Hakonarson, P. Hall, A. Hamsten, H. P. Hartung, R. N. Heard, S. Heath, J. Hobart, M. Hoshi, C. Infante-Duarte, G. Ingram, W. Ingram, T. Islam, M. Jagodic, M. Kabesch, A. G. Kermode, BIBLIOGRAPHY 140

T. J. Kilpatrick, C. Kim, N. Klopp, K. Koivisto, M. Larsson, M. Lathrop, J. S. Lechner-Scott, M. A. Leone, V. Leppa, U. Liljedahl, I. L. Bomfim, R. R. Lincoln, J. Link, J. Liu, A. R. Lorentzen, S. Lupoli, F. Macciardi, T. Mack, M. Marriott, V. Martinelli, D. Mason, J. L. McCauley, F. Mentch, I. L. Mero, T. Mihalova, X. Montalban, J. Mottershead, K. M. Myhr, P. Naldi, W. Ollier, A. Page, A. Palotie, J. Pelletier, L. Piccio, T. Pickersgill, F. Piehl, S. Poby- wajlo, H. L. Quach, P. P. Ramsay, M. Reunanen, R. Reynolds, J. D. Rioux, M. Rodegher, S. Roesner, J. P. Rubio, I. M. Ruckert, M. Salvetti, E. Salvi, A. Santaniello, C. A. Schaefer, S. Schreiber, C. Schulze, R. J. Scott, F. Selleb- jerg, K. W. Selmaj, D. Sexton, L. Shen, B. Simms-Acuna, S. Skidmore, P. M. Sleiman, C. Smestad, P. S. Sorensen, H. B. Sondergaard, J. Stankovich, R. C. Strange, A. M. Sulonen, E. Sundqvist, A. C. Syvanen, F. Taddeo, B. Taylor, J. M. Blackwell, P. Tienari, E. Bramon, A. Tourbah, M. A. Brown, E. Tronczyn- ska, J. P. Casas, N. Tubridy, A. Corvin, J. Vickery, J. Jankowski, P. Villoslada, H. S. Markus, K. Wang, C. G. Mathew, J. Wason, C. N. Palmer, H. E. Wich- mann, R. Plomin, E. Willoughby, A. Rautanen, J. Winkelmann, M. Wittig, R. C. Trembath, J. Yaouanq, A. C. Viswanathan, H. Zhang, N. W. Wood, R. Zuvich, P. Deloukas, C. Langford, A. Duncanson, J. R. Oksenberg, M. A. Pericak-Vance, J. L. Haines, T. Olsson, J. Hillert, A. J. Ivinson, P. L. D. Jager, L. Peltonen, G. J. Stewart, D. A. Hafler, S. L. Hauser, G. McVean, P. Donnelly, and A. Compston, “Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis,” Nature, vol. 476, no. 7359, pp. 214–219, 2011.

[146] J. Z. Liu, A. F. McRae, D. R. Nyholt, S. E. Medland, N. R. Wray, K. M. Brown, A. Investigators, N. K. Hayward, G. W. Montgomery, P. M. Visscher, N. G. BIBLIOGRAPHY 141

Martin, and S. Macgregor, “A versatile gene-based test for genome-wide associa- tion studies,” American Journal of Human Genetics, vol. 87, no. 1, pp. 139–145, 2010.

[147] I. M. S. G. C. (IMSGC), A. H. Beecham, N. A. Patsopoulos, D. K. Xifara, M. F. Davis, A. Kemppinen, C. Cotsapas, T. S. Shah, C. Spencer, D. Booth, A. Goris, A. Oturai, J. Saarela, B. Fontaine, B. Hemmer, C. Martin, F. Zipp, S. D’Alfonso, F. Martinelli-Boneschi, B. Taylor, H. F. Harbo, I. Kockum, J. Hillert, T. Olsson, M. Ban, J. R. Oksenberg, R. Hintzen, L. F. Barcellos, W. T. C. C. C. . (WTCCC2), I. I. G. C. (IIBDGC), C. Agliardi, L. Alfreds- son, M. Alizadeh, C. Anderson, R. Andrews, H. B. Sondergaard, A. Baker, G. Band, S. E. Baranzini, N. Barizzone, J. Barrett, C. Bellenguez, L. Bergam- aschi, L. Bernardinelli, A. Berthele, V. Biberacher, T. M. Binder, H. Blackburn, I. L. Bomfim, P. Brambilla, S. Broadley, B. Brochet, L. Brundin, D. Buck, H. Butzkueven, S. J. Caillier, W. Camu, W. Carpentier, P. Cavalla, E. G. Celius, I. Coman, G. Comi, L. Corrado, L. Cosemans, I. Cournu-Rebeix, B. A. Cree, D. Cusi, V. Damotte, G. Defer, S. R. Delgado, P. Deloukas, A. di Sa- pio, A. T. Dilthey, P. Donnelly, B. Dubois, M. Duddy, S. Edkins, I. Elovaara, F. Esposito, N. Evangelou, B. Fiddes, J. Field, A. Franke, C. Freeman, I. Y. Frohlich, D. Galimberti, C. Gieger, P. A. Gourraud, C. Graetz, A. Graham, V. Grummel, C. Guaschino, A. Hadjixenofontos, H. Hakonarson, C. Halfpenny, G. Hall, P. Hall, A. Hamsten, J. Harley, T. Harrower, C. Hawkins, G. Hel- lenthal, C. Hillier, J. Hobart, M. Hoshi, S. E. Hunt, M. Jagodic, I. Jelcic, A. Jochim, B. Kendall, A. Kermode, T. Kilpatrick, K. Koivisto, I. Konidari, BIBLIOGRAPHY 142

T. Korn, H. Kronsbein, C. Langford, M. Larsson, M. Lathrop, C. Lebrun- Frenay, J. Lechner-Scott, M. H. Lee, M. A. Leone, V. Leppa, G. Liberatore, B. A. Lie, C. M. Lill, M. Linden, J. Link, F. Luessi, J. Lycke, F. Macciardi, S. Mannisto, C. P. Manrique, R. Martin, V. Martinelli, D. Mason, G. Mazi- brada, C. McCabe, I. L. Mero, J. Mescheriakova, L. Moutsianas, K. M. Myhr, G. Nagels, R. Nicholas, P. Nilsson, F. Piehl, M. Pirinen, S. E. Price, H. Quach, M. Reunanen, W. Robberecht, N. P. Robertson, M. Rodegher, D. Rog, M. Sal- vetti, N. C. Schnetz-Boutaud, F. Sellebjerg, R. C. Selter, C. Schaefer, S. Shau- nak, L. Shen, S. Shields, V. Siffrin, M. Slee, P. S. Sorensen, M. Sorosina, M. Sospedra, A. Spurkland, A. Strange, E. Sundqvist, V. Thijs, J. Thorpe, A. Ticca, P. Tienari, C. van Duijn, E. M. Visser, S. Vucic, H. Westerlind, J. S. Wiley, A. Wilkins, J. F. Wilson, J. Winkelmann, J. Zajicek, E. Zindler, J. L. Haines, M. A. Pericak-Vance, A. J. Ivinson, G. Stewart, D. Hafler, S. L. Hauser, A. Compston, G. McVean, P. D. Jager, S. J. Sawcer, and J. L. McCauley, “Anal- ysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis,” Nature genetics, 2013.

[148] F. Mosteller and R. R. Bush, “Selected quantitative techniques,” Handbook of Social Psychology, vol. 1, pp. 289–334, 1954.

[149] T. Liptak, “On the combination of independent tests,” Magyar Tud. Akad. Mat. Kutato Int. Kozl, no. 3, pp. 171–197, 1958.

[150] R. Fisher, “Combining independent tests of significance,” Am. Stat., vol. 2, no. 30, 1948. BIBLIOGRAPHY 143

[151] M. Whitlock, “Combining probability from independent tests: the weighted z-method is superior to fisher’s approach,” Journal of Evolutionary Biology, vol. 18, no. 5, pp. 1368–1373, 2005.

[152] S. Stouffer, E. Suchman, L. DeVinney, S. Star, and R. J. Williams The American soldier: Adjustment during army life, vol. 1, 1949.

[153] M. Parkes, A. Cortes, D. A. van Heel, and M. A. Brown, “Genetic insights into common pathways and complex relationships among immune-mediated dis- eases,” Nature reviews.Genetics, vol. 14, no. 9, pp. 661–673, 2013.

[154] W. H. da, B. T. Sherman, and R. A. Lempicki, “Systematic and integrative analysis of large gene lists using david bioinformatics resources,” Nature proto- cols, vol. 4, no. 1, pp. 44–57, 2009.

[155] B. Nolen and A. Lokshin, “Protein biomarkers of ovarian cancer: the forest and the trees,” Future oncology (London, England), vol. 8, no. 1, pp. 55–71, 2012.

[156] D. Yang, Y. Sun, L. Hu, H. Zheng, P. Ji, C. Pecot, Y. Zhao, S. Reynolds, H. Cheng, R. Rupaimoole, D. Cogdell, M. Nykter, R. Broaddus, C. Rodriguez- Aguayo, G. Lopez-Berestein, J. Liu, I. Shmulevich, A. Sood, K. Chen, and W. Zhang, “Integrated analyses identify a master microrna regulatory network for the mesenchymal subtype in serous ovarian cancer,” Cancer Cell, vol. 23, no. 2, pp. 186–99, 2013.

[157] C. G. Gunawardana, C. Kuk, C. R. Smith, I. Batruch, A. Soosaipillai, and E. P. Diamandis, “Comprehensive analysis of conditioned media from ovarian BIBLIOGRAPHY 144

cancer cell lines identifies novel candidate markers of epithelial ovarian cancer,” Journal of proteome research, vol. 8, no. 10, pp. 4705–4713, 2009.

[158] C. Kuk, V. Kulasingam, C. Gunawardana, C. Smith, I. Batruch, and E. Dia- mandis, “Mining the ovarian cancer ascites proteome for potential ovarian can- cer biomarkers,” Molecular & Cellular Proteomics, vol. 8, no. 4, pp. 661–669, 2009.

[159] Y. Zhang, B. Xu, Y. Liu, H. Yao, N. Lu, B. Li, J. Gao, S. Guo, N. Han, J. Qi, K. Zhang, S. Cheng, H. Wang, X. Zhang, T. Xiao, L. Wu, and Y. Gao, “The ovarian cancer-derived secretory/releasing proteome: A repertoire of tu- mor markers,” Proteomics, vol. 12, no. 11, pp. 1883–1891, 2012.

[160] L. Gortzak-Uzan, A. Ignatchenko, A. I. Evangelou, M. Agochiya, K. A. Brown, P. S. Onge, I. Kireeva, G. Schmitt-Ulms, T. J. Brown, J. Murphy, B. Rosen, P. Shaw, I. Jurisica, and T. Kislinger, “A proteome resource of ovarian cancer ascites: integrated proteomic and bioinformatic analyses to identify putative biomarkers,” Journal of proteome research, vol. 7, no. 1, pp. 339–351, 2008.

[161] B. Cox, T. Kislinger, and A. Emili, “Integrating gene and protein expression data: pattern analysis and profile mining,” Methods, vol. 35, no. 3, pp. 303–314, 2005.

[162] C. J. Hack, “Integrated transcriptome and proteome data: the challenges ahead,” Briefings in functional genomics and proteomics, vol. 3, no. 3, pp. 212– 219, 2004. BIBLIOGRAPHY 145

[163] K. Brown and I. Jurisica, “Online predicted human interaction database,” Bioinformatics, vol. 21, no. 9, pp. 2076–2082, 2005.

[164] N. J. Bowen, L. Walker, L. Matyunina, S. Logani, K. Totten, B. Benigno, and J. McDonald, “Gene expression profiling supports the hypothesis that human ovarian surface epithelia are multipotent and capable of serving as ovarian can- cer initiating cells,” BMC medical genomics, vol. 2, pp. 71–8794–2–71, 2009.

[165] U. Consortium, “Update on activities at the Universal Protein Resource (UniProt) in 2013,” Nucleic Acids Research, vol. 41, pp. D43–D47, 2013.

[166] A. Yoshida, N. Okamoto, A. Tozawa-Ono, H. Koizumi, K. Kiguchi, B. Ishizuka, T. Kumai, and N. Suzuki, “Proteomic analysis of differential protein expression by brain metastases of gynecological malignancies,” Hum Cell, vol. 26, no. 2, pp. 56–66, 2013.

[167] M. Kim, C. Sung, I. Do, H. Jeon, T. Song, H. Park, Y. Lee, B. Kim, J. Lee, and D. Bae, “Overexpression of Galectin-3 and its clinical significance in ovarian carcinoma,” International journal of clinical oncology, vol. 16, no. 4, pp. 352– 358, 2011.

[168] U. Bharadwaj, C. Marin-Muller, M. Li, C. Chen, and Q. Yao, “Mesothelin confers pancreatic cancer cell resistance to tnf-alpha-induced apoptosis through akt/pi3k/nf-kappab activation and il-6/mcl-1 overexpression,” Molecular can- cer, vol. 10, pp. 106–4598–10–106, 2011.

[169] ENCODE Project Consortium, “An integrated encyclopedia of dna elements in the human genome,” Nature, vol. 489, no. 7414, pp. 57–74, 2012. BIBLIOGRAPHY 146

[170] W. A. Kibbe, C. Arze, V. Felix, E. Mitraka, E. Bolton, G. Fu, C. J. Mungall, J. X. Binder, J. Malone, D. Vasant, H. Parkinson, and L. M. Schriml, “Disease ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data,” Nucleic Acids Res, vol. 43, pp. D1071–8, Jan 2015. 147

Appendix A

iCTNet Data Source Description

iCTNet have collected 13 publicly available data sources, which are listed in Table 3.1. In this appendix, a short description are given for each source, and for most sources, an example data entry is provided to elucidate the collected information.

1. Disease Ontology is a standardized source of human disease, providing con- sistent, reusable and sustainable description of human disease terms, pheno-

type characteristics and related medical vocabulary disease concepts (http: //disease-ontology.org ). Disease ontology is well-maintained, and more details can be found at the latest update [170]. The data entry of ovarian cancer in disease ontology is shown in Table A.1.

2. HUGO Gene Nomenclature Committee (HGNC) is responsible for ap-

proving unique symbols and names for human genes. http://www.genenames. org is a curated online repository of HGNC-approved gene resources. Table A.2 shows the data entry of BRCA1 in HGNC. 148

3. mirCat is a concatenation of miRNA databases whose data has been experi- mentally verified. The data sources of mirCat include microRNA.org, miRTar- Base, tarbase, microT (v3.0) and miR2Disease. mirCat provides online search at

http://www.mirrna.org for miRNA-gene, miRNA-disease, and miRNA-tissue associations. iCTNet has collected miRNA-gene associations from miCat. Table A.3 shows the miRNA-gene associations of hsa-let-7a, which targets 39 genes, and the first 10 gene targets are listed here.

4. BRENDA Tissue Ontology is a structured enclyopedia for the source of enzymes, including terms for tissues, cell lines, cell types and cell cultures. BRENDA tissue ontology is one of the first tissue-specific ontologies.

5. Comparative Toxicogenomics Database (CTD) was developed to central- ize the information on genes and proteins responding to environmental toxic agents. Now CTD provides chemical-gene, chemical-disease and gene-disease interactions that are manually curated in the literature. Table A.4 shows a chemical-gene data entry, Table A.5 shows a chemical-disease data entry and Table A.6 shows a gene-disease data entry.

6. Medical Dictionary for Regulatory Activities (MedDRA) is a specific standardized medical terminology covering pharmaceutical, biologics, vaccines and drug-device combination products.

7. GWAS Catalog shares GWAS data from published studies which meet certain criteria, e.g., the arrays include sufficient genome-wide coverage. A sample data entry with selected columns is shown in Table A.7. 149

8. Online Mendelian Inheritance in Man database (OMIM) is an author- itative data source of human genes and genetic phenotypes. OMIM is updated daily. An example data entry of OMIM is given in Table A.8.

9. GNF Gene Atlas provides a global gene expression map by integrating mi- croarray data representing different cell and tissue types. The data can be

accessed at http://www.ebi.ac.uk/gxa/array/U133A.

10. DrugBank provides comprehensive drug target information of FDA-approved or experimental drugs. Table A.9 shows two data entries of FDA-approved drugs in DrugBank.

11. Side Effect Resource (SIDER) provides recorded adverse drug reactions of marketed medicines. SIDER mapped drug labels to STITCH compound identifiers, which are derived from PubChem compound identifiers and used the MedDRA dictionary to extract side effect terms. Table A.10 shows two example data entries in SIDER.

12. iRefIndex is an online repository of protein-protein interactions. iRefIndex consolidated a number of most popular databases: BIND, BioGRID, CORUM, DIP, HPRD, IntAct, MINT, MPact, MPPI and OPHID. In iRefIndex, there is corresponding publication evidence for each entry. Data can be downloaded in the Proteomics Standards Initiative Molecular Interaction (PSI-MI) tab- delimited format.

13. ppiTrim provides non-redundant and consistently annotated physical protein- protein interactions from iRefIndex. Table A.11 shows two interaction entries in ppiTrim with selected columns. 150

Table A.1: Ovarian cancer entry in disease ontology

DOID DOID:2394 Name ovarian cancer Definition A female reproductive organ cancer that is located in the ovary. http://www.cancer.gov/dictionary?CdrID=445074 Xrefs MSH:D010051 NCI:C4984 OMIM:167000 OMIM:607893 ORDO:213517 SNOMEDCT 2010 1 31:123843001 SNOMEDCT 2010 1 31:154528000 SNOMEDCT 2010 1 31:363443007 SNOMEDCT 2010 1 31:372117006 SNOMEDCT 2010 1 31:93934004 UMLS CUI:C0919267 UMLS CUI:C1140680 UMLS CUI:C1299247 Alternateids DOID:0060070 DOID:2144 DOID:9595 Subsets DO MGI slim TopNodes DOcancerslim Synonyms malignant Ovarian tumor [EXACT] malignant tumour of ovary [EXACT] ovarian neoplasm [EXACT] ovary neoplasm [EXACT] primary ovarian cancer [EXACT] tumor of the Ovary [EXACT] Relationships is a female reproductive organ cancer

Table A.2: Gene BRCA1 data entry in HGNC

APPROVED SYMBOL BRCA1 APPROVED NAME breast cancer 1, early onset HGNC ID HGNC:1100 PREVIOUS SYMBOLS & NAMES - SYNONYMS ”BRCA1/BRCA2-containing complex, subunit 1”, BRCC1, ”Fanconi anemia, complementation group S”, FANCS, PPP1R53, ”protein phosphatase 1, regulatory subunit 53”, RNF53 LOCUS TYPE gene with protein product CHROMOSOMAL LOCATION 17q21.31 GENE FAMILY RING-type (C3HC4) zinc fingers Serine/threonine phosphatases /Protein phosphatase 1, regulatory subunits Fanconi anemia, complementation groups HCOP Orthology Predictions for BRCA1 151

Table A.3: miRNA hsa-let-7a entry in mirCat

miRNA Gene Targets Pubmed ID hsa-let-7a APP 19110058 BCL2 17260024 Sequence PRDM1 18583325 ZFP36L1 17942906 UGAGGUAGUAGGUUGUAUAGUU CASP3 18758960 CASP8 18758960 Location CASP9 18758960 chr9:95978060-95978139 strand: 1 CCND2 20418948 CYP27B1 17890240 E2F1 19110058 ...... 152 -9(10H)-anthracenone protein (4-pyridinylmethyl) results in decreased activity of KCNQ1 Table A.4: Chemical-gene data entry in CTD ChemicalName10,10-bis(4-pyridinylmethyl) ChemicalID-9(10H) GeneSymbol GeneID C112297 GeneForms Organism KCNQ1 Interaction 3784 InteractionActions protein PubMedIDs 10,10-bis decreases activity 18568022 -anthracenone 153

Table A.5: Chemical-disease data entry in CTD

Chemical Chemical Disease Disease Direct Inference Omim PubMed Name ID Name ID Evidence GeneSymbol IDs IDs 06-Paris-LA C046983 Precursor Cell MESH:D054198 therapeutic 4519131 -66 protocol Lymphoblastic Leukemia- Lymphoma

Table A.6: Gene-disease data entry in CTD

Gene Gene Disease Disease Direct Inference Omim PubMed Symbol ID Name ID Evidence ChemicalName IDs IDs 128UP 36288 Depressive MESH:D003866 zinc chloride 17327255 Disorder 154 7 − 10 × 12,030 controls Table A.7: Data entry in GWAS Catalog November 13, 2013Hum or Mol amyotrophicGenet... sclerosis cases, lateral sclerosis 3,762 Amyotrophic lateral sclerosis cases, DateAddedto Catalog07/22/14 First Journal/Study Author /Date/ Goris A Disease /Trait Multiple sclerosis 4,088 Multiple Initial NA Description Sample Description Replication 17q21.32 Region Sample NPEPPS Mapped rs2935183-? nearGene-5 Strongest NR Context Risk Gene(s) Allele 5 Allele P-value SNP-Risk Frequency in Controls 155

Table A.8: Gene-disease data entry in OMIM

Phenotype Alternative Phenotype Gene Gene Location titles;symbols MIM # MIM # Breast Cancer Breast cancer, familial 114480 RAD54L 603615 1p34.1

Table A.9: Drug-target data entries in DrugBank

DrugBank ID Name Type UniProt ID UniProt Name DB00001 Lepirudin BiotechDrug P00734 Prothrombin DB00014 Goserelin SmallMoleculeDrug P30968 Gonadotropin-releasing hormone receptor

Table A.10: Drug-side effect entries in SIDER

STITCH STITCH UMLS Drug Side effect MedDRA UMLS MedDRA flat stereo ID name name concept type ID side effect compound compound ID ID -100003914 -39468 C0038454 levobunolol cerebrovasc- PT C0038454 Cerebrovasc- ular accident ular accident -100003914 -39468 C0015230 levobunolol rash LLT C0015230 Rash

Table A.11: Protein-protein interaction entries in ppiTrim

geneID A geneID B aliasA aliasB method pmids interactionType 10000 23327 AKT3 NEDD4L MI:0415 19953087 MI:0220 (enzymatic study) (ubiquitination reaction) 10000 5590 AKT3 PRKCZ MI:0415 12162751 MI:0217 (enzymatic study) (phosphorylation reaction) 156

Appendix B

iCTNet Database Schema

This figure shows the design schema of iCTNet database, including 39 tables. This schema is designed and exported using MySQL Workbench 5.2 (http://www.mysql. com/products/workbench/). 157

Figure B.1: iCTNet database schema otology tb_bto_ontology tb_efo lationship parent CHAR(11) child CHAR(11) relationship_id TINYINT efo_id VARCHAR(12) name VARCHAR(110) Indexes tb_efo_ontology Indexes parent VARCHAR(12) child VARCHAR(12) Indexes tb_bto Experimental Factor On tb_gnf bto_id CHAR(11) name VARCHAR(65) gene_id INT bto_id CHAR(11) expr DECIMAL log_expr DECIMAL Indexes Indexes tb_bto_ontology_re relationship_id TINYINT name VARCHAR(13) Indexes Gene Expression Altas Tissue Ontology 2) 115) tb_doid_efo_map doid_id INT efo_id VARCHAR(12) Indexes tb_doid_bto doid_id INT bto_id CHAR(11) tb_doid doid_id INT doid_code VARCHAR(1 name VARCHAR(75) Indexes ) Indexes tb_side_effect tb_side_effect_bto umls_id CHAR(8) umls_name VARCHAR( umls_id CHAR(8) bto_id CHAR(11) Indexes Indexes Side Effect tb_doid_medic_map doid_id INT medic_id VARCHAR(12 Indexes gy -Ontology Map tb_ctd_side_effect mesh_id VARCHAR(12) umls_id CHAR(8) tb_doid_ontology Indexes parent INT child INT n Indexes Side Effect Connection tb_mirna tb_doid_omim_map source_gene_id INT target_gene_id1 INT pubmed CHAR(8) doid_id INT omim_id CHAR(6) Indexes Indexes Human Disease Ontolo Human Disease Source miRNA-gene associatio ) 5) YINT ) tb_medic medic_id VARCHAR(12 name VARCHAR(180) tb_gene_efo_gwas tb_gene Indexes gene_id INT efo_id VARCHAR(12) pvalue DECIMAL OR_or_beta VARCHAR( SNPs VARCHAR(170) pubmed CHAR(8) gene_id INT hgnc_id VARCHAR(10) symbol VARCHAR(25) name VARCHAR(140) location VARCHAR(70) group_id TINYINT type_id TINYINT Indexes Indexes tb_ppi source INT target INT pubmed VARCHAR(160 method_id SMALLINT interaction_type_id TIN edge_type_id TINYINT sources VARCHAR(110) complex TINYINT Indexes tb_omim bidmap omim_id CHAR(6) name VARCHAR(125) type_id TINYINT Indexes (1) tb_gene_alias tb_gene_type tb_gene_group gene_id INT alias VARCHAR(25) type_id TINYINT name VARCHAR(30) group_id TINYINT name VARCHAR(20) Indexes Indexes Indexes tb_ppi_method method_id SMALLINT name VARCHAR(80) Gene Indexes tb_gene_omim_mor omim_id CHAR(6) gene_id INT association_type CHAR confirmed TINYINT Indexes tb_omim_type type_id TINYINT name VARCHAR(14) Indexes type 00) bank Human Disease 0) YINT ) NT on tb_ctd_gene_ixn gene_id INT mesh_id VARCHAR(12) organism_id INT pubmeds VARCHAR(62 tb_gene_medic_ctd tb_ppi_edge_type tb_ppi_interaction_ tb_drug_gene_drug Indexes medic_id VARCHAR(12 gene_id INT pubmeds VARCHAR(65 edge_type_id TINYINT name VARCHAR(1) interaction_type_id TIN name VARCHAR(60) drugbank_id INT gene_id INT pharmacological TINYI actions VARCHAR(45) Indexes Indexes Indexes Indexes Drug-Gene Connection PPI py Disease-Gene Associati 00) ) 11) ap tb_ctd_medic_thera mesh_id VARCHAR(12) medic_id VARCHAR(12 pubmeds VARCHAR(72 Indexes tb_ctd tb_drugbank tb_organism mesh_id VARCHAR(12) name VARCHAR(250) drugbank_id INT name VARCHAR(255) cas_number VARCHAR( type VARCHAR(15) groups VARCHAR(35) organism_id INT name VARCHAR(32) Indexes Indexes Indexes tb_ctd_alias tb_drugbank_alias tb_ctd_drugbank_m alias VARCHAR(550) mesh_id VARCHAR(12) alias VARCHAR(105) drugbank_id INT drugbank_id INT mesh_id VARCHAR(12) Indexes Indexes Indexes Organism Therapy Drug 158

Appendix C

ACS Proteins Selected for Ovarian Cancer

ACS proteins are proteins with observed changes in expression at both gene and protein levels, as well as identified as secreted proteins. Table C.1 contains 43 proteins selected for further investigation. Description of 23 attributes in the table: (1) Protein id: uniprotID The uniprot ID of each protein in the record. (2) Gene id: geneName Gene name of the corresponding mouse protein. (3) Gene id: Human Gene name of human homologue of the mouse protein. The mapping information was downloaded from HomoloGene (2012/12/14 version) (http: //www.ncbi.nlm.nih.gov/homologene). (4) Label: Class A stands for differentially expressed genes in Affymetrix data; C stands for significant cell proteins from proteomic profile; and S stands for secreted proteins. In this category, A, C and S are pooled for the union of two cancer cell lines. ACS indicates proteins with observed changes in expression at both gene and protein levels, as well as identified as secreted proteins. AC S indicates proteins show- ing changes in expression at both gene and protein levels and they are not secreted proteins, but have direct interactions with any secreted proteins. AS C indicates 159

proteins, differentially expressed at gene level, and have direct interactions with sig- nificant cell proteins. CS A indicates proteins significantly differential expressed at protein level and secreted from the cancer cells, but not differentially expressed at gene level. While CS A proteins have direct interactions with proteins differentially expressed at gene level. (5) Label: Class 2 Label for the intersection of two cancer cell lines. (6) Previous studies: Studied biomarker Putative secreted biomarkers in previous proteomic studies [157][158][160]. (7) Previous studies: associated Putative biomarkers from a literature review [155]. (8) GO: CC Cellular component in GO. C : Cytoplasm ; E: extracellular; P: Plasma membrane. (9) CDIP database: CDIP UP The number of studies showing the corresponding human gene up-regulated in CDIP (ovarian cancer). (10) CDIP database: CDIP DOWN The number of studies showing the corre- sponding human gene down-regulated in CDIP (ovarian cancer). (11) Gene expression: probeset The probeset with the largest variance in Affymetrix gene expression data for IC5. (12) Gene expression: AffyIC5fold The fold change of the corresponding probeset in Affymetrix gene expression data for IC5. (13) Gene expression: probeset The probeset with the largest variance in Affymetrix gene expression data for IG10. (14) Gene expression: AffyIG10fold The fold change of the corresponding probeset in Affymetrix gene expression data for IG10. 160

(15) Cell protein: msIC5 The fold change of the corresponding cell protein in proteomic profile of IC5. (16) Cell protein: msIC5 FDR False discovery rate for the corresponding fold change in proteomic profile of IC5. (17) Cell protein: msIG10 The fold change of the corresponding cell protein in proteomic profile of IG10. (18) Cell protein: msIG10 FDR False discovery rate for the corresponding fold change in proteomic profile of IG10. (19) Secretome: seIC5 The averaged count of spectra in secretome profile of IC5. (20) Secretome: seIG10 The averaged count of spectra in secretome profile of IG10. (21) IC5 rank: IC5 r The aggregate ranking of corresponding protein in the com- bination of all three profiles: gene expression, proteomic and secretome profiles of IC5. (22) IG10 rank: IG10 r The aggregate ranking of corresponding protein in the combination of all three profiles: gene expression, proteomic and secretome profiles of IG10. (23) General rank: Mouse rank The aggregate ranking of corresponding protein in the combination of all three profiles: gene expression, proteomic and secretome profiles of both cancer cell lines: IC5 and IG10. 161 Table C.1: 43 ACS proteins and their values