Network Medicine: A Network-based Approach to Human Diseases

by Susan Dina Ghiassian

B.S. in Physics, Sharif University of Technology

M.S. in Physics, Northeastern University

A dissertation submitted to

The Faculty of

the Colledge of Art and Science of

Northeastern University

in partial fulfillment of the requirements

for the degree of

March 19, 2015

Dissertation directed by

Albert-László Barabási

Distinguished University DEDICATION

To Mamanjoon

ii ACKNOWLEDGMENTS

I would like to thank my advisor, Albert-László Barabási, not only for giving me the opportunity to spend the most productive years of my life (so far!) in his lab but also for proving me that there is no limit to human inspiration. He taught me to be broad- minded and unbiased in discussing ideas, pragmatic and collaborative in doing research and confident in defending what I believe in.

I would like to take this opportunity to thank three members of my committee:

Daniel Chasman, who has always offered me his guidance during my research, Alain

Karma for teaching me the basics of statistical mechanics and Alessandro Vespignani from whom I have learnt and been amazed through his contributions to the field of network science.

Completion of this dissertation would not have been possible without all the help and support of a former member of the lab, Jörg Menche, who patiently taught me all steps required to perform a successful research. He not only showed me the necessity of an honest research but also taught me life lessons of being helpful to others and peaceful to yourself. I am grateful to have worked with my wonderful collaborators at

CCNR: Sabrina Rabello, Emre Guney, Marc Santolini, Maksim Kitsak, Joseph de Nicolo,

Suzanne Aleva, Brett Common and James Bagrow.

I would also like to express my gratitude to Joseph Loscalzo, who guided me through to the completion of my research and patiently answered all my questions. This disser- tation is the result of a collaborative effort of many bright collaborators from different

iii institutes: CCNR, Dana Farber Cancer Institute, Brigham and Women’s Hospital, and

University Heart Center, Hamburg, Germany. I would like to acknowledge all my col- laborators (Mark Vidal, David E. Hill, Sam Pevzner, Anne-Ruxandra Carvunis, Thomas

Rolland, Franco Giulianini, Piero Ricchiuto, Christian Mueller, Tajna Zeller, Sasha Singh,

Aikawa Masanori, Ramy Arnaout and many more) for making this happen.

I would like to thank my uncle and aunts (Freydoun Ghiassian, Shaheen Ghiassian and Deena Westerby) for their continuous support and encouragement. I would also like to thank my dearest friends and family (Anahita Faham, Fateme Tousi, Amir Taqavi,

Samira Faegh, Parnian Boloorizadeh, Dena Saadat, Sara Ansari, Marzieh Haghighi,

Parisa Taheri, Noushin Fallahpour, Mona Shahi, Mona Manouchehri, and many more) who have always been by my side, listened to me, shared their experiences and brought the best out of me.

I am blessed for having my biggest role models, my lovely parents (Bahman Ghiassian and Fozia Benaissa) and their endless support. They always inspired me, believed in me and supported me in every way possible. Their kind hearts, bright minds, nice person- alities and helping hands have always been the guide throughout my life. I am grateful to have my sister and brother, Yasman and Ehsan, who are always fun, supporting and loving.

My special thanks go to my husband, Razzi Movassaghi, who has been by my side through ups and downs for the past 8 years and made me believe in myself. He is not only the source of my courage and motivation in life, but he has always provided me with his insightful scientific suggestions to my research.

Finally, this work is dedicated to the memory of my beloved grandmother who loved to learn and always encouraged me to keep learning. She was the best thing this world could have.

iv ABSTRACT

With the availability of large-scale data, it is now possible to systematically study the underlying interaction maps of many complex systems in multiple disciplines. Statisti- cal physics has a long and successful history in modeling and characterizing systems with a large number of interacting individuals. Indeed, numerous approaches that were

first developed in the context of statistical physics, such as the notion of random walks and diffusion processes, have been applied successfully to study and characterize com- plex systems in the context of network science. Based on these tools, network science has made important contributions to our understanding of many real-world, self-organizing systems, for example in computer science, sociology and economics.

Biological systems are no exception. Indeed, recent studies reflect the necessity of applying statistical and network-based approaches in order to understand complex bio- logical systems, such as cells. In these approaches, a cell is viewed as a complex network consisting of interactions among cellular components, such as genes and proteins. Given the cellular network as a platform, machinery, functionality and failure of a cell can be studied with network-based approaches, a field known as systems biology.

Here, we apply network-based approaches to explore human diseases and their as- sociated genes within the cellular network. This dissertation is divided in three parts:

(i) A systematic analysis of the connectivity patterns among disease proteins within the cellular network. The quantification of these patterns inspires the design of an al- gorithm which predicts a disease-specific subnetwork containing yet unknown disease-

v associated proteins1 . (ii) We apply the introduced to explore the common

underlying mechanism of many complex diseases. We detect a subnetwork from which

inflammatory processes initiate and result in many autoimmune diseases. (iii) The last

chapter of this dissertation describes the statistical methods, detailed data curation pro-

cesses and additional analyses performed to accomplish the previous parts.

1 The contents of this part are published in Plos. Comp. Bio. journal

vi CONTENTS

Dedication ii

Acknowledgments iii

Abstract v

Contents v

1introduction 1

1.1 Origin of graph theory ...... 1

1.2 Emergence of network science ...... 3

1.3 Network science applications in systems biology ...... 7

1.4 Emergence of Network Medicine ...... 9

1.4.1 Human interactome and complex diseases ...... 14

1.4.2 Existing methods for the identification of disease-gene associations 17

2 a disease module detection (diamond) algorithm 23

2.1 Quantifying interaction patterns of disease proteins within the interac-

tome...... 24

2.2 The DIAMOnD algorithm ...... 34

2.2.1 Time complexity ...... 36

2.3 DIAMOnD performance and robustness ...... 38

2.3.1 Synthetic modules construction ...... 40

2.3.2 Estimating the recovery rate ...... 41

vii 2.3.3 Analyzing the sensitivity towards perturbations and network nois-

iness ...... 42

2.4 Identifying and validating disease modules ...... 46

2.5 Comparison with existing methods ...... 50

2.6 Extending the basic DIAMOnD algorithm ...... 54

2.7 Discussion ...... 58

3 common underlying molecular mechanisms of complex diseases 61

3.1 Constructing inflammasome, thrombosome, and fibrosome ...... 62

3.1.1 Significant clustering of seed genes within the human interactome . 62

3.1.2 Effect of biased studies on significant clustering of seed genes . . . 66

3.1.3 Modules detection, validation and robustness ...... 68

3.1.4 Cross-talk region of the modules ...... 73

3.1.5 Biological importance of the endophenotype modules ...... 74

3.1.6 The role of endophenotype modules in cardiovascular disease . . . 75

3.1.7 The role of endophenotype modules in complex diseases ...... 76

3.2 Topological properties of the endophenotype modules ...... 77

3.2.1 Central location of inflammatory and fibrotic genes ...... 77

3.3 Functionality of detected endophenotype modules using macrophages . . 82

3.3.1 Detection of early and late proteins in response to inflammatory

stimulator ...... 84

3.3.2 Early proteins may be responsible for triggering late proteins . . . . 85

3.4 Discussion ...... 87

4dataanalysisandpreparation 93

4.1 Human Interactome (HI) ...... 93

4.2 Highly studied proteins within the PPI ...... 95

4.3 Modular nature of protein-protein interaction network ...... 98

viii 4.3.1 Disease-genes associations ...... 99

4.3.2 Gene annotations ...... 100

4.4 LCC significance ...... 101

4.5 Pathways analysis ...... 101

4.6 Genetic association analysis ...... 101

4.7 Differential expression analysis of cardiovascular risk ...... 102

4.8 THP-1 cell culture experiments and proteomics ...... 104

5 conclusionsandfuturedirections 105

bibliography 109

ix LISTOFFIGURES

Figure 1 Schematic network representation...... 5

Figure 2 Localization of disease proteins...... 25

Figure 3 Disease proteins forming the largest connected component (LCC). 26

Figure 4 Singnificant clustering of disease proteins...... 29

Figure 5 Topological communities and disease proteins...... 30

Figure 6 Failure of topological community detection methods...... 31

Figure 7 Connectivity significance vs. local modularity of disease proteins. 33

Figure 8 Connectivity significance characterizes disease proteins...... 35

Figure 9 The DIAMOnD algorithm...... 37

Figure 10 Macular degeneration disease module...... 39

Figure 11 Synthetic modules...... 40

Figure 12 Performance evaluation of DIAMOnD...... 43

Figure 13 N-1 analysis...... 45

Figure 14 DIAMOnD robustness...... 46

Figure 15 Biological evaluation of lysosomal storage diseases module. . . . . 48

Figure 16 Biological validation of DIAMOnD across 70 diseases...... 51

Figure 17 DIAMOnD and Random Walk in synthetic and disease modules. . 53

Figure 18 Overall comparison of DIAMOnD and Random Walk...... 55

Figure 19 Schematic representation showing why to assign node weights. . . 56

Figure 20 Extending the DIAMOnD algorithm to adopt node weights. . . . . 59

x Figure 21 Topological characteristics of seed genes within the HI...... 65

Figure 22 Genetic association of seed genes...... 67

Figure 23 Studying biased studies of networks in seeds clustering...... 69

Figure 24 Biological validation of the detected DIAMOnD genes...... 71

Figure 25 Topological properties of the endophenotypic modules...... 72

Figure 26 Differentially expressed genes within modules...... 73

Figure 27 Tree analysis...... 80

Figure 28 Tree analysis of seed genes and modules...... 81

Figure 29 Functional similarities of ome-M proteins...... 84

Figure 30 Detecting early and late proteins of inflammatory responses. . . . 86

Figure 31 Dyadicity and heterophilicity of a group of nodes...... 97

Figure 32 Heterophilicity vs. Dyadicity of hot and cool genes...... 97

Figure 33 Tree analysis of hot and cool genes...... 98

Figure 34 Modular structure of the PPI...... 100

Figure 35 Venn diagram of differentially expressed genes...... 103

xi LISTOFTABLES

Table 1 List of the first 35 diseases and their associated disease proteins. . 27

Table 2 List of the second 35 diseases and their associated disease proteins. 28

Table 3 Genes associated with endophenotypes, on the HI...... 64

Table 4 Fully embedded pathways within modules...... 75

Table 5 Differentially expressed genes within the modules...... 76

Table 6 Number of diseases significantly enriched with modules...... 77

Table 7 Significantly enriched diseases in module-specific regions...... 78

Table 8 Topological network properties of endophenotype modules. . . . . 82

Table 9 Properties of early and late proteins (criterion a)...... 88

Table 10 Properties of early and late proteins (criterion b)...... 89

Table 11 Properties of early and late proteins (criterion c)...... 90

Table 12 Tree analysis of hot and cool genes...... 99

Table 13 Case and control selection statistics...... 103

xii 1

INTRODUCTION

Networks have been studied in a variety of disciplines, from mathematics and physics to sociology, biology and medicine. Prominent examples of real world networks are friendships, co-citation, power-grids and gene regulatory networks. The reason for the widespread use of networks is that essentially any system with interacting components can be represented as a network. To better understand the functionality of such systems, it is important to first study their topological characteristics in a systematic framework, the topic of “Network Science”.

Network Medicine is an emerging branch of the multidisciplinary field of network science that aims to understand and ultimately treat human diseases through systematic network-based approaches. In the following, we present a brief historical background to network science in general, and network medicine in particular.

1.1 Origin of graph theory

Modern network science is rooted in graph theory, which dates back to 1735, when

Leonard Euler first formulated a mathematical puzzle as a graph problem. The puzzle was to find a path that passes through all seven bridges connecting four areas of the city of Königsberg without crossing a bridge twice. Euler found that he can simplify the puzzle by turning it into a graph representation, where vertices are different areas and

1 edges represent the bridges between them (Fig. 1). He then showed that such a path exists only if there are two or fewer vertices with an odd number of edges. Therefore, the existence of such a path is a characteristic of the underlying graph and hidden in its topology. Indeed, many real world problems can be formulated and thus simplified using graph representations.

The next far-reaching paradigm in graph theory is the random graph model first introduced by Paul Erdös and Alfréd Rényi [1], two Hungarian mathematicians and, independently, by Edgard Gilbert [2]. In the model introduced by Erdös and Rényi

(ER model), L edges are randomly placed in a graph containing N vertices. Gilbert introduced an essentially equivalent model where each pair of vertices in a graph is connected with probability p. In the framework developed by Erdös and Rényi, the probability of having exactly L edges in a graph is:

      N           −   L N(N 1) −L P =     p (1 − p) 2 (1) L  2        L

Therefore, the average number of edges, is the first moment of equation 1,

N(N−1) 2 N(N−1) < L >= ∑L=0 LpL = p 2 .

The number of edges that are attached to a vertex is also called its degree. In a random graph containing N vertices and L edges, the average degree of a vertex is

< k >= 2 < L > /N, as each edge contributes to the degree of two vertices. Similarly, the probability of a node having degree k follows the binomial distribution:

   N − 1    k N−1−k pk =   p (1 − p) (2)   k

2 For graphs where < k ><< N, the binomial distribution can be well approximated

−k k by the Poisson distribution, pk = e k! . Thus, the degree distribution of random graphs is uniquely characterized by the average degree < k >. For decades, random graph models remained the dominant paradigm. In the beginning of the twenty first century, however, with the availability of large-scale data, scientists revisited random graph models, which eventually lead to the emergence of network science.

1.2 Emergence of network science

While the origins of network science go back to mathematical graph theory, many im- portant subsequent contributions have been made by physicists, in particular those with a background in statistical mechanics. The twenty first century has offered vast experimental facilities and computational capacities to acquire and analyze the wiring diagrams behind many real world graphs (i.e. networks). The availability of large-scale data along with the multidisciplinary nature of networks, lead to a sudden rise of atten- tion to the science of networks.

In the terminology of network science, the fundamental components of a network are commonly referred to as ‘nodes’ and the interactions between them as ‘links’. Social net- works, for example, consist of people as nodes and their social ties, such as friendship, are represented by links.

Studies attempting to understand and reproduce the properties of real-world net- works played a crucial role in the emergence of network science. In 1998, Watts and

Strogatz realized that most real networks are characterized by two important features:

(a) the so-called ‘small-world effect’, meaning that nodes can be reached by one an- other with a few hops and (b) high clustering, representing the tendency of a node’s

3 interacting partners to interact among themselves and form triangles (Figure 1). While

random graph models are able to predict the small-world effect of real networks, they

fail to describe their high clustering. Watts and Strogatz introduced a simple model

(WS model) to combine these two properties [3]. The starting point of the model is a

regular one-dimensional ring lattice. Next, with probability p each edge is rewired to

another randomly chosen vertex. Even though the WS model is successful at explaining

the small-world and high clustering effect of the real networks, Barabási and Albert et.

al. later showed that it fails to describe another key feature of real networks, i.e. their power-law degree distribution.

In 1999, Albert et. al. designed a study to generate a map of the World Wide Web

(WWW) that changed the view of scientist about the applicability of the traditional

random graph models. In this study, the existing documents (nodes) and URLs (links)

form the WWW map. Each document was assigned a kin and kout indicating the number

of links directing to and out of the document, respectively.

Since users have complete freedom of directing the URLs to any document, one might

expect that the resulting WWW map has properties similar to the ones of random

graphs. Yet, as Albert et. al. showed, this is not the case. In particular the degree dis-

tribution of the WWW map is strikingly different. Both the in and the out degree dis-

tributions do not follow Poisson distributions, but instead a power-law [4]. Indeed, the

large-scale characteristics of the WWW map resemble those predicted for self-organized

systems with highly interacting components.

The power-law degree distribution is not a characteristic limited to the WWW. Rather,

studying the topological characteristics of other networks such as scientific citations,

collaboration of actors and electrical power grids, confirmed that they all have scale-free properties [5]. This reflects a surprising phenomenon where regardless of the diverse

nature of the components and the interactions of real-world networks, there is one

4 k = 4 topological community

kin =1

A

hub

shortest path between node A and B largest connected component (lcc) component connected largest

B

kout = 2

High clustering

Node Link Directed link connected components connected

Figure 1: Schematic network representation. This representation shows a few network proper- ties: (i) connected components are the isolated islands in the network. This network consists of four connected components where the largest ones is known as largest con- nected component (LCC). (ii) Topological communities are densly connected group of nodes with a few connections to other nodes. (iii) Hubs represent the nodes with high number of connections (high degree). (v) Shortest path, shown in blue, reflects the shortest path available between node A and B. And (vi) High clustering represents the tendency of a node’s neighbors to interact among themselves.

5 universal law that governs them all. This universality is perhaps the most impactful

observation towards the emergence of network science.

A key feature of scale-free networks is that the probability of having large degree

nodes is non-zero, in stark contrast to ER and WS models. It was shown later that the

presence of such high degree nodes (hubs) is not only essential for the structure of the

network but also for the functioning of the corresponding physical system [6].

In networks with power-law degree distribution, the nodes have highly heteroge-

neous degrees: While most nodes have a small degree, there exist some nodes with

very large degree (Figure 1). One of the important and unexpected consequences of

such scale-free networks is their high tolerance towards random failures. This means

that removing large portions of the nodes from the network will not have a major effect

on its overall structural integrity. For example, Albert et. al. showed that after randomly

removing as much as 45% of the nodes from the WWW map, its largest connected

component1 persists. In contrast, removing a few hubs will destroy the structure of the

network entirely and leave only disconnected fragments [7].

Albert et. al. further showed that the distribution of the shortest path of pairs in the

WWW network follows a normal distribution. The average shortest path can thus be

interpreted as the diameter2 of the network [4]. In so-called ‘small-world’ networks,

the diameter scales logarithmically (or slower) with the network size. This means that

even in very large networks, on average any two nodes are only a few links away from

each other. The diameter of the he WWW, for example, was calculated to be 18.59 and

follows:

< d >= 0.35 + 2.06log(N) (3)

1 Connected component of a network, is a subnetwork where all nodes are connected to each other through at least one path (Fig. 1). 2 An alternative definition that is more frequently used today considers the shortest path instead of the average path length.

6 While previous models (ER and WS) fail to describe the scale-free nature of real net- works, in the framework of the Barabási-Albert model it follows naturally from two generic assumptions: (1) real world networks are continuously expanding by the in-

troduction of new nodes and (2) new nodes are more likely to be attached to nodes with high degree (preferential attachment). This phenomenon, known as the “rich gets richer” effect is easy to picture in several social networks where well-established mem- bers have a higher chance of acquiring new connections. Similarly, newly generated documents in the WWW are also prone to be accessible through establishing new links to already high traffic webpages such as Yahoo and Google. Interestingly, it has been shown that even in biological systems the protein products of the evolutionary old genes tend to have higher degrees [8, 9], which is consistent with the rule of preferen- tial attachment.

1.3 Network science applications in systems biology

Network science has found many applications in biology. The power-law degree dis- tribution of real-world networks was shown to hold for several biological networks

[6, 10–15]. This surprising universality reflects a major success of network science to ex- plore cell functionality in a more comprehensive, systems-oriented framework known as “Systems Biology”.

Recent advances in molecular biology and low cost unbiased sequencing have offered a giant inventory of cellular components of individuals. However, to address the com- plexity of the living organisms we need a conceptual framework where a cell is viewed as a network of interacting cellular components. These components, such as proteins, genes and metabolites work together to exert functions. Thus, to understand cell ma-

7 chinery, survival and failure it is important to consider the complex and multifactorial nature of cellular networks and their components.

The underlying network of a cell consists of cellular components as nodes and their in- teractions as links. These interactions are either physical or functional (phenotypic). Ex- amples of physical networks include protein-protein interaction (PPI) networks where proteins physically bind to each other, metabolic networks where nodes are metabo- lites that are connected if they are in a same biochemical reaction, regulatory net- works where a transcription factor regulates the activity of a gene. RNA regulatory and kinase-substrate networks with post-translational modifications are other examples of networks with physical interactions. On the other hand, cellular components in pheno- typic networks are connected to each other if they are attributed to the same function.

Examples of such networks are co-expression networks where two proteins with simi- lar expression pattern are linked and genetic networks where two genes are connected if the mutation of both (double mutant) results in a different phenotype than muta- tion in each (single mutant). Different methods and techniques, from computational predictions [16–18] to experimental approaches [19–23], have been designed to identify whether two molecules interact. However, the twenty first century has offered unprece- dented high-throughput techniques to identify the pair-wise interactions of proteins in a systematic, unbiased fashion.

In 2001, Jeong et. al. [6] have studied the protein-protein interaction network of Sac- charomyces cerevisiae [24, 25], which contained 1,870 proteins as nodes, interacting through 2,240 links. They found that the topology of the network is characterized by a heterogeneous degree distribution, i.e. while most proteins have only few interaction partners; there are also a number of very highly connected proteins. This scale-free struc- ture implies a high tolerance of the network structure towards random errors [7] and the importance of the presence of high degree nodes. They could confirm this network-

8 based prediction experimentally by deleting the high degree nodes, finding that they are indeed essential for the vitality of organism. They observed a clear correlation between node degree and the percentage of essential genes. This is a very important finding as it reflects the explanatory power of the network topology and the ability of extract- ing biological insights from purely topological properties of the respective biological networks.

1.4 Emergence of Network Medicine

As the post genomic view changes the role of cellular components from individual agents to members of a more complex network, network medicine aims to systemati- cally study this network in the context of human diseases. The interdisciplinary field of network medicine is emerging from the whole body of principles governing cellular networks and their applications, aiming to provide a platform to study available clini- cal and biological data in the context of complex networks. Potential avenues that open with advances in network medicine include (a) understanding the molecular mecha- nisms that underlie diseases, (b) discovery of yet unknown disease genes, (c) identifica- tion of drugs and drug targets, (d) understanding disease-disease relationships and (d) introducing new strategies for molecular disease classifications.

• Understanding disease mechanisms

Current advances in exploring disease-gene associations provide us of with rich, yet incomplete information towards understanding disease pathologies and diagnostics.

Studies have shown that disease-associated genes constitute about 10% of the entire genome [26]. However, despite tremendous advances in molecular biology and disease

9 gene discovery, there is no clear understanding on how an abnormality in a single gene

translates into clinical symptoms (phenotypes).

Network medicine has offered a platform where the gap between genotype and phe-

notype may be filled with a network-based analysis of human diseases. In this plat-

form, a phenotype is considered to be the consequence of a perturbation within the

cellular network. This network perturbation triggers a cascade of failures, which leads

to malfunctioning of the cellular network and thus appearance of a certain phenotype.

Fortunately, cellular networks are robust towards many perturbations and failures, al-

lowing the cell to maintain their normal functioning. This is a direct consequence of

the network topology, which allows it to adapt to many mistakes through alternative

pathways. However, there are a few selective types of perturbations that lead to extreme

fragility and local failure of the cellular network and disease manifestation [27]. Indeed,

several studies have been carried out to explain anomalies in cellular networks and their

association with different diseases [28–36].

• Disease gene discovery

Genetic diseases are associated with an abnormality in one or more genes. To better

understand the molecular basis of diseases, it is important to assemble a complete list

of all disease-associated genes. There are several techniques to discover disease-gene

associations, including linkage analysis3 , hypothesis-driven analysis and more recently

Genome Wide Association (GWA) studies. However, many disease-gene associations

are still unknown, let alone our inability to explain disease mechanisms. Network-based

approaches may offer a framework to systematically integrate biological and clinical

data in the context of the cellular network and exploit further information on disease

origins and candidate disease genes discovery. Identifying candidate genes helps us

discover new biomarkers and better diagnoses.

3 Linkage is the tendency of genes to be inherited together to the offspring, due to their close location on the chromosome.

10 • Network pharmacology

After spending a wealth of time, money and human effort on DNA sequencing and after the completion of the human genome project in 2001 [37, 38], scientist were hop- ing to discover new effective drugs. However, the number of FDA approved drug per year only decreased since 2000 [39]. The reason of course, being that a disease is not a consequence of an abnormality in a single cellular component, but rather a whole cascading effect of a perturbation through various pathobiological processes within a yet more complex network of interactions [40]. Therefore, a thorough understanding of the cellular dysfunction made by genetic abnormalities is necessary for identifying candidate targets.

In the drug-target network, constructed by Yildrim et. al., FDA approved and experi- mental drugs are connected to their protein target [41]. This map shows that most drugs are palliative, i.e. they do not target disease-associated gene products directly. Instead, they often target the proteins that are in the network vicinity of the disease-associated genes products. Furthermore, in reality, drugs may target additional proteins than the ones established in vitro, which may lead to several unwanted side effects. Drugs could also manifest their side effect through many other interacting partners of the target proteins. A network-based approach is therefore, required to understand drugs mech- anism of action. In fact, network pharmacology is already successfully affecting drug development [42–45].

• Disease-disease relationships (causality and comorbidity)

The systematic analysis of human diseases has so far been focused on a single disease at a time. Recent studies, however, point out the importance of studying all human diseases under the same systematic framework. In this framework, different diseases are connected to each other through their common underlying cellular network [46].

11 Considering the complex nature of cellular networks, it is highly unlikely for differ- ent phenotypes to be independent at their molecular level. Many apparently different diseases are associated with mutations in the same gene(s). The co-occurrence of dif- ferent diseases in the same patient, known as comorbidity [47], could be the result of shared and/or overlapping underlying molecular mechanisms. In 2007, Goh et. al have constructed the Human Disease Network (HDN), where nodes represent diseases that are linked if a mutation in the same gene is associated with both [48]. The HDN was constructed without using the disease classifications, however, they observed that diseases of the same broad category aggregated in the same network neighborhood, in- dicating that similar diseases are also similar at the molecular level. In order to further understand the relationships between diseases, several other types of disease networks have been mapped out, such as a metabolic disease network [49], a micro-RNA disease network [50] and phenotypic disease networks [51–53].

Menche et. al. further studied diseases and their associated genes in the context of the underlying cellular network and found that the topological location of a disease within the network predicts its pathobiological relationships with other diseases. This provides a theoretical framework, which can explain disease comorbidity and similar- ity. For instance, they observed that diseases with topologically close disease genes are significantly comorbid and similar at symptom level. Moreover, they found that genes associated to topologically close diseases are co-expressed. Hence, network-based ap- proaches of studying diseases help further revealing the disease-disease relationships at their topological and functional level. This may lead to identifying and/or repur- posing similar therapies for phenotypes with currently no treatment. In fact, studies show the success of repurposing already approved drugs for a few phenotypes with unknown treatment [54–57].

• Disease classification

12 Contemporary classification of human diseases is based on the pathological analysis and clinical syndrome of diseases [58]. Although this approach has so far well served clinicians, it relies on observational skills and suffers from limitations of focusing on a single organ where the symptoms manifest. Today, given the constantly increasing advances in identifying molecular associations of many diseases, a more systematic ap- proach is required to replace the classic disease diagnostic and classification paradigm

[59, 60].

Most phenotypes are governed by a few common intermediate phenotypes (endophe- notypes). Endophenotypes are a collection of biological processes that play an impor- tant role in developing many diseases. Loscalzo et. al. have proposed a network-based framework of classifying diseases. In this framework, a phenotype is a result of a system- driven cascade in a series of linked networks: A mutated disease-associated gene is a node in the network, which interacts with many other proteins including those, which control these endophenotypes. Therefore, a specific phenotype is the result of the inter- play of these endophenotypes together with environmental determinants. Altogether, we are convinced that an integrated, system-based approach is required to address the current limitations of disease classification.

In summary, for a systematic understanding of human diseases, it is not sufficient to identify disease-causing genes as individuals. Rather, an overall detailed map of their interactions is necessary to understand a protein(s) malfunction cascading effect and disease manifestation. This can in turn open new avenues in preventive, predictive, personalized and participative medicine (P4-medicine) [61]. Therefore, we must shift our approach to study complex diseases from a reductionist to a systematic view, where different phenotypes are explored in the context of their underlying cellular network.

In this dissertation we focus on studying disease genes within a consolidated cellular network, the Human Interactome.

13 1.4.1 Human interactome and complex diseases

Various kinds of networks can be constructed by integrating different types of inter- actions, such as protein-protein, signaling, metabolic and regulatory interactions. We refer to the union of all physical interactions between human cellular components as the Human Interactome (HI). Despite recent advances, existing maps of the HI are noisy and far from complete. In fact, current high-throughput methods only cover less than

20% of the estimated complete map of the interactome [62–67]. Yet, despite this incom- pleteness, several studies show that these maps offer insightful predictions [46, 68–71].

However, the incompleteness of current maps needs to be taken in to account when studying phenotypes and cell functions in the context of the Human Interactome.

To explore complex diseases with a network-based approach, we start by mapping the disease-associated genes on the HI and studying their topological properties. This leads us to the main concepts of network medicine as described below.

A great review article on network medicine (NM) listed the main concepts of network medicine, helping us explore human diseases in the context of the interactome [40].

The underlying hypotheses are: (a) most disease genes tend to avoid hubs and are instead located at the periphery of the interactome. (b) Proteins that are involved in the same disease tend to interact with each other. In corollary to this locality of disease genes, (c) abnormalities in interacting proteins often lead to the same phenotype. (d)

Proteins associated to the same phenotype, tend to aggregate in the same region of the

HI (disease module notion). (e) Causal biological pathways overlap with known disease- associated components, and (f) Phenotypes with similar associated cellular components, share symptoms and are comorbid [48, 72–74].

14 On average, disease-associated proteins tend to have a higher degree that non disease-

associated ones [75–77]. Intuitively, mutation in genes whose protein products have

more interacting partners is more likely to have worse phenotypic outcome. Therefore,

one might think, hubs of the network are more likely to be disease-associated genes.

On the contrary, protein products of disease-associated genes do not constitute hubs

of the network. In fact, model organisms show that hub proteins are often encoded by

essential genes4 [6]. The abnormalities in essential genes spread through the links of

their encoded proteins and affect a large number of other surrounding proteins. As a

result, mutations in essential genes lead to a lethal phenotype, i.e. affected organisms

will not survive through the embryonic stage [78].

Local clustering of disease genes and disease module concepts are two central hy-

potheses of this dissertation. These hypotheses suggest that if a fraction of disease-

associated genes are known, the proteins in their network vicinity are likely to be also

disease associated and as a result, a well-defined network region (disease modules) can

be associated to a specific disease. Therefore, it is crucial to develop proper network

to identify these disease modules. To do so, it is important to distinguish

between three fundamental structures within biological networks, namely, functional

modules, topological communities and disease modules, as described below.

Topological modules are sets of nodes that are densely connected to each other and

loosely connected to the rest of the network. In social networks, for example, topological

modules can be seen as circle of friends. Cellular functions are most likely carried out in

a modular fashion [79], thus functional modules are set of nodes that work together to

exert a function. Biological pathways are examples of functional modules in biological

networks, where a group of molecules work together to carry out a function such as

DNA transcription.

4 Essential genes are crucial for the survival of living organisms. Essential genes are older and evolved more slowly and their protein products are often present in multiple tissues.

15 Disease modules are the subnetworks that contain all disease-associated components

(proteins), such that perturbations within the modules can be associated with disease manifestation. Disease modules, as subnetworks associated to any disease of inter- est, are significantly enriched with disease determinants. Therefore, they offer limited search spaces to identify candidate disease genes, molecular mechanisms, drug targets and eventually new therapeutics. In the next chapter we show that disease modules cannot be characterized by topological modules and often contain several functional modules (biological pathways).

Several topological measures and tools, such as clustering coefficient, motifs and com- munity detection methods have been designed to quantify and/or identify the inter- connectivity of proteins and existing modules. However, the main question to address is, whether topological modules represent distinct, biologically meaningful modules and if not, whether network science can offer tools to correctly identify disease mod- ules. Therefore identifying disease modules is of particular importance as it defines our search space for detecting (a) disease associated molecular components, (b) candidate drug targets and (c) the subnetwork responsible for disease demonstration.

The next section explores the existing methodologies that attempt to identify disease- gene associations and capture disease modules. In chapter 2 we perform a thorough study on topological properties of disease-genes associated for a corpus of 70 diseases to explore the extent to which existing methodologies can capture disease modules.

Moreover, we introduce proper statistical measures towards developing DIseAse MOd- ule Detection (DIAMOnD) algorithm [80].

16 1.4.2 Existing methods for the identification of disease-gene associations

Genes whose loci are closer together on the chromosome are more likely to be inher- ited together during meiosis (cell division). These genes are known to be in the same linkage interval. Several integrative approaches have been proposed to leverage gene annotations such as Gene Ontology, biological pathways, gene expression, protein se- quence, etc. to discover disease-gene associations.

Moreover, the fact that disease genes are not randomly scattered within the Human

Interactome, resulted in different classes of methodologies that exploit the HI to dis- cover new candidate genes. Recent studies show that network-based approaches out- perform traditional techniques [81]. Therefore, a class of methodologies uses the map of protein-protein interaction (PPI) networks to rank the candidate disease genes. These methodologies can be categorized into three groups:

• Network neighbors

These methods assume that direct interacting partners of disease genes are likely to be disease-associated. A class of these methods identifies a candidate protein if it interacts with a certain percentage of disease protein [82, 83]. Oti et.al. predict a gene to be disease-associated if it (a) lies within the linkage interval of known disease genes and (b) has at least one connection to known disease genes in the PPI. Indeed, direct interacting partners of disease genes, which also reside in the same linkage interval have been shown to be 10 times more likely to have true disease association [73]. Considering

Gene Ontology annotations leads to 1000 fold enrichment of candidate genes with true disease-causing genes [40].

• Topological community finding methods

17 Topological community finding (also known as graph partitioning) methods have been shown to uncover functional modules and since phenotypically similar disorders tend to have functionally similar underlying genes, it is expected that such methodologies could also be used to predict disease-gene associations [81]. For a given disease of in- terest a set of genes with known disease-association are curated from linkage and/or

GWA studies. The underlying interactome is then constructed in the relevant tissue and cell line with merging the most up to date HI map. In this approach several topolog- ical community finding methods [84–92] can be used to identify a network neighbor- hood, i.e. disease module, which best contains the true disease-associated genes. These methods assume that topological, functional and disease modules are interconnected and highly overlapping and thus topological community finding methods can identify disease modules. Indeed, Graph Summarization (GS), Markov Clustering (MCL) and

VI-Cut methods have been shown to find biologically relevant modules within the PPI.

GS is an unsupervised topological community finding method that compresses a given network into an ideal summary network [93]. Nodes of the summary graph represent the modules of the original network. MCL is a popular clustering technique, which is based on random walk on the graph [94]. MCL identifies the network neighborhoods with high flow concentration as clusters. Unlike the previous clustering methods, VI-

Cut uses the known annotations as part of the clustering algorithm and outputs an optimal clustering that is not only topologically reasonable but also matches the known annotations. Recently more clustering algorithms such as link clustering [95] and Lou- vain method [96] have been introduced that are widely used.

Although topological community finding methods have been shown to outperform the network neighborhood methods, we will show in next chapter that followed by a thorough study, topological communities are not able to significantly capture the dis- ease modules. In fact, we show that not only disease modules exhibit a distinct topol-

18 ogy that does not match that of interaction-based topological communities, but also this difference is expected to persist even if the complete map of the HI was available.

Motivated by this observation we introduce our DIseAse MOdule Detection algorithm that enables us to capture disease genes and identify the disease modules.

• Diffusion-based approaches

Diffusion-based approached are another class of disease gene identifications algorithms that are utilized in the field. The main feature of these approaches is the explicit usage of disease genes as inputs of the methodology. In this approach, random walkers on graphs iteratively walk from their current positions (nodes) to their randomly selected immediate neighbors [97]. Kohler. et .al [98] introduced a variant of this model where random walkers are able to restart and return to their original position with probability

r. This can be formulated as follows:

pt+1 =(1 − r)Wpt+rp0 (4)

Where pt represents a vector in which the ith element shows the probability of being at

node i and time t. W is the column-normalized adjacency matrix of the graph. Random walkers start from the nodes with known disease association with the same probability

(p0). This iteration continues until a steady state is reached where the change between

pt and pt+1 falls below 10−6. Finally the candidate disease-genes are ranked based on

their total visiting probabilities in the steady state, p∞.

Network propagation is a method presented by Vanunu et. el [99] which uses similar

flow-based approach. In this method a flow spreads from a set of known disease genes

(seeds) to their neighboring nodes. A parameter controls the rate at which the flow pumps from the seed nodes. Similar to random walk method the algorithm stops when

a steady state reaches.

19 Navlakha et. al. compared all the above approaches and their variations [81]. They found that diffusion based methods that use disease-genes as inputs have the most predictive power for identifying candidate disease-genes. In summary, network-based approaches have significantly improved the disease-gene identification process as com- pared to using linkage intervals only. Still given the increasing advances in mapping out the HI and disease-gene associations, the quality and predictive power of such methodologies should be revisited.

In this dissertation, we aim to systematically study and quantify the topological pat- terns of disease genes within the HI. In chapter 2, we introduce a distinctive parameter that describes the topological pattern of disease genes. We show that this parameter is robust towards increasing the coverage of the interactome map. We then use this pa- rameter to further exploit the network neighborhood of a disease query and identify the candidate disease genes. For a thorough analysis, we apply this methodology on a cor- pus of 70 diseases and show that its performance competes with the leading algorithms

in the literature. Finally we introduce a variation of this methodology, which enables to

incorporate different weights to nodes and links. We show that this simple variation re-

sults in significant improvement of performance. The identified disease modules can be

further validated using available biological data such as co-expression, Gene Ontology

and biological pathways.

In chapter 3, we further discuss the hypothesis that most complex diseases have

common underlying mechanisms. In fact a few intermediate pathophenotypes (i.e. en-

dophenotypes) including inflammation, thrombosis, fibrosis, immunity, apoptosis, hem-

orrhage, cell proliferation, and apoptosis are known to play important roles in develop-

ing many complex diseases [58]. Therefore, to study the origin and development of

human diseases (pathogenesis), we decided to identify the associated disease modules

to these endophenotypes and study their relationship to each other as well as to a cu-

20 rated corpus of 299 disorders. In chapter 4, we elaborate on the construction of these network models and explore the common underlying molecular mechanisms of com- plex diseases.

21

2

ADISEASEMODULE

DETECTION(DIAMOND)

ALGORITHM

1In recent years, there is increasing evidence that proteins associated with a particu-

lar disease correspond to distinct network neighborhood, within the Human Interac-

tome, representing the cellular network of all physical molecular interactions [40, 42,

48, 77, 100–102]. The pathobiological properties of a disease and its clinical manifes-

tations can be linked to perturbations within these disease neighborhoods, or disease

modules [103]. With recent advances in genome-wide disease gene association [104] and

high-throughput Interactome mapping [65] we can already pinpoint the approximate

location for some disease modules (Figure 2). For many diseases, however, a consid-

erable fraction of their disease associations remain unknown [105]. In this chapter, we

propose a network-based methodology to uncover the disease module associated with

a particular phenotype. The algorithm is based on a systematic analysis of the net-

work properties of known disease proteins across 70 diseases, revealing that instead of

connection density the connectivity significance is the most predictive quantity charac-

terizing their interaction patterns. This quantity allows us to systematically explore the

1 The contents of this chapter are published inPlos. Comp. Biol. journal

23 local network neighborhood around a given set of known disease proteins, helping us identifying promising new disease protein candidates.

2.1 Quantifying interaction patterns of disease proteins within the interactome

In order to observe and quantify the underlying interaction patterns of disease proteins, we first need to compile the proper data sources. Therefore, we start by constructing the Human Interactome as well as disease-gene associations of a corpus of 70 manu- ally curated and expert-selected diseases. The details of data preparation is extensively explained in chapter 4.

In total, we obtained 141,296 interactions between 13,460 proteins, 1,531 of which are associated with one or more diseases. Examining the subgraphs consisting of pro- teins associated with the same disease, we found that the largest connected component

(LCC) typically contains only 10%-30% of the disease proteins (Figure 3a). This sur- prisingly low fraction has been shown to be a direct consequence of the incomplete- ness of currently available interactome maps [46]. Yet, despite this apparent scattering, the observed agglomeration is typically still higher than expected for randomly dis- tributed proteins (Figure 3b). The LCCs of 49 (out of 70) diseases are significantly larger

(z − score > 1.6) than random expectation (Figure 4a, Table 1 and 2). To explore the possible influence of noise in the underlying Interactome on the observed clustering we repeated the analysis on perturbed networks with varying degrees of noise and incom- pleteness. Figure 4b shows that ~50% of all diseases exhibit significant LCCs even after removing or randomizing up to 90% of the links in the network, indicating that the find-

24 Protein-Protein Interaction Network

disease neighborhoods

topological community

Protein Largest connected Disease protein component (LCC)

Figure 2: Proteins associated with the same phenotype tend to localize in specific neighbor- hoods of the Interactome, indicating the approximate location of the corresponding disease modules. Topological network communities are highly interconnected groups of nodes.

25 0.45 10 0 0.4 Random −1 0.35 10

0.3 10 −2 p-value < 10 -6 0.25 −3 z-score = 23.42 0.2 10 Frequency Observed LCC: 24 Frequency 0.15 10 −4 0.1 −5 0.05 10 0 10 −6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 5 10 15 20 25 30 Fraction of disease proteins in LCC LCC size

(a) (b)

Figure 3: (a) Distribution of the fraction of disease proteins within the largest connected com- ponent (LCC) for 70 diseases. Only 10%-30% of the disease proteins are part of the LCC. (b) LCC size of proteins associated with lysosomal storage disease compared to random expectation. Out of 45 disease proteins, 24 (53%) are part of the LCC (z − score = 23.42, empirical p − value < 10−6).

ing that disease proteins tend to reside in specific network neighborhood is remarkably

robust.

From a network science perspective, the task of identifying these disease neighbor-

hoods can be considered a community detection problem. Numerous algorithms [84–

92, 96] define a community as a locally dense subgraph in a network (Figure 2). In

order to evaluate the extent to which such topological community detection algorithms

can be used to predict disease modules, we chose three representative, methodologi-

cally distinct algorithms that have been successfully applied to identify communities of

functionally related proteins (functional modules) in protein interaction networks:

(i) A link community algorithm from [86], which is based on link-similarities and pro- vides a hierarchical clustering of all links in the network. We use the default cut-off at

the optimal partition density. (ii) The parameter-free Louvain method [96], which maxi-

mizes the global modularity of the network. (iii) The Markov Cluster Algorithm (MCL)

[94], which detects dense regions of the network based on random flow. We use the

26 Disease #genes (lcc) z − score p − value Adrenal gland diseases 18 (5) 8.13 3.09e-4 Alzheimer disease 29 (6) 6.55 8.13e-4 Amino acid metabolism inborn errors 52 (13) 10.27 2.5e-5 Amyotrophic lateral sclerosis 21 (2) 1.33 0.25 Anemia aplastic 21 (9) 14.49 2.12e-4 Anemia hemolytic 29 (7) 8 2.12e-4 Aneurysm 15 (4) 7.22 1.15e-3 Arrythmias cardiac 30 (5) 4.91 3.87e-3 Arthritis rheumatoid 42 (9) 7.95 2.53e-4 Asthma 37 (3) 1.53 0.12 Arterial occlusive diseases 44 (4) 2.19 0.06 Arteriosclerosis 38 (4) 2.66 0.03 Basal ganglia diseases 45 (8) 6.39 1.13e-3 Behcet syndrome 13 (2) 2.57 0.11 Bile cut diseases 31 (2) 0.6 0.46 Blood coagulation disorders 40 (25) 26.91 0.0 Blood platelet disorders 26 (7) 8.82 1.03e-4 Breast neoplasms 40 (18) 18.74 0.0 Carbohydrate metabolism inborn 77 (11) 4.94 4.31e-3 Carcinoma renal cell 18 (3) 3.84 0.02 Cardiomyopathies 50 (12) 9.65 6.6e-5 Cardiomyopathy hypertrophic 22 (4) 1.86 4.96e-3 Celiac disease 36 (2) 0.34 0.56 Cerebellar ataxia 30 (2) 0.66 0.44 Cerebrovascular disorders 47 (4) 1.98 0.07 Charcot-marie-tooth disease 27 (5) 5.46 2.32e-3 Colitis ulcerative 56 (4) 1.44 0.12 Colorectal neoplasms 42 (16) 15.83 0.0 Coronary artery disease 31 (2) 0.6 0.46 Crohn disease 72 (10) 4.82 4.91e-3 Death sudden 19 (1)-0.49 1.0 Diabetes mellitus type 2 73 (9) 4.03 9.83e-3 Dwarfism 26 (3) 2.5 0.05 Esophageal diseases 24 (3) 2.76 0.04 Eophthalmos 13 (2) 1.58 0.11

Table 1: List of the 70 diseases (1-35 shown here) considered in this study, together with their re- spective number of associated genes and the size of their largest connected component (LCC) on the Interactome, as well as its significance compared to randomly selected genes as given by the z − score and the empirical p − value obtained from 106 simula- tions.

27 Disease #genes (lcc) z − score p − value Glomerulonephritis 18 (3) 3.83 0.02 Gout 13 (1)-0.33 1.0 Graves disease 13 (2) 2.57 0.11 Head and neck neoplasms 35 (4) 2.94 0.03 Hypothalamic diseases 23 (2) 1.15 0.29 Leukemia b-cell 17 (2) 1.82 0.18 Leukemia myeloid 43 (17) 16.67 0.0 Lipid metabolism 50 (4) 11.62 2e-6 Liver cirrhosis 24 (2) 1.07 0.32 Liver cirrhosis biliary 23 (2) 1.15 0.29 Lung diseases obstructive 40 (4) 2.49 0.04 Lupus erythematous 75 (7) 1.26 0.13 Lymphoma 24 (2) 1.07 0.32 Lysosomal storage 45 (24) 23.42 0.0 Muscular degeneration 44 (8) 6.53 9.36e-4 Metabolic syndrome x 14 (3) 5.06 8.52e-3 Motor neuron disease 31 (2) 0.6 0.46 Multiple sclerosis 69 (11) 5.87 1.89e-3 Muscular dytrophies 36 (12) 12.86 2e-6 Mycobacterium 22 (4) 4.86 4.91e-3 Myeloproliferative disorders 19 (6) 9.76 6.1e-5 Metabolic and nutritional diseases 599 (270) 4.04 2e-6 Peroxisomal disorders 20 (17) 30.86 0.0 Psoriasis 54 (5) 2.47 0.04 Purine-pyrimidine metabolism inborn 16 (2) 1.98 0.16 Renal tubular transport inborn errors 34 (3) 1.74 0.10 Sarcoma 25 (7) 9.13 8.4e-5 Spastic parapalegia hereditary 20 (1)-0.51 1.0 Spinocerebellar ataxias 28 (2) 0.78 0.40 Spinocerebellar degenerations 30 (2) 0.65 0.44 Spondylarthropathies 18 (4) 5.99 2.26e-3 Taupathies 35 (9) 9.32 5.6e-5 Uveal diseases 17 (3) 4.07 0.01 Varicose veins 20 (1)-0.51 1.0 Vasculitis 15 (2) 2.16 0.14

Table 2: List of the 70 diseases (36-70 shown here) considered in this study, together with their re- spective number of associated genes and the size of their largest connected component (LCC) on the Interactome, as well as its significance compared to randomly selected genes as given by the z − score and the empirical p − value obtained from 106 simula- tions.

28 60 60 link rewiring 50 50

40 40

30 30

20 20

-scoredistribution 10 -score distribution -score 10 z z 0 0

0.1 0.3 0.5 0.7 0.9 Original network Fraction of randomized links

(a) (b)

Figure 4: (a) Significance of the LCC sizes as measured by the z − score for all 70 considered diseases. The whiskers indicate the minimum, 25th, 50th, 75th percentile and max- imum across all diseases. Overall, 70% of the diseases show significant clustering (z − score > 1.6). (b) LCC z − score distribution in noisy networks in which a fraction f of all links is randomized by either link removal or rewiring.

default settings (inflation parameter r = 2) of version mcl-12-068. Each of these methods

identifies a large number of communities within the Interactome (Figures 5a, 5b and

5c). In order to evaluate whether some of these communities may be candidates for

specific disease modules, we determined their enrichment with known disease proteins.

We found that only between ∼ 1% − 5% of the communities detected by the different

methods are significantly enriched (p − value < 0.05, Fishers exact test) with any set

of disease proteins (Figure 6). Conversely, only 15% of the diseases have any signif-

icantly enriched community. As these significantly enriched communities cover only

∼ 15% − 38% of all proteins associated with the respective disease, we were unable to

assign for any of these diseases a single connected disease module (Figure 5d,5e and

5f).

These results suggest that while topological communities may often represent mean-

ingful functional modules [106], they are not able to capture disease modules. One pos-

29 < coverage > = %15

100000 6

10000 Link Clustering 5 # enriched communities 48 1000 4 # diseases significantly 11 enriched in any community 3 100

2 # Diseases # Diseases # communities # 10 1 1 0 1 10 100 1000 0 0.2 0.4 0.6 0.8 1 Community size Disease-gene coverage [fraction]

(a) (b)

< coverage > = %38

100 2 Louvin Method

1.5 # enriched communities 8 # diseases significantly 12 enriched in any community 10 1 # Diseases # Diseases # communities # 0.5

1 0 1 10 100 1000 10000 0 0.2 0.4 0.6 0.8 1 Community size Disease-gene coverage [fraction]

(c) (d)

< coverage > = %18

1000 10 MCL 8 100 # enriched communities 22 # diseases significantly 16 6 enriched in any community

10 # Diseases 4 # communities # 2

1 0 1 10 100 1000 0 0.2 0.4 0.6 0.8 1

Community size Disease-gene coverage [fraction]

(e) (f)

Figure 5: We applied three representative community detection algorithms to explore the extent to which topological modules correspond to disease modules. Each row corresponds to size distribution of the topological communities in the HI and number of community- disease pairs with significant overlap vs. their Jaccard similarity J corresponding to (a, b) link clustering, (c, d) the Louvin method and (e, f) the MCL method, respectively. No identified topological community coincides (J=1) with a full set of disease genes.

30 10 5 All communities with at least one disease gene with p - value < 0.05 10 4

10 3

2 10 33%

Numbercommunities of 36% 10 1

23% 0.07% 1%

5% 10 0 Link -2 0 2 4Louvain 6 8 101214 MCL Clustering

Figure 6: Only 1%-5% of the communities detected by the different methods are significantly enriched with disease proteins, none of which includes a significant fraction of all disease proteins sible reason for this may be that disease proteins do not constitute particularly dense subgraphs.

To quantify the extent to which disease proteins correspond to topological commu- nities, we use the local modularity R [92] a key measure used in community detection.

The community character of a set of nodes C is determined by the “sharpness” of its boundary, i.e by how well it is separated from the rest of the network. The boundary B consists of all nodes in C that have connections to nodes outside the community. The local modularity R is then defined as the number of links attached to nodes in B that do not leave the community, normalized by their total number of links. This can be written as

∑Bijδ(i, j) ij R = (5) ∑Bij ij where Bij is the adjacency matrix of the boundary nodes, δ(i, j) = 1, if both nodes i and

j are in C, otherwise δ(i, j) = 0 and R=1 corresponds to perfect modularity and R~0 to

randomly assigned communities. If we consider the known disease associated proteins

31 as communities, we find that R<0.01 for 97% of the diseases, with no disease exceeding

R>0.07 (Figure 7a). While these values are still significantly different from random ex- pectation R~0, the communities resulting from optimizing R are unlikely to represent meaningful disease modules. The comparison with random control was done by select- ing for each disease the same number of proteins at random from the Interactome (100

times). We then used a Kolmogorov-Smirnoff test to estimate the significance of the

difference between the distribution of disease proteins and the respective distribution

obtained in the randomization.

Yet, disease proteins do exhibit distinct and predictive connectivity patterns that can

be captured and exploited if we evaluate the significance of their connections instead

of their density. Consider a network of N proteins containing a relatively small number

(s0) of seed proteins associated with a particular disease. For randomly scattered seed proteins, the probability that a protein with a total of k links has exactly ks links to seed proteins is given by the hypergeometric distribution:

   

 s0   N − s0              ks k − ks p(k, ks) =   (6)  N        k

To evaluate whether a certain protein has more connections to seed proteins than

expected under this null hypothesis, we calculate the connectivity p − value, i.e. the

cumulative probability for the observed or any higher number of connections:

k p − value(k, ks) = ∑ p(k, ki) (7) ki=ks

32 1 10 0 Random Proteins Degree preserving randomization Disease Proteins Full randomization −1 10 Disease Proteins

10 −2

0.1 10 −3 Frequency Frequency 10 −4

10 −5

0.01 10 −6 0 0.02 0.04 0.06 0.08 0.1 10 −25 10 −20 10 −15 10 −10 10 −5 10 0 Local Modularity Connectivity significance

(a) (b)

Figure 7: (a) Comparison of the distribution of the local modularity R for disease proteins and proteins randomly selected from the Interactome. (b) Distribution of the connectivity significance of disease proteins and randomly selected proteins.

The use of the significance of the number of connections instead of their absolute num- ber reduces the spurious detection of high-degree proteins. Figure 7b shows that the connectivity p − values within the sets of known disease proteins are very significantly

(p − value < 10−241, Kolmogorov-Smirnov test) shifted towards smaller values when compared to the distributions expected for randomly scattered proteins. For example, the randomization procedure never yields connectivity significance values smaller than

10−5, while 60% of the disease proteins have a connectivity significance smaller than this value, some as small as 10−23. Again, the comparison with random control was done by randomly assigning proteins to each disease (100 times). We derived the significance of the difference between the two distributions by using Kolmogorov-Smirnoff test.

Taken together, these results show that disease proteins exhibit distinct interaction patterns among each other that suggest the existence of specific disease modules within the Interactome. Yet, these modules apparently do not coincide with topological com- munities of densely interconnected proteins. In principle, this discrepancy could be either a mere consequence of incomplete Interactome and gene-disease association data

33 [40, 64, 65], or reflect an inherent fundamental difference between disease and topologi- cal modules. To investigate this question, we compared the behavior of the two relevant measures, local modularity and connectivity significance, for different levels of com- pleteness of the underlying network. Figure 8a shows that the connectivity significance of disease genes slowly drops as more and more links are removed. Conversely, this trend indicates that the predictive power of the connectivity significance should con- tinuously increase as the Interactome becomes more and more complete. For the local modularity measure, however, we see a very different behavior. Figure 8b shows that the modularity remains roughly constant as the network completeness decreases or even slightly increases, similar to the behavior observed for random expectation. The reason for this somewhat unintuitive behavior is that random removal affects links between disease proteins to the same extent as links to other proteins, thereby leaving their rel- ative relationship, on average, unchanged (Figure 8c). We therefore expect that with increasing network completeness, the local modularity among disease proteins will not significantly increase. These results suggest that topological communities are not able to significantly capture disease proteins, regardless of the level of network completeness.

Connectivity significance, on the other hand, captures the interaction patterns between disease proteins more and more distinctively as the network approaches the complete network.

2.2 The DIAMOnD algorithm

Building on the observation that the connectivity significance is highly distinctive for known disease proteins, we propose the following algorithm to infer yet unknown dis- ease proteins (Figure 9), and hence to identify the respective disease module:

34 10 0 10 p- value < 10 -241 p- value = 0.17x10 -4 Random Proteins Random Proteins −1 Disease Proteins Disease Proteins 10 8 R 10 −2 6

−3 4 10 Localmodularity −4 2 10 -log(connectivitysignificance) 0 10 −5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Fraction of removed links Fraction of removed links

(a) (b)

Boundary nodes R = + Local community

R = 0.34 R = 0.33

0.0 0.5 Fraction of removed links

(c)

Figure 8: (a) Connectivity significance of disease proteins as a function of the fraction f of links removed from the network. The red bars denote the mean and the standard deviation as measured across 70 diseases, yellow bars show random expectation obtained from the same number of randomly distributed genes. (b) Local modularity of disease pro- teins and randomly selected proteins when a fraction f of the links is removed from the network. (c) Illustration of the local modularity R.

35 (i) The connectivity significance (Equation 7) is determined for all proteins connected to any of the s0 seed proteins.

(ii) The proteins are ranked according to their respective p − values.

(iii) The protein with the highest rank (i.e. lowest p − value) is added to the set of

seed nodes, increasing their number from s0 → s1 = s0 + 1.

(iv) Steps (i)-(iii) are repeated with the expanded set of seed proteins, pulling in one protein at a time into the growing disease module.

The procedure (i)-(iv) can be continued until the module spans the entire network.

The order in which the proteins are being pulled into the module reflects their topo- logical relevance to the disease, resulting in a ranking of all proteins. Figure 10 shows

a subgraph of the Interactome highlighting the seed proteins associated with macular

degeneration and the first 50 DIAMOnD genes. Furthermore, as detailed below, the al-

gorithm can be easily adapted to incorporate additional features, in particular weighted

links and/or protein associations.

2.2.1 Time complexity

Calculating tens to hundreds of p − values at each iteration is computationally expen-

sive; therefore we have implemented an efficient calculation to reduce the execution

time. The number of times we need to calculate the computationally relatively expen-

sive p − values can be considerably reduced by noticing that two proteins with the

same values of either ks or k can be ranked directly according to their value in the

respective other parameter, see Eqs. 6 and 7: If two proteins have the same degree k,

the one with higher ks will result in less terms in the sum in Eq. 7 and consequently

a lower p − value. Similarly, between two proteins with the same number of connec-

36 Iteration 1 Iteration 2 Iteration 3 s = 16 s = 17 s0 = 15 1 2 N = 10 4 -3 4 -3 4 -3 8.97 x 10 N = 10 9.56 x 10 N = 10 0.01 x 10

-5 -5 -5 2.09 x 10 2.39 x 10 2.71 x 10

-5 -5 -5 3.14 x 10 3.58 x 10 4.06 x 10

-6 -6 -6 6.29 x 10 ➟ 4.19 x 10 ➟ 8.15 x 10 ✔ 37 -3 -3 -3 2.99 x 10 3.20 x 10 3.40 x 10

-8 -6 2.72 x 10 ✔ 2.40 x 10 ✔ Module protein -3 -5 -5 Detected protein 3.00 x 10 5.02 x 10 5.68 x 10 Candidate protein

Figure 9: The DIAMOnD algorithm. At each step of the iterative algorithm, the connectivity significance of all immediate neighbors of disease proteins is calculated. Next, the most significantly connected node (lowest p − value) is integrated into the module, thus expanding the module by one node per iteration step tions to seeds ks, the one with lower k will result in lower p − value. This results in the following procedure: At each iteration step, we first classify the nodes based on their ks and rank the node with lowest k highest within that class. Next, we classify the top ranks of each class by their degree k and choose the ones with highest ks. Fi- nally, we calculate the exact p − value for the remaining nodes. This procedure guaran- tees that the number of candidate nodes will reduce to at most s nodes per iteration, as ks cannot exceed s (note that si → si + 1 at each iteration). In the worst-case sce- nario, and without further reducing the candidate nodes by their degree k, we are left with s nodes for which we need to calculate p − values. Assuming we need to iden- tify N nodes from the network, the time complexity of the algorithm is of the order

N(N−1) 2 s + (s + 1) + ... + (N − 1) + N ∼ 2 = O(N ). This compares favorably with other well established algorithms such as the random walk based method, whose complexity is between O(N.logN) and O(N3) .

2.3 DIAMOnD performance and robustness

In order to systematically evaluate the performance of DIAMOnD we first used a well- controlled test scenario by constructing synthetic modules of proteins within the Inter- actome. We analyzed the extent to which DIAMOnD can recover the full module if we remove the disease association from a certain fraction of proteins, thus obtaining a seed cluster that is no longer fully connected.

38 Seed proteins Seed - Seed Crucial seeds Seed - DIAMOnD DIAMOnD proteins DIAMOnD - 2 i (first 50 iterations) DIAMOnD Other proteins other 1 9 1 8

5 1 7 5 0

2 0 4 1 1 2 1 2 2 1 6 1 3 9 3 2 3 1 0 4 3 2 3 4 2 8 1 2

39 3 5 3 6 3 8 3 1 2 7 2 5 2 4 7 2 9 6 3 9 3 3 3 7 2 6 1 5

2 8 3 0

3 4 4 7

1

Figure 10: Subgraph of the Interactome highlighting the seed proteins for macular degeneration and the first 50 corresponding DIAMOnD proteins. In the beginning, two separate clusters grow independently until they merge at iteration step 50. Note that DIAMOnD also proposes proteins that do not have direct connections to seed proteins, e.g. at iteration steps 12 and 15. The squares mark seed proteins whose removal leads to large differences in the resulting DIAMOnD modules. The three leftmost squares, for example, enable the identification of a protein at iteration step 23, which in turn triggers the inclusion of the cluster of proteins depicted underneath, which would be absent otherwise. 10000 module size 200 random node 1000

100 module size Modulesize

10

1 1 10 100 1000 degree of therandom random node seed

(a) (b)

Figure 11: Properties of the synthetic Shell modules. (a) Illustration of the construction process: An initial node is selected at random and all first and second neighbors are added to the module. The exact topological properties of the resulting modules depend on the initial node. Panel (b) shows how the synthetic module size varies with the degree of the initial node.

2.3.1 Synthetic modules construction

There are many different possibilities to construct a connected set of nodes in a network, generally leading to modules with different topological properties. We implemented two different methods:

(i) Shell-modules: We randomly selected one node from the network and add all its

first and second neighbors to the module (Figure 11a). Depending on the particular starting node, the constructed module may vary in size (Figure 11b). Most diseases in our curated corpus have between 50 and 150 currently identified disease proteins.

Assuming that these represent only 30%-50% of all associated proteins, we chose 200 as

the putative size of complete disease modules within the Interactome.

(ii) Connectivity significance modules: We started from a randomly selected node

and iteratively add the most significantly connected node to the module until its size

reaches 200 nodes. This process produces modules with topological properties similar

to those observed for real diseases.

40 2.3.2 Estimating the recovery rate

For each initially connected synthetic module, we randomly removed a certain fraction

(25%, 50% and 75%) of the nodes and use the remaining nodes as seed proteins for

DIAMOnD. Figures 12a and 12b show the fraction of recaptured initial seed nodes

(recall) as a function of the number of iterations of the algorithm for 50% of the module removed. As expected, the highest rate of true positives is achieved in early iterations, so the highest ranked proteins are most likely to be part of the original full module.

In both shell and connectivity modules, we find that the total recall of the removed nodes is relatively insensitive to the incompleteness of the seed set, i.e. the fraction of removed seed nodes (Figure 12c,12d). The observation that a similar number of proteins can be recalled from a 25% subset of the full module and from a 75% subset can be used to address a critical limitation of prioritization methods that only provide a ranking of all proteins, yet offer no objective criterion for the total number of biologically relevant proteins. Indeed, estimating the true positive rate is inherently difficult as the true set of proteins is by definition unknown. However, since the recall of DIAMOnD does not depend on the unknown total number of disease proteins, we can estimate it by further pruning a given incomplete set of known disease proteins. We tested this procedure on our set of 70 diseases by removing 10%, 20% and 30% of the respective known disease proteins, see Figure 12e,12f for two examples, blood coagulation and lipid metabolism disorders, respectively. Generally, the recall is found to be higher when disease associa- tions are preferably removed from proteins that are part of the original LCC. The reason lies behind the fact that unlike other clustering methods, DIAMOnD is insensitive to the connection between the seed proteins themselves. In fact, only those connections of

41 seed proteins to the rest of the network determin the detected DIAMOnD nodes and removing nodes from the LCC are less likely to alter those connections.

2.3.3 Analyzing the sensitivity towards perturbations and network noisiness

Both the network data and the disease associations are inherently noisy and expected to contain a considerable number of false positives. The similar recall from different levels of seed protein incompleteness suggests, however, that collectively the seed proteins and their interactions provide sufficient predictive power to yield robust predictions. In order to evaluate how sensitive the DIAMOnD outcome is with respect to variations in the set of seed genes, we performed an N − 1 analysis: We modified the initial seed pro- tein set by removing one of the s0 proteins at a time, resulting in s0 different DIAMOnD sets. Comparing the resulting sets of DIAMOnD proteins to the original predictions obtained from the full seed set, we find that the methodology is very robust, yielding overlaps close to 100% in most cases. Individually, most seed proteins can be removed without considerably changing the resulting DIAMOnD proteins. There are, however, typically a small number of nodes whose removal results in a drastic change of the final outcome. The deviation caused by a specific node removal may occur in the initial iter- ations and disappear over the long run (Figure 13a, green data points) or persist across all iterations (red data points). These latter nodes are therefore more important for the integrity of the seed set.

We quantify the extent to which the removal of a seed node affects the outcome by two parameters: (i) the deviation from the original outcome and (ii) the persistence of that deviation for many iterations:

42 0.45 0.9 0.4 0.8 0.35 0.7 0.3 0.6 0.25 0.5 0.2 0.4 0.15 0.3 0.1 0.2 0.05 synthetic shell modules 0.1 synthetic connectivity modules Average modules 100 over recall 0 Average modules 100 over recall 0 0100200300400500 0100200300400500 Iteration Iteration

(a) (b)

0.5 1 0.45 0.9 0.4 0.8 0.35 0.7 0.3 0.6 0.25 0.5 Recall 25% node removal Recall 25% node removal 0.2 50% node removal 0.4 50% node removal 0.15 75% node removal 0.3 75% node removal 0.1 0.2 0.05 synthetic shell modules 0.1 synthetic connectivity modules 0 0 0100200300400500 0100200300400500 iteration iteration

(c) (d)

0.35 0.55 0.5 0.3 0.45 0.25 0.4 0.35 0.2 10% node removal 0.3 10% node removal 0.15 20% node removal

20% node removal Recall Recall 0.25 30% node removal 30% node removal 0.2 0.1 0.15 0.05 0.1 lysosomal storage diseases lipid metabolism disorders 0.05 0 04080120160200 04080120160200 iteration iteration

(e) (f)

Figure 12: Performance evaluation of DIAMOnD. We use two different methods to construct synthetic modules (shells and connectivity modules). (a, b) Recovery rate of the DIA- MOnD algorithm when removing 50% of seed nodes from shells (a) and connectivity synthetic modules (b), respectively. The recovery rate in synthetic modules is roughly independent of the module incompleteness. (c, d) Recovery rate when 25%, 50% and 75% of the nodes are removed from shells and connectivity modules. (e, f) Recovery rate when 10%, 20% and 30% of the nodes are removed from the disease proteins of lysosomal storage diseases and lipid metabolism disorders.

43 deviation = 1–overlap (8)

where the overlap is measured by the number of proteins that are in common between the original DIAMOnD outcome and the DIAMOnD outcome after the removal of seed genes. The persistence of a deviation is measured as:

total number of iterations while the deviation exist persistence = (9) total number of iterations

High persistence indicates that the removal of a node results in a deviation that holds across all iterations. However, typically we find that the perturbations introduced by removing a single seed node are compensated after a few iterations.

Figure 13b shows the degree of the nodes that cause deviations of different persis- tence. Crucial nodes with high persistence are characterized by a high degree (gener- ally several fold increase compared to both average degree of the network, < k >= 20.7,

and average degree of the disease proteins, < kdisease >= 28.9). Interestingly, we further

observe that crucial nodes whose removal will be most destructive are generally not part of the largest connected component of the initial seed set. Instead, the disease mod-

ules are robust towards removing disease proteins from the LCC, as these proteins will

be recovered early on due to their significant connectivity. Similar results are obtained when noise is introduced in the underlying network.

We used two models to construct ensembles of randomized networks with varying

degrees of noise and incompleteness compared to the original Interactome:

(i) To investigate the effects of network incompleteness we construct pruned networks

by removing a fraction of randomly selected links from the Interactome.

44 500 450 Crucial nodes 160 (persistent deviation) Deviation > 0.5 Deviation > 0.9 400 140 Non crucial nodes Deviation > 0.8 Deviation > 0.99 350 120 # of nodes per disease 300 100 >

250 k

< 80 200 150 60 100 40

# # of true hits (overlap) < k > 50 20 network 0 0 0100200300400500 0 0.2 0.4 0.6 0.8 1 Iteration Persistence

(a) (b)

Figure 13: (a) Robustness of the DIAMOnD algorithm towards small variations in the starting seed proteins (N-1 analysis). While most nodes influence the outcome very little, there are a few nodes whose removal results in a large deviation from the original outcome. This deviation may either persist across iterations (red data points) or disappear after a few iterations (green). (b) Crucial nodes are characterized by a 3-4 times higher degree.

(ii) To explore the impact of noise in the Interactome we use partially rewired net- works in which a fraction of randomly selected links are split and then randomly re- connected. This procedure corresponds to the configuration model [107, 108] and does not alter the degrees of the nodes, i.e. only the specific interaction partners of the nodes are randomized, not their overall number. Note that the original network is perturbed considerably even at small fractions of rewired links as both existing links are removed and simultaneously new ones are established.

Figure 14a and 14b show that, regardless of the method we choose to add the noisi- ness to the network, small variations ~1% of all links in the Interactome have almost no effect on the obtained DIAMOnD genes. Up to 5% of the Interactome can be completely randomized, while still retrieving more than 70% of the original set of DIAMOnD genes for more than half of all diseases.

45 200 0.70 coverage 200 0.70 coverage f = 0 .01 f = 0 .01 f = 0 .02 f = 0 .02 150 f = 0 .05 150 f = 0 .05

100 100 Overlap Overlap

50 50

0 0 0 50 100 150 200 0 50 100 150 200 iteration iteration

(a) (b)

Figure 14: (a) DIAMOnD robustness towards random link removal from the Interactome. We identified the DIAMOnD proteins for 70 diseases in the original Interactome as well as in perturbed networks with varying fractions f of randomly removed links. Data points and bars represent the median and median absolute deviation of the over- lap (number of common proteins) between original and randomized DIAMOnD sets across 70 diseases as a function of the iteration step. (b) Same as (a), but for perturbed networks in which varying fractions f of all links have been randomly rewired.

2.4 Identifying and validating disease modules

Next we explore the performance of DIAMOnD on 70 real diseases. Since the full set of

disease proteins is, by definition, unknown, we cannot assess the performance directly

in terms of true positives/negatives. We therefore use publicly available gene annota-

tion data, GeneOntology [109] and biological pathways from MSigDB [110] to validate

the DIAMOnD disease modules: For each disease we determine a reference set of all

significantly enriched GO-terms and pathways within the set of seed proteins. We then

compare the respective annotations of each DIAMOnD gene to this reference set, assum-

ing that proteins with annotations similar to the ones of the seed genes are more likely

to be disease associated as well [48, 72, 81]. Therefore, to validate the potential disease

relevance of the predicted candidate genes, we compare their biological characteristics

to the ones of the initial seed genes using the following workflow:

46 (i) First we identify the set of GO terms (pathways) that are significantly enriched within the given set of seed genes using Fisher’s exact test (Bonferroni corrected p −

value < 0.05).

(ii) For each candidate gene we then check whether it is annotated with any of these significant terms. Genes with common annotations are considered as true positives.

(iii) We compare the performance of DIAMOnD genes to seed genes as well as to random expectation for the same number of genes drawn randomly from network.

The performance is based on the number of candidate genes that are considered true positives. To quantify the statistical significance of a given number of true positives at a given iteration step we use a sliding window approach: At each iteration step i, we consider the same number of candidate genes as there are seed genes for the respective disease. If there are 100 seed genes, for example, we use the genes in the interval [i − 100/2, i + 100/2] and count the number true positives among these genes.

The statistical significance of an observed number is then determined using Fisher’s exact test. Matching the number of candidate genes with the number of seed genes allows us to compensate for the dependence of p − values on the underlying set size,

thereby enabling us to directly compare DIAMOnD sets at different iteration steps, as well as DIAMOnD genes and seed genes.

Figure 15a and 15b offers examples for the validation according to pathway similarity for lysosomal storage diseases. The first ~200 DIAMOnD genes are found to participate

in important seed pathways at a rate similar to the one within the seed proteins them-

selves and significantly higher than random expectation. In total, 58 out of 70 disease

modules can be validated by either GO terms or pathways, 46 by both. Figures 16a and16b summarizes the validation of the disease modules for all 70 diseases. The ma- jority of the detected modules perform several times better than random expectation, in particular in the first 50-100 iterations.

47 0 500 10 DIAMOnD genes 450 5 p-value = 0.05 Random genes 10− 400 Seed genes 10 10− 350 15 10− 300 -value Cutoff iteration: 200 p 20 250 10− 200 25 10− 150 10 30 100 − DIAMOnD genes 48 Enrichment Numberhits true of 35 50 10− late prediction 0 40 early prediction 10− iteration 0 100 200 300 400 500 0100200300400500 Seed gene Iteration Iteration DIAMOnD gene

(a) (b) (c)

Figure 15: Biological evaluation of DIAMOnD. (a) Validation of the DIAMOnD genes based on GeneOntology terms (see Materials & Methods). (b) The significance of the similarity between DIAMOnD genes and seed genes suggests a cutoff of ~200 DIAMOnD genes. (c) Network representation of the lysosomal storage diseases module. Depending on the specific application, the main interest of applying DIAMOnD could lie either in selecting a small number of most promising disease protein candidates, or in obtaining a larger set of proteins to explore the molecular disease mechanisms in a broader context. For the former case, DIAMOnD directly offers a ranked list of candidates. The latter approach, however, requires an additional criterion to define the boundary of the disease module, i.e. a threshold for the total number of proteins to be considered. This threshold can be chosen by using either (i) topological or (ii) biological properties of the agglomerated proteins.

(i) The connectivity p − values cannot be used directly to define a topological thresh-

old. The reason is that the module grows at each iteration step, i.e. the number of seed

genes s on which the p − value in Eqs. (1) and (2) is based, also increases. Since larger

sets can produce smaller p − values, the absolute significance values obtained at differ-

ent iteration steps cannot be compared to each other. However, our analysis suggests

and alternative approach to define a topological threshold: As discussed above, the

recall of the DIAMOnD algorithm does not depend sensitively on the initial level of

completeness (Figure 12c-12f). Hence, the true positive rate can be estimated by remov-

ing varying fractions of seed proteins. For lysosomal storage disorders, for example we

find an estimated recall of ~50% at iteration 40 (Figure 12e). After 40 iterations, the recall saturates and reaches a plateau, indicating that thereafter only few DIAMOnD proteins are expected to be truly disease associated. This saturation point may therefore be used as a threshold for the total number of DIAMOnD genes to consider.

(ii) A biological criterion for the threshold can be obtained from the validation ac- cording to Figure 15a and15b. The number of DIAMOnD proteins with direct biologi- cal evidence reaches a plateau at ~200 iteration steps, suggesting this as the maximal number that should be considered. A more stringent criterion is to use the significance of the enrichment. The enrichment is typically strongest within the highest ranked DI-

49 AMOnD proteins and decreases with increasing iteration steps. For lysosomal storage diseases, for example, we find that the first 200 DIAMOnD proteins are similarly sig-

nificantly enriched as the seed proteins (Figure 15b). The largest connected component

of the seed proteins alone consists of 24 (out of 45) proteins. When 200 DIAMOnD pro-

teins are added, the largest connected component of the resulting module integrates 11

additional, previously disconnected seed proteins, resulting in a module consisting of

234 proteins (Figure 15c). Figure 16c shows the distribution of the fraction of integrated

seed proteins across 70 diseases for several iterations. We find that with increasing num-

ber of DIAMOnD genes more and more disconnected seed proteins are integrated into

the module, thus allowing for an integrated analysis of their molecular mechanism.

2.5 Comparison with existing methods

In recent years, a number of disease protein prioritization methods [73, 81, 94, 98, 111,

112] have been developed that can in principle be used to identify disease modules. To evaluate the relative performance of DIAMOnD, we implemented a random walk based algorithm (RW) [98] that was shown to outperform other methods and may therefore serve as a reference [81]. We implemented a method from [98] that prioritizes candidate genes based on network diffusion. The seed genes serve as starting points for a random walker that wanders from node to node along the links of the network. At every time step of the iterative algorithm, the walker moves to a randomly selected neighbor of its current position. After every move the walker is reset to a randomly chosen seed gene with a given probability r (we use r=0.4). After a sufficient number of iterations the frequency with which the nodes in the network are visited converges and can be used to rank the corresponding genes. Genes that are visited more often are considered to be

50 20 30 18 16 25 14 20 12 10 15 8 Foldchange Foldchange 6 10 4 5 2 0 0 0 100 200 300 400 500 0 100 200 300 400 500 Iteration Iteration

(a) (b)

0.45 iteration 0 (seed proteins) 0.4 iteration 50 iteration 100 0.35 iteration 150 0.3 iteration 200 iteration 250 0.25

Frequency 0.2 frequency 0.15 0.1 0.05 0 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 fractionFraction of directly of disease connected proteins seed in proteins LCC

(c)

Figure 16: (a, b) Summary of the validation for all 70 disease modules based on GeneOntology (a) and biological pathways (b). (c) Fraction of seed proteins that are contained in the LCC of the DIAMOnD module for varying iteration steps. The distributions show the values obtained from 70 diseases. By introducing DIAMOnD proteins, previously disconnected seed proteins become part of the LCC.

51 closer to the seed genes and therefore more relevant to the disease than those who are visited less often.

Figure 17a and 17b summarizes the results of the comparison between DIAMOnD

and RW on the synthetic modules. As we removed the attribute from half of the module

nodes (about 100 nodes), iteration step 100 is a reasonable point of comparison. For both types of synthetic modules we find that DIAMOnD has a higher recovery in the top 100 predictions, whereas RW captures more true hits in its late predictions. In most cases

DIAMOnD is able to identify removed nodes in the early iterations until the recovery rate saturates (Figure 17a ). A higher initial slope corresponds to higher precision, i.e. a higher ratio of true positives TP/(TP+FP). DIAMOnD shows higher precision and sensitivity (recall) in the initial iterations whereas RW performs better at later iterations once DIAMOnD saturated. In the context of disease protein identification, a high quality detection of fewer proteins with few false positives is generally more desirable than low quality detection of hundreds of proteins.

We also compared the predictions of DIAMOnD and RW for each of the 70 real dis- ease modules, as illustrated in Figure 17c for lysosomal storage diseases. In general,

DIAMOnD offers several conceptual and practical advantages compared to previous methods: (a) Many methods like RW preferentially select proteins from the immediate neighborhood of the seed proteins. Surprisingly, we find that a considerable fraction of the DIAMOnD proteins do not directly interact with seed genes (Figures 10 and

18a). DIAMOnD thereby offers disease-relevant candidates beyond first-order protein interactions. (b) Physically interacting proteins often share functional annotations and pathways [65, 106]. As a consequence, methods like RW are expected to perform well on generic validation data. In our comprehensive analysis across 70 diseases we are limited to such generic validation data and hence observe a comparable performance when GO term similarity is used as reference. Yet, we find that when we use pathways DIAMOnD

52 0.3 0.8 Random Walk 0.7 Random Walk 0.25 DIAMOnD DIAMOnD 0.6 0.2 0.5 0.15 0.4 0.3 0.1 Average recall Average Average recall Average 0.2 0.05 0.1 synthetic shell modules synthetic connectivity modules 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Iteration Iteration

(a) (b)

120

100

RW GO 80 DIAMOnD GO RW KEGG DIAMOnD KEGG 60 # true hits true# 40 lysosomal storage diseases

20

0 0 100 200 300 400 500 Iteration

(c)

Figure 17: Comparison between DIAMOnD and Random Walk (RW). (a, b) Average recovery rates of DIAMOnD and the reference RW algorithm when removing 50%(100 nodes) of 100 generated shells (a) and connectivity (b) modules. (c) Comparison of the bio- logical evidence for proteins identified by DIAMOnD and RW for lysosomal storage diseases.

53 outperforms RW (Figure 18b). Furthermore, a more focused study on a single disease that used a variety of disease-specific data, e.g. from GWAS, microarray experiments and comorbidity analysis, has experimentally confirmed the specific disease-relevance of the DIAMOnD genes and significant outperformance of DIAMOnD over RW [68].

(c) By design, DIAMOnD avoids the selection of spurious high degree nodes. Conse- quently, the resulting modules are generally characterized by the absence of hubs. RW proteins, in contrast, have 2-3 times higher average degree (Figure 18c). (d) The recall rate of the DIAMOnD algorithm is roughly independent of the level of incompleteness in the seed genes. It therefore allows us to estimate the number of biologically rele- vant predictions (Figure 12c-12f). In contrast, methodologies like RW solely provide a ranking, without predicting the total number of the most probable candidates. (e) DI-

AMOnD shows a significantly higher recall in the early iterations compared to RW, thereby providing higher confidence candidates early on. (f) As we discuss below, the

DIAMOnD algorithm can be fine-tuned for specific applications, for example by giving varying weights to the initial seed genes.

2.6 Extending the basic DIAMOnD algorithm

We observed a common behavior by exploring the validation plots of 70 diseases, where the curves corresponding to the number of true hits often reach a plateau (as in figure

15a) and DIAMOnD stops detecting more relevant hits. This could either be used as a way of identifying the module size or it might be limitation of the methodology. Indeed, we were able to show that in many cases, DIAMOnD performance can be improved by small adjustments.

54 0.3 0.8 Random Walk KEGG pathways 0.25 DIAMOnD 0.6 GO ontology

0.2 0.4 0.2 0.15 0 0.1 Frequency -0.2 0.05 -0.4 Equal performance DIAMOnD performanceDIAMOnD

0 comparedrandom to walk 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -0.6 20 40 60 80 100120140160180200 Jaccard similarity between first neighbors & candidate nodes Iteration

(a) (b)

< k > = 23.46

0.8 < k > = 91.93 0.7 p-value = 1.96e-14 0.6 0.5 0.4

Frequency 0.3

0.2 Random Walk DIAMOnD 0.1 0 0 100 200 300 400 500 600 700 800 900 degree k

(c)

Figure 18: (a) Overlap between identified proteins and immediate neighbors of seed proteins. In contrast to RW, DIAMOnD includes a considerable number of proteins without first-order interactions to seed genes. (b) Comparison of the performance of DIA- MOnD and RW across 70 diseases with respect to non-specific disease data. (c) De- gree distributions of the identified proteins. DIAMonD proteins are characterized by the absence of hubs.

55 Figure 19: Schematic representation showing the importance of adding weights to the seed pro- teins. Detected proteins (shown in gray) by DIAMOnD, are iteratively added to the seed proteins (shown in black), updating the module. At each iteration, DIAMOnD treats detected proteins same as seed proteins (shown in black). Thus, node A and B are equally ranked. However, the methodology favors protein A, if a higher wights is given to seed proteins.

As discussed before, at each iteration DIAMOnD identifies a node and updates the module by adding the node to the module. Therefore, in the iteration process intro- duced above, the seed proteins are treated the same way as the predicted proteins agglomerated into the module at later iteration steps (Figure 20e). In most cases, where

DIAMOnD reaches a large enough modules, DIAMOnD nodes dominate the module and the methodology stops detecting further relevant hits. Figure 19 shows a subnet- work where seed proteins are shown in black and detected proteins are shown in gray.

In this subnetwork both nodes A and B have the same degree and same number of connection to colored (either black or gray) nodes. Therefore, the methodology ranks them equally. However, it is intuitive to select node A over B as its connections are to the seed proteins as opposed to the detected ones. As the module size increases this drawback becomes more pronounced up to a point where the module is dominated by detected nodes and the methodology stops detecting relevant nodes, resulting in a saturation point in the performance (figure 15a)

A simple solution to this is to specify a weight to the original set of see nodes, so that the methodology differentiates them. In fact, the DIAMOnD methodology can be easily extended to incorporate weighted links or nodes.

56 To assign higher weights to the seed proteins as compared to those that are only predicted, we introduce an additional weight α> 1 for the seed proteins and α= 1 for

all other proteins. By considering links to nodes with higher weights to be α times stronger, the direct neighbors of seed proteins have a higher chance of being identified.

Technically, this is implemented by artificially increasing the number of seed genes, for example by duplicating their number in the case of α= 2, while maintaining their

original interactions (Figure 20a). The generalized form of equation 6 then becomes:

   

 s + (α − 1)s0   N − s              ks + (α − 1)ks0 k − ks p(k, ks) =   (10)

 N + (α − 1)ks        k + (α − 1)ks0

Where s0 is the number of seed proteins and s is the module size. Similarly ks denotes

the number of links of a node to the module and ks0 denotes its number of links to the original seed proteins. By tuning α and comparing the different resulting DIAMOnD sets we can optimize their biological relevance. In synthetic modules, the recovery rate could thereby be increased 2 to 3 times in comparison to the original version of the algorithm for which the recovered fraction saturates (Figure 20b, 20c). On the set of

70 diseases, the optimal values for α vary considerably (see Figures 20d and 20efor the examples of lysosomal storage diseases and ulcerative colitis). Based on the path- way validations, we find that α = 10 performs best for many diseases (Figure 20f). As noted above, however, the validation according to pathways is biased towards imme- diate neighbors of the seed genes and we therefore expect that optimal values of α will depend on the specific application and the validation data that are used. We also

57 observed that introducing α allows for the construction of larger modules by helping avoid plateaus in the identification of relevant proteins (Figure 20b-20f).

2.7 Discussion

The hypothesis that disease associated proteins tend to interact with each other in the human Interactome underlies all network-based prioritization methods. Yet, for most diseases we found that only a relatively small fraction of known seed proteins in fact interact with each other. As a consequence, diseases cannot be associated with topo- logically dense network communities. Instead of the interaction density, we identified the interaction significance as the key quantity to characterize the connection patterns among disease proteins. While in principle this could be a consequence of our currently still very limited knowledge of disease associated proteins and their interactions, our results suggest that there is in fact a fundamental difference between disease modules and topological modules. Biologically, it is indeed plausible that disease modules do not necessarily coincide with densely interconnected topological modules. Highly in- terconnected proteins often represent functional units to perform a certain cellular task.

Diseases, on the other hand, are likely to be the result of perturbations among several functional modules and therefore expected to span across functional modules/topolog- ical communities. Our analysis of the connection patterns of known disease proteins further allowed us to design a predictive and robust algorithm to uncover unknown disease associations and construct a comprehensive disease module. For both synthetic test modules and real disease modules the recall of DIAMonD generally does not de- pend on the level of completeness in the initial set of seed proteins, but is rather a property of the module itself. This can be used to estimate the expected true positive

58 50 60 %$# %$# 45 %$#! 50 %$#! 40 %$#" %$#" 35 %$## %$## 40 %$# ( %$# ( 30 K 25 30 K' J' J 20 20 15

Numberhits true of 10 Numberhits true of M' M 10 5 shell N' N synthetic module synthetic connectivity module 0 0 0 100 200 300 400 500 0 100 200 300 400 500 L' L Iteration Iteration

(a) (b) (c)

200 300 2 &%#! 1.9 180 280 Iteration 100 &%#" iteration 200 260 1.8 &%##

59 160 iteration 300 1.7 &%#$ 240 iteration 400 140 220 1.6 &%# ) &%#!) 1.5 120 200 180 1.4 100 1.3

160 Foldchange Numberhits true of 80 of true hits Number 140 1.2 120 1.1 60 ulcerative colitis 100 nutritional and metabolic diseases 1 40 80 0.9 1 2 3 4 5 6 7 8 910 12345678910 0 50 100 150 200 250 Iteration

(d) (e) (f)

Figure 20: Extending the DIAMOnD algorithm. (a) Illustration of how the algorithm can be modified to give the initial seed proteins a higher weight α = 2 by (virtually) doubling the seed proteins while keeping their interactions. Tuning α results in different sets of detected proteins. (b, c) Comparing the performance for varying values of α in synthetic shells (b) and connectivity significance (c) modules, respectively. The best results are obtained for α = 3. (c) The performance may also saturate for α larger than a certain value. For a given disease α can be tuned to optimize the results. Performance of DIAMOnD with respect to different values of α is shown for ulcerative colitis (d) and nutritional and metabolic diseases (e). These plots suggest that at α = 2 the number of true positives is maximal. (f) Overall, α ~10 results in the best performance of DIAMOnD across 70 diseases. The individual values may vary considerably, however, suggesting an individual optimization for best results. rate in the predictions and is particularly convenient for predicting new disease associa- tions, where the total number of proteins involved in a disease is not known. While the outcome of DIAMOnD does not depend sensitively on the exact set of seed proteins, there typically are a few crucial seed proteins whose omission leads to drastically dif- ferent and presumably random results. These crucial proteins are characterized by their high degree. Their topological importance suggests also particularly important roles for the pathobiological mechanisms of the disease. Overall, the final disease modules typ- ically consist of one large component that contains all DIAMOnD genes and 30%-60%

of the initially disconnected seed proteins, the rest remaining disconnected. The inte-

gration of the several initially disconnected seed clusters into a broader disease module

and the elucidation of the network paths that interconnect them is crucial for a holis-

tic understanding of the pathobiology and molecular mechanisms underlying complex

diseases. Whether the remaining disconnected seed proteins could be integrated if the

Interactome data was more complete, or whether their disease associations are spurious

remains an open question.

60 3

COMMONUNDERLYING

MOLECULARMECHANISMSOF

COMPLEXDISEASES

A limited number of key endophenotypes are common to most complex diseases. Most notable among them are inflammation, thrombosis, and fibrosis[58]. These endophe-

notypes reflect mechanisms that facilitate the organism’s adaptation to injury. Each

has acute and resolving phases. The common goal of these underlying responses is

to restore normal organ and organism function. In as much as these endophenotypes

evolved to promote healing from acute injury, their implications for chronic injury or

disease likely had a lesser role, if any, on their genetic selection. As a result, chronic

overexuberant inflammatory, thrombotic, or fibrotic responses can yield organ impair-

ment and adverse long-term effects that outweigh the acute benefits they provide[113–

116].

Inflammation, thrombosis, and fibrosis are pathologically linked [117]: inflammation can induce (accelerate) thrombosis, thrombosis can induce inflammation[118, 119], and

fibrosis can result from resolving inflammation and thrombosis [120]. For these rea- sons, we explored the joint molecular network determinants of these endophenotypes, in particular aiming to identify those molecular subnetworks or mediators that are com-

61 mon to all, as well as those that are distinctive for each. In this way, we can define the determinants of the interplay among these common endophenotypes, as well as the determinants of heightened or deficient responses in them.

A complex cascade of molecular interactions occurs during inflammatory, thrombotic, and fibrotic processes, many of which remain poorly understood. Several molecules and cell types play a crucial role in these processes and exert their function through a network of interactions. Therefore, in this study we explore (a) network models of inflammation, thrombosis, and fibrosis; (b) their biological and topological crosstalk; and (c) the role of macrophages as central cellular mediators of these endophenotypes.

3.1 Constructing inflammasome, thrombosome, and fibrosome

Below, we extensively describe the steps towards constructing the network models of the three endophenotypes. We further validate the constructed modules by available biological evidence and show that the modules are robust towards small variations.

Finally, we discuss the cross-talk region of the three endophenotypic modules and stress their importance.

3.1.1 Significant clustering of seed genes within the human interactome

We start our analysis by assembling a set of genes with established association (seed genes) with inflammation, thrombosis, and fibrosis from the literature. To do so, we used HuGe Navigator (http://www.hugenavigator.net). HuGE Navigator is a continu- ously updated and publicly available knowledge base that retrieves genes associated

62 with a phenotype of interest by first parsing PubMed articles and subsequently manu- ally reviewing the results by experts.

In order to obtain genes with high-confidence association, we additionally used two

filtering criteria: (a) genes whose association has been reported in at least two publica- tions; and (b) genes that are expressed in tissues related to cardiovascular diseases.

To do so, we used gene expression data from 79 human tissues [121]. The specific

tissues considered include: monocytes, vascular smooth muscle cells, endothelial cells,

T-cells, and hepatocytes. We consider a gene to be expressed in a tissue if its expression

level meets one of the following criteria:

(i) Expression level of 200 mRNA counts or higher in the specified tissue.

(ii) The expression level in the specified tissue is significantly higher than the expres- sion profile across all tissues. For this analysis we used a modified z − score > 1.6

defined by [122]:

0.6745(x − x˜) M = i i MAD

Where MAD is the median absolute deviation and x˜ denotes the median.

The final numbers of seed genes for inflammation, thrombosis, and fibrosis were 452,

158, and 104, respectively (Table 3). As expected, the three gene sets show significant overlap; for example, 80%(p − value = 1.50x10−161; Fisher’s exact test) and 78%(p −

value = 3.58x10−104) of the genes associated with thrombosis and fibrosis, respectively, are also associated with inflammation (Fig. 21a).

For an independent biological evaluation of the compiled seed gene lists, we tested for association between candidate functional single nucleotide polymorphisms (SNPs) map- ping to each seed gene and selected established cardiovascular biomarkers, C-reactive protein (CRP), fibrinogen, soluble intercellular adhesion molecule (ICAM), as well as

63 Endophenotype # HuGENet genes # Seed Genes #Seed Genes in HI LCC size z − score

Inflmmation 1,679 452 442 360 10.85

Thrombosis 603 158 156 96 19.25

Fibrosis 621 104 101 55 22.27

Table 3: Genes associated with endophenotypes, on the human interactome a clinical vascular pathophenotype, venous thromboembolism (VTE). The detailed de- scription of data and analysis for finding the genetic associations is available in chapter

4.

Although curated seed genes are not necessarily expected to overlap with genetic associations meeting genome-wide significance (P − value < 5x10−8), we observed that for all four validation sets, inflammation and thrombosis seed genes carry a larger fraction of low p − value as compared to other genes in the network. We observed a similar effect for fibrosis seed genes with respect to CRP, ICAM, and VTE, but not

fibrinogen ( Fig.22).

To identify the network modules corresponding to the three endophenotypes, we con- solidated several direct physical protein-protein interactions data sources including reg- ulatory interactions [19], binary interactions containing high-throughput datasets [20–

22, 65]with binary interactions from IntAct [123] and MINT [23] databases, literature- curated interactions from IntAct, MINT, BioGRID [124], and HPRD [125], metabolic enzyme-coupled interactions [49], protein complexes [126], kinase network [127], sig- naling interactions [128] and liver-specific interactions [129] ( See chapter 4 for detailed description on construction of human interactome ). The resulting network has a power- law degree distribution [5] and consists of N=13,681 proteins (nodes) and M=144,414 interactions between them (edges).

64 (a) (b)

(c) (d)

Figure 21: Topological characteristics of seed genes within the Human Interactome. (A) Venn diagram of inflammatory (red), thrombotic (blue), and fibrotic (orange) seed genes. (B), (C), and (D) correspond to subgraphs of the human interactome containing in- flammatory, thrombotic, and fibrotic seed genes, respectively. These genes form a giant connected component, suggesting the existence of a local network neighbor- hood enriched with inflammatory, thrombotic, and fibrotic genes. The randomized distribution of the LCC size is shown in the histograms. For the effect of literature bias, see SI.

65 Considerable evidence suggests that genes associated with complex diseases are not randomly scattered within the HI but tend to interact with each other in specific net- work neighborhoods, or disease modules [40, 48, 80]. The same phenomenon is found for the seed genes of the three endophenotypes: their seed genes form connected sub- graphs whose sizes are significantly larger than expected by chance for randomly dis- tributed genes (Fig.21b, 21c and 21d and Table 3).

3.1.2 Effect of biased studies on significant clustering of seed genes

Current maps of the human interactome are prone to investigative biases [46, 130].

Since disease genes are typically the particular focus of experimental research, it is of- ten observed that they have more established interaction partners and, therefore, higher degree in the network. To quantify the extent to which the observed topological prop- erties of disease proteins is due to these biased studies, we repeated our analysis on an unbiased, Y2H high-throughput subset of the human interactome [113, 115, 116] and confirmed that the observed clustering, indeed, reflects the existence of modules responsible for these endophenotypes.

The raw observation suggests that proteins show much smaller clustering effect on unbiased protein-protein interaction network. However, note that the interaction among substantial number of seed genes has not yet been examined in current high-throughput maps (Y2H network), and, therefore, the conclusion of such observation requires more attention in the future.

As mentioned above, due to limited search space (number of proteins examined) and an interaction detection sensitivity of ~10%, the unbiased maps are much sparser than current LCI maps. Thus, observing a smaller clustering effect is, indeed, expected and

66 67

Figure 22: Genetic association of seed genes as compared to other genes with respect to three cardiovascular biomarkers, CRP, fibrinogen, and sICAM, as well as the specific vascular disease phenotype, VTE, in the inflammatory (pink background, first column), thrombotic (light blue background, second column), and fibrotic (light orange, third column) subnetworks. Seed genes contain more low p − value GWAS genes than other genes in the network [red circles, seed genes; green circles, endophenotype module (subnetwork); black circles, rest of network]. can be explained by the incompleteness of the current maps of unbiased interactome.

Moreover, LCI is not limited to protein-protein interactions and includes interactions from several sources such as metabolic, regulatory, etc.

To show that the observed topological properties of disease proteins on unbiased maps are, indeed, due to the incompleteness of the network (and not solely due to the biased nature of studies), we proceed as follow (Fig. 23a for the flowchart):

1. For a fair comparison, we first limit the nodes of our HI to those existing in Y2H.

Therefore, we characterized the subnetwork of the full network that contains Y2H net- work nodes. This subnetwork contains a substantially larger number of edges than the

Y2H network. The latter can be viewed as an incomplete but unbiased subset of this subnetwork.

2. Next, we check whether the differences are significant or expected by chance. We randomly remove (“prune”) links from this subnetwork until we reach the same number of links as in the Y2H network. In parallel, we try to keep the degree of the nodes preserved as in the Y2H network.

Our analysis shows that the observed low clustering of seed genes in unbiased maps

lies within the expected range drawn by randomly pruned events (Fig. 23b). Therefore,

low clustering of disease proteins can, to a great extent, be explained by the incomplete-

ness of the network.

3.1.3 Modules detection, validation and robustness

We used the seed gene clusters in the interactome as a starting point to explore the

molecular mechanisms of the respective endophenotypes in the broader context of

68 (a) (b)

Figure 23: Studying biased studies of networks in seeds clustering effects. (a) The flowchart of fairly comparing seeds clustering effects within curated networks and unbiased networks. (b) The observed clustering of seeds within unbiased maps lies within the expected range drawn by randomization.

69 disease-associated endophenotype modules, i.e., sub-networks associated with inflam- mation, thrombosis, and fibrosis. To identify these neighborhoods of endophenotype proteins, we used the DIseAse MOdule Detection (DIAMOnD) method that iteratively expands the seed gene neighborhood by adding proteins with a significant number of connections to the seed gene pool [80]. In principle, the method ranks all proteins in the network. To identify the boundary of each endophenotype module, we therefore, con- sidered their biological relevance with additional biological evidence. We used Gene

Ontology and MSIgDB [131] pathways as follows:

(a) MSIgDB pathways: From the MSIgDB database, we retrieved the biological path- ways significantly enriched with seed genes (FDR corrected). Next, we show that these pathways are also statistically highly enriched with DIAMonD genes. Figure 24 shows the number of DIAMOnD genes that belong to these sets of pathways as a function of

DIAMOnD iteration and their corresponding p − values.

(b) Gene Ontology (GO): In the same fashion, we extracted GO terms [downloaded

Nov. 2011] significantly annotated for the seed genes and show that DIAMonD genes

are significantly annotated for the same GO terms.

We found that approximately the first 450, 700, and 650 DIAMOnD genes show a clear and significant biological association with inflammatory, thrombotic, and fibrotic seed genes, respectively (Fig. 24a-24c). DIAMOnD genes, together with the seed genes, form three endophenotype modules that we call the inflammasome, thrombosome, and

fibrosome, containing 902, 858, and 704 proteins, respectively. Moreover, through the ad- dition of 450 DIAMOnD proteins, 93% of inflammatory seed proteins are integrated into a connected component (LCC) (Fig. 25a). Therefore, the additional DIAMOnD proteins allow for the integration of previously disconnected seed proteins into the connected component of the modules.

70 (a) (b)

(c) (d)

Figure 24: Biological validation of the detected DIAMOnD genes. Panels A, B, and C corre- spond to validating DIAMOnD genes of inflammation (A), thrombosis (B), and fi- brosis (C), respectively (red lines, seed genes; green lines, DIAMOnD genes; black lines, randomly selected genes). Validation is assessed with respect to GeneOntology and MSIgDB pathways. As the DIAMOnD genes are iteratively added to the neigh- borhood, the p − value of enrichment increases with a clear jump to non-significant values (p − value ~1) at the indicated iteration. Therefore, we use the suggested it- eration steps to define cutoffs for the methodology, and thereby identify the size limit of the underlying associated module. We chose 450, 700, and 600 first identified DIAMOnD nodes to form the inflammasome, thrombosome, and fibrosome mod- ules, respectively. (D) Venn diagram of the inflammasome, thrombosome, and fibro- some genes. The fully embedded pathways within detected modules have been found in inflammasome-specific proteins, thrombosome-specific proteins, overlapping pro- teins in the inflammasome and thrombosome, and overlapping proteins in all three modules.

71 (a) (b)

Figure 25: Topological properties and robustness of the endophenotypic modules. (a) Previ- ously disconnected seed genes are now connected to each other through detected DIAMOnD genes. The inflammasome, thrombosome, and fibrosome modules so- constructed allow 93%, 90%, and 93% of seed genes to become part of the LCC, respectively. (b) N-1 analysis. All modules are robust towards removing one node from the initial set of seed genes. The largest deviation appears as a consequence of removing the gene A2M from the fibrosis seed genes.

The resulting modules are robust towards small variations in the initial seed gene

set (Fig. 26). To check the robustness of the module-finding methodology towards false positives and genes miss-annotations we performed the so-called N-1 analysis where N is the original number of seed genes. In this analysis, we remove one seed each time and expand the neighborhood of N-1 seeds iteratively. At each iteration, we measure the overlap of the detected genes between the original (N seeds) and the trial (N-1 seeds) sets.

Figure 25b shows the average overlap of the detected DIAMOnD genes as opposed to the DIAMOnD iteration step. The overlap has been measured between the genes resulting from two different seed sets: the original seeds and N different configurations of trial seeds (each containing N-1 seeds). As shown in the figure the methodology is robust towards small variation of seed genes.

72 Figure 26: Enrichment of seed genes and modules with differentially expressed genes in subjects with a significant cardiovascular risk factor burden.

3.1.4 Cross-talk region of the modules

The three modules have a large common core of 530 proteins (Fig. 2d). The thrombo- some and inflammasome show significant (p − value < 10−324) overlap of 637 genes

(Jaccard index J=0.57). Part of the overlap of the modules stems from common seed

genes. Considering that the seed genes represent established knowledge, calculating

the significance of the module overlap reduced to calculating the significance of the

overlap between the added (DIAMOnD) genes. We calculate the significance of inflam-

masome and thrombosome overlap using the following methods:

(a) First, we consider that 450 (700) detected DIAMOnD nodes with respect to in-

flammation (thrombosis) could have been selected from any nodes in the network. We

calculate the p − value using Fisher’s exact test, resulting in a p − value < 10−324.

(b) In practice, the detected nodes cannot be selected from anywhere in the network.

Rather, they are iteratively added based on their connectivity patterns to seed nodes.

To factor this principle into the analysis, we assume that detected nodes can be selected

from first neighbors of seeds only. We further limit this pool of candidate nodes by

73 taking those that are first neighbors of both inflammation and thrombosis seeds. This selection process will underestimate the resulting significance. Fisher’s exact test yields a p − value < 10−324.

(c) With the same approach, we found the overlapping significance of p − value <

10−324for both inflammasome-thrombosome, and thrombosome-fibrosome pairs, respec-

tively.

Interestingly, the overlap between the modules is more significant than the overlap

between the seed genes, suggesting that these endophenotypes are truly in the same

neighborhood of the interactome. Further pathway analysis of the genes within the

modules identifies five fully embedded pathways: IL6, IGF1, extrinsic prothrombin ac-

tivation, AP1 family of transcription factors and PECAM1 (Fig. 24d, Table 4). Note that,

given the current coverage of the map of human Interactome, proteins belonging to the pathway AP1 family of transcription factor do not directly interact with each other.

The cross-talk region could be further investigated for inflammation-induced throm- botic pathways [132]. It is known, for example, that inflammation inhibits natural anti-

coagulant pathways and fibrinolytic activity as well as increases procoagulant factors,

thereby increasing the (net) thrombotic response.

3.1.5 Biological importance of the endophenotype modules

We next turned to an analysis of the identified modules with respect to (a) cardio- vascular disease risk (an example of preclinical disease), and (b) their more general association with complex diseases.

74 Fully embedded pathways Network neighborhood Genes

IL6 Inflammasome- CSNK2A1, JUN, SRF, HRAS, IL6R, STAT3, thrombosome PTPN11, IL6ST, RAF1, SHC1, ELK1, MAP2K1, crosstalk TYK2, JAK1, JAK3, JAK2, CEBPB, GRB2, MAPK3, FOS, IL6, SOS1

Extrinsic prothrombin Thrombosome-specific F10, TFPI, F2,F2R, FGG, PROS1, PROC, FGA, activation SERPINC1, FGB, F3,F5,F7

PECAM1 Modules crosstalk ITGAV, SRC, PTPN11, PLCG1, PECAM1, YES1,

FYN, ITGB3, PTPN6, LYN, LCK, INPP5D

AP1 family of transcription MAPK9, JUN, MAPK10, MAPK11, MAPK8, factors MAPK14, ATF2, MAPK3, FOS, MAPK1

IGF1 Inflammasome specific CSNK2A1, JUN, SRF, HRAS, MAPK8, PTPN11, RAF1, SHC1, ELK1, MAP2K1,I GF1, RASA1, PIK3CA, IGF1R, GRB2, IRS1, MAPK3, FOS, PIK3CG, PIK3R1, SOS1

Table 4: Fully embedded pathways in different regions of endo-phenotype modules and their associated genes

3.1.6 The role of endophenotype modules in cardiovascular disease

To assess the potential role of the three endophenotypes for the risk of developing car-

diovascular diseases, we analyzed gene expression data in monocytes from a cohort

of 1,258 individuals [133, 134] (see Chapter 4 for data description), comparing individ-

uals at high risk of cardiovascular diseases (cases) to patients at low risk (controls).

Quantitative biochemical risk factors measured in the population included CRP, fibrino-

gen, high-density lipoprotein (HDL), low-density lipoprotein (LDL), apolipoprotein-A

(APO-A), apolipoprotein-B (APO-B), and triglycerides. All three endophenotypes are

strongly enriched with CRP, HDL, and APO-A-associated genes. The inflammasome

and thrombosome were additionally enriched with triglyceride-related genes (Fig. 26,

Table 5).

75 Molecule #dE genes inflammasome (seeds) Thrombosome (seeds) Fibrosome (seeds)

CRP 479 55 (28) 52 (11) 35 (6)

Fibrinogen 255 20 (8) 19 (3) 15 (2)

APO-A 136 19 (11) 17 (3) 15 (2)

APO-B 9 1 (0) 0 (0) 1 (0)

HDL 216 34 (17) 30 (7) 24 (4)

LDL 3 1 (1) 1 (0) 0 (0)

Triglyceride 57 8 (4) 8 (2) 4 (1)

Table 5: Differentially expressed genes associated with cardiovascular risks and their overlap with endophenotype modules and seeds.

3.1.7 The role of endophenotype modules in complex diseases

For a more general assessment of the role of the three endophenotypes in complex diseases other than cardiovascular diseases, we next analyzed their enrichment with disease proteins from a corpus of 299 diseases [46]. We found that, in total, the disease- genes associated with 156 (52% of) diseases significantly overlap with at least one of the three detected modules (Table 6). Among these diseases, 67 are enriched in all three modules, while 11, 10, and 26 are inflammasome-, thrombosome-, and fibro- some–specific, respectively (Table 7). These data support the notion that inflamma- tion, thrombosis, and fibrosis are pathobiological endophenotypes common to many diseases.

In summary, we observed that the three detected endophenotype modules are highly enriched with known disease genes, in general, and, more specifically, with differen- tially expressed genes associated with cardiovascular risk factors. Hence, the detected

76 #diseases significantly enriched with Seeds Module

Inflammation 95 117

Thrombosis 83 116

Fibrosis 77 99

Table 6: Number of diseases significantly enriched with endophenotype modules. subregions of the network, including the proteins and their molecular interactions, are of high biological importance and worth analyzing in the context of disease develop- ment.

3.2 Topological properties of the endophenotype modules

Prompted by the strong enrichment of the endophenotype modules with genes associ- ated with complex diseases and preclinical cardiovascular disease (cardiovascular risk factors), we explored whether this central role is also reflected in specific topological properties of the modules within the interactome.

3.2.1 Central location of inflammatory and fibrotic genes

To do so, we analyzed the extent to which the robustness and structural integrity of the network depend on these proteins using a “tree” analysis, i.e., testing whether a set of nodes constitutes an essential backbone of the HI (“trunk” of the tree) or whether it is of secondary importance for the overall structure (“leaves”). To do so, we remove the given set of nodes from the network and measure two parameters: (a) the number

77 Cross talk Inflammasome specific Thrombosome specific

arthritis autoimmune-diseases intestinal-neoplasms joint-diseases arteriosclerosis central-nervous- abnormalities,-multiple anemia,-aplastic autoimmune-diseases-of-the-nervous-system kidney-diseases kidney-neoplasms system-diseases death-sudden bone-diseases,-developmental breast-neoplasms bile-duct-diseases bone-diseases bone-marrow-diseases leukemia liver-diseases lung-diseases death-sudden-cardiac carcinoma,-renal-cell cardiovascular-abnormalities cardiovascular-diseases lymphatic-diseases heart-arrest charcot-marie-tooth-disease colitis,-ulcerative cerebrovascular-disorders colitis colonic-diseases lymphoproliferative-disorders heart-defects-congenital collagen-diseases colorectal-neoplasms congenital-abnormalities male-urogenital-diseases muscular-disorders-atrophic congenital,-hereditary,-and-neonatal- connective-tissue-diseases crohn-disease metabolic-diseases multiple-sclerosis neuroectodermal-tumors diseases-and-abnormalities death,-sudden demyelinating-diseases diabetes-mellitus musculoskeletal-abnormalities neuroendocrine-tumors demyelinating-autoimmune-diseases, -cns digestive-system-diseases digestive-system-neoplasms musculoskeletal-diseases skin-diseases-genetic diabetes-mellitus,-type-1 endocrine-system-diseases female-urogenital-diseases myeloproliferative-disorders neoplasms stomatognathic-diseases genetic-diseases,-inborn female-urogenital-diseases-and- neoplasms-by-histologic-type genetic-diseases,-x-linked 78 pregnancy-complications gastroenteritis neoplasms-by-site nephritis Fibrosome specific genital-diseases,-female gastrointestinal-diseases glucose-metabolism-disorders nervous-system-diseases heart-defects,-congenital gonadal-disorders heart-diseases hematologic-diseases nervous-system-malformations infant,-newborn,-diseases leukemia,-lymphoid hemic-and-lymphatic-diseases hemorrhagic-disorders neurodegenerative-diseases adenocarcinoma leukemia,-myeloid lung-diseases,-obstructive immune-system-diseases nutritional-and-metabolic-diseases arrhythmias-cardiac lupus-erythematosus,-systemic immunoproliferative-disorders peripheral-nervous-system-diseases genital-diseases-male lymphoma,-non-hodgkin inflammatory-bowel-diseases intestinal-diseases pigmentation-disorders psoriasis genital-neoplasms-male glioma neoplasms,-glandular-and-epithelial respiratory-tract-diseases lymphoma-non-hodgkin neoplastic-syndromes,-hereditary rheumatic-diseases macular-degeneration skin-diseases,-papulosquamous skin-and-connective-tissue-diseases otorhinolaryngologic-diseases urologic-neoplasms skin-diseases urogenital-neoplasms prostatic-diseases urologic-diseases vascular-diseases prostatic-neoplasms

Table 7: Significantly enriched diseases in module-specific regions. of remaining connected components (islands), and (b) the size of the remaining LCC.

Next, we compare the results to expected values of these measures as follows:

(a) Randomly select the same number of nodes from the network.

(b) Remove these nodes from the network.

(c) Measure parameters as introduced above (a and b).

(d) Repeat steps (a)-(c) 106 times to produce a randomized distribution.

(e) Calculate a z − score for the actual observation.

Highly positive (negative) z − scores of the LCC size (number of connected compo-

nents) reflect a central location of the respective nodes.

A group of nodes whose removal results in a significantly higher number of con-

nected components (z − score(CC) > 1.6) with a much smaller LCC (z − score(LCC) <

−1.6) is considered essential for the integrity of the HI, i.e., is trunk-like. By contrast, nodes whose removal leads to a significantly decreased number of connected compo- nents (z − score(CC) < −1.6) and a larger LCC (z − score(LCC) > 1.6) are considered non-essential, i.e., are leaf-like ( Fig. 27). [See Table 8 for a list of basic topological prop- erties of these modules].

The results of this analysis on both seeds and module proteins show that inflam- matory seeds and modules as well as fibrosome are trunk-like and, thus, essential for the overall integrity of the network (with high z − score(CC) and z − score(LCC)) (Fig.

28a). Note that these results cannot be attributed to high average degree and centrality alone. Furthermore, despite having higher average degree and betweenness centrality, thrombosis and fibrotic seed proteins are not trunk-like (Fig. 28a and 28b).

It is worth noting that thrombosis and fibrosis seeds are near-subsets of the inflam- mation seeds, i.e., ~80% of the seeds are inflammatory. However, only inflammation seeds are trunk-like. Similarly, although the inflammasome and thrombosome overlap significantly and are comparable in size, yet only the inflammasome shows trunk-like

79 (a)

(b) (c)

Figure 27: Tree analysis. (a) Given a set of nodes (shown in red), we start by removing them from the HI and recording the number of remaining connected components and their LCC size. Next, we compare this analysis to that of random expectations where same number of nodes is randomly removed from the network. Red arrows show the observed LCC size and number of connected components of the remaining nodes where show a (b) trunk-like and (c) leaf-like behavior of a given set of nodes,.

80 (a)

(b)

Figure 28: Tree analysis of seed genes and modules. (a) Different rows show the observed size of the LCC and the number of connected components after removing the denoted gene sets. The observed parameter is compared to that of random expectation and a z − score is calculated. As shown, the inflammasome, thrombosome, and fibrosome, as well as inflammatory seed genes, are highly essential for defining the clustered structure of the network. (b) number of connected components z − score vs. LCC z − score phase diagram. 81 Degree Betweenness centrality Degree

Inflammation seeds 5.9e−4 36.92

Inflammation added nodes 1.5e−3 87.16

Inflammation module 1.1e−3 62.26

Thrombosis seeds 3.7e−4 28.28

Thrombosis added nodes 1.3e−3 80.11

Thrombosis module 1.2e−3 70.69

Fibrosis seeds 5.9e−4 38.34

Fibrosis module 1.4e−3 78.62

Network 1.9e−4 21.11

Table 8: Topological network properties of endophenotype modules. behavior. This (a) excludes the possibility that size might be responsible for this effect and (b) indicates that the non-overlapping proteins are responsible for the observed dif- ferences in essentiality. Overall, we conclude that the enrichment of inflammasome with different disease determinants is rooted in its topologically centered location within the

HI.

3.3 Functionality of detected endophenotype modules using macrophages

During inflammatory responses, monocytes differentiate into macrophages17 [130], which function as either pro-inflammatory or anti-inflammatory cells known as M1 and M2 macrophages, respectively. A key difference between M1 and M2 macrophages lies in their gene expression pattern and protein levels. M1 macrophages may play a key role in

82 an acute phase of inflammation through the production of injurious molecules, whereas

M2 cells may participate in tissue repair in a later phase.

In order to identify proteins that play a role in M1 (IFNγ) responses [135, 136], we used two unbiased quantitative proteomic datasets generated from human THP-1

macrophage-like cells stimulated without (M0) or with INFγ (M1). Proteins were sam- pled at six time points up to 72 hours of stimulation (see Chapter 4 for data description and analysis). This experimental procedure yields a time series of protein abundance that can inform or suggest downstream causation (Fig. 30a).

Three thousand eight hundred twenty-one proteins with at least one interacting part- ner in HI were detected in both M1 and the no stimulation, baseline control, M0. Among these proteins, 447 overlap with endophenotype modules (p − value = 1.40x10−15). We refer to these 447 proteins as “ome-M1” proteins, indicating the detected proteins in both M0 and M1 that overlap with the three endophenotype modules, the inflamma- some, thrombosome, and fibrosome.

Next, we observed that the M0 and M1 proteins residing in the endophenotype mod- ules (ome-M proteins) have significantly different functional annotations from those outside of the module. To do so we proceeded as follows( Figure 29a):

(a) We store the pathways enriched by ome-M proteins in a pathway array P0 =

0 0 [p1, p2, ...] .

(b) From all proteins detected in M0 and M1 condition, we randomly select the same number of ome-M proteins.

(c) We store the pathways enriched by these randomly selected proteins in P1 =

1 1 [p1, p2, ...] .

(d) We repeat steps (b) and (c) 1000 times to find P1, P2, ...P1000enriched pathway arrays.

83 (a) (b)

Figure 29: Functional similarities of ome-M proteins. (a) The flowchart shows the steps towards calculating functional cohesiveness among ome-M proteins inside and outside de- tected modules. (b) Detected proteins inside modules are functionally and signifi- cantly different from the ome-M proteins.

(e) We calculate the Jaccard similarity of all pairwise combinations of the 1000 path- way arrays.

(f) Next, we calculate the Jaccard similarity of P0with every of the 1000 pathway arrays.

The distribution of Jaccard similarities calculated in step (e) and (f) are shown in gray and red respectively in figure 29b. Our observation shows that detected proteins that reside in the modules functionally and significantly differ from other detected proteins outside the modules. Therefore, we limit our further studies to these ome-M proteins.

3.3.1 Detection of early and late proteins in response to inflammatory stimu-

lator

As we are interested in finding proteins responsive to inflammatory stimuli, we study the proteins abundance in M1 relative to M0 (where proteins are not induced and their abundance varies normally). Therefore, we first calculated the fold change of protein abundances at each time point. We did this by dividing the protein abundances in

84 M1 by M0. Next, we identified subgroups of enhanced and suppressed proteins by applying a k-means clustering on the time series of fold changes for ome-M1 proteins.

For performing k-means clustering, we used Cluster 3.0, and for viewing the heatmap, we used Java TreeView.

The changes in sum of within-cluster distances (sw) of protein levels with respect to the number of clusters suggests k=5 clusters as an optimal number of clusters (Fig.

30b, elbow method). The first (last) identified cluster represents a set of proteins with high (low) relative abundance throughout the measurement time (Fig. 30c). Cluster

2 represents a subset of proteins in which protein abundance is higher than the M0 baseline during the first day and decreases thereafter. At the same time, clusters 3 and

4 together represent two subsets of proteins that are highly expressed only after the

first day of activation with IFNγ. We refer to these two subgroups as early and late proteins where early proteins have an elevated relative abundance within the first days and decreased levels thereafter, and late proteins are unaffected within the first day and increase their expression after 24hrs.

3.3.2 Early proteins may be responsible for triggering late proteins

The enhanced (suppressed) abundance level of proteins within (after) the first day of activation with IFNγ, suggests that the high abundance of early proteins on the first day is mechanistically linked to the abundance of late proteins. This observation is also consistent with the connectivity patterns among early and late proteins within the interactome: Each late or early protein has kin interactions with the other proteins within its own group and kout interactions with the proteins of the other group. We

find that early proteins tend to interact with late proteins more than they do with

85 86

Figure 30: Detecting early and late proteins of inflammatory responses. (a) Schematic representation of inducing inflammatory stimulator to THP1 cells. (b) Sum of within cluster distances vs. number of clusters where k=5 was found to detect optimal clustering. (c) Clusters formed by k-means clustering analysis of M1 macrophages where two boxes indicate late and early expressed protein. (d) Network representation of early and late proteins within detected endophenotype modules and the enrichment of early proteins within cross-talk region of the three endophenotypic modules. themselves. In contrast, late proteins tend to interact with each other more than they interact with early proteins. An early protein has an average kin of 3.33 and an average

kout of 8.24, whereas a late protein has an average kin of 9.71 and an average kout of 4.71.

This observation suggests that early proteins are responsible for triggering late proteins, while downstream, triggered late proteins tend to interact with each other.

To define a high confidence set of early and late proteins, we compared the average

abundance levels of proteins within and after the first 24 hrs (Fig. 30c) and selected

those that satisfy three different confidence criteria: (a) p − value < 0.05, (b) FC > 1.5,

and (c) p − value < 0.05 and FC > 1.5. Early and late proteins, while separated, are

interconnected within the modules and, thus, directly influence each other. A list of the

top 20 pathways enriched by ome-M1 proteins characterized by confidence criterion (a)

can be found in Table 9. Tables 10 and 11 list the same properties for proteins character-

ized by criteria (b) and (c). We generated the network representation of these proteins

(Fig. 30d) where we observe that unlike late proteins, early proteins are significantly lo-

cated within the cross-talk of the modules (p − value = 0.01). Our observation suggests

that the inflammatory responses initiate from within the cross-talk and trigger other

downstream proteins in a cascade.

3.4 Discussion

Starting from high confidence literature-curated seed genes and using the DIAMOnD

methodology, we detected three sub-regions within the HI associated with inflamma-

tory, thrombotic, and fibrotic responses. These highly overlapping regions are signif-

icantly enriched with several disease determinants, including: (a) disease genes asso-

ciated with more than 50% of the compiled complex diseases, and (b) differentially

87 FC>1.5 Topological Proteins Top 20 pathways p − value < 0.05 properties

Early proteins #proteins = 33 STMN1, VAV3, ITGA3, REACTOME-HEMOSTASIS M = 26 CARD9, GNA12, REACTOME-FORMATION-OF-PLATELET-PLUG LCC size = 14 IFNGR1, RASA1, BIOCARTA-PTC1-PATHWAY < k > = 81.03 PARP1, CD36, SCARB1, BIOCARTA-SRCRPTP-PATHWAY

< kin > = 1.57 CSNK2A1, PTGS1 , REACTOME-PLATELET-ACTIVATION

< kout > = 0.64 CD22, TNPO1, DHFR, REACTOME-CYCLIN-A1-ASSOCIATED-EVENTS-

pin,out = 0.01 PEBP1, GPX1, AKT2, DURING-G2-M-TRANSITION PRKDC, CD9, LRPPRC, BIOCARTA-CELLCYCLE-PATHWAY HSPB1, TOP2A, CLTC, BIOCARTA-G2-PATHWAY CABIN1, CD58, KEGG-HEMATOPOIETIC-CELL-LINEAGE CCNB1, CALM1, REACTOME-E2F-MEDIATED-REGULATION-OF- CDK1, CDK9, CDK7, DNA-REPLICATION CSK, RPS27A REACTOME-G1-S-TRANSITION KEGG-CELL-CYCLE REACTOME-E2F-ENABLED-INHIBITION-OF-PRE- REPLICATION-COMPLEX-FORMATION KEGG-MAPK-SIGNALING-PATHWAY REACTOME-RECRUITMENT-OF-NUMA-TO- MITOTIC-CENTROSOMES REACTOME-PLATELET-ACTIVATION-TRIGGERS BIOCARTA-HIVNEF-PATHWAY BIOCARTA-AKAP95-PATHWAY REACTOME-PHOSPHORYLATION-OF-THE-APC KEGG-B-CELL-RECEPTOR-SIGNALING-PATHWAY

Late proteins #proteins = 18 TAB1, IL1RN, IL1B, KEGG-LEISHMANIA-INFECTION LCC size = 5 NAMPT, VDAC1, KEGG-APOPTOSIS M = 7 NCF1, CTTN, RPS24, BIOCARTA-IL1R-PATHWAY < k > = 64.22 CD74, TANK, BIRC2, BIOCARTA-NFKB-PATHWAY

< kin > = 0.78 TRAF3, IRAK1, KEGG-TOLL-LIKE-RECEPTOR-SIGNALING-

< kout > = 1.17 TRADD, RANBP9, PATHWAY

pin,out = 0.24 CRK, CASP7, NCF2 BIOCARTA-DEATH-PATHWAY BIOCARTA-HIVNEF-PATHWAY KEGG-NOD-LIKE-RECEPTOR-SIGNALING- PATHWAY KEGG-RIG-I-LIKE-RECEPTOR-SIGNALING- PATHWAY BIOCARTA-TNFR2-PATHWAY BIOCARTA-MITOCHONDRIA-PATHWAY BIOCARTA-CASPASE-PATHWAY BIOCARTA-STRESS-PATHWAY REACTOME-APOPTOSIS BIOCARTA-TOLL-PATHWAY REACTOME-GENES-INVOLVED-IN-APOPTOTIC- CLEAVAGE-OF-CELLULAR-PROTEINS REACTOME-APOPTOTIC-EXECUTION-PHASE KEGG-MAPK-SIGNALING-PATHWAY KEGG-SMALL-CELL-LUNG-CANCER REACTOME-TOLL-RECEPTOR-CASCADES

Table 9: Topological and biological properties of early and late proteins characterized by confi- dence level criterion (a): FC > 1.5 and p − value < 0.05

88 p − value < 0.05 Topological Proteins Top 20 pathways properties

Early proteins #proteins = 47 STMN1, VAV3, ITGA3, REACTOME-HEMOSTASIS M = 36 CARD9, GNA12, REACTOME-FORMATION-OF-PLATELET-PLUG LCC size = 23 IFNGR1, VCL, RASA1, KEGG-REGULATION-OF-ACTIN-CYTOSKELETON < k > = 64.57 PARP1, CD36, SCARB1, REACTOME-PLATELET-ACTIVATION

< kin > = 1.53 CSNK2A1, PRKDC, KEGG-FOCAL-ADHESION

< kout > = 2.76 CD9, LRPPRC, HSPB1, REACTOME-GAP-JUNCTION-DEGRADATION P − value = 0.04 PTPN2, TOP2A, CLTC, BIOCARTA-PTC1-PATHWAY CABIN1, CD58, BIOCARTA-SRCRPTP-PATHWAY MTHFD1, PIK3CG, KEGG-PROGESTERONE-MEDIATED-OOCYTE- CBS, PTGS1, CD22, MATURATION TNPO1, CAD, DHFR, REACTOME-CYCLIN-A1-ASSOCIATED-EVENTS- PEBP1, GPX1, AKT2, DURING-G2-M-TRANSITION PON2, ROCK2, CD2AP, KEGG-CHEMOKINE-SIGNALING-PATHWAY CCNB1, DAB2, REACTOME-PLATELET-ACTIVATION-TRIGGERS CALM1, BRAF, MYO6, BIOCARTA-HIVNEF-PATHWAY CASP8, IQGAP1, BIOCARTA-IGF1-PATHWAY CDK1, CDK9, CDK7, REACTOME-GAP-JUNCTION-TRAFFICKING CSK, RPS27A REACTOME-COLLAGEN-MEDIATED-ACTIVATION- CASCADE BIOCARTA-INSULIN-PATHWAY KEGG-GLIOMA KEGG-NEUROTROPHIN-SIGNALING-PATHWAY BIOCARTA-CELLCYCLE-PATHWAY

Late proteins #proteins = 55 TAB1, IL1RN, IL1B, REACTOME-HEMOSTASIS M = 159 NAMPT, AHCY, REACTOME-FORMATION-OF-PLATELET-PLUG LCC size = 47 HMGB1, VDAC1, KEGG-REGULATION-OF-ACTIN-CYTOSKELETON < k > = 107.6 VDAC2, NCF1, LIMA1, REACTOME-PLATELET-ACTIVATION

< kin > = 5.78 CTTN, RPS24, KRAS, KEGG-FOCAL-ADHESION

< kout > = 2.36 TRAF3, IRAK1, REACTOME-GAP-JUNCTION-DEGRADATION P − value = TRADD, CSNK2B, BIOCARTA-PTC1-PATHWAY 1.7e-4 PAK2, CASP4, VLDLR, BIOCARTA-SRCRPTP-PATHWAY VIM, HSPA9, HSPA8, KEGG-PROGESTERONE-MEDIATED-OOCYTE- TLR2, ABI1, MATURATION HSP90AA1, HSPD1, REACTOME-CYCLIN-A1-ASSOCIATED-EVENTS- PPM1B, ACTN4, RDX, DURING-G2-M-TRANSITION REL, ALOX5, EDF1, KEGG-CHEMOKINE-SIGNALING-PATHWAY CD74, GRB2, NFKB2, REACTOME-PLATELET-ACTIVATION-TRIGGERS GNB2L1, TANK, ENO1, BIOCARTA-HIVNEF-PATHWAY BIRC2, HCLS1, RAN, BIOCARTA-IGF1-PATHWAY EIF3E, RANBP9, REACTOME-GAP-JUNCTION-TRAFFICKING MAPK13, CRK, ASAP1, REACTOME-COLLAGEN-MEDIATED-ACTIVATION- LGALS3, CASP7, CASCADE NCF2, PTPN12, BIOCARTA-INSULIN-PATHWAY HNRNPA1, RPL22, KEGG-GLIOMA HSPA5, PAFAH1B1 KEGG-NEUROTROPHIN-SIGNALING-PATHWAY BIOCARTA-CELLCYCLE-PATHWAY

Table 10: Topological and biological properties of early and late proteins characterized by confi- dence level criterion (b): p − value < 0.05

89 FC>1.5 Topological Proteins Top 20 pathways properties

Early proteins #proteins = 67 STMN1, VAV3, CALU, REACTOME-HEMOSTASIS M = 85 DNAJB1, ITGAL, ITGA4, REACTOME-FORMATION-OF-PLATELET-PLUG LCC size = 36 ITGA3, RPS6KA3, CARD9, REACTOME-PLATELET-ACTIVATION < k > = 72.81 GNA12, CAV1, IFNGR1, REACTOME-PLATELET-ACTIVATION-TRIGGERS

< kin > = 2.54 GSTM1, PTGES3, RASA1, KEGG-ARACHIDONIC-ACID-METABOLISM

< kout > = 1.37 PARP1, CD36, SCARB1, KEGG-PROGESTERONE-MEDIATED-OOCYTE- P − value = CSNK2A1, PTGS1, CD22, MATURATION 4.47e-3 EIF3M, TNPO1, TFRC, REACTOME-PLATELET-DEGRANULATION DHFR, PPP1CA, GAPDH, KEGG-HEMATOPOIETIC-CELL-LINEAGE PEBP1, EPHX1, AGT, ETS1, REACTOME-SIGNALING-IN-IMMUNE-SYSTEM AHSG, GPX1, AKT2, REACTOME-CELL-SURFACE-INTERACTIONS-AT- RPS27A, CD63, CSNK2A2, THE-VASCULAR-WALL GC, PIK3CB, CCNB1, KEGG-REGULATION-OF-ACTIN-CYTOSKELETON EIF4A2, CALM1, ALB, REACTOME-PROSTANOID-HORMONES GNAI2, PGRMC2, GGT1, BIOCARTA-PTC1-PATHWAY CDK1, CDK9, CDK7, BIOCARTA-SRCRPTP-PATHWAY HLA-B, CSK, HCK, BIOCARTA-AKAP95-PATHWAY TBXAS1, PRKDC, CD9, KEGG-MAPK-SIGNALING-PATHWAY LRPPRC, IFI30, CASP3, KEGG-NATURAL-KILLER-CELL-MEDIATED- HSPB1, HSPA14, PLA2G4A, CYTOTOXICITY MYL6, TOP2A, CLTC, KEGG-FOCAL-ADHESION CABIN1, TXN, CD58 REACTOME-CYCLIN-A1-ASSOCIATED-EVENTS- DURING-G2-M-TRANSITION REACTOME-HORMONE-BIOSYNTHESIS

Late proteins #proteins = 42 MYL12A, TANK, FSCN1, REACTOME-HEMOSTASIS LCC size = 8 BIRC2, TRAF3, IRAK1, REACTOME-FORMATION-OF-PLATELET-PLUG M = 21 TRADD, RHOC, RELB, REACTOME-PLATELET-ACTIVATION < k > = 63.88 JUN, MX1, TGM2, REACTOME-PLATELET-ACTIVATION-TRIGGERS

< kin > = 1 RANBP9, MAPK9, CRK, KEGG-ARACHIDONIC-ACID-METABOLISM

< kout > = 2.19 HRAS, CASP10, CASP7, KEGG-PROGESTERONE-MEDIATED-OOCYTE- P − value = NCF2, IFIH1, SPP1, LPL, MATURATION 6.14e-3 TAB1, IL1RN, IL1B, DOK1, REACTOME-PLATELET-DEGRANULATION NAMPT, RPS6KA5, GAB2, KEGG-HEMATOPOIETIC-CELL-LINEAGE MARCKS, CAMK2G, REACTOME-SIGNALING-IN-IMMUNE-SYSTEM RPS13, VDAC1, NCF1, REACTOME-CELL-SURFACE-INTERACTIONS-AT- TIMP3, FYB, ARHGAP17, THE-VASCULAR-WALL CTTN, RPS24, IRF5, CD74, KEGG-REGULATION-OF-ACTIN-CYTOSKELETON ITGB3 REACTOME-PROSTANOID-HORMONES BIOCARTA-PTC1-PATHWAY BIOCARTA-SRCRPTP-PATHWAY BIOCARTA-AKAP95-PATHWAY KEGG-MAPK-SIGNALING-PATHWAY KEGG-NATURAL-KILLER-CELL-MEDIATED- CYTOTOXICITY KEGG-FOCAL-ADHESION REACTOME-CYCLIN-A1-ASSOCIATED-EVENTS- DURING-G2-M-TRANSITION REACTOME-HORMONE-BIOSYNTHESIS

Table 11: Topological and biological properties of early and late proteins characterized by confi- dence level criterion (c): FC > 1.5

90 expressed genes associated with cardiovascular risk factors (i.e., preclinical disease).

Separately, we found IL6, IGF1, extrinsic prothrombin activation, the AP1 family of tran-

scription factors, and PECAM1 pathways to be fully embedded within these modules.

The three endophenotypes are not only of interest in terms of functional enrichment,

but also lie within a topologically important region of the HI. We showed that proteins

belonging to the inflammasome and fibrosome are highly essential for maintaining the

overall structure and integrity of the network.

To study further the rather large number of proteins in the predicted modules, we

dissected them into functional subgroups. As proteins function through a cascade of

interactions among cellular components, it is important to be able to map this biological

and topological information to a potential molecular mechanism and find the most

relevant underlying pathways. We, therefore, divided the genes within the modules into

different subgroups, each of which having a certain role in inflammatory processes.

These subgroups are defined based on the protein clusters with similar expression pat-

tern towards the inflammatory cytokine (INFγ). Detailed analysis of module response to

INFγ led us to observe four significantly distinctive protein abundance patterns belong-

ing to: (a) expressed proteins, (b) silent proteins, (c) early proteins, and (d) late proteins.

Present (silent) proteins show an elevated (decreased) level of abundance throughout

the course of 72 hrs after INFγ exposure. Early proteins manifest an elevated abundance

during the first day, while late proteins show low abundance during the first day and are

increased in abundance, thereafter. Our observations suggest that the common underly-

ing mechanism of many inflammatory-driven complex diseases resides in the common

core of the endophenotype modules detected in this work. Hence, this region merits

more attention in the context of physiology and treatment of inflammatory-driven phe-

notypes.

91

4

DATAANALYSISAND

PREPARATION

Despite the constantly growing body of publicly available data, data extraction and cu- ration could be burdensome. Not all data are readily available nor are they complete and flawless. Certain biological data in particular are hard to access due to strict policies of many health institutes. Furthermore, given the intrinsic limitations of experimental devices and human errors these data are prone to noise and thus challenging to re-

fine and process. Therefore, collaborative efforts along with careful considerations are required to acquire and analyze the data. Below I elaborate on curation and analysis performed on datasets used in this dissertation.

4.1 Human Interactome (HI)

The underlying network of studies performed in this dissertation is the consolidated human interactome (HI). In generating the HI, we only consider direct physical protein- protein interactions with reported experimental evidence. As described in [46], we con- solidated several data sources including:

(i) Regulatory interactions: We used the TRANSFAC [19] database that lists regula-

tory interactions derived from the presence of a transcription factor binding site in the

93 promoter region of a certain gene. The resulting network consists of 774 transcription factors and genes connected via 1,335 interactions.

(ii) Binary interactions: We combine several yeast-two-hybrid high-throughput datasets

[20–22, 63, 65] with binary interactions from IntAct [123] and MINT [23]databases. The sum of these data sources yields 28,653 interactions between 8,120 proteins.

(iii) Literature curated interactions: These interactions, typically obtained by low through- put experiments, are manually curated from the literature. We use IntAct, MINT, Bi- oGRID [124] and HPRD [125], resulting in 88,349 interactions between 11,798 proteins.

(iv) Metabolic enzyme-coupled interactions: Two enzymes are assumed to be coupled if they share adjacent reactions in the KEGG and BIGG databases. In total, we use 5,325 such metabolic links between 921 enzymes from [49].

(v) Protein complexes: Protein complexes are single molecular units that integrate multiple gene products. The CORUM database [126] is a collection of mammalian com- plexes derived from a variety of experimental tools, from co-immunoprecipitation to co-sedimentation and ion exchange chromatography. In total, CORUM yields 2,837 com- plexes with 2,069 proteins connected by 31,276 links.

(vi) Kinase network (kinase-substrate pairs): Protein kinases are important regula- tors in different biological processes, such as signal transduction. PhosphositePlus [127] provides a network of peptides that can be bound by kinases, yielding in total 6,066 interactions between 1,843 kinases and substrates.

(vii) Signaling interactions: The dataset from [128] provides 32,706 interactions be- tween 6,339 proteins that integrate several sources, both high-throughput and literature curation, into a directed network in which cellular signals are transmitted by proteins- protein interactions. Note that we do not take the direction of these interactions into account.

94 The union of all interactions from (i)-(vii) yields a network of 13,460 proteins that are interconnected by 141,296 physical interactions. The union of all interactions obtained from (i)-(vii) yields a network of 13,460 proteins that are interconnected by 141,296 phys- ical interactions. The network has a power-law degree distribution with a few hubs and a substantial number of low-degree nodes [5], and shows other typical characteristics observed previously in biological networks, such as high clustering and short path lengths.

The underlying map of HI used in chapter 3 contains one additional data source, liver- specific interactions [129]. This is because (a) the focus of chapter 3 of this dissertation is exploring network models of inflammatory processes and (b) most protein mediators of inflammation are synthesized in the liver. Addition of liver-specific interactions results in a network containing 13,681 Proteins that are connected through 144,414 Interactions.

4.2 Highly studied proteins within the PPI

Protein-protein interaction maps are the blueprints of Network Medicine and systems biology. Different groups are experimentally exploring and updating these maps. De- spite the wide usage of Literature Curated Interactome (LCI), these sources are known to be biased towards different parameters such as highly studied proteins. Yeast-two- hybrid method on the other hand is a high throughput experimental setup, which screens proteins in an unbiased fashion. Current map of PPI are far from complete.

In fact the previous released data from Y2H method in 2005 [20], is estimated to cover only 5% of all potential protein interactions. In 2012, this coverage has increased to 20%

[63] of what is known to be the reference map.

95 In this section, we call the 2005 and 2012 releases of protein-protein interaction net- works as HI2005 and HI2012, respectively. Here, we study the topological properties of highly studied proteins within HI2005, HI2012 and LCI and show that although they,

LCI shows a clear biased nature towards highly studied proteins.

We first, used two classes of proteins, highly studied proteins (hot proteins) vs. pro- teins of less interest (cool proteins). This classification is based on the number of publi- cations where a protein is mentioned [63]. Next, we study the interactions within and in between hot and cool proteins in both PPI and LCI maps. In order to assess any biases we explore the interaction between and among hot and cool proteins. In other words, we are interested to see whether hot and cool proteins tend to be connected to proteins of the same class and/or tend to avoid them. Therefore, we use the so-called Dyadicity

(D) and Heterophilicity (H) measures as defined in (figure 31), Where both measures range from -1 to 1 [137]. Given two types of nodes (hot and cool in this case), the nodes with dyadicity higher than one have higher number of interactions among themselves than expected by chance, whereas, lower dyadicity reflects the lower tendency of nodes to interact with each other. Similarly, high (low) heterophilicity points out to the high

(low) tendency of nodes to attach to nodes of other type.

Therefore, a group of nodes with D~1 and H~1 are completely randomly scattered within the network and do not manifest biased interaction selections. The result of our analysis suggests that both hot and cool proteins have dyadicity and a heterophilicity of around one in HI1 and HI2012. This indicates that scientific interests of proteins do not affect their interactions selection identified by Y2H method. Whereas, LCI is clearly biased towards the “popularity” of a protein (Figure 32).

To address how essential hot and cool proteins are for the connectedness of the net- work, we apply the so-called “tree analysis”, explained in chapter 3. We found that; in

HI1 and LCI, cool proteins hold the integrity of the interactome (trunk-like). These re-

96 Figure 31: Dyadicity (D) refers to the tendency of nodes interacting with each other, whereas, heterophilicity (H) refers to the tendency of two types of nodes avoiding each other.

Figure 32: Heterophilicity vs. Dyadicity of hot and cool genes in HI2005, HI2012 and LCI. LCI is biased towards highly studied proteins (hot proteins) whereas maps resulting from Y2H method (HI2005 and HI2012) are unbiased (D~1 and H~1).

97 (a) (b)

Figure 33: (a) and (b) correspond to the lcc size of the remaining proteins after removing cool and hot proteins, respectively and as opposed to random expectations. sults are very significant, as the same number of randomly removed proteins would not cause nearly as much damage to the network. Removing hot proteins on the other hand does not drastically alter the integrity of the network (Figure 33). In HI2012 however, we observed that cool and hot proteins are equally responsible for holding the structure of the network and removing either group will disintegrate the network (Table 12).

4.3 Modular nature of protein-protein interaction network

We assess the properties of PPI as it evolves with increasing the coverage. To do so, we study the topology of different releases of the PPI maps [20, 63]. In particular, we are interested to see how the newly discovered interactions are placed in the previously released network.

To do so, we consider the protein pairs that established new links in the HI2012 release. We then check the shortest path of these pairs in the previous network release in HI2005 (i.e. without the presence of the new link). We compared our results with a case were the same number of new links are rewired to a randomly chosen node from

HI2005, allowing us to calculate a z − score for each newly discovered link.

98 Network Zone LCC size z − score

Hot 207 -12.67 HI2005 Cool 268 4.01

Hot 445 -27.23 HI2012 Cool 745 -4.44

Hot 367 -93.66 LCI Cool 269 62.95

Table 12: Lcc size of remaining proteins when removing hot and cool proteins from HI2005, HI2012 and LCI.

Figure 34 shows the z − score distribution corresponding to actual newly discovered

links as opposed to randomly placing new links on the HI2005. The shift towards nega-

tive values suggests that newly discovered interactions tend to connect already proteins

that have been already closer than average in the previous PPI release, reinforcing the

modular structure of PPI.

4.3.1 Disease-genes associations

We curated a list of 70 well-characterized complex diseases (Table 1 and 2) and their

known associated proteins from OMIM [138] and GWAS [104]. The corpus of 70 dis-

eases was manually chosen by a medical expert, with the additional criteria of at least 20

associated genes reported in the literature. The gene-disease associations were retrieved

from OMIM (Online Mendelian Inheritance in Man; http://www.ncbi.nlm.nih.gov/omim)

[26] and GWAS (Genome-Wide Association Studies. The OMIM associations we use

also include associations from UniProtKB/Swiss-Prot and have been compiled by [138].

The disease-gene associations from GWAS are obtained from the PheGenI database

99 Figure 34: The z − score distribution of newly discovered links as opposed to z − score distribu- tion of randomly placed links. We calculate a z − score for each newly discovered link. A negative z − score shows that the new link tends to connect pair of proteins that have already been close to each other.

(Phenotype-Genotype Integrator; http://www.ncbi.nlm.nih.gov/gap/PheGenI) [104] that integrates various NCBI genomic databases. We use a genome-wide significance cutoff of p − value ≤ 5 · 10−8.

4.3.2 Gene annotations

We use Gene Ontology (GO) for all genes are extracted from [http://www.geneontology.org/, downloaded Nov. 2011]. We only use high confidence annotations associated with the evidence codes EXP, IDA, IMP, IGI, IEP, ISS, ISA, ISM or ISO. In particular, we do not use annotations inferred from physical interactions (evidence code IPI) in order to avoid circularity. To obtain a complete set of GO terms from the reported most specific term for each gene, all annotations are propagated upwards on the full tree.

The pathway annotations are extracted from the Molecular Signatures Database (MSigDB) published by the Broad Institute, Version 3.1 [131]. MSigDB integrates several different pathway databases; we use the ones from KEGG, Biocarta and Reactome.

100 4.4 LCC significance

The significance of the clustering of a given set of nodes is obtained by comparing the observed LCC size with the size expected for randomly distributed nodes of a set of the same size obtained from 106 simulations. The resulting z − score is:

lcc − < lcc > z − score = observed randomized σrandomized

where lcc is the size of largest connected component and < lcc >randomized and σ are

the average size and standard deviation, respectively, of the largest connected compo-

nents across all randomized sets.

4.5 Pathways analysis

For pathway analysis, we used version 3.1 of Molecular Signature Database (MSigDB) developed by the Broad Institute [131], which is an integration of several different databases. Here we use pathways from KEGG, Reactome, and Biocarta. Each pathway is associated with a list of genes for which we calculate their enrichment using Fisher’s exact test.

4.6 Genetic association analysis

Using the Human Exome BeadChip platform v.1.1 (Illumina, San Diego), genotype data for candidate functional genetic polymorphisms were collected from the Women’s

Genome Health Study (WGHS) [139]. Genotype data were further reduced to genotype

101 calls as described[140]. The genetic markers on this platform are mainly responsible

for functional changes allowing for gene assignment. In total, there were 22,516, 22,390, and 22,411 WGHS individuals of European ancestry with genotype data and plasma measures of C-reactive protein (CRP), sICAM1, and fibrinogen respectively. There were also 526 cases of incident venous thromboembolism (VTE) compared with 21,479 unaf- fected WGHS individuals with genotype data. We tested the SNPs for association using linear logistic regression. We selected the SNPs with minor allele frequency of at least

0.0005 and for each gene we assigned the corrected (Šidák method) minimum p − value among the SNPs mapping to it.

4.7 Differential expression analysis of cardiovascular risk

We were provided with gene expression data derived from the population-based Guten- berg Health Study (GHS) through collaboration [133, 134]. The dataset consists of mRNA counts of the Illumina HT-12 v3 BeadChips (n = 1,285). The associated dataset is not publicly available however; the data is accessible through direct collaboration.

Analyses were conducted at the University Heart Center, Hamburg, Germany.

We analyzed the data to retrieve differentially expressed genes associated with car- diovascular risk factors. The sample sizes were selected so that the biomarker levels are consistent with the recommended effect size (low and high risk ranges) from the literature [141–143]. Therefore, we defined cases and controls as individuals with the top and bottom 25% of the risk factor level distribution. The case and control sample sizes are listed in Table 13. To derive the differentially expressed genes, we performed a non-parametric Mann-Whitney U-test. Figure 35 shows the Venn diagram of differen- tially expressed genes with respect to different risk factors. As shown in the figure, the

102 Figure 35: Venn diagram of differentially expressed genes associated with cardiovascular risk factors CRP, fibrinogen, HDL, triglycerides, and APO-A.

Molecule Case sample size (high risk range) Control sample size (low risk range)

CRP 375 (>3 mg/l) 372 (<1 mg/l)

Fibrinogen 324 (>402 mg/dl) 325 (<302 mg/dl)

APO-A 329 (<1.46 g/l) 325 (>1.86 g/l)

APO-B 328 (>1.20 g/l) 330 (<0.85 g/l)

HDL 328 (<45 mg/dl) 346 (>66 mg/dl)

LDL 316 (>164 mg/dl) 318 (<115 mg/dl)

Triglyceride 325 (>153 mg/dl) 327 (<80 mg/dl)

Table 13: Case and control Sample sizes and biomarker level ranges. differentially expressed genes associated to different molecules are highly overlapping and pleiotropic.

We showed that the inflammasome, thrombosome, and fibrosome are significantly enriched with differentially expressed genes of CRP, HDL, APO-A, and triglyceride modules. There were only 9 and 3 differentially expressed genes associated with APO-

B and LDL. Therefore, due to lack of statistical power, we observed non-significant enrichment of module proteins with those genes. Surprisingly, neither the seeds nor detected modules are enriched with fibrinogen genes.

103 4.8 THP-1 cell culture experiments and proteomics

Detailed studies on the human macrophage-like cell line THP-1 stimulation experi-

ments and corresponding quantitative proteomics studies are included in (H. Iwata et

al., manuscript under review) and (P. Ricchiuto et al., manuscript under review). Briefly,

THP-1 cells (ATCC) were stimulated without (M0) or with 10 ng/ml INFγ for 72 hours

(M1). Cells were collected from each time course condition at six time points – 0, 8, 12,

24, 48 and 72 hours – for subsequent protein isolation, proteolysis and tandem mass tag-

ging [TMT-6plex, Pierce]. The peptides were analyzed by an LTQ-Orbitrap Elite model

(Thermo Scientific) coupled to an Easy-nLC1000 HPLC pump (Thermo Scientific). The

MS/MS data were queried against the human Uniprot database (downloaded on March

27, 2012) using the SEQUEST search algorithm via the Proteome Discoverer Software package (version 1.3, Thermo Scientific).

Proteomic analysis revealed that 4,680 (4,589) proteins were detected (with at least 2

unique peptide ID and a unique gene ID) in the M0 (M1) condition, among which 4,278

(4,188) proteins were found to have at least one interacting partner in the consolidated

HI.

In order to explore protein abundance changes in response to inflammatory stimuli, we restricted our study to the proteins whose levels were measured in both M0 and

M1 conditions. We found that of the 4,143 proteins found in both datasets, 447 reside within the detected endophenotype modules (p − value = 1.13x10−9).

104 5

CONCLUSIONSANDFUTURE

DIRECTIONS

The era of Big Data has offered immense wealth of knowledge and data in a variety of disciplines, from physics to computer science or sociology to biology and medicine.

Through decades, many health institutes have collected detailed clinical patient data.

As a consequence, a rich and constantly growing collection of biological and temporal clinical data is now available. Yet, these data are not informative by themselves. To harvest their full potential, we need to explore them through proper statistical analysis and visualization methods. Advances in network science and statistical physics have lead to the design of several tools and methodologies to analyze the wealth of biological and clinical data in the framework of “Network Medicine”. Network medicine aims to analyze these data through an integrative, network-based approach, to understand and ultimately treat human diseases.

Numerous challenges and tasks need to be addressed for network medicine to ac- complish its goals. These tasks include, for example, assembling complete functional annotations of all cellular components, understanding and reducing the effect of study biases, generating the complete map of the HI and identifying drug target interactions.

All these factors are being gradually improved and updated and thus bringing network medicine closer towards fulfilling its goals. Yet, despite the incomplete and often noisy

105 nature of the available biological and clinical data, network medicine already has suc- cessfully proven its predictive powers in several studies. Although further advances in this field rely on developing proper network tools and measures along with a high- throughput data collection, equally important is the collaborative effort across different disciplines to interpret and validate the biological insights brought by data analysis tools.

In this dissertation we showed that network science can offer valuable tools and mea- sures to quantify the topological properties of disease proteins within the Human In- teractome. We found that conventional tools, such as topological community detection methods, modularity measures etc., are not sufficient to characterize the network prop- erties of disease associated proteins. We therefore developed several new, more accurate measures to gain insights into the network-based mechanisms of human diseases. The two main contributions of this dissertation can be summarized as followed:

a) A systematic analysis of the network-based properties of disease proteins associ-

ated with 70 diseases. We found that disease proteins are not randomly scattered

within the HI, but agglomerate in specific regions, suggesting the existence of

specific disease modules for each disease. In contrast to previous, widely held

assumptions, we could show that disease modules do not constitute particularly

dense subgraphs and can therefore not be identified with traditional community

detection algorithms. We were able to identify connectivity significance as a more

accurate measure to characterize the interaction patterns of disease proteins. In

contrast to previous modularity measure, the connectivity significance is robust

towards the incompleteness of currently available interactome data. These insights

allowed us to rationally design a reliable and efficient Disease Module Detection

algorithm (DIAMOnD).

106 (b) Identifying the inflammasome, thrombosome and fibrosome. Using the methods

introduced in part (A) we constructed network models of inflammation, thrombo-

sis and fibrosis, which are known to play a crucial role in development of many

diseases. Indeed, we observed that these regions are enriched with several disease

determinants. We further studied the topological localization of detected modules

within the HI, by introducing the so-called “tree analysis”. Tree analysis deter-

mines whether a group of nodes is essential (trunk-like) or non-essential (leaf-like)

to the overall structure of the network. In this analysis a given group of nodes is

removed from the network, resulting in to the separation of the network in to sev-

eral isolated components. The number of remaining connected components along

with the size of the largest connected component is then compared to the ran-

dom expectation where the same number of nodes is randomly removed from the

network. We showed that inflammasome and fibrosome are trunk-like and thus

have a topologically central location within the HI. Moreover, we showed that

the three detected modules are highly overlapping and inflammatory processes

initiate from the cross talk region.

The results presented in this dissertation suggest a number of interesting directions for future research. Indeed, the identification of disease modules only represents the very first step towards elucidating the biological mechanisms of the respective diseases.

The exact mechanism how perturbations within disease modules translate into disease manifestation at the symptom level remains largely unknown and a major challenge for our understanding of human disease. A second important challenge that can be systematically addressed using the framework of disease modules is the identification of promising drug targets.

Lastly, we expect that the concepts developed in this dissertation can also be applied beyond biological networks. Indeed, the notion of disease-specific subnetworks can be

107 generalized to for example to social networks, in which a group of individuals is at- tributed to a certain party or shares other interests. Since the spread of interests, ideas, etc. is highly influenced by close connections between individuals, certain attributes could be distributed (and therefore predicted) according to connectivity significance instead of the traditional connectivity density. Hence, the introduced measures and methodology may be applied when studying the spread of ideas, interests, etc. within a network.

108 BIBLIOGRAPHY

[1] P. Erdös and A Rényi. On random graphs I. Publ. Math. Debrecen, 6:290–297, 1959. (Cited on page 2.)

[2] E. N. Gilbert. Random Graphs. pages 1141–1144, 1959. (Cited on page 2.)

[3] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of /‘small-world/’ networks. Nature, 393(6684):440–442, 1998. (Cited on page 4.)

[4] Reka Albert, Hawoong Jeong, and Albert-Laszlo Barabasi. Internet: Diameter of the World-Wide Web. Nature, 401(6749):130–131, 1999. (Cited on pages 4 and 6.)

[5] A. L. Barabási and R. Albert. Emergence of Scaling in Random Networks. Science, 286, 1999. (Cited on pages 4, 64, and 95.)

[6] H. Jeong, S. P. Mason, A. L. Barabási, and Z. N. Oltvai. Lethality and centrality in protein networks. Nature brief communications, 411, 2001. (Cited on pages 6, 7, 8, and 15.)

[7] Reka Albert, Hawoong Jeong, and Albert-Laszlo Barabasi. Error and attack tol- erance of complex networks. Nature, 406(6794):378–382, 2000. (Cited on pages 6 and 8.)

[8] J. Kim, P. L. Krapivsky, B. Kahng, and S. Redner. Infinite-order percolation and giant fluctuations in a protein interaction network. Physical Review E, 66(5):055101, 2002. (Cited on page 7.)

[9] Andreas Wagner. How the global structure of protein interaction networks evolves. Proceedings of the Royal Society B: Biological Sciences, 270(1514):457–466, 2003. (Cited on page 7.)

[10] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A. L. Barabasi. The large-scale organization of metabolic networks. Nature, 407(6804):651–654, 2000. (Cited on page 7.)

[11] Andreas Wagner and David A. Fell. The small world inside large metabolic networks. Proceedings of the Royal Society of London B: Biological Sciences, 268(1478):1803–1810, 2001.

[12] Andreas Wagner. The Yeast Protein Interaction Network Evolves Rapidly and Contains Few Redundant Duplicate Genes. Molecular Biology and , 18(7):1283–1292, 2001.

[13] Siming Li, Christopher M. Armstrong, Nicolas Bertin, Hui Ge, Stuart Milstein, Mike Boxem, Pierre-Olivier Vidalain, Jing-Dong J. Han, Alban Chesneau, Tong Hao, Debra S. Goldberg, Ning Li, Monica Martinez, Jean-François Rual, Philippe

109 Lamesch, Lai Xu, Muneesh Tewari, Sharyl L. Wong, Lan V. Zhang, Gabriel F. Berriz, Laurent Jacotot, Philippe Vaglio, Jérôme Reboul, Tomoko Hirozane- Kishikawa, Qianru Li, Harrison W. Gabel, Ahmed Elewa, Bridget Baumgartner, Debra J. Rose, Haiyuan Yu, Stephanie Bosak, Reynaldo Sequerra, Andrew Fraser, Susan E. Mango, William M. Saxton, Susan Strome, Sander van den Heuvel, Fabio Piano, Jean Vandenhaute, Claude Sardet, Mark Gerstein, Lynn Doucette- Stamm, Kristin C. Gunsalus, J. Wade Harper, Michael E. Cusick, Frederick P. Roth, David E. Hill, and Marc Vidal. A Map of the Interactome Network of the Meta- zoan C. elegans. Science (New York, N.Y.), 303(5657):540–543, 2004.

[14] Soon-Hyung Yook, Zoltán N. Oltvai, and Albert-László Barabási. Functional and topological characterization of protein interaction networks. PROTEOMICS, 4(4):928–942, 2004.

[15] L. Giot, J. S. Bader, C. Brouwer, A. Chaudhuri, B. Kuang, Y. Li, Y. L. Hao, C. E. Ooi, B. Godwin, E. Vitols, G. Vijayadamodar, P. Pochart, H. Machineni, M. Welsh, Y. Kong, B. Zerhusen, R. Malcolm, Z. Varrone, A. Collis, M. Minto, S. Burgess, L. McDaniel, E. Stimpson, F. Spriggs, J. Williams, K. Neurath, N. Ioime, M. Agee, E. Voss, K. Furtak, R. Renzulli, N. Aanensen, S. Carrolla, E. Bickelhaupt, Y. Lazo- vatsky, A. DaSilva, J. Zhong, C. A. Stanyon, R. L. Finley, K. P. White, M. Braverman, T. Jarvie, S. Gold, M. Leach, J. Knight, R. A. Shimkets, M. P. McKenna, J. Chant, and J. M. Rothberg. A Protein Interaction Map of Drosophila melanogaster. Sci- ence, 302(5651):1727–1736, 2003. (Cited on page 7.)

[16] Anton J. Enright, Ioannis Iliopoulos, Nikos C. Kyrpides, and Christos A. Ouzou- nis. Protein interaction maps for complete genomes based on gene fusion events. Nature, 402(6757):86–90, 1999. (Cited on page 8.)

[17] Patrick Aloy and Robert B. Russell. InterPreTS: protein Interaction Prediction through Tertiary Structure. , 19(1):161–162, 2003.

[18] Edward M. Marcotte, Matteo Pellegrini, Ho-Leung Ng, Danny W. Rice, Todd O. Yeates, and . Detecting Protein Function and Protein-Protein In- teractions from Genome Sequences. Science, 285(5428):751–753, 1999. (Cited on page 8.)

[19] V. Matys. TRANSFAC(R): transcriptional regulation, from patterns to profiles. Nucleic Acids Research, 31(1):374–378, 2003. (Cited on pages 8, 64, and 93.)

[20] J. F. Rual, K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, N. Li, G. F. Berriz, F. D. Gibbons, M. Dreze, N. Ayivi-Guedehoussou, N. Klitgord, C. Simon, M. Boxem, S. Milstein, J. Rosenberg, D. S. Goldberg, L. V. Zhang, S. L. Wong, G. Franklin, S. Li, J. S. Albala, J. Lim, C. Fraughton, E. Llamosas, S. Cevik, C. Bex, P. Lamesch, R. S. Sikorski, J. Vandenhaute, H. Y. Zoghbi, A. Smolyar, S. Bosak, R. Sequerra, L. Doucette-Stamm, M. E. Cusick, D. E. Hill, F. P. Roth, and M. Vidal. Towards a proteome-scale map of the human protein-protein interaction network. Nature, 437(7062):1173–8, 2005. (Cited on pages 64, 94, 95, and 98.)

[21] U. Stelzl, U. Worm, M. Lalowski, C. Haenig, F. H. Brembeck, H. Goehler, M. Stroedicke, M. Zenkner, A. Schoenherr, S. Koeppen, J. Timm, S. Mintzlaff, C. Abraham, N. Bock, S. Kietzmann, A. Goedde, E. Toksoz, A. Droege, S. Kro-

110 bitsch, B. Korn, W. Birchmeier, H. Lehrach, and E. E. Wanker. A human protein-protein interaction network: a resource for annotating the proteome. Cell, 122(6):957–68, 2005.

[22] H. Yu, L. Tardivo, S. Tam, E. Weiner, F. Gebreab, C. Fan, N. Svrzikapa, T. Hirozane- Kishikawa, E. Rietman, X. Yang, J. Sahalie, K. Salehi-Ashtiani, T. Hao, M. E. Cu- sick, D. E. Hill, F. P. Roth, P. Braun, and M. Vidal. Next-generation sequencing to generate interactome datasets. Nat Methods, 8(6):478–80, 2011. (Cited on pages 64 and 94.)

[23] A. Ceol, A. Chatr Aryamontri, L. Licata, D. Peluso, L. Briganti, L. Perfetto, L. Castagnoli, and G. Cesareni. MINT, the molecular interaction database: 2009 update. Nucleic Acids Res, 38(Database issue):D532–9, 2010. (Cited on pages 8, 64, and 94.)

[24] Peter Uetz, Loic Giot, Gerard Cagney, Traci A. Mansfield, Richard S. Judson, James R. Knight, Daniel Lockshon, Vaibhav Narayan, Maithreyan Srinivasan, Pascale Pochart, Alia Qureshi-Emili, Ying Li, Brian Godwin, Diana Conover, Theodore Kalbfleisch, Govindan Vijayadamodar, Meijia Yang, Mark Johnston, Stanley Fields, and Jonathan M. Rothberg. A comprehensive analysis of protein- protein interactions in Saccharomyces cerevisiae. Nature, 403(6770):623–627, 2000. (Cited on page 8.)

[25] Takashi Ito, Tomoko Chiba, Ritsuko Ozawa, Mikio Yoshida, Masahira Hattori, and Yoshiyuki Sakaki. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proceedings of the National Academy of Sciences of the of America, 98(8):4569–4574, 2001. (Cited on page 8.)

[26] A. Hamosh, A. F. Scott, J. Amberger, C. Bocchini, D. Valle, and V. A. McKusick. Online Mandelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res, 30(1), 2002. (Cited on pages 9 and 99.)

[27] Jörg Stelling, Uwe Sauer, Zoltan Szallasi, Francis J. Doyle Iii, and John Doyle. Robustness of Cellular Functions. Cell, 118(6):675–685, 2004. (Cited on page 10.)

[28] Stefano Volinia, Marco Galasso, Stefan Costinean, Luca Tagliavini, Giacomo Gam- beroni, Alessandra Drusco, Jlenia Marchesini, Nicoletta Mascellani, Maria Elena Sana, Ramzey Abu Jarour, Caroline Desponts, Michael Teitell, Raffaele Baffa, Rami Aqeilan, Marilena V. Iorio, Cristian Taccioli, Ramiro Garzon, Gianpiero Di Leva, Muller Fabbri, Marco Catozzi, Maurizio Previati, Stefan Ambs, Tiziana Palumbo, Michela Garofalo, Angelo Veronese, Arianna Bottoni, Pierluigi Gas- parini, Curtis C. Harris, Rosa Visone, Yuri Pekarsky, Albert de la Chapelle, Mark Bloomston, Mary Dillhoff, Laura Z. Rassenti, Thomas J. Kipps, Kay Huebner, Flavia Pichiorri, Dido Lenze, Stefano Cairo, Marie-Annick Buendia, Pascal Pineau, Anne Dejean, Nicola Zanesi, Simona Rossi, George A. Calin, Chang-Gong Liu, Jeff Palatini, Massimo Negrini, Andrea Vecchione, Anne Rosenberg, and Carlo M. Croce. Reprogramming of miRNA networks in cancer and leukemia. Genome Research, 20(5):589–599, 2010. (Cited on page 10.)

[29] Guanming Wu, Xin Feng, and . A human functional protein interac- tion network and its application to cancer data analysis. Genome Biology, 11(5):R53,

111 2010.

[30] Aleksej Zelezniak, Tune H. Pers, Simão Soares, Mary Elizabeth Patti, and Ki- ran Raosaheb Patil. Metabolic Network Topology Reveals Transcriptional Reg- ulatory Signatures of Type 2 Diabetes. PLoS Comput Biol, 6(4):e1000729, 2010.

[31] Daehee Hwang, Inyoul Y. Lee, Hyuntae Yoo, Nils Gehlenborg, Ji-Hoon Cho, Bri- anne Petritis, David Baxter, Rose Pitstick, Rebecca Young, Doug Spicer, Nathan D. Price, John G. Hohmann, Stephen J. DeArmond, George A. Carlson, and Leroy E. Hood. A systems approach to prion disease. Molecular Systems Biology, 5(1), 2009.

[32] Eric Bonnet, Marianthi Tatari, Anagha Joshi, Tom Michoel, Kathleen Marchal, Geert Berx, and Yves Van de Peer. Module Network Inference from a Cancer Gene Expression Data Set Identifies MicroRNA Regulated Modules. PLoS ONE, 5(4):e10162, 2010.

[33] Elena Morandi, Cinzia Severini, Daniele Quercioli, Giovanni D’Ario, Stefania Perdichizzi, Miriam Capri, Giovanna Farruggia, Maria Mascolo, Wolfango Horn, Monica Vaccari, Roberto Serra, Annamaria Colacci, and Paola Silingardi. Gene ex- pression time-series analysis of Camptothecin effects in U87-MG and DBTRG-05 glioblastoma cell lines. Molecular Cancer, 7(1):66, 2008.

[34] Xia Yang, Joshua L. Deignan, Hongxiu Qi, Jun Zhu, Su Qian, Judy Zhong, Gevork Torosyan, Sana Majid, Brie Falkard, Robert R. Kleinhanz, Jenny Karlsson, Lawrence W. Castellani, Sheena Mumick, Kai Wang, Tao Xie, Michael Coon, Chun- sheng Zhang, Daria Estrada-Smith, Charles R. Farber, Susanna S. Wang, Atila van Nas, Anatole Ghazalpour, Bin Zhang, Douglas J. MacNeil, John R. Lamb, Ka- trina M. Dipple, Marc L. Reitman, Margarete Mehrabian, Pek Y. Lum, Eric E. Schadt, Aldons J. Lusis, and Thomas A. Drake. Validation of candidate causal genes for obesity that affect shared metabolic pathways and networks. Nat Genet, 41(4):415–423, 2009.

[35] Sergio E. Baranzini. The genetics of autoimmune diseases: a networked perspec- tive. Current Opinion in Immunology, 21(6):596–605, 2009.

[36] I. W. Taylor, R. Linding, D. Warde-Farley, Y. Liu, C. Pesquita, D. Faria, S. Bull, T. Pawson, Q. Morris, and J. L. Wrana. Dynamic modularity in protein interac- tion networks predicts breast cancer outcome. Nat Biotechnol, 27(2):199–204, 2009. (Cited on page 10.)

[37] J. , Mark D. Adams, Eugene W. Myers, Peter W. Li, Richard J. Mu- ral, Granger G. Sutton, Hamilton O. Smith, Mark Yandell, Cheryl A. Evans, Robert A. Holt, Jeannine D. Gocayne, Peter Amanatides, Richard M. Ballew, Daniel H. Huson, Jennifer Russo Wortman, Qing Zhang, Chinnappa D. Kodira, Xiangqun H. Zheng, Lin Chen, Marian Skupski, Gangadharan Subramanian, Paul D. Thomas, Jinghui Zhang, George L. Gabor Miklos, Catherine Nelson, Samuel Broder, Andrew G. Clark, Joe Nadeau, Victor A. McKusick, Norton Zin- der, Arnold J. Levine, Richard J. Roberts, Mel Simon, Carolyn Slayman, Michael Hunkapiller, Randall Bolanos, Arthur Delcher, Ian Dew, Daniel Fasulo, Michael Flanigan, Liliana Florea, Aaron Halpern, Sridhar Hannenhalli, Saul Kravitz, Samuel Levy, Clark Mobarry, Knut Reinert, Karin Remington, Jane Abu-Threideh,

112 Ellen Beasley, Kendra Biddick, Vivien Bonazzi, Rhonda Brandon, Michele Cargill, Ishwar Chandramouliswaran, Rosane Charlab, Kabir Chaturvedi, Zuoming Deng, Valentina Di Francesco, Patrick Dunn, Karen Eilbeck, Carlos Evangelista, An- drei E. Gabrielian, Weiniu Gan, Wangmao Ge, Fangcheng Gong, Zhiping Gu, Ping Guan, Thomas J. Heiman, Maureen E. Higgins, Rui-Ru Ji, Zhaoxi Ke, Karen A. Ketchum, Zhongwu Lai, Yiding Lei, Zhenya Li, Jiayin Li, Yong Liang, Xiaoy- ing Lin, Fu Lu, Gennady V. Merkulov, Natalia Milshina, Helen M. Moore, Ash- winikumar K Naik, Vaibhav A. Narayan, Beena Neelam, Deborah Nusskern, Dou- glas B. Rusch, , Wei Shao, Bixiong Shue, Jingtao Sun, Zhen Yuan Wang, Aihui Wang, Xin Wang, Jian Wang, Ming-Hui Wei, Ron Wides, Chunlin Xiao, Chunhua Yan, and others. The Sequence of the Human Genome. Science, 291(5507):1304–1351, 2001. (Cited on page 11.)

[38] Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921, 2001. (Cited on page 11.)

[39] Fabio Pammolli, Laura Magazzini, and Massimo Riccaboni. The productivity crisis in pharmaceutical R&D. Nat Rev Drug Discov, 10(6):428–438, 2011. (Cited on page 11.)

[40] A. L. Barabási, N. Gulbahce, and J. Loscalzo. Network medicine: a network-based approach to human disease. Nat Rev Genet, 12(1):56–68, 2011. (Cited on pages 11, 14, 17, 23, 34, and 66.)

[41] M. A. Yildirim, K. I. Goh, M. E. Cusick, A. L. Barabási, and M. Vidal. Drug-target network. Nat Biotechnol, 25(10):1119–26, 2007. (Cited on page 11.)

[42] E. E. Schadt. Molecular networks as sensors and drivers of common human diseases. Nature, 461(7261):218–23, 2009. (Cited on pages 11 and 23.)

[43] Liang-Hui Chu and Bor-Sen Chen. Construction of a cancer-perturbed protein- protein interaction network for discovery of apoptosis drug targets. BMC Systems Biology, 2(1):56, 2008.

[44] Asfar S. Azmi, Zhiwei Wang, Philip A. Philip, Ramzi M. Mohammad, and Fa- zlul H. Sarkar. Proof of Concept: Network and Systems Biology Approaches Aid in the Discovery of Potent Anticancer Drug Combinations. Molecular Cancer Ther- apeutics, 9(12):3137–3144, 2010.

[45] Shiwen Zhao and Shao Li. Network-Based Relating Pharmacological and Ge- nomic Spaces for Drug Target Identification. PLoS ONE, 5(7):e11764, 2010. (Cited on page 11.)

[46] Jörg Menche, Amitabh Sharma, Maksim Kitsak, Susan Dina Ghiassian, Marc Vi- dal, Joseph Loscalzo, and Albert-László Barabási. Uncovering disease-disease relationships through the incomplete interactome. Science, 347(6224), 2015. (Cited on pages 11, 14, 24, 66, 76, and 93.)

[47] Jose M. Valderas, Barbara Starfield, Bonnie Sibbald, Chris Salisbury, and Mar- tin Roland. Defining Comorbidity: Implications for Understanding Health and Health Services. Annals of Family Medicine, 7(4):357–363, 2009. (Cited on page 12.)

113 [48] K. I. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal, and A. L. Barabási. The human disease network. Proc Natl Acad Sci U S A, 104(21):8685–90, 2007. (Cited on pages 12, 14, 23, 46, and 66.)

[49] D. S. Lee, J. Park, K. A. Kay, N. A. Christakis, Z. N. Oltvai, and A. L. Barabási. The implications of human metabolic network topology for disease comorbidity. Proc Natl Acad Sci U S A, 105(29):9880–5, 2008. (Cited on pages 12, 64, and 94.)

[50] Ming Lu, Qipeng Zhang, Min Deng, Jing Miao, Yanhong Guo, Wei Gao, and Qinghua Cui. An Analysis of Human MicroRNA and Disease Associations. PLoS ONE, 3(10):e3420, 2008. (Cited on page 12.)

[51] CA. Hidalgo, N. Blumm, A. L. Barabási, and NA. Christakis. A Dynamic Network Approach for the Study of Human Phenotypes. PLoS Comput Biol, 5, 2009. (Cited on page 12.)

[52] Silpa Suthram, Joel T. Dudley, Annie P. Chiang, Rong Chen, Trevor J. Hastie, and Atul J. Butte. Network-Based Elucidation of Human Disease Similarities Reveals Common Functional Modules Enriched for Pluripotent Drug Targets. PLoS Com- putational Biology, 6(2):e1000662, 2010.

[53] Yueyi Liu, Paul Wise, and Atul Butte. The "etiome": identification and clustering of human disease etiological factors. BMC Bioinformatics, 10(Suppl 2):S14, 2009. (Cited on page 12.)

[54] George J. Brewer. Drug development for orphan diseases in the context of person- alized medicine. Translational Research, 154(6):314–322, 2009. (Cited on page 12.)

[55] Ted T. Ashburn and Karl B. Thor. Drug repositioning: identifying and developing new uses for existing drugs. Nat Rev Drug Discov, 3(8):673–683, 2004.

[56] Joseph Lehár, Andrew Krueger, Grant Zimmermann, and Alexis Borisy. High- order combination effects and biological robustness. Molecular Systems Biology, 4:215–215, 2008.

[57] Alexis A. Borisy, Peter J. Elliott, Nicole W. Hurst, Margaret S. Lee, Joseph Lehár, E. Roydon Price, George Serbedzija, Grant R. Zimmermann, Michael A. Foley, Brent R. Stockwell, and Curtis T. Keith. Systematic discovery of multicomponent therapeutics. Proceedings of the National Academy of Sciences, 100(13):7977–7982, 2003. (Cited on page 12.)

[58] J. Loscalzo, I. Kohane, and A. L. Barabási. Human disease classification in the postgenomic era: a complex systems approach to human pathobiology. Mol Syst Biol, 3:124, 2007. (Cited on pages 13, 20, and 61.)

[59] Carolyn Y. Ho and Christine E. Seidman. A Contemporary Approach to Hyper- trophic Cardiomyopathy. Circulation, 113(24):e858–e862, 2006. (Cited on page 13.)

[60] Hiroyuki Morita, Heidi L. Rehm, Andres Menesses, Barbara McDonough, Amy E. Roberts, Raju Kucherlapati, Jeffrey A. Towbin, J. G. Seidman, and Christine E. Seidman. Shared Genetic Causes of Cardiac Hypertrophy in Children and Adults.

114 New England Journal of Medicine, 358(18):1899–1908, 2008. (Cited on page 13.)

[61] D. G. Galas and L. Hood. Systems Biology and Emerging Technologies Will Cat- alyze the Transition from Reactive Medicine to Predictive, Personalized, Preven- tive and Participatory (P4) Medicine. IBC, 1:6, 2009. (Cited on page 13.)

[62] Roberto Mosca, Tirso Pons, Arnaud Céol, , and Patrick Aloy. To- wards a detailed atlas of protein–protein interactions. Current Opinion in Structural Biology, 23(6):929–940, 2013. (Cited on page 14.)

[63] Thomas Rolland, Murat Ta¸san, Benoit Charloteaux, Samuel J Pevzner, Quan Zhong, Nidhi Sahni, Song Yi, Irma Lemmens, Celia Fontanillo, Roberto Mosca, Atanas Kamburov, Susan D Ghiassian, Xinping Yang, Lila Ghamsari, Dawit Balcha, Bridget E Begg, Pascal Braun, Marc Brehme, Martin P Broly, Anne- Ruxandra Carvunis, Dan Convery-Zupan, Roser Corominas, Jasmin Coulombe- Huntington, Elizabeth Dann, Matija Dreze, Amélie Dricot, Changyu Fan, Eric Franzosa, Fana Gebreab, Bryan J Gutierrez, Madeleine F Hardy, Mike Jin, Shuli Kang, Ruth Kiros, Guan Ning Lin, Katja Luck, Andrew MacWilliams, Jörg Menche, Ryan R Murray, Alexandre Palagi, Matthew M Poulin, Xavier Rambout, John Rasla, Patrick Reichert, Viviana Romero, Elien Ruyssinck, Julie M Sahalie, Annemarie Scholz, Akash A Shah, Amitabh Sharma, Yun Shen, Kerstin Spirohn, Stanley Tam, Alexander O Tejeda, Shelly A Trigg, Jean-Claude Twizere, Kerwin Vega, Jennifer Walsh, Michael E Cusick, Yu Xia, Albert-László Barabási, Lilia M Iakoucheva, Patrick Aloy, Javier De Las Rivas, Jan Tavernier, Michael A Calder- wood, David E Hill, Tong Hao, Frederick P Roth, and Marc Vidal. A Proteome- Scale Map of the Human Interactome Network. Cell, 159(5):1212–1226, 2014. (Cited on pages 94, 95, 96, and 98.)

[64] G. Traver Hart, Arun Ramani, and Edward Marcotte. How complete are current yeast and human protein-interaction networks? Genome Biology, 7(11):120, 2006. (Cited on page 34.)

[65] K. Venkatesan, J. F. Rual, A. Vazquez, U. Stelzl, I. Lemmens, T. Hirozane- Kishikawa, T. Hao, M. Zenkner, X. Xin, K. I. Goh, M. A. Yildirim, N. Simonis, K. Heinzmann, F. Gebreab, J. M. Sahalie, S. Cevik, C. Simon, A. S. de Smet, E. Dann, A. Smolyar, A. Vinayagam, H. Yu, D. Szeto, H. Borick, A. Dricot, N. Kl- itgord, R. R. Murray, C. Lin, M. Lalowski, J. Timm, K. Rau, C. Boone, P. Braun, M. E. Cusick, F. P. Roth, D. E. Hill, J. Tavernier, E. E. Wanker, A. L. Barabási, and M. Vidal. An empirical framework for binary interactome mapping. Nat Methods, 6(1):83–90, 2009. (Cited on pages 23, 34, 52, 64, and 94.)

[66] Michael P. H. Stumpf, Thomas Thorne, Eric de Silva, Ronald Stewart, Hyeong Jun An, Michael Lappe, and Carsten Wiuf. Estimating the size of the human interac- tome. Proceedings of the National Academy of Sciences of the United States of America, 105(19):6959–6964, 2008.

[67] Mark N. Wass, Alessia David, and Michael J. E. Sternberg. Challenges for the prediction of macromolecular interactions. Current Opinion in Structural Biology, 21(3):382–390, 2011. (Cited on page 14.)

115 [68] A. Sharma, J. Menche, C. Huang, T. Ort, X. Zhou, S. D. Ghiassian, D. Thibault, L. Voung, F. Guo, N. Gulbahce, F Baribaud, J. Tocker, R. Dobrin, E. Barnathan, H. Liu, M. Kitsak, N. Sahni, R. A. Panettieri, K. G. Tantisira, W. Qiu, B. A. Raby, E. K. Silverman, M. Vidal, S. T. Weiss, and A. L. Barabási. disease module in the interactome explains disease heterogeneity, drug response and captures novel pathways and genes for Asthma. Under review, 2015. (Cited on pages 14 and 54.)

[69] Heike Goehler, Maciej Lalowski, Ulrich Stelzl, Stephanie Waelter, Martin Stroedicke, Uwe Worm, Anja Droege, Katrin S. Lindenberg, Maria Knoblich, Christian Haenig, Martin Herbst, Jaana Suopanki, Eberhard Scherzinger, Clau- dia Abraham, Bianca Bauer, Renate Hasenbank, Anja Fritzsche, Andreas H. Ludewig, Konrad Buessow, Sarah H. Coleman, Claire-Anne Gutekunst, Bern- hard G. Landwehrmeyer, Hans Lehrach, and Erich E. Wanker. A Protein Inter- action Network Links GIT1, an Enhancer of Huntingtin Aggregation, to Hunting- ton’s Disease. Molecular Cell, 15(6):853–865, 2004.

[70] Janghoo Lim, Tong Hao, Chad Shaw, Akash J. Patel, Gábor Szabó, Jean-François Rual, C. Joseph Fisk, Ning Li, Alex Smolyar, David E. Hill, Albert-László Barabási, Marc Vidal, and Huda Y. Zoghbi. A Protein–Protein Interaction Network for Human Inherited Ataxias and Disorders of Purkinje Cell Degeneration. Cell, 125(4):801–814, 2006.

[71] Miguel Angel Pujana, Jing-Dong J. Han, Lea M. Starita, Kristen N. Stevens, Muneesh Tewari, Jin Sook Ahn, Gad Rennert, Victor Moreno, Tomas Kirch- hoff, Bert Gold, Volker Assmann, Wael M. ElShamy, Jean-Francois Rual, Dou- glas Levine, Laura S. Rozek, Rebecca S. Gelman, Kristin C. Gunsalus, Roger A. Greenberg, Bijan Sobhian, Nicolas Bertin, Kavitha Venkatesan, Nono Ayivi- Guedehoussou, Xavier Sole, Pilar Hernandez, Conxi Lazaro, Katherine L. Nathanson, Barbara L. Weber, Michael E. Cusick, David E. Hill, Kenneth Offit, David M. Livingston, Stephen B. Gruber, Jeffrey D. Parvin, and Marc Vidal. Net- work modeling links breast cancer susceptibility and centrosome dysfunction. Nat Genet, 39(11):1338–1349, 2007. (Cited on page 14.)

[72] J. Xu and Y. Li. Discovering disease-genes by topological features in human protein-protein interaction network. Bioinformatics, 22(22):2800–5, 2006. (Cited on pages 14 and 46.)

[73] M. Oti, B. Snel, M. A. Huynen, and H. G. Brunner. Predicting disease genes using protein-protein interactions. J Med Genet, 43(8):691–8, 2006. (Cited on pages 17 and 50.)

[74] T K B. Gandhi, J. Zhong, S. Mathivanan, L. Karthick, and Chandrika. K. N. Anal- ysis of the human protein interactome and comparision with yeast, worm and fly interaction datasets. 2006. (Cited on page 14.)

[75] Shinichiro Wachi, Ken Yoneda, and Reen Wu. Interactome-transcriptome analysis reveals the high centrality of genes differentially expressed in lung cancer tissues. Bioinformatics, 21(23):4205–4208, 2005. (Cited on page 15.)

[76] Pall F. Jonsson and Paul A. Bates. Global topological features of cancer proteins in the human interactome. Bioinformatics (Oxford, England), 22(18):2291–2297, 2006.

116 [77] I. Feldman, A. Rzhetsky, and D. Vitkup. Network properties of genes harboring inherited disease mutations. Proc Natl Acad Sci U S A, 105(11):4323–8, 2008. (Cited on pages 15 and 23.)

[78] Haiyuan Yu, Pascal Braun, Muhammed A. Yildirim, Irma Lemmens, Kavitha Venkatesan, Julie Sahalie, Tomoko Hirozane-Kishikawa, Fana Gebreab, Na Li, Nicolas Simonis, Tong Hao, Jean-Fran´coisRual, Amélie Dricot, Alexei Vazquez, Ryan R. Murray, Christophe Simon, Leah Tardivo, Stanley Tam, Nenad Svrzikapa, Changyu Fan, Anne-Sophie de Smet, Adriana Motyl, Michael E. Hudson, Juyong Park, Xiaofeng Xin, Michael E. Cusick, Troy Moore, Charlie Boone, Michael Sny- der, Frederick P. Roth, Albert-László Barabási, Jan Tavernier, David E. Hill, and Marc Vidal. High Quality Binary Protein Interaction Map of the Yeast Interactome Network. Science (New York, N.Y.), 322(5898):104–110, 2008. (Cited on page 15.)

[79] Leland H. Hartwell, John J. Hopfield, Stanislas Leibler, and Andrew W. Murray. From molecular to modular cell biology. Nature, 1999. (Cited on page 15.)

[80] S. Ghiassian, J. Mench, and A. L. Barabási. A DIseAse MOdule Detection (DI- AMOnD) Algorithm derived from a systematic analysis of connectivity patterns of disease proteins in the Human Interactome. Plos , 2015. (Cited on pages 16, 66, and 70.)

[81] S. Navlakha and C. Kingsford. The power of protein interaction networks for associating genes with diseases. Bioinformatics, 26(8):1057–63, 2010. (Cited on pages 17, 18, 20, 46, and 50.)

[82] Elena Nabieva, Kam Jim, Amit Agarwal, Bernard Chazelle, and . Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics, 21(suppl 1):i302–i310, 2005. (Cited on page 17.)

[83] Benno Schwikowski, Peter Uetz, and Stanley Fields. A network of protein-protein interactions in yeast. Nat Biotech, 18(12):1257–1261, 2000. (Cited on page 17.)

[84] Aaron Clauset, M. Newman, and Cristopher Moore. Finding community struc- ture in very large networks. Physical Review E, 70(6), 2004. (Cited on pages 18 and 26.)

[85] Aaron Clauset. Finding local community structure in networks. Physical Review E, 72(2):026132, 2005.

[86] J. Ahn, D. H. Lee, Y. Yoon, Y. Yeu, and S. Park. Improved method for protein complex detection using bottleneck proteins. BMC Med Inform Decis Mak, 13 Suppl 1:S5, 2013. (Cited on page 26.)

[87] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75–174, 2010.

[88] M. Girvan and M. E. Newman. Community structure in social and biological networks. Proc Natl Acad Sci U S A, 99(12):7821–6, 2002.

117 [89] Andrea Lancichinetti and Santo Fortunato. Community detection algorithms: A comparative analysis. Physical Review E, 80(5), 2009.

[90] M. Newman. Fast algorithm for detecting community structure in networks. Phys- ical Review E, 69, 2004.

[91] M. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69, 2004.

[92] James Bagrow and Erik Bollt. Local method for detecting communities. Physical Review E, 72(4), 2005. (Cited on pages 18, 26, and 31.)

[93] S. Navlakha, R. Rajeev, and Nisheeth. S. Graph summarization with bounded error. Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008. (Cited on page 18.)

[94] Stijn Van Dongen. Graph Clustering Via a Discrete Uncoupling Process. SIAM Journal on Matrix Analysis and Applications, 30(1):121–141, 2008. (Cited on pages 18, 26, and 50.)

[95] Yong-Yeol Ahn, James P. Bagrow, and Sune Lehmann. Link communities reveal multiscale complexity in networks. Nature, 466(7307):761–764, 2010. (Cited on page 18.)

[96] Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefeb- vre. Fast unfolding of communities in large networks. Journal of Statistical Mechan- ics: Theory and Experiment, 2008(10):P10008, 2008. (Cited on pages 18 and 26.)

[97] Tolga Can, \ Orhan, #199, amo\ #487, , lu, and Ambuj K. Singh. Analysis of protein- protein interaction networks using random walks. ACM, 2005. (Cited on page 19.)

[98] S. Kohler, S. Bauer, D. Horn, and P. N. Robinson. Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet, 82(4):949–58, 2008. (Cited on pages 19 and 50.)

[99] O. Vanunu and R. Sharan. A propagation-based algorithm for inferring gene- disease association. German Conference on Bioinformatics, pages 54–63, 2008. (Cited on page 19.)

[100] T. Pawson and R. Linding. Network medicine. FEBS Lett, 582(8):1266–70, 2008. (Cited on page 23.)

[101] A. Zanzoni, M. Soler-Lopez, and P. Aloy. A network medicine approach to human disease. FEBS Lett, 583(11):1759–65, 2009.

[102] M Buchanan, G Caldarelli, and P De Los Rios. Networks in cell biology. Cambridge University Press, 2010. (Cited on page 23.)

[103] A. del Sol, R. Balling, L. Hood, and D. Galas. Diseases as network perturbations. Curr Opin Biotechnol, 21(4):566–71, 2010. (Cited on page 23.)

118 [104] E. M. Ramos, D. Hoffman, H. A. Junkins, D. Maglott, L. Phan, S. T. Sherry, M. Fe- olo, and L. A. Hindorff. Phenotype-Genotype Integrator (PheGenI): synthesizing genome-wide association study (GWAS) data with existing genomic resources. Eur J Hum Genet, 22(1):144–7, 2014. (Cited on pages 23, 99, and 100.)

[105] Han G. Brunner Driel and Marc A. van. From syndrome families to functional . Nature Reviews Genetics, 5, 2004. (Cited on page 23.)

[106] R. Sharan, I. Ulitsky, and R. Shamir. Network-based prediction of protein function. Mol Syst Biol, 3:88, 2007. (Cited on pages 29 and 52.)

[107] M. Newman. The Structure and Function of Complex Networks. SIAM review, 45(2):167–256, 2003. (Cited on page 45.)

[108] E. A. Bender Canfield and E. R. The asymptotic number of labeled graphs with given degree sequences. Combinatorial Theory, 24(3):296–307, 1978. (Cited on page 45.)

[109] Michael. Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna Lewis, John C. Matese, Joel E. Richardson, Martin Ringwald, Gerald M. Rubin, and Gavin Sherlock. Gene Ontology: tool for the unification of biology. Nature genetics, 25(1):25–29, 2000. (Cited on page 46.)

[110] M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, and M. Hattori. The KEGG re- source for deciphering the genome. Nucleic Acids Res, 32(Database issue):D277–80, 2004. (Cited on page 46.)

[111] K. Lage, K. Mollgard, S. Greenway, H. Wakimoto, J. M. Gorham, C. T. Workman, E. Bendsen, N. T. Hansen, O. Rigina, F. S. Roque, C. Wiese, V. M. Christoffels, A. E. Roberts, L. B. Smoot, W. T. Pu, P. K. Donahoe, N. Tommerup, S. Brunak, C. E. Seidman, J. G. Seidman, and L. A. Larsen. Dissecting spatio-temporal protein networks driving human heart development and related disorders. Mol Syst Biol, 6:381, 2010. (Cited on page 50.)

[112] E. Guney and B. Oliva. Exploiting protein-protein interaction networks for genome-wide disease-gene prioritization. PLoS One, 7(9):e43557, 2012. (Cited on page 50.)

[113] Patrick L. McGeer and Edith G. McGeer. Inflammation, autotoxicity and Alzheimer disease. Neurobiology of Aging, 22(6):799–809, 2001. (Cited on pages 61 and 66.)

[114] Gokhan S. Hotamisligil. Inflammation and metabolic disorders. Nature, 444(7121):860–867, 2006.

[115] Göran K. Hansson. Inflammation, Atherosclerosis, and Coronary Artery Disease. New England Journal of Medicine, 352(16):1685–1695, 2005. (Cited on page 66.)

119 [116] K. E. Wellen and G. S. Hotamisligil. Inflammation, stress, and diabetes. J Clin Invest, 115(5):1111–9, 2005. (Cited on pages 61 and 66.)

[117] Eric A. Fox and Susan R. Kahn. The relationship between inflammation and ve- nous thrombosis. A systematic review of clinical studies. Thrombosis and Haemosta- sis, 2005. (Cited on page 61.)

[118] T. W. Wakefield, R. M. Strieter, M. R. Prince, L. J. Downing, and L. J. Greenfield. Pathogenesis of venous thrombosis: a new insight. Cardiovascular Surgery, 5(1):6– 15, 1997. (Cited on page 61.)

[119] Peter Libby and Daniel I. Simon. Inflammation and Thrombosis: The Clot Thick- ens. Circulation, 103(13):1718–1720, 2001. (Cited on page 61.)

[120] B. M. Stramer, R. Mori, and P. Martin. The inflammation-fibrosis link? A Jekyll and Hyde role for blood cells during wound repair. J Invest Dermatol, 127(5):1009– 17, 2007. (Cited on page 61.)

[121] A. I. Su, T. Wiltshire, S. Batalov, H. Lapp, K. A. Ching, D. Block, J. Zhang, R. Soden, M. Hayakawa, G. Kreiman, M. P. Cooke, J. R. Walker, and J. B. Hogenesch. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A, 101(16):6062–7, 2004. (Cited on page 63.)

[122] I. Boris and D. Hoaglin. How to Detect and Handle Outliers, The ASQC Basic Ref- erences in Quality Control: Statistical Techniques. ASQC Quality Press, Milwuakee, WI, 1993. (Cited on page 63.)

[123] B. Aranda, P. Achuthan, Y. Alam-Faruque, I. Armean, A. Bridge, C. Derow, M. Feuermann, A. T. Ghanbarian, S. Kerrien, J. Khadake, J. Kerssemakers, C. Leroy, M. Menden, M. Michaut, L. Montecchi-Palazzi, S. N. Neuhauser, S. Or- chard, V. Perreau, B. Roechert, K. van Eijk, and H. Hermjakob. The IntAct molec- ular interaction database in 2010. Nucleic Acids Res, 38(Database issue):D525–31, 2010. (Cited on pages 64 and 94.)

[124] C. Stark, B. J. Breitkreutz, A. Chatr-Aryamontri, L. Boucher, R. Oughtred, M. S. Livstone, J. Nixon, K. Van Auken, X. Wang, X. Shi, T. Reguly, J. M. Rust, A. Win- ter, K. Dolinski, and M. Tyers. The BioGRID Interaction Database: 2011 update. Nucleic Acids Res, 39(Database issue):D698–704, 2011. (Cited on pages 64 and 94.)

[125] T. S. Keshava Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar, S. Math- ivanan, D. Telikicherla, R. Raju, B. Shafreen, A. Venugopal, L. Balakrishnan, A. Marimuthu, S. Banerjee, D. S. Somanathan, A. Sebastian, S. Rani, S. Ray, C. J. Harrys Kishore, S. Kanth, M. Ahmed, M. K. Kashyap, R. Mohmood, Y. L. Ra- machandra, V. Krishna, B. A. Rahiman, S. Mohan, P. Ranganathan, S. Ramabadran, R. Chaerkady, and A. Pandey. Human Protein Reference Database-2009 update. Nucleic Acids Res, 37(Database issue):D767–72, 2009. (Cited on pages 64 and 94.)

[126] A. Ruepp, B. Waegele, M. Lechner, B. Brauner, I. Dunger-Kaltenbach, G. Fobo, G. Frishman, C. Montrone, and H. W. Mewes. CORUM: the comprehensive re- source of mammalian protein complexes-2009. Nucleic Acids Res, 38(Database issue):D497–501, 2010. (Cited on pages 64 and 94.)

120 [127] P. V. Hornbeck, J. M. Kornhauser, S. Tkachev, B. Zhang, E. Skrzypek, B. Mur- ray, V. Latham, and M. Sullivan. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post- translational modifications in man and mouse. Nucleic Acids Res, 40(Database issue):D261–70, 2012. (Cited on pages 64 and 94.)

[128] A. Vinayagam, U. Stelzl, R. Foulle, S. Plassmann, M. Zenkner, J. Timm, H. E. Assmus, M. A. Andrade-Navarro, and E. E. Wanker. A directed protein interaction network for investigating intracellular signal transduction. Sci Signal, 4(189):rs8, 2011. (Cited on pages 64 and 94.)

[129] J. Wang, K. Huo, L. Ma, L. Tang, D. Li, X. Huang, Y. Yuan, C. Li, W. Wang, W. Guan, H. Chen, C. Jin, J. Wei, W. Zhang, Y. Yang, Q. Liu, Y. Zhou, C. Zhang, Z. Wu, W. Xu, Y. Zhang, T. Liu, D. Yu, Y. Zhang, L. Chen, D. Zhu, X. Zhong, L. Kang, X. Gan, X. Yu, Q. Ma, J. Yan, L. Zhou, Z. Liu, Y. Zhu, T. Zhou, F. He, and X. Yang. Toward an understanding of the protein interaction network of the human liver. Mol Syst Biol, 7:536, 2011. (Cited on pages 64 and 95.)

[130] H. Iwata, I. Manabe, and R. Nagai. Lineage of bone marrow-derived cells in atherosclerosis. Circ Res, 112(12):1634–47, 2013. (Cited on pages 66 and 82.)

[131] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A, 102(43):15545–50, 2005. (Cited on pages 70, 100, and 101.)

[132] Kenan Aksu, Ayhan Donmez, and Gokhan Keser. Inflammation-Induced Throm- bosis: Mechanisms, Disease Associations and Management. Current Pharmaceuti- cal Design, 18(11):1478–1493, 2012. (Cited on page 74.)

[133] C. Schurmann, K. Heim, A. Schillert, S. Blankenberg, M. Carstensen, M. Dorr, K. Endlich, S. B. Felix, C. Gieger, H. Grallert, C. Herder, W. Hoffmann, G. Ho- muth, T. Illig, J. Kruppa, T. Meitinger, C. Muller, M. Nauck, A. Peters, R. Rettig, M. Roden, K. Strauch, U. Volker, H. Volzke, S. Wahl, H. Wallaschofski, P. S. Wild, T. Zeller, A. Teumer, H. Prokisch, and A. Ziegler. Analyzing illumina gene ex- pression microarray data from different tissues: methodological aspects of data analysis in the metaxpress consortium. PLoS One, 7(12):e50938, 2012. (Cited on pages 75 and 102.)

[134] T. Zeller, P. Wild, S. Szymczak, M. Rotival, A. Schillert, R. Castagne, S. Maouche, M. Germain, K. Lackner, H. Rossmann, M. Eleftheriadis, C. R. Sinning, R. B. Schn- abel, E. Lubos, D. Mennerich, W. Rust, C. Perret, C. Proust, V. Nicaud, J. Loscalzo, N. Hubner, D. Tregouet, T. Munzel, A. Ziegler, L. Tiret, S. Blankenberg, and F. Cambien. Genetics and beyond–the transcriptome of human monocytes and disease susceptibility. PLoS One, 5(5):e10693, 2010. (Cited on pages 75 and 102.)

[135] E. Frederick Wheelock. Interferon-Like Virus-Inhibitor Induced in Human Leuko- cytes by Phytohemagglutinin. Science, 149(3681):310–311, 1965. (Cited on page 83.)

121 [136] JA. Green, SR. Cooperband, and S. Kibrick. Immune specific induction of interferon production in cultures of human blood lymphocytes. Science, 164(3886):1415–1417, 1969. (Cited on page 83.)

[137] Juyong Park and Albert-László Barabási. Distribution of node characteristics in complex networks. Proceedings of the National Academy of Sciences, 104(46):17916– 17920, 2007. (Cited on page 96.)

[138] A. Mottaz, Y. L. Yip, P. Ruch, and A. L. Veuthey. Mapping proteins to disease terminologies: from UniProt to MeSH. BMC Bioinformatics, 9 Suppl 5:S3, 2008. (Cited on page 99.)

[139] P. M. Ridker, D. I. Chasman, R. Y. Zee, A. Parker, L. Rose, N. R. Cook, J. E. Buring, and Group Women’s Genome Health Study Working. Rationale, design, and methodology of the Women’s Genome Health Study: a genome-wide association study of more than 25,000 initially healthy american women. Clin Chem, 54(2):249– 55, 2008. (Cited on page 101.)

[140] M. L. Grove, B. Yu, B. J. Cochran, T. Haritunians, J. C. Bis, K. D. Taylor, M. Hansen, I. B. Borecki, L. A. Cupples, M. Fornage, V. Gudnason, T. B. Harris, S. Kathiresan, R. Kraaij, L. J. Launer, D. Levy, Y. Liu, T. Mosley, G. M. Peloso, B. M. Psaty, S. S. Rich, F. Rivadeneira, D. S. Siscovick, A. V. Smith, A. Uitterlinden, C. M. van Duijn, J. G. Wilson, C. J. O’Donnell, J. I. Rotter, and E. Boerwinkle. Best practices and joint calling of the HumanExome BeadChip: the CHARGE Consortium. PLoS One, 8(7):e68095, 2013. (Cited on page 102.)

[141] Paul M. Ridker. C-Reactive Protein: A Simple Test to Help Predict Risk of Heart Attack and Stroke. Circulation, 108(12):e81–e85, 2003. (Cited on page 102.)

[142] Donat Spahn, Bertil Bouillon, Vladimir Cerny, Timothy Coats, Jacques Duranteau, Enrique Fernandez-Mondejar, Daniela Filipescu, Beverley Hunt, Radko Komad- ina, Giuseppe Nardi, Edmund Neugebauer, Yves Ozier, Louis Riddez, Arthur Schultz, Jean-Louis Vincent, and Rolf Rossaint. Management of bleeding and coagulopathy following major trauma: an updated European guideline. Critical Care, 17(2):R76, 2013.

[143] Kim K. Birtcher and Christie M. Ballantyne. Measurement of Cholesterol: A Pa- tient Perspective. Circulation, 110(11):e296–e297, 2004. (Cited on page 102.)

122