Functional Characteristics of Cancer Driver Genes In

FUNCTIONAL CHARACTERISTICS OF CANCER DRIVER GENES

IN COLORECTAL CANCER

GURKAN BEBEK

Submitted in partial fulﬁllment of the requirements for the degree of Master of

Science

Clinical Research Scholars Program

CASE WESTERN RESERVE UNIVERSITY

May 2017 CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

Gurkan Bebek

candidate for the degree of Master of Science∗.

Committee Chair Masaru Miyagi

Committee Member Thomas LaFramboise

Committee Member Jennifer Yu

Date of Defense April 5, 2017

∗ We also certify that written approval has been obtained for

any proprietary material contained therein.

i Dedicated to my family. Contents

1 Introduction 1

1.1 Colorectal Cancer ...... 1

1.2 Driver Genes in Cancer ...... 2

1.3 Contributions ...... 3

2 Methods 7

2.1 Cancer Driver Gene Networks ...... 7

2.2 Quantifying Network Similarity ...... 8

2.2.1 Constructing the Alignment Graph ...... 8

2.2.2 Alignment Scoring ...... 11

2.2.3 Scoring Nodes of the Alignment Graph...... 11

2.2.4 Scoring Sequence Similarity...... 13

2.2.5 Scoring Functional Similarity...... 13

2.2.6 Overall Similarity Between Two Interactions...... 15

2.2.7 Scoring the Alignment Graph and Normalization of Scores . . 15

2.2.8 Finding the Optimal Alignment ...... 16

2.3 Gene Ontology-Based Classiﬁcation of Genes ...... 17

2.4 Sequence (Protein Families) Based Classiﬁcation of Genes ...... 18

2.5 Sequence and Gene Ontology Based Classiﬁcation of Genes ...... 18

2.6 Somatic mutation proﬁles of tumor samples and statistical evaluation 19

1 3 Results and Discussion 20

3.1 Cancer driver genes and their synergistic activities ...... 20

3.2 Finding similarities among synergistic networks ...... 22

3.3 Clustering Cancer Driver Genes ...... 22

3.4 Testing synergistic activities by comparing somatic mutation proﬁles

of tumor samples ...... 26

3.5 Validation of results with cell line epistasis experiments ...... 27

4 Conclusion 30

A Appendix 32

A.1 Gene Ontology Based Classiﬁcation of Cancer Driver Genes . . . . . 32

A.2 Sequence Based Classiﬁcation of Cancer Driver Genes ...... 32

References 35

2 List of Figures

1 Overview of the study design ...... 4

2 Alignment graph nodes ...... 10

3 Algorithm for within species network alignment ...... 12

4 Example alignment of two CAN-gene networks ...... 21

5 Histogram of normalized scores ...... 23

6 Classiﬁcation of CAN-gene networks and observed somatic mutations

in various tumors ...... 24

7 Hierarchical clustering of CAN-genes ...... 25

8 Validation Experiments ...... 28

9 Multidimensional Scaling of knockdown experiments in triplicate . . . 29

10 Classiﬁcation of CAN-genes based on their gene ontology annotations 33

11 Classiﬁcation of CAN-genes based on their Pfam families ...... 34

3 Functional Characteristics of Cancer Driver Genes

in Colorectal Cancer

Abstract

GURKAN BEBEK

Colorectal Cancer is the third most commonly diagnosed cancer and third leading cause of cancer death in both men and women. Recent studies have led to the discovery of cancer driver genes whose mutations are mostly observed in low frequencies compared to the total number of tumors analyzed, suggesting that mutations in distinct subsets of driver genes are suﬃcient for tumorigenesis. We hypothesize that comparing the networks by which the driver genes contribute to tumorigenesis can generate more meaningful functional similarities. We develop an algorithm for within-species network similarity quantiﬁcation. We compare subnetworks corresponding to colorectal cancer driver genes using this tool to group driver genes. We validate our approach with cell line experiments and show we can predict driver gene similarities based on functional similarities. The relationships between driver genes whose mutations lead to similar phenotypic outcomes are best understood through comparison of pathways that are dysregulated.

4 1 Introduction

1.1 Colorectal Cancer

Colorectal cancer (CRC) is a major worldwide health problem owing to its high

prevalence and mortality rates. CRC is the third most commonly diagnosed cancer

and third leading cause of cancer death in both men and women. It is estimated that

about 134,490 people were diagnosed with CRC in the USA, and about 49,190 people

will die of the disease. Colon and rectum cancer represents 8.0% of all new cancer

cases in the U.S [? ]. Incidence of CRC is higher in men than women. The risk

factors for CRC increases with age. The median age at diagnosis is about 70 years

in developed countries. If diagnosed early CRC is also one of the most curable types

of cancer with 5-year survival rates as high as 90%. Most CRCs could be prevented

with current clinical practices such as screening tests [??? ].

Recent advances in the diagnosis and treatment of CRC improved the overall outcome of patients through new screening methods, biomarker and genomic analysis, personalized therapies and chemotherapy. However, many patients with advanced and metastatic tumors will still die from the disease. Novel diagnosis and treatment options are still needed for these patients.

There are various risk factors associated with colon cancer. These include the per- sonal history of colorectal cancer or polyps, family history of colon cancer, inherited syndromes that increase colon cancer risk, diabetes, obesity, smoking, alcohol use, diet etc. While inherited gene mutations make up only a small percentage of colon

1 cancers, they can increase an individual’s risk of cancer signiﬁcantly.

Most colorectal cancer cases are sporadic. They develop slowly over several years

through the adenoma–carcinoma sequence [?? ]. The major treatment options for

CRC are surgery, neoadjuvant radiotherapy (for patients with rectal cancer), and adjuvant chemotherapy (for patients with stage III/IV and high-risk stage II colon cancer). While 5-year relative survival rate for stage I disease is %90, stage IV patients survival rate is slightly greater than 13% [? ].

1.2 Driver Genes in Cancer

In recent years, projects utilizing next generation sequencing technologies have resulted in an excess of genomic data. Studies that sequenced cancer exomes allowed us to understand complex diseases such as colorectal and breast cancer [? ], ovarian cancer [? ] and glioblastoma multiforme (GBM) [? ] in greater depth. These studies conﬁdently recovered genes that are frequently mutated in diﬀerent cancers. Further analyses revealed so-called candidate cancer genes (CAN-genes), which are shown to drive carcinogenesis (hence named “driver” genes), as opposed to the genes that are also mutated in carcinogenic samples, but likely have no impact on tumor growth

(passenger genes). Recent research eﬀorts have introduced methods for identifying such sets of genes for non-small cell lung carcinoma, glioblastoma multiforme, and ovarian cancer using computational methods [???? ].

One of the key challenges associated with the introduction of these CAN-genes is to understand the mechanistic link between the mutations in these genes and the car-

2 cinogenic phenotype. Some researchers utilized network approaches to characterize the synergistic relationships among such genes. A variety of approaches were pre- sented for the identiﬁcation of protein subnetworks (small groups of proteins that are linked to each other through protein-protein interactions) that would describe pathways that are likely to be operative when particular sets of mutations are observed

[??? ]. However, these methods have not yet adequately explained the observation that, although CAN-genes are key drivers of carcinogenesis, they are mostly observed in very low abundance in the patient groups. For instance, the most frequently mutated GBM driver gene, TP53, is mutated in at most 27% of GBM patients [? ].

In colorectal cancer patients, mutations of well-known driver genes, other than the highly mutated APC gene [? ], are also found in lower frequencies than expected [? ].

However, via downstream alterations manifested by several of these driver mutations, cancers that share similar characteristics develop. Therefore, it is quite likely that cancer manifests itself via synergistic activities of various combinations of driver gene mutations.

1.3 Contributions

In this work, we present a comparative network analysis approach to further study cancer driver genes in terms of their functional similarity, with the objective of gain- ing insights into the relationships between these driver genes. The proposed computational framework for comparative analysis of cancer driver genes is illustrated in

Figure ??. Overall, the proposed framework provides a computational method for

3 A Exome Sequencing of Tumors KRAS APC

Cancer driver genes B Top Cancer Driver Gene Networks Pathway Analysis Framework

C Pairwise Quan9ta9ve Network Alignment

N1 vs. N2 Sim. Hierarchal Matrix clustering

Driver Genes Networks 20 Driver Gene Networks 1019 13 Driver Genes 18 Networks 1214 1517 1611

Figure3 1:6 Overview5 2 1 4 of the study design.

A. The cancer driver genes are identiﬁed mostly by exome sequencing [?? ].

B. The driver genes are imputed into the signaling pathway search framework [? ].

Using this framework, networks connecting each colorectal cancer (CRC) driver gene with APC are generated.

C. All pairs of the top cancer driver gene networks are aligned to each other, utilizing protein family information and gene ontologies, Using this measure of similarity, hierarchical clustering is performed to classify driver genes networks. The results are validated computationally and experimentally.

4 studying the functional characteristics of CAN-genes by approaching them from the standpoint of network similarity. Our hypothesis is that network similarities will better describe the synergistic similarities of CAN-genes then traditional ontology-based comparison of these genes. This analysis is carried out within a speciﬁc cancer type.

To test this hypothesis, we must first determine the networks by which each of the driver genes contribute to tumorigenesis. We obtain these networks by integrating annotations and tissue-specific microarray data [? ]. Next, we produce organism- specific alignments of these networks by optimizing a novel measure of edge similarity based on sequence information and gene ontology annotations. We then use cluster analysis to test our hypothesis. Our results show that driver genes that are part of more similar signaling subnetworks are grouped together, while naive, non-network- based measures produce less meaningful results. We also validate our approach with an epistasis analysis of a candidate pair.

An important contribution of our approach is that, instead of directly quantifying the similarity between the driver genes themselves, we quantify the similarity between the pathways (or more generally PPI subnetworks) associated with each driver gene.

In order to quantify subnetwork similarity, we develop a novel network-alignment algorithm for the speciﬁc purpose of aligning PPI subnetworks that originate from the same organism, i.e. within-species network alignment. Although there exists a rich literature on comparing PPI networks of diﬀerent species [?????????

] and in principle these approaches would be directly applicable to the comparison of our driver gene subnetworks, there are several considerations that prohibit direct application of these methods to these subnetworks. In particular, we identify the

5 following as the key characteristics of a network alignment algorithm that will be used to compare driver gene subnetworks:

1. The alignment should be global in the sense we should be able to compute a

single “similarity score” between the two subnetworks being compared.

2. There should not be any ambiguous mapping (i.e. every mapped gene from one

network should be mapped to only one gene from the other network).

3. Two identical genes in diﬀerent subnetworks (which is not possible in cross-

species network alignment) should have the maximum score, as they clearly

have an identical function.

Various solutions to the network alignment problem have been proposed for cross- species alignment, but none of them feature all of the above characteristics. For instance, IsoRank [? ] is a global network alignment tool that ﬁnds matches of a protein in one PPI network in another PPI network if the former’s neighbors are good matches for the latter’s neighbors. While, IsoRank satisﬁes (1) and (2), it does not satisfy (3), as it uses the PageRank algorithm where the score of a protein alignment is related to the scores of its neighbors. Therefore, the alignment of two identical genes may not be the maximum score, because their neighbors may not be identical.

All other cross-species alignment algorithms [?? ] focus on local alignments, and hence do not satisfy the ﬁrst requirement.

6 2 Methods

2.1 Cancer Driver Gene Networks

In this study, we utilize an integrative -omics methodology, where we ﬁrst generate

CAN-gene networks by which the driver genes contribute to tumorigenesis. It has been shown that mutation of APC is the first hit in colorectal cancer, followed by the mutation of a set of 22 cancer driver genes [? ]. To generate the proposed tumorigenesis pathways for each of the driver genes, we therefore generate protein- protein interaction networks that connect each of the driver genes with APC. To do so, we first obtain a tissue-specific microarray gene expression data collected in replicates from the Apc1638N+/− mouse model (available through the Gene Expression Omnibus data series GSE19338). Using the Blossom algorithm [? ] we search publicly available

PPI databases [?? ] and score these by integrating the tissue-speciﬁc gene expression data, gene ontologies [? ], and known pathways [???? ]. APC and 22 other

CAN-genes are linked via networks using this network discovery method. While we have utilized these networks to demonstrate utility of our approach, any other set of networks demonstrating synergistic links between driver genes could be utilized in this pipeline. The Blossom networks and the approach have been validated within [?

] and [? ] and various synergy studies of pairwise driver genes in other studies.

7 2.2 Quantifying Network Similarity

While quantifying the functional similarity between two PPI subnetworks, our goal

is to quantify 1) the functional similarity between proteins that are part of their

subnetworks and 2) topological consistency of the interactions in the network. In

other words, for two subnetworks to be considered functionally similar, both the

proteins and the context in which they interact with each other should be similar [?

? ].

In order to incorporate both topology and protein similarity, we formulate the

network similarity problem as a network alignment problem, where proteins in a

network are matched to proteins in another network. The resulting score is the sum

of the scores of the individual alignments of the proteins (See 2.3 Alignment Scoring),

penalized by topological inconsistency. A network alignment algorithm optimizes this

score, to produce the highest quality alignment.

We represent the two subnetworks as graphs GA = (VA,EA), GB = (VB,EB),

where the vertices (v ∈ V ) correspond to proteins/genes, the edges (e ∈ E) correspond

to interactions. We deﬁne an alignment as a map A : E0 → EB,E0 ⊆ EA, |EA| ≤

|EB|. A is constrained to be one-to-one: ∀e1, e2 ∈ E0,A(e1) = A(e2) =⇒ e1 = e2.

P|E0| We deﬁne an alignment score W (A) = i=0 w(ei), where w(ei) is the score of

alignment of interaction ei with A(ei). The best alignment is A that maximizes W (A).

8 2.2.1 Constructing the Alignment Graph

In order to ﬁnd the best alignment between GA and GB, a computationally feasible

algorithm for optimizing the alignment score must be developed. When aligning PPI

networks across species, Guo et al. [? ] construct an alignment graph, where each

node represents a pair of possibly alignable interactions. The edges of the alignment

graph are based on network connectivities and these edges are weighted according to

the score of mapping the two pairs of edges that are mapped to each other and the

topological inconsistency created by this mapping. They then use a greedy heuristic

to ﬁnd non-redundant subgraphs on this graph with high total edge weight. To align

PPI networks within the same species, we construct a similar graph, in which the

edge similarities are quantiﬁed based on Pfam and Gene Ontology annotation terms

(see next subsection). We then ﬁnd the optimal alignment using a search algorithm

where the graph is pruned greedily to form a topologically correct alignment of the

two networks and the maximal alignment is returned at the end.

The alignment graph Ψ(GA,GB) = (UΨ,FΨ) is generated as follows: every vertex

ui ∈ UΨ corresponds to a pair of interactions (eA, eB) where eA ∈ EA and eB ∈ EB, such that UΨ = EA × EB. In essence, the nodes of the alignment graph correspond to every possible alignment of the interactions of the two networks (In Figure ?? these are the nodes displayed on the right hand side). A pair of nodes (ui, uj) where ui, uj ∈ UΨ is connected by an edge fij ∈ Fψ in the alignment graphs if the alignment follows one of the following three types of connectivities:

1. The two pairs of aligned edges share a protein in both graphs (Extension)

9 2. The two pairs of aligned edges share a protein in one graph, but in the other

graph they are separated by one hop (Insertion/Deletion)

3. The two pairs of aligned edges are separated by one hop in both networks (Jump)

v1 v2 v3 v1 v2 v2 v3 A

v4 v5 v6 v4 v5 v5 v6

v1 v2 v3 v1 v2 v2 v3 B

v4 v5 v6 v7 v4 v5 v6 v7

v1 v2 v3 v4 v1 v2 v3 v4 C

v5 v6 v7 v8 v5 v6 v7 v8

Network A Proteins Network B Aligned Edges

Protein-Protein Interactions Protein-Protein Mapping Alignment Graph Interaction

Figure 2: Alignment graph nodes. The nodes that are generated while building the alignment graph are depicted. (A) an extension, where two edges follow each other in the two networks. (B) while one network has two connected edges, the other has an additional edge in the other network (insertion/deletion). (C) the two pairs of edges are separated with one interaction on both networks (a jump).

10 After the network is thusly constructed, the edges are weighted with a combination of the similarity score for each aligned interaction and a penalty for insertions/deletions and jumps, biasing the graph towards fewer topological inconsistencies. We next describe the scoring of nodes and edges in detail.

2.2.2 Alignment Scoring

In the alignment graph, the weight of an edge consists of two components: (i) the weights of nodes that are connected by the edge and (ii) the relative topology of the corresponding interactions in the two subnetworks. Since each node in the alignment graph represents a possible mapping of two interactions, each from one of the subnetworks being compared, the weight of a node reﬂects the functional similarity between these interactions. We now discuss how we quantify the functional similarity between two interactions. Subsequently, we discuss how topological inconsistencies in the mapping are penalized.

2.2.3 Scoring Nodes of the Alignment Graph.

An important factor that will result in proper alignments is to develop a biologically accurate quantitative measure of the similarity between interactions in the two subnetworks. Here, we deﬁne such a measure based on sequence similarities identiﬁed as in the protein domain family database Pfam [? ] and through Gene Ontology annotations [? ]. However, instead of aligning proteins to proteins (node-node alignments), we align interactions (edge-edge alignments). A similar approach was proposed earlier for cross-species alignments of networks [? ]. This approach allows us to focus more

11 A1 A2 Network A B1 B2 Construct and weight Construct edges based Network A vs. Network B on graph connecvies A1 A2 alignment nodes Alignment weight A1 A2 A2 A3 A2 A4 B1 B2 B1 B2 B1 B2 A4 A3 1 PFAM GO 2 … Network B A1 U B1 A1 U B1 Family1 kinase B1 B2 Family2 phosphorylaon A1 A2 A2 A3 A2 A4 B2 B3 B2 B3 B2 B3 B3 A2 U B2 A2 U B2 Family3 kinase binding Family4 Start from random node, and begin greedy search 3 algorithm

A1 A2 A2 A3 A2 A4 Repeat A1 A2 B1 B2 B1 B2 B1 B2 Network A Network B Final A1 A2 steps 3-4, Pick the B1 B2 next node A1 B1 Alignment B1 B2 pick best alignment

A4 A2 B2 4 A1 A2 A2 A3 A2 A4 6 A2 A3 A2 A3 A2 A4 5 B2 B3 B2 B3 B2 B3 A3 B3 B2 B3 B2 B3 B2 B3

Figure 3: Algorithm for within species network alignment.

1. Construct a node for every combination of interactions of networks A and B.

Note that alignment nodes are not transitive, but in this figure we only consider one direction of alignment. Weight the nodes with the “alignment weight,” which is proportional to the number of shared Pfam families and shared gene ontology terms between corresponding proteins in an aligned interaction. 2. Connect the nodes by edges corresponding to extensions, insertions/deletions, and jumps. Note that only extensions are shown in this figure, and also, reverse alignments are not included (hence the lone node). 3-5. Find the “modified” minimum spanning tree, corresponding to an alignment, eliminating nodes with each step. Repeat, until an alignment has been generated from each node, and 6. pick the alignment with the minimum weight. on functional associations across networks (Figure ??).

12 2.2.4 Scoring Sequence Similarity.

Pfam provides a public database of protein domains and their associated proteins as discovered by sequence similarity analysis. We incorporate Pfam into our algorithm under the conjecture that alignment between two edges is more plausible if the corresponding proteins in the alignment share a high number of Pfam domains. Hence, utilization of Pfam allows us to incorporate sequence similarity into our alignment score, highlighting the functional viewpoint of these sequences.

Given two interactions eA = (vAi , vAj ), eA ∈ EA and eB = (vBk , vBl ), eB ∈ EB, we compute the sequence-based similarity between these two interactions as follows:

Where PF(v) represents the set of Pfam domains corresponding to the given protein v.

2.2.5 Scoring Functional Similarity.

The functional relationship between two interacting proteins can be assessed using biological process (BP) Gene Ontology annotations, where operations or sets of molecular events are described. Molecular function annotations from Gene Ontology describe elemental activities of gene products at the molecular level, and cellular component annotations lists parts of a cell the gene product is active. Hence, in this context, only biological processes can possibly describe the underlying processes and can help capture the nature of interactions. For example, if an interaction consists of a protein

13 with the GO term kinase activity (GO: 0016301) and another protein with the GO

term protein kinase binding (GO: 0019901), when aligning two interactions, it follows

that both interactions will have a similar relationship. Therefore, given an interac-

tion alignment, corresponding proteins should have similar GO terms, indicating that

they have the same role in their respective interactions. Indeed, in recent years, many

studies have demonstrated that interactions in signaling and regulatory networks can

be abstracted as interactions between diﬀerent functional categories, which are used

recurrently in various biological processes [?? ]. We compute the GO-based similar-

ity between two proteins utilizing a semantic measure where GO terms of each gene

pair are assessed by their speciﬁcity and depth on the GO tree [? ]. This measure

quantiﬁes the information content of the shared functional annotations of the two

proteins being compared and is deﬁned as:

|GΛ(Si,Sj )| IC(x, y) = −log2 |Gr|

Here, GΛ(Si,Sj ) is the set of all genes that are described by the minimum common ancestor set of the genes Si and Sj, and Gr is the set of all genes in GO. If the two

proteins have highly speciﬁc common ancestors, then they are scored higher than if

their common ancestors describe many other genes.

Using this measure, we quantify the functional similarity between two interactions

eA = (vAi , vAj ), eA ∈ EA and eB = (vBk , vBl ) as follows:

sGO(eA, eB) = IC(vAi , vAj ) ∗ IC(vBk , vBl )

14 2.2.6 Overall Similarity Between Two Interactions.

The overall similarity score, and hence weight of alignment, is deﬁned as:

w(eA, eB) = spfam(eA, eB) + sGO(eA, eB)

2.2.7 Scoring the Alignment Graph and Normalization of Scores

The alignment score is essentially the cost of the tree, tAB, generated as a result of aligning networks GA(EA,VA) and GB(EB,VB). This cost, cost(tAB), is simply calculated by summing up the weight of nodes on the tree.

X cost(tAB) = w(eA, eB)

(eA,eB )∈tAB

The networks that are considered for alignment have varying number of protein- protein interactions, and the network alignments essentially ﬁnd similar interactions across two networks. To assess the network alignments, after a score is calculated, a normalized alignment score is calculated. The normalization is done with respect to the maximum possible alignment that is possible. The normalized score,s ˆAB, is:

k(min(|EA|, |EB|) − 1) sˆAB = cost(tAB)

Where k is the constant value that represents the maximum similarity score an edge can be assigned. In case of a perfect alignment, e.g. self-alignment of a network, sÂB=1, as the tree will return one less edge then the size of the smallest network (if there is a cycle in the graph). In case of no similarity, the score tAB will be infinite, which will correspond to a normalized scores ÂB = 0. Based on the normalized scores,

15 unsupervised hierarchical clustering of the networks was done in R using the hclust

function.

2.2.8 Finding the Optimal Alignment

Successful traversal of the alignment graph can discover an optimal alignment. How-

ever, due to the three types of connectivity that are possible, the number of edges

in the graph can become very large and the traversal can be cumbersome. We here

propose a novel method to approximate the global maximum of the alignment on the

graph. In an alignment graph, Ψ, a single edge in GA is mapped to many possible

edges in GB, therefore a traditional minimum spanning tree would not make sense, as

it would not correspond to a one-to-one alignment of an edge of GA with an edge in

GB. Therefore, when a node corresponding to the alignment (eA, eB) is added to the

network, all other nodes containing possible alignments of eA or eB are removed from

the graph. This dramatically reduces the computational time of the algorithm, and

it removes the redundancy inherent of the alignment graph. The alignment graph is

calculated starting from every node, and the modiﬁed network with the least weight

is picked as the optimum alignment. In essence, the traversal of the alignment graph

generates a tree that is representative of the alignment. A minimum spanning tree

would not work, because it would inevitably pick nodes that violate the one-to-one

constraint on A.

Starting at a random node, the algorithm proceeds as follows:

16 Initialize:

Etraversed = ∅

Vtraversed = {x}, where x = (eAx , eBx ) is a random node from A.

Vavailable = VΨ\[(eAx , ∗), (∗, eAx ), (eBx , ∗), (∗, eBx )]

Repeat until Vavailable = ∅

Choose (x, y),

where y = (eAy , eBy ) ∈ Vavailable with minimum weight in Ψ,

x ∈ Vtraversed and y∈ / Vtraversed.

Vtraversed=Vtraversed ∪ y

Etraversed = Etraversed ∪ (x, y)

Vavailable = Vavailable\[(eAy , ∗), (∗, eAy ), (eBy , ∗), (∗, eBy )]

The algorithm is repeated for every node in the alignment graph. The traversal with the minimum weight is the one that corresponds to the best alignment of the graphs.

2.3 Gene Ontology-Based Classiﬁcation of Genes

We have utilized the same semantic ontology similarity measure to assess the functional similarities of the genes [? ]. The similarity scores of driver genes based on

0 gene ontologies, sGO, are utilized to generate an unsupervised clustering of the genes in R using the hclust function, which performs hierarchical clustering.

0 sGO(vi, vj) = IC(vi, vj)

17 2.4 Sequence (Protein Families) Based Classiﬁcation of Genes

We have utilized a measure that is similar to the alignment scoring to assess the

sequence-based similarities of the cancer driver genes. Shared protein family domains

of genes obtained from PFAM [? ] were used to generate a scoring matrix, where

given two genes vi, vj, we compute the sequence-based similarity between these two

genes as follows:

0 |PF(vi) ∩ PF(vj)| spfam(vi, vj) = |PF(vi) ∪ PF(vj)|

where PF(v) represents the set of Pfam domains corresponding to the given gene v.

Unsupervised clustering of the genes was generated using hierarchical clustering.

2.5 Sequence and Gene Ontology Based Classiﬁcation of Genes

To generate a comparable measure at the gene level, we have also utilized a weight function that is similar to the weight function used in the alignment algorithm. How- ever, since only driver genes are considered, and not networks, a modiﬁed scoring function w0 is used.

0 0 0 w (vi, vj) = spfam(vi, vj) + sGO(vi, vj)

18 2.6 Somatic mutation proﬁles of tumor samples and statisti-

cal evaluation

We have acquired mutation data for the driver genes that correspond to the generated networks. The CAN-gene identiﬁcation dataset and The Cancer Genome Atlas publicly available somatic mutation data for have been acquired for the driver genes that correspond to the generated networks. The CAN-gene identiﬁcation dataset [? ] and

The Cancer Genome Atlas publicly available somatic mutations for colon and rectum adenocarcinomas [? ] were merged. In this mutational dataset, the 94 samples mutation profile for the driver genes of interest are represented in a binary fashion, i.e. a positive value is assigned to a gene if there is at least one non-synonymous mutation on that gene. The Mantel test is utilized to test any significant correlation between the network alignments and mutation profiles of patient samples. The normalized network score is converted to distances and the mutation profiles across patients are used for each gene to calculate a distance. Mantel test as it is implemented in R package ade4 is used with ten thousand permutations to generate a p-value showing the likelihood of correlation.

19 3 Results and Discussion

In this work, we investigate frequently mutated genes that are part of the neoplastic

process, also known as cancer driver genes. Although there are various genes that

have been identiﬁed as key driver genes, most of these genes are mutated in lower

frequencies than expected in sequenced tumors. We hypothesize that functionality of

cancer driver genes can be better assessed by looking at their functional role within

the tumorigenesis process. In this work, we have started from driver genes identiﬁed

for colorectal cancer, tested our hypothesis through a novel computational approach,

where networks connecting these driver genes are used to assess these genes (Fig-

ure ??).

3.1 Cancer driver genes and their synergistic activities

In recent years various cancer genomes and exomes were sequenced with the new

technologies that have become available. These studies have resulted in identiﬁcation

of key driver genes for various cancers such as colorectal and breast cancer [? ], ovarian cancer [? ] and glioblastoma multiforme [? ]. In this study, we focus on colorectal cancer driver genes as identiﬁed in [? ].

Colorectal tumorigenesis starts mostly with a mutation in APC gene, and tumor progression results from mutations in the other genes as observed previously [?? ].

These sequences of molecular and genetic events lead to transformation of adenoma- tous polyps to malignancies. In addition to mutations, various genetic and epigenetic

20 CTNNB1 CTNNB1

Figure 4: Example alignment of two CAN-gene networks (SMAD3 (left) vs. SMAD4

(right) depicted). CAN-genes are colored grey, and dashed (red) edges show aligned nodes (hence edges) that returned the best score. This alignment has a normalized score of 0.82. events such as abnormal DNA methylation can silence tumor suppressor genes or activate oncogenes. In this study we focus on somatic mutations that would compro- mise the genetic balance and ultimately lead to tumorigenesis. Utilizing an earlier network discovery approach we gather 22 CAN-gene signaling pathway networks that represent likely synergistic relationships between these CAN-genes and APC gene.

These networks reflect significant signaling pathway events, reflecting the content of the gene expression data used. In other words, utilization of a different gene expression data might change the set of driver gene networks we gather. Moreover, we are able to extend the number of discovered networks via utilizing estimated interactions as previously suggested [?? ] to overcome with missing links in the PPI network. However, we choose to use only known interactions collected from public

PPI databases [?? ] and these 22 networks are utilized for further evaluation.

21 3.2 Finding similarities among synergistic networks

We employ a novel network alignment tool catered towards aligning networks that originate from the same organism. The alignments resulted in normalized similarity scores among the 22 networks representing CAN-genes in colorectal cancer (e.g.

SMAD3 -SMAD4 alignment as in Figure ??). Figure ?? shows the histogram generated based on pairwise alignments of the driver gene networks. The normalized similarity score ranges between 0 and 1, and the distribution of the scores shows that while most of the networks are not similar to each other, there are smaller groups of networks that show high similarity, i.e. have high normalized alignment score.

Next, hierarchical clustering is conducted with these similarity measures, and a dendogram showing close relationships of these networks are shown (Figure ??). As expected, networks that connect genes that belong to the same family such as SMAD2,

SMAD3 and SMAD4, or EPHA3 and EPHB6 cluster closely. SMAD2, SMAD3, and

SMAD4 networks also show increased similarity to other driver genes based on their highly similar interacting partners. For instance, HIST1HB1 and SMAD2 tumor suppressor genes were previously classiﬁed as chromatin remodeling/transcription genes, and showed close relationship in our unsupervised classiﬁcation [?? ].

3.3 Clustering Cancer Driver Genes

In order to depict the utility of the network comparison approach, we also employed a similar clustering of the driver genes by utilizing their gene ontology terms and/or

Pfam domains. While sequence only and ontology only comparisons generated very

22 Histogram of Normalized Alignment Scores

140

120

100

Frequency 60

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Alignment Score (0-1)

Figure 5: Histogram of normalized scores are plotted. All networks are compared

pairwise, and the frequencies are calculated. While most networks are not similar to

each other, there is a limited number that is highly similar, i.e. have a normalized

score greater than 0.8.

diﬀerent results than the network approach (See Appendix), we have combined the

two measures to create a better comparison of gene-based, and network-based ap-

proaches. The clustering of the cancer driver genes shown in Figure ?? and ??

clearly show that network-based approach can capture a diﬀerent clustering of these

genes.

For instance, TGFβ signaling associated SMADs are clustered with P2X7 receptor

(P2RX7) in the leftmost cluster in Figure ??, are not clustered together in the gene- based approach. These genes are shown to have a relation where activation of P2X7 receptors enhanced TGF-β signaling which is upstream of SMADs [? ]. Another

23 fsmlsfrteedie ee r ie nrows. in given are genes driver these for samples of two from samples patient [ 94 studies in independent observed mutations top. the at depicts shown below are heatmap networks The CAN-gene of in clustering mutations hierarchal somatic The observed tumors. and various networks CAN-gene of Classiﬁcation 6: Figure

Colorectal Cancer Samples ? ?

Colorectal Cancer Patients ahclm orsod oagn.Mttoa proﬁles Mutational gene. a to corresponds column Each ]

P2RX7 SMAD4 SMAD2 SMAD3 HIST1H1B SFRS6

Driver Genes ABCA1 LMO7

24 EPHA3 EPHB6 EVL TCF7L2 KRAS SYNE1 LRP2 RUNX1T1 EXOC4 KCNQ5 MMP2 CD46 NF1 PRKD1 iso A-ee,sqec n/rotlg-ae prahsaemr iie in limited more assessment. are this approaches ontology-based and/or sequence proper- CAN-genes, functional of capture ties can This approach network-based information. our while sequence that on shows dependent clearly not is and functions gene/protein of tion A (Figure subtrees different proxim- close A within (Figure appear ity would only genes if these Interestingly, used, is subtrees. information different sequence-based in genes gene two the these clustering, placed network-based approaches our only in subtree same the within were these While genes remodeling cytoskeleton is differences the of example shown. is incor- that [ measure similarity distance semantic a porates using CAN-genes of clustering Hierarchical 7: Figure

Height

?? 0 1 2 3 4 5 6 .Hwvr h eeotlg nycutrn a lcdte in them placed has clustering only ontology gene the However, ). SFRS6 SMAD4 SMAD2 SMAD3 EXOC4 ? ?? ABCA1 n euneifrainetatdfo fm[ Pfam from extracted information sequence and ] .Nt htgn nooyicue xetannota- expert includes ontology gene that Note ). LRP2

PRKD1 Cluster Dendrogram dist(merged_matrix)

hclust (*,"ward") P2RX7 KRAS

25 NF1 TCF7L2 SYNE1 EVL MMP2 EPHA3 EPHB6 HIST1H1B

LMO7 KCNQ5 LMO7 CD46 RUNX1T1 and SYNE1 [ ? ? ]. ] 3.4 Testing synergistic activities by comparing somatic mu-

tation proﬁles of tumor samples

In order to test our hypothesis, we have collected somatic mutations identiﬁed from

colorectal cancer samples collected from humans [?? ]. To improve the rigor of our

study, we have combined data from two studies that followed diﬀerent approaches.

Overall, the 94 colorectal cancer tumor samples are tabulated for their mutation

status at these 22 genes (Figure ??). We utilize Mantel test [? ] as a statistical test.

This method looks for signiﬁcant correlation between two distance matrices. The

CAN-gene network similarities acquired from the network alignments and mutation proﬁles of patient samples are used for this purpose.

We calculate distances between driver gene mutation proﬁles based on their com- monality in the tumor samples. For instance, observing a mutation in the same tumor would increase the similarity of these genes, or decrease the distance between them in this context. We gather such distance measures based on all 94 samples. The same distance measure used for hierarchical clustering put to work for CAN-gene networks.

As a result, the distance matrices (not shown) had no correlation (p ≥ 0.4).

This result rejects that the two datasets, one from the alignment of the networks and one from the somatic mutations have no correlation. If there was a correlation, this would mean that by utilizing the mutation proﬁles only, we could generate a similar clustering of the genes without using driver gene networks. This result rejects this claim and hence suggests that the mutations are mostly mutually exclusive for functionally similar CAN-genes.

26 3.5 Validation of results with cell line epistasis experiments

We validated our approach via a cell line experiment where we have picked two genes that are closely linked within the clustering of CAN-gene networks and performed an epistasis analysis. Epistasis refers to genetic interactions that the mutation of one gene masks the phenotypic eﬀects of a mutation at another locus. Our hypothesis is that given the CAN-gene clustering if two networks are closely similar, the sup- pression of one gene followed by the second gene should not create a big shift in the phenotype. This is justiﬁed by the functional similarity of the two genes, and it would be redundant to lose the same function twice.

We have performed siRNA knockdown experiments in a patient-derived cell (do- nated by Markowitz Lab, CWRU). The cell line did not have a mutation in EPHA3 or EPHB6 —the two genes we have selected. EPHA3 and EPHB6 were knockdown individually and in sequence (both ways) to observe the phenotypical change

(Figure ??). The knockdown samples in triplicate are then analyzed with Illumina

HumanHT12 microarray chip (data pending on Gene Expression Omnibus). The results were quantile normalized and gene level expression values were further analyzed.

All data was utilized to generate a multidimensional scaling plot to observe changes in the cell lines (Figure ??).

Based on our results, we observe that the single knockdown experiments have generated similar phenotypic changes. The second knockdown did not have additive or synergistic eﬀect. Hence the two similar CAN-genes identiﬁed by the network analysis validates our approach that CAN-genes with similar functions are not necessarily

27 !

Figure 8: The experimental design for validations is shown. The two genes are knockdown one at a time, and also in tandem. All cell lines are hybridized and analyzed with mRNA gene expression chips. mutated in the same CRC tumors. In other words, mutations are mutually exclusive for functionally similar CAN-genes identiﬁed by our framework.

28 WT 1.5 1.0

WT 0.5 Leading logFC dim 2

0.0 A3 B6A3 ABAB AB B6AB WT

0.0 0.5 1.0 1.5

Leading logFC dim 1

Figure 9: Multidimensional Scaling of knockdown experiments in triplicate.

The multidimensional scaling plot as a result of microarray gene expression proﬁles of samples are shown. WT shows Wild type CRC Cells, A3 shows knockdown of

EPHA3 gene (siEPHA3), B6 show knockdown of EPHB6 gene (siEPHB6) and AB shows knockdown of both genes (siEPHA3 first and siEPHB6 later or siEPHB6 first and siEPHA3 later). The phenotypes of single knockdowns shift cells drastically in a similar direction. However the additional knockdown of the second gene after the first one does not create incremental burden in the phenotype, hence siEPHA3-siEPHB6 knockdowns are still close to siEPHA3 only or siEPHB6 only knockdowns of samples.

29 4 Conclusion

In this study, we introduce a network alignment based approach to analyze signaling networks in carcinogenesis, and we demonstrate this method using colorectal cancer driver gene signaling networks. While recent advancement in sequencing technologies allow us to identify these changes, little is known as to how these alterations come together to manifest themselves as malignancies. By utilizing a novel network alignment tool we were able to study how functional similarity between driver gene signaling networks can be used to predict functional similarities among driver genes in carcinogenesis.

By repeating the clustering experiment with only functional annotations and sequence information, we found that the synergistic networks represent a better way of elucidating these relationships, and it is likely that similar driver genes are grouped together. In other words, given a group of similar driver genes, these work in similar synergistic activities with the APC gene in the tumorigenic progress.

By testing against an independent set of mutation data, we found that given a speciﬁc tumor, it is unlikely that similar driver genes are mutated together. In other words, given a group of similar driver genes, a single mutation in this group is suﬃcient to continue the tumorigenic progress.

We further validated our computational results with cell line experiments, where we demonstrated that the clustering of CAN-Genes can predict the functional role of these genes in cancer phenotypes. This work represents a novel approach to studying

30 cancer driver gene networks: analyzing the networks in terms of their functional similarity, using graph alignment algorithms.

In future work, this framework for analysis of signaling pathway networks can be applied to analyze larger numbers of driver genes, resulting in a more complete timeline of likely events of tumorigenesis. Here, we have concluded a small study for colorectal cancer; this can be applied to other cancers that benefit from recent large- scale cancer genome projects. These methods will be important for understanding the relationships between cancer signaling networks, resulting in improved diagnosis and characterization of tumors and finding appropriate and efficient regimens for treatments.

31 A Appendix

A.1 Gene Ontology Based Classiﬁcation of Cancer Driver

Genes

The clustering of the cancer driver genes shown in Figure ?? and ?? show very clear

diﬀerences. that network-based approach can capture a diﬀerent clustering of these

genes. For instance, EPHA3 and EPHB6 which have a close functional bimodality

are put together in the context of CRC networks, whereas the gene ontologies does not

show the same similarity for these two genes. Another example is SMAD2, SMAD3

and SMAD4 genes where in the latter clustering these are not observed to be in the same subcluster. Moreover, the hierarchical tree also shows a more level clustering rather than grouping the genes. This result suggests that the networks captured functional similarities of CAN-genes where better than the annotation only based approach.

A.2 Sequence Based Classiﬁcation of Cancer Driver Genes

We have also utilized Pfam domains [? ] as a measure of similarity at the sequence

level, and generated a scoring matrix that compared the driver genes that we have

studied. The subsequent clustering is shown in Figure ??, where most genes are grouped together, since most genes had similar distances to others. While some protein families are captured quite well, the dendogram does not present much information about the majority of the driver genes.

32 Functional similarity of Driver Genes using GO only EPHB6 CD46 LMO7 RUNX1T1 KCNQ5 HIST1H1B SYNE1 MMP2 EVL LRP2 NF1 KRAS P2RX7 PRKD1 EXOC4 SMAD2 SMAD4 SMAD3 TCF7L2 EPHA3 SFRS6 ABCA1

Figure 10: Classiﬁcation of CAN-genes based on their gene ontology annotations using

hierarchal clustering are shown. The functional similarities of genes are assessed by

the semantic similarity of the ontology terms that were reported in Gene Ontology

database. The red lines show expected groupings of the terms that were placed in

diﬀerent subtrees in the unsupervised method.

33 lseigaeson h eunesmlrte fgnsaeasse yteshared the by database. assessed Pfam are the genes in of terms similarities sequence The hierarchal shown. using families are Pfam clustering their on based CAN-genes of Classiﬁcation 11: Figure

Height

0 1 2 3 4 5

SMAD4 SMAD2 SMAD3 EPHA3 EPHB6 LMO7 SYNE1

PRKD1 Cluster Dendrogram dist(pfam_matrix) hclust (*,"ward") RUNX1T1 P2RX7

34 MMP2 NF1 LRP2 KRAS HIST1H1B KCNQ5 EXOC4 EVL ABCA1 CD46 SFRS6 TCF7L2 References

[1] N Howlader, AM Noone, M Krapcho, D Miller, K Bishop, SF Altekruse,

CL Kosary, M Yu, J Ruhl, Z Tatalovich, A Mariotto, DR Lewis, HS Chen,

EJ Feuer, and KA Cronin (eds). Seer cancer statistics review , 1975-2013,. Tech-

nical report, National Cancer Institute. Bethesda, MD, based on November 2015

SEER data submission., posted to the SEER web site, April 2016.

[2] Tara Gangadhar and Richard L Schilsky. Molecular markers to individualize

adjuvant therapy for colon cancer. Nat Rev Clin Oncol, 7(6):318–25, Jun 2010.

[3] Sebastian S Zeki, Trevor A Graham, and Nicholas A Wright. Stem cells and their

implications for colorectal cancer. Nat Rev Gastroenterol Hepatol, 8(2):90–100,

Feb 2011.

[4] Hermann Brenner, Matthias Kloor, and Christian Peter Pox. Colorectal cancer.

Lancet, 383(9927):1490–502, Apr 2014.

[5] B. Vogelstein, E. R. Fearon, S. R. Hamilton, S. E. Kern, A. C. Preisinger, M. Lep-

pert, Y. Nakamura, R. White, A. M. Smits, and J. L. Bos. Genetic alterations

during colorectal-tumor development. N Engl J Med, 319(9):525–32, 1988.

[6] Axel Walther, Elaine Johnstone, Charles Swanton, Rachel Midgley, Ian Tomlin-

son, and David Kerr. Genetic prognostic and predictive markers in colorectal

cancer. Nat Rev Cancer, 9(7):489–99, Jul 2009.

[7] Laura D Wood, D Williams Parsons, SiˆanJones, Jimmy Lin, Tobias Sj¨oblom,

35 Rebecca J Leary, ..., and Bert Vogelstein. The genomic landscapes of human

breast and colorectal cancers. Science, 318(5853):1108–13, 2007.

[8] TCGA. Integrated genomic analyses of ovarian carcinoma. Nature,

474(7353):609–15, 2011.

[9] D. W. Parsons, S. Jones, X. Zhang, J. C. Lin, R. J. Leary, P. Angenendt,

P. Mankoo, H. Carter, I. M. Siu, G. L. Gallia, A. Olivi, R. McLendon, B. A.

Rasheed, S. Keir, T. Nikolskaya, Y. Nikolsky, D. A. Busam, H. Tekleab, Jr. Diaz,

L. A., J. Hartigan, D. R. Smith, R. L. Strausberg, S. K. Marie, S. M. Shinjo,

H. Yan, G. J. Riggins, D. D. Bigner, R. Karchin, N. Papadopoulos, G. Parmi-

giani, B. Vogelstein, V. E. Velculescu, and K. W. Kinzler. An integrated genomic

analysis of human glioblastoma multiforme. Science, 321(5897):1807–12, 2008.

[10] H. Carter, S. Chen, L. Isik, S. Tyekucheva, V. E. Velculescu, K. W. Kinzler,

B. Vogelstein, and R. Karchin. Cancer-speciﬁc high-throughput annotation of

somatic mutations: computational prediction of driver missense mutations. Can-

cer Res, 69(16):6660–7, 2009.

[11] W. C. Wong, D. Kim, H. Carter, M. Diekhans, M. C. Ryan, and R. Karchin.

Chasm and snvbox: toolkit for detecting biologically important single nucleotide

mutations in cancer. Bioinformatics, 27(15):2147–8, 2011.

[12] A. Youn and R. Simon. Identifying cancer driver genes in tumor genome se-

quencing studies. Bioinformatics, 27(2):175–81, 2011.

36 [13] A. Torkamani and N. J. Schork. Identiﬁcation of rare cancer driver mutations

by network reconstruction. Genome Res, 19(9):1570–8, 2009.

[14] G. Bebek, V. Patel, and M. R. Chance. Petals: Proteomic evaluation and topo-

logical analysis of a mutated locus’ signaling. BMC Bioinformatics, 11:596, 2010.

[15] Nir Yosef, Einat Zalckvar, Assaf D Rubinstein, Max Homilius, Nir Atias, Liram

Vardi, Igor Berman, Hadas Zur, Adi Kimchi, Eytan Ruppin, and Roded Sharan.

Anat: a tool for constructing and analyzing functional protein networks. Sci

Signal, 4(196):pl1, 2011.

[16] S D Markowitz and M M Bertagnolli. Molecular origins of cancer: Molecular

basis of colorectal cancer. N Engl J Med, 361(25):2449–2460, Dec 2009.

[17] Comprehensive molecular characterization of human colon and rectal cancer.

Nature, 487(7407):330–337, 07 2012.

[18] M. Koyuturk, A. Grama, and W. Szpankowski. An eﬃcient algorithm for detect-

ing frequent subgraphs in biological networks. Bioinformatics, 20 Suppl 1:i200–7,

2004.

[19] M. Koyuturk, Y. Kim, S. Subramaniam, W. Szpankowski, and A. Grama. De-

tecting conserved interaction patterns in biological networks. J Comput Biol,

13(7):1299–322, 2006. Koyuturk, Mehmet Kim, Yohan Subramaniam, Shankar

Szpankowski, Wojciech Grama, Ananth R01 GM068959-01/GM/NIGMS NIH

HHS/ J Comput Biol. 2006 Sep;13(7):1299-322.

37 [20] J. Flannick, A. Novak, B. S. Srinivasan, H. H. McAdams, and S. Batzoglou.

Graemlin: general and robust alignment of multiple large interaction networks.

Genome Res, 16(9):1169–81, 2006.

[21] B. Dost, T. Shlomi, N. Gupta, E. Ruppin, V. Bafna, and R. Sharan. Qnet: a tool

for querying protein interaction networks. J Comput Biol, 15(7):913–25, 2008.

[22] X. Guo and A. J. Hartemink. Domain-oriented edge-based alignment of protein

interaction networks. Bioinformatics, 25(12):i240–6, 2009.

[23] C. S. Liao, K. Lu, M. Baym, R. Singh, and B. Berger. Isorankn: spectral methods

for global alignment of multiple protein networks. Bioinformatics, 25(12):i253–8,

2009.

[24] U. Pieper, N. Eswar, B. M. Webb, D. Eramian, L. Kelly, D. T. Barkan, H. Carter,

P. Mankoo, R. Karchin, M. A. Marti-Renom, F. P. Davis, and A. Sali. Modbase,

a database of annotated comparative protein structure models and associated

resources. Nucleic Acids Res, 37(Database issue):D347–54, 2009.

[25] X. Qian and B. J. Yoon. Eﬀective identiﬁcation of conserved pathways in bio-

logical networks using hidden markov models. PLoS One, 4(12):e8070, 2009.

[26] F. Ay, M. Kellis, and T. Kahveci. Submap: aligning metabolic pathways with

subnetwork mappings. J Comput Biol, 18(3):219–35, 2011.

[27] Rohit Singh, Jinbo Xu, and Bonnie Berger. Global alignment of multiple protein

interaction networks with application to functional orthology detection. Proc

Natl Acad Sci U S A, 105(35):12763–8, Sep 2008.

38 [28] Chris Stark, Bobby-Joe Breitkreutz, Teresa Reguly, Lorrie Boucher, Ashton Bre-

itkreutz, and Mike Tyers. Biogrid: a general repository for interaction datasets.

Nuc. Ac. Res., 34(suppl 1):D535–539, 2006.

[29] T S Keshava Prasad, Kumaran Kandasamy, and Akhilesh Pandey. Human pro-

tein reference database and human proteinpedia as discovery tools for systems

biology. Methods Mol Biol, 577:67–79, 2009.

[30] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P.

Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-

Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald,

G. M. Rubin, and G. Sherlock. Gene ontology: tool for the uniﬁcation of biology.

the gene ontology consortium. Nat Genet, 25(1):25–9, 2000.

[31] Minoru Kanehisa and Susumu Goto. Kegg: Kyoto encyclopedia of genes and

genomes. Nuc. Ac. Res., 28(1):27–30, 2000.

[32] F. Campagne, S. Neves, C. W. Chang, L. Skrabanek, P. T. Ram, R. Iyengar,

and H. Weinstein. Quantitative information management for the biochemical

computation of cellular networks. Sci STKE, 2004(248), August 2004.

[33] Nancy R. Gough, Elizabeth M. Adler, and L. Bryan Ray. Focus issue: Cell

signaling–making new connections. Sci. STKE, 2004(261):12, 2004.

[34] H. W. Mewes, K. Heumann, A. Kaps, K. Mayer, F. Pfeiﬀer, S. Stocker, and

D. Frishman. Mips: a database for genomes and protein sequences. Nuc. Ac.

Res., 27(1):44–48, January 1999.

39 [35] Vishal N Patel, Gurkan Bebek, John M Mariadason, Donghai Wang, Leonard H

Augenlicht, and Mark R Chance. Prediction and testing of biological networks

underlying intestinal cancer. PLoS ONE, 5(9):e12497, 2010.

[36] G Bebek and J Yang. Pathﬁnder: mining signal transduction pathway seg-

ments from protein-protein interaction networks. BMC Bioinformatics, 8:335–

335, 2007.

[37] M. Punta, P. C. Coggill, R. Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell,

N. Pang, K. Forslund, G. Ceric, J. Clements, A. Heger, L. Holm, E. L. Sonnham-

mer, S. R. Eddy, A. Bateman, and R. D. Finn. The pfam protein families

database. Nucleic Acids Res, 40(D1):D290–D301, 2012.

[38] Jayesh Pandey, Mehmet Koyut¨urk,Yohan Kim, Wojciech Szpankowski, Shankar

Subramaniam, and Ananth Grama. Functional annotation of regulatory path-

ways. Bioinformatics, 23(13):i377–86, Jul 2007.

[39] Eric Banks, Elena Nabieva, Ryan Peterson, and Mona Singh. Netgrep: fast

network schema searches in interactomes. Genome Biol, 9(9):R138, 2008.

[40] Jayesh Pandey, Mehmet Koyut¨urk, Shankar Subramaniam, and Ananth

Grama. Functional coherence in domain interaction networks. Bioinformatics,

24(16):i28–34, Aug 2008.

[41] TCGA. The cancer genome atlas data portal, 2012. Accessed on Jan 4, 2012,

http://tcga-data.nci.nih.gov/tcga/dataAccessMatrix.htm.

40 [42] K. W. Kinzler and B. Vogelstein. Lessons from hereditary colorectal cancer. Cell,

87(2):159–70, 1996.

[43] Ugur Eskiocak, Sang Bum Kim, Peter Ly, Andres I Roig, Sebastian Biglione,

Kakajan Komurov, Crystal Cornelius, Woodring E Wright, Michael A White,

and Jerry W Shay. Functional parsing of driver mutations in the colorectal

cancer genome reveals numerous suppressors of anchorage-independent growth.

Cancer Res, 71(13):4359–65, Jul 2011.

[44] Chia-Mei Wang, Yuan-Yi Chang, and Synthia H Sun. Activation of p2x7

purinoceptor-stimulated tgf-beta 1 mrna expression involves pkc/mapk signalling

pathway in a rat brain-derived type-2 astrocyte cell line, rba-2. Cell Signal,

15(12):1129–37, Dec 2003.

[45] N. Mantel. The detection of disease clustering and a generalized regression ap-

proach. Cancer Res, 27(2):209–20, 1967.