DAWNRANK: DISCOVERING PERSONALIZED DRIVER IN CANCER

BY

JACK PU HOU

THESIS

Submitted in partial fulfillment of the requirements for the degree of Master of Science in Bioengineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2016

Urbana, Illinois

Adviser:

Professor Jian Ma

ABSTRACT

Large-scale cancer genomic studies have revealed that the genetic heterogeneity of the same type of cancer is greater than previously thought. A key question in cancer genomics is the identification of driver genes. Although existing methods have identified many common drivers, it remains challenging to predict personalized drivers to assess rare and even patient-specific mutations. We developed a new algorithm called DawnRank to directly prioritize altered genes on a single patient level. Applications to TCGA datasets demonstrated the effectiveness of our method. We believe DawnRank complements existing driver identification methods and will help us discover personalized causal mutations that would otherwise be obscured by tumour heterogeneity. Source code can be accessed at http://bioen- compbio.bioen.illinois.edu/DawnRank/.

ii

ACKNOWLEDGEMENTS

We thank Benita Katzenellenbogen, Edison Liu, Koichiro Inaki, and ChengXiang Zhai for helpful discussions. We thank Sam Ng (Josh Stuart’s lab at UCSC) for technical assistance to run PARADIGM-Shift. This work was supported by an award from the Interdisciplinary

Innovation Initiative (In3) program at the University of Illinois (to J.M.) and an Illinois

Distinguished Fellowship (to J.P.H.). J.M. was additionally supported by NIH grant HG006464 and NSF grants 1054309 and 1262575.

iii

TABLE OF CONTENTS

CHAPTER 1: INTRODUCTION………………………………………………………..1

CHAPTER 2: METHODS……………………………………………………………….4 2.1 DawnRank Algorithm………………………………………………………..5 2.2 Condorcet Voting for Rank Aggregation…………………………………….8

CHAPTER 3: RESULTS………………………………………………………………..11 3.1 Datasets and Network………………………………………………………..11 3.2 Comparison to Previous Methods……………………………………………12 3.3 Discovering New and Personalized Drivers…………………………………15 3.4 Implementation and Running Time………………………………………….24

CHAPTER 4: DISCUSSION…………………………………………………………….25

CHAPTER 5: FIGURES………………………………………………………………...27

CHAPTER 6: REFERENCES………………………………………………………….33

APPENDIX A: LIST OF ABBREVIATIONS…………………………………………..38

APPENDIX B: SUPPLEMENTAL INFORMATION…………………………………..39 B.1 Dynamic Damping Factor vs. Static Damping Factor………………………39 B.2 DawnRank Proof of Convergence…………………………………………...40 B.3 Parameter Training…………………………………………………………..42 B.4 Comparison with Summary Statistics……………………………………….43

APPENDIX C: SUPPLEMENTAL FIGURES………………………………………….46

iv

CHAPTER 1

INTRODUCTION

Recent advances in next-generation sequencing (NGS) technologies have provided us with an unprecedented opportunity to better characterize the molecular signatures of human cancers. The critical challenge facing cancer genomics today is to analyze and integrate such information in the most efficient and meaningful way to advance cancer biology, and then to translate that knowledge to clinic [1, 2]. A key question in cancer genomics is how to distinguish “driver” mutations, which contribute to tumorigenesis, from functionally neutral “passenger” mutations

[3]. The most basic approach is to categorize mutations based on recurrence, i.e., the most commonly occurring mutations are more likely to be drivers [4, 5], or by comparing mutation rates in individual genes based on an empirically derived background mutation rate, such as

MutSig [6] and MuSiC [7]. Machine learning based approaches use existing knowledge to help identify drivers. For example, CHASM utilizes random forest to classify driver mutations using alterations trained from known cancer-causing somatic missense mutations [8]. There are several recent methods that use additional information to help predict driver genes and driver pathways.

CONEXIC was developed to integrate copy number change and expression data to identify potential driver genes located in regions that are amplified or deleted in tumors [9]. Network and pathway-based approaches have become one of the most promising methods to understand drivers due to their ability to model gene-gene interactions by aggregating small effect sizes from individual genes. MEMo and Dendrix rely on predicted mutual exclusivity of driver mutations within pathways or subnetworks [10, 11]. MEMo utilizes driver cliques based on known pathways with mutually exclusive mutations in the patient cohort, whereas Dendrix identifies subnetworks (de novo) with potential driver activity as having high coverage and high mutual

1 exclusivity. Another method, PARADIGM-Shift was developed to utilize pathway-level information along with other features (such as expression, methylation, copy number) to infer gain and loss of function for mutations. PARADIGM-Shift works best with small pathways with specific genes of interest, i.e., focus genes [12]. DriverNet classifies driver mutations as mutations that propagate outlying downstream differential expression in the transcriptional regulatory network [13]. More recently, TieDIE [14] was developed to find small cancer driver pathways within a large super-pathway to connect genomic alterations with transcriptomic changes, MAXDRIVER [15] was proposed to identify driver genes by integrating multiple omics data and heterogeneous networks, and VarWalker [16] was developed to construct cancer- specific mutation networks to assist driver identification in a patient cohort. However, the scope of these methods is still limited. Most existing methods such as MEMo, Dendrix, and DriverNet require a large number of patient samples to generate reliable results and lack the ability to discover rare and patient-specific drivers. Other methods such as PARADIGM-Shift are designed to determine drivers in small pathways and often require detailed previous knowledge of specific pathways and focus genes to operate effectively. New methods are needed to identify novel and rare drivers when we do not have much prior knowledge of the tumor.

It is now acknowledged that individual tumors of the same type are highly heterogeneous and have diverse genomic alterations [2, 17]. This stems from the “long-tail phenomenon” which states that cancer mutations are characterized by a small number of frequently mutated genes and a large number of infrequently mutated genes [18, 19]. Discovering rare drivers in the long tail of genetic alterations remains difficult. Therefore, we urgently need methods to assess the impact of patient-specific and rare mutations from individual tumor samples in order to elucidate personalized molecular drivers.

2

In this work, we introduce a new method called DawnRank that detects driver genes using data from a single patient sample. By only using data from an individual patient sample rather than a large cohort, we identify drivers in a personalized fashion. The single patient approach classifies drivers regardless of mutation frequency, thereby allowing us to focus on rare (infrequent) drivers. DawnRank ranks potential driver genes based on their impact on the overall differential expression of its downstream genes in the molecular interaction network. Mutated genes with higher ranking are more likely to be drivers. Our method builds on the DriverNet rationale that the impact of a potential driver gene can be determined by its effect on the genes that are regulated by it. However, unlike DriverNet, DawnRank can be applied to a single patient at a time. It is also important to note that our approach differs from the aforementioned

PARADIGM-Shift. The small-scale network approach of PARADIGM-Shift works well when the user has a target pathway and genes in mind, but is not applicable in determining rare driver genes distributed across multiple pathways with a more comprehensive gene network.

DawnRank differs from MEMo and Dendrix as the scope of DawnRank is to determine individual driver genes rather than driver pathways. Although CONEXIC also utilizes pathway information to identify important CNVs, DawnRank has a more generalized framework to assess the impact of both mutations and CNVs at the individual patient level. Such a patient-specific framework also makes DawnRank different from methods such as MAXDRIVER and

VarWalker. DawnRank also differs from TieDIE, which predicts subnetworks of interlinking genes that highlight cancer specific subnetworks, whereas DawnRank predicts individual driver genes.

3

CHAPTER 2

METHODS

The DawnRank method ranks mutated genes in a single patient according to their potential to be drivers. DawnRank requires the knowledge of a gene interaction network, somatic genomic alterations from the patient’s cancer genome, and the differential gene expression profile between the cancer transcriptome and normal transcriptome. The overview of DawnRank is in

Figure 1. Our method ranks genes according to their impact on the perturbation of downstream genes, i.e., a gene will be ranked higher if it causes many downstream genes, directly or indirectly in the interaction network, to be differentially expressed. DawnRank views the gene network as a directed graph. We adopted the random walk approach used in PageRank [20, 21] to model this process iteratively. The framework effectively reflects the observation from previous works that mutations in genes with higher connectivity within the gene network are more likely to be impactful. In each iteration, a node in the network can either, with a probability

1 − � (d is called damping factor, which we defined in a new way. See below.), revert back to stay at the same node, or with a probability � to walk randomly to a downstream node, which symbolizes the impact a particular gene has on its downstream neighbors. Our method depends on three parameters: the differential gene expression, the interaction network as a directed graph, and the damping factor. These three parameters along with genomic alterations form the key components of our model to determine drivers in individual patient samples. The output of the rank describes gene’s overall impact. In order to produce a more readable form of the rank, we converted the rank into percentile form to get the relative order of the genes in the rest of this paper.

4

2.1 DawnRank Algorithm

In DawnRank, a gene will possess a higher impact score (i.e., rank) if the gene is highly connected to differentially expressed downstream genes (directly and indirectly connected).

Driver genes tend to display a high-degree of connectivity within the gene network [22, 23]. For example, using the number of outgoing edges alone, known driver genes as classified by the

Cancer Gene Census (CGC) [24] have a mean and median of 31.45 and 12 outgoing edges, respectively, whereas genes not typically classified as drivers (not in CGC) have a mean and median of 17.73 and 3 outgoing edges, respectively. The higher number of outgoing connectivity of known driver genes suggests that the PageRank model would be appropriate to prioritize driver genes based on their impact in the gene interaction network. PageRank has had several adaptations in genomics. GeneRank utilized PageRank to rank the importance of genes in a molecular network [25]. PageRank derivatives (such as SPIA [26]) have also been used to analyze pathway-level importance. More recently, it was utilized to predict clinical outcome of cancer patients based on gene expression [27] and to assist subtype identification [28]. Such approaches also show similarity in nature to modeling network impact as a heat diffusion process as used in HotNet [29] and TieDIE [14]. DawnRank builds on the original PageRank algorithm by providing a way to model a network’s directionality with more stable rankings by utilizing dynamic damping factors (see below).

DawnRank views the gene network as a directed graph. Let N be the number of nodes (in our case, genes) in the directed graph, and A be the adjacency matrix representation of the graph, a 0-1 matrix (if node i links to j, then � = 1). Note that the current 0-1 adjacency matrix can be naturally extended to consider weighted edges to further distinguish different gene-gene interactions.

5

We define the rank of each gene iteratively:

�� � = (1 − �)� + � , 1 ≤ � ≤ � (1) ���

� is the rank in the � iteration. The output of the rank describes a gene’s overall impact on the network: the higher the rank, the higher the impact of the gene. � is the damping factor, a parameter representing the extent to which the ranking depends on the structure of the graph. In

DawnRank, the damping factor is individualized based on gene connectivity (discussed below).

� is the prior probability of the gene which we set to the absolute differential expression. The absolute differential expression is the absolute value of the difference of the log scale tumor and

normal expression values. The ��� = � is the in-degree of i, or the number of incoming nodes to i. This differs from the original PageRank definition of ���, which was the out-degree of i. A webpage’s PageRank is dependent on the rank of the webpages that link to it (incoming edges), whereas our gene’s rank is dependent on the rank of the genes that it links to (outgoing edges).

The zero-one gap problem refers to the potential pitfall that assigns biased ranks to some nodes [30]. When trying to rank nodes with 0 incoming edges, known as “dangling nodes”, the

��� will be 0, arising to a divide-by-zero error. In our real gene network data, 15.5% of all genes do not have any incoming edges. The initial PageRank algorithm attempts to handle the problem by setting the damping factor to be 0 for such genes, while using the damping factor

0.85 for all other nodes. If we use this approach, the ranks of genes with no incoming edges will be based solely on its differential expression (and not the network structure). However, this correction in itself causes a large gap in the damping factor for genes with 0 and 1 incoming edge

(Supplementary Figure C3). This large gap in the damping factor can cause a drastic change in

6 the ranking of the gene when an incoming edge is added to the gene which in turn may cause unstable rankings [30]. An unstable ranking system is especially concerning to gene network data, as it is still not a complete representation of all interactions among genes [31]. Therefore, small modifications and additions to certain gene interactions may significantly alter the rankings of potential drivers. To address this problem, we utilize dynamic damping factors [30], where each gene possesses an individualized damping factor based on the number of incoming edges to that gene (Eq. 2). As the number of incoming edges increases, the damping factor gradually rises to incorporate more connectivity information into the ranking of the gene, therefore no large gap is observed from 0 in-degree and 1 in-degree (see Supplementary Figure C3).

��� � = (2) ��� + �

The parameter � follows a Dirichlet prior trained from maximizing the values of � over 100 random samples. We selected the � value of 3 because it had the highest average DawnRank scores for known drivers in CGC (see Supplementary Materials for more details on how � was trained). Overall, the dynamic damping factor mitigates the large change in the damping factor in nodes with 0 and 1 incoming edges by gradually increasing the damping factor as the gene’s in- degree increases, thereby creating more reliable and more stable rankings. A toy example is in the Supplementary Figure C2. We also show that DawnRank performs more reliably with a dynamic damping factor than a static damping factor on the TCGA datasets (see Supplementary

Figure C1).

In addition to the iterative version of DawnRank, the method can also be presented in matrix form:

� = (1 − �)� + ��×� (3)

7 where �, �, and � are N×1 matrices to represent the rank, gene-specific damping factor, and the gene expression, respectively, and � is the transition matrix defined by:

� � , ⋯ , ��� ��� � = ⋮ ⋱ ⋮ (4) � � , ⋯ , ��� ���

DawnRank converges when there is no longer a significant update in the ranks. This is when the magnitude of the difference of the ranks between time � + 1 and the previous time point � falls below a small �, which we set to 0.001, the same value suggested by [30].

DawnRank also stops when no solution is present after a maximum number of iterations, which we set at 100. In practice, DawnRank always converges for any reasonable µ between 0.01 and

20 within 20 iterations. Nonetheless, there are corner cases at low damping factors (µ < 10) where DawnRank either does not converge or converges very slowly. A detailed discussion of the convergence of DawnRank can be found in the Supplementary Materials.

2.2 Condorcet Voting for Rank Aggregation

To aggregate the rankings of genes from individual patient samples to determine the most impactful drivers in a population (e.g., known drivers for the same type of cancer or a specific sub-type), DawnRank applies a modified version of the Condorcet method [32]. The Condorcet method is a voting scheme in which “voters” vote for the best “candidate” by submitting a rank- ordered list of candidate preferences. The list of preferences is allowed to be either partial or full.

The Condorcet method then selects a winning candidate by comparing every possible pair of candidates A and B and determining a “winner” by comparing the number of voters that preferred A to B and vice-versa. We applied the Condorcet method to the personalized rankings of genes to determine aggregate ranking of genes in a patient population.

8

Although the Condorcet method is built to handle partial voting lists, one difficulty of implementing the Condorcet method is the lack of patient samples that possess the commonly mutated genes. Many pairwise comparisons are missing for many gene combinations due to the lack of patients that have mutations in both genes simultaneously. However, since DawnRank can output a ranking as an impact score for all genes regardless if a gene is mutated, we evaluated pairwise comparisons of two genes based on patients with a mutation in at least one of the two genes. This approach avoids the use of non-mutated gene comparisons to calculate the aggregate score of genes, as the objective of DawnRank is to determine the altered genes that are the most impactful. However, since mutation recurrence is an important factor in detecting common drivers, we also implemented a penalty heuristic, �, a number between 0 and 1 in our approach to lower the ranking of a gene in a pairwise comparison that is not mutated. This penalty allows us to rank aggregate frequent drivers based on both impact and frequency.

� if �(�)����(�) > �(�)����(�) PairwiseWinner �, � = (5) � otherwise where

� if � is NOT mutated �(�) = (6) 1 if � is mutated

We used the output from DawnRank, which we converted to percentile rank format, to represent the ranking of the gene. The penalty heuristic lowers the value of a non-mutated gene when comparing it against a mutated gene. This heuristic serves as both a means to prevent a rare mutation that is impactful in one patient from winning all pairwise comparisons (akin to a candidate winning just because one and only voter that voted for it ranked it higher than any other candidate) and to prevent a low impact, high frequency mutation from winning a pairwise comparison against high-impact genes that are not frequently mutated (akin to an unpopular candidate winning just because many voters had a low-preference vote for that candidate). We

9 selected � by running DawnRank over 100 random patient samples for various instances of � between 0 and 1 and calculating the precision with respect to CGC genes (see Supplementary

Materials). We found � to be 0.85.

10

CHAPTER 3

RESULTS

We applied DawnRank to TCGA datasets. We first showed that DawnRank outperforms two recent pathway-based methods DriverNet and PARADIGM-Shift. We also showed that

DawnRank produces reliable results as compared to cohort-based approaches CHASM and

OncodriveFM. We then used the results of DawnRank to determine both potential novel drivers

(new genes mutated frequently), and more importantly, potential rare and personalized drivers that previously could not be assessed by other methods. We also applied the method to study breast cancer subtypes and found that the amount of predicted rare drivers has a strong correlation with the degree of genetic heterogeneity in different breast cancer subtypes.

3.1 Datasets and Network

We applied DawnRank to 512 glioblastoma multiforme (GBM) samples, 504 breast cancer

(BRCA) samples, and 572 ovarian cancer (OV) samples in TCGA. The datasets we used in this work include gene expression and coding-region mutation data for three cancer types generated by TCGA [33-35]. The data was accessed on May 20, 2013. The mutation data we used included non-synonymous point mutations and insertions and deletions (indels) in coding regions.

We built the gene interaction network used in DawnRank using a variety of sources, including the network used in MEMo [10] as well as the up-to-date curated information from

Reactome [36], the NCI-Nature Curated PID [37], and KEGG [38]. The MEMo network consisted of inferred gene-interaction from sources of information such as interactions, gene co-expression, protein domain interaction, and text-mined interaction described by [39]. To aggregate all of the networks together, all redundant edges were collapsed to single edges. We included self-loops within the network to account for auto-regulation events [40]. The resulting

11 aggregate network consisted of 11,648 genes and 211,794 edges, and can be downloaded from our supplementary website.

To help evaluate the quality of our results, we obtained a list of 487 known driver genes from the well-studied cancer gene database, CGC [24]. Additionally, we also compared the quality of our results to the Pan-Cancer drivers [41], a driver gene list built using results from well known cohort-based methods over twelve tumor types. In practice, there is no gold standard of known drivers. However, well-curated cancer gene databases such as CGC and the Pan-

Cancer study provide an approximate benchmark of known drivers [13, 42].

3.2 Comparison to Previous Methods

3.2.1 DawnRank outperforms pathway-based methods DriverNet and PARADIGM-Shift

We evaluated the performance of DawnRank’s ability to identify known drivers and compared it with DriverNet and PARADIGM-Shift. As mentioned above, we utilized CGC as an approximate benchmark of known drivers. Note that here we implicitly assume that all non- synonymous mutations in driver genes are potential driver mutations if they are selected by a method. We performed two separate comparisons. (1) We compared DawnRank to DriverNet over a large network in order to evaluate the performance of the two methods using a large human interaction network (which PARADIGM-Shift is not able to work with practically). (2)

We compared DawnRank to PARADIGM-Shift and DriverNet over a smaller, but well- annotated gene network based on KEGG in order to determine the effectiveness of the three algorithms in smaller networks. The network used in the first comparison was the same network described earlier. The network used in the second comparison was a smaller network built from the aggregation of the KEGG cancer pathways with 1,492 gene nodes and 8,070 edges. We ran

DriverNet version 1.0.0, defining a differentially expressed gene using their default settings of 2

12 standard deviations (http://www.bioconductor.org/packages/release/bioc/html/DriverNet.html), and we ran PARADIGM-Shift version 0.1.9 using the suggested global-rank transformation for expression data (http://sysbio.soe.ucsc.edu/paradigm/tutorial/). To facilitate the comparison, we applied the Condorcet rank aggregation (see Methods) for the DawnRank scores based on individual patient samples to provide the consensus population-level driver scores. For each comparison, we used the following three measures (Precision, Recall, and F1-Score):

(# Mutated Genes found in CGC) ∩ (# Genes found) ��������� = (# Genes found)

(# Mutated Genes found in CGC) ∩ (# Genes found) ������ = (7) (# Mutated Genes found in CGC)

���������×������ � ����� = 2× ��������� + ������

Precision, recall and F1 scores were based on the top N genes. We first evaluated the performance between DawnRank and DriverNet. In general, DawnRank outperforms DriverNet in all three cancer datasets with respect to CGC (Figure 2). Although DriverNet performs comparably in ranking the top genes in GBM, it has poorer performance in OV and BRCA. A potential explanation of the difference may lie in the total number of mutations in the three cancer datasets. GBM had 5,478 mutations over 599 genes, while OV had 13,520 mutations over

4,968 genes and BRCA had 11,900 mutations over 5,205 genes. The numbers indicate that there may be more passenger mutations in BRCA and OV and DawnRank is less affected by noise than DriverNet. An illustration of this is DriverNet’s ranking of the gene TTN as a top 5 driver in both BRCA and OV. TTN is the longest gene in the and recent TCGA analysis has suggested that that higher mutation rate in TTN is likely to be artifacts [34]. TTN was not ranked among the top 60 genes in any cancer according to DawnRank. We then evaluated the performance of DawnRank, PARADIGM-Shift, and DriverNet using the smaller KEGG

13 network. Overall, DawnRank outperforms both DriverNet and PARADIGM-Shift in terms of precision, recall, and F1 scores using CGC as a standard (Figure 3) or the Pan-Cancer results as a standard (Supplementary Figure C11). In BRCA, although some known drivers such as TP53 and ATM were detected by multiple methods, DawnRank detected important known driver genes in the top 10 such as CDH1, PIK3R1, and BRCA1 in breast cancer which were not detected by either PARADIGM-Shift or DriverNet as top ranking drivers.

In addition to its ability to more precisely identify known driver genes at the entire patient population level, DawnRank also has an advantage of obtaining high-quality results from a much smaller patient cohort. To evaluate this ability, we applied DawnRank to random subsets of the patient cohort from GBM, OV, and BRCA from 10, 20, 50, and 100 patients to determine the precision at which DawnRank identifies known drivers in its top 30 results. We compared these results with that of DriverNet using the same patient subsets. The mean of the precision scores after ten runs is presented in Figure 4. We found that even with a small patient cohort,

DawnRank can still perform reasonably well, much better than DriverNet. Even though

DawnRank does not perform as well with a sample of 10 as it can with the entire dataset, the average DawnRank precision at 10 samples is comparable to the DriverNet precision using the entire cohort. Our results suggest that DawnRank is able to identify known drivers even with a small number of samples.

3.2.2 DawnRank achieves reliable results as compared to cohort-based approaches CHASM and

OncodriveFM

In addition to comparing DawnRank to the pathway-based approaches of DriverNet and

PARADIGM-Shift, we compared DawnRank to non-pathway-based approaches as well. We

14 compared DawnRank with CHASM [8] and OncodriveFM [43]. We ran CHASM version 3.0 using CRAVAT [44] with the cancer driver analysis mode (http://www.cravat.us/), and we ran

OncodriveFM using the IntOGen Mutation Analysis interface using SIFT, Polyphen2, and

Mutation Assessor data (http://www.intogen.org/web/mutations/v04/analysis). Both CHASM and OncodriveFM can rank genes from most likely drivers to least likely ones. Like in

DawnRank, we evaluated all non-synonymous mutations and evaluated each method using precision-recall metrics in comparison to CGC. The results (Supplementary Figure C10) show that DawnRank performs comparably to both CHASM and OncodriveFM. These results also demonstrate that the personalized approach of DawnRank can achieve reliable results at the cohort level.

3.3 Discovering New and Personalized Drivers

In this section, we use DawnRank to identify driver genes that may not have been classified as drivers by other methods. We discuss such driver genes determined by DawnRank in two main categories: general novel drivers, and personalized drivers (we are mostly interested in personalized rare drivers).

• A general novel driver is a recurrent driver that has not been classified as a driver gene.

Here we define novel drivers as genes that satisfy the following requirements: (1) Highly

ranked (genes that have an aggregate rank in the top 30 of driver genes based on the

patient cohort), which means it is considered as driver in multiple samples; (2) Recurrent

(alteration frequency >2% in the patient cohort); (3) Not previously classified as a driver

by CGC.

• A personalized driver is a gene predicted to be a driver for specific patients. Here we

define personalized drivers as drivers classified as significant using DawnRank’s

15

maximally selected rank statistics cutoff (explained later) and as drivers that display

higher than expected rankings under the Chauvenet’s criterion for detecting outliers. In

other words, personalized driver genes could be recurrently mutated in multiple samples,

but they are only ranked high enough to be considered as drivers in specific samples. We

are particularly interested in rare drivers, as rare drivers would be obscured by the long

tail phenomenon of cancer mutations. Personalized drivers mutated in less than 2% of

patients will be considered personalized rare drivers.

3.3.1 Discovering Novel Drivers with Coding Mutations

Using the criteria listed above to detect novel drivers, we used the aggregate DawnRank score to select genes with coding mutations that were considered novel drivers. In GBM, we found several consistently high-ranking potential novel driver genes. Of them, TGFBR2, HIF1A, and

FOXO3 are the most promising. These genes are highly rank, frequently mutated, and are involved in several cancer functions and pathways. TGFBR2 is a transforming growth factor whose function is to regulate cell division signals and inhibit cell division [45]. TGFBR2 is the third-ranked driver in terms of Condorcet aggregate score, and it is altered in 18.7% of GBM patient samples. TGFBR2 is classified as a potential tumor suppressor (it shows decreased expression in GBM samples), and it is involved in many pathways to cancer including TGFβ signaling, HTLV infection, and Hippo signaling [46]. HIF1A is a hypoxia inducing factor that has been shown to be necessary for the survival of cancer stem cells in gliomas [47]. HIF1A is involved in mTOR pathway, which regulates nutrient sensing for cell growth, which could be responsible for cell growth and metastasis in cancer. HIF1A is ranked as the fourth most impactful driver in GBM, and it is mutated in 13.4% of GBM patient samples [48]. FOXO3 is

16 the eighth ranked gene. It is a common phosphorylation target of AKT and ERK, and is a key trigger for in the PIK3/AKT pathway [49].

In OV and BRCA, there were fewer candidate novel drivers than GBM, however, we found two genes that show strong potential to become novel drivers: PDPK1 in OV and CENPE in BRCA. PDPK1 is a phosphoinositide dependent protein kinase-1 that interacts with a crucial driver gene AKT1, and together their signaling plays a critical role in activating proliferation and survival pathways within cancer cells such as the PIK3/AKT pathway and the mTOR pathway

[50, 51]. PDPK1 is ranked second in OV and is mutated in 5.4% of OV samples. The mutation in

PDPK1 itself also suggests an important impact in the functionality of the protein binding. For example, in sample TCGA-13-0751, the mutation is in a Glycine to Arginine substitution near a hairpin loop in the middle of the protein (Supplementary Figure C6). The amino acid change from G to R changes the dynamic of the protein structure by replacing a non-polar amino acid with a positively charged amino acid, which may change the binding interaction with the negatively charged substrate phosphatidylinositol-3,4,5-trisphosphate, which in turn may affect the phosphorylation and activation of at least 24 kinases [52].

In BRCA, a potential novel driver we found is CENPE. CENPE is a centromere-associated protein that contributes to mechanics of microtubule- interactions with mitotic checkpoint. Overexpression of CENPE can lead to excessive cell growth and contribute to tumor proliferation [53]. CENPE is ranked 26-th in DawnRank. We observed that it is highly overexpressed in tumor samples. CENPE is interesting as it is ranked differently among different subtypes of breast cancer. All but one of the mutations in CENPE was in Luminal B or Basal subtype. The Basal subtype mutations had the highest average ranking in the 95.2 percentile and the average Luminal B subtype mutations ranked lower in the 90.9 percentile, whereas the

17

Luminal A mutation was much less significant with a ranking in the 78.3 percentile. This shows that even though CENPE has mutations in multiple subtypes, it may harbor different driver potentials in different subtypes.

3.3.2 Discovering Novel Drivers with Copy Number Changes

We then applied DawnRank to examine the role of CNVs in the three cancer types used in this study. We included genes that displayed at least a two-fold change in copy number change and also had a corresponding change in expression. We treated CNV events in a similar fashion as coding mutations to prioritize the genes with an amplification and deletion and then determined the most impactful CNVs. In addition to known driver genes with CNVs, we also identified novel driver CNVs not in CGC (Supplementary Figure C7).

In GBM, known CNV drivers (based on CGC) include the amplification of some well- known driver genes such as EGFR, PDGFRA, and MYC and the deletions of MLLT3 and ANK3.

A potential novel driver CNV is the amplification of SEC61G. SEC61G is a proto-oncogene that is required for tumor cell survival [54]. SEC61G is altered in 14.3% of GBM cases and it is ranked 17-th among CNVs. ELAVL2 is a potential tumor suppressor gene whose function is to stabilize and control nervous-specific binding [55]. ELAVL2 is altered in 6.7% of GBM cases and is ranked 2nd among CNVs in GBM. In OV, many known drivers are amplifications such as CCNE1 and KRAS and deletions such as MAF or PIK3R1. An example of a novel amplification is FZD5, a “frizzled” gene that has a strong co-expression event with the Wnt signaling pathway in ovarian cancer [56]. FZD5 is altered in 2.9% of OV cases and it is ranked

9th among CNVs in OV. In BRCA, known driver CNVs include amplifications in CCND1,

MYC, GATA3 and EGFR and deletions in MAF. An example of a novel amplification event in

BRCA is PAK1, which is a breast cancer oncogene that activates MAPK and MET signaling

18 pathways [57]. PAK1 is the 3rd ranked gene with CNVs, and it is altered in 2.4% of the BRCA patient samples.

3.3.3 Discovering Personalized Drivers

In this section, we discuss the personalized scope of DawnRank by demonstrating its ability to determine personalized novel and rare drivers. The main aspect that distinguishes DawnRank from existing programs is the ability to discover rare and even patient-specific drivers. Even if a gene is altered only in a single patient, DawnRank will still be able to evaluate the impact of that alteration. We determined such personalized drivers using the “maximally rank statistics” method. This method classifies alterations as “drivers” and “non-drivers” by assigning a cutoff that maximizes the number of known driver genes (genes in CGC) ranking higher than the cutoff). Altered genes in the individual patient with a ranking higher than the cutoff would be described as personalized drivers. The average cutoff for GBM, OV, and BRCA was 98.1, 92.8, and 92.2, respectively (measured in percentile rank). We specifically looked for genes that have not been classified as a driver in CGC and genes that have a significantly higher than expected rank in specific patients. A rank is considered significantly high in a patient if the rank is considered as an outlier under Chauvenet’s criterion for outlier detection.

In our case, a gene is considered to be a rare driver if the gene is both classified as a driver using the above criteria, and it is mutated in only a small number of patients (<2%). We selected genes that fit the above criteria to discover potential personalized drivers: genes that labeled as significant from the maximally-selected rank statistics and genes that ranked higher in specific patients according to the Chauvenet’s criterion. These selection criteria yielded in 26 potential personalized drivers in GBM (10 from mutations, 16 from CNV), 56 potential personalized

19 drivers in BRCA (20 from mutations, 36 from CNV), and 77 potential personalized drivers in

OV (26 from mutations, 51 from CNV) (Figure 5A and Supplementary Figures S8 and S9).

We found 26 potential personalized drivers in GBM. Of these genes, we found that several of the candidate driver genes were involved in important known cancer pathways. Using KEGG to map genes to pathways [46], mutations in three interferon receptor genes IFNA14, IFNW1, and IFNA17 belong to multiple pathways that have significant impact on cancer, including the

Cytokine-Cytokine receptor pathway and the JAK/STAT pathway. Cytokines are important factors in tumor cell control as they are key players in inflammatory and immune response [58].

The JAK/STAT is a related pathway activated by the binding of cytokines to its receptor. The pathway is involved in inflammation, proliferation, and invasion/migration [59]. In either case,

IFNA14, IFNW1, and IFNA17 are interferon receptor genes that may interact with genes such as

TP53 to induce apoptosis of cancer cells [60]. Both the Cytokine-Cytokine receptor pathway and the JAK/STAT pathway are common drug targets in GBM [58, 59], which could lead to the implications that IFNA14, IFNW1, and IFNA17 could be targeted. IFNA14, IFNW1, and IFNA17 are mutated in 4.7%, 5.6%, and 4.5% of GBM patients, respectively. IFNA14 is ranked significantly higher than its average expected ranking in patients TCGA-32-1978 and TCGA-19-

2619 while IFNW1 and IFNA17 both have high-ranking mutations in the patient TCGA-28-1749.

Although IFNA14, IFNW1, and IFNA17 are examples of personalized drivers, they are not personalized rare drivers. An example of a personalized rare driver is PITX2, which is altered in only 2 patients (0.4%). PITX2 is in the TGFβ signaling pathway, which can be targeted by cancer drugs such as decanoyl-RVKR chloromethylketone in GBM [61, 62].

We found 77 potential personalized driver genes in OV, and like GBM, many of these genes fall under similar KEGG pathways. Figure 5A shows the list of personalized drivers in

20

OV. The most well represented KEGG pathways are the calcium signaling pathway and the

MAPK signaling pathway. Calcium signaling is important in the regulation of cancer cell proliferation and/or apoptosis [63]. In OV, four alterations are present in the calcium signaling pathway: mutations in ADORA2B, CACNA1E, and ITPR3 along with a CNV for SLC8A1. Of these alterations, ADORA2B is often associated with cancer cell movement and metastasis, and it has been a drug target in other cancers [64]. The MAPK signaling pathway plays a role in communication with cell growth, and it can be targeted by cancer drugs such as MEK and RAF inhibitors [65, 66]. Genes mutated in the MAPK signaling pathway include CACNA1E, TRAF6, and DUSP16. The TRAF6 gene is especially interesting as it up-regulates HIF1A and has implications in tumor angiogenesis [67]. TRAF6 is another good example to demonstrate the power of DawnRank because it is mutated in only two patients, TCGA-13-1410 and TCGA-61-

1919, but DawnRank was still able to detect it as a rare driver. We looked into the TRAF6 mutation further, including its connectivity in the network and the potential functional impact of the mutation (Figure 5B and C). We found that the mutation in TCGA-13-1410 is considered to have an important functional impact by MutationAssessor, affecting the MATH domain [68].

The mutation is located at a loop defining one wall of the binding site and projects its long side- chain to directly contact the conserved aromatic residue in the peptide substrate, and it likely modulates its interaction with its receptors [69]. TRAF6 is also known to be involved in the NF-

κB pathway, which is related to poor outcomes in ovarian cancer [70].

We found 56 potential rare drivers in BRCA. Unlike in GBM and OV, the distribution of rare drivers in BRCA seemed to spread out among different pathway with no major cancer pathway having more than two candidate rare drivers. Nevertheless, several important cancer pathways are highlighted in BRCA. One of them is the KEGG “Proteoglycans in Cancer

21

Pathway”. Proteoglycans bind to ligands and receptors that regulate tumor neoplastic growth and angiogenesis [71]. Alterations in genes coding for proteoglycans include ANK3 and FRS2. Of these, FRS2 is especially interesting due to both its scarcity (only one amplification event in all

BRCA patients) and its function as a fibroblast growth factor whose amplification activates the

FGFR2 [72]. It is also highly differentially expressed in the patient, having differential expression 3.21 standard deviations above normal. The calcium signaling pathway, like in ovarian cancer, also has multiple potential drivers in breast cancer. These genes include

CACNA1B and GRIN2A. GRIN2A is a potential rare driver as it is mutated in 1.7% of BRCA patients but it has been known to be recurrent driver mutation in other cancers [73].

3.3.4 Discovering Personalized Drivers

We next looked at the distribution of the candidate rare drivers across breast cancer subtypes. We tested whether or not the rare drivers show different distribution among patients in the four major subtypes of breast cancer (Luminal A, Luminal B, Her2, and Basal-like). Although we used a 2% as a threshold to determine rare drivers so far, we further modeled the personalized drivers at low frequencies using cutoffs from 1% to 5% to better visualize the distribution of infrequent, personalized drivers in BRCA subtypes. As shown in Figure 6A, the distribution of the driver genes is not the same among subtypes. The most obvious contrast involves the Basal-like and

Luminal A comparison. Basal-like samples tend to have more very rare drivers occurring in less than 1% of the cases and have much small number of drivers occur over 1%. Luminal A samples, on the other hand, have fewer drivers occurring in less than 1% of the samples, but as the cutoff increases, more Luminal A drivers can be found. A Chi-squared goodness of fit test confirms that the distribution of Basal and Luminal A breast cancers have very different distributions (p=0.015). This may suggest a fundamental difference in the genetic heterogeneity

22 of Basal-like and Luminal A breast cancers. The rare driver distribution of Basal-like cancer is also different from the frequency distribution of known driver genes (Figure 6).

The distribution of known drivers and rare drivers may correspond to the genetic heterogeneity of breast cancer subtypes. Luminal A is an ER+ subtype with less heterogeneity and has been defined by its good prognosis. Patients with Luminal A subtype generally have the most favorable outcome and have a smaller chance of relapse when treated early [74, 75]. On the other hand, the Basal-like subtype, which mainly comprises of triple negative breast cancers, is difficult to treat due to tumor heterogeneity. Basal-like breast cancers are known to be more aggressive with poor prognosis [35, 76]. Based on our results, we found that Luminal A samples seem to have common drivers more than any other cancer type. The pattern in the Basal-like type is the opposite. There are more potential rare drivers in Basal-like samples than other subtype.

This suggests that Luminal A breast cancers are driven more by common drivers while Basal- like breast cancers are driven more by rare drivers.

Potential personalized rare driver alterations for Basal-like breast cancer include LGR5 and BCL2L14. LGR5 is a CNV that is amplified and overexpressed in the Basal-like patient sample TCGA-AN-A0AT. LGR5 is necessary for the stimulation of basal cell growth in breast tissue [77]. BCL2L14 is a key regulator of cell apoptosis at the mitochondria level [78], and it is altered in the single Basal-like patient TCGA-AN-A0FJ.

3.4 Implementation and Running time

DawnRank was implemented in R. Using a network of 1,492 nodes and 8,070 edges and a cohort of the BRCA samples we used, we compared the run time of DawnRank, DriverNet, and

PARADIGM-Shift on a computer with 16GB RAM and Intel i5-3317U processor. DriverNet was the fastest, with a runtime of 6 minutes. DawnRank completed the analysis in 9 minutes for

23 the main script and 12 minutes for the Condorcet rank aggregator. PARADIGM-Shift took approximately 15 hours to run.

24

CHAPTER 4

DISCUSSION

It is now acknowledged that individual tumors of the same type are highly heterogeneous and have diverse genomic alterations. Therefore, we urgently need novel methods to identify patient- specific and rare drivers from individual tumor samples in order to elucidate personalized molecular mechanisms in different types of cancer. The goal of DawnRank is to integrate mutation data, gene expression, and network information to discover drivers in a personalized manner. We applied DawnRank to a large number of TCGA datasets. By comparing to previous studies, our results demonstrated the effectiveness of DawnRank: (1) Despite its single-patient scope, DawnRank detects common and known drivers with as much or more precision than existing methods. (2) We can identify rare and novel genes that are potential drivers to specific patients. We believe this method will complement existing driver identification methods and will help us discover potential personalized drivers. The application to breast cancer subtypes further demonstrated that the rare drivers predicted by DawnRank may provide new insights into the molecular explanations of cancer subtypes with higher tumor heterogeneity.

One potential limitation of the current DawnRank method is that we still largely rely on known molecular interaction network. Such network information is still incomplete, which may create false negatives in DawnRank’s prediction. In addition, currently the overall network information we use in DawnRank is not cancer-type specific or patient-specific. Therefore, some of the network level interactions and perturbations specific to certain cancer subtypes or patient samples may be obscured by this approach. Another potential limitation is that DawnRank detects only potential drivers that alter the expression of other genes. However, this may not be the case for all drivers. In addition, some of the predictions from our method could potentially be

25 caused by passenger mutations that coincidentally change the expression of downstream genes.

In those cases, it may be useful to additionally utilize cohort and functional-impact based methods such as OncodriveFM. As the community continues to refine our understanding of the interaction networks and cancer driver genes, we expect that DawnRank’s ability to predict driver alterations would also improve.

Our method provides a unique solution to predict potential driver genes in cancer.

However, the computationally predicted personalized drivers should not be over-interpreted before additional experimental evidence becomes available. To fully validate such computational predictions, both in vitro and in vivo experimental validations would be needed to comprehensively assess the tumorigenic potentials of predicted drivers in individual patients.

Nevertheless, our method provides a reliable solution to prioritize such follow-up experimental validations. Taken together, our personalized approach is promising to discover potential causal generic variants that would be otherwise obscured by tumor heterogeneity. The personalized framework may help determine the optimal treatment strategy for each patient through individualized assessment based on the molecular signatures of their cancers.

26

CHAPTER 5

FIGURES

Gene Network

Gene Expression

DawnRank

▪ A ranking framework based on PageRank that considers the impact of genes in the network ▪ Impact includes connectivity and the amount downstream genes to be differentially expressed ▪ Dynamic damping factor is used to improve the original PageRank in ranking genes

Somatic Ranks of Alteration Data Genes (SNP, CNV, etc.)

Personalized Driver Alterations

Figure 1: Overview of the DawnRank Method

27

DawnRank and DriverNet Comparison (Precision)

GBM OV BRCA 1.00

0.75 Method 0.50 DawnRank DriverNet 0.25

0.00 Precision according to CGC 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes DawnRank and DriverNet Comparison (Recall)

GBM OV BRCA 0.06

0.04 Method DawnRank DriverNet 0.02

0.00 Recall according to CGC 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes DawnRank and DriverNet Comparison (F1 Score)

GBM OV BRCA 0.100

0.075 Method 0.050 DawnRank DriverNet 0.025

F1 according to CGC 0.000 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

Top N genes Figure 2: A comparison of the precision, recall, and F1-scores for the top ranking genes in

DawnRank and DriverNet. The X-axis represents the number of top ranking genes involved in the precision, recall, and F1 score calculation. The Y-axis represents the score of the given metric.

28

DawnRank, DriverNet, and PARADIGM−Shift comparison (Precision)

GBM OV BRCA 1.00

0.75 Method DawnRank 0.50 DriverNet PARADIGMShift 0.25

0.00 Precision according to CGC 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes DawnRank, DriverNet, and PARADIGM−Shift comparison (Recall)

GBM OV BRCA 0.08

0.06 Method DawnRank 0.04 DriverNet PARADIGMShift 0.02

0.00 Recall according to CGC 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes DawnRank, DriverNet, and PARADIGM−Shift comparison (F1 Score)

GBM OV BRCA

0.10 Method DawnRank DriverNet 0.05 PARADIGMShift

F1 according to CGC 0.00 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes Figure 3: A comparison of the precision, recall, and F1 scores for the top ranking genes in

DawnRank, DriverNet, and PARADIGM-Shift on a small network (defined from the KEGG database). The X-axis represents the number of top ranking genes involved in the precision, recall, and F1 score calculation. The Y-axis represents the score of the given metric.

29

Figure 4 Method Comparison Using Patient Subsets (Precision)

GBM OV BRCA 0.6 ●

● ● ●

0.4 ● ● ● ● Method ● ● ● ● ● DawnRank ● ● ● ● ● ● ● DriverNet ● ● ● ● ● Precision 0.2 ● ● ● ● ● ●

● 0.0 10 20 50 100 All 10 20 50 100 All 10 20 50 100 All Size of Subset Method Comparison Using Patient Subsets (Recall)

GBM OV BRCA 0.04 ●

● 0.03 ● ●

● ● Method ● ● ● ● ● ● ● DawnRank ● 0.02 ● ● ● ● ● DriverNet Recall ● ● ● ● ● ● ● ● ● ● 0.01 ● ●

10 20 50 100 All 10 20 50 100 All 10 20 50 100 All Size of Subset Method Comparison Using Patient Subsets (F1 Score)

GBM OV BRCA ●

0.06 ● ● ●

● ● ● ● ● ● ● Method ● 0.04 ● ● ● ● ● ● DawnRank ● ● ● ● ● ● ● DriverNet ● ● ● ● F1 Score 0.02 ● ● ● 0.00 10 20 50 100 All 10 20 50 100 All 10 20 50 100 All Size of Subset Figure 4: Comparison results using subset of patient samples. The Figure Chows the precision, recall, and F1 scores of the DawnRank and DriverNet results in determining known drivers among their top 30 genes using a small subset of the patient samples (X-axis) rather than the entire cohort. The Y-axis represents the average precision of 10 runs using the subset. The error bars show the 1-standard deviation range of each data point.

30

The Pathway Interaction of TRAF6 in patient TCGA−13−1410 Figure 5

ZFAND5 TEP1 HACE1 ● DLG5 PDE4D CBLB SH3BGRL (A) (B) CLSTN2 SPTA1 CD40LG ●PDPK1 FBXO25 BEX1 UBE2N CARD11 GAB2 FBLN2 RIPK2 KLHL9 ● ● ● ● CD40 APBB2 IL1B ● ●MALT1BCL10 SLC8A1 ● IKBKG UBE2V2 HLA−DRB5 IL1A●MYD88 SQSTM1 ● ● PDE10A TOLLIP AKT1 IRAK1 ●IKBKB ●● BCAT1 ● CHUK MAP3K7 ITPR3 ● ● IL1R1 ● RELA NFKB1 ● RRAGC IRAK4 ● ● TRAF2 TXN DIRAS2 ● ● ● CACNA1E ● NGFR ●● ● ERC1 MAP3K5 PPL NFKBIA MAP3K14●MAP3K3 GLRX3 RSF1 ● ● TAS2R1 ● ● ● MAP2K3● MAP2K6 PDLIM5 TRAF6TRAF6 CPA6 SORT1 TP53 ● TRIM37 ● ● PRMT2 NR2C2 ROBO2 ZNF274 ● ● SPOCK3 ● PTPN4 BDNF● ● MATN1 CITED4 RAB3IP ● USP7 TRAF7 NRBF2 ● ● CTBP2 SLC1A6 IFNA10 KCNQ1 NAB1 VCP ASB13 ● ALDH3A2 ZRANB1 CNOT2 ● PYROXD1 CAND2 PSMC2 ● DRG1 PSMC3 PSMB5 UBE2M API5 ● ● PSMD13 BNIPL ● ● PSMC1 ● ARNTL2 PSMD3 CLDN11 ● ● PSMD6 EDARADD PSMD1 ● SKAP1 ● PSMD12PSMD7 SLC19A1 ● ● RIC8A TRAF6 ARHGAP22 DNAJB1 FBXL4 TAS2R7 MAN1A2 (C) ICAM4 ACTB ST8SIA4 DUSP16 SCARB2 MAT2B SAV1 GRB7 SLPI CD40 peptide ST8SIA3 ADORA2B GABBR2 RIMS1 OR4N4 GSTM2 PDPK1 CALCR TRPS1 FLG RNF144B TRAF6 TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA TCGA − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − 13 61 13 61 61 57 36 29 29 25 24 13 13 10 04 04 61 61 59 57 57 36 31 30 29 29 29 25 25 25 24 24 24 23 23 23 13 13 13 09 09 61 61 36 36 30 30 29 29 29 25 25 25 24 24 24 24 24 20 13 13 13 13 04 04 61 36 36 31 24 24 24 23 04 − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −

2066 1728 2057 2102 1727 1993 2529 1697 1692 1635 1434 1410 0791 0928 1646 1365 2110 1743 2349 1994 1586 2544 1959 1718 1778 1775 1707 2400 1625 1319 2298 2030 2027 2643 2081 1113 0913 0888 0792 0367 0364 2092 2018 1577 1570 1891 1859 2429 2428 2427 2393 2392 1870 2281 2260 1928 1616 1567 1686 2059 1405 0908 0751 1644 1638 2012 2547 1581 1955 2297 2262 1551 1114 1514 R466W

Figure 5: Personalized drivers in TCGA ovarian cancer samples. (A) The darker red/blue entries

(red for point mutations and blue for CNVs) indicate the personalized drivers that are significant and not documented in CGC. The lighter entries are mutations that are not considered as drivers by DawnRank in specific samples. The X-axis includes patient samples with a personalized driver that is significant and not documented in CGC. On the Y-axis, rare drivers (frequency

<2%) are in blue and non-rare drivers (frequency >=2%) are in purple. (B) Part of the gene interaction network where TRAF6 is involved. The size of the node scales according to the

DawnRank score in ovarian cancer patient sample TCGA-13-1410 and the color intensity scales with differential expression. Note that this Figure Chows the impact of TRAF6 to other genes in the network, but not all the edges where TRAF6 is not involved are shown due to space limitation

(for example, the large number of outgoing edges of TP53). (C) A close-up view of the protein

31 structure of TRAF6 indicates that an amino acid change from the mutation in ovarian cancer patient sample TCGA-13-1410 causes substitution of the Arginine (R) to Tryptophan (W). The substitution of the mutation occurs in a conserved binding site, and the substitution from a charged Arginine to a non-polar aromatic amino acid Tryptophan may suggest a change in the substrate binding with the CD40 peptide substrate.

(A) BRCA Subtype Comparison (B) BRCA Subtype Comparison (CGC genes) 0.15 Basal 0.6 Basal 0.10 0.4 0.05 0.2 0.00 0.0

0.15 Her2 0.6 Her2 0.10 0.4 0.05 0.2 0.00 0.0 LumA

0.15 LumA 0.6 0.10 0.4 0.05 0.2 0.00 0.0 LumB

0.15 LumB 0.6 0.10 0.4 Proportion of Patients Driven by Rare Driver Driven by Proportion of Patients 0.05 Rare Driver Driven by Proportion of Patients 0.2 0.00 0.0 0.01 0.02 0.03 0.04 0.05 0.01 0.02 0.03 0.04 0.05 Frequency Cutoff Frequency Cutoff

Figure 6: Distribution of rare drivers (A) and known drivers (B) across four breast cancer subtypes. The X-axis describes the frequency cutoffs (1% to 5%). Note that this cutoff is not cumulative. The Y-axis shows the proportion of patient samples that have rare drivers within that frequency cutoff.

32

CHAPTER 6

REFERENCES

1. Green ED, Guyer MS, National Human Genome Research I: Charting a course for genomic medicine from base pairs to bedside. Nature 2011, 470:204-213. 2. Stratton MR: Journeys into the genome of cancer cells. EMBO Mol Med 2013, 5:169-172. 3. Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, Davies H, Teague J, Butler A, Stevens C, et al: Patterns of somatic mutation in human cancer genomes. Nature 2007, 446:153-158. 4. Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, Sougnez C, Greulich H, Muzny DM, Morgan MB, et al: Somatic mutations affect key pathways in lung adenocarcinoma. Nature 2008, 455:1069-1075. 5. Jones S, Zhang X, Parsons DW, Lin JC, Leary RJ, Angenendt P, Mankoo P, Carter H, Kamiyama H, Jimeno A, et al: Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science 2008, 321:1801-1806. 6. Banerji S, Cibulskis K, Rangel-Escareno C, Brown KK, Carter SL, Frederick AM, Lawrence MS, Sivachenko AY, Sougnez C, Zou L, et al: Sequence analysis of mutations and translocations across breast cancer subtypes. Nature 2012, 486:405-409. 7. Dees ND, Zhang Q, Kandoth C, Wendl MC, Schierding W, Koboldt DC, Mooney TB, Callaway MB, Dooling D, Mardis ER, et al: MuSiC: identifying mutational significance in cancer genomes. Genome Res 2012, 22:1589-1598. 8. Carter H, Samayoa J, Hruban RH, Karchin R: Prioritization of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations (CHASM). Cancer Biol Ther 2010, 10:582-587. 9. Akavia UD, Litvin O, Kim J, Sanchez-Garcia F, Kotliar D, Causton HC, Pochanard P, Mozes E, Garraway LA, Pe'er D: An integrated approach to uncover drivers of cancer. Cell 2010, 143:1005-1017. 10. Ciriello G, Cerami E, Sander C, Schultz N: Mutual exclusivity analysis identifies oncogenic network modules. Genome Res 2012, 22:398-406. 11. Vandin F, Upfal E, Raphael BJ: De novo discovery of mutated driver pathways in cancer. Genome Res 2012, 22:375-385. 12. Ng S, Collisson EA, Sokolov A, Goldstein T, Gonzalez-Perez A, Lopez-Bigas N, Benz C, Haussler D, Stuart JM: PARADIGM-SHIFT predicts the function of mutations in multiple cancers using pathway impact analysis. Bioinformatics 2012, 28:i640-i646. 13. Bashashati A, Haffari G, Ding J, Ha G, Lui K, Rosner J, Huntsman DG, Caldas C, Aparicio SA, Shah SP: DriverNet: uncovering the impact of somatic driver mutations on transcriptional networks in cancer. Genome Biol 2012, 13:R124. 14. Paull EO, Carlin DE, Niepel M, Sorger PK, Haussler D, Stuart JM: Discovering causal pathways linking genomic events to transcriptional states using Tied Diffusion Through Interacting Events (TieDIE). Bioinformatics 2013. 15. Chen Y, Hao J, Jiang W, He T, Zhang X, Jiang T, Jiang R: Identifying potential cancer driver genes by genomic data integration. Sci Rep 2013, 3:3538. 16. Jia P, Zhao Z: VarWalker: personalized mutation network analysis of putative cancer genes from next-generation sequencing data. PLoS Comput Biol 2014, 10:e1003460.

33

17. Pe'er D, Hacohen N: Principles and strategies for developing network models in cancer. Cell 2011, 144:864-873. 18. Ding L, Wendl MC, Koboldt DC, Mardis ER: Analysis of next-generation genomic data in cancer: accomplishments and challenges. Hum Mol Genet 2010, 19:R188-196. 19. Reimand J, Bader GD: Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers. Mol Syst Biol 2013, 9:637. 20. Page L, Brin S, Motwani R, Winograd T: The PageRank citation ranking: bringing order to the web. 1999. 21. Brin S, Page L: The anatomy of a large-scale hypertextual Web search engine. Computer networks and ISDN systems 1998, 30:107-117. 22. Khurana E, Fu Y, Colonna V, Mu XJ, Kang HM, Lappalainen T, Sboner A, Lochovsky L, Chen J, Harmanci A, et al: Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 2013, 342:1235587. 23. D'Antonio M, Ciccarelli FD: Integrated analysis of recurrent properties of cancer genes to identify novel drivers. Genome Biol 2013, 14:R52. 24. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer 2004, 4:177-183. 25. Morrison JL, Breitling R, Higham DJ, Gilbert DR: GeneRank: using search engine technology for the analysis of microarray experiments. BMC bioinformatics 2005, 6:233. 26. Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim JS, Kim CJ, Kusanovic JP, Romero R: A novel signaling pathway impact analysis. Bioinformatics 2009, 25:75-82. 27. Winter C, Kristiansen G, Kersting S, Roy J, Aust D, Knosel T, Rummele P, Jahnke B, Hentrich V, Ruckert F, et al: Google goes cancer: improving outcome prediction for cancer patients by network-based ranking of marker genes. PLoS Comput Biol 2012, 8:e1002511. 28. Liu Y, Gu Q, Hou JP, Han J, Ma J: A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression. BMC Bioinformatics 2014, 15:37. 29. Vandin F, Upfal E, Raphael BJ: Algorithms for detecting significantly mutated pathways in cancer. Journal of computational biology : a journal of computational molecular cell biology 2011, 18:507-522. 30. Wang X, Tao T, Sun J-T, Shakery A, Zhai C: Dirichletrank: Solving the zero-one gap problem of pagerank. ACM Transactions on Information Systems (TOIS) 2008, 26:10. 31. Vinh NX, Chetty M, Coppel R, Wangikar PP: Issues impacting genetic network reverse engineering algorithm validation using small networks. Biochim Biophys Acta 2012, 1824:1434-1441. 32. Pihur V, Datta S, Datta S: Finding common genes in multiple cancer types through meta- analysis of microarray experiments: a rank aggregation approach. Genomics 2008, 92:400- 403. 33. Cancer Genome Atlas Research N: Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 455:1061-1068. 34. Integrated genomic analyses of ovarian carcinoma. Nature 2011, 474:609-615. 35. Cancer Genome Atlas N: Comprehensive molecular portraits of human breast tumours. Nature 2012, 490:61-70. 36. Croft D, O'Kelly G, Wu G, Haw R, Gillespie M, Matthews L, Caudy M, Garapati P, Gopinath G, Jassal B, et al: Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res 2011, 39:D691-697.

34

37. Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH: PID: the Pathway Interaction Database. Nucleic Acids Res 2009, 37:D674-679. 38. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M: KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res 2012, 40:D109-114. 39. Wu G, Feng X, Stein L: A human functional protein interaction network and its application to cancer data analysis. Genome Biol 2010, 11:R53. 40. Becskei A, Serrano L: Engineering stability in gene networks by autoregulation. Nature 2000, 405:590-593. 41. Tamborero D, Gonzalez-Perez A, Perez-Llamas C, Deu-Pons J, Kandoth C, Reimand J, Lawrence MS, Getz G, Bader GD, Ding L, Lopez-Bigas N: Comprehensive identification of mutational cancer driver genes across 12 tumor types. Sci Rep 2013, 3:2650. 42. Lawrence MS, Stojanov P, Mermel CH, Robinson JT, Garraway LA, Golub TR, Meyerson M, Gabriel SB, Lander ES, Getz G: Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 2014, 505:495-501. 43. Gonzalez-Perez A, Lopez-Bigas N: Functional impact bias reveals cancer drivers. Nucleic Acids Res 2012, 40:e169. 44. Douville C, Carter H, Kim R, Niknafs N, Diekhans M, Stenson PD, Cooper DN, Ryan M, Karchin R: CRAVAT: cancer-related analysis of variants toolkit. Bioinformatics 2013, 29:647-648. 45. Ye XZ, Xu SL, Xin YH, Yu SC, Ping YF, Chen L, Xiao HL, Wang B, Yi L, Wang QL, et al: Tumor-associated microglia/macrophages enhance the invasion of glioma stem-like cells via TGF-beta1 signaling pathway. J Immunol 2012, 189:444-453. 46. Kanehisa M: The KEGG database. Novartis Found Symp 2002, 247:91-101; discussion 101- 103, 119-128, 244-152. 47. Wang Y, Liu Y, Malek SN, Zheng P, Liu Y: Targeting HIF1alpha eliminates cancer stem cells in hematological malignancies. Cell Stem Cell 2011, 8:399-411. 48. Hsieh AC, Liu Y, Edlind MP, Ingolia NT, Janes MR, Sher A, Shi EY, Stumpf CR, Christensen C, Bonham MJ, et al: The translational landscape of mTOR signalling steers cancer initiation and metastasis. Nature 2012, 485:55-61. 49. Sunayama J, Sato A, Matsuda K, Tachibana K, Watanabe E, Seino S, Suzuki K, Narita Y, Shibui S, Sakurada K, et al: FoxO3a functions as a key integrator of cellular signals that control glioblastoma stem-like cell differentiation and tumorigenicity. Stem Cells 2011, 29:1327-1337. 50. Meuillet EJ, Zuohe S, Lemos R, Ihle N, Kingston J, Watkins R, Moses SA, Zhang S, Du- Cuny L, Herbst R, et al: Molecular pharmacology and antitumor activity of PHT-427, a novel Akt/phosphatidylinositide-dependent protein kinase 1 pleckstrin homology domain inhibitor. Mol Cancer Ther 2010, 9:706-717. 51. Ericson K, Gan C, Cheong I, Rago C, Samuels Y, Velculescu VE, Kinzler KW, Huso DL, Vogelstein B, Papadopoulos N: Genetic inactivation of AKT1, AKT2, and PDPK1 in human colorectal cancer cells clarifies their roles in tumor growth regulation. Proc Natl Acad Sci U S A 2010, 107:2598-2603. 52. Komander D, Fairservice A, Deak M, Kular GS, Prescott AR, Peter Downes C, Safrany ST, Alessi DR, van Aalten DM: Structural insights into the regulation of PDK1 by phosphoinositides and inositol phosphates. EMBO J 2004, 23:3918-3928. 53. Wood KW, Chua P, Sutton D, Jackson JR: Centromere-associated protein E: a motor that puts the brakes on the mitotic checkpoint. Clin Cancer Res 2008, 14:7588-7592.

35

54. Lu Z, Zhou L, Killela P, Rasheed AB, Di C, Poe WE, McLendon RE, Bigner DD, Nicchitta C, Yan H: Glioblastoma proto-oncogene SEC61gamma is required for tumor cell survival and response to endoplasmic reticulum stress. Cancer Res 2009, 69:9105-9111. 55. Nord H, Hartmann C, Andersson R, Menzel U, Pfeifer S, Piotrowski A, Bogdan A, Kloc W, Sandgren J, Olofsson T, et al: Characterization of novel and complex genomic aberrations in glioblastoma using a 32K BAC array. Neuro Oncol 2009, 11:803-818. 56. Yoshioka S, King ML, Ran S, Okuda H, MacLean JA, 2nd, McAsey ME, Sugino N, Brard L, Watabe K, Hayashi K: WNT7A regulates tumor growth and progression in ovarian cancer through the WNT/beta-catenin pathway. Mol Cancer Res 2012, 10:469-482. 57. Shrestha Y, Schafer EJ, Boehm JS, Thomas SR, He F, Du J, Wang S, Barretina J, Weir BA, Zhao JJ, et al: PAK1 is a breast cancer oncogene that coordinately activates MAPK and MET signaling. Oncogene 2012, 31:3397-3408. 58. Dranoff G: Cytokines in cancer pathogenesis and cancer therapy. Nat Rev Cancer 2004, 4:11-22. 59. McFarland BC, Ma JY, Langford CP, Gillespie GY, Yu H, Zheng Y, Nozell SE, Huszar D, Benveniste EN: Therapeutic potential of AZD1480 for the treatment of human glioblastoma. Mol Cancer Ther 2011, 10:2384-2393. 60. Takaoka A, Hayakawa S, Yanai H, Stoiber D, Negishi H, Kikuchi H, Sasaki S, Imai K, Shibue T, Honda K, Taniguchi T: Integration of interferon-alpha/beta signalling to p53 responses in tumour suppression and antiviral defence. Nature 2003, 424:516-523. 61. Smith AL, Robin TP, Ford HL: Molecular pathways: targeting the TGF-beta pathway for cancer therapy. Clin Cancer Res 2012, 18:4514-4521. 62. Basque J, Martel M, Leduc R, Cantin AM: Lysosomotropic drugs inhibit maturation of transforming growth factor-beta. Can J Physiol Pharmacol 2008, 86:606-612. 63. Swami M: Signalling: the calcium connection. Nat Rev Cancer 2010, 10:738. 64. Desmet CJ, Gallenne T, Prieur A, Reyal F, Visser NL, Wittner BS, Smit MA, Geiger TR, Laoukili J, Iskit S, et al: Identification of a pharmacologically tractable Fra-1/ADORA2B axis promoting breast cancer metastasis. Proc Natl Acad Sci U S A 2013, 110:5139-5144. 65. Boldt S, Weidle UH, Kolch W: The role of MAPK pathways in the action of chemotherapeutic drugs. Carcinogenesis 2002, 23:1831-1838. 66. Santarpia L, Lippman SM, El-Naggar AK: Targeting the MAPK-RAS-RAF signaling pathway in cancer therapy. Expert Opin Ther Targets 2012, 16:103-119. 67. Sun H, Li XB, Meng Y, Fan L, Li M, Fang J: TRAF6 upregulates expression of HIF-1alpha and promotes tumor angiogenesis. Cancer Res 2013, 73:4950-4959. 68. Reva B, Antipin Y, Sander C: Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res 2011, 39:e118. 69. Ye H, Arron JR, Lamothe B, Cirilli M, Kobayashi T, Shevde NK, Segal D, Dzivenu OK, Vologodskaia M, Yim M, et al: Distinct molecular mechanism for initiating TRAF6 signalling. Nature 2002, 418:443-447. 70. Annunziata CM, Stavnes HT, Kleinberg L, Berner A, Hernandez LF, Birrer MJ, Steinberg SM, Davidson B, Kohn EC: Nuclear factor kappaB transcription factors are coexpressed and convey a poor outcome in ovarian cancer. Cancer 2010, 116:3276-3284. 71. Iozzo RV, Sanderson RD: Proteoglycans in cancer biology, tumour microenvironment and angiogenesis. J Cell Mol Med 2011, 15:1013-1031.

36

72. Zhang K, Chu K, Wu X, Gao H, Wang J, Yuan YC, Loera S, Ho K, Wang Y, Chow W, et al: Amplification of FRS2 and activation of FGFR/FRS2 signaling pathway in high-grade liposarcoma. Cancer Res 2013, 73:1298-1307. 73. Wei X, Walia V, Lin JC, Teer JK, Prickett TD, Gartner J, Davis S, Program NCS, Stemke- Hale K, Davies MA, et al: Exome sequencing identifies GRIN2A as frequently mutated in melanoma. Nat Genet 2011, 43:442-446. 74. Guedj M, Marisa L, de Reynies A, Orsetti B, Schiappa R, Bibeau F, MacGrogan G, Lerebours F, Finetti P, Longy M, et al: A refined molecular taxonomy of breast cancer. Oncogene 2012, 31:1196-1206. 75. Neve RM, Chin K, Fridlyand J, Yeh J, Baehner FL, Fevr T, Clark L, Bayani N, Coppe JP, Tong F, et al: A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer Cell 2006, 10:515-527. 76. Cheang MC, Voduc D, Bajdik C, Leung S, McKinney S, Chia SK, Perou CM, Nielsen TO: Basal-like breast cancer defined by five biomarkers has superior prognostic value than triple-negative phenotype. Clin Cancer Res 2008, 14:1368-1376. 77. Plaks V, Brenot A, Lawson DA, Linnemann JR, Van Kappel EC, Wong KC, de Sauvage F, Klein OD, Werb Z: Lgr5-expressing cells are sufficient and necessary for postnatal mammary gland organogenesis. Cell Rep 2013, 3:70-78. 78. Liu X, Pan Z, Zhang L, Sun Q, Wan J, Tian C, Xing G, Yang J, Liu X, Jiang J, He F: JAB1 accelerates mitochondrial apoptosis by interaction with proapoptotic BclGs. Cell Signal 2008, 20:230-240.

37

APPENDIX A

LIST OF ABBREVIATIONS

NGS – Next-generation Sequencing CGC – Cancer Gene Census TCGA – The Cancer Genome Atlas GBM – Glioblastoma multiforme OV – Ovarian Cancer BRCA – Breast Cancer

38

APPENDIX B

SUPPLEMENTAL INFORMATION

B.1 Dynamic Damping Factor vs. Static Damping Factor

We compared DawnRank using the dynamic damping factor (see Methods section) to a

PageRank-like algorithm with a static damping factor to determine the impact of our method over traditional PageRank-like methods. Precision, recall, and F1 score were based on the top N genes where N ranges from 1 to 100. The PageRank-like static damping factor used the

PageRank-suggested value 0.85, whereas the DawnRank dynamic damping factor used a free parameter trained � of 3 (see Methods in the main text). We evaluated the precision and recall scores to compare the quality of the static damping factor and the dynamic damping factor.

In evaluating both damping factors, the dynamic damping factor has a higher precision, recall, and F1 score than the static damping factor in all three cancers. See Supplementary Figure

C1. In OV and BRCA, many known drivers such as NF1, BRCA2, RB1, and APC in OV and

PIK3CA, MAP2K4, and PIK3R1 in BRCA were ranked in the top 10 using the dynamic damping factor, and they were not accounted for using the static damping factor. In GBM, the dynamic damping factor performs slightly worse regarding the top 10 genes as several driver genes such as TP53 and EGFR are ranked in the top 15 rather than the top 10.

Overall, the result shows that the dynamic damping factor is helpful in determining known driver mutations more precisely than the static damping factor.

An illustration of the zero-one gap problem is shown in Supplementary Figure C2. In

Supplementary Figure C2, there is a network with six nodes linking to a center node. Each of the outer nodes linking to the center node has the same ranking. In the PageRank-like method with a static damping factor, the rank of a node depends on the rank of outgoing nodes, and therefore

39 genes that have more outgoing edges would have a higher score. However, when static damping factor is used, adding outgoing edges from the central node provides a counterintuitive result, not increasing the rank of the central node while also decreasing the ranks of the outer nodes. This is due to the difference in the damping factor between a node with no incoming edge and a node with one incoming edge is very large (a difference of a damping factor 0 and 0.85). This tends to lead to unstable rankings. The DawnRank dynamic damping factor correctly stabilizes the rank of the outer nodes while augmenting the rank of the central node. The change in rankings demonstrates the importance of using a dynamic damping factor. As shown earlier, we also compared the results on using dynamic damping factor and static damping factor based on

TCGA data, and demonstrated that the dynamic damping factor can help identify more drivers than the static damping factor (Supplementary Figure C1).

B.2 DawnRank Proof of Convergence

The DawnRank formula is shown in Equation 1 below:

�� � = (1 − �)� + � , 1 ≤ � ≤ � (1) ���

∗ To prove that the convergence time is small, let us define � to be the true ranking of any gene j.

Therefore the true rank of any gene must satisfy the DawnRank formula in (1) exactly.

∗ ∗ �� � = (1 − �)� + � , 1 ≤ � ≤ � (2) ���

To determine the convergence, we calculate the error of DawnRank at each time point t compared to the true result. The error rate of gene j will be defined as the sum of the absolute difference between each ranking of gene j:

40

∗ ��� � = � − � (3)

∗ To calculate the error, we calculate � − � :

∗ ∗ �� �� � − � = � − � (4) ��� ���

∗ Therefore, by way of the triangle inequality � − � :

∗ ∗ �|� − � | � − � ≤ � (5) ���

And the error rate will become the summation for all j.

∗ ∗ �|� − � | ��� � = � − � ≤ � (6) ���

This becomes

(� � ) |� − �∗| ��� � ≤ (7) ���

Since ��� = � and 0 ≤ � ≤ 1 for all instances of j:

(��) ≤ 1 (8) (�)

We have

(� � ) � − �∗ ��� � ≤ ≤ ���(� − 1) (9) (�)

41

Unless the damping factor for each j is equal to 1, the error will decrease at a compound rate.

Therefore given enough iterations, the ��� � will eventually approach 0. In our damping factor:

� = ��� (��� + �) , where � = 3, the damping factor will always be less than 1 for all genes. Therefore, for our iteration of DawnRank, the algorithm will converge at a compound rate.

B.3 Parameter Training

We trained two parameters in our model, the µ free parameter used to calculate the damping factor � = ��� (��� + �), and the � parameter used in the Condorcet rank-aggregation in order to account for missing data in certain pairwise comparisons. We performed the parameter training over 100 random patient samples using the small network of the KEGG pathways of

1,492 genes. This network was used to avoid using the same data to train the model and to run the analysis. This model was run over a 10-fold cross validation to determine the most reliable results.

We calculated µ by running the DawnRank algorithm using the data and network described in the main text over various empirically chosen µ. We selected the µ that presented the highest average rank for common driver genes from CGC. Our results show that the µ parameter has a peak at µ=3 where the average percentile rank of known driver genes is in the

72.7 percentile. Known drivers are driver genes in CGC. In Supplementary Figure C4, we show that µ=3 parameter contains the highest average rank of common driver genes with downward trends as µ both increases and decreases, so we set µ=3. DawnRank scores are based on both differential expression and a network connectivity, whose weight is determined by the damping factor. A high µ value lowers the impact of the network in DawnRank, and a low µ puts too much emphasis on the network and not enough on the differential expression.

42

We then applied the Condorcet rank aggregation to our DawnRank trial runs using the

µ=3. Since the � is a value from 0 and 1, we ran the Condorcet rank aggregation for all values of

� from 0 and 1 with an increment of 0.05. We selected the parameter that maximized the precision with respect to genes in CGC for the top 10 mutated genes within the randomly selected patient samples. This process was run 10 times, and the precision results for the 10 runs are shown in the Supplementary Figure C5 below. We can see from the figure that the precision with respect to CGC is maximized at � = 0.85.

B.4 Comparison with Summary Statistics

In addition to comparing DawnRank results with well-established and known methods, we also compared DawnRank rankings to simple metrics. We compared DawnRank’s rankings to the rankings based on the results: (1) if we only used connectivity of the genes; (2) if we only used differential expression of the genes; and (3) if we only used the maximum downstream differential expression, i.e., ranking genes by the maximum differential expression among all outgoing genes. We used the same precision-recall evaluation metrics. The results showed that none of the three metrics performs well by themselves, suggesting that it is necessary for

DawnRank to use a combination of connectivity and differential expression to calculate its rank

(Supplementary Figure C12).

B.4.1 Comparison against tumor quality and survival rate

We examined potential clinical factors that may show relationships with the DawnRank output.

We compared the DawnRank scores with tumor quality and survival rate. With tumor quality, we found no significant correlation with DawnRank scores. For each patient in the three cancer types, we determined the Pearson correlation of the tumor nuclei percentage with both the average rank of driver mutations and the variance of the ranks of the driver mutations. This

43 would test whether or not the tumor nuclei percentage was correlated with either the rank of driver mutations, indicating that more tumor quality leads to more significant driver, or the variance of the driver mutations, indicating that tumor quality raises the variation among drivers.

The resulting correlation, however, was low (each of the correlations for driver means and variances with tumor nuclei percentages was between 0.01 and -0.01) with the tumor quality having little relationship with the mean or variance of drivers. One potential reason behind this lack of correlation is the fact that TCGA has already required tumors should have at least 60% tumor nuclei for data from NGS platforms (and 80% for previous tumor samples).

We then examined the relationship between the DawnRank scores and the survival rate.

We took advantage of the inter-tumor heterogeneity of the tumors, and the resulting DawnRank scores were used as variables to build a consensus hierarchical co-clustering model using Wards linkage to separate the patients into potential subtypes. We compared our results to the gene- expression based GBM subtypes: Classical, Proneural, Neural, and Mesenchymal [1]. We chose to analyze GBM because it has both gene expression based subtypes and lower and more variable survival rates [2]. The results are shown in Supplementary Figure C13. Using a log-rank test, we found that the DawnRank hierarchical clustering separated the subtypes by survival time even more accurately than that of conventional gene-expression subtypes (p=0.0044 and p=0.131).

B.4.2 Comparison against mutation rate

One factor that may determine the number of driver mutations predicted by DawnRank could be the mutation rate itself. On average, there are 4.24 drivers per patient in GBM, 5.89 drivers in

BRCA, and 6.05 drivers in OV, within the range of the 2-8 mutations expected in solid tumors

[3]. The total number of alterations (mutation plus copy number) in each of the three cancers is

44

7,876, 10,950, and 17,189, respectively. Therefore, even though OV has higher mutation rate, the number of drivers only increases a little, suggesting the robustness of our method. To examine the DawnRank score in a more extreme case, we applied our method to the lung cancer data in

TCGA, including 22,789 mutations across 152 samples. We found that the average number of drivers in lung cancer patients is 7.18. Therefore, even with the large number of mutations in lung cancer, DawnRank still predicts the reasonable number of drivers, further suggesting the robustness of the method. We also plotted the number of predicted drivers against the total number of mutations for individual patients in Supplementary Figure C14. We observed that as the number of mutations increases, the number of predicted drivers in an individual sample increases at a much lower rate until it reaches a plateau, further suggesting the robustness of our method.

45

APPENDIX C

SUPPLEMENTAL FIGURES

Static and Dynamic Comparison (Precision)

GBM OV BRCA 1.00

0.75 Method 0.50 Dynamic Static 0.25

0.00 Precision according to CGC 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes Static and Dynamic Comparison (Recall)

GBM OV BRCA 0.06

0.04 Method Dynamic Static 0.02

0.00 Recall according to CGC 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes Static and Dynamic Comparison (F1 Score)

GBM OV BRCA 0.100

0.075 Method 0.050 Dynamic Static 0.025

F1 according to CGC 0.000 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes

Figure C1: A comparison of the precision, recall, and F1 score for the top ranked genes when using the static damping factor (in original PageRank) and the dynamic damping factor (in

DawnRank). The X-axis represents the number of top ranked genes involved in the precision, recall and F1 score calculation, and the Y-axis represents the score of the given metric.

46

(A) (B) (C)

Figure C2: A toy example showing the effect of using dynamic damping factor. (A) Original network with one central node and six outer nodes. (B) When static damping factor is used, after adding the outgoing edges (in red) from the central node, the ranking of the central node does not change and the ranking of the outer nodes decreases. (C) When dynamic damping factor is used

(as in DawnRank), after adding the outgoing edges from the central node, the ranking of the central node increases (as expected) and the ranking of the outer nodes does not change (as expected).

47

Relation between Static and Dynamic Damping Factors

0.75

0.50 Method Dynamic Damping Static Damping

0.25 Damping Factor

0.00 0 5 10 15 20 Number of Inlinks

Figure C3: The static versus the dynamic damping factor change over the number of inlinks (i.e. incoming edges).

Free Parameter Training for Mu

● 72 ●

● ●

● ● 70 ●

● ● ● Average Percentile Rank Percentile Average ● 68 ● ● ● ● ● ● ● ● ● ●

0 5 10 15 20 25 Proposed Mu

Figure C4: The training results for various µ parameters when looking at the average rank of common driver genes. The µ that provides the highest average rank is 3.

48

Free Parameter Training for the Condorceet Vote Aggregation

● ●

● ● ● ● 0.55 ● ● ● ●

● ● ● ●

0.50 ●

0.45 ● Average Precision for top 10 Genes Precision for Average

● 0.40 ●

0.25 0.50 0.75 1.00 Proposed Delta

Figure C5 The training results for various � parameters when looking at the average precision of the top 10 rank-aggregated genes with respect to CGC. The � is maximized at 0.85 with a precision of 0.59.

49

G468R

Ins(1,3,4,5)P4

β2 β1

Figure C6: A close-up view of the protein structure of PDPK1 indicates that an amino acid change from the mutation causes substitution of the Glycine (G) to Arginine (R) in TCGA

Ovarian Cancer samples TCGA-13-0751 (DawnRank score 98.14 percentile). The substitution is in a loop between 2 beta strands, and occurs close to the binding site for the substrate

Ins(1,3,4,5)P4, indicating a potential interaction of the positively charged R-group of Arginine and a phosphate group.

50

GBM Genes by CNV Log Ratio PDGFRA

4 EGFR SEC61G MYC CDK6 KIT AHR EPHB1 DCN PITX2 2 KDR

0 CAMK1D HCN1 ODZ2 ADRBK2 CDH13 ODZ3 NEFL CNV Log Ratio CBLN2 SYT1 FGFR2 SMARCA2 ABR STXBP6 ANK3 ADRA2C SH3GL2 MLLT3 TEK FGF13 −2 ELAVL2 NPY ACTN2

−4

OV Genes by CNV Log Ratio

4 KRAS ODZ4 UGT2B15 NFE2L3 EPHB1 ACTR3 CCNE1 CDH2 BMP7 FGF1 FZD5 CBLB 2 MYC

0 ERBB4 ADRBK2 NR2F2 DCN PIK3R1 DMD MMP26 FBXO15 ABR CNV Log Ratio CBLN2 C3 MAF CDH13 SMARCA2 MEF2C NBEA AR GALR1 −2 DNAH9 FYN

−4

BRCA Genes by CNV Log Ratio

4 LEF1 FGFR1 ODZ4 MYC EGFR ESR1 VEGFA ANAPC1 GATA3 CCND1 NFE2L3 LRRC1 PAK1 MYB GLI3 PREX1 CDH2 ATAD2 GRB2 IGF1R ERBB4 E2F7 E2F3

2 APP

0 UGT2B15 ADRBK2 TSLP MAF FZD5 CDH13 CNV Log Ratio ANGPTL4 MMRN1

−2 CXCR7

−4

Figure C7: A visualization of the top driver CNVs in BRCA, GBM, and OV. The X-axis

represents the top genes (in order of rank), and the Y-axis represents the copy number changes.

The labels of known drivers are shown in blue.

51

TPM2 SEPT14 CNOT2 NUAK2 CSTF2T PLK2 H2AFX NCOA3 CCDC59 CPSF3 ARHGEF5 FAM123B NPHS1 AGTR2 PPIL1 TBCC ATR MARS NME7 IFNA17 IFNA14 IFNW1 DCN DGKG PITX2 TCGA−06−0129 TCGA−32−1978 TCGA−06−0173 TCGA−28−1749 TCGA−19−2619 TCGA−06−2569 TCGA−06−0195 TCGA−32−1987 TCGA−28−1756 TCGA−14−1825 TCGA−06−0145 TCGA−02−0104 TCGA−02−0085 TCGA−02−0025 TCGA−87−5896 TCGA−76−6662 TCGA−74−6584 TCGA−28−1746 TCGA−27−1837 TCGA−14−0740 TCGA−12−0826 TCGA−12−0818 TCGA−02−0326

Figure C8: Personalized drivers in TCGA GBM samples. The darker red/blue entries (red for

point mutations and blue for CNVs) indicate the personalized drivers that are significant and not

documented in CGC. The lighter entries are mutations that are not considered as drivers by

DawnRank in specific samples. The X-axis includes patient samples with a personalized driver

that is significant and not documented in CGC. On the Y-axis, rare drivers (frequency < 2%) are

in blue and non-rare drivers (frequency >= 2%) are in purple.

52

PJA1 TRRAP SCN8A CAD ARHGAP28 CD40 CYLC2 CACNA1B OR1J1 WNK1 HIST1H1E PIGG HCN1 ARSA IGSF5 CEP63 AATF GRIN2A HACE1 ADAM21 CCAR1 RIMS1 SLC4A4 RSF1 GBE1 PLEKHA1 COPS2 CHN2 STX16 FARP1 BCL2L14 LGR5 OR52A1 BTC ROBO1 BRCC3 OLIG3 CDC42BPB KIAA0146 GLIS1 GJB6 HSD17B4 CSNK1G1 ACLY ALX3 POU4F1 CNOT2 FRS2 TRPS1 PREX1 DNAH9 ODZ4 UNC5D ANK3 VPS13B CDH2 TCGA−BH−A0BW TCGA−E2−A10C TCGA−A1−A0SH TCGA−BH−A0BP TCGA−AR−A0TW TCGA−AN−A0 TCGA−AN−A0AR TCGA−A8−A06R TCGA−A2−A0T0 TCGA−A1−A0SK TCGA−A1−A0SJ TCGA−BH−A0HQ TCGA−B6−A0I8 TCGA−AR−A0TX TCGA− TCGA−AN−A0XV TCGA−AN−A0G0 TCGA−AN−A0FJ TCGA−AN−A0AJ TCGA−A2−A0SU TCGA−A2−A04W TCGA−A2−A04U TCGA−E2−A14O TCGA−E2−A108 TCGA−BH−A0H9 TCGA−BH−A0B8 TCGA−BH−A0B7 TCGA−B6−A0RM TCGA−B6−A0IN TCGA−AR−A0U1 TCGA−AR−A0TT TCGA− TCGA− TCGA− TCGA− TCGA− TCGA− TCGA−AN−A0FY TCGA−AN−A0FX TCGA−A8−A0AD TCGA−A8−A09N TCGA−A8−A08G TCGA−A8−A083 TCGA−A8−A06O TCGA−A2−A0YJ TCGA−A2−A0CL A A A A A A A O−A0JI O−A12G O−A0JG O−A0J2 O−A03V O−A03P O−A03M A T

Figure C9: Personalized drivers in TCGA BRCA samples. The darker red/blue entries (red for

point mutations and blue for CNVs) indicate the personalized drivers that are significant and not

documented in CGC. The lighter entries are mutations that are not considered as drivers by

DawnRank in specific samples. The X-axis includes patient samples with a personalized driver

that is significant and not documented in CGC. On the Y-axis, rare drivers (frequency < 2%) are

in blue and non-rare drivers (frequency >= 2%) are in purple.

53

DawnRank, CHASM, Oncodrive−FM Comparison (Precision)

GBM OV BRCA 1.00

0.75 Method DawnRank 0.50 CHASM OncodriveFM 0.25

0.00 Precision according to CGC 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes DawnRank, CHASM, Oncodrive−FM Comparison (Recall)

GBM OV BRCA 0.06

Method 0.04 DawnRank CHASM 0.02 OncodriveFM

0.00 Recall according to CGC 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes

DawnRank, CHASM, Oncodrive−FM Comparison (F1 Score)

GBM OV BRCA 0.100

0.075 Method DawnRank 0.050 CHASM OncodriveFM 0.025

F1 according to CGC 0.000 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes

Figure C10: A comparison of the precision, recall, and F1-scores for the top ranking genes in

DawnRank, CHASM and OncodriveFM. The X-axis represents the number of top ranking genes involved in the precision, recall, and F1 score calculation. The Y-axis represents the score of the given metric.

54

DawnRank, DriverNet, and PARADIGM−Shift comparison (Precision)

GBM OV BRCA 1.00

0.75 Method DawnRank 0.50 DriverNet PARADIGMShift 0.25

0.00 Precision according to CGC 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes DawnRank, DriverNet, and PARADIGM−Shift comparison (Recall)

GBM OV BRCA

0.10 Method DawnRank DriverNet 0.05 PARADIGMShift

0.00 Recall according to CGC 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes DawnRank, DriverNet, and PARADIGM−Shift comparison (F1 Score)

GBM OV BRCA 0.20

0.15 Method DawnRank 0.10 DriverNet PARADIGMShift 0.05

F1 according to CGC 0.00 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes

Figure C11: A comparison of the precision, recall, and F1-scores for the top ranking genes in

DawnRank, DriverNet, and PARADIGM-Shift using the Pan-Cancer predicted drivers based on

[4]. The X-axis represents the number of top ranking genes involved in the precision, recall, and

F1 score calculation. The Y-axis represents the score of the given metric.

55

Comparison against summary statistics (Precision)

GBM OV BRCA 1.00

0.75 Method DawnRank 0.50 Connectivity MaxExp 0.25 ExpressionOnly

0.00 Precision according to CGC 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes Comparison against summary statistics (Recall)

GBM OV BRCA 0.06 Method 0.04 DawnRank Connectivity MaxExp 0.02 ExpressionOnly

0.00 Recall according to CGC 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes Comparison against summary statistics (F1 Score)

GBM OV BRCA 0.100

0.075 Method DawnRank 0.050 Connectivity MaxExp 0.025 ExpressionOnly

F1 according to CGC 0.000 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Top N genes

Figure C12: A comparison of the precision, recall, and F1-scores for the top ranking genes in

DawnRank compared to three summary statistics: differential expression only, connectivity, and maximum downstream differential expression (MaxExp). The X-axis represents the number of top ranking genes involved in the precision, recall, and F1 score calculation. The Y-axis represents the score of the given metric.

56

DawnRank−Based Survival Clustering DawnRankNCIS−−BasedBased Survival Survival Clustering Clustering GBMNCIS Expression−Based Survival−Based Clustering Survival Clustering GBM Expression−Based Survival Clustering 1.0 1.0 1.0 1.0 1.0 1.0

Classical ClassicalClassical ClassicalClassical Classical Proneural ProneuralProneural ProneuralProneural Proneural 0.8 0.8 0.8 Mesenchymal 0.8 MesenchymalMesenchymal 0.8 MesenchymalMesenchymal 0.8 Mesenchymal Neural NeuralNeural NeuralNeural Neural 0.6 0.6 0.6 0.6 0.6 0.6 Survival Survival Survival Survival Survival Survival 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.0

0 500 1000 1500 2000 0 0 500 500 10001000 15001500 20002000 0 0 500 500 10001000 15001500 20002000 0 500 1000 1500 2000 Time TimeTime TimeTime Time

Figure C13: A Kaplan-Meier survival plot showing the survival rates of four clusters derived

from using GBM DawnRank scores (left), as compared to expression based clustering (right).

The X-axis represents the days until death and the Y-axis represents the proportion of living

patients.

57

DawnRank drivers vs. total number of mutations

15

10

5 Number of significant drivers

0

0 25 50 75 100 125 Total number of mutations

Figure C14: A plot illustrating the trend between the total number of mutations (X-axis) and the number of significant DawnRank drivers (Y-axis). The intensity of each point represents the number of patients at the coordinate. The line in blue is the LOESS curve.

58