Characterization of Genome Rearrangements from Tumour Sequencing Data

Andrew William McPherson

B.A.Sc. (Hons.), Simon Fraser University, 2002

Thesis Submitted in Partial Fulﬁllment of the Requirements for the Degree of Doctor of Philosophy

in the School of Computing Science Faculty of Applied Science

c Andrew William McPherson 2015 SIMON FRASER UNIVERSITY Summer 2015

All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for “Fair Dealing.” Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. Approval

Name: Andrew William McPherson Degree: Doctor of Philosophy (Computing Science) Title: Characterization of Genome Rearrangements from Tumour Sequencing Data Examining Committee: Chair: Dr. Andrei Bulatov Professor

Dr. S. Cenk Sahinalp Senior Supervisor Professor

Dr. Sohrab P. Shah Co-Supervisor Adjunct Professor

Dr. Cedric Chauve Supervisor Professor Department of Mathematics

Dr. Ryan Morin Internal Examiner Assistant Professor Department of Molecular Biology and Biochemistry

Dr. David Haussler External Examiner Professor Center for Biomolecular Science University of California, Santa Cruz

Date Defended: 27 July 2015

ii Abstract

Genome rearrangements are important mutational events in many cancers, and their detection and characterization has the potential to improve treatment options for cancer patients. Evidence of genome rearrangement is available in the sequence of affected DNA and RNA molecules of tumour cells. The development of high-throughput sequencing has drastically increased the efficiency with which researchers can sequence DNA and RNA molecules, though the new technologies have resulted in an increased computational burden, requiring solutions to novel algorithmic problems. In this thesis we describe novel algorithms for detection and characterization of genome rearrangements with specific focus on rearrangements that reshape tumour genomes and impact cancer biology. We describe a method for detecting gene fusions from RNA sequence data (RNA-Seq). Given both RNA-Seq and Whole Genome Sequence (WGS) data, we describe an integrated method for detection of expressed rearrangements, and subsequently extend this method to account for complex genomic rearrangements. Finally, we describe a method for detecting rearrangements existing in subpopulations of tumour cells, and determining the impact on the content of the genome in those subpopulations. The described methods each formulate a maximum parsimony or likelihood optimization problem, and propose combinatorial algorithms to solve these problems. A common theme for the described methods is the benefits of integrating multiple and diverse data-types. We demonstrate using simulated and real data that principled methods for joint analysis of multiple data-types frequently out-perform independent analyses of each data-type. We apply our methods to the detection and characterization of rearrangements in tumour samples, and provide novel examples of events relevant to the biology of each tumour.

Keywords: cancer; genome rearrangements; genome sequencing; RNA-Seq; combinatorial algorithms

iii Dedication

To my three favorite people. Jocelyn, Avery, Devon.

iv Acknowledgements

This research was supported in part by the CIHR bioinformatics training program and an NSERC Alexander Graham Bell Canada Graduate Scholarship. Thank you to Dr. Cedric Chauve who introduced me to computational biology with an enjoyable summer project, and convinced me to apply for the bioinformatics training program. Thank you to Dr. Sohrab Shah, I consider myself very lucky that Sohrab agreed to invite me to work in the CTAG lab, despite his many other commitments. His projects were the most interesting I had encountered thus far, and continued to inspire throughout my PhD. My sincerest gratitude to Dr. Cenk Sahinalp for his guidance, dedication, and for continually challenging my assumptions. Cenk’s intellectual rigour is both an example and a positive inﬂuence on my work. A special thank you to my co-authors and collaborators Andrew Roth, Alex Wyatt, Salem Malikic, Nilgun Donmez, Gavin Ha, Lucas Swanson, Iman Hajirasouliha, Fereydoun Hormozdiari. Thank you to David Huntsman, Sam Aparicio and Alexandre Bouchard-Côté for their additional supervision. My work would not have been possible without the support staﬀ of the many labs with which I collaborated. Thank you to those whose supported my work in the Shah Lab, the CTAG lab, the Vancouver Prostate Centre, and the Aparicio lab. Finally, this thesis could not have been written without the dedicated support of my family, my wife Jocelyn, my parents, and Jocelyn’s parents.

v Table of Contents

Approval ii

Abstract iii

Dedication iv

Acknowledgements v

Table of Contents vi

List of Tables xi

List of Figures xii

1 Introduction 1 1.1 The Biology of Genome Rearrangements ...... 2 1.1.1 Oncogenic Genomic Rearrangements and Gene Fusions ...... 2 1.1.2 Structural and Numerical Chromosome Instability ...... 3 1.1.3 Chromoplexy, Chromothripsis and Complex Genomic Rearrangements 4 1.1.4 Heterogeneity and Evolution ...... 4 1.2 Technologies for Characterization of Genome Rearrangements ...... 5 1.2.1 Non-sequencing Methods for Discovery and Detection ...... 5 1.2.2 Genome Re-Sequencing Technologies ...... 6 1.2.3 Detection of Genome Rearrangements ...... 7 1.3 Representation of Genome Rearrangements ...... 8 1.3.1 Types of Rearrangements ...... 8 1.3.2 Read Pair Alignment Signatures ...... 9 1.3.3 Graph Representations of Rearranged Genomes ...... 9 1.4 Contribution ...... 10 1.5 Organization of the thesis ...... 12

2 deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data 13 2.1 Introduction ...... 13

vi 2.1.1 Previous work on gene fusion detection from RNA-Seq ...... 14 2.1.2 The deFuse Method ...... 15 2.2 Method ...... 15 2.2.1 Ethics Statement ...... 15 2.2.2 Data sets ...... 15 2.2.3 The deFuse algorithm ...... 16 2.2.4 Alignment parameters used for this study ...... 23 2.2.5 Annotation of each prediction ...... 24 2.2.6 Classiﬁcation of predictions as real fusions or false positives . . . . . 24 2.2.7 Implementation, availability and data resources ...... 25 2.3 Results ...... 26 2.3.1 Application to ovarian and sarcoma datasets ...... 26 2.3.2 deFuse has higher sensitivity and speciﬁcity than competing methods 30 2.3.3 Rediscovery of known gene fusions ...... 33 2.3.4 Fusion boundaries coincident with interrupted expression show dominant expression of the fused gene ...... 34 2.3.5 Evidence of previously described rearrangements in sarcoma and ovarian carcinoma data ...... 35 2.4 Discussion ...... 37 2.4.1 Limitations ...... 38 2.4.2 Conclusion ...... 38

3 Comrad: Detection of expressed rearrangements by integrated analysis of RNA-Seq and low coverage genome sequence data 39 3.1 Introduction ...... 39 3.2 Approach ...... 41 3.3 Methods ...... 42 3.3.1 Identifying potential rearrangement breakpoints and fusion splices . 42 3.3.2 Corroborating rearrangement breakpoints and fusion splices . . . . . 43 3.3.3 Selecting the most parsimonious set of alignments for ambiguously aligning reads ...... 44 3.3.4 Modifying the breakpoint overlap function ...... 48 3.3.5 Assembling a prediction sequence ...... 49 3.3.6 Heuristic ﬁltering ...... 49 3.4 Results ...... 50 3.4.1 Accurate discovery of gene fusions ...... 51 3.4.2 MIPOL1-DGKB and MRPS10-HPR are caused by reciprocal exchanges in C4-2 and LNCaP ...... 53

vii 3.4.3 Genomic rearrangements create fusion transcripts with non-canonical splicing ...... 53 3.5 Discussion ...... 55

4 nFuse: Discovery of complex genomic rearrangements in cancer using high-throughput sequencing 59 4.1 Introduction ...... 59 4.2 Methods ...... 61 4.2.1 Complex rearrangement discovery using breakpoint graphs . . . . . 61 4.3 Results ...... 69 4.3.1 HCC1954 breast cancer cell line ...... 69 4.3.2 Simulated Dataset ...... 74 4.3.3 Primary prostate tumour 963 ...... 76 4.4 Discussion ...... 79

5 ReMixT: Joint inference of genome structure and content in heterogeneous tumor samples 81 5.1 Introduction ...... 81 5.2 Problem Deﬁnition ...... 83 5.2.1 Mixtures of Genome Graphs ...... 84 5.2.2 Modeling Read Counts ...... 87 5.2.3 Maximum Posterior Genome Mixtures ...... 91 5.3 Method ...... 92 5.3.1 Method Overview ...... 92 5.3.2 Expectation Maximization Method for Learning h ...... 92 5.3.3 Combinatorial Method for Inferring G ...... 93 5.4 Results ...... 98 5.4.1 Simulating rearranged genomes ...... 98 5.4.2 Benchmarking learning haploid depth using simulated data . . . . . 99 5.4.3 Benchmarking structure and content prediction using simulated data 100 5.4.4 Comparison with Existing Copy Number Inference Methods . . . . . 101 5.5 Discussion ...... 102

6 Applications 106 6.1 deFuse ...... 106 6.2 Comrad and nFuse ...... 107

7 Conclusion 108 7.1 Possible Improvements ...... 109 7.2 Future Directions ...... 110

viii Bibliography 112

Appendix A Supplementary Material for deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data 134 A.1 Glossary ...... 134 A.2 Supplementary Results ...... 135 A.2.1 A classiﬁer for gene fusions predictions ...... 135 A.3 Supplementary Computational Methods ...... 136 A.3.1 Conditions for discordant alignments to have originated from reads spanning the same fusion ...... 136 A.3.2 Generating Maximal Valid Clusters ...... 138 A.3.3 Split read boundary sequence prediction ...... 139 A.3.4 Dynamic programming matrix deﬁnition ...... 139 A.3.5 Covariance between the lengths of fragments spanning a fusion boundary140 A.3.6 Covariance between split read statistics for reads split by a fusion boundary ...... 140 A.3.7 Features ...... 141 A.3.8 Filtering ...... 144 A.3.9 Probabilistic motivation for clustering conditions ...... 145 A.3.10 FusionSeq predictions ...... 148 A.3.11 MapSplice predictions ...... 149 A.3.12 Running deFuse on melanoma RNA-Seq datasets ...... 150

Appendix B Supplementary Material for nFuse: Discovery of complex genomic rearrangements in cancer using high-throughput sequencing 151 B.1 nFuse pipeline overview ...... 151 B.1.1 Partial alignments of WGSS reads ...... 153 B.1.2 Discordant read clustering ...... 154 B.1.3 Corroboration between fusion transcripts and breakpoints ...... 155 B.1.4 Maximum parsimony formulation for resolving multi-map reads . . . 156 B.1.5 Post-processing ...... 160 B.2 Calculating breakpoint probability ...... 161

B.2.1 Calculating P (cj = 0|csj) ...... 162

B.2.2 Calculating P (dj = 1|asj, cj = 0) ...... 164

B.2.3 Calculating P (gij = 1|nmj, dj = 1) ...... 164

B.2.4 Calculating P (bi|ni) ...... 165

B.2.5 Calculating P (bi|·) ...... 165 B.3 Shortest alternating path algorithm ...... 166

B.4 Path search parameter βp ...... 167

ix Appendix C Supplementary Material for ReMixT: Joint inference of genome structure and content in heterogeneous tumor samples 169 C.1 Overview ...... 169 C.2 Expectation Maximization Method for Learning h ...... 171 C.3 Estimating the overdispersion parameter r ...... 175 C.4 Independence of Segment Read Counts ...... 175 C.5 Parameters used for existing methods ...... 176 C.5.1 Theta2.0 ...... 176 C.5.2 Titan ...... 176 C.5.3 CloneHD ...... 176

x List of Tables

Table 2.1 RT-PCR validated novel deFuse predictions ...... 28 Table 2.2 Summary of RNA-Seq statistics and fusion predictions ...... 30 Table 2.3 Fusions predictions compared between deFuse and FusionSeq . . . . . 31 Table 2.4 Comparison of accuracy metrics for FusionSeq and deFuse ...... 32 Table 2.5 deFuse predictions for existing datasets with known fusions ...... 33

Table 3.1 Known and novel fusions predicted by Comrad ...... 58

Table 4.1 Sequencing statistics for HCC1954 and 963...... 69 Table 4.2 Summary of putative CCBRs discovered in HCC1954 ...... 71 Table 4.3 Summary of putative complex breakpoints discovered in HCC1954 . . 72 Table 4.4 Statistics for CGR breakpoints detected by nFuse...... 74 Table 4.5 Complex breakpoints identiﬁed in a simulated dataset...... 75 Table 4.6 CCBRs identiﬁed in a simulated dataset...... 75 Table 4.7 Summary of putative CCBRs discovered in 963 ...... 76 Table 4.8 Summary of putative complex breakpoints discovered in 963 . . . . . 76

xi List of Figures

Figure 1.1 Representing read pair alignments ...... 10 Figure 1.2 Read pair signatures and graph representation by rearrangement type 11

Figure 2.1 The deFuse gene fusion discovery method ...... 17 Figure 2.2 Conditions for considering two paired end reads to have originated from the same fusion transcript ...... 19 Figure 2.3 Searching for candidate split reads ...... 21 Figure 2.4 deFuse ROC Curve ...... 29 Figure 2.5 Variable importance for deFuse classiﬁer ...... 29 Figure 2.6 Evidence for the FRYL-SH2D1A fusion ...... 36 Figure 2.7 Fusions in sarcoma samples ...... 37

Figure 3.1 Corroborating rearrangement breakpoints and fusion splices . . . . 44 Figure 3.2 Rearrangement support graph ...... 45 Figure 3.3 Evidence for MIPOL1-DGKB as a reciprocal insertion ...... 54 Figure 3.4 Gene fusion CCDC43-YBX2 produces fusion transcripts with non- canonical splicing ...... 55

Figure 4.1 Breakpoint graph representations of poly-fusions ...... 64 Figure 4.2 Breakpoint Graph Representations of Closed Chains of Breakage and Rejoining ...... 65 Figure 4.3 Performance of nFuse breakpoint prediction on breakpoints previously discovered in HCC1954 ...... 71 Figure 4.4 Complex breakpoints and poly-fusions in HCC1954 ...... 73 Figure 4.5 CGRs discovered in primary tumour sample 963 ...... 78

Figure 5.1 ReMixT Problem Overview ...... 84 Figure 5.2 An Example Genome Graph ...... 86 Figure 5.3 Observed and Expected Segment Read Counts ...... 89 Figure 5.4 Haplotype Allele Read Counts ...... 89 Figure 5.5 Genome Graph Modifications ...... 94 Figure 5.6 The Genome Modification Graph ...... 95 Figure 5.7 An Example Genome Graph Modification ...... 96

xii Figure 5.8 Simulating realistic rearranged genomes ...... 99 Figure 5.9 Benchmarking the Learning Algorithm ...... 100 Figure 5.10 Benchmarking Genomegraph Inference vs. Naive Approaches . . . 104 Figure 5.11 Performance of the genomegraph algorithm compared TITAN, CloneHD, and Theta2.0 ...... 105 Figure 5.12 Performance of the genomegraph algorithm compared TITAN, CloneHD, and Theta2.0 using a dataset with limited ampliﬁed regions . . . . 105

xiii Chapter 1

Introduction

Recent innovations in molecular biology have resulted in the development of new technologies for high-throughput sequencing (HTS) of DNA molecules [13, 100, 95]. With these new technologies in hand, researchers are now equipped to interrogate genetic variation at unprecedented scale and resolution. Since their development, researchers have used HTS to quantify the extent of human genetic variation [29, 30], discover inherited disease associated genetic variants [119, 17], and somaticly acquired variants contributing to the development of cancer [168, 142]. As HTS technologies become progressively cheaper, researchers are planning larger scale studies that involve sequencing many thousands of samples [71, 60]. Three properties of HTS contribute to its novelty and utility. First, the sequencing technology itself is largely unbiased. Thus no prior knowledge of a sequence is required to interrogate the DNA sequence content of a sample, allowing for the discovery of novel and unexpected sequences. Second, the sequencing technology produces nucleotide level sequence information. Third, the technology preserves the inherent digital nature of the sequenced molecules. A count of the number of measurements of a given sequence corresponds one-to-one with the number of molecules sequenced, and as such the dynamic range is effectively unlimited. A single assay with all three of these properties at low cost has been a significant driver of molecular biology research. The benefits of HTS make it a perfect tool for studying tumour genomes with the aim of understanding the genetic variants that contribute to the development of cancer. Cancer is a disease primarily caused by somatic mutation of genomic DNA (although specific inherited mutations also result in a predisposition to cancer development). Somatic mutations deregulate or modify genes, and provide tumour cells the ability for growth and metastatic dissemination. Using HTS, researchers have produced comprehensive catalogs of somatic mutations in specific cancer types, providing insights into the mechanisms that produce somatic mutations [126, 14, 3]. Identification of common somatic mutations using HTS has allowed biologists to classify and stratify tumors, and has provided the potential for more

1 specific and effective treatment [168, 142]. Finally, identification of rare but functionally import mutations is one step towards personalized cancer treatment [87].

1.1 The Biology of Genome Rearrangements

1.1.1 Oncogenic Genomic Rearrangements and Gene Fusions

The first functional somatic mutation discovered in cancer was a translocation between chromosomes 9 and 22 identified in a chronic granulocitic leukemia (CML) by Nowell and Hungerford in 1960 [115]. Since their discovery the importance of genomic rearrangements to cancer biology has been well established. As of 2007, rearrangements affecting 337 genes have been identified in benign and malignant neoplastic disorders [108]. Onco- genic rearrangements can be classified by their biological effects into four distinct groups: translocations that create novel oncogenic fusion genes from two distinct wild type genes, translocations of oncogenes to more favourable genomic loci, amplifications of oncogenes, and deletions of tumour supressor genes. The first class of rearrangement is exemplified by the 9-22 translocation in CML. Named the Philadelphia chromosome after the city in which it was discovered, the rearrangement fuses wild type BCR and ABL1 genes to create an BCR-ABL fusion gene with oncogenic properties [89]. Subsequent experiments showed BCR-ABL is sufficient for induction of CML-like disease in mice [34], providing further evidence that BCR-ABL is the primary initiating mutational event in CML and other leukemias. The second class of rearrangement results in deregulated expression of a wild type gene with oncogenic potential. For instance, wild type Myc is a transcription factor controlling expression of many genes including genes involved in cell proliferation. Transcription of Myc is tightly controlled in healthy cells, but in Burkett’s Lymphoma, translocation of Myc to the immunoglobulin heavy chain locus on human chromosome 14 results in persis- tant expression and progression towards neoplasia [35, 43]. In prostate cancer, gene fusions between TMPRSS2 and ETS transcription factors were recently identified as a highly recurrent feature of the disease [158]. Though the exact function of the ETS prostate fusions is still under investigation, a recent study provided evidence that the TMPRSS2-ERG fusion protein disrupts androgen receptor signaling potentially initiating dedifferentiation [174]. The third class of rearrangements, amplification of oncogenes, also results in increased expression of a wild type gene with oncogenic potential. Increased gene copies result in increased transcription, higher concentrations of the protein product, and initiation and maintenance of the cancer phenotype. Focal and high level amplifcation of the ERBB2 gene occurs 20-30% of breast cancers [109]. Amplification results in increased ERBB2 expression, and overexpression has been shown to be sufficient for malignant transformation [24]. Although previously an indicator of poor prognosis [148], ERBB2 positive breast cancers are now treatable with trastuzumab [24].

2 Finally, rearrangements that delete tumour supressor genes have the eﬀect of disabling the cellular mechanisms that control cell division and maintain the integrity of the genome. Many tumour suppressor genes are haplosuﬃcient, and thus two disabling mutations (somatic or germline), are required to disable each allele and inhibit the function of the gene [75]. For instance, Retinoblastoma results from a bi-allelic inactivation of the RB1 gene. In many cases, one ‘hit’ in the two hit deactivation of RB1 is a deletion of a region of chromosome 13 encompassing RB1 [27]. Mutation of TP53 and coincident deletion of chromosome 17p is another important example seen in multiple cancers [8].

1.1.2 Structural and Numerical Chromosome Instability

In addition to the causative role of speciﬁc rearrangements, patterns of genomic rearrang- ments can also be the result of cancer speciﬁc cellular malfunction. Dysregulation of DNA repair mechanisms inhibit the cells natural ability to respond to endogenous and environmental DNA damage. DNA breaks will thus either go unrepaired, or will be repaired by a more error prone, but still functional DNA repair mechanism. The result is structural instability, the accumulation of broken and rearranged chromosomes. Structural instability often results in segregation errors and associated numerical instability [22]. During mitosis, the kinetochores of healthy cells attach to the centromeres of replicated chromosomes, and work to physically segregate sister chromatids into separate daughter cells. In structurally altered genomes, chromosomes without centromeres will be segregated arbitrarily to daughter cell nuclei, while chromosomes with multiple centromeres may be broken as they are pulled in opposite directions. In both cases, errors during segregation results in numerical changes to chromosomal content. Additionally, telomeres act to protect the ends of chromosomes, and the ‘free’ ends of broken chromosomes lack this protection. During replication, the free ends of sister chromatids have the potential to ligate together, forming a single dicentric chromosome. Subsequent kinetechore activity will then pull each centromere towards opposite daughter cells, breaking the aberrant chromosome in a new location. Each daughter cell will then acquire a broken chromosome, and the process will repeat. The resulting cycle has been termed breakage-fusion-bridge [99]. Many of the genes implicated in genomic instability in cancer, such as TP53 and BRCA1/2, are not directly targetable with chemotherapeutics. Instead, many therapies promote further DNA damage beyond that which is tolerable for the tumour cells. Cross- linking agents such as cisplatin cause DNA damage by cross linking DNA strands [136, 36]. Tumour cells with non-functioning DNA repair mechanisms will accumulate a critical mass of DNA lesions resulting in problems during mitosis and activation of apoptosis. Addition- ally, recently developed treatments have leveraged synthetic lethality to provide treatment options for genomically instable cancers [44]. By inhibiting the remaining functional repair mechanisms, these drugs promote increased instability resulting in the persistence of

3 unrepaired DNA lesions, the accumulation of which causes cell cycle arrest and apoptosis. Included in this class of drugs are the very recently developed PARP inhibitors, the ﬁrst of which was approved for use in Europe in 2014. PARP mediates base excision repair, and its inhibition results in single stranded DNA breaks normally repaired by homologous recombination (HR). Tumour cells with loss of HR will accumulate an unsustainable number of DNA lesions, resulting in cell cycle arrest, chromosome instability, and cell death [98].

1.1.3 Chromoplexy, Chromothripsis and Complex Genomic Rearrange- ments

The result of genome instability is progressive acquisition of rearrangements over multiple cycles of cell division. By contrast, two new classes of complex rearrangement, Chromoplexy and Chromothripsis, result in the simultaneous acquisition of multiple rearrangements in a single event. Chromothripsis aﬀects whole chromosome arms, and is characterized by deletion of megabase scale segments with the remaining segments connected by a chaotic pattern of rearrangement. Such a pattern of rearrangements and copy changes suggests a single catastrophic event in which whole chromosome arms are shattered and then reassembled incorrectly [153]. Proposed mechanistic causes of Chromothripsis include DNA damage caused by ionizing radiation or telomere dysfunction [153], or fragmentation of chromatin during aborted apoptosis [159]. Another mechanism that simultaneously results in multiple rearrangements is Chromo- plexy. Unlike Chromothripsis, Chromoplexy usually results in multiple balanced rearrangements, and few if any changes to the number of copies of neighboring genomic regions [14]. In fact, Chromoplexy can be considered a natural extension of balanced reciprocal translocation: not just two but several genomic loci are broken, and broken ends permuted then rejoined. The cause of Chromoplexy has not been experimentally validated. The predominant hypothesis is that the multiple, potentially distantly located genes, co-regulated by the same transcription factor (Androgen Receptor in prostate cancer) are recruited to the same transcriptional hub. Once co-localized, multiple simultaneous breaks during transcription followed by aberrant repair produce the observed pattern of complex balanced rearrangement [7].

1.1.4 Heterogeneity and Evolution

Ongoing somatic mutation in cancer produces lineages of genomically divergent tumour cell populations. Thus a single tumour mass may be comprised of heterogeneous populations of tumour cells related by descent to a single ancestral tumour cell [115]. Mutations acquired along each lineage produce the complement of mutations, or genotype, of individual tumour cells. Distinct genotypes will diﬀer in their ability to survive and proliferate within an environment that includes pressure from other tumour cells, the immune system and

4 therapies administered to the patient. Tumour cells with higher fitness may out-compete other tumour and normal cells, resulting in the dominance of a specific genotype [23]. Tumour heterogeneity has important implications for diagnosis and treatment. Het- erogeneity within an individual tumour limits the utility of diagnosis from a single biopsy [9, 50]. Furthermore, a tumour with a diversity of genotypes may be more likely to survive sudden environmental changes, such as those associated with initiation of treatment [32, 23]. In fact, studies of relapse patients have shown evidence for the relapse clone as a minor population present at the time of diagnosis [40, 92], suggesting treatment may select for resistant clones. Perhaps related to these problems, genomic heterogeneity has been associated with poor outcomes, and has been proposed as a prognostic marker [111]. A complete understanding of the development and progression of a tumour requires a study of the evolutionary history of tumour cell populations. In human cancers it is typically infeasible to sample intermediate tumour genotypes, pre-diagnosis, thus evolutionary histories of must be computationally predicted using phylogenetic approaches. As genomic technologies have improved, researchers have been able to more accurately predict evolutionary histories from multiple primary, metastasis and recurrence samples taken after diagnosis of a cancer [40]. Initial efforts focused mainly on the more tractable problem of identifying SNVs in exonic regions [144, 9, 49, 176, 49, 51, 26], identifying heterogeneity and branched evolution in many tumour types. More recent efforts involving whole genome sequencing have been able to build evolutionary histories from more comprehensive sets of mutations including rearrangements. Researchers have identified ancestral versus late acquired rearrangements, and uncovered patterns of lineage specific rearrangement classes [64, 38]. Recent studies have uses rearrangements and other mutations as clonal markers, helping to uncover complex patterns of metastatic spread in prostate cancer [63, 56, 33].

1.2 Technologies for Characterization of Genome Rearrange- ments

1.2.1 Non-sequencing Methods for Discovery and Detection

Methods pre-dating high-throughput sequencing provided researchers with valuable tools for identifying rearrangements, though these techniques were limited in terms of resolution and throughput. Fluorescence in situ hybridization (FISH) could be used to count or con- ﬁrm adjacency between a small set of targeted genomic regions for a large number of single cells. Spectral karyotyping (SKY) could be used to provide a more comprehensive picture of rearrangements in a tumour genome, but could only identify large scale changes and were too laborious to be applied to more than a single cell. Array based methods such as array comparative genomic hybridization (aCGH) and single nucleotide polymorphism (SNP) ar- rays could be used to identify copy number changes, but provided no direct information

5 about adjacencies between segments and could not be used to identify balanced rearrangements. FISH, SKY, and array methods relay on hybridization to a known sequence, limiting their use for discovery of novel sequences.

1.2.2 Genome Re-Sequencing Technologies

Sequencing technologies are more suited to novel sequence discovery, and were critical for sequencing large genomes, such as the human genome. The whole genome shotgun sequencing approaches of the human genome project used the chain termination method (Sanger sequencing) and relied on sophisticated robotic automation for increased throughput [79]. The invention of high-throughput sequencing now provides the ability to re-sequence additional human genomes, including tumour genomes, more quickly and cheaper than before. In brief, high-throughput sequencing technology relies on real time observations of DNA replication for single stranded DNA molecules of interest. Since replication involves incorporating a sequence of nucleotides complementary to the replicated molecule, observing the sequence of nucleotide incorporations is sufficient to identify the DNA sequence. The process is performed for millions of DNA fragments in parallel, with reactions taking place on a single chip and nucleotide incorporations detected as light emissions using an imaging device. The result is millions of sequences (reads) each representing a DNA molecule in the sample. Furthermore, the existence of the reference genome produced by the human genome project obviates the need for expensive cloning of longer DNA molecules, as would be required for assembly of short contigs into longer scaffolds or chromosomes. Instead, researchers make the assumption that the DNA sample is highly similar to the reference genome, an assumption that is reasonable for humans and even to some extent human cancers. Sequences are aligned to the human reference genome, with the reference acting as a scaffold for an implicit assembly of the sample genome. The focus of re-sequencing shifts from full assembly of a genome to identification of important changes with respect to the reference. As a result of high-throughput sequencing and the completed human reference genome, re-sequencing individual human genomes is now routine, opening up many new avenues in cancer research including comprehensive characterization of rearrangements. In addition to sequencing genomic DNA, HTS has been incorporated into a rich set of more complicated sequencing assays. Of particular importance is the sequencing of complementary DNA produced by reverse transcription of poly-A enriched RNA, termed RNA-Seq. From RNA-Seq, researchers can identify the genes and their isoforms that are transcribed in a given sample, and estimate expression levels of the expressed transcripts. Importantly for cancer research, RNA-Seq can also be used to identify novel fusion transcripts produced by somatic genome rearrangement.

6 1.2.3 Detection of Genome Rearrangements

New re-sequencing technologies have shifted much of the work from the molecular biology labs to computational labs. Significant recent work focuses on improving the computational steps required to accurately identify variants. Mapping and alignment of reads to the reference genome is the first major step in the process [86, 58, 80]. Accurate alignment is fundamental to subsequent detection of nucleotide level changes, since such changes will be observed as imperfect alignment between reads and the reference. Inaccurate read mapping may results in false positives where imperfectly aligned reads should be mapped to a different location in the reference, and false negatives if mis-mapped reads represent a DNA change of interest. Alignment ambiguity is also an important consideration. The length of HTS read sequences is short enough that many sequences can not be uniquely mapped to a single loci in the human reference [140], impeding accurate placement of mutations, and complicating estimation of the number of copies of repeated regions. The second step in the detection process can be considered as either a clustering step or multiple sequence alignment (MSA), and depends on the type of variant to be identified. For instance, identification of small (<20 nt) substitutions, insertions or deletions involves simply counting the number of read alignments that either support or contradict the mismatch, insertion or deletion. Alignment ambiguity resulting from repetitive DNA often necessitates an approximation of MSA referred to as local realignment[61] for accurate clustering of supportive vs. contradictory reads. Detection of rearrangements involves identification of the one or more novel adjacencies, or breakpoints, produced by the rearrangement. Read evidence for a breakpoint is a discordant read. In contrast to a concordant read: a read that aligns to a single contiguous location in the genome, discordant reads are reads that align to multiple non-contiguous locations. Subsequent to alignment, detection of rearrangement breakpoints involves identification of clusters of discordant reads. As with small variants, an approximation to MSA is often required to overcome alignment ambiguity and accurately cluster supporting reads.

Paired End Reads

All sequencing technologies have length limitations, and maximum attainable lengths are less than the length of many repeats. The paired end approach was developed as early as 1981 [62], and was used to improve eﬃciency of genome assembly and overcome the problems of repetitive DNA [167]. Rather than attempting to sequence an entire DNA molecule, the paired approach sequences each end of a larger DNA molecule of approximate known size. Although one end of the molecule may be repeat sequence, the other end’s sequence may be unique, especially if the molecule is signiﬁcantly longer than the average size of a repeat.

7 Initial efforts to discover rearrangements in cancer used paired end sequencing of long DNA molecules cloned in bacteria [162]. For pairs of end sequences, an alignment to the reference is concordant if: a) the two ends map to the same chromosome, b) the mappings are separated by a distance that is equal to the known length of the molecule, and c) the orientation/strand of the mappings imply that the sequenced molecule is a contiguous subsequence of the reference. As before, end sequences with no concordant alignment are considered discordant and are indicative of a potential rearrangement. The paired end approach was adopted for HTS, first for human structural variation [77], then for cancer specific rearrangements [10]. For HTS, molecules are sequenced in parallel and must all have approximately the same size. Thus a typical initial step is fragmentation and size selection. HTS read lengths are shorter than previous technologies, and the lengths of sequenced molecules are typically also shorter. The development of paired end HTS has been used extensively for detection of rearrangements in cancer. The predominant methodology uses three main steps: a) mapping of read pairs to the reference, b) clustering of discordant read pairs supporting the same rearrangement, and c) probabilistic or heuristic filtering to remove mapping and sequencing artifacts [10, 147, 130, 163, 28].

1.3 Representation of Genome Rearrangements

1.3.1 Types of Rearrangements

A breakpoint represents the most atomic unit of rearrangement. For additional inter- pretability, researchers have deﬁned classes of simple rearrangement based on whether they reorient or change the number of copies of aﬀected chromosomal segments. A list of the common classes of rearrangement follows. deletion Excision of a segment of DNA from a chromosome, with the free ends adjacent to the excised segment re-joined. Results in a single breakpoint for the re-joined ends. tandem duplication A copy of a segment reinserted with the same orientation immedi- ately after the end of the copied segment. Results in a single breakpoint representing the adjacency between the end of the original segment and the beginning of the copy. Also referred to by some authors as an eversion. inversion A segment excised, reversed and reinserted at the original location. Results in two breakpoints: adjacency between the upstream sequence and the end of the segment, and adjacency between the downstream sequence and the beginning of the segment. One inversion breakpoint in isolation is termed a fold-back inversion, and represents a chromosome arm copied and rejoined to itself. translocation Two separate chromosomes are broken and rejoined. Produces one breakpoint, the adjacency between the separate chromosomes. A reciprocal translocation

8 implies a swapping of chromosome arms and produces two breakpoints, one for each mutant chromosome. balanced Multiple breakpoints that together result in a rearrangement that does not change the number of copies of neighboring segments of DNA. Inversions and reciprocal translocations are examples of balanced rearrangement.

1.3.2 Read Pair Alignment Signatures

Paired end read alignments produce a speciﬁc signature dependent on whether the sequenced molecule is rearranged or non-rearranged with respect to the reference. Paired end HTS sequences each DNA fragment from 5’ to 3’, towards the center of the fragment. Thus, for a concordant read, one end will align directly to the reference genome, and the reverse complement of the other end will align downstream and within a distance consistent with the fragment length. Equivalently, we say the other end aligns to the reverse complement of the reference sequence. For discordant reads, the relative position and orientation of each end alignment is suﬃcient to determine the type of simple rearrangement (Figure 1.2). Alignments of reads are typically represented in chromosome space as a line segment spanning the region of alignment, with an arrowhead in the direction of sequencing [135]. For discordant reads, this representation allows for unambiguous determination of the breakpoint relative to the read: the arrowhead points towards the breakpoint.

1.3.3 Graph Representations of Rearranged Genomes

Rearrangement breakpoints detected using high-throughput sequencing provide information about sequences that are adjacent in the tumour genome, but not in the reference genome. Since read sequences are short, the range of the adjacency information provided by breakpoint predictions from the reads is limited. Assembly of full length tumour chromosomes is impossible from short reads alone, and generally not prioritized as it would require additional, more expensive sequencing of longer fragments of DNA. Thus the information provided by high-throughput sequencing about the structure of a rearranged tumour genome is inherently ambiguous. What can be readily identiﬁed are unbroken segments of the genome, and tumour speciﬁc adjacencies between the ends of those segment. Graphs are a natural way of representing a set of rearrangement breakpoints for a given tumour genome (Figure 1.2). Two representations predominate, bi-directed graphs and bi-edge-colored graphs. Undirected graphs are not generally used since they do not preserve orientation of rearranged segments. In a bi-directed graph, nodes represent genomic segments and bi-directed edges represent adjacencies [106]. The direction of edges in a bi- directed graph can be thought of as placing constraints on valid walks through the graph, where a valid walk represents a sequence of segments for a tumour chromosome. A walk that traverses an in-edge into a node must be followed by an out-edge out of the node,

9 Read Pair Sequence Direction of sequencing

ACTATACG------TAGCATGC Read 2 Read 1 TGATATGC------ATCGTACG

Direction of sequencing Concordant Read Pair Alignment Read 1 Alignment Reference Sequence ...CATACATACACTATACGACTGCTGGCACTAGCATGCCAATACAGA...... GTATGTATGTGATATGCTGACGACCGTGATCGTACGGTTATGTCT...

Read 2 Alignment Fragment Sequence in Reference

Concordant Alignment Representation

Read Pair Alignment Reference Sequence

Figure 1.1: Representing read pair alignments. Top: DNA fragments are sequenced at each end, from 5’ to 3’, towards the middle of the fragment. Middle: Alignment of a concordant read pair, showing read pair, and sequencing direction overlaid on the reference. One read pair aligns directly to the reference and the other read aligns downstream and on the reverse complement strand of the reference. Bottom: Representation of the concordant alignment as line segments with arrowheads showing strand of alignment, or equivalently sequencing direction. and conversely, traversing an out-edge into a node must be followed by an in-edge when traversing out of that node. By contrast, in bi-edge-colored graphs segments are represented as edges between vertices representing segment ends [129, 128, 117]. Adjacencies between segments are represented as bond edges with a color distinct from segment edges. In the bi-edge-colored graph representation, a valid chromosome is a walk that alternates between segment and bond edges.

1.4 Contribution

• We introduce deFuse [102], a tool for predicting gene fusions from RNA-Seq data (Chapter 2). deFuse uses an approximation algorithm to resolve ambiguously mapping RNA-Seq reads, and a variant of the smith waterman algorithm to predict nucleotide level fusion sequences. A probability is calculated based on informative features of each predicted sequence. Based on comparisons on validated fusions in real tumour samples, we show that deFuse produces substantially better sensitivity and specificity than two other published methods. We used deFuse to discover gene fusions in 40 ovarian tumor samples, one ovarian cancer cell line, and three sarcoma samples. We identified the first gene fusions discovered in ovarian cancer and conclude that gene fusions are not infrequent events in ovarian cancer and have the potential to substan-

10 Chromosome Read Pair Signature Bi-edge-colored Graph Bi-directed Graph

Wild-type

Deletion

Tandem Duplication

Inversion

Translocation

Figure 1.2: Read pair signatures and graph representation by rearrangement type. Non- rearranged wild-type chromosomes produce aligned read pairs in head-to-head orientation with no change in inferred fragment length from expected. A chromosome with a deletion produces aligned read pairs in head-to-head orientation with a larger than expected inferred fragment length. A tandem duplication produces aligned read pairs in tail-to-tail orientation. An inversion produces two types of aligned read pair signatures: tail-to-head and head-to-tail, for the two breakpoints created by the inversion. A translocation/fusion produces aligned read pairs on separate chromosomes. Shown is an unbalanced translocation. A balanced, or reciprocal, translocation involves two breakpoints with aligned read pair orientations reversed.

tially alter the expression patterns of the genes involved. As testament to its utility, deFuse, has been cited 112 times to date and has been used to identify fusions in at least 13 studies.

• We introduce Comrad [104], the first tool for joint analysis of RNA-Seq and WGS data for the purposes of identifying expressed genome rearrangements (Chapter 3). Comrad uses an approximation algorithm to jointly resolve ambiguously mapping genome sequencing and RNA-Seq reads. As a proof of concept, we rediscovered 4 fusions in prostate cancer cell line C4-2 out of 6 fusions previously identified in related cell line LNCaP. We also identified 6 novel fusion transcripts and associated genomic breakpoints, and verified their existence in both C4-2 and LNCaP, suggesting that Comrad may be more sensitive than previous methods that have been applied to fusion discovery in LNCaP.

• We introduce nFuse [105], a tool for identifying complex breakpoints and complex balanced rearrangements from joint analysis of RNA-Seq and WGS data (Chapter 4). Our work represents the first method for systematic identification of complex balanced rearrangements, and the first method for systematically predicting complex

11 breakpoints resulting in gene fusions. We demonstrate the utility of our method by validating 2 out of 2 complex balanced rearrangements in primary prostate tumour 963, and 5 out of 6 complex breakpoints in breast cancer cell line HCC1954, including an important gene fusion that was missed in a previous study of the HCC1954 genome.

• We introduce ReMixT [103], the first tool for jointly predicting genome structure and content in heterogeneous tumour samples (Chapter 5). We extend traditional HMM based approaches by incorporating breakpoints predicted from whole genome sequencing, and model tumour samples as mixtures of genome graphs. We describe a novel greedy algorithm that infers clone specific copy number of segments and breakpoints given a known mixture. Using simulated whole genome sequence data, we show that integration of breakpoints improves inference of segment and breakpoint copy number over more naive, non-integrated methods. We also show that ReMixT out-performs three competing methods for inference of clone specific segment copy number.

1.5 Organization of the thesis

The remainder of the thesis is organized as follows. In Chapter 2 we present deFuse, a method for detection of gene fusions in RNA-Seq data. In Chapter 3 we present Comrad, a method for joint analysis of RNA-Seq and WGS for identification of expressed rearrangements. In Chapter 4 we present nFuse, an extension of Comrad for prediction of complex rearrangements and associated fusions from RNA-Seq and WGS data. In Chapter 5 we present ReMixT, an method for joint inference of genome structure and content in heterogeneous tumour samples using WGS data. In Chapter 6 we briefly summarize the results of studies that have benefited from the application of the described methods. Finally in Chapter 7 we summarize our contribution to rearrangement detection and characterization and describe possible further improvements and directions of research.

12 Chapter 2 deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data

2.1 Introduction

Gene fusions are known to play an important role in the development of haematalogical disorders and childhood sarcomas, while the recent discovery of ETS gene fusions in prostate cancer[157] has also prompted renewed interest in gene fusions in solid tumors. ETS gene fusions are present in 80% of malignancies of the male genital organs, and as a result these fusions alone are associated with 16% of all cancer morbidity[108]. The discovery of the EML4 -ALK fusion in non-small-cell lung cancer and the ETV6 -NTRK3 fusion in human secretory breast carcinoma suggest that gene fusions are also recurrent at low levels in other solid tumor types[156, 150]. The discovery of such rare but recurrent gene fusions may be of significant clinical benefit where they provide the potential for targeted therapy. Gene fusions are thought to arise predominantly from double stranded DNA breakages followed by a DNA repair error [108, 5]. Promoter exchanges are one class of gene fusions, characterized by the replacement of an oncogene’s regulatory regions with those of another gene, resulting in deregulation of transcription of the oncogene. For ETS gene fusions in prostate cancer, the androgen-responsive regulatory elements of TMPRSS2 drive the expression of the ETS family member to which TMPRSS2 is fused[157]. Another class of gene fusions leads to the creation of a chimeric protein with biological function distinct from either of the partner genes from which it originated. A classic example is BCR-ABL1, a chimeric protein that is the defining lesion in chronic myelogenous leukaemia (CML), and which induces growth factor independence and the inhibition of apoptosis [45]. Large scale, genome-wide efforts to comprehensively identify and characterize genomic rearrangements that lead to gene fusions in human cancers have recently been made possible

13 through next generation sequencing technologies. These technologies provide a deeper level of sequencing than is possible by cytogenetic and Sanger sequencing methods and are poised to reveal a more detailed understanding of the extent and nature of genomic rearrangements in cancer. For example, using low-coverage paired end whole genome (gDNA) shotgun sequencing, Stephens et al. [152] reported that the genomes of breast cancer cells harbour many more rearrangements than previously thought, and suggested that this class of somatic mutation needs to be carefully considered when interpreting breast cancer genomes. Using similar experimental and analytical techniques, Campbell et al. [25] proﬁled tumor evolution in pancreatic cancer patient samples by proﬁling the pattern of somatic rearrangements found in primary tumors and distant metastases extracted from the same patient.

2.1.1 Previous work on gene fusion detection from RNA-Seq

Next generation sequencing of cDNA (RNA-Seq or whole transcriptome shotgun sequencing) provides an ideal experimental platform for expressed gene fusion discovery. Analogous to genome sequencing, RNA-Seq enables an unbiased and relatively comprehensive view into tumor transcriptomes, and can provide information about the rarest of transcripts. RNA- Seq targets only expressed sequences from protein coding genes and is thus more focused than whole genome sequencing. Maher et al. [91] demonstrated the capacity of RNA-Seq to find gene fusions in prostate cancer samples. They identified potentially fused gene pairs using discordantly aligned paired end reads, and also identified potential fusion splices by mining end sequences for alignments to all possible pairings of exons of the potentially fused gene pairs. A study of the melanoma transcriptome by Berger et al. (2010) used many of the same principles. Another recently developed method called FusionSeq identifies gene fusions from discordant alignments, and uses a variety of novel filters and quality metrics to discriminate real fusions from sequencing and alignment artifacts[138]. FusionSeq has been used to identify fusions in prostate tumor samples and cell lines[138, 125]. While the methods used for these studies are capable of identifying genuine gene fusions, many challenges and limitations remain in the analysis of RNA-Seq data. For example, the aforementioned studies only considered reads that align uniquely to the genome. However, errors in next generation sequencing together with homologous and repetitive sequences shared between genes often produce ambiguous alignments of the short reads generated in RNA-Seq experiments. While resolving the ‘correct’ placement of these reads is often not possible, we propose that ambiguously-aligning reads provide important evidence of real gene fusions, and therefore should be leveraged by analysis methods. Sequence reads that align across a gene fusion boundary (so-called split reads) are a strong source of evidence for gene fusions in paired-end RNA-Seq data. Hu et al. [70] propose a strategy centered on the ability to identify split reads called PERAlign: the method uses split read aligner MapSplice [164] to identify single end reads split by fusion boundaries, and then verifies those fusion boundaries using a probabilistic model to infer

14 the alignment of discordantly aligning pairs. As described below, our method also combines the complementary sources of split reads and discordant reads, but we show using real patient data that discordant read analysis followed by split read analysis is considerably more sensitive for gene fusion discovery than the reverse procedure described by Hu et al.

2.1.2 The deFuse Method

With the goal of resolving the limitations described above and therefore providing a more accurate method for detecting gene fusions from RNA-Seq, we developed a novel algorithm called deFuse. The central idea behind deFuse is to guide a dynamic programming-based split read analysis with discordant paired end alignments. This is in contrast to PERAlign, which uses discordant paired end alignments to verify the results of a split read analysis. Furthermore, unlike previous approaches, we do not discard paired end reads that align ambiguously, but instead consider all alignments for each read, and attempt to resolve the most likely alignment position for each read. We show that using ambiguously-aligning reads results in an increased amount of evidence for predicted gene fusions and an increase in the number of relevant gene fusions predicted. In addition, our method is not limited to ﬁnding gene fusions with boundaries between known exons, and therefore can identify fusion boundaries in the middle of exons or involving intronic or intergenic sequences. Finally, the method attempts to provide a number of conﬁdence measures to estimate the validity of each prediction.

2.2 Method

2.2.1 Ethics Statement

We obtained three sarcomas and 40 ovarian carcinomas from the OvCaRe (Ovarian Cancer Research) frozen tumor bank. Patients provided written informed consent for research using these tumor samples before undergoing surgery, and the consent form acknowledged that a loss of conﬁdentiality could occur through the use of samples for research. Separate approval from the hospital’s institutional review board was obtained to permit the use of these samples for RNA-sequencing experiments.

2.2.2 Data sets

Patient tumor samples from the OvCaRe tumor bank

We interrogated the transcriptomes of a cell line derived from a serous borderline tumor, in addition to three sarcomas and 40 ovarian carcinomas obtained from the OvCaRe (Ovarian Cancer Research) frozen tumor bank. Pathology review, sample preparation, RNA extrac- tion, RNA-Seq library construction and RNA-Seq sequence data generation using Illumina

15 GAII were performed as previously described [168, 142]. The RNA-Seq datasets used in this study are listed in Table 2.2, which provides a summary-level description of each sample. For each case we list data acquisition statistics, the tumor type and subtype and the number of predictions made by deFuse.

Published data sets

In addition to our internally generated data, we tested deFuse using published paired end RNA-Seq data sets known to contain gene fusions. These datasets were used as positive controls in the evaluation of deFuse. We used the NCI-H660 prostate cell line from the FusionSeq website http://info.gersteinlab.org/FusionSeq_Test_Datasets) known to harbour a TMPRSS2-ERG fusion. In addition, we downloaded the datasets derived from 13 melanoma samples and cell lines and one chronic myelogenous leukemia (CML) cell line K-562 described in Berger et al. [15]. Datasets were obtained from the Short Read Archive (http://www.ncbi.nlm.nih.gov/sra) under submission number SRA009053. As described in Berger et al. [15], the CML cell line harbours three previously described gene fusions including BCR-ABL1, and the melanoma data harbours 11 gene fusions.

2.2.3 The deFuse algorithm

In this section, we describe the deFuse algorithm. We begin by defining essential terms. We define a fragment as a size selected cDNA sequence (usually approximately 250bp) during RNA-Seq library construction. We define a read as a sequenced end of a fragment (usually approximately 50bp). We define paired ends as the pair of reads sequenced from the ends of the same fragment. The insert sequence is the portion in the middle of the fragment that is not sequenced. A fusion boundary is the precise, nucleotide-level genomic coordinate that defines the breakpoint on either side of the gene fusion. We define spanning reads as paired ends that harbour a fusion boundary in the insert sequence, whereas a split read harbours a fusion boundary in the read itself. A discordant alignment is produced by spanning reads of a fragment with each end aligning to a different gene, whereas a split read will often produce a single end anchored alignment for which one end aligns to one gene and the other end does not align. With these definitions in hand we will now describe how deFuse predicts gene fusions by searching RNA-Seq data for fragments that harbour fusion boundaries. As mentioned previously, the problem of identifying the true genomic origin of a set of RNA-Seq reads is confounded by several factors, and as a result, a proportion of the RNA-Seq reads will have ambiguous alignments to the genome. The deFuse method, outlined schematically in Figure 2.1, combines an approach for resolving the actual alignment location of ambiguously aligning spanning reads with a dynamic programming based split read analysis for resolving

16 the nucleotide level fusion boundary with high sensitivity. A novel conﬁdence measure is provided based on the degree of corroboration of evidence supporting the prediction.

Figure 2.1: The deFuse gene fusion discovery method. a) Discordant alignments are clustered based on the likelihood that those alignments were produced by reads spanning the same fusion boundary. Ambiguous alignments are resolved by selecting the most likely set of fusion events, and the most likely assignment of paired end reads to those events, and the remaining alignments are discarded. b) Paired end reads with an alignment for which one end aligns near the approximate fusion boundary are mined for split alignments of the other end of the read. c) The predicted fusion boundary is used to calculate the fragment lengths for each spanning paired end read. These fragment lengths are tested for the hypothesis that they were drawn at random from the fragment length distribution.

The method consists of four main steps. The ﬁrst step is alignment of paired end reads to a reference comprised of the sequences that are expected to exist in the sample, with all relevant alignments considered. We use spliced and unspliced gene sequences as a reference because we have found that fusion genes often produce splice variants that express intronic sequences, and that some of those splice variants are biologically relevant (unpublished data). We deﬁne two necessary conditions for considering discordant alignments to have originated from reads spanning the same fusion boundary and use these conditions to cluster discordant alignments representing the same fusion event. The second step resolves ambiguous discordant alignments by selecting the most likely set of fusion events, and the most likely assignment of spanning reads to those events (Figure 2.1a). The third step is a targeted search for split reads using a dynamic programming based solution to resolve the nucleotide level fusion boundary of each event (Figure 2.1b). The forth step involves a test for the corroboration of the spanning and split read evidence. For each spanning read, we calculate the putative length of the fragment that generated that paired end read given the fusion boundary predicted by the split reads. The resulting set of fragment lengths is used to test the hypothesis that those fragments were generated by the inferred fragment length

17 distribution (Figure 2.1c). Finally, we compute a set of quantitative features, and use an adaboost classiﬁer to discriminate between real gene fusions and artifacts of the sequencing and alignment process.

Conditions for considering discordant alignments to have originated from reads spanning the same fusion boundary deFuse begins with a search for spanning reads as evidence of gene fusion events. We describe two necessary conditions for considering two discordant alignments to have originated from reads spanning the same fusion boundary. The size selection step of the RNA-Seq library construction protocol results in a collection of cDNA fragments with lengths that we approximate with the inferred fragment length distribution P (L) derived from concordant alignments [90, 15]. We restrict our analysis to consider only the most probable range of fragment lengths [lmin, lmax] where lmin α α and lmax are the 2 and (1 − 2 )-percentiles of P (L) respectively. The value α represents the proportion of paired end reads that are not guaranteed by the algorithm to be assigned to the correct fusion event. By definition a spanning read harbours a fusion boundary in the insert sequence (Fig- ure 2.2a). Given a discordant alignment of a spanning read, the insert sequence of the corresponding fragment should align downstream of the alignment of each end in each gene (Figure 2.2b). We call the region in the transcript to which the insert sequence should align as the fusion boundary region since it represents the region in the transcript where the fusion boundary is expected to exist. Given read length r, the insert sequence has maximum length lmax − 2r, thus the fusion boundary region has length lmax − 2r. We define the overlapping boundary region condition C1 as the condition that the fusion boundary regions for two paired end alignments must overlap in each transcript (Fig- ure 2.2c). The overlapping boundary region condition ensures that there exists a valid location for the fusion boundary in each transcript that would simultaneously explain both paired end alignments. Given two reads spanning the same fusion boundary, the difference between the fragment lengths of those reads can be calculated as |dX + dY | where dX and dY are given by Figure 2.2d. The implied fragment length difference for two discordant alignments can be calculated similarly as shown in Figure 2.2e. We define the similar fragment length condition C2 as the constraint that |dX +dY | must be no more than lmax −lmin for us to consider two paired end reads to have originated from the same fusion transcript. A more rigorous definition and probabilistic motivation for the two conditions is given in Supplementary Methods (Text S1).

18 Figure 2.2: Conditions for considering two paired end reads to have originated from the same fusion transcript. a) Fusion transcript X-Y supported by a paired end read spanning the fusion boundary. b) Discordant paired end reads represent reads potentially spanning a fusion boundary. Each discordant alignment suggests fusion boundaries in the regions adjacent to the alignments in each transcript. The fusion boundary region, shown in gray, is the region in which we expect a fusion boundary to occur. c) The overlapping boundary region condition is the condition that the fusion boundary regions in each transcript must overlap. d) The diﬀerence between the fragment lengths of two paired end reads spanning a fusion boundary is |dX + dY |. e) The similar fragment length condition is the constraint that |dX + dY | must be no more than lmax − lmin.

19 Assigning a unique discordant alignment to each spanning read

The utility of an ambiguously aligned read depends on our ability to select the correct alignment for that read based on the greater context of all paired end alignments in the RNA-Seq dataset. Given that we are considering alignments to spliced and unspliced gene sequences, ambiguous alignments will result from homology between genes and also from the redundant representation of the same exon multiple times for multiple splice variants of the same gene. The true alignment for each read must be inferred, as it will be used to identify, for the first situation, the correct pair of genes involved in the fusion, and for the second situation, the correct pair of splice variants of those genes. Define a valid cluster as a set of discordant alignments for which every two paired end alignments in that set satisfy the overlapping boundary region condition C1 and the similar fragment length condition C2. Each valid cluster represents a potential fusion event implied by a set of discordant alignments. A paired end read will be a member of multiple valid clusters as a result of homology between genes, exon redundancy between transcripts, and the multiplicity of valid clusterings. Fusion events are rare when compared with the event of detecting a discordant paired end read given the existence of a fusion event that would generate that read. Thus we seek an assignment of each paired end read to a single fusion event (valid cluster) that minimizes the number of fusion events. The resulting solution, first described by Hormozdiari et al. in the context of genomic structural variation [65], is termed the maximum parsimony solution. Computation of the maximum parsimony solution is NP-Hard by reduction to the set cover problem as shown by Hormozdiari et al.[65]. Similar to Hormozdiari et al., we compute the maximum parsimony solution using a modified version of the greedy algorithm for solving set cover with approximation factor log n. Reads are initially set as unassigned and valid clusters are initially set as unselected. At each step the algorithm selects, from the set of unselected valid clusters, the cluster containing the largest number of unassigned reads, with ties broken randomly. Each unassigned read in the newly selected cluster is assigned to that cluster. The algorithm continues, selecting clusters and assigning reads, until all reads have been assigned to clusters. The algorithm will produce an equivalent solution to the maximum parsimony problem when considering only maximal valid clusters, rather than all valid clusters (which are exponential in number). Therefore we first calculate the maximal valid clusters (see Supplementary Methods, Text S1) and then apply the algorithm to only the maximal valid clusters.

Split read boundary sequence prediction

Given fusion events nominated by spanning reads, deFuse does a targeted split read analysis to predict nucleotide level fusion boundaries. For a cluster of discordant alignments, the approximate fusion boundary is the intersection of the fusion boundary regions of the dis-

20 cordant alignments in that cluster (Figure 2.3a). A candidate split read is a read for which one end is anchored near to an approximate fusion boundary, such that the other end of the read could potentially align to the approximate fusion boundary. Reads with discordant and single end anchored alignments are considered as candidate split reads; however, reads with concordant alignments are not considered.

Figure 2.3: Searching for candidate split reads. a) Approximate fusion boundaries, shown as dashed rectangles, are the intersection of fusion boundary regions for discordant alignments supporting a potential fusion. b) The mate alignment region, shown as a dashed rectangle, is the union of possible alignment locations for the other end of a single end anchored alignment. c) The approximate fusion boundary in transcript X is projected into transcript Y by remapping the start of the approximate fusion boundary from X, to the genome, to Y .

Given the alignment of one end of a read, we deﬁne the mate alignment region as the region in which we would expect the other end of the read to align for the pair of alignments to be considered concordant. The mate alignment region is calculated as the union of all possible concordant alignments of the other end of the read given a fragment length in the range lmin to lmax (Figure 2.3b). A read with a mate alignment region that intersects the approximate fusion boundary of a cluster is considered a candidate split read for that cluster. The maximum parsimony solution will nominate fusions such that the average number of spanning reads per fusion is maximized. However, it is possible that the selected transcript variants do not maximize both spanning and split read evidence. Thus, when searching for candidate split reads, it is necessary to search across all relevant transcripts. In addition to calculating the approximate fusion boundary in the transcript variants proposed by the maximum parsimony solution, we also project those approximate fusion boundaries onto other transcript variants of the same gene. The approximate fusion boundary in transcript variant X of gene A is projected onto transcript Y of gene A by remapping the start of the approximate fusion boundary from X to the genome, and then from the genome to Y (Figure 2.3c). The start of the approximate fusion boundary is the end closest to the

21 discordant alignments. Candidate split reads are then found by searching for mate alignment regions that intersect the approximate fusion boundaries in any of the transcript variants. The split read analysis proceeds by aligning candidate split reads to approximate fusion boundaries. We ﬁrst align a candidate split read to the two approximate fusion boundaries in the two fused transcripts, then combine the two alignments in a way that maximizes the combined alignment score. Let sr be the end sequence of a candidate split read expected to be split by a fusion boundary and let SX and SY be the sequences of the approximate fusion boundary in transcripts X and Y , where a fusion between X and Y has been nominated by spanning read evidence.

We start by aligning sr to SX using dynamic programming based local alignment and penalizing initial gaps in the end sequence, and then repeat with the reverse of sr and the reverse of SY (see Supplementary Methods, Text S1). Let DX and DY be the matrices produced by aligning sr to SX and SY . A split in the alignment is represented by the triple

(iX , iY , j) where iX and iY are the nucleotide level position of the fusion boundary in SX and SY , and j ∈ 1,..., |sr| is the position of the fusion boundary in the read sequence with

|sr| defined as the length of sr. All splits (iX , iY , j) with maximum score can be calculated in O(|sr||SX | + |sr||SY |) by first finding j as follows:

j = argmaxj max DX (i, j) + max DY (i, |sr| − j) (2.1) i i

and then ﬁnding iX and iY as:

iX = argmaxiDX (i, j) (2.2)

iY = argmaxiDY (i, j) (2.3)

With this method, ﬁnding all splits with maximum score does not necessitate backtrack- ing through a dynamic programming matrix (which is a worst case exponential operation).

Additionally, we add the constraint that DX (i, j) and DY (i, |sr| − j) must surpass a threshold score manchor = m · nanchor, where m is a score for each matched nucleotide and nanchor is the minimum number of nucleotides of a potential split read that must align to SX at one end of the read and SY at the other end. Thresholding manchor will prevent us from having to consider the case where the majority of a read aligns to SX whereas the terminal nucleotide matches erroneously to many locations in SY , and visa versa.

Multiple splits (iX , iY , j) that produce the same maximum score occur for two reasons. Multiple splits with diﬀerent values for j often imply sequence similarity between the regions on each side of the fusion boundary. As a result the subsequence of sr that represents the fusion boundary will align equally well to SX or SY . Given a single value for j, multiple

22 values for iX or iY are often the result of one end of the split read aligning to many places because that end is small enough that no unique alignment exist. For the aforementioned situations, the only way to resolve the the true split alignment is to consider the split alignments in the context of the alignments of other candidate split reads. We resolve the problem of multiple split alignments as follows. We ﬁrst cluster together split alignments that corroborate the same fusion boundaries iX , iY . In the unlikely even that a single read produces multiple alignments with the same iX , iY we select the one with the maximum score. For each split alignment in each cluster we calculate the anchoring score min(DX (iX , j),DY (iY , |sr| − j)) and select the cluster that maximizes the sum of the anchoring scores across all split alignments in that cluster. Maximizing the sum of the anchoring scores has two eﬀects: it attempts to ensure that the fusion boundary is centered in the middle of the reads where multiple fusion boundaries are possible, and it prevents a number of spurious alignments anchored only by a few nucleotides from eclipsing a more interesting fusion boundary prediction.

Corroborating spanning read and split read evidence

To test the corroboration between spanning read and split read evidence, we ﬁrst use the fusion boundary predicted by the split reads to calculate the inferred fragment lengths {li} of the n spanning reads that support the fusion prediction. To test the hypothesis that these n fragment lengths were drawn at random from the same distribution that generated the fragment lengths of concordant paired end reads, we model P (L) = N (µ, σ2) and use a z- test to test the hypothesis that the set of spanning fragment lengths was generated by P (L). A dependence between the fragment lengths of reads spanning the same fusion boundary means that the sample variance of the set {li} includes a covariance term. The sample variance of {li} can be calculated as described in Supplementary Methods (Text S1). We use a z-test to calculate the p-value for the hypothesis that the set {li} was generated by the distribution P (L). We call this value the corroboration p-value and use it to discriminate between true and false positives.

2.2.4 Alignment parameters used for this study

To classify paired end reads as concordant we aligned reads to spliced genes, the genome, and UniGene sequences using bowtie[82] in paired end mode. We also aligned reads to spliced and unspliced genes in single end mode with parameters -k 100 -m 100. We classified any paired end read as concordant if both ends aligned to the same gene, regardless of the location of the alignment in that gene. Paired end reads aligning with one or both ends to ribosomal RNA sequences were removed from the analysis as has been done previously[138]. Paired end reads not classified as concordant were classified as discordant. Single end mode alignments of discordant paired end reads were then classified as fully aligned or

23 single end anchored. Fully aligned discordant paired end reads were clustered with α = 0.05 and the maximum parsimony solution was found using the algorithm given above. Split alignments were generated using nanchor = 4, m = 2, u = −1, g = −2 (see Supplementary Methods, Text S1). Finally, a predicted fusion sequence was assembled that included the regions in each gene to which spanning reads aligned, joined together at the fusion boundary predicted by the split reads.

2.2.5 Annotation of each prediction

Predicted fusion sequences were annotated as open reading frame preserving, 5’ or 3’ UTR exchanges, interchromosomal, inversion, eversion, and between adjacent genes. The trans- lational phase for each coding nucleotide was calculated using the frame column for each exon in the ensembl GTF ﬁle. Given nucleotide x of 5’ gene A with phase ph(x) spliced to nucleotide y of 3’ gene B with phase ph(y), if ph(x) = ph(y) + 1 mod 3, the fusion A-B is annotated as open reading frame preserving. Note this method would not detect open reading frames of novel proteins, only those of chimeric proteins that are combination of the protein sequences of the fused genes. Results were also annotated for their position (UTR, exonic, intronic, coding, upstream, downstream) within each gene.

2.2.6 Classiﬁcation of predictions as real fusions or false positives

We computed a set of features to better characterize our predicted fusions. The features were calculated for each fusion prediction with the aim of discriminating between true and false positives. We initially lacked a set of positive and negative controls that would have been necessary for a principled machine learning based classification method. Thus initial validation candidates were identified by thresholding these features at levels we suspected would enrich for real fusions (see Results). Validation was also attempted for suboptimal predictions and 40 randomly chosen predictions in order to establish a set of negative controls. Once we had performed a significant number of validations, these validations became the training set for a classifier. We calculated the following 11 features for the examples in our training set (detailed descriptions in Supplementary Methods, Text S1):

Spanning read coverage Normalized spanning read coverage.

Split position p-value P-Value for the hypothesis that the split position statistic was calculated from split reads that are evenly distributed across the fusion boundary.

Minimum split anchor p-value P-Value for the hypothesis that the minimum split anchor statistic was calculated from split reads that are evenly distributed across the fusion boundary.

24 Corroboration p-value P-Value for the hypothesis that the lengths of reads spanning the fusion boundary were drawn from the fragment length distribution.

Concordant ratio Proportion of spanning reads supporting a fusion that have a concordant alignment using blat with default parameters.

Fusion boundary di-nucleotide entropy Di-nucleotide entropy calculated 40 nt upstream and downstream of the fusion boundary for the predicted sequence, taking the minimum of both values.

Fusion boundary homology Number of homologous nucleotides in each gene at the predicted fusion boundary. cDNA adjusted percent identity Maximum adjusted percent identity for the alignments of the predicted sequence to any cDNA.

Genome adjusted percent identity Maximum adjusted percent identity for the alignments of the predicted sequence to the genome.

EST adjusted percent identity Maximum adjusted percent identity for the alignments of the predicted sequence to any EST.

EST island adjusted percent identity Maximum adjusted percent identity for the alignments of the predicted sequence to any EST island.

We then used the ada (2.0-2) package in R (2.11.0) to train an adaboost model using the stochastic gradient boosting algorithm with exponential loss, discrete boosting, and decision stumps as the base classiﬁer[107]. We used conservative regularization (shrinkage parameter ν = 0.1) and permitted the algorithm 200 iterations. Adaboost was selected because it would enable us to leverage the weak predictive power of individual features, and would provide a straightforward way of evaluating the predictive power of each feature. Finally, the classiﬁer was used to classify all predictions for our ovarian and sarcoma datasets.

2.2.7 Implementation, availability and data resources deFuse is implemented in C++, perl and R. A typical library of 120,000,000 paired end reads completes in approximately 6 hours using a cluster of 100 compute nodes. The human genome (NCBI36) and gene models in GTF format (ensembl 54) were downloaded from Ensembl [12]. EST sequences and spliced EST alignments were downloaded from UCSC[131]. UniGene sequences were downloaded from NCBI[127]. This study focused on the 21295 genes annotated as protein_coding, processed_transcript, IG_C_gene, IG_D_gene, IG_J_gene, and IG_V_gene in the ensembl GTF ﬁle (see Table S10 for ensembl IDs for each gene). An unspliced gene sequence is composed of the genomic sequence starting

25 2kb upstream of the most upstream exonic nucleotide of that gene’s splice variants, to the genomic position 2kb downstream of the most downstream exonic nucleotide of that gene’s splice variants. A spliced gene sequence is composed of the concatenated sequences of each exon of a single splice variant of a gene. Our reference sequences were comprised of 21295 unspliced gene sequences and 46662 spliced gene sequences. All data ﬁles are available as part of the deFuse software package at http://compbio.bccrc.ca.

2.3 Results

2.3.1 Application to ovarian and sarcoma datasets

Fusion sequence predictions were obtained for the 44 datasets as detailed in Methods. This study only considered sequence predictions supported by ﬁve or more spanning reads and one or more split reads, though theoretically the limit on the number of spanning reads could be lowered for smaller datasets as was done for the melanoma datasets. The total number of unﬁltered predictions at this stage numbered 20,327.

Assembling a set of positive and negative controls

Next we assembled a set of positive and negative controls by attempting to validate a selection of predictions potentially representing real fusions, and another set of predictions representing systematic artifacts. To select potential positives, we ﬁrst used the following set of heuristic ﬁlters to enrich for real fusions, producing a subset consisting of 268 predictions.

Spanning read count > 5 Split read count > 3 Spanning read coverage > 0.6 Split position p-value > 0.1 Minimum split anchor p-value > 0.1 Corroboration p-value > 0.1 Concordant ratio < 0.1 cDNA adjusted percent identity < 0.1 Genome adjusted percent identity < 0.1 EST adjusted percent identity < 0.3 EST island adjusted percent identity < 0.3

From the 268 ﬁltered predictions we selected 46 predictions, and in doing so attempted to select predictions from libraries with a range of read lengths, such that those predictions

26 covered a large range of values for the spanning and split read counts. Included in this set of 46 predictions were all eight predictions that pass the heuristic filters and involve a cancer associated gene from the cancer gene census (Welcome Trust Sanger Institute Cancer Genome Project web site, http://www.sanger.ac.uk/genetics/CGP). Out of this set of 46 predictions, 42 were successfully validated by RT-PCR (Table S5, Dataset S1). Next we selected 14 predictions representing potential recurrent artifacts, requiring that each of the 14 fail at least one of our heuristic filters, and also requiring that each was predicted to exist in two or more libraries. None of these predictions validated. Finally, we selected 40 predictions at random from the unfiltered list of 20,327 with the assumption that the majority of them would be negative. Only one of the 40 randomly selected predictions validated as real. In total, 45 predictions were validated by RT-PCR (Table 2.1). This included 42 predictions from the 46 potential positives, one prediction from the 40 randomly selected fusions, and two more predictions nominated by FusionSeq (see Section 2.3.2). Flu- orescent in situ hybridization (FISH) assays were attempted for 17 of the 45 PCR validated predictions, with 14 resulting in positive identification of a potentially causative underlying genomic aberration (Table S4, Dataset S2).

Classiﬁcation of ovarian and sarcoma predictions

We were interested in building a classifier that could discriminate between real fusions and false positives. As a training set, we compiled a list of all ovarian and sarcoma fusions for which validation was attempted, and added to this list the 11 melanoma fusions, the three K-562 fusions and the TMPRSS2-ERG fusion in NCI-H660. The resulting dataset contained 60 positive and 61 negative predictions (Table S8). The training set was used to train an adaboost model as described in Methods. Training error for the model was 0.017 representing two negatives misclassified as positives. We used a leave one out method to calculate adaboost probability estimates for each point in the training data. We then used the resulting set of probability estimates to generate an ROC (Figure 2.4) and calculated the conditional AUC from the ROC as 0.91. Given a target false positive rate of 10%, we calculated the estimated true positive rate as 82%, and a threshold of 0.81 on the probability estimate to achieve that target. Finally we identified the three most significant features as the two p-values calculated for the split alignment positions and the corroboration p-value (Figure 2.5). Next we used the adaboost model to classify all remaining ovarian and sarcoma predictions to produce a final set of predictions for the ovarian and sarcoma datasets, thresholding the probability estimates produced by the adaboost model at 0.81. In total we predicted 2,540 gene fusions across all RNA-Seq datasets (Table S1). The vast majority of the 2,540 events, 1,658 in total, were predicted to involve adjacent genes and were not predicted to be the result of an underlying inversion or eversion. For these events, a skipped transcription stop site or alternative transcription start site is an alternative explanation to that of an

27 Table 2.1: RT-PCR validated novel deFuse predictions. RNA-Seq evidence, annotation information and validation information is shown for each prediction for which validation by PCR was attempted.

library 5’ gene 3’ gene span ambig split exon inter. prom. ﬁsh CNV split corrob. min count span read bndry expr. exch. valid. break pos. p- an- count count p- value chor value p- value CCC1 TYW1 HGSNAT 41 38 11 ••/◦ ◦ ◦ 0.92 1 0.65 CCC4 TNS3 PKD1L1 12 0 4 ◦ ◦/•• 0.82 0.39 0.68 CCC9 RPN2 PMEPA1 48 0 11 • ◦/◦ ◦ • 0.83 0.88 0.34 CCC9 TLX3 RANBP17 10 0 5 ◦ •/◦ ◦ 0.51 0.63 0.38 CCC12 ITCH RALY 59 0 6 • ◦/• • ◦/◦ 0.99 0.88 0.78 CCC12 MTHFD1 C1orf61 27 8 11 • ◦/••••/• 0.53 0.67 0.79 CCC12 YTHDF2 SYTL1 53 0 19 • ◦/◦ ◦ •/• 0.75 0.94 0.81 CCC13 PPME1 MRPL48 69 0 22 ••/◦ ◦ 0.48 0.76 0.41 CCC14 EPCAM DLEC1 27 0 17 • ◦/◦ ◦ ◦ 0.98 0.72 0.91 CCC15 AFF4 LAMC3 5 0 3 • ◦/• • ◦/◦ 0.64 0.89 0.54 CCC15 ARSB DMGDH 103 0 87 ••/◦ ◦ ◦/◦ 0.92 0.8 0.21 CCC15 KIFC3 CNGB1 14 0 16 • ◦/• • ◦/◦ 0.36 1 0.45 CCC15 NUMB ALDH6A1 22 0 12 • ◦/◦ ◦ ◦/◦ 0.98 0.85 0.81 CCC15 PVRL2 LMNA 17 2 7 • ◦/◦ ◦ ◦ ◦/◦ 0.39 0.33 0.67 CCC15 SLC38A10 ZCCHC11 12 0 1 • ◦/◦ ◦ ◦/◦ 0.28 1 0.29 CCC15 TMEM63A NRD1 17 0 7 ◦ ◦/◦ ◦ • ◦/• 0.57 0.83 0.85 CCC15 UBR4 JMJD2B 27 0 18 ◦ ◦/• ◦ • •/• 0.5 1 0.56 CCC16 HPS5 APOO 23 3 11 ••/◦ ◦ • 0.67 0.49 0.35 CCC16 PAPOLA HIP1R 44 0 19 • ◦/◦ ◦ • 0.89 0.62 0.57 CCC16 PPL RBKS 10 0 14 • ◦/◦ ◦ • 0.81 0.7 0.43 EMD6 BCAS3 ARHGAP15 10 0 4 • ◦/◦ ◦ • ◦/◦ 0.42 0.73 0.81 EMD6 CAMK2G DDX1 9 0 2 • ◦/◦ ◦ ◦/◦ 0.28 1 0.46 EMD6 CYB5D2 ANKFY1 6 0 1 ••/◦ ◦ ◦/◦ 0.65 0.82 0.75 EMD6 EIF4G3 LRRC8D 7 0 4 • ◦/◦ ◦ ◦/◦ 0.19 0.96 0.47 EMD6 ROCK1 CMKLR1 13 0 8 • ◦/◦ ◦ • ◦/• 0.62 0.31 0.82 GRC5 FBXO25 BET1L 8 5 3 • ◦/◦ ◦ 0.86 0.37 0.68 GRC5 PCP4L1 SDHC 7 7 10 • ◦/◦ ◦ 0.71 0.18 0.68 HGS1 CAPNS1 WDR62 7 0 11 • ◦/• ◦ 0.55 1 0.85 HGS1 LETM1 USP15 7 1 5 • ◦/◦ ◦ 0.95 0.74 0.59 HGS1 RAB6A USP43 14 9 6 • ◦/◦ ◦ • 0.45 0.81 0.5 HGS3 ELL CYLN2 15 0 8 • ◦/• ◦ ◦/• 0.85 1 0.56 HGS3 FRYL SH2D1A 27 0 7 • ◦/••••/◦ 0.9 1 0.62 HGS3 GTF2I PGPEP1 34 0 3 ◦ ◦/◦ ◦ •/◦ 0.15 1 0.3 HGS3 PRR12 FLT3LG 20 0 11 ••/◦ ◦ •/• 0.72 1 0.24 HGS4 FLNB VPS8 95 4 51 ◦ ◦/• ◦ • •/• 0.8 0.72 0.59 HGS4 LMF1 UMOD 15 0 7 ••/•••/◦ 0.88 1 0.4 HGS4 SLC37A1 ABCG1 40 0 14 • ◦/◦ ◦ •/• 0.83 1 0.42 HGS4 STK3 NPAL2 7 0 3 ••/◦ ◦ ◦/◦ 0.69 0.79 0.13 MUC1 ERBB2 PERLD1 25 0 11 • ◦/◦ ◦ •/◦ 0.84 1 0.75 MUC1 KIAA0355 UQCRC1 10 0 6 • ◦/◦ ◦ ◦/◦ 0.82 0.76 0.44 YKS2 C12orf48 MYBPC1 8 0 6 • ◦/◦ ◦ ◦/◦ 0.63 0.84 0.2 SARC1 CMKLR1 HNF1A 38 0 7 ◦ •/• ◦ 0.72 0.84 0.11 SARC1 ERBB3 CRADD 103 7 41 ◦ •/◦ ◦ • 0.87 0.88 0.61 SARC2 SMARCB1 WASF2 16 14 4 • ◦/◦ ◦ ◦/◦ 0.38 0.64 0.66 SARC3 RREB1 TFE3 103 0 28 ••/◦ ◦ • ◦/◦ 0.93 1 0.3

28 Figure 2.4: deFuse ROC Curve. ROC curve for deFuse annotated with the threshold for the adaboost probability estimate. The threshold corresponds to a false positive rate of 10% and true positive rate of 82%.

Figure 2.5: Variable importance plot for deFuse classiﬁer. Relative importance of each of the 11 features used by deFuse classiﬁer.

29 underlying genomic deletion. Of the remaining 882 events, 394 were inter-chromosomal and 488 were intra-chromosomal. The intra-chromosomal events can be further subdivided into 240 inversions, 131 eversions and 117 deletions eﬀecting non-adjacent genes (Table 2.2).

Table 2.2: Summary of RNA-Seq statistics and fusion predictions across all samples.

Case Type Reads Read Fragment Fragment Total In- Inter- Intra- Read- Inversion Eversion Deletion (Mil- Length Mean Std. Fu- frame chr. chr. through lions) Dev. sions SBOT LGS 28 36-42 210 38 24 2 2 22 17 3 1 1 CCC1 CCC 18 50 282 36 49 6 10 39 30 1 3 5 CCC2 CCC 38 50 198 29 27 0 5 22 18 3 0 1 CCC3 CCC 37 50 209 27 34 2 6 28 22 3 0 3 CCC4 CCC 20 50 249 41 55 7 7 48 33 7 8 0 CCC5 CCC 32 36-42 245 36 26 1 6 20 17 2 0 1 CCC6 CCC 32 36-42 234 38 14 3 0 14 10 1 1 2 CCC7 CCC 19 50 259 39 48 4 12 36 21 4 6 5 CCC8 CCC 39 36-42 242 38 41 7 12 29 15 2 10 2 CCC9 CCC 38 50 265 41 62 13 10 52 35 5 6 6 CCC10 CCC 37 50 278 38 97 10 12 85 75 5 1 4 CCC11 CCC 53 36-42 259 39 64 2 24 40 33 4 2 1 CCC12 CCC 36 36-42 244 31 40 7 10 30 18 8 4 0 CCC13 CCC 31 50 263 35 74 10 15 59 49 3 6 1 CCC14 CCC 40 50 250 39 82 8 13 69 56 6 4 3 CCC15 CCC 40 50 189 29 53 6 16 37 19 5 9 4 CCC16 CCC 41 50 229 27 80 2 16 64 46 5 8 5 EMD1 EMD 32 36-50 187 35 62 5 9 53 37 9 1 6 EMD2 EMD 30 42-50 208 33 64 7 2 62 50 6 5 1 EMD3 EMD 33 50 227 31 40 4 6 34 30 4 0 0 EMD4 EMD 38 50 242 33 58 7 3 55 41 8 3 3 EMD5 EMD 39 50-75 244 29 49 6 3 46 38 4 2 2 EMD6 EMD 39 50 246 34 85 11 12 73 45 11 11 6 EMD7 EMD 25 42-50 211 33 23 4 3 20 15 3 1 1 EMD8 EMD 30 50-75 189 31 51 7 3 48 40 2 3 3 GRC1 GRC 58 36-50 206 39 105 14 10 95 78 8 3 6 GRC2 GRC 74 36-42 183 39 95 5 15 80 60 12 0 8 GRC3 GRC 31 36-42 196 37 38 3 5 33 29 3 1 0 GRC4 GRC 34 36-42 172 34 46 5 7 39 27 7 1 4 GRC5 GRC 41 50-75 247 31 101 9 8 93 71 16 0 6 HGS1 HGS 39 50 241 37 73 6 9 64 51 8 2 3 HGS2 HGS 29 50 278 38 75 12 8 67 58 5 1 3 HGS3 HGS 26 37-42 211 34 80 7 15 65 59 3 2 1 HGS4 HGS 30 36-42 209 33 54 3 11 43 20 8 10 5 HGS5 HGS 33 50 220 25 92 7 11 81 65 7 6 3 LGS1 LGS 35 50 242 26 47 8 3 44 34 9 0 1 MUC1 MUC 42 36-50 208 30 66 8 11 55 44 6 3 2 MUC2 MUC 33 36 224 31 61 9 11 50 37 10 1 2 SCH1 SCH 24 50-75 210 30 43 3 11 32 27 5 0 0 SCH2 SCH 35 36-50 201 31 46 0 6 40 34 4 1 1 YKS1 YKS 46 50 249 27 44 6 5 39 34 3 1 1 YKS2 YKS 40 50 252 31 49 5 11 38 32 3 1 2 SARC1 EPS 19 50 263 35 39 1 6 33 27 3 2 1 SARC2 EPS 28 36-50 333 36 69 10 10 59 51 6 0 2 SARC3 IGMS 17 50 233 33 15 2 4 11 10 0 1 0

LGS: Low Grade Serous, HGS: High Grade Serous, CCC: Clear cell carcinoma, EMD: Endometrioid tumor, MUC: Mucinous tumor, YKS: Yolk sac tumor, GRC: Granulosa cell tumor, SCH: Small cell hypercalemic, EPS: Epithelioid Sarcoma, IGMS: intermediate grade myoﬁbroblastic sarcoma.

2.3.2 deFuse has higher sensitivity and speciﬁcity than competing methods

Comparison to FusionSeq and MapSplice

We analyzed CCC15, CCC16 and EMD6 with MapSplice version 1.14.1 and FusionSeq version 0.6.1 in order to compare the sensitivity of these methods with that of deFuse (Supplementary Methods, Text S1, Dataset S3, Dataset S4). These cases were chosen because they ranked highest with respect to the number of validated predictions, with seven, three and four validated predictions respectively. FusionSeq successfully identiﬁed ten of a potential 14 validated deFuse predictions in CCC15, CCC16 and EMD6 (Table 2.1),

30 whereas MapSplice did not recover any of these fusions. We suspected that MapSplice might perform better on RNA-Seq data with longer read lengths. Thus we also used MapSplice to predict fusions using the 75mer reads from GRC5. GRC5 was chosen because it is the only case with 75mers and validated deFuse fusion predictions. MapSplice successfully identified the sequences of both validated fusions in GRC5. We also attempted to establish whether FusionSeq and MapSplice could identify real fusions in our data that deFuse should have been able to identify, but did not. To this end, we identified all MapSplice and FusionSeq predictions for which there were no corresponding deFuse predictions in the set of initial predictions produced by the heuristic filters. We also removed MapSplice and FusionSeq predictions that did not involve ensembl annotated genes because it would be impossible for deFuse to identify those events. From this list we selected 14 MapSplice predictions and eight FusionSeq predictions that we considered to have the highest likelihood of successful validation according to a variety of conservative criteria (Supplementary Methods, Text S1). For MapSplice, we attempted PCR validation for seven predictions from 50mer libraries CCC15, CCC16 and EMD6, and seven predictions from 75mer libraries SCH1, EMD5 and GRC5, with all 14 failing to produce a PCR product. For FusionSeq, we attempted PCR validation for eight predictions from CCC15, CCC16 and EMD6, three of which validated (Table 2.3).

Table 2.3: Fusions predictions compared between deFuse and FusionSeq. Comparison of deFuse using heuristic filters (deFuse thresholds) and deFuse using a classifier (deFuse classifier) with FusionSeq.

library 5’ gene 3’ gene FusionSeq deFuse thresholds deFuse classiﬁer PCR validated CCC15 UBR4 JMJD2B •••• CCC15 TMEM63A NRD1 •••• CCC15 PVRL2 LMNA •••• CCC15 ARSB DMGDH ◦ • • • CCC15 NUMB ALDH6A1 ◦ • • • CCC15 KIFC3 CNGB1 ◦ • • • CCC15 AFF4 LAMC3 ◦ • • • CCC16 PAPOLA HIP1R •••• CCC16 HPS5 APOO •••• CCC16 PPL RBKS •••• EMD6 BCAS3 ARHGAP15 •••• EMD6 EIF4G3 LRRC8D •••• EMD6 ROCK1 CMKLR1 •••• EMD6 POLR2J2 CLMN •••• EMD6 CAMK2G DDX1 • ◦ • • EMD6 CYB5D2 ANKFY1 • ◦ • • CCC15 SLC38A10 ZCCHC11 • ◦ • • EMD6 S100PBP CAMK2G • ◦ ◦ ◦ CCC15 FARSA RAD23A • ◦ • ◦ CCC16 ITCH DYNLRB1 • ◦ ◦ ◦ CCC16 PIK3C2B SMG5 • ◦ ◦ ◦

None of the 14 MapSplice predictions had corresponding deFuse predictions, ﬁltered or unﬁltered (blat with > 90% sequence identity). However seven of the eight FusionSeq

31 predictions had a corresponding deFuse prediction, though all seven were filtered by the heuristic filters (Table 2.3). We initially trained the adaboost classifier on training data that did not include the seven deFuse predictions corresponding to these seven fusions identified by FusionSeq. We then used the resulting adaboost model to classify the seven deFuse predictions. The model successfully classified the three validated fusions as real, and erroneously classified as real one of the four predictions that failed to validate (Table 2.3). Subsequent analysis included the seven FusionSeq predictions as part of the 121 predictions used as training data as described in Methods. In total there were 21 predictions with PCR results (17 positive and 4 negative) in CCC15, CCC16 and EMD6 upon which a quantitative comparison between FusionSeq and deFuse could be made. We computed the sensitivity and specificity on this data for deFuse- Threshold, deFuse-Classifier and FusionSeq. The sensitivity and specificity values were 82.3% and 100% for deFuse-Threshold; 100% and 94.4% for deFuse-Classifier; and 76.5% and 76.5% for FusionSeq (see Table 2.4).

Table 2.4: Comparison of accuracy between deFuse and FusionSeq on a subset of events predicted by either method in CCC15, CCC16 and EMD6. There were 21 PCR validations attempted including 17 positives (P) and 4 negatives (N). TP: true positives, TN: true TP TP negatives, FP: false positives, FN: false negatives, Sens= P , Spec= TP +FP Method P N TP TN FP FN Sens Spec deFuse-Threshold 17 4 14 4 0 3 82.3 100 deFuse-Classiﬁer 17 4 17 3 1 0 100 94.4 FusionSeq 17 4 13 0 4 4 76.5 76.5

Assessing the advantages of considering ambiguously aligning reads and avoiding reliance on known exon boundaries

We sought to establish the benefit of the maximum parsimony approach for resolving ambiguously aligning reads, and the dynamic programming based approach for aligning split reads to discover fusion boundaries. Each predicted fusion splice was annotated as coincident or not coincident with known ensembl exon boundaries. The fusion splices for eight of the 45 PCR validated fusions were not predicted to coincide with ensembl exon boundaries (Table 2.2), including CMKLR1 -HNF1A and CRADD-ERBB3 involving cancer associated HNF1A and ERBB3 [145, 139, 123]. None of these eight gene fusions would be discoverable using a method that relied on the identification of reads split at known exon boundaries. For each PCR validated fusion, we also calculated the number of spanning reads that align to a unique location in the genome, and considered the effect of an analysis restricted to considering only these reads. Such a theoretical analysis would have resulted in four fusions having lower than the threshold of five spanning reads (Table 2.2), and one of those fusions having no spanning reads. The SMARCB1 -WASF2 fusion involving SMARCB1,

32 a gene known to be aﬀected by genomic rearrangements in other cancers[110], would have only two reads in such an analysis. The TYW1 -HGSNAT fusion with 41 spanning reads, ranked 10th by spanning read count out of the 45 PCR validate fusions, would have only three spanning reads in the restricted analysis. An analysis that considered only uniquely aligning spanning reads and considered fusion splices at known exon boundaries would theoretically result in fewer false positives, as is apparent from the high validation rate in previous studies [138, 15]. However, such an analysis would be guaranteed to miss fusions in our datasets, including fusions involving previously described fusion partners. Considering ambiguously aligned spanning reads and performing an unbiased search for fusion splices can help to recover these false negatives, while more sophisticated techniques such as corroborating spanning and split read evidence can help to reduce the false positive rate without increasing the false negative rate.

2.3.3 Rediscovery of known gene fusions

We evaluated the ability of deFuse to rediscover known gene fusions in publicly available RNA-Seq data. Using deFuse, we searched for the TMPRSS2 -ERG fusion in the NCI- H660 prostate cell line dataset, the three fusions previously identified in the CML dataset (SRA accession: SRR018269) and the 11 fusions in melanoma libraries identified by Berger et al.[15]. Since seven of the fusions in the melanoma datasets are supported by fewer than five spanning reads, we altered the configuration of deFuse for the melanoma libraries such that only two spanning reads and one split read were required for deFuse to attempt assembly of a fusion boundary sequence. For all 15 fusions, deFuse was able to assemble the correct fusion boundary sequence. We evaluated the performance of deFuse using heuristic filters (deFuse-Thresholds), and deFuse using the adaboost classifier (deFuse-Classifier), when applied to the prostate, CML and melanoma datasets (Table 2.5). Since the training set for deFuse-Classifier includes the prostate, CML and melanoma fusions, we used a leave one out method classify each fusion.

Table 2.5: Results of a defuse analysis on existing datasets with known fusions. Shown are results for both deFuse with thresholds and deFuse with the classiﬁer.

library 5’ gene 3’ gene span split corrob. split split deFuse deFuse deFuse count count p- pos. an- correct thresh- clas- value p- chor se- olds siﬁer value p- quence proba- value bility NCIH660 TMPRSS2 ERG 19 10 0.31 0.31 0.45 • ◦ 0.48 SRR018259 KCTD2 ARHGEF12 4 1 0.33 0.97 0.91 • ◦ 0.98 SRR018260 ITM2B RB1 19 2 0.37 0.51 0.68 •• 0.98 SRR018260 ANKHD1 C5orf32 2 2 0.45 0.09 0.05 • ◦ 0.00 SRR018261 GCN1L1 PLA2G1B 4 1 0.35 0.57 0.62 • ◦ 1.00 SRR018265 WDR72 SCAMP2 3 2 0.00 0.97 0.25 • ◦ 0.59 SRR018266 C1orf61 CCT3 54 17 0.12 0.56 0.56 •• 0.79 SRR018266 MIXL1 PARP1 2 1 0.18 0.25 0.19 • ◦ 0.03 SRR018266 C11orf67 SLC12A7 43 24 0.92 0.55 0.75 •• 0.99 SRR018266 GNA12 SHANK2 29 9 0.58 0.24 0.34 •• 0.83 SRR018267 TLN1 C9orf127 3 1 0.08 0.71 0.76 • ◦ 0.91 SRR018267 ALX3 RECK 4 6 0.72 0.25 0.45 • ◦ 0.99 SRR018269 ABL1 BCR 91 14 0.68 0.68 0.67 •• 0.97 SRR018269 SLC44A4 BAT3 27 6 0.44 0.74 0.67 •• 0.99 SRR018269 NUP214 XKR3 67 15 0.90 0.91 0.28 •• 1.00

33 deFuse-Thresholds identifies 7 of the 15 known fusions, whereas deFuse-Classifier identifies 10 of the 15 fusions. Notably, TMPRSS2 -ERG is not included in the deFuse-thresholds or the deFuse-Classifier results, primarily because the TMPRSS2 -ERG prediction is an outlier on both the Spanning read coverage feature and the Fusion boundary di-nucleotide entropy feature. deFuse-Classifier assigns a probability of 0.48 to the TMPRSS2 -ERG prediction. A probability threshold of 0.48 (instead of 0.81 as calculated in Section 2.3.1) would result in a true positive rate of 93%, and false positive rate of 14%. These numbers suggest that our initially selected probability threshold of 0.81 may have been overly conservative given that we could have increased our true positive rate by 11% at the expense of only a 4% increase in false positive rate. Using a threshold of 0.48, deFuse-Classifier would recover 13 of the 15 events including TMPRSS2 -ERG.

2.3.4 Fusion boundaries coincident with interrupted expression show dominant expression of the fused gene

We sought to understand each fusion’s impact on the expression patterns of the fused genes.

For a given fusion boundary ζ, let epζ be the expression of exons on the preserved side of ζ, normalized by the length of those exons. Also let erζ be the length normalized expression of the remaining exons, not predicted to be part of the fusion gene. We define the interrupted epζ expression index Eζ = log2 as the ratio of the expression of preserved versus remaining erζ exons, analogous to the splicing index [54]. For each PCR-validated fusion boundary ζ predicted for an ovarian dataset we calculated Eζ for all ovarian datasets and compared Eζ for the dataset with the predicted fusion to Eζ for the datasets without the fusion using a Wilcoxon test [54], resulting in 22 fusion events with at least one partner predicted as interrupted (p-values < 0.05, see Table 2.1 and Table S2). Promoter exchanges are characterized by overexpression of the 3’ exons of a gene resulting from the replacement of 5’ regulatory regions [157]. For each PCR validated fusion we calculated whether the 3’ partner was expressed significantly higher in the dataset harbouring the fusion compared to other ovarian datasets (p-values < 0.1, see Table S2, Table S6 and Table S7). We then overlapped the overexpression results with the interrupted expression results to find seven fusions representing potential promoter exchanges (Table 2.1). The remaining 15 expression-interrupting fusions represent either biallelic inactivations (for example, HNF1A described below) or dominant expression of the fusion allele (for example, RREB1 -TFE3 described below). We sought to rule out genomic amplification as a mechanism of overexpression for the seven putative promoter exchanges. Analysis of Affy SNP6.0 genome data indicates that two of the 3’ partners, SH2D1A and UMOD, are in regions of genomic amplification (Table S3). Given that UMOD is not expressed in any other ovarian library (Table S9), a genomic amplification alone cannot explain UMOD expression in HGS4. For the FRYL-SH2D1A fusion in HGS3, a marked coincidence between the fusion boundary and an expression

34 changepoint implies that only the fused copy of SH2D1A is expressed (Figure 2.6). FISH evidence for FRYL-SH2D1A indicates that at most one copy of the FRYL-SH2D1A fusion exists in the genome of each tumor cell (Figure 2.6), suggesting that ampliﬁcation of the SH2D1A region is not the underlying cause of SH2D1A overexpression. FRYL expression is on average 670 fold higher than SH2D1A expression in the non-HGS3 ovarian libraries (Table S6), implying that the FRYL promoter would overexpress SH2D1A, were it fused to SH2D1A. In HGS3, SH2D1A expression is on average 36 fold higher than in other ovarian libraries, supporting the theory that the FRYL promoter is driving SH2D1A expression. The FRYL-SH2D1A fusion does not preserve the open reading frame of SH2D1A. Investigation of the functional impact of FRYL-SH2D1A and the other six promoter exchanges is ongoing.

2.3.5 Evidence of previously described rearrangements in sarcoma and ovarian carcinoma data

We sought to identify previously described rearrangements in our sarcoma and ovarian carcinoma data. Although generally considered a breast cancer rearrangement, amplification of ERBB2 has also been shown to occur in mucinous ovarian tumors [97]. In our ovarian cases, one mucinous tumor, MUC1, harbours a fusion between ERBB2 and adjacent PERLD1 caused by an underlying genomic inversion. CNV analysis of Affy SNP6.0 genome data predicted ERBB2 to be highly amplified in the genome of MUC1 (Table S3), and ERBB2 expression is approximately 10 fold higher in MUC1 than in any other ovarian library (Table S6). Since amplification of ERBB2 requires replication of ERBB2 across the genome, a reasonable explanation for the ERBB2 -PERLD1 fusion is that it is a secondary effect of the process of ERBB2 amplification. Analysis of the two epithelioid sarcomas and one intermediate grade myofibroblastic sarcoma produced five fusion predictions between non-adjacent genes, three involving genes previously described as translocated in cancer. The CMKLR1 -HNF1A fusion is predicted to significantly interrupt expression of HNF1A (Table 2.1). In fact, there is no evidence of wild- type HNF1A expression in SARC1, indicating the possibility that the CMKLR1 -HNF1A fusion transcript is evidence of a biallelic inactivation of HNF1A in SARC1 (Figure 2.7a). Biallelic inactivation of HNF1A has been previously reported to lead to aberrant activation of signalling pathways involved in tumorigenesis in human hepatocellular adenomas [123]. The RREB1 -TFE3 gene fusion found in the intermediate grade myofibroblastic sarcoma SARC3 fuses the first eight exons of RREB1 to the last nine exons of TFE3, preserving the open reading frame of both RREB1 and TFE3. The fusion is predicted to interrupt expression of RREB1 (Table 2.1), indicating that RREB1 -TFE3 is the dominantly expressed RREB1 allele. The underlying translocation leaves intact the DNA binding domain and N-terminal activation domain of TFE3 (Figure 2.7b). TFE3 is a known fusion partner in papillary renal cell carcinoma[146] and alveolar soft part sarcoma[78].

35 Figure 2.6: Evidence for the FRYL-SH2D1A fusion showing the validated fusion boundary (vertical red line). a) Validation evidence using a FISH come together assay, with fusion probes circled in white. b) FISH probe selection. c) FRYL exonic coverage showing fewer reads aligning after the fusion boundary. FRYL exons in blue with narrower boxes denoting untranslated sequence. d) SH2D1A exonic coverage showing signiﬁcant coverage after the fusion boundary. SH2D1A exons in green with narrower boxes denoting untranslated sequence. e, FRYL-SH2D1A exons in blue or green depending on their origin, with the whole transcript predicted as untranslated. f) Positions of spanning reads supporting the fusion. g, Split alignments supporting the fusion prediction. h) Chromatogram of a sequenced PCR product supporting the fusion.

36 Figure 2.7: Fusions in sarcoma samples. a) Read depth across HNF1A exonic positions shows that only the region after the fusion boundary is being expressed, evidence of the possible biallelic inactivation of HNF1A. b) Putative RREB1 -TFE3 chimeric protein showing preservation of TFE3 ’s basic helix-loop-helix (bHLH) leucine zipper (LZ) domain and N-terminal activation domain (ATA), in addition to 4 of RREB1 ’s zinc ﬁnger (ZF) motifs.

Finally, the SMARCB1 -WASF2 gene fusion found in SARC2 is predicted to produce a transcript that preserves the reading frame of both SMARCB1 and WASF2. The predicted fusion protein would be composed of amino acids 1-209 of SMARCB1, which would preserve a DNA binding domain at 106-183 but interrupt a MYC binding domain at 186-245 [31], suggesting that the SMARCB1 -WASF2 fusion protein would retain only partial SMARCB1 function. SMARCB1 has been shown to be frequently inactivated in epithelioid sarcomas [110].

2.4 Discussion

We have developed a new algorithmic method called deFuse for gene fusion discovery in RNA-Seq data. We evaluated deFuse on 40 ovarian cancer patient samples, one ovarian cancer cell line and three sarcoma patient samples. Using these data, we demonstrate with RT-PCR validated fusions how deFuse exhibits substantially better accuracy than two competing methods and that deFuse is able to discover gene fusions that are not discoverable by more simplistic methods. deFuse computes a set of 11 quantitative features used to characterize its predicted fusions. In our initial analysis we used heuristic, intuitively chosen thresholds to eliminate false positives and nominated expected true positives and false positive predictions for RT-PCR validation. This yielded a set of benchmark fusion predictions: 60 true positives and 61 true negatives that we in turn leveraged to train an adaboost classiﬁer to more robustly and objectively identify real gene fusions from the features. The

37 classiﬁer yielded an AUC accuracy of 0.91. Importantly, the validated fusions in ovarian cancer represent the ﬁrst reported gene fusions in that tumor type.

2.4.1 Limitations

The lack of a sufficient number of positive and negative controls for a particular type of event, such as gene fusions, represents a major challenge when evaluating novel algorithms designed for discovery of those events. This challenge is exacerbated when the prediction set contains a much larger proportion of negatives than positives. We attempted to select candidates to enrich for positive examples to provide a balanced set of ground truth events with which to train our classifier. While this has inherent biases, only one in 40 randomly chosen predictions validated indicating that a completely unbiased selection would have yielded too few positives to robustly fit a classifier. We attempted to mitigate the acknowledged biases by using other software to find additional positives and also included the very limited set of published examples from the literature. The main limitation of deFuse is the requirement of at least five discordant read pairs to nominate a gene fusion to the adaboost classifier. This will certainly miss fusions that have very low expression and may result in insensitivity to fusions from RNA-Seq datasets with minimal sequence generation. This is suggested by the results in Section 2.3.3. However, sequencing platforms are increasing throughput at exponential rates and it will soon be rare for an RNA-Seq library to under-sample a transcriptome. Another potential limitation of deFuse is its reliance on an annotated set of genes. As such, it will not be able to discover fusions that involve loci that are not annotated as genes. Finally, deFuse relies on alignment to a reference as its primary analytical step. Thus deFuse would miss gene fusions involving completely novel sequences that may exist in a transcriptome library but are not represented in the reference used by the aligner. In such situations, de novo assembly based methods such as Trans-ABySS [134] may outperform deFuse.

2.4.2 Conclusion

Full characterization of the mutational composition of cancer genomes will provide the opportunity to discover drivers of oncogenesis and will aid the development of biomarkers and drug targets for targeted therapy. As production of RNA-Seq data derived from tumor transcriptomes becomes routine, sophisticated techniques such as those used by deFuse will be required to identify the gene fusions that are part of each tumor’s mutational landscape. As a ﬁrst step in this process, we have identiﬁed gene fusions as a new class of features of the mutational landscape of ovarian tumor transcriptomes, in addition to discovering novel gene fusions in three sarcoma tumors.

38 Chapter 3

Comrad: Detection of expressed rearrangements by integrated analysis of RNA-Seq and low coverage genome sequence data

3.1 Introduction

High-throughput sequencing of cDNA (RNA-Seq) is rapidly accelerating our understanding of the sequence content of the human transcriptome. RNA-Seq can be used for high- throughput quantiﬁcation of transcript abundance as has been done previously using mi- croarrays. However, microarray based approaches require pre-existing knowledge of the transcriptome sequence. RNA-Seq, by contrast, can be used for denovo characterization of the transcriptome including unbiased discovery and nucleotide level characterization of novel transcripts. Compared to previous, Sanger sequencing based approaches for discovering novel transcripts, RNA-Seq is higher throughput for lower cost [166]. Applied to cancer genomics, RNA-Seq can be employed for the discovery of novel aberrant transcripts with implications for cancer biology. [91, 90] used RNA-Seq to rediscover known gene fusions in CML and prostate cell lines, and also discovered novel gene fusions in prostate tumours. Similarly, [15] applied RNA-Seq to the discovery of gene fusions in melanoma. [125] used a newly developed method called FusionSeq [138] to discover non- ETS fusions in RNA-Seq data from prostate tumours, and [70] developed PERAlign and used it to discover novel gene fusions in breast cancer cell lines. The general methodology used by all of these studies was the "paired end" method: i) RNA-Seq is used to sequence both ends of a set of cDNA fragments, ii) the resulting sequence pairs are aligned to the reference genome or transcriptome, iii) a chimeric transcript will produce chimeric fragments and those chimeric fragments will produce a pair of sequences (paired end reads) that align

39 to different genes, thus, any paired end read for which one end aligns to one gene and the other end aligns to another gene is considered potential evidence of a gene fusion. Despite the aforementioned methodological advances and associated discoveries, accurate prediction of gene fusions from RNA-Seq data remains a difficult problem. The large amount of sequence data produced by high-throughput sequencing and the complexity of the transcriptome make RNA-Seq data difficult to interpret. High expression levels combined with sequencing errors and novel splicing produce many sequence pairs that appear to have been produced from chimeric fragments [138]. Furthermore, the reverse transcription step used to produce cDNA has been shown to produce chimeric fragments via the process of template switching [69]. Previous studies dealt with false positives produced by ’false’ chimeric fragments with the application of heuristic filters [91, 90, 15, 138, 125]. Another common practise was to discard paired end reads with multiple mappings to the genome (multi-map reads) [91, 90, 15, 138, 125]. Although discarding multi-map reads and applying heuristic filters may reduce the false positive rate, the effects of these practises on the false negative rate have not been properly quantified. Assuming successful prediction of fusion transcripts from RNA-Seq data, the significance of those transcripts may still depend on the discovery of an underlying genomic rearrangement. Fusions between adjacent genes not associated with a genomic rearrangements are a distinct class of fusion transcript known as transcription induced chimeras (TICs) or read-throughs. Though a significant amount of recent work has identified tissue and tumour specific read-throughs [20, 73, 165, 132], the mechanisms for their heritability remain unclear, and their ubiquity in normal tissue[2, 122] impedes assessment of their functional significance in cancer. RNA-Seq alone cannot distinguish a read-through from a small deletion that brings together two genes. Instead of RNA-Seq, some investigators have sought to discover gene fusions using whole genome shotgun sequencing (WGSS) [126, 10]. However, results produced from WGSS data suffer from the reverse problem, that is, the significance of any fusion discovered using WGSS data is difficult to determine without an understanding of its effects on expression and without knowing whether it produces a fusion transcript. Furthermore, the unfocused nature of whole genome sequencing make this method expensive at coverage levels required to accurately predict genomic rearrangements. A natural progression, given the complementarity of genomic and transcriptomic data, would be an analysis that combined these two data types. For example, the study by [15] combined RNA-Seq data with copy number data to identify gene fusions associated with deletions and unbalanced rearrangements. Unfortunately copy number data is ineffective for the discovery of balanced rearrangements such as the reciprocal translocation that creates the BCR-ABL gene fusion associated with chronic myelogenous leukemia (CML) [137]. By contrast, WGSS data can effectively discover balanced and unbalanced rearrangements. However, there are currently no methods that combine whole genome sequence data with RNA-Seq data for the purposes of gene fusion detection.

40 The availability of methods for predicting gene fusions in RNA-Seq data and rearrangements in WGSS data make it conceivable that these existing methods could be combined for a joint analysis of RNA-Seq and WGSS data for accurate gene fusion prediction. How- ever, as we demonstrate in this paper, applying each tool independently and then com- bining the results would be inaccurate. All methods for analysis of either RNA-Seq or WGSS data use heuristic filters to discard low confidence predictions supported by marginal amounts of evidence in order to attain a reasonable true positive rate. Thus any true fusion supported by only a marginal amount of evidence in either one or both datasets will be missed by independent analysis. We find that a joint analysis produces a limited number of results supported by both datasets, partially obviating the need for thresholding when searching for aberrant transcripts associated with genomic rearrangements. Simi- larly, the actual mapping location of multi-map reads may be difficult to resolve with an independent analysis of each dataset, even when using methods that effectively leverage multi-map reads, such as MoDIL [84] or VariationHunter [65, 66, 59] for WGSS analysis, and deFuse (unpublished, website: http://compbio.bccrc.ca/) or ShortFuse (unpublished, website: http://exon.ucsd.edu/ShortFuse/) for RNA-Seq analysis. We show that an analysis that simultaneously considers all reads from both datasets is better able to resolve the alignment location of multi-map reads.

3.2 Approach

Comrad is a novel algorithmic framework for the integrated analysis of RNA-Seq and WGSS data for the purposes of discovering genomic rearrangements and aberrant transcripts. Com- rad builds on the COMMON-LAW framework first proposed in related work by [67] on structural variation discovery in multiple sequenced genomes. The Comrad method leverages the advantages of both types of data, providing accurate classification of rearrangements as expressed or not expressed and accurate classification of the genomic or non-genomic origin of aberrant transcripts. A major benefit of Comrad is its ability to accurately predict fusion transcripts and their associated genome rearrangements using low coverage WGSS data. As a result, a Comrad analysis can be performed at a cost comparable to that of two RNA-Seq experiments, significantly lower than an analysis requiring high coverage genome data. The algorithmic basis of Comrad, provided in detail in this paper, is an integer programming formulation which can be solved exactly using branch and bound or approximately using the relaxation of the linear program. For larger datasets, Comrad provides the option of using a greedy algorithm that can yield efficient solutions with reasonable running times. We have applied Comrad to the discovery of gene fusions and read-throughs in prostate cancer cell line C4-2, a derivative of the LNCaP cell line with androgen-independent char- acteristics. As proof of concept, we have used Comrad to rediscover 4 out of 5 fusions previously described in LNCaP and known to also exist in C4-2. We have also used Com-

41 rad to identify 6 novel fusion transcripts and associated genomic rearrangements. A simple extension to the Comrad framework has allowed us to discover reciprocal rearrangement breakpoints for the two translocations found in the C4-2 data, making Comrad the first method to allow for the systematic discovery of reciprocal rearrangements. Furthermore, since Comrad is not biased towards canonical fusion splice junctions or fusions between known exons, we are able to use Comrad to discover fusions exhibiting non-canonical splicing. Some of the fusions we identify are supported by multi-map reads, showing that Comrad can effectively leverage multi-map reads for fusion discovery. Finally, some of the rearrangement breakpoints discovered by Comrad have as few as 1 read of supporting evidence, showing that Comrad is effective at discovering fusion evidence in low coverage genome data.

3.3 Methods

The Comrad method begins by enumerating all rearrangement breakpoints implied by the WGSS reads and all gene fusion splices implied by the RNA-Seq reads. Some of these breakpoints and fusion splices will be supported by multi-map reads, but a read can only originate from at most one genomic location. Thus we require a robust method for determining the most likely origin for each read given the greater context of the alignments of all WGSS and RNA-Seq reads. Most rearrangements and gene fusions are specific to individual cell lineages, ie, they occur at low levels of recurrence[108]. Furthermore, since the RNA-Seq and WGSS data originate from the same sample, fundamental differences between the two datasets (differences that are not the result of expression or splicing) are unlikely. Thus when seeking to determine the most likely origin for each read, we seek as the most parsimonious solution, a global assignment (of reads) that minimizes three types of differences: differences between the WGSS dataset and the reference genome (rearrangement breakpoints), differences between the RNA-Seq dataset and the reference transcriptome (fusion splices) and differences between RNA-Seq dataset and WGSS dataset.

3.3.1 Identifying potential rearrangement breakpoints and fusion splices

Analysis of the WGSS data begins by enumerating all rearrangement breakpoints implied by the WGSS reads, and forming clusters of WGSS reads that support each rearrangement breakpoint. WGSS reads are aligned to the genome (NCBI36). Concordantly aligning reads are used to estimate the minimum and maximum DNA fragment length Lmin and

Lmax[91, 15], and are subsequently discarded. All alignments of the remaining discordant reads are retained for further analysis. We then use an existing algorithm[65] to identify all clusters of discordant alignments, where each cluster is a maximal set of reads that could be explained by a single pair of breakpoints.

42 Similar to the analysis of the WGSS data, analysis of the RNA-Seq data begins by enumerating all fusion splices implied by the RNA-Seq reads, and forming clusters of RNA- Seq reads that support each fusion splice. RNA-Seq reads are aligned to spliced transcript sequences and unspliced gene sequences as annotated by ensembl (version 54)[12]. Aligning to all splice variants of each gene and also the unspliced gene sequence enables Comrad to handle alternative splicing in a natural way. Multiple alignments of RNA-Seq data will arise because of homology between genes and redundant inclusion of the same exon in multiple splice variants of the same gene. Selecting the most parsimonious set of unique alignments for RNA-Seq data, as described later, will select not only the most likely pair of genes involved in a fusion event, but also the most likely pair of splice variants for each gene. Maximal sets of discordant RNA-Seq alignments corroborating the same fusion splice are enumerated using analogous conditions and the same algorithm as described for WGSS alignments.

3.3.2 Corroborating rearrangement breakpoints and fusion splices

RNA-Seq and WGSS alignments corroborate the same rearrangement if there exists at least one plausible genomic breakpoint at each locus that could explain both the RNA-Seq and WGSS alignments. Splicing confounds this problem because it often results in RNA-Seq reads that align to the genome many kb from the corresponding rearrangement breakpoints. Thus, to accurately establish corroboration between RNA-Seq and WGSS data, the effects of splicing must be considered. We describe two conditions for corroboration that afford efficient computation and ensure the existence of rearrangement breakpoints that would explain the RNA-Seq and genome sequencing alignments. For one end of a given set of RNA-Seq alignments, we define the projected intron as the

Imax sized region starting at the most downstream genomic position of those alignments. A set of RNA-Seq alignments is said to be corroborated by a set of WGSS alignments if the projected introns for the RNA-Seq alignments overlap with the breakpoint regions for the WGSS alignments. Thus the overlapping intron condition is the condition that the pair of projected introns for RNA-Seq alignments must overlap with the pairs of breakpoint regions for the WGSS alignments. We also define the intron region as the portion of the projected intron of a set of RNA- Seq alignments that is upstream from and including the breakpoint regions of a set of WGSS alignments. The nonconflicting intron condition is the condition that the two intron regions for a potentially corroborating set of RNA-Seq and genome sequencing alignments cannot overlap. The nonconflicting intron condition disqualifies mutually exclusive sets of RNA-Seq and genome sequencing alignments that would otherwise satisfy the overlapping intron condition.

43 Figure 3.1: Corroborating rearrangement breakpoints and fusion splices. Two conditions are required for RNA-Seq alignments and genome sequencing alignments to be considered evidence of the same rearrangement. The projected introns of the RNA-Seq alignments must overlap with the breakpoint regions of the genomic alignments, and the two intron regions must not overlap.

3.3.3 Selecting the most parsimonious set of alignments for ambiguously aligning reads

The relationship between WGSS reads, rearrangement breakpoints, RNA-Seq reads and fusion splices can be depicted in the rearrangement support graph as shown in figure 3.2. The rearrangement support graph is formed by connecting WGSS reads to rearrangement breakpoints supported by those reads, and connecting RNA-Seq reads to fusion splices supported by those reads. Fusion splices and rearrangements breakpoints are connected based on the corroboration indicated by the overlapping intron condition and the nonconflicting intron condition. The rearrangement support graph encodes the ambiguity of WGSS and RNA-Seq reads with multiple alignments. Since at most one alignment for each read is valid, we seek a transformation of the graph that removes edges such that the transformed graph has exactly one edge incident with each read. Given only RNA-Seq data we previously attempted to produce a maximum parsimony solution with a minimum number of predicted fusions (deFuse: http://compbio.bccrc.ca). More recently, we introduced combinatorial formulations to identify structural variation events in several donor genomes by means of minimizing a weighted sum of structural differences between the donor genomes as well as one reference genome [67]. Expanding on this principle, we now attempt to select a set of alignments M

44 Figure 3.2: Rearrangement support graph. The relationship between DNA reads, DNA clusters, RNA reads and RNA clusters can best be depicted using the rearrangement support graph.

45 so as to minimize the number of differences between the WGSS data, RNA-Seq data, and the reference genome. Let XG be the set of all rearrangement breakpoints and let XT be the set of all fusion GT G T splices. Let C be the set of all rearrangement breakpoint/fusion splice pairs (xj , xk ) ∈ XG × XT that satisfy both conditions of corroboration. For each rearrangement breakpoint G G G G G xj ∈ X , let δj ∈ ∆ be a corresponding indicator variable, such that δj = 1 if and only G G if at least one alignment has been selected that supports xj . In other words, δj = 1 if and only if the corresponding rearrangement breakpoint vertex has at least one incident G G alignment edge in the transformed graph. Define fusion splice indicator variables, δj ∈ ∆ , similarly. We are now in a position to give precise definitions of the three types of differences being G G G considered. For a given set of selected alignments, each xj ∈ X for which δj = 1 implies T T a single difference between the WGSS data and the reference. Similarly, each xk ∈ X for T which δk = 1 implies a single difference between the RNA-Seq data and the reference. Fully enumerating the differences between the WGSS data and the RNA-Seq data would require assembly of at least one of these datasets, a method we do not consider here. Instead, we define a difference between the WGSS data and the RNA-Seq data as a difference found between one dataset and the reference that is not corroborated by a difference between the other dataset and the reference. G First define a corroboration indicator variable κj for each rearrangement breakpoint G G xj that indicates whether a fusion splice has been selected that corroborates xj . More G G T GT T formally, κj = 1 if and only if there exists (xj , xk ) ∈ C for which δk = 1. Define a T T similar indicator variable κk for each fusion splice xk . A difference between the WGSS data and the reference not corroborated by a difference between the RNA-Seq data and the G G reference is given by δj · (1 − κj ). Conversely an uncorroborated difference between the T T RNA-Seq data and the reference is given by δk · (1 − κk ). Let A be the full set of alignments considered and let M ⊆ A be any valid subset, that is, a set of alignments for which each read is aligned to exactly one mapping location. We seek to minimize the objective function f(M) that calculates the total number of differences implied by M (Equation 3.1).

X G X T X G G X T T f(M) = δj + δk + δj · (1 − κj ) + δk · (1 − κk ) j k j k X h G G Gi X h T T T i = 2δj − δj · κj + 2δk − δk · κk (3.1) j k

We seek an eﬃcient algorithm that minimizes this objective function by reducing the rearrangement support graph. We propose two algorithms for solving this problem, an ILP

46 formulation for use with an ILP solver and an approximation algorithm based on weighted set cover.

ILP formulation

G G G For each RNA-Seq read ri ∈ xj , a 0-1 integer variable aij indicates whether the edge G G T between ri and xj is present in the transformed graph. Similarly, 0-1 integer variable alk T T represents whether the edge between rl and xk is present in the transformed graph. The ILP formulation for reducing the rearrangement support graph attempts to ﬁnd a valid G T assignment for variables aij and alk that minimizes the objective function given in equation G T 3.1. As described above, 0-1 integer variables δj and δk represent whether at least one G T read has been assigned to the respective rearrangement breakpoint xj and fusion splice xk . G G G T T T G T GT Let zj = δj · κj and zk = δk · κk Also, let cjk = 1 if (xj , xk ) ∈ C , otherwise cjk = 0. The objective function for the ILP is given as in equation 3.1, and the constraints are given below in equations 3.2-3.9.

G G X G ∀ri ∈ R : aij = 1 (3.2) j T T X T ∀rl ∈ R : alk = 1 (3.3) k G G G G G G ∀ri ∈ R , xj ∈ X : δj ≥ aij (3.4) T T T T T T ∀rl ∈ R , xk ∈ X : δk ≥ alk (3.5) G G ∀j : zj ≤ δj (3.6) T T ∀k : zk ≤ δk (3.7) G X T ∀j : zj ≤ cjkδk (3.8) k T X G ∀k : zk ≤ cjkδj (3.9) j

Constraints 3.2 and 3.3 ensure that each read is assigned to exactly one cluster in the transformed graph. Constraints 3.4 and 3.5 ensure that a cluster is considered as selected if at least one read has been assigned to that cluster. The constraint given by equation 3.6 G G G G G ensures that zj = 1 only if δj = 1, a consequence of deﬁning zj = δj · κj . The constraint G T given by equation 3.8 ensures that zj = 1 only if there exists a k such that δk = 1 and G cjk = 1, that is, zj = 1 for rearrangement breakpoint j only if a corroborating fusion splice has been selected. The constraints given by equations 3.7 and 3.9 are analogous constraints T for zk . The above ILP formulation can be solved exactly using an ILP solver or can be solved approximately using randomized rounding applied to the LP relaxation of the problem. For

47 the purposes of this study we used the branch and bound based exact ILP solver provided in the GLPK library (http://www.gnu.org/software/glpk/).

Greedy approximation algorithms

The greedy approximation algorithm for reducing the rearrangement support graph can be used to provide an approximate solution to larger problems. The algorithm is based on a previous formulation for identifying structural variations in multiple genomes as proposed by [67]. However we rework the cost function to allow a single rearrangement breakpoint to support multiple fusion splices and visa versa. For each corroborating rearrangement G T GT G T breakpoint/fusion splice pair (xj , xk ) ∈ C , form the read set z = xj ∪ xk and an G T associated indicator set ∆z = {δj , δk }. For each uncorroborated rearrangement breakpoint G G G xj form the set z = xj and an associated indicator set ∆z = {δj }. Form analogous sets for uncorroborated transcriptome clusters. Calculate the cost of each set z as given in equation 3.10.

X cost(z) = 2 − δm (3.10) δm∈∆z

G T Let U be the set of uncovered reads, initially empty. Also, all δj and δk are initially 0. At each step in the algorithm, select the set zk that covers the largest number of reads for the lowest cost, that is, the set zk that maximizes 3.11.

|z \ U| k (3.11) cost(zk)

For each δm ∈ ∆zk , set δm = 1 if |xk \ U| > 0. That is, select cluster xk by setting δm = 1 if xk covers additional elements of U. Update all other ∆z that contain δm and also update the cost of any set that may have changed due to changes in its associated ∆z. For each read r in zk \ U, remove all edges in the rearrangement support graph incident with r, retaining only the edge between r and the cluster used to create zk. Add the reads in zk to U, and repeat, selecting a new set zk+1 until all U includes all WGSS and RNA-Seq reads. The greedy algorithm will provide a solution with cost at most log n · OPT, where OPT is the cost of the optimal solution and n is the total number of reads. For asymptotic analysis and proofs of complexity, please see [67].

3.3.4 Modifying the breakpoint overlap function

One beneﬁt of the given formulation of the problem is that it allows for the substitution G T GT of diﬀerent rules for the corroborative relationship (xj , xk ) ∈ C . We explored one other

48 G T possibility. Given a rearrangement breakpoint xj and a fusion splice xk , we calculate the pair of genes potentially aﬀected by the events represented by those clusters. We GT G T GT G T then deﬁne C as (xj , xk ) ∈ C ⇐⇒ genepair(xj ) = genepair(xk ). We show in the results section that this alternative corroborative relationship allows us to discover reciprocal translocations.

3.3.5 Assembling a prediction sequence

For each fusion splice and each rearrangement breakpoint, Comrad assembles a prediction sequence. Suppose a set of reads implies a fusion splice (or rearrangement breakpoint) A A B B between transcript A and B (or genomic loci A and B). Let {(si , ei )} and {(si , ei )} be the start and end positions for the alignments to A and B respectively. Let SA be the A A sequence in A in the range [min{si }, max{ei }], and let SB be the sequence in B in the B B range [min{si }, max{ei }]. If the alignments are to the + strand of gene A and the − strand of gene B (+− orientation), then the prediction sequence is SA · SB. For −+, ++ and −− orientations, the predicted sequence is rc(SA) · rc(SB), SA · rc(SB) and rc(SA) · SB respectively where rc() is reverse complementation.

3.3.6 Heuristic ﬁltering

The heuristic filtering used by Comrad can be categorized as pre-filtering or post-filtering. Pre-filtering is applied before the application of the above algorithms, and is intended to remove reads that are un-likely to inform a gene fusion analysis. Post-filtering is applied to the results of the above algorithms in an attempt to remove predictions that are likely to be false positives or are unlikely to be novel.

Pre-ﬁltering

The pre-filtering used by Comrad involves aligning reads to a specific set of sequences using bowtie[82] and removing those reads from further consideration if their alignments satisfy a given criteria. RNA-Seq reads are aligned to the genome and UniGene clusters[127] and reads with concordant alignments are discarded. RNA-Seq data is often contaminated by a significant amount of ribosomal RNA (rRNA)[138]. Thus, RNA-Seq reads are also aligned to ensembl annotated rRNA, and any read with one or both ends aligning to any rRNA is discarded. Comrad is not intended as a method for reconstructing immunoglobulin (IG) rearrangements. Thus, any RNA-Seq read that aligns with one end to one IG gene, and the other end to any other IG gene is discarded. Finally, Comrad discards RNA-Seq and WGSS reads for which each end aligns to a repeat region in the genome, and those repeat regions are of the same type.

49 False positive post-ﬁltering

False positive post-filtering attempts to remove predictions that are most likely to be false positives produced by spurious alignment artifacts. The sequence concordance filter aligns each prediction sequence to the appropriate reference sequences using blat [74]. Fusion splices are aligned to spliced and unspliced gene sequences and rearrangement breakpoints are aligned to the genome. A prediction sequence is discarded if that sequence aligns to the reference with greater than 80% identity. The read concordance filter uses blat to align all reads suggestive of an event to the appropriate reference (see above). A prediction is discarded if more than 10% of the supporting reads align concordantly to that reference. Genomic fusions require at least 1 supporting WGSS read and 5 supporting RNA-Seq reads to be considered in this study.

Novelty post-ﬁltering

Novelty post-filtering attempts to remove predictions that are unlikely to be novel. The sequence concordance filter is used to remove fusion splice predictions with significant alignments to ESTs (EST database retrieved from UCSC genome browser[131] November 26, 2010). The EST island filter begins by aligning fusion splice prediction sequences to the genome using blat and allowing for a spliced alignment. Fusion splices are discarded if their prediction sequence aligns entirely within a region of the genome suggested as co-transcribed by clusters of overlapping spliced EST alignments [131].

3.4 Results

We analyzed RNA-Seq and WGSS data produced from the C4-2 cell line, a derivative of the LNCaP prostate cell line. As a result of the close relationship between C4-2 and LNCaP, we hypothesized that fusions previously discovered in LNCaP would be useful as positive controls to be searched for in the C4-2 data. We also sought to use the C4-2 data as a proxy for discovering novel gene fusions in LNCaP. The WGSS and RNA-Seq data for C4-2 each consisted of 84 million 50bp + 50bp paired end reads. With an approximate fragment length of 500bp, the WGSS data provides 7X physical coverage of a diploid human genome. Given that the LNCaP genome is tetraploid [11], the physical coverage for C4-2 is more likely closer to 3.5X. Previous analysis of LNCaP resulted in the discovery of 6 fusion transcripts [91, 90], DLEU2-PSPC1, RERE-PIK3CD, MIPOL1-DGKB, MRPS10-HPR, C19orf25-APC2 and SLC45A3-ELK4. We used PCR to confirm 5 of these fusion transcripts as present in C4-2; DLEU2-PSPC1 could not be confirmed in C4-2. The 5 confirmed fusion transcripts serve as positive control fusion transcripts to be discovered by Comrad in the C4-2 data. Both C19orf25-APC2 and SLC45A3-ELK4 involve adjacent genes, and are thus potential

50 read-through events, a possibility that was confirmed for SLC45A3-ELK4 in a more recent study[132]. Comrad found only RNA-Seq evidence of SLC45A3-ELK4 in C4-2, providing further evidence that SLC45A3-ELK4 is not associated with chromosomal rearrangement. No RNA-Seq or WGSS evidence was found for C19orf25-APC2. A targeted search did not identify any RNA-Seq reads supporting a C19orf25-APC2 fusion transcript, suggesting that C19orf25-APC2 expression is lower than that required for detection at the sequencing depth provided by the C4-2 RNA-Seq data. The remaining 3 fusions, RERE-PIK3CD, MIPOL1- DGKB and MRPS10-HPR involve distant genes and are thus potentially caused by underlying genomic rearrangement. Comrad successfully identified the previously described[157] rearrangement breakpoint for MIPOL1-DGKB and also identified rearrangement breakpoints for RERE-PIK3CD and MRPS10-HPR. The novel rearrangement breakpoints for RERE-PIK3CD and MRPS10-HPR were confirmed by PCR for both C4-2 and LNCaP. Comrad predicted an additional 10 novel genomic fusions for C4-2 (Supplementary Table S1). We attempted to validate 9 of the 10 predictions, excluding AMACR-GUSBL1 as a likely read-through associated with a small 600bp deletion. Of the 9 novel Comrad predictions, 6 were validated in C4-2 by PCR and sanger sequencing. We also validated all 6 novel Comrad predictions in LNCaP, thereby showing that Comrad is more sensitive than previous methods that have been applied to fusion discovery in LNCaP. Evidence for the 6 previously identified fusion transcripts and the 6 novel Comrad predictions is shown in Table 3.1. To identify potential false negatives, we also analyzed the C4-2 RNA-Seq data using deFuse, a method for identifying fusion transcripts in RNA-Seq data alone. The deFuse analysis produced 31 predictions, 8 of which are predicted to be interchromosomal or long-range intrachromosomal fusions (Supplementary Table S4). Genomic evidence was identified by Comrad for 7 of these 8 fusions; genomic evidence for the singe remaining fusion transcript, ZDHHC20-TNFRSF19, could not be identified by Comrad. The predicted ZDHHC20-TNFRSF19 sequence exhibits canonical GT-AG splicing at the fusion boundary and is thus not likely to be the product of template switching during reverse transcriptase [69]. The lack of genomic evidence for ZDHHC20-TNFRSF19 makes this fusion a candidate trans-splicing event in C4-2 [85]. Alternatively, ZDHHC20-TNFRSF19 could represent a false negative for Comrad.

3.4.1 Accurate discovery of gene fusions

Analysis of WGSS data for evidence of genomic rearrangement is made diﬃcult by the repetitive nature of the genome, and the large amount of coverage required to reliably predict rearrangements [65, 28]. Conversely, RNA-Seq produces many spurious chimeric reads by at least two mechanisms: template switching during reverse transcriptase [69], and the combined eﬀect of read errors and high gene expression [138]. Given RNA-Seq and WGSS data from the same sample, an integrated analysis using Comrad provides the ability

51 to more accurately resolve multi-map reads, and more confidently identify real events even where those events have relatively little evidence. Comrad accurately identifies WGSS evidence for gene fusions, even when the evidence consists of only a small number of reads. Five of the validated genomic breakpoints are supported by two or less WGSS reads (Table 3.1). The breakpoint for DGKB-MIPOL1, arguably the most biologically important fusion in the dataset, is supported by one read. An independent analysis of WGSS data using a threshold of one read would result in the prediction 20675 fusions between the genes considered in this study. Considering only uniquely aligned reads results in the prediction of 9949 fusions supported by at least one WGSS read. The existence of 9949 fusions would indicate that C4-2 is very highly rearranged, especially since these 9949 fusions represent rearrangements involving genic regions only. However, aCGH data does not indicate this level of genomic rearrangement as only 41 copy number changepoints are predicted across the genome (Supplementary Table S5). Addition- ally, previous spectral karyotype results identified only 9 structural aberrations (per diploid cell) for LNCaP [11]. Thus it is likely that many of the 9949 fusions are either false positives or represent transposable elements as opposed to large scale structural aberrations. Clearly, identification of true positives from this set of fusions would be difficult if not impossible, rendering a single discordant WGSS read unreliable in an independent analysis of WGSS data. However, that single discordant WGSS read could be used to identify an important rearrangement breakpoint when considered in conjunction with RNA-Seq data as is done by Comrad. Comrad provides the ability to accurately identify fusions where evidence for those fusions does not map uniquely to the genome. For the 9 PCR confirmed genomic fusions, 38% of the RNA-Seq reads and 32% of the WGSS reads are multi-map reads. Remov- ing these reads from the analysis would prevent the identification of the MRPS10-HPR and TFDP1-GRK1 fusion transcripts and would hinder our ability to properly identify the CYP2C19-FAM190B fusion transcript (14/20 multi-map reads) and both the forward and reciprocal rearrangement breakpoints for MRPS10-HPR (12/13 and 4/5 multi-map reads respectively). The MRPS10-HPR and TFDP1-GRK1 fusion transcripts are supported entirely by multi-map RNA-Seq reads. Intron 2 of WDR32 contains a region of high sequence similarity to parts of MRPS10. As a result the RNA-Seq evidence for MRPS10-HPR aligns to either a hypothetical WDR32-HPR fusion transcript or the MRPS10-HPR fusion transcript. Similarly, RNA-Seq evidence for TFDP1-GRK1 also supports a hypothetical BX842568-GRK1 fusion transcript. TFDP1 and BX842568 have 96% sequence similarity over a region of 1234bp, making the two possibilities, TFDP1-GRK1 or BX842568-GRK1, equally likely without knowledge of the corroborating WGSS data identified by Comrad. Therefore, even an analysis of the RNA-Seq data using a method that considers multi-map reads could fail to correctly identify the MRPS10-HPR and TFDP1-GRK1 fusion tran-

52 scripts. However, Comrad is able to resolve the correct alignment location of multi-map reads that support the MRPS10-HPR and TFDP1-GRK1 fusion transcripts by leveraging the relatively unambiguous WGSS evidence of associated rearrangement breakpoints.

3.4.2 MIPOL1-DGKB and MRPS10-HPR are caused by reciprocal exchanges in C4-2 and LNCaP

In order to search for multiple genomic breakpoints associated with gene fusions we altered the breakpoint overlap function as described in section 3.3.4. The MIPOL1-DGKB and MRPS10-HPR fusions were both found to have reciprocal rearrangement breakpoints in addition to the regular breakpoints that produce the MIPOL1-DGKB and MRPS10-HPR fusion transcripts. The reciprocal breakpoints were each positioned no further than 10kb from the regular breakpoints, and were oriented so as to create reciprocal DGKB-MIPOL1 and HPR-MRPS10 fusion genes. Comrad did not detect a fusion transcript for either of the reciprocal DGKB-MIPOL1 and HPR-MRPS10 fusion genes. The presence of the reciprocal rearrangement breakpoints was confirmed by PCR for both C4-2 and LNCaP. The regular and reciprocal breakpoints involving the MRPS10 gene at p21.1 on chromosome 6 and the HPR at q22.3 on chromosome 16 are almost certain to represent the t(6;16)(p21.1;q22) reciprocal translocation previously identified by spectral karyotyping of LNCaP [11]. The same spectral karyotype for LNCaP does not identify a reciprocal translocation between chromosomes 7 and 14 that would be necessary to explain the regular and reciprocal breakpoints identified for DGKB-MIPOL1. Extensive Fluorescent In Situ Hy- bridization (FISH) experiments performed by [157] also rule out the possibility of 7-14 reciprocal translocation, and instead [157] hypothesize that the DGKB-MIPOL1 fusion is the result of an insertion. The new reciprocal breakpoint evidence identified by Comrad and validated by PCR strongly indicate that the DGKB-MIPOL1 fusion is not the result of a simple insertion. Given the previous spectral karyotype and FISH evidence, and the newly identified reciprocal breakpoint, a more likely hypothesis is that the DGKB-MIPOL1 fusion is caused by an underlying reciprocal insertion, by which genomic DNA is exchanged between chromosome 7 and chromosome 14 to produce DGKB-MIPOL1 and the reciprocal MIPOL1-DGKB (Figure 3.3).

3.4.3 Genomic rearrangements create fusion transcripts with non-canonical splicing

The Comrad predictions include 3 validated fusion transcripts with non-canonical splicing in C4-2. The 3 transcripts predicted for the PIK3CD-RERE fusion include one for which an intronic region of RERE is not spliced out of the resulting transcript. The MRPS10-HPR fusion activates a cryptic splice site in the 3’ UTR of MRPS10 to produce a fusion transcript with non-canonical splicing. Finally, the CCDC43-YBX2 fusion transcript includes the

53 ETV1 DGKB MIPOL1

7 14

Reciprocal Insertion

7 14

DGKB-MIPOL1 MIPOL1-DGKB ETV1

14 25 13 26 WGSS WGSS RNA-Seq ( 10) Figure 3.3: Evidence for MIPOL1-DGKB as a reciprocal insertion. Insertion of the ETV1 locus into chromosome 14 is supported by 1 WGSS read and 30 RNA-Seq reads. The reciprocal insertion of the MIPOL1 locus into chromosome 7 is supported by 1 WGSS reads.

54 full sequence of the genomic breakpoint, as no splicing occurs across the breakpoint. By analyzing the exon and intron expression obtained from RNA-Seq alignments, it is apparent that a signiﬁcant proportion of CCDC43-YBX2 expression includes the ﬁrst half of intron 3 of CCDC43 and the last half of intron 4 of YBX2 (Figure 3.4). The intron retention found for CCDC43-YBX2 likely results because intron 3 of CCDC43 is an AT-AC intron (for both splice variants) whereas intron 4 of YBX2 is a GT-AG intron. The resulting fused intron, with an AT at the 5’ splice site and an AG at the 3’ splice site, is unlikely to be recognized by either the U2 or U12 dependent spliceosomes [154].

100

YBX2 GT AG CCDC43 AT AC AC

CCDC43-YBX2 AT AG

RNA-Seq

WGSS

Figure 3.4: Gene fusion CCDC43-YBX2 produces fusion transcripts with non-canonical splicing. Exon and intron expression estimated from RNA-Seq alignments is shown in dark grey above the CCDC43 and YBX2 gene models at the top. Sequences are shown for the 5’ and 3’ splice sites of introns involved in the fusion, in addition to the splice site sequences of the aberrant AT-AG intron of CCDC43-YBX2. RNA-Seq and WGSS reads supporting the fusion are shown at the bottom of the ﬁgure.

3.5 Discussion

Comrad provides the ﬁrst accurate computational method for simultaneous analysis of RNA- Seq and low coverage WGSS data for the purposes of identifying fused genes, and for diﬀerentiating fusions of genomic origin from those fusions that are non-genomic, ie co- transcription of adjacent genes or trans-splicing events. We have used the C4-2 data and

55 a theoretical analysis to show that Comrad is able to discover fusions that other methods would not be capable of discovering given the same data. The advantages of Comrad are twofold. First, Comrad is able to leverage unambiguous WGSS data in order to correctly identify a fusion transcript supported by multi-map RNA-Seq data, and visa versa. Second, Comrad is able to accurately identify genomic rearrangements that result in gene fusions, even if that genomic rearrangement is supported by only one WGSS read. This second advantage means that genomic rearrangements producing gene fusions can be accurately identified in low coverage genome data, ie a Comrad analysis can be performed for roughly twice the cost of an RNA-Seq experiment. As a proof of concept, we have shown that we are able to re-discover, in the C4-2 data, 4 of 6 fusions previously identified in the closely related cell line LNCaP. We then successfully validated the fusion transcripts and rearrangement breakpoints for 6 of the 10 novel genomic fusions nominated by Comrad. All 6 fusions were confirmed by PCR for both C4-2 and LNCaP, showing that Comrad is more sensitive to genomic fusions than previous methods that have been applied to LNCaP. Additionally, we used deFuse to identify fusion transcripts for which no genomic rearrangement is detected by Comrad, despite the fact that a genomic rearrangement is expected given the position and orientation of the genes involved. We identified and validated ZDHHC20-TNFRSF19, possibly a trans-splicing event, or possibly the product of a genomic inversion and thus a false negative for Comrad in the C4-2 data. The fact that only one high confidence fusion transcript could be identified as a potential false negative for Comrad implies that Comrad provides a sensitive method for the detection of fusion transcripts with associated rearrangement breakpoints. Finally, we have used Comrad to gain new insight into the biology of rearrangement fusions. We have identified instances of non-canonical splicing in fusion transcripts produced by genomic rearrangements, including activation of cryptic splice sites, and intron retention due to incompatibility between the 5’ and 3’ splice sites of the rearrangement induced intron. We have also used a simple modification of the Comrad framework to identify reciprocal evidence of rearrangement breakpoints. The modified framework led to the discovery of reciprocal evidence for both interchromosomal fusions identified by Comrad. For one of these fusions, MRPS10-HPR, the location of the reciprocal translocation coincides precisely with a translocation found in LNCaP by spectral karyotyping[11]. The other fusion, DGKB- MIPOL1, a reciprocal was previously thought to be a result of an insertion [157]; however Comrad has provided evidence that this fusion could be the result of a reciprocal insertion. RNA-Seq has already proven to be a powerful tool for the discovery of aberrant transcripts, and WGSS has already proven its utility when searching for rearrangements. We find that an integrated analysis of RNA-Seq and WGSS data minimizes some of the limitations of analyzing either RNA-Seq or WGSS data alone, and yields greater insight than either of these data types can provide independently. Given the possible existence of rearrangement fusions occurring at low levels of recurrence in cancer, the increased accuracy

56 and decreased cost associated with a Comrad analysis may be useful when searching for these rearrangement fusions in a large number of tumours. The identiﬁcation of such rearrangement fusions would then hopefully assist in the classiﬁcation of molecular subtypes and the development of targeted therapies.

57 Table 3.1: Known and novel fusions predicted by Comrad in C4-2 and validated in both LNCaP and C4-2. The number of reads supporting each event is provided, in addition to how many of those reads multi-map to the genome. The ﬁrst 6 fusions have been previously described.

5’ gene 3’ gene event evidence reads multi-map MIPOL1 DGKB transcript RNA-Seq 30 0 translocation WGSS 1 0 reciprocal WGSS 1 0 RERE PIK3CD transcript RNA-Seq 35 0 transcript RNA-Seq 11 0 transcript RNA-Seq 6 0 transcript RNA-Seq 11 0 inversion WGSS 1 0 MRPS10 HPR transcript RNA-Seq 67 67 translocation WGSS 13 12 reciprocal WGSS 5 4 DLEU2 PSPC1 transcript RNA-Seq 0 0 deletion WGSS 0 0 SLC45A3 ELK4 transcript RNA-Seq 18 0 transcript RNA-Seq 15 0 C19orf25 APC2 transcript RNA-Seq 0 0 TFDP1 GRK1 transcript RNA-Seq 10 10 deletion WGSS 7 4 FAM117B BMPR2 transcript RNA-Seq 12 0 eversion WGSS 2 0 ITPKC PPFIA3 transcript RNA-Seq 6 0 deletion WGSS 13 0 CCDC43 YBX2 transcript RNA-Seq 9 0 deletion WGSS 8 1 GPS2 MPP2 transcript RNA-Seq 23 1 eversion WGSS 10 0 FAM190B CYP2C19 transcript RNA-Seq 20 14 deletion WGSS 2 0

58 Chapter 4 nFuse: Discovery of complex genomic rearrangements in cancer using high-throughput sequencing

4.1 Introduction

Cancer is a genomic disease characterized by unregulated cell growth resulting from acquired or inherited DNA changes. Genome rearrangements are an important class of DNA changes, known to disrupt the activity of tumour suppressor genes and promote increased activity of oncogenes. Genome rearrangements are also known to create fusion genes: novel oncogenes formed when a rearrangement juxtaposes two or more existing genes. Fusion genes are the defining molecular feature of many cancers and represent potential drug targets in those cancers. A classic example is the BCR-ABL1 gene fusion present in 95% of chronic myelogenous leukemia patients, and targeted by the drug imatinib. The molecular mechanisms that cause somatic genome rearrangements are still the focus of investigation. Double stranded DNA breaks followed by a ‘joining event’ are known to result in a simple genomic rearrangement consisting of a single breakpoint, where a breakpoint is defined as a pair of genomic locations that are distant in the normal genome, but adjacent in the tumour genome. A breakpoint can be considered as the most basic unit of rearrangement. Examples of processes that generate single breakpoints include breakage- fusion-bridge cycles, nonhomologous end joining and homologous recombination-mediated repair [16]. Recently discovered are complex genomic rearrangements (CGRs), rearrangements comprised of multiple breakpoints with a specific structure. In prostate cancer, for example, [14] discovered closed chains of breakage and rejoining (CCBR). [14] suggested that CCBR potentially occurs when distant chromosomal regions are spatially co-localized in the nucleus, possibly because they have been recruited by the same transcriptional factory. Importantly,

59 they showed that biologically relevant gene fusions, such as TMPRSS2-ERG, were created by CCBR events. CCBRs are balanced rearrangements: they result in little or no loss of genomic material. It has been proposed that balanced rearrangements are more likely to produce functional gene fusions [108]. Other cancers, exhibit an entirely different type of CGR produced by a shattering of chromosomal regions, followed by a reassembly from the resulting fragments [153]. As a result, some breakpoints between large chromosomal segments contain additional smaller fragments (genomic shards) interposed at the breakpoint [16]. These genomic shards originate from other regions affected by the catastrophe, typically at the boundaries of deleted regions [16]. Breakpoints with small (∼500bp) genomic shards interposed at the breakpoint are termed complex and have been identified previously in breast cancer [152]. Breakpoints with larger fragments of other genes interposed at the breakpoint have the potential to create poly-fusions, fusion genes comprised of 3 or more separate genes. Both complex breakpoints and poly-fusions are rearrangements composed of two or more simple breakpoints, and identification of all breakpoints is required to discover the fusion. High-throughput paired-end Whole Genome Shotgun Sequencing (WGSS) is currently the most efficient method of identifying breakpoints in tumour genomes. Briefly, WGSS can be used to sequence the ends of short fragments of DNA produced by fragmentation of a tumour genome. The pairs of end sequences (paired-end reads, or simply reads) can then be mapped back to a healthy reference genome sequence. Distantly mapping reads or reads that map with unexpected orientation can then be used to predict breakpoints. WGSS, however, presents many unique challenges compared to earlier technologies. The presence of repeated regions in the genome and short WGSS read lengths complicate the problem of unambiguously identifying the origin of some WGSS reads. Furthermore, sequencing errors lead to some proportion of false reads. Both of these problems are magnified due to the huge size of WGSS datasets. Finally, aneuploidy, tumour heterogeneity and cellularity have the combined affect of diluting the sequence signal of breakpoints, even in high coverage WGSS datasets. Nevertheless, solutions now exist for accurately predicting breakpoints from WGSS [65, 28], though a true account of false negative rates remains elusive. Given the ability to predict breakpoints in WGSS, an important question is how to infer genome structure from these breakpoints, and potentially reconstruct chromosomal architectures. In a recent paper, [53] propose methods for reconstructing ’digital karyotypes’ from copy number and breakpoint predictions. Their method requires precise breakpoint predictions, and could not guarantee a unique solution for a reasonably complex genome. Previous to [53], efforts to reconstruct tumour genomes relied on low resolution data such as FISH and BAC sequencing [128, 129, 120]. These methods may be sufficiently sensitive to reconstruct large scale rearrangements, however they will likely miss complex focal rearrangements.

60 In this study, we propose a method for reconstructing CGRs from WGSS data. Crucial to the problem of identifying CGRs is the missing data problem: identification of a CGR relies on the identification of all n breakpoints in the CGR. Therefore, the basis for our approach is a high sensitivity method for predicting breakpoints. However, WGSS read alignment data contains a significant amount of noise, and this noise will produce false positive predictions, especially with a method that prioritizes sensitivity. Thus we calculate a probability for each breakpoint that reflects our belief in its existence. Like the aforementioned studies, we identify CGRs using breakpoint graphs [124]. We incorporate the breakpoint probability into the graph, and use that probability to guide our search for high probability structures representing potential CGRs. We prioritize our search for CGRs based on fusion transcript predictions from matched high-throughput cDNA sequencing (RNA-seq), thereby using effect on the transcriptome as an indicator of potential functional significance. We have applied our method, nFuse, to publicly available WGSS and RNA-seq data for the well characterized breast cancer cell line HCC1954. We show that we are able to rediscover a significant proportion of previously discovered breakpoints. Furthermore, we show that the breakpoint probability we calculate accurately separates the previously discovered breakpoints from a background of predominantly false positive predictions. Using Long-Range PCR (LR-PCR), we validated 5 out of 6 poly-fusions predicted by nFuse for HCC1954. We have also applied nFuse to WGSS and RNA-seq data generated from primary human prostate cancer sample 963 [169]. Using a CCBR discovered in 963, we illustrate how CCBRs can be used to infer the gene expression history of a tumour. Finally, we present an example of a CCBR with a complex breakpoint discovered in 963, providing a link between CCBR and complex breakpoints.

4.2 Methods

4.2.1 Complex rearrangement discovery using breakpoint graphs

Complex rearrangements involve two or more breakpoints, such that the set of breakpoints elicit a speciﬁc structure. To identify complex rearrangements, we employ a construct called the breakpoint graph [124]. The complex rearrangements we are interested in discovering naturally arise as features of the breakpoint graph. Unlike previous breakpoint graph approaches, the breakpoint graph we construct includes a measure of the uncertainty inherent in breakpoint predictions produced from WGSS data. Our algorithms then seek to identify CGRs more likely to be real by searching for the higher probability structures in the breakpoint graph. Of crucial importance is the eﬀect of missing data on our ability to predict CGRs. For a CGR composed of n breakpoints, failing to predict any one of those n breakpoints will result in a failure to identify the CGR. To mitigate this problem we seek to include in the breakpoint graph all reasonable breakpoint predictions, including those nominated by

61 reads with ambiguous genomic origin. Thus the breakpoint graph we construct will contain a large amount of noise, and the majority of breakpoints are expected to be false positives. A real but low probability breakpoint may then be identified as part of a CGR, providing the probabilities of the CGR’s other breakpoints are sufficiently high. By contrast, removing low probability breakpoints before building the breakpoint graph would also remove the aforementioned real breakpoint, making it impossible to identify the CGR. nFuse seeks to identify two types of CGRs: Closed Chains of Breakage and Rejoining (CCBR) [14], and poly-fusions/complex breakpoints. We emphasize here that these two types of CGRs are very different types of events, unified by breakpoint graphs as a common computational representation. We introduce the concept of the breakpoint graph by first focusing on poly-fusions and complex breakpoints, after which we describe CCBRs and their breakpoint graph representation.

Breakpoint graph structure

A breakpoint is an adjacency in one genome that does not exist in another genome. In the context of cancer genomics, we are interested in identifying adjacencies in the tumour genome not found in the normal (or reference) genome. Such unexpected adjacencies are evidence of somatic rearrangement, and may have important implications for tumour biology. For instance, an unexpected adjacency between the 5’ exons of gene A and the 3’ exons of gene B may represent an A-B fusion gene that drives proliferation of a tumour. The breakpoint graph is a representation of a set of unexpected adjacencies, or breakpoints. We use the breakpoint graph to represent the set of breakpoints identified in a tumour genome, that are not in the reference genome. The graph is defined on a set of vertices representing the set of nucleotides that are adjacent in the reference but not in the tumour. The graph contains two types of edges, breakpoint edges and adjacency edges. Breakpoint edges represent adjacencies in the tumour, while adjacency edges represent a putatively contiguous region of the reference genome not interrupted by a breakpoint. Note that for identification of CCBRs, we generalize adjacency edges as described below.

Consider the following example illustrated in Figure 4.1. Let A1 and A2 be adjacent nucleotides in reference chromosome A, B1 and B2 be adjacent nucleotides in reference chromosome B, and suppose we identify an A1, B2 breakpoint (Figure 4.1a). The graph for the A1, B2 breakpoint contains vertices for A1, A2, B1 and B2, in addition to the breakpoint edge (A1,B2) (Figure 4.1b). Now consider an additional B3, C1 breakpoint between chromosomes B and C (Figure 4.1c). In addition to the (B3,C1) breakpoint edge, the graph also contains a (B2,B3) adjacency edge representing a putatively contiguous region in the tumour between nucleotides B2 and B3 (Figure 4.1d). Finally consider a fourth breakpoint between chromosomes A and B (Figure 4.1e) represented by an (A3,B5) breakpoint edge

(Figure 4.1f). The breakpoint graph will also contain a (B2,B5) adjacency edge representing the possibility that the (B3,C1) does not exist in a putative tumour chromosome that

62 contains both the (A1,B2) and (B5,A3) breakpoints. Thus each breakpoint will be considered optional in our realization of the breakpoint graph. To reﬂect this, we deﬁne adjacency edges as follows. Let Xleft,Xright be two nucleotides adjacent in the reference with a breakpoint edge incident on Xleft. We add an adjacency edge from Xleft to every upstream right vertex. Similarly, if a breakpoint edge is incident on Xright, we add an adjacency edge from

Xright to every downstream left vertex.

Poly-fusions and complex breakpoints

A key feature of the breakpoint graph is that every alternating path represents a putative tumour chromosome. Poly-fusions and complex breakpoints are subsequences of tumour chromosomes, and as such will be represented as alternating paths given successful identi- ﬁcation of all relevant breakpoints. As an example, consider a fusion between gene X on chromosome A and gene Y on chromosome C, for which a fragment of chromosome B is interposed at the breakpoint (Figure 4.1g). In the breakpoint graph, the complex breakpoint will be represented as an alternating path of length 5 between vertices representing the 5’ end of gene X and the 3’ end of gene Y (Figure 4.1h). In general, a poly-fusion or complex breakpoint involving n loci will be represented in the breakpoint graph as an alternating path of length 2n − 3.

Closed chains of breakage and rejoining (CCBR)

Closed chains of breakage and rejoining (CCBR) can be thought of as a generalization of a reciprocal translocation to n > 2 loci. For a reciprocal translocation, 2 loci are broken and the broken ends are swapped and rejoined. A 3 loci CCBR involves the breakage, permu- tation and rejoining of 3 loci. An example 3 loci CCBR event would be the transformation of chromosomes A, B and C into tumour chromosomes A-B, B-C and C-A (Figure 4.2a). In the ideal case, no chromosomal material will be lost in the exchange (Figure 4.2a). As shown in this work and previously [14], many instances of chromosomal breakage and rejoining involve the loss or gain of chromosomal material. As a result, the breakpoints at broken and rejoined loci may be separated by an unknown distance. Figure 4.2b depicts a more realistic example involving chromosomes A, B and C. In this example, the CCBR has resulted in a loss of small sections of chromosomes A and C. In addition, the A-B and B-C tumour chromosomes created by the CCBR both include copies of a segment of chromosome B, resulting in a gain of that segment. Of crucial importance, any loss or gain caused by a CCBR will not necessarily be represented by additional breakpoints in the breakpoint graph. Nevertheless, the breakpoint graph, properly defined, will yield CCBRs as a specific type of subgraph. To identify CCBRs we augment the previously defined breakpoint graph with additional edges. Call the previously defined adjacency edges as gain adjacency edges. Define

63 A B A1 A2 A1 A2

chrA

B B B1 B2 1 2

chrB

C A1 A2 D A1 A2

chrA

B1 B2 B3 B4 B1 B2 B3 B4

chrB

C C C1 C2 1 2

chrC

E A1 A2 A3 A4 F A1 A2 A3 A4 chrA

B1 B2 B3 B4 B5 B6 B1 B2 B3 B4 B5 B6

chrB

C1 C2 C1 C2

chrC

gene X G A1 A2 H gene X A1 A2 chrA

B B B B B1 B2 B3 B4 1 2 3 4

chrB gene Y C1 C2 gene Y C1 C2

chrC (a) (b) (c) (d) (e) (f) (g) (h)

Figure 4.1: Breakpoint graph representations of poly-fusions. (A) A breakpoint as an unexpected adjacency. (B) The breakpoint graph for a single breakpoint showing a breakpoint edge. (C) Two breakpoints on chromosomes A, B, and C. (D) The breakpoint graph for the two breakpoints showing 2 breakpoint edges and an adjacency edge. (E) Three breakpoints on chromosomes A, B, and C. (F) The breakpoint graph for the three breakpoints showing a (B2,B5) adjacency edge that encodes the optional nature of breakpoint (B3,C1). (G) Breakpoints for a X-Y gene fusion with a complex breakpoint. (H) The breakpoint graph for the complex breakpoint showing an alternating path between X and Y.

64 additional adjacency edges called loss adjacency edges as follows. Let Xleft,Xright be two nucleotides adjacent in the reference with a breakpoint edge incident on Xleft. Add loss adjacency edges from Xleft to every downstream right vertex. Similarly, if a breakpoint edge is incident on Xright, add loss adjacency edges from Xright to every upstream left vertex. A n-loci CCBR in the resulting graph will be represented by an alternating cycle of length 2n. Figure 4.2c shows the breakpoint graph for the CCBR in Figure 4.2b. The breakpoint edges, loss edges (A1,A4) and (C1,C4), and gain edge (B2,B3) together form an alternating 6-cycle. Note that an alternative explanation for the breakpoints in Figure 4.2b is a reciprocal translocation between chromosomes A and C, with a complex breakpoint for the A-C chromosome involving a shard of chromosome B. We will explore this ambiguity further when discussing the results for tumour sample 963.

A A1 A2

chr A

B1 B2

chr B

C1 C2

chr C

B C A1 A2 A3 A4 A1 A2 A3 A4

chr A

B1 B2 B3 B4 B1 B2 B3 B4

chr B

C1 C2 C3 C4 C1 C2 C3 C4

chr C Figure 4.2: Breakpoint Graph Representations of Closed Chains of Breakage and Rejoining (CCBR). (a) In an idealized version of a CCBR, no chromosomal material is lost or gained. (b) Actual CCBRs may involve small loss or gain of chromosomal material. For instance, the A2 → A3 and C2 → C3 sections of chromosomes A and C appear to have been lost, and the B2 → B3 section of chromosome B appears to have been duplicated.

65 Identifying high probability CGRs

A breakpoint graph constructed from WGSS data will contain many alternating paths connecting candidate fused genes, and many alternating cycles. Some of the ambiguity arises because the WGSS data is produced from a diploid, or potentially poly-ploid, genome. Tumour chromosomes reassembled from copies of the same reference chromosomes will each produce a set of breakpoints. The WGSS data for these tumour chromosomes will then yield a merged set of all breakpoints. Only in very simplistic instances will it be possible to repartition the merged breakpoints into sets of breakpoints each produced by the same tumour chromosome. Furthermore, the breakpoints obtained from WGSS data will include a signiﬁcant number of spurious predictions, especially when prioritizing sensitivity as proposed by nFuse. Spurious breakpoint predictions will further increase the number of alternating paths and cycles. nFuse uses an objective function to identify real CGRs from the background of incidental and false positive structures. Our objective function is probabilistically motivated, and incorporates the probability that each breakpoint exists (breakpoint probability), in addition to a probability calculated for the total length of adjacency edges in the structure (CGR length probability). Inclusion of the breakpoint probability will allow nFuse to mitigate the eﬀects of spurious breakpoint predictions. We model the CGR length probability as an exponential distribution with scale parameter β, and motivate the choice of exponential independently for complex breakpoints/poly-fusions and CCBRs in the following sections. The negative log probability of a CGR with breakpoints X and adjacency edge lengths Y can be calculated as given in Equation 4.1, herein referred to as the CGR score.

y CGR score ≡ − log P (X,Y ) = log β + X − X log P (x) (4.1) β y∈Y x∈X

Let G be a breakpoint graph with breakpoint edges given distance -log P (x), and adja- y cency edges given distance β . By inspection of Equation 4.1, an alternating cycle or path that maximizes P (X,Y ) will be a shortest alternating cycle or path on G.

Breakpoint prediction and probability estimation

66 Let R be the set of paired end WGSS reads. We generate a set of mapping locations M for R using the following well established strategy [160, 162]. For each paired end read 1 2 (rj , rj ) ∈ R:

1. identify a single concordant mapping location if it exists. 2. if no concordant mapping location exists:

1 (a) identify the n top scoring mapping locations for rj 2 (b) identify the n top scoring mapping locations for rj

1 2 We identify the n top scoring mapping locations for rj (and rj ) as follows. Let sj be the maximum alignment score attained by partial alignment of read j to the genome. Briefly, a partial alignment is an alignment of the first ` nucleotides of the read (supplementary methods). Let k be the number of mappings of read j that attain sj. If k > n assume the read is unmappable and filter it, otherwise retain the k mapping locations. The study described herein used bowtie2 [81] in local alignment mode to obtain partial alignments. We are currently exploring the tradeoff between speed, accuracy and flexibility of available aligners to allow optimal performance of the nFuse breakpoint prediction. 1 2 Let mj ∈ M be the mapping locations identified for read (rj , rj ) ∈ R. Define the following indicator variables:

cj ≡ read j is concordant dj ≡ the true alignment was discovered and is in the set mj

We make the assumption that reads mapped concordantly by the aligner are in fact concordant (with probability 1). We ﬁlter the concordantly mapped reads to create the set of d d discordant reads R and set of discordant mappings M . As a result, P (cj = 1, dj = 1) = 0 for the set of ﬁltered reads. We estimate probabilities for the remaining two possibilities for the true alignment of each read:

P (cj = 1|·) ≡ concordant but missed by the aligner P (dj = 1|cj = 0, ·) ≡ discordant but missed by the aligner

We estimate P (cj = 1|·) using the maximum concordant alignment score csj. To calculate csj, we align both ends of read j to all mapping locations in the set mj, and set csj to the maximum alignment score identiﬁed by this process. We then calculate

P (cj = 1|csj) (supplementary methods), and use it to approximate P (cj = 1|·). We approximate P (dj = 1|cj = 0, ·) as P (dj = 1|cj = 0, asj) where asj is the alignment score for read j (supplementary methods). Next, we cluster the discordant alignments Md based on the likelihood that a set of alignments were generated by the same breakpoint (supplementary methods). Let the resulting clusters of alignments represent putative breakpoints. Let gij indicate that putative breakpoint i generated read j. Assume gij = 0 if read j is not in the cluster that supports

67 breakpoint i. We estimate P (gij = 1|·) as P (gij = 1|nmj, dj = 1), where nmj is the number of alternate mapping locations of read j. Under the assumption that all mapping locations discovered by the aligner are equally likely, we calculate P (g = 1|nm , d = 1) = 1 . ij j j nmj Finally, let bi indicate that breakpoint i is true, let Gi be the set of all gij for breakpoint i, and let ni be the number of reads that were generated by breakpoint i, that is ni = P g . We estimate P (b |n ) (supplementary methods) and use it to estimate P (b |·) gij ∈Gi ij i i i as given by Equation 4.2.

X Y P (bi|·) = P (bi|ni) P (gij = 1|nmj, dj = 1) Gi j

×P (dj = 1|asj, cj = 0)

×P (cj = 0|csj) (4.2)

Identifying high probability complex breakpoints and poly-fusions

Complex breakpoints and poly-fusions may be frequently occuring events in a rearranged tumour genome. Without further information, the biological significance of these events will be difficult to quantify. We use fusion transcripts predicted from RNA-seq to guide our search for complex breakpoints and poly-fusions, using effect on the transcriptome as an indicator of potential biological significance. The fusion transcripts also serve as a scaffold for reconstruction of the complex breakpoints / poly-fusions. Given a gene A - gene B fusion transcript predicted from RNA-seq, we would like to predict the set of breakpoints that produced the A-B fusion. The breakpoints will often occur in the introns of gene A and B. As a result, these breakpoints are often spliced out of the A-B fusion transcript. Let xA and xB be the genomic positions of the splice sites in gene A and B that are predicted as spliced together in the fusion transcript. We would like to predict the intron sequence between xA and xB on the tumour chromosome. We model the intron lengths of fusion transcripts using an exponential with rate parameter βp. An alternating path p from xA to xB represents a potential intron for the A-B fusion transcript, and the total length of the adjacency edges in p equals the length of the putative intron. Following from the analysis that lead to the CGR score (Equation 4.1), we reconstruct the most likely intron by searching for the shortest alternating path between xA and xB on the graph G with β = βp. See the supplementary methods for details on setting βp.

Identifying high probability CCBRs

Very little is currently known about CCBRs, making model selection diﬃcult. We model the total length of loss and gain adjacency edges in a CCBR using an exponential distribution with rate parameter βc. We selected the exponential because it is the maximum entropy distribution for a positive random variate with ﬁxed mean. For the purposes of this study,

68 we have used βc = 2000bp. We expect that the future discovery of additional CCBRs will allow us to properly estimate βc. Similar to complex breakpoints / poly-fusions, we search for CCBRs that are associated with fusion transcript predictions. For each breakpoint b that is part of a complex breakpoint / poly-fusion, we search for a CCBR that includes b. Following from the analysis that lead to the CGR score (Equation 4.1), we reconstruct the most likely CCBR that includes b by searching G with β = βc for the shortest alternating cycle that includes b. Speciﬁcally, we ﬁrst remove the breakpoint edge (b1, b2) for breakpoint b from G, then search for the shortest between b1 and b2.

4.3 Results

We have used nFuse to identify CGRs in three datasets: a HCC1954 breast cancer cell line dataset, a dataset derived from primary tumour 963 [169], and a simulated dataset that includes 120 synthetic CGRs (see Table 4.1 for sequencing statistics). We used the HCC1954 dataset to assess breakpoint prediction sensitivity and breakpoint scoring speciﬁcity, and used the simulated dataset to assess precision and recall for CGR discovery. CGRs were retained only if their CGR score (Equation 4.1) was less than 20.

Table 4.1: Sequencing statistics for HCC1954 and 963.

HCC1954 963 Simulation WGSS RNA-seq WGSS RNA-seq WGSS RNA-seq Read Length 36,80 36,50 50,76 50 80 50 Fragment Length Mean 193 176 406 233 300 250 Fragment Length Std. Dev. 37 33 49 36 50 40 Total Reads 340,977,703 175,508,350 176,764,897 86,720,870 208,839,566 2,877,519 Concordantly Mapped Reads 308,724,222 145,180,689 143,853,385 65,826,510 186,178,029 2,469,024

4.3.1 HCC1954 breast cancer cell line

We first applied our method to publicly available data for HCC1954, a cell line that has been well studied at the molecular level. The HCC1954 cell line was derived from a ductal breast carcinoma and is estrogen receptor negative, progesterone receptor negative, and ERBB2 positive [179]. Four recent studies sought to identify rearrangements in HCC1954. [16] used end sequencing of bacterial artificial chromosome (BAC) libraries to discover rearrangements in tumour amplicons, and identified 59 unique breakpoints in HCC1954. [178], used long transcriptome reads and nominate fusion transcripts, and a combination of Long-Range PCR (LR-PCR) and fluorescence in situ hybridization (FISH) to identify underlying genomic rearrangements. [152] used WGSS to discover rearrangements in 24 breast cancers, and were able to identify 230 unique breakpoints in HCC1954. Some of

69 the breakpoints discovered by [152] were more complex than a breakage and rejoining of two genomic loci. Interposed between the breakpoints were one or more genomic shards: small (<500bp) fragments of DNA from elsewhere in the genome. [48], also used WGSS to discover somatic alterations in HCC1954, and identified 77 unique breakpoints. Finally, [6] used their pipeline SnowShoes-FTD to identify 4 fusion transcripts in HCC1954. We obtained WGSS and RNA-seq data for HCC1954 from the NCBI Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/Traces/sra). The WGSS data (accession number ERA010917) is the same data used in the study by [48], and the RNA-seq data (accession number ERA015355) was produced in a separate study on allele specific expression [179]. Next we compiled a list of 345 validated breakpoints from the studies by [16], [152], and [48]. Small deletions were excluded from the analysis since they are not the focus of this study. There was very little overlap between the sets of breakpoints discovered in each study, with only 3 breakpoints common to all three studies. Using the previously validated breakpoints, we sought to estimate whether nFuse is sensitive enough to detect a significant proportion of real breakpoints, and whether the nFuse breakpoint ranking method could discern between real breakpoints and the background noise of spurious predictions. The breakpoint detection step of the nFuse pipeline identifies 296 of the 345 previously validated true positive breakpoints, accounting for 91.5% of the [16] breakpoints, 81.3% of the [152] breakpoints, and 97.4% of the [48] breakpoints, for a recall of 0.858 (Figure 4.3a). In addition to the 296 true positive breakpoints, nFuse also identifies 2,634,524 additional breakpoints. Since 2,634,524 is well beyond the expected number of breakpoints in a rearranged tumour genome, and since we are including even very low probability breakpoint predictions, a large majority of the 2,634,524 are expected to be false positives. We sought to estimate whether the breakpoint probability we calculate could discriminate between true and false breakpoints. We first selected 3000 breakpoint predictions at random, and assumed that a significant majority of these predictions were false. Next we compared the scores of the 3000 randomly selected predictions to the scores of the true positive predictions (Figure 4.3b), finding that the true positive predictions scored significantly better than the randomly selected predictions (p-value < 2.2 × 10−16 Wilcoxon rank sum test). Next we used a breakpoint graph constructed from the 2,634,524 HCC1954 breakpoints to predict CGRs in HCC1954 (see Table 4.2 and Table 4.3 for summaries of CCBRs and complex breakpoints respectively). We then attempted to validate the top 6 complex breakpoint / poly-fusion predictions as ranked by CGR score (Equation 4.1). For validation we performed Long Range PCR (LR-PCR) across the entire length of the complex breakpoint / poly-fusion. An event was considered validated if the size of the PCR product matched the predicted length, and Sanger sequencing of both ends of the PCR product matched the predicted sequence. Five out of 6 LR-PCR assays produced PCR products, 4 of the predicted size. The PCR product for PHF20L1 -SAMD12 was approximately 1.5 kbp

70 A B

Stephens et al. 230 10 42 Bignell et al. 59 8 4 1 171 37 6 13 Score 3 1 4 Galante et al. 77 2,634,524 2 nFuse 71 recall: 0.858 0

2 All Predictions Previously Validated (a) (b)

Figure 4.3: Performance of nFuse breakpoint prediction on breakpoints previously discovered in HCC1954 (A) Shown is the overlap between sets of breakpoints discovered by [16], [152], [48], and nFuse. Previously discovered breakpoints are rediscovered by nFuse with a recall of 0.858. (B) Beanplot comparing nFuse breakpoint scores for a random selection of 3000 nFuse breakpoint predictions, and the 296 ‘true positive’ nFuse breakpoint predictions. Score is calculated as -log probability. The nFuse breakpoint scoring ranks true positive breakpoints significantly higher (closer to 0) than random breakpoints, many of which are expected to be false positives. longer than expected. We confirmed by PCR that each of the 3 individual breakpoints predicted to form the PHF20L1 -SAMD12 poly-fusion were present in the PHF20L1 -SAMD12 PCR product. Thus we conclude that the PHF20L1 -SAMD12 prediction is correct but potentially incomplete, and suspect the existence of an additional insertion that is not identified when searching for the least complex solution. The 5 validated events are shown in Figure 4.4.

Table 4.2: Summary of putative CCBRs discovered in HCC1954. The CCBRs are grouped by number of breakpoints, and cumulative distance between breakpoints. Shown are counts for each group.

Total Distance Between Breakpoints No. Breaks. 0-500 500-1000 1000-2000 2000-5000 5000-10000 >10000 2 6 0 3 5 2 1 3 6 1 4 1 1 0 4 1 2 0 0 3 3 5 0 0 0 0 0 4 6 0 0 0 0 0 1 7 0 0 0 0 0 2 8 0 0 0 0 1 0

Four of the 5 validated complex breakpoints (Figures 4.4a-4.4d) express intergenic or intronic sequence, and are more likely to be truncating mutations than viable fusion genes.

71 Table 4.3: Summary of putative complex breakpoints discovered in HCC1954. The complex breakpoints are grouped by number of genomic shards, and total length of shards. Shown are counts for each group.

Total Length of Genomic Shards No. Shards 0-500 500-1000 1000-2000 2000-5000 5000-10000 >10000 1 12 0 3 3 2 7 210 0 1 0 1 300 0 0 0 0 400 0 0 0 1 500 0 0 0 1 600 0 0 0 1

The fifth fusion transcript, PHF20L1 -SAMD12, first discovered by [6], is predicted to preserve the reading frames of both PHF20L1 and SAMD12 (Figure 4.4e). Among high confidence fusion transcripts, PHF20L1 -SAMD12 expression is second only to STRADB- NOP58, as suggested by read depth at the breakpoint. Genomic shards have implications for traditional rearrangement detection techniques. The genomic shards range in size from 220bp to 4.3 kbp for the 5 events we validated. A 220 bp genomic shard may be small enough such that some paired end reads span the full complex breakpoint, allowing detection of the breakpoint using conventional methods. However, the 1 kbp and larger genomic shards will be longer than the paired end reads in most WGSS assays, preventing straightforward detection of the breakpoint. Thus a potentially interesting gene fusion such as PHF20L1 -SAMD12 would be impossible to discover when considering only single breakpoints as evidence for gene fusions, as has been done previously [10]. Instead, such methods would falsely nominate SAMD12 -FAM49B and PHF20L1 -Intergenic as truncating mutations. We identified an additional 2 high-confidence poly-fusions by searching for sequences of genomic shards highly connected by fusion transcripts. For each fusion transcript corroborated by alternating path p, we searched for other fusion transcripts corroborated by a subpath of p. We used the resulting sets of (non-conflicting) fusion transcripts to identify poly-fusions for which each genomic shard is expressed in at least one non-conflicting fusion transcript. Thus we use fusion transcripts as a scaffold for local genome reconstruction. We believe the highly connected nature of the additional 2 poly-fusions (Figures 4.4f-4.4g) provides more confidence in these events. Finally, we used the 7 complex breakpoints / poly-fusioins (5 validated and 2 high- confidence) to evaluate the utility of including suboptimal breakpoint predictions in the breakpoint graph. Statistics for the 7 events are detailed in Table 4.4. The 15 breakpoints for the 7 events include breakpoints with low read support, breakpoints supported by multi- map reads, and low probability breakpoints. Three of the breakpoints are supported by only 1 read. For 2 of the breakpoints, the entire set of supporting reads also align to other genomic loci, and form a coherent cluster at those loci. Thus even multi-map resolution methods

72 A Fusion B Fusion Transcript GSDMC -Intergenic Transcript PVT1-Intergenic Tumour Chr 500 bp GSDMC Tumour Chr PVT1 4.3 kbp PVT1

Chr11

Chr8 PVT1 GSDMC Chr8 PVT1

C Fusion D Fusion Transcript ENDOD1-WASH2P Transcript ZDHHC11-RNF130

Tumour Chr ENDOD1 220 bp WASH2P Tumour Chr ZDHHC11 330 bp RNF130

Chr8 Chr 1 Chr 2 Chr 11 ENDOD1 WASH2P Chr5 ZDHHC11 RNF130

E Fusion Transcript PHF20L1-SAMD12

Tumour Chr SAMD12 1.1 kbp 1 kbp PHF20L1 FAM49B

Chr8 SAMD12 FAM49B PHF20L1

F G Fusion Fusion Transcripts Transcripts Tumour Chr 57.8 kbp Tumour Chr PCNT 42 kbp OXR1

Chr 21 PCNT Chr 8 OXR1 (a) (b) (c) (d) (e) (f) (g)

Figure 4.4: Complex breakpoints and poly-fusions in HCC1954. (A-D) Complex breakpoints produce truncated GSDMC and PVT1 transcripts, and ENDOD1 -WASH2P and ZDHHC11 -RNF130 fusion transcripts. Validated by LR-PCR. (E) A PHF20L1 -FAM49B- SAMD12 poly-fusion produces an in-frame PHF20L1 -SAMD12 fusion transcript. Validated by LR-PCR. (F-G) Complex breakpoints corroborated by multiple fusion transcripts.

73 such as VariationHunter [65] may be unable to identify the correct mappings of these reads. Another 2 breakpoints are given low probability due to the existence of marginal concordant alignments. Using breakpoints supported by at least 2 uniquely aligning, high conﬁdence discordant reads would have resulted in identiﬁcation of only 3 of the 7 events.

Table 4.4: Statistics for CGR breakpoints detected by nFuse. Columns: Type: Closed Chain Breakage and Rejoining (CCBR) or Complex Breakpoint (CB) CGR score: score calculated as per Equation 4.1, breakpoint: order of the breakpoint in the CCBR or CB, read count: number of supporting WGSS reads, multi-map: average number of genomic loci to which supporting reads can alternatively be mapped, probability: breakpoint probability as calculated using Equation 4.2, score: negative log of the breakpoint probability, rank: rank of the breakpoint in the set of all breakpoints ordered by score.

CGR Case Type CGR score breakpoint read count multi-map probability score rank HMGN2P46-MYC 963 CCBR 12.06 1 5 1.8 0.994 0.01 613 2 3 1.3 0.022 3.83 4,328 3 6 1.0 1.000 0.00 414 4 1 1.0 0.001 7.47 152,379 WDTC1-EFCAB4A 963 CCBR/CB 7.42 1 6 1.0 0.759 0.28 849 2 1 1.0 0.001 6.59 48,415 3 12 1.3 1.000 0.00 541 ZDHHC11-RNF130 HCC1954 CB 1.74 1 10 1.0 0.545 0.61 1,311 2 9 1.7 0.382 0.96 1,574 ENDOD1-WASH2P HCC1954 CB 4.75 1 5 1.8 0.998 0.00 400 2 14 5.1 0.578 0.55 1,277 PVT1-Intergenic HCC1954 CB 2.96 1 3 1.3 0.465 0.76 1,414 2 5 1.0 0.630 0.46 1,204 PHF20L1-SAMD12 HCC1954 CB 7.31 1 1 1.0 0.023 3.78 16,445 2 9 3.0 1.000 0.00 273 3 3 1.0 0.074 2.60 4,102 GSDMC-Intergenic HCC1954 CB 9.04 1 1 1.0 0.004 5.58 139,931 2 3 1.0 0.059 2.83 4,576 OXR1-Intergenic HCC1954 CB 14.29 1 11 2.5 1.000 0.00 204 2 2 1.5 0.026 3.65 6,879 PCNT-Intergenic HCC1954 CB 16.77 1 1 1.0 0.001 6.63 400,357 2 2 2.5 0.081 2.52 3,908

4.3.2 Simulated Dataset

We used a simulated dataset to estimate the sensitivity of the nFuse method. We generated 209 million 80 × 80 WGSS reads and 2.9 million 50 × 50 RNA-seq reads from a simulated genome that included 60 CCBRs and 60 complex breakpoints. WGSS and RNA-seq reads were generated using maq simulate, and simulation parameters trained from the HCC1954 data (lanes ERR016395 and ERR022661). The 60 CCBRs and 60 complex breakpoints were generated in 3 diﬀerent classes of diﬃculty, with features of each CGR selected uniformly and at random from a range of values dependent on the class. We analyzed the simulated dataset using nFuse with a threshold of 20 for the CGR score.

74 For the complex breakpoints, we first selected 2 genes with at least one intron each, then selected an intron from each gene. We created a fusion transcript by splicing the 5’ exons of the first gene to the 3’ exons of the second gene, and sampled RNA-seq reads from the fusion transcript at a coverage selected from between 20× and 200×. Next we created a complex breakpoint composed of n number of shards of length {`1..`n}, where n and {`1..`n} were selected from a range of values dependent on the class difficulty (Table 4.5). WGSS reads were sampled from the complex breakpoint at a coverage selected from between 5× and 30×. nFuse detected 49 of 60 complex breakpoints in the simulated dataset (Table 4.5).

Table 4.5: Complex breakpoints identiﬁed in a simulated dataset.

Class Shard Length Num. Shards Recall A 500-2000 1 19/20 B 1000-2000 2-4 16/20 C 2000-10000 3-5 19/20 Total 54/60

For the CCBRs, we again selected 2 genes, created a fusion transcript, and generated reads as described above for complex breakpoints. We then simulated a simple breakpoint between the two genes. We also simulated an additional n−1 breakpoints with the structure of a CCBR. Each breakpoint was separated by a distance `i from the subsequent breakpoint in the CCBR. Values for n and {`1..`n} were selected from a range of possibilities that depended on the class of diﬃculty (Table 4.6). WGSS reads were sampled from the n breakpoints at a coverages selected from between 5× and 30×. nFuse detected 54 of 60 complex breakpoints in the simulated dataset (Table 4.6).

Table 4.6: CCBRs identiﬁed in a simulated dataset.

Class Max. Dist. Num. Breaks. Recall A 500 2-3 17/20 B 1000 4-5 18/20 C 2000 6-7 14/20 Total 49/60

nFuse predicts an additional 3 CCBRs and 4 complex breakpoints in the simulated dataset. For 3 of the 4 false positive complex breakpoints, the predicted sequence is identical to the sequence of an undiscovered simulated complex breakpoint. However, for each of these 3 predictions, at least one of the shards is predicted to originate from the wrong location in the genome. Instead, a homologous region is incorrectly predicted as the origin of those shards. The remaining complex breakpoint and 3 CCBRs also represent undiscovered simulated events with misplaced breakpoints due to homology. Based on the simulation, we estimate the precision of nFuse to be 0.92 for complex breakpoints and 0.95 for CCBRs.

75 4.3.3 Primary prostate tumour 963

We applied nFuse to the discovery of complex rearrangements in sample 963, a primary prostate tumour sample. We generated WGSS data at 17× physical coverage in addition to 150 million reads of matched RNA-seq [169]. The WGSS data produced 762,675 putative breakpoint predictions, and these were used to construct the breakpoint graph for 963. The primary CGR feature of 963 was CCBRs (see Table 4.7 and Table 4.8 for summaries of CCBRs and complex breakpoints respectively). A 4-loci CCBR and 3-loci CCBR CCBRs were prioritized for validation since both aﬀected cancer relevant genes, and each contained a breakpoint supported by only a single read. We validated all breakpoints and fusion transcripts associated with both CCBRs. We also validated a complex breakpoint associated with the 3-loci CCBR using LR-PCR. The complex breakpoint and two CCBRs are described in detail below. Table 4.7: Summary of putative CCBRs discovered in 963. The CCBRs are grouped by number of breakpoints, and cumulative distance between breakpoints. Shown are counts for each group.

Total Distance Between Breakpoints No. Breaks 0-500 500-1000 1000-2000 2000-5000 5000-10000 >10000 275 2 5 4 0 311 1 2 0 0 411 1 0 1 0 500 0 0 0 1

Table 4.8: Summary of putative complex breakpoints discovered in 963. The complex breakpoints are grouped by number of genomic shards, and total length of shards. Shown are counts for each group.

Total Length of Genomic Shards No. Shards 0-500 500-1000 1000-2000 2000-5000 5000-10000 >10000 141 0 0 0 2 210 0 0 0 1

As described by [169], the 963 tumour is significant because it is difficult to classify in the context of established prostate cancer biology. Although the histology of 963 is consistent with a uniform cell type, the gene expression profile is suggestive of a hybrid luminal/neuroendocrine phenotype. The fusion genes discovered in 963 exhibit a similar hybrid pattern. Some of the fused genes are primarily expressed in luminal cells, and others are primarily expressed in neuroendocrine cells. A growing body of evidence suggests that the binding of transcriptional machinery predisposes DNA to double stranded breaks [88, 112]. Thus [169] hypothesized that luminal and neuroendocrine expression patterns were present in nascent 963 tumour cells.

76 Among the catalogue of luminal/neuroendocrine fusion genes discovered in 963, two are of particular interest because of their association with a CGR. Most notable of these fusions is highly expressed HMGN2P46 -MYC, a promoter exchange between the MYC oncogene and luminal cell specific HMGN2P46, with a breakpoint in MYC similar to that found in Burkitt’s Lymphoma [37]. Seemingly unrelated is the ARHGEF17 -SHANK2 fusion involving neuroendocrine specific SHANK2, previously reported as fused in melanoma [15]. We have discovered and validated a CCBR consisting of 4 breakpoints, one of which produces a HMGN2P46 -MYC fusion and another that produces a ARHGEF17 -SHANK2 fusion (Fig- ure 4.5a). The discovery of a single genomic event that produces two fusion transcripts, one involving a luminal specific HMGN2P46, and another involving neuroendocrine specific SHANK2, is evidence that tumourigenesis occurred in a progenitor cell simultaneously expressing both luminal and neuroendocrine specific genes. We have also identified a CGR that is represented in the breakpoint graph by a path and a cycle. The path represents a complex breakpoint involving the WDTC1, PRKRIP1, and EFCAB4A genes on chromosomes 1, 7, and 11 respectively. The complex breakpoint was identified as the underlying genomic rearrangement explaining several fusion transcripts. nFuse also identifies a cycle that uses the same two breakpoints as the path, and one additional breakpoint. Given all available information, including fusion transcripts and the 3 breakpoints, the most parsimonious CGR is a reciprocal translocation between chromosomes 1 and 11, with an insertion of a 800bp shard of chromosome 7 at one of the breakpoints (Figure 4.5b). Without knowledge of the fusion transcripts, and given previous interpre- tation of CCBRs [14], the breakpoints may have been interpreted differently. Specifically, the 3 breakpoints could alternatively represent a transformation that produces 3 tumour chromosomes, a 1-7 chromosome, a 7-11 chromosome, and an 11-1 chromosome. We are able to exclude this alternate possibility by using the fusion transcripts as a scaffold for local reconstruction of the CGR. As mentioned in reference to Figure 4.2b, gain adjacency edges produce CCBRs with ambiguous structures. The gain adjacency edge for WDTC1 -PRKRIP1 -EFCAB4A was determined to represent an insertion of a shard of chromosome 7 at a breakpoint between chromosomes 1 and 11, rather than a region of chromosome 7 duplicated in two tumour chromosomes. Thus we sought to rule out the possibility that all gain adjacency edges represent insertions. The MYC CCBR includes two gain adjacency edges. If either of these gain edges represented insertions, it would be impossible for the MYC CCBR alone to explain the ARHGEF17 -SHANK2 and HMGN2P46 -MYC fusion transcripts; additional breakpoints and further complexity would be required. Thus the most parsimonious explanation is that the gain adjacency edges of the MYC CCBR represent regions of chromosome 11 that are duplicated in the final tumour chromosomes.

77 A MYC 1 2 3 1 2 3 chr8 ARHGEF17 ARHGEF17-SHANK2 1 2 3 1 2 3 chr11 SHANK2 1 2 3 1 2 3 chr11 HMGN2P46 HMGN2P46-MYC 1 2 2 3 1 2 2 3 chr15

B WDTC1-CD151 WDTC1-EFCAB4A WDTC1-PRKRIP1

Chr1-7-11 WDTC1 PRKRIP1 EFCAB4A CD151

Chr1 WDTC1 Chr7 PRKRIP1 Chr11 EFCAB4A CD151

Chr11-1 WDTC1 EFCAB4A (a) (b)

Figure 4.5: CGRs discovered in primary tumour sample 963. (A) A single CCBR produces 4 fusion genes: MYC-ARHGEF17, ARHGEF17 -SHANK2, SHANK2 -HMGN2P46, and HMGN2P46 -MYC. Only the ARHGEF17 -SHANK2 and HMGN2P46 -MYC fusion genes produce fusion transcripts. (B) Example of a CGR that is both a CCBR and a fusion involving 3 loci. The aberrant 1-7-11 chromosome produces 3 fusion transcripts: WDTC1 -CD151, WDTC1 -EFCAB4A, and WDTC1 -PRKRIP1.

78 4.4 Discussion

We have applied nFuse to the discovery of fusion transcripts and underlying Complex Ge- nomic Rearrangements (CGRs) in breast cancer cell line HCC1954 and a primary prostate tumour sample 963. The landscape of CGR events differed between HCC1954 and 963, with complex breakpoints and poly-fusions arising as the predominant CGR feature of HCC1954, and closed chains of breakage and rejoining (CCBRs) arrising as the predominant feature of 963. In HCC1954, nFuse predicted 7 high confidence complex breakpoints / poly-fusions. One of these fusions is PHF20L1 -SAMD12, a highly expressed in-frame fusion missed by [152] likely because of the complexity of the breakpoint. In fact, our results strongly suggest that analysis of single breakpoints in isolation is inadequate as a method for identifying fusion genes. The large size of fragments that are interposed at the breakpoints of some fusion genes will prevent the discovery of those fusions. nFuse is also capable of identifying the single breakpoints underlying fusion transcripts caused by more simple rearrangements. nFuse successfully recovers all 4 fusion transcripts identified by ShowShoes-FTD, predicting a simple rearrangement for three and a CGR for the forth. In 963, nFuse identified a CGR with potential biological implications. Based on existing evidence that transcribed genes are prone to double stranded breaks, and building on the suggestion by [14] that CCBRs occur for sets of genes recruited to the same transcriptional factory, we propose that CCBRs may be used to infer the gene co-expression history of a tumour. In 963, the discovery of the MYC CCBR suggests that luminal specific HMGN2P46 and neuroendocrine specific SHANK2 were co-expressed in a single nascent tumour cell during the formation of the CCBR. Thus the CCBR provides further evidence of the dual luminal/neuroendocrine history of the 963 tumour, and suggests the unusual luminal/neuroendocrine expression pattern of the tumour predates the formation of the MYC rearrangement. We have used examples in both HCC1954 and 963 to highlight the potential utility of performing an integrated analysis of matched WGSS and RNA-seq datasets. The RNA-seq data yields information about long range connectivity between genomic regions, acting as a set of very long genomic reads. As such, the RNA-seq data can be useful as a scaffold for reconstructing tumour chromosomes. In some cases the RNA-seq data can also be used to resolve genomic architectural ambiguities, as for the 2 CCBRs discussed for 963. Finally, RNA-seq can be used to identify potentially interesting events such as fusion transcripts that serve as a starting point for targeted analysis. Many fusion genes, including those with complex origins such as PHF20L1 -SAMD12, can be detected using conventional analysis of RNA-seq data. Nevertheless, many interesting questions cannot be answered with a fusion transcript prediction alone. For instance, it is impossible to measure the clonal abundance of a gene fusion at the transcriptomic level alone, since transcript abundance is heavily influenced by expression. Given multi-

79 ple tumour samples from the same patient, knowledge of breakpoints will allow us to ask which samples harbour the fusion, whereas knowledge of the fusion transcript only allows us to understand the expression levels in each sample. Furthermore, an understanding of the clonal abundances of rearrangements will help determine the evolutionary history of the tumour. The evolutionary history will then help determine the founder status of each rearrangement, and which rearrangements are drivers of tumourogenesis [143]. Finally, it has been assumed throughout this work that the breakpoints of CGRs occur simultaneously during a single event. A complex breakpoint with one genomic shard could also be formed by a two independent breakage-rejoining events that occur at the same loci at diﬀerent times during the tumours development. Similarly, a CCBR could be formed by breakage and rejoining events occuring in succession at the same loci. Both of these scenarios require the formation of intermediate breakpoints. In the evolutionary history of the tumour, some cells would likely have evolved from cells in the intermediate state without having gained all breakpoints in the CGR. Thus we expect the intermediate breakpoints to be present in some proportion of tumour cells, though that proportion may be very small. To date we have not identiﬁed any intermediate breakpoints in the sequencing data. Future work will involve testing for intermediate breakpoints, and negative results will provide further evidence of the simultaneity of CGRs.

80 Chapter 5

ReMixT: Joint inference of genome structure and content in heterogeneous tumor samples

5.1 Introduction

Human cells have evolved DNA repair mechanisms to mitigate the effects of DNA breakage during transcription and replication. During the lifetime of many cancers, one or more of these mechanisms will be compromised. With DNA repair compromised, DNA breakages will either go unrepaired or will be repaired by a less accurate but still functional mechanism. Improper repair of DNA breakage events will result in structural chromosomal aberrations in descendent tumour cell lineages. Structural aberrations can then lead to further problems during mitosis, with incorrect segregation of DNA to daughter cells resulting in numerical chromosomal aberrations [22]. If both daughter cells are viable, incorrect segregation leads to a divergence in chromosomal content between the two descendent lineages. The progressive acquisition of structural and numerical chromosome aberrations is referred to as genome instability. A direct consequence of genome instability is intra-tumour heterogeneity [22]. A sample of a biopsy from a genomically unstable cancer will contain tumour cells from lineages that have diverged in the structure and content of their chromosomes. Significant changes do not usually out-pace cell division. Thus collections of cells, referred to as clones, will be genomically similar. Furthermore, a clone with increased fitness may expand relative to competing clones, resulting in partial dominance of a tumour sample by a genomically divergent clone. Bulk sequencing of a heterogeneous sample mixes the signals from tumour clones and contaminating normal cells. An important problem in cancer genomics is the unmixing of these signals, and reconstruction of the structure and content of the genomes

81 of each clone. The key difficulty of the problem is that mixing dilutes the signal of the changes of interest, often to a level approaching that of the noise in the data. Existing methods focus on accurate modeling of the number of the copies of each reference genome segment in the sequenced tumour clone or clones. A simple model of genome structure predominates for most tools: segments in the model are adjacent only if they are also adjacent in the reference genome. An additional copy of a segment is implicitly modeled as a copied and truncated chromosome, when in reality it may be a tandem duplication resident on the original chromosome. Theta and Theta2 [117, 118] infer the copy number of tumour clone genomes and mixing proportions of tumour clones and contaminating normal cells. Both tools assume a-priori knowledge of large segments of the genome with identical clone specific copy number, and model adjacent segments independently. Titan [57] uses an HMM to model spatial correlation between segments adjacent in the reference genome, however the state space of the HMM is restricted to allow only one aberrant genotype per segment. A similar method, CloneHD [46] uses a factorial-HMM with a more comprehensive state space. Simplified models of connectivity are reasonable for genomic profiles using array based technologies. With whole genome sequencing, tumour specific adjacencies, or breakpoints, are readily available and can be predicted with reasonable accuracy using a variety of tools. Breakpoints provide the potential for a more comprehensive model of genome structure that includes long range connectivity between genomic segments. An important question in computational biology is the extent to which a more comprehensive model of genome structure has the potential to improve copy number inference. Furthermore, a method that integrates both copy number and breakpoints could provide additional information about each breakpoint: whether the breakpoint is real or a false positive, the prevalence of the breakpoint in the clone mixture, and the number of chromosomes harboring the breakpoint per clone. Some progress has been made on more comprehensive modeling of genome structure in tumour clones. [93] proposes an algorithm to infer missing adjacencies in a mixture of rearranged tumour genomes, however they do not model copy number. [175] proposes a framework for sampling from the rearrangement history of tumour genomes. [117] proposes PREGO, a method for inferring the copy number of segments and breakpoints using a genome graph based approach, though they do not model normal contamination or tumour heterogeneity, limiting applicability of their method to real tumours. Additional progress in the ability to infer divergent genome structure remains relevant to cancer research. Subclonal copy changes (changes in a subset of clones) are difficult to assess with single cell methods, whereas coincident subclonal breakpoints could be more easily assessed for their presence in individual cells. Given multiple samples, an ability to identify subclonal events will enable more accurate tracking of complicated patterns of metastasis, such as sample heterogeneity produced by multiple-metastases to the same anatomic site

82 [33]. Furthermore, subclonal changes represent contemporary events, whereas ubiquitous changes represent historical events. The ability to discern historical from contemporary breakpoints may provide insight into the repair mechanisms that have been active historically versus those that were active at the time the tumour was sampled. We propose a method for joint inference of genomic content and structure given tumour sequencing data and a set or predicted breakpoints. Our method is built upon two important assumptions. First we assume that intelligent aggregation can be used to gain additional statistical strength, and increased power to detect changes in a minor tumour clone. Thus, we choose a larger segmentation than competing methods. To avoid the additional noise associated with a true copy number change occurring in the middle of a segment, we augment our segmentation with breakends of predicted breakpoints, with the intention of capturing the majority of true copy number changes across the genome. Furthermore, we use counts of reads for alleles of each haplotype block, rather than for alleles of each heterozygous SNP, increasing statistical strength for inference of allele speciﬁc copy number changes. Second, we assume that a more comprehensive model of genome structure that includes the long range connectivity implied by rearrangement breakpoints will improve inference accuracy. We model the likelihood of observing the sequencing data given a mixture of genome graphs, where each genome graph represents the content and structure of the genomes of tumour clones. Our method can be considered a natural extension of the factorial-HMM used by cloneHD, which can be thought of as a mixture of HMMs for modeling the clone mixture. We show using simulated genomes that our method out-performs Titan, Theta2 and CloneHD for inference of clone speciﬁc copy number. We also compare our method against two breakpoint naive methods for inferring copy number of segments and breakpoints: a factorial HMM, and a model assuming independence between segments. For the breakpoint naive methods, we post-hoc assign breakpoint copy number based on segment copy number. We show that integration of breakpoints improves inference of segment copy number given the same data, and that post-hoc assignment of breakpoint copy number is less accurate than using an integrated model.

5.2 Problem Deﬁnition

We consider the problem of predicting segment and breakpoint copy number given whole genome sequence data from tumour and matched normal samples (see Figure 5.1). Assume as input a set of alignments of uniquely mapped concordant reads, and a set of putative breakpoints predicted from discordant reads. We aim to predict the following:

1. mixture proportions of tumour clones and normal cells

2. clone and allele speciﬁc copy number of genomic segments

3. clone speciﬁc copy number of rearrangement breakpoints

83 The strongest signal of copy number change is from segment specific differences in counts of paired end reads (reads for the remainder of the chapter) aligned concordantly to the reference genome. Thus, we model the likelihood of concordant read counts for large segments (Section 5.2.2). A likelihood model on its own will over-fit to the data, thus we also impose a structure on the clone specific copy number by modeling each clone as a genome graph in a mixture of genome graphs (Section 5.2.1). We then formulate a maximum posterior estimation problem (Section 5.2.3). A more detailed description of the problem is provided below. Refer to Table C.1 for a description of each parameter.

Tumour Sample Normal Cells Concordant Alignments

Dominant Tumour Clone Cells Breakpoints (Tumour Specific Adjacencies)

Subdominant Sequencing Tumour & Clone Cells Alignment

Inference Problem

Mixture Fractions Segment Copy Number Adjacency Copy Number

Dominant Subdominant Dominant Subdominant Tumour Tumour Tumour Tumour Normal Clone Clone Normal ? Normal Clone Clone 1 1 Dominant 1 1 Tumour ? Clone 1 1 1 ? 0 ? Subdominant Tumour ? 1 0 Clone 1 0 1 fm Cnm Cjm

Figure 5.1: An overview of the problem solved by ReMixT. A tumour sample is modeled as a mixture of divergent tumour populations and contaminating normal cells (upper left). Sequencing and alignment produces predictions of breakpoints and concordant read alignments (upper right). The problem solved by ReMixT involves estimation of the cellular mixture fractions fm (bottom left), clone and allele speciﬁc segment copy number Cnm (bottom middle, alleles not shown) and clone speciﬁc adjacency copy number Cjm including that of breakpoints (bottom right).

5.2.1 Mixtures of Genome Graphs

In the following section we describe mixtures of genome graphs, and thereby deﬁne the structure of the solution space used to model mixtures of rearranged genomes. We will

84 assume for simplicity a uni-chromosomal reference genome of length L, with the obvious generalization to multiple chromosomes that considers chromosome/position pairs in place of positions.

Segmenting the Genome

We deﬁne a regular segmentation of the genome, and augment this segmentation with break- ends of rearrangement breakpoints. Let lseg be the length of regular segments (3MB for this study). Represent a segment extremity as the pair (x, t) ∈ [1,L] × {−1, +1} where x represents the position and t = −1 and t = +1 represent a segment start and end respectively. Deﬁne a set of N 0 = d L e regular segments as the set of start end pairs given lseg by Equation 5.1, truncating the last segment end to the length of the genome.

N 0 0 S = (i − 1) · lseg + 1, −1 , i · lseg, +1 (5.1) i=1 Each breakpoint b ∈ B is a pair of break-ends, and each break-end can be represented 0 as a segment extremity (xj, tj). Augment segmentation S based on break-ends as follows.

If tj = −1, split the segment containing xj into a segment ending at xj − 1 and a segment starting with xj. Correspondingly, if tj = +1, split the segment containing xj into a segment ending at xj and a segment starting with xj +1. Let S be the augmented set of N segments th deﬁned on 2N segment extremities W . Let ln be the length of the n segment sn ∈ S (see Figure 5.3).

Constructing the Genome Graph

We use genome graphs to represent potential adjacencies between segments in a rearranged genome. Trivially, segments adjacent in the reference genome may also be adjacent in the tumour genome. Deﬁne the set of reference adjacencies A as given by Equation 5.2, where th wi ∈ W is the i segment extremity.

N−1 A = w2i, w2i+1 (5.2) i=1

The set of breakpoints B defines putative tumour specific adjacencies. We will model 1 chromosome ends as a special type of adjacency called a telomere adjacency . Let sN+1 = (∅, −1), (∅, +1) be a dummy telomere segment. Define a telomere as an adjacency between a real segment extremity and either (∅, −1) or (∅, +1). Define the space of all telomeres T as the edges of a complete bipartite graph between vertex set W and vertex set {(∅, −1), (∅, +1)}.

1acknowledging the biological inaccuracy of applying the term telomere

85 Let vertex set V = W ∪ (∅, −1), (∅, +1) be the set of segment extremities plus the 2 additional dummy telomere vertices. Each of A, B, S and T are sets of segment extremity pairs and can thus be represented as edges in a graph. Deﬁne the genome graph H = (V,E = (S, Q)) ; Q = A ∪ B ∪ T as a bi-edge-colored graph on vertex set V , where segment edges in S are given a ‘color’ distinct from bond edges in Q (represented as thick dashed and thin solid respectively in Figure 5.2). Bond edges have 3 classes, reference (edge set A), breakpoint (edge set B) and telomere (edge set T ).

Breakpoint Reference Segment Edges Bond Edges Bond Edges

Telomere Bond Edges Dummy Telomere Segment Edge

Figure 5.2: An example genome graph on 6 regular segments. Segments edges (black, thick dashed) connect vertices representing segment extremities. Reference bond edges (yellow, thin solid) connect segments to recapitulate the reference chromosomes. Breakpoint bond edges (red, thin solid) represent putative connections between segment ends as identiﬁed through analysis of discordant sequencing reads. A dummy segment edge (bottom) is used to connect the end points of linear chromosomes via telomere bond edges (blue, thin solid). Telomere bond edges form a complete bipartite graph on the pair of vertices incident to the dummy segment edge, and all other vertices.

Genome Instances and Genome Mixtures

Define a linear chromosome as a sequence of (possibly repeated) oriented segments, and a circular chromosome as a cycle of (possibly repeated) oriented segments. Thus a linear chromosome is an alternating walk in H starting and ending with a segment edge, and a circular chromosome is an alternating tour in H. For algorithmic convenience, we will model all chromosomes as alternating tours. An alternating walk that represents a linear chromosome can be transformed into a tour by connecting each end of the chromosome to opposite ends of the dummy telomere segment edge using telomere bond edges. A telomere bond edge is considered observed if it is incident to a vertex representing the end of a reference chromosome, with remaining telomere bond edges considered unobserved. All reference and breakpoint bond edges are considered observed. A genome can be represented exactly as a collection of alternating tours in H, though such a representation would be unidentifiable based only on adjacency information provided by whole genome sequencing. Instead, a collapsed representation is more convenient. Define a genome instance g : E → N as an assignment of counts to edges in H. A genome instance

86 is valid if and only if there exists a collection T of alternating tours in H for which each edge e appears g(e) times in T . Alternatively, let S(v) and Q(v) be segment and bond edges incident with vertex v respectively. A genome instance is valid only if copy number balance condition holds ([117], Equation 5.3).

∀v ∈ V : X g(e) = X g(e) (5.3) e∈S(v) e∈Q(v)

Call a genome instance that obeys the copy number balance condition as balanced.A genome instance g may be balanced but not valid if for some edge e, g(e) < 0. Balanced genome instances will be important as modifications of valid genome instances. Define a genome collection G as a collection of M valid balanced genome instances. Let M P f ∈ R>0, m fm = 1 represent the cellular fraction of each population. Finally, define a genome mixture as the pair (G, f).

Allele Speciﬁc Genome Graph

We model the copy number of each parental allele for each segment. Complete loss of a parental allele, termed Loss Of Heterozygousity (LOH), is of particular biological interest as such events frequently occur as part of a ‘double hit’ targeting a tumour suppressor gene. A method that infers the specific copy number of each allele will enable us to identify such biologically important events. Furthermore, previous methods have shown increased performance when modeling allele specific versus total copy number. We define the allele specific genome graph H0 to jointly model both genome structure and allele specific copy number. Construct H0 = (V 0,E0), E0 = S0 ∪ Q0, from genome graph H by duplicating edges and vertices for arbitrarily named allele 1 and allele 2. For each vertex 0 vn ∈ V create two vertices vn,1, vn,2 in V . For each segment edge sn = (v2n−1, v2n) ∈ S, 0 create two segment allele edges sn,1 = (v2n−1,1, v2n,1) and sn,2 = (v2n−1,2, v2n,2) in S . For each bond edge e = (vj, vk) ∈ Q, create four bond edges (vj,1, vk,1), (vj,1, vk,2), (vj,2, vk,1), 0 (vj,2, vk,2) in Q . Subsequent sections will refer to the allele specific genome graph and its edges and vertices as H, E = S ∪ Q and V respectively.

5.2.2 Modeling Read Counts

In the following section we ﬁrst describe a general model for per segment counts of concordantly aligned reads. We then extend this model to allele speciﬁc segment read counts, and describe a likelihood model for the read count data.

Sequenced Genome Mixtures

Let Rm be the number of concordant reads contributed by cell population m in a heterogeneous tumour sample. We assume each of the Rm reads are sampled uniformly from the

87 length Lm of the genome of cell population m. Thus each cell population contributes a speciﬁc number of reads per nucleotide, h = Rm , to the sequencing experiment. We call m Lm hm the haploid read depth of population m since it is the read depth contributed by a single copy of a segment by m. The haploid read depth encodes both the cellular fractions f and the total amount of sequencing. Cellular fractions can be calculated from haploid read depths using equation 5.4 2.

hm fm = P (5.4) m hm

For convenience we use hm when modeling read counts, and convert back to cellular fractions when necessary. Deﬁne a sequenced genome mixture as the pair (G, h), to model both the genome mixture and amount of sequencing.

Total Segment Read Counts

Represent each concordant read alignment r = (x, y); x, y ∈ [1,L] as a pair of positions representing the start and end of the alignment of the read in the genome3. Let observed read count xn be the number of concordant read alignments fully contained within segment sn. For sequenced genome mixture (G, h), let cnm denotes the total copy number of segment n in genome m, i.e. cnm = gm(sn). The expected read count contributed by population m for segment n is a linear combination of: a) the length of the segment, ln b) the copy number of the segment in population m, cnm c) the haploid read depth of population m, 4 hm . The expected total read count µn of segment n can be calculated by summing over the M populations in the mixture (Equation 5.5, Figure 5.3).

X µn = lncnmhm (5.5) m

Haplotype Block Read Counts

Heterozygous germline SNPs can be used to distinguish reads originating from distinct parental alleles. Calculation of allele specific read counts can then be used to estimate allele specific copy number. Haplotype information allows for aggregation of read counts from multiple SNPs, increasing statistical strength significantly.

Let χi ∈ {0, 1} be a binary indicator representing the two possible alleles of heterozygous SNP i.A haplotype block η = (i, k, y), y ∈ {0, 1}k is a sequence of alleles

2 Let Nm be the fractional number of cells sequenced from population m, assuming all nucleotides of NmLm those cells are sequenced. Then Rm = y = hmLm where y is the length of sequenced fragments. Thus Nm = hmy, and assuming fragment length y is invariant between cell populations, normalizing hm is equivalent to normalizing Nm and results in cellular fraction fm. 3 For paired end reads, x is the start of the left end and y the end of the right end. 4 In practice, accurate modeling of real data requires a dataset speciﬁc segment bias, though we do not describe such a parameter in this chapter.

88 Concordant Alignments Genome Mixture Segment Boundaries Rearrangement Breakpoint Clone 0 Normal

Clone 1 Turmour Clone 2

Segment Read Counts Expected Segment Read Count

Read Cnm ln hm Count · · m Genomic Segment Sum over Length of segment n Haploid read depth clones of clone m

Copies of segment n in clone m

Figure 5.3: Observed and expected segment read counts. Regular segmentation is augmented by break-ends of rearrangement breakpoints (alternating grey, top left). Contained concordant reads are counted for each segment (blue histogram, bottom left). Based on a genome mixture (top right), expected segment read count is calculated as the product of segment copy number, segment length, and haploid read depth, summed over all clones.

(χi = y1, .., χi+k−1 = yk) for k consecutive SNPs starting at i, where the sequence of alleles exist consecutively on the same physical chromosome. The alternate haplotype block

η¯ = (χi =y ¯1, .., χi+k−1 =y ¯k) represents the other of the two parental alleles (here χ¯ = 1−χ). We infer haplotypes by first predicting heterozygous SNPs from a matched normal and then use shapeit2 [39] and a 1000 genomes reference panel. Calculate haplotype block read counts as follows (see Figure 5.4). Call a read r as nonconflicting with η = (i, k, y) if for all j ∈ {i, .., i + k − 1} read r matches allele χi+j−1 = yj. Call a read r as supporting of η if it is non-conflicting, and contains at least one SNP j from j ∈ {i, .., i + k − 1}. Let zη and zη¯ be counts of reads that support η and η¯ respectively.

Heterozygous SNPs SNP Allele Read Counts Haplotype Allele Read Counts

Germline Heterozygous SNPs Allele 1 AGATGACTACGAT TACGCAGATGACT CTATACGCAGATG Allele 1 ...ACTATACGCATATGACTACGAT... Allele 2 GCATATGACTACG Allele 2 Maternal Chromosome ...ACTATACGCATATGACTACGAT...

Paternal Chromosome Haplotype Block

Figure 5.4: Haplotype Allele Read Counts. Heterozygous SNPs are present on one or the other parental allele in the normal genome (left). SNP allele read counts count reads supporting each allele of a heterozygous SNP (middle). By contrast, haplotype allele read counts are counts of reads supporting any subset of SNP alleles in a haplotype block, for each of the two alleles of that block (right).

89 Allele Speciﬁc Segment Read Counts

Modeling expected read counts of haplotype blocks is made difficult by the fact that we do not know the effective length of each block, and thus cannot easily write down an equation similar to Equation 5.5. Instead we use haplotype block read counts to calculate observed allele specific segment read counts xn1 and xn2 for allele 1 and 2 respectively. We then calculate expected allele specific segment read counts by scaling expected total read counts by φn, the proportion of total reads that support either allele (see Equation 5.7 below). As a simple convention, we choose an assignment of the two alleles such that xn1 ≥ xn2, and call allele 1 as the major allele and allele 2 as the minor allele 5 6. We calculate allele specific segment read counts as follows. For each haplotype block j overlapping segment n, calculate zηj,n as the count of reads that a) are contained within segment n, and b) support haplotype ηj. Calculate zη¯j,n similarly. Allele specific segment read counts xnk for allele k are calculated as given by Equation 5.6.

X  max(z , z ) if k = 1  ηj,n η¯j,n  j xnk = (5.6) X min(z , z ) if k = 2  ηj,n η¯j,n  j

Expected Segment Read Counts

Below we provide a more comprehensive description of how we calculate expected total and N×3 allele specific segment read counts in the model. The full set of observed data X ∈ N are per-segment total and allele specific read counts. For segment n, xn3 is the total count of reads contained within segment n, and xn1 and xn2 are the counts of the subsets of those reads classified as from the major or minor allele (k = 1 and k = 2 respectively). Modeled alleles ` = 1 and ` = 2 correspond to measured major (k = 1) and minor (k = 2) alleles respectively.

For sequenced genome mixture (G, h), let cnm` denote the copy number of allele ` of segment n in genome m, i.e. cnm` = gm(sn`). Let pn`k represent, for segment n, the proportion of reads from allele ` that can contribute to measurement k, assumed known.

For total read counts (k = 3), pnk` = 1 for ` ∈ {1, 2}, since total read count includes all reads from both alleles. For major and minor read counts, calculate the proportion of reads that can be genotyped by heterozygous SNPs for segment n as given by Equation 5.7.

Since reads from allele ` = 1 can not contribute to measured read count xn2, and visa versa,

5 Classiﬁcation into paternal/maternal is impossible without knowledge of the parental genomes, and somewhat irrelevant in the context of this work. 6 Note that major/minor does not imply more/less copies in all populations. The major allele may have less copies than the minor allele in a low prevalence subpopulation, even though the combination of populations produces more read counts for the major allele.

90 pnk` = 0 for k 6= `; k ∈ {1, 2}. Equation 5.8 fully speciﬁes pn`k.

xn1 + xn2 φn = (5.7) xn3   φn if k = `, k ∈ {1, 2}  pnk` = 1 if k = 3 (5.8)   0 else

Let µnk ∈ R>0 model expected major/minor/total read count, calculated as given by Equation 5.9.

X X µnk = lnhmcnm`pnk` (5.9) m `

5.2.3 Maximum Posterior Genome Mixtures

In this section we describe the likelihood model of observed read counts, and a prior probability over Genome Structure that favours solutions that use fewer unobserved telomeres.

Genome Mixture Likelihoods

We model the likelihood of allele speciﬁc or total read counts x given expected read counts µ using either a Poisson (Equation 5.10) or Negative Binomial (Equation 5.11). The overdispersion constant for the Negative Binomial is estimated oﬀ-line (Section C.3). See Sec- tion C.4 for a discussion of the independence assumption in comparison to similar models.

µxe−µ p(x|µ) = (5.10) x! Γ(r + x) r r µ x p(x|µ, r) = (5.11) Γ(x)Γ(r) r + µ r + µ

Prior Probability over Genome Structure

Positive copy number assigned to a bond edge e by genome instance gm implies edge e ‘exists’ in the genome mixture. Observed bond edges are assumed to have higher prior probability of existing. Thus, we place a prior probability over the number of unobserved edges used by genome instances in a genome mixture. Let U ⊂ Q be the set of unobserved bond edges in H. Let β be a parameter related to our belief in the existence of an unobserved telomere, or copy number transition unexplainable by a breakpoint. Calculate a prior for the copy number of genome instance g as given by Equation 5.12.

P (g|β) ∝ Y e−βg(e) (5.12) e∈U

91 In log space the above prior amounts to a fixed penalty on each additional copy of an unobserved bond edge. Such a prior prevents over-fitting of the genome structure, such as a genome instance that assigns positive copy number to many telomere edges in order to be able to fit each segment edge perfectly to the segment likelihood. Thus the impact of Equation 5.12 is similar to that of a transition matrix in an HMM for modeling spatial correlation, and in fact the genome graph model with no breakpoints has an equivalent representation as an HMM. Higher values of β have the effect of smoothing over false positive deviations in predicted copy number that result from sampling error of identical true copy number.

Overall Objective Function

Our overall objective is to identify the genome mixture (G, h) that maximizes the full posterior (Equation 5.13) given β.

p(G, h|X, β) ∝ p(X|G, h) Y P (g|β) (5.13) g

5.3 Method

5.3.1 Method Overview

We separate the maximum posterior genome mixture problem into two subproblems:

1. learn h

2. infer G given h

For problem 1, we remove breakpoint edges from the genome graph, reducing the graph to a hidden markov model (HMM), making learning of h more tractable. For problem 2, we use an approximate combinatorial method to infer G given h, initializing at the results of running the viterbi algorithm on the HMM used to learn h.

5.3.2 Expectation Maximization Method for Learning h

We learn h using the Baum Welch algorithm [18] for learning the parameters of a Hidden Markov Model (see Section C.2). In brief, we use Expectation Maximization to ﬁnd a local maxima of the marginal likelihood function (Equation 5.14).

L(X|h) = X p(X,C|h) (5.14) C

At iteration t, the algorithm calculates the h(t) that maximizes the expected value of the complete data log likelihood p(X,C|h) with respect to the conditional distribution

92 p(C|X, h(t−1), ·). We use multiple restarts with diﬀerent initial h(0) in an attempt to discover the global maximum.

5.3.3 Combinatorial Method for Inferring G

We propose a greedy algorithm for inferring the full structure of the genome collection G given h estimated from large segments. Given a current solution G(t), our aim is to select from a set of possible modifications that a) are simple to calculate, b) are comprehensive enough to escape local optima, c) produce a valid genome collection G(t+1) when applied to G(t). A valid genome collection is defined as a genome collection for which all genomes g ∈ G have positive copy number and obey the copy number balance condition. Modeling the likelihood of total reads introduces a dependency between the copy number of each allele of the same segment, complicating inference. For the purposes of inferring G, we model only the likelihood of allele specific read counts, allowing for the independent modeling of the alleles of each segment. Let ce be a vector of per clone copy numbers assigned to edge e for each of the m genome instances, such that cem = gm(e). Suppose edge sn` ∈ S represents segment n, allele `. The likelihood of allele specific read counts xnk where k = ` can be written as p(xnk|µn`) where µn` is calculated as given by Equation 5.15.

X µn` = lnhmcemφn (5.15) m

We deﬁne the objective function of our greedy heuristic as given by Equation 5.17.

f(G) = − log p(G|X, h, ·) X = fe(ce) (5.16) e∈E  − log p(xnk|µn`) if e = sn` ∈ S fe(ce) = (5.17) P  m β · gm(e) if e ∈ U

Representing modiﬁcations as edge disjoint sets of alternating cycles

We propose an algorithm for identifying a modiﬁcation of G(t), optimal with respect to f, from a restricted set of modiﬁcations that increase or decrease the copy number of any edge by at most 1. Consider valid genome instance g1 and balanced genome instance g∆. The set of balanced genome instances is closed under addition and subtraction. Thus g2 = g1 + g∆ will be a balanced genome instance, and if g2(e) = g1(e) + g∆(e) ≥ 0 ∀e, then g2 will also be valid.

An obvious set of candidate g∆ include all g∆ that set g∆(e) = 1 (or g∆(e) = −1) for all edges e in some alternating cycle of H. Such g∆ would be analogous to adding (or subtracting) a circular chromosome to the genome instance. However, it is trivial to show

93 for such a restricted set of modiﬁcations, a greedy algorithm would easily get stuck in local optima. In Figure 5.5A, for instance, transformation of the genome instance on the left to that on the right would not be possible without an intermediate state that assigns 2 copies to each segment edge. Such an intermediate state may be prohibitively unlikely if there is considerable evidence for the segment to be copy number 1.

A B 1 -1 -1 1 1

+ +1 +1 = 1 1 +1 +1

1 1 1 -1 -1

Figure 5.5: Genome graph modifications. Thick dashed lines show segments edges, thin solid lines show bond edges. A) Transformation of one genome instance to another via addition of a modification. Edges are annotated with copy number for edges with non-zero copies. Note that a transformation performed in two steps, one addition and one subtraction, would likely result in a highly improbable intermediate state with segment edge copies set to 2. B) Representation of the modification as an alternating cycle in the genome graph.

Instead, we require candidate g∆ that both add an subtract edges, as swapping one set of edges for another set will allow easier movement through the space of possible genome collections. Let g+ be a balanced genome instance constructed such that g+(e) = 1 for all edges in a set of edge disjoint alternating cycles in H. Note that since each vertex is incident to a single segment edge, edge disjoint alternating cycles are also vertex disjoint. Let g− be deﬁned similarly to g+, with g−(e) = −1 for a distinct set of edge disjoint alternating cycles in H. Let g∆ = g+ + g−. Call g∆ thus constructed as a simple genome modiﬁcation.

Proposition 1. Any simple genome modiﬁcation has a representation as an edge disjoint set of alternating cycles in a speciﬁc bi-edge-colored graph derived from H.

Proof. Edges in g∆ will have copy number in {−1, 0, +1}. Remove all 0 edges from g∆. Assign a color to each remaining edge as follows:

• segment edge e with g∆(e) = +1 → red

• segment edge e with g∆(e) = −1 → blue

• bond edge e with g∆(e) = +1 → blue

• bond edge e with g∆(e) = −1 → red

Now consider any vertex v ∈ H. Let s be the segment edge incident with v, and let th ei be the i bond edge incident with v. Suppose g∆(s) = 0, but there exists i such that

94 g∆(ei) 6= 0. Then there must exist i1, i2 such that g∆(ei1 ) = +1 and g∆(ei1 ) = −1, otherwise g+ and g− would not both be vertex disjoint. Furthermore, ei1 will be colored blue and ei2 will be colored red. Suppose g∆(s) = +1. Then there exists a single bond edge ei incident with v such that g∆(ei) = +1 by the copy number balanced condition and the requirement that g+ be vertex disjoint. Edge s will be colored red and ei will be colored blue. A similar analysis for a segment edge s with g∆(s) = −1 shows that s will be colored blue and a single incident bond edge will be colored red. Thus at any vertex, either 0 or 2 incident edges will have non-negative copy number. Furthermore, for vertices with 2 incident non-negative copy number edges, the color of the edge as deﬁned above will alternate. Thus g∆ can be represented as a set of vertex disjoint alternating cycles.

We now show how to construct a bi-edge colored graph F called the genome modification graph (Figure 5.6), such that any vertex disjoint alternating cycle in F represents a valid simple genome modification in H. Take the vertex set of H as the vertex set of F . Take two copies of the edge set of H as the edge set of F , with the result that each edge in H corresponds to two parallel edges in F . For each pair of parallel edges, label one edge as ‘+1’ and the other edge as ‘-1’. Define the edge coloring of F as follows:

• segment edge e labeled +1 → red

• segment edge e labeled −1 → blue

• bond edge e labeled +1 → blue

• bond edge e labeled −1 → red

A B + + - - + + - -

Figure 5.6: The Genome Modiﬁcation Graph. A) A small subgraph of a genome graph with a subset of edges is shown. B) Transformation of the genome graph to a genome modiﬁcation graph. Each edge is duplicated and labelled with ‘+’ or ‘-’. Segment edges are colored red for label ‘+’ and blue for label ‘-’. Bond edges are given an opposite labelling: blue for label ‘+’ and red for label ‘-’.

Proposition 2. Any simple genome modiﬁcation can be represented as a vertex disjoint alternating cycle in F .

95 Proof. Follows directly from Proposition 1

Note that the converse of Proposition 2 is not true. Some vertex disjoint alternating cycles can only be decomposed into g+(e) and g−(e) such that either g+(e) > 1 or g−(e) < −1 for some edge e. Also note that simple genome modiﬁcations involving only bond edges are equivalent to a balanced rearrangement in a breakpoint graph. See Figure 5.7 for an example of a simple modiﬁcation.

Simple Modification of Edge Copy Number 1

1 1

1 1 1 1

1 1

1 1 Genome Modification Graph +1

+1 Edge Coloring: -1 +1 segment +1 -1 -1 segment +1 bond -1 -1 bond

Figure 5.7: A simple modification of edge copy number increments or decrements the copy number of a subset of edges for a specific clone, maintaining the copy balance condition (top). The associated simple genome modification and its alternating cycle representation.

Selecting an optimal simple genome modiﬁcation

M For a collection of M genomes, consider the set of edge copy number modifications T , M where T = {−1, 0, +1}. Let ∆ ∈ T , ∆ 6= 0 be a specific modification affecting at least one E of the M genomes. Let z ∈ T represent the acceptance of the modification or its inverse for each edge in E. Thus ze∆ represents modification by ∆, −∆ or 0 for ze equal to +1, −1 and 0 respectively. Write the objective of a locally optimal modification given ∆ as shown

96 in Equation 5.18.

X argmin fe(ce + ze∆) (5.18) z∈TE e∈E X X s.t. ze∆ = ze∆ ∀v ∈ V, (5.19) e∈S(v) e∈Q(v) X 1 ≥ |ze| ∀v ∈ V, (5.20) e∈S(v) X 1 ≥ |ze| ∀v ∈ V (5.21) e∈Q(v)

The first constraint (Equation 5.19) is simply the copy number balance condition. The second and third constraints (Equations 5.20 and 5.21) ensure that the set of alternating cycles represented by z is vertex disjoint, and is thus a simple genome modification. To identify the locally optimal modification, first create genome modification graph F , then create cost function κ assigning cost to each edge as follows:

• edge e labeled +1 → fe(ce + ∆) − fe(ce)

• edge e labeled −1 → fe(ce − ∆) − fe(ce).

Proposition 3. Assume fe is convex. A minimum cost vertex disjoint set of alternating cycles C in F given cost function κ is a locally optimal modiﬁcation minimizing Equa- tion 5.18.

Proof. Any z can be transformed into a speciﬁc C by selecting the edges in F corresponding to settings of z (label +1 if ze = +1, label −1 if ze = −1, or no edge if ze = 0). Conversely, suppose C includes both the +1 labeled edge and −1 labeled edge that correspond to a single edge in H. The convexity of fe implies that fe(ce) − fe(ce + ∆) + fe(ce) − fe(ce − ∆) ≤ 0.

Selection of both edges implies fe(ce +∆)−fe(ce)+fe(ce −∆)−fe(ce) ≤ 0, otherwise neither would have been selected. Thus fe(ce +∆)−fe(ce)+fe(ce −∆)−fe(ce) = 0, and both edges can be removed from C without consequence to the objective. Let C0 be derived from C by removing pairs of +1 and −1 labeled edges corresponding to a single edge in H. There is a bijection between the space of possible z and the space of possible C0. Furthermore, P minimizing Equation 5.18 is equivalent to minimizing e∈E (fe(ce + ze∆) − fe(ce)). The result follows.

We use the following well known transformation to identify the minimum cost vertex disjoint set of alternating cycles in bi-edge-colored graph G = (V,E) with edge coloring 0 0 0 color : E → {red, blue} and edge cost function cost : E → R. Create graph G = (V ,E ) from G as follows. For each vertex v ∈ V , add vertices vred and vblue to V 0, and add transverse edge (vred, vblue) to E0. For edge (u, v) = e ∈ E, add edge e0 = (ured, vred) to E0 if color(e) = red or add e0 = (ublue, vblue) if color(e) = blue. Let M be a prefect matching

97 in G0. Such a matching exists since the set of transverse edges is a perfect matching. Furthermore, M is a perfect matching in G0 if and only if the non-transverse edges in M represent a set of vertex disjoint alternating cycles in G. Thus the minimum cost perfect matching in G0 represents a minimum cost vertex disjoint set of alternating cycles in G. We use Blossom V [76] to calculate the minimum cost perfect matching.

Greedy Algorithm

We propose the following algorithm for solving the maximum posterior genome mixture problem. Given are haploid read depths h. Initialize G to a valid genome collection as follows. Set the copy number of segment edge n to cn. Set the copy number of reference edges to the minimum of the adjacent segment edges. Finally, set the copy number of telomere edges so as to satisfy the copy number balance condition at each vertex. The algorithm proceeds as follows. Given current iteration t, identify the minimum (t) M cost modification ∆ of G from the set of all possible modifications ∆ ∈ T , ∆ 6= 0. If the minimum cost modification has cost less than 0, apply it to G(t) to create G(t+1) and continue. If all modifications have cost 0, stop iteration.

5.4 Results

5.4.1 Simulating rearranged genomes

We developed a principled method of simulating rearranged genomes that fulfilled two important criteria. First, the simulated tumour genomes are required to have been produced by a known evolutionary history composed of duplication, deletion, and balanced rearrangement events applied successively to a initially non-rearranged normal genome. Second, the copy number profile of the simulated tumour genome should be reasonably similar to that of previously observed tumours. Naively applying random rearrangements to a normal genome would result in a significant number of regions that are homozygously deleted. Such a scenario is unrealistic given that large scale homozygous deletion would remove housekeeping genes necessary for the survival of the cell. Alternatively, a naively simulated genome could become exceptionally large, an outcome that is rarely observed given the presumed burden of replicating such a genome. Thus, we developed a re-sampling method for producing realistic rearranged genomes (Figure 5.8). At each step in the simulation of an evolutionary history, we resample a swarm of genomes according to a fitness function. Fitness is calculated as the multinomial likelihood of the simulated copy numbers given average copy number proportions from a set of real tumours. For this study we used copy number proportions measured from 7 high grade serous tumours (data not shown).

98 Rearrange and Resample

Swarm of Genomes Clone 2

Clone 1

Figure 5.8: Simulating realistic rearranged genomes. Simulation starts with a set of normal genomes (leftmost column of white circles). At each step, a random rearrangement is applied to produce a new set of genomes (darkening grey circles from left to right), and a new set of genomes is resampled (edges between columns of circles). Two clones are selected with from a single evolutionary history (red tinted).

To simulate a mixture of related genomes, we first simulate the rearrangement history of the common ancestor. We then modify the fitness function to include a term that controls the deviation between ancestor and descendant. A target deviation is specified as the proportion of the genome with divergent copy number state between ancestor and descendant. The deviation term is the squared error between simulated deviation and target deviation, with some user specified scaling factor, or variance. We simulated 20 mixtures of rearranged tumour genomes. Genomes in each mixture harbored 50 ancestral and 40 clone specific rearrangements, with an additional 50 false rearrangements. Each genome consisted of 1000 segments with randomly sampled lengths totaling 3 × 109 nt. Segment lengths as a proportion of genome length were sampled from a Dirichlet distribution with concentration parameter 1. Proportion genotypable reads uniformly from between 0.05 and 0.2. We assumed the samples were composed of 40% normal cells and 2 tumour clones. Target deviation was set at 30%. Minor clone proportions were set to 5, 10, 20, and 30% of cells. Read counts were simulated using a negative binomial likelihood given segment copy numbers and assuming 40X sequencing 7.

5.4.2 Benchmarking learning haploid depth using simulated data

We used EM to infer the haploid read depths (h parameter) within the context of the HMM version of our model for simulated datasets. For 50% of the simulated datasets, minor clone proportion predictions are within 5% of the simulated value, and normal proportion predictions are within 2% of the simulated value (Figure 5.9).

7total haploid coverage of 0.2 reads (paired) per nucleotide corresponds to a 40X sequence coverage genome, 100X100bp reads from ∼400bp fragments

99 Figure 5.9: Benchmarking the Learning Algorithm. Box plots of inferred clonal fraction for the minor tumour clone (left) and normal clone (right). Datasets are grouped by simulated minor clone proportion (0.05, 0.1, 0.2, 0.4). Normal proportion is 0.4 for all simulations.

(a) Proportion minor clone in the mixture with normal ﬁxed at 0.4

Minor Fraction Error Normal Fraction Error 0.020 0.020

0.015 0.015

0.010 0.010

0.005 0.005

0.000 0.000

−0.005 −0.005

−0.010 −0.010

−0.015 −0.015

0.05 0.1 0.2 0.3 0.05 0.1 0.2 0.3 Minor Clone Proportion Minor Clone Proportion (b) Proportion of normal in the mixture

Minor Fraction Error Normal Fraction Error

0.04 0.04

0.02 0.02

0.00 0.00

−0.02 −0.02

−0.04 −0.04

0.1 0.2 0.3 0.5 0.75 0.9 0.1 0.2 0.3 0.5 0.75 0.9 Normal Proportion Normal Proportion (c) Proportion of genomic segments with divergent (subclonal) copy number

Minor Fraction Error Normal Fraction Error 0.025 0.025

0.020 0.020

0.015 0.015

0.010 0.010

0.005 0.005

0.000 0.000

−0.005 −0.005

−0.010 −0.010

−0.015 −0.015

−0.020 −0.020

0.15 0.3 0.45 0.6 0.15 0.3 0.45 0.6 Proportion Genome Subclonal Proportion Genome Subclonal

5.4.3 Benchmarking structure and content prediction using simulated data

We applied 3 decoding algorithms to the read count data assuming the clone proportions and sequencing depth were known. The independent algorithm calculates the maximum likelihood copy number state of each segment and then post-hoc assigns copy number to breakpoints. The Viterbi algorithm calculates the maximum posterior path through the HMM representation of each chromosome, also assigning breakpoint copy number post-hoc. The genomegraph algorithm uses the proposed algorithm to simultaneously infer segment and breakpoint copy number. For optimal performance, the genomegraph algorithm is initialized with the results of the Viterbi.

100 We calculated 3 measures of performance, f-measure of ability to predict breakpoints as present versus false, f-measure of ability to predict breakpoints as subclonal, and proportion of segments for which the correct copy number is identiﬁed. The genomegraph algorithm outperforms the independent and Viterbi algorithms on all measures for all but the 5% minor clone mixtures (Figure 5.10). The inability of the independent algorithm to model spatial correlation between adjacent segments results in a higher number of spurious copy number transitions and low precision with respect to estimation of breakpoint presence and clonality. Precision is higher for the Viterbi due to the smoothing properties of the algorithm. However, recall is lower than the genomegraph method, since copy number changepoints either do not precisely coincide with respective breakpoints, or are smoothed over entirely for copy number changes in low proportion clones. Finally, joint inference as implemented by the genomegraph method noticeably improves the accuracy of segment copy number prediction over the current state of the art, viterbi inference in an HMM.

5.4.4 Comparison with Existing Copy Number Inference Methods

We compared our genome graph approach to four existing methods for subclonal copy number inference in heterogeneous tumour samples: TITAN [57], CloneHD [46], and Theta2.0 [117, 118] (see supplementary section C.5). We ﬁrst simulated 10 normal genomes, including SNP genotypes. Recombination sites were selected at a rate of 20 per 100Mb. For each region between adjacent recombination sites we selected a random individual from a phased 1000 genomes reference panel [39], and used that individual’s SNP genotypes for that region. For each of the 10 normal genomes, we then simulated 4 pairs of tumour clones using the re-sampling method described above, varying the proportion of the genome that is divergent. Finally we simulated 4 genome mixtures for each of the 40 pairs of tumour clones. For each simulated mixture, we calculated the number of reads for each clone in an approximately 30X sequencing dataset, and simulated fragments with mean 300 and standard deviation 30. We then added added the germline SNPs to 100bp reads from these fragments, ﬂipping the genotype of SNPs based on a simulated sequencing error rate of 0.005. We compared each tool’s output CNA predictions and mixture prediction to the true simulated values using 3 measures

1. Proportion of segments with correct major/minor copy number for each clone, or total copy number where allele speciﬁc copy number was unavailable

2. Relative error in estimation of normal proportion.

3. Relative error in estimation of minor clone proportion.

The results are shown in Figure 5.11. For our simulated data, our method outperforms the competing methods on all measures. In general, failure of the competing methods to

101 correctly predict copy number is primarily due to the inability to identify the true mixture. Given an incorrect estimate of the normal contamination and minor clone proportion, copy number estimates are unlikely to be correct. We considered the possibility that other tools may perform poorly on datasets with significant amounts of high level amplification. Thus we simulated an additional 40 genome mixtures, modifying the target copy number state distribution to produce genomes for which segments with more than 4 total copies are very rare. For this simulation we fixed the proportion of the genome that is divergent at 0.25, and varied the minor clone proportion in the mixture. As can be seen in Figure 5.12, each tool performs better in some circumstances, but only the genomegraph method performs consistently well across different mixtures.

5.5 Discussion

We have developed a novel method for joint inference of genome content and structure. Using a comprehensive set of simulated genome mixtures, we have shown that joint inference out-performs naive methods with respect to identification of subclonal breakpoints and classification of breakpoints as real or false positive. Furthermore, we have shown that inclusion of breakpoints during copy number inference provides a modest but consistent improvement in the accuracy of predicted segment copy number. Results from a comparison against existing subclonal copy number inference tools show our method out-performs existing methods on the simulated mixtures. Our method provides several additional novel contributions. When selecting a segment length, a balance must be struck between the need for increased statistical strength provided by longer segments, and the additional noise that results from a true copy number transition occurring in the middle of a segment. By segmenting at the break-ends of rearrangement breakpoints, we attempt to capture a majority of the copy number transitions with our segmentation. We then use a medium size segmentation to break up longer unsegmented regions, allowing us to retain improved statistical strength, while modeling for a majority of the changepoints. Furthermore, haplotype blocks are used in a novel way to improve the accuracy of allele specific read counts. The aggregation of allele specific read counts across segments allows us to model the alleles of large segments as (approximately) independent, making inference more tractable. Compared to copy number inference tools such as Titan, our state space is in some ways more comprehensive, allowing for any combination of clone copy number at each segment. Though our initial results are promising, significant work remains. The proposed moves of our greedy heuristic are not sufficient to escape local optima of significant importance. For instance, a breakage fusion bridge cycle would be represented as a loop bond edge in the genome graph. We do not include loop bond edges as they would never be selected by a matching based approach. To support loops, we could either add dummy segment edges,

102 making each loop a 3 edge alternating path, or use a more general matching algorithm such as minimum perfect b-matchings to identify optimal moves. Finally, future work may show beneﬁt to inclusion of rearrangement breakpoints during learning, providing motivation for development of a method that jointly infers genome content, genome structure, and clone structure.

103 Figure 5.10: Benchmarking Genomegraph Inference vs. Naive Approaches. Performance of the genomegraph algorithm compared to two breakpoint naive approaches, the independent model and and HMM using the viterbi algorithm. For each set of plots, one parameter of the simulation was varied.

(a) Proportion minor clone in the mixture with normal ﬁxed at 0.4

Breakpoint Presence F-Measure Breakpoint Subclonal F-Measure Proportion Segments Correct 0.9 0.95 0.95 0.8 0.90 0.90 0.7 0.85 0.85 0.6 independent viterbi 0.80 0.5 0.80 genomegraph

0.4 0.75 0.75 0.3 0.70 0.70 0.2 0.65

0.05 0.1 0.2 0.3 0.05 0.1 0.2 0.3 0.05 0.1 0.2 0.3 Minor Clone Proportion Minor Clone Proportion Minor Clone Proportion (b) Proportion of genomic segments with divergent (subclonal) copy number

Breakpoint Presence F-Measure Breakpoint Subclonal F-Measure Proportion Segments Correct

0.9 0.95 0.98 0.8 0.90 0.96 0.7 0.85 0.94 independent 0.6 viterbi genomegraph 0.80 0.92 0.5

0.75 0.4 0.90

0.70 0.3 0.88

0.15 0.3 0.45 0.6 0.15 0.3 0.45 0.6 0.15 0.3 0.45 0.6 Proportion Genome Subclonal Proportion Genome Subclonal Proportion Genome Subclonal (c) Proportion of normal in the mixture

Breakpoint Presence F-Measure Breakpoint Subclonal F-Measure Proportion Segments Correct 0.8 0.9 0.95

0.7 0.90 0.8

0.6 0.85 0.7 independent 0.80 0.5 viterbi 0.6 genomegraph 0.75 0.4 0.5 0.70 0.3 0.4 0.65 0.2 0.3 0.60

0.1 0.2 0.3 0.5 0.75 0.9 0.1 0.2 0.3 0.5 0.75 0.9 0.1 0.2 0.3 0.5 0.75 0.9 Normal Proportion Normal Proportion Normal Proportion (d) Learning error as standard deviation from true fraction

Breakpoint Presence F-Measure Breakpoint Subclonal F-Measure Proportion Segments Correct

1.0 1.0 1.0

0.8 0.8 0.8

independent 0.6 0.6 0.6 viterbi genomegraph

0.4 0.4 0.4

0.2 0.2 0.2

0.001 0.005 0.01 0.02 0.05 0.001 0.005 0.01 0.02 0.05 0.001 0.005 0.01 0.02 0.05 Haploid Depth Error Haploid Depth Error Haploid Depth Error

104 Figure 5.11: Performance of the genomegraph algorithm compared to three existing methods, TITAN, CloneHD, and Theta2.0. Each plot shows performance with one parameter of the simulation varied.

(a) Proportion minor clone in the mixture with normal ﬁxed at 0.4

Proportion Segments Correct Normal Fraction Error Minor Clone Fraction Error 1.2 0.10 0.3

1.0 0.05 0.2 0.00 0.8 −0.05 0.1 titan 0.6 theta −0.10 genomegraph 0.4 0.0 −0.15 clonehd 0.2 −0.20 −0.1 0.0 −0.25

−0.2 −0.30 −0.2

0.05 0.1 0.2 0.3 0.05 0.1 0.2 0.3 0.05 0.1 0.2 0.3 Minor Clone Proportion Minor Clone Proportion Minor Clone Proportion (b) Proportion of genomic segments with divergent (subclonal) copy number

Proportion Segments Correct Normal Fraction Error Minor Clone Fraction Error 1.2 0.20

0.1 1.0 0.15

0.8 0.10 0.0 titan 0.6 0.05 theta genomegraph 0.4 −0.1 0.00 clonehd

0.2 −0.05 −0.2 0.0 −0.10

−0.2 −0.3 −0.15

0.15 0.3 0.45 0.6 0.15 0.3 0.45 0.6 0.15 0.3 0.45 0.6 Proportion Genome Subclonal Proportion Genome Subclonal Proportion Genome Subclonal

Figure 5.12: Performance comparison of the genomegraph algorithm with TITAN, CloneHD, and Theta2.0, using a dataset with limited ampliﬁed regions. Showed is performance when varying the proportion of minor clone in the mixture with normal ﬁxed at 0.4

Proportion Segments Correct Normal Fraction Error Minor Clone Fraction Error 1.2 0.2 0.3

1.0 0.1 0.2

0.8 0.0 0.1 titan 0.6 theta −0.1 0.0 genomegraph 0.4 clonehd −0.2 −0.1 0.2

0.0 −0.3 −0.2

−0.2 −0.4 −0.3

0.05 0.1 0.2 0.3 0.05 0.1 0.2 0.3 0.05 0.1 0.2 0.3 Minor Clone Proportion Minor Clone Proportion Minor Clone Proportion

105 Chapter 6

Applications

6.1 deFuse deFuse has been used to discover novel gene fusions in several high impact cancer studies. Steidl et al. used defuse to identify CIITA as a recurrent fusion partner in lymphoid cancers, implicating fusions in reduced tumour cell immunogenicity in these cancers [151]. Roberts et al. used deFuse to discover fusions responsible for activating kinase and cytokine receptor signaling in high-risk acute lymphoblastic leukemia [133]. Scott et al. used deFuse to discover TBL1XR1-TP63, a novel recurrent gene fusion in B-cell non-Hodgkin lymphoma [141]. Thiollier et al. applied deFuse to xenographts of acute megakaryoblastic leukemia, discovering novel fusions for this cancer type [155]. In a landscape study of peripheral T cell lymphomas (PTCLs), Palomero et al. used deFuse in conjunction with ChimeraScan to identify ALK fusions in PTCL patient samples [121]. Masetti et al. also combined ChimeraScan and deFuse to discover CBFA2T3-GLIS2, a novel fusion transcript recurrent in pediatric cytogenetically normal acute myeloid leukemia (CN-AML), and showed that this fusion predicted poorer outcome for CN-AML patients [96]. McNerney et al. used deFuse to identify a loss-of-function fusion transcript involving CUX1, resulting in the discovery of CUX1 as a novel haplo-insufficient tumour suppressor gene in high-risk acute myeloid leukemia [101]. Lee et al. discovered 14-3-3 fusions in a clinically aggressive form of uterine sarcoma: high-grade endometrial stromal sarcoma (ESS), first identifying the associated rearrangement using cytogenetics, then confirming the presence of the fusion transcript using deFuse, and finally confirming the existence of a translated onco-protein and validating its onco-genic function [83]. Two studies used deFuse to identify fusions affecting the mitogen-activated protein kinase (MAPK) pathway in pediatric brain tumors [177, 72]. Finally, Majewski et al. used a capture assay to enrich for kinase fusions, and applied TopHat-fusion, deFuse, and and de novo transcript assembly with Trinity to discover novel fusion transcripts in non-small cell lung cancer (NSCLC) [94].

106 6.2 Comrad and nFuse

Comrad and nFuse were used to characterize the genomes and transcriptions of patient samples in several studies. In a companion paper to nFuse, Wu et al. applied Comrad and nFuse to the discovery of complex fusion transcripts in prostate cancer [171]. Their study identified fusion transcripts composed of 3 or more distinct genes (Poly-gene fusions) in patient samples and the LNCaP cell line, including poly-gene fusions resulting from chromothripsis. In a second companion paper, Wu et al. applied nFuse to identify fusion transcripts and associated rearrangements in a conventional but aggressive prostate ade- nocarcinoma. Despite evidence that the tumour cells were homogoneous with respect to histology and copy number profiles, the tumour exhibited a hybrid luminal-neuroendocrine expression profile. Fusion transcripts identified using nFuse recapitulated the hybrid nature of the tumour with fusion partners from both neuroendocrine and luminal cell associated genes [170]. A single complex rearrangement responsible for 4 of these fusion transcripts including C15orf21-MYC and ARHGEF17-SHANK2, was described in more detail in the nFuse paper. A subsequent study by Wyatt et al. used nFuse to catalog patient specific gene fusions in a larger cohort [173]. Many of the gene fusions identified converged on a small number of cancer specific pathways. The same study details the results of applying an early version of ReMixT to a patient with a large number of tandem duplications to show that the tumour cells were heterogeneous for the presence of those duplications. Applied to colorectal cancer (CRC), nFuse was used by Nome et al, to identify novel fusion transcripts, a small number of which were found to be recurrent in a large cohort of patient samples [113, 114]. Additional studies have used an nFuse derivative called deStruct to identify genomic rearrangements from only whole genome sequencing. Gunawardana et al., used deStruct as part of a study of the landscape of somatic changes in primary mediastinal B cell lymphoma and Hodgkin lymphoma [55]. Twa et al. used deStruct to identify breakpoints affecting the programmed death ligand (PDL) locus in Primary testicular diffuse large B cell lymphoma (PTL) [161]. The sequencing data was obtained from archival, formalin-fixed, paraffin- embedded (FFPE) tissue samples, and Bacterial Artificial Chromosome (BAC) capture was used to enrich for PDL sequence, thus their study provides validation of deStruct’s applicability to data from a wide variety of sources. Boutros et al. used deStruct to catalog the presence of breakpoints across multiple biopsies each obtained from patients with multifocal prostate cancer [19]. Finally, Eirew et al., applied deStruct to patient derived breast xenografts to study the evolution of tumour cells throughout serial engraftment [41].

107 Chapter 7

Conclusion

Rearrangements impact tumour biology in a multitude of ways. A rearrangement may delete or translocate parts of a gene, compromising its ability to function. With a tumour suppressor compromised, a cell may be then able to evade cell cycle checkpoints and apoptosis, leading to unbounded growth. A fusion of two genes may create a novel gene that promotes tumour development. Translocation of a gene may deregulate its expression, and may also result in the development of the cancer phenotype. The effects of rearrangement on expressed genes will be observable in both the genome and transcriptome sequence data. We have shown evidence of biologically significant events identified by applying our methods to both genome and transcriptome sequence data. Furthermore, detected rearrangements may be important as molecular phenotypes, providing evidence of historical mutation events. Complex balanced rearrangements in prostate cancer provide potential evidence for which genes were co-expressed historically, under the current hypothesis that the breakpoints for these events form simultaneously at a single transcriptional hub. Detection of subclonal rearrangements may provide evidence of contemporary genomic instability, in addition to providing predictions of clonal markers for studying evolutionary relationships between tumour clones. This thesis presents several novel algorithms for detection and characterization of rearrangements in cancer. Each method has benefited from integration of multiple types of data: spanning and split discordant reads for deFuse, WGS and RNA-Seq for Comrad and nFuse, and breakpoints and segment read counts for ReMixT. Furthermore, for each problem solved by the proposed methods, we have developed a principled method for joint analysis of the multiple data types, and shown that joint analysis out-performs more naive independent methods. The algorithmic problems formulated in this thesis are primarily maximum parsimony, or maximum likelihood motivated optimization problems. Two problems formulated for the deFuse work are identification of optimal mapping locations of multi-map reads, and optimal split read alignment. For the first problem, we show that the setcover formulation [65]

108 and greedy approximation algorithm provide reasonable results. For the second problem we provide a dynamic programming (DP) based approach based on merging two independent DP matrices as was proposed concurrently for structural variation [1], and previously for spliced alignments [172]. For joint inference of WGS and RNA-Seq, we modiﬁed the setcover based approach for joint analysis of multiple related individuals [68] accounting for additional complexities resulting from working with the transcriptome. For Comrad, we proposed both an ILP formulation and setcover approximation to identify the most parsimonious sets of rearrangements and fusion transcripts that explain the read alignment data. For nFuse, we updated the problem to include complex breakpoints and complex rearrangements, and provided a greedy approximation algorithm to identify the most parsimonious set of events. Finally, for ReMixT we formulated the problem of inferring the maximum posterior segment and breakpoint copy number. We model genome structure implied by a set of predicted breakpoints, adding a prior that enforces sparsity of copy number changes that are unexplained by breakpoints. We propose solving the problem using a greedy heuristic that sequentially updates inferred copy number, maximizing the increase in the overall posterior probability at each step. We solve the subproblem of identifying each update from a restricted set of candidates using a novel reduction to minimum cost perfect matching.

7.1 Possible Improvements

Many important fusion transcripts have been identified from RNA-Seq using our method and others. However, many more fusions of unknown function have been identified. It is generally unknown whether these events are successfully translated into protein products. Recent progress in mass spectrometry (MS) now allow for its practical application in the context of proteogenomics studies [4]. By augmenting nucleotide databases with patient specific fusion transcripts discovered from RNA-Seq, researchers will be able to identify fusion proteins from MS data, allowing for a full understanding of the impact of rearrangements on the proteome. The proposed method for identification of complex balanced rearrangements (CBR) requires a shortest path search for every breakpoint, and is thus inefficient and does not provide a global solution. A more optimal algorithm could be designed by slightly modifying the problem formulation, and leveraging the reduction to minimum cost perfect matching described for ReMixT. Suppose instead of the proposed formulations, we instead wish to identify the set of CBR breakpoints that simultaneously maximize the number of CBR breakpoints and minimize the length of loss and gain edges, or a weighted combination thereof. This problem can be solved by identifying the minimum cost set of edge disjoint alternating cycles (corresponding to CBRs) for a Genome Modification Graph with appropriate edge weights: ‘+’ and ‘-’ segment edges are given a positive weight scaled by length, ‘+’ breakpoint edge weights are given a negative weight, ‘-’ breakpoint edge weights are

109 given inﬁnite weight, telomere edges are removed and reference bond edges are given weight 0. Minimum cost perfect matching could then be used to identify all CBRs simultaneously. Finally, we have proposed an algorithm for jointly inferring content and structure of genomes from in heterogeneous tumour samples, though we assume the mixture is known. In practice, we use an HMM to learn the mixture, ignoring any additional information provided by predicted breakpoints. Of beneﬁt would be an algorithm for learning the mixture while marginalizing breakpoint and segment copy number, analogous to the baum-welch algorithm used for HMMs. Such a problem is very likely to be computational hard, though an approximate probabilistic method such as variational EM could provide reasonable solutions.

7.2 Future Directions

Improvement in single cell sequencing are making this technology a more practical tool for understanding the clonal composition of heterogeneous tumour samples. Nevertheless, many of these technologies require a trade-off; datasets have either low genome coverage, or exhibit allele and region specific coverage bias induced by the use of a DNA amplification step. Coverage biases inhibit accurate copy number prediction in amplification based assays, though the use of amplification increases the probability of detecting breakpoints in individual cells. A solution for single cell copy number characterization in these datasets may involve integration of copy number and breakpoints as has been proposed for ReMixT. Breakpoints that have been associated with copy number changes in WGS data can then be used as markers for copy number changes in single cell datasets. Even if the copy number change of interest cannot be directly associated with a breakpoint, using ReMixT, it may be possible to identify a breakpoint with the same clonal prevalence as the copy number change of interest. Single cell sequencing of sets of breakpoints with distinct clonal prevalences will then allow for an understanding of the co-occurrence of breakpoints, and indirectly, copy number changes. Also promising are amplification free techniques, though application of these technologies would require further computational advances. Important would be the ability to accurately detect the presence of breakpoints in very low coverage datasets. Such methods could benefit from integrated breakpoint and copy number prediction. Also important would be the ability to share statistical strength across groups of cells with similar genotype. Improvements to sequencing technology and inference algorithms will soon provide the ability to comprehensively characterize the rearrangements in sequenced tumours. However, understanding the evolutionary history of rearrangements in a tumour remains a difficult and poorly solved problem. Whereas the clonal evolution of tumours from the perspective of single nucleotide variants has been well studied, few studies have sought to build phylogenies of rearrangements, likely in part due to the inherent computational difficulties. A principled

110 method for building rearrangement phylogenies from multiple cancer samples could benefit our understanding of the progression of cancer, particularly in cancers driven by rearrangements, and genomically unstable cancers. Whereas focal changes impact tumour biology by changing specific genes, rearrangements also have the ability to restructure the genome. It is known that restructuring may result in formation of fusion genes by juxtaposition of two wild type genes, deletion and breakage of tumor suppressor genes, or amplification of oncogenes. Additional work could help elucidate the order in which multiple events are acquired and the sequence of steps for progressively acquired events, with the aim of determining the initiating changes. Furthermore, it is not well known the extent to which structural changes themselves are important events in cancer progression, independent of the genes they affect. Structural changes may have other, previously unappreciated consequences, such as destabilizing a chromosome as a first step towards progressive amplification via successive segregation errors. The extent to which structural changes are a necessary step in the progression of the cancer, and how and by what mechanism the process is initiated, are questions that require further computational work to be answered in future studies.

111 Bibliography

[1] Alexej Abyzov and Mark Gerstein. Age: deﬁning breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision. Bioinformatics, 27(5):595–603, Mar 2011.

[2] P Akiva, A Toporik, S Edelheit, Y Peretz, A Diber, R Shemesh, A Novik, and R Sorek. Transcription-mediated gene fusion in the human genome. Genome Res, 16(1):30–36, Jan 2006.

[3] Ludmil B Alexandrov, Serena Nik-Zainal, David C Wedge, Samuel AJR Aparicio, Sam Behjati, Andrew V Biankin, Graham R Bignell, Niccolò Bolli, Ake Borg, Anne-Lise Børresen-Dale, et al. Signatures of mutational processes in human cancer. Nature, 2013.

[4] Javier A Alfaro, Ankit Sinha, Thomas Kislinger, and Paul C Boutros. Onco- proteogenomics: cancer proteomics joins forces with genomics. Nat Methods, 11(11):1107–13, Nov 2014.

[5] P D Aplan. Causes of oncogenic chromosomal translocation. Trends Genet, 22(1):46– 55, Jan 2006.

[6] Y W Asmann, A Hossain, B M Necela, S Middha, K R Kalari, Z Sun, H S Chai, D W Williamson, D Radisky, G P Schroth, J P Kocher, E A Perez, and E A Thomp- son. A novel bioinformatics pipeline for identiﬁcation and characterization of fusion transcripts in breast cancer and normal cell lines. Nucleic Acids Res, 39(15), Aug 2011.

[7] Sylvan C Baca, Davide Prandi, Michael S Lawrence, Juan Miguel Mosquera, Alessan- dro Romanel, Yotam Drier, Kyung Park, Naoki Kitabayashi, Theresa Y MacDonald, Mahmoud Ghandi, Eliezer Van Allen, Gregory V Kryukov, Andrea Sboner, Jean- Philippe Theurillat, T David Soong, Elizabeth Nickerson, Daniel Auclair, Ashutosh Tewari, Himisha Beltran, Robert C Onofrio, Gunther Boysen, Candace Guiducci, Christopher E Barbieri, Kristian Cibulskis, Andrey Sivachenko, Scott L Carter, Gor- don Saksena, Douglas Voet, Alex H Ramos, Wendy Winckler, Michelle Cipicchio, Kristin Ardlie, Philip W Kantoﬀ, Michael F Berger, Stacey B Gabriel, Todd R Golub, Matthew Meyerson, Eric S Lander, Olivier Elemento, Gad Getz, Francesca Demiche- lis, Mark A Rubin, and Levi A Garraway. Punctuated evolution of prostate cancer genomes. Cell, 153(3):666–77, Apr 2013.

[8] S J Baker, A C Preisinger, J M Jessup, C Paraskeva, S Markowitz, J K Willson, S Hamilton, and B Vogelstein. p53 gene mutations occur in combination with 17p

112 allelic deletions as late events in colorectal tumorigenesis. Cancer Res, 50(23):7717–22, Dec 1990.

[9] Ali Bashashati, Gavin Ha, Alicia Tone, Jiarui Ding, Leah M Prentice, Andrew Roth, Jamie Rosner, Karey Shumansky, Steve Kalloger, Janine Senz, et al. Distinct evolutionary trajectories of primary high-grade serous ovarian cancers revealed through spatial mutational proﬁling. The Journal of pathology, 231(1):21–34, 2013.

[10] A Bashir, S Volik, C Collins, V Bafna, and B J Raphael. Evaluation of paired-end sequencing strategies for detection of genome rearrangements in cancer. PLoS Comput Biol, 4(4), Apr 2008.

[11] B Beheshti, J Karaskova, P Park, J Squire, and B Beatty. Identiﬁcation of a high frequency of chromosomal rearrangements in the centromeric regions of prostate cancer cell lines by sequential giemsa banding and spectral karyotyping. Molecular Diagnosis, 5(1):23–32, 2000.

[12] H Bengtsson, R Irizarry, B Carvalho, and T P Speed. Estimation and assessment of raw copy numbers at the single locus level. Bioinformatics, 24(6):759–767, Mar 2008.

[13] David R Bentley, Shankar Balasubramanian, Harold P Swerdlow, Geoﬀrey P Smith, John Milton, Clive G Brown, Kevin P Hall, Dirk J Evers, Colin L Barnes, Helen R Bignell, Jonathan M Boutell, Jason Bryant, Richard J Carter, R Keira Cheetham, An- thony J Cox, Darren J Ellis, Michael R Flatbush, Niall A Gormley, Sean J Humphray, Leslie J Irving, Mirian S Karbelashvili, Scott M Kirk, Heng Li, Xiaohai Liu, Klaus S Maisinger, Lisa J Murray, Bojan Obradovic, Tobias Ost, Michael L Parkinson, Mark R Pratt, Isabelle M J Rasolonjatovo, Mark T Reed, Roberto Rigatti, Chiara Rodighiero, Mark T Ross, Andrea Sabot, Subramanian V Sankar, Aylwyn Scally, Gary P Schroth, Mark E Smith, Vincent P Smith, Anastassia Spiridou, Peta E Torrance, Svilen S Tzonev, Eric H Vermaas, Klaudia Walter, Xiaolin Wu, Lu Zhang, Mohammed D Alam, Carole Anastasi, Ify C Aniebo, David M D Bailey, Iain R Bancarz, Saibal Banerjee, Selena G Barbour, Primo A Baybayan, Vincent A Benoit, Kevin F Ben- son, Claire Bevis, Phillip J Black, Asha Boodhun, Joe S Brennan, John A Bridgham, Rob C Brown, Andrew A Brown, Dale H Buermann, Abass A Bundu, James C Burrows, Nigel P Carter, Nestor Castillo, Maria Chiara E Catenazzi, Simon Chang, R Neil Cooley, Natasha R Crake, Olubunmi O Dada, Konstantinos D Diakoumakos, Belen Dominguez-Fernandez, David J Earnshaw, Ugonna C Egbujor, David W El- more, Sergey S Etchin, Mark R Ewan, Milan Fedurco, Louise J Fraser, Karin V Fuentes Fajardo, W Scott Furey, David George, Kimberley J Gietzen, Colin P God- dard, George S Golda, Philip A Granieri, David E Green, David L Gustafson, Nancy F Hansen, Kevin Harnish, Christian D Haudenschild, Narinder I Heyer, Matthew M Hims, Johnny T Ho, Adrian M Horgan, Katya Hoschler, Steve Hurwitz, Denis V Ivanov, Maria Q Johnson, Terena James, T A Huw Jones, Gyoung-Dong Kang, Tzvetana H Kerelska, Alan D Kersey, Irina Khrebtukova, Alex P Kindwall, Zoya Kingsbury, Paula I Kokko-Gonzales, Anil Kumar, Marc A Laurent, Cynthia T Law- ley, Sarah E Lee, Xavier Lee, Arnold K Liao, Jennifer A Loch, Mitch Lok, Shu- jun Luo, Radhika M Mammen, John W Martin, Patrick G McCauley, Paul McNitt, Parul Mehta, Keith W Moon, Joe W Mullens, Taksina Newington, Zemin Ning, Bee Ling Ng, Sonia M Novo, Michael J O’Neill, Mark A Osborne, Andrew Osnowski,

113 Omead Ostadan, Lambros L Paraschos, Lea Pickering, Andrew C Pike, Alger C Pike, D Chris Pinkard, Daniel P Pliskin, Joe Podhasky, Victor J Quijano, Come Raczy, Vicki H Rae, Stephen R Rawlings, Ana Chiva Rodriguez, Phyllida M Roe, John Rogers, Maria C Rogert Bacigalupo, Nikolai Romanov, Anthony Romieu, Rithy K Roth, Natalie J Rourke, Silke T Ruediger, Eli Rusman, Raquel M Sanches-Kuiper, Martin R Schenker, Josefina M Seoane, Richard J Shaw, Mitch K Shiver, Steven W Short, Ning L Sizto, Johannes P Sluis, Melanie A Smith, Jean Ernest Sohna Sohna, Eric J Spence, Kim Stevens, Neil Sutton, Lukasz Szajkowski, Carolyn L Tregidgo, Gerardo Turcatti, Stephanie Vandevondele, Yuli Verhovsky, Selene M Virk, Suzanne Wakelin, Gregory C Walcott, Jingwen Wang, Graham J Worsley, Juying Yan, Ling Yau, Mike Zuerlein, Jane Rogers, James C Mullikin, Matthew E Hurles, Nick J Mc- Cooke, John S West, Frank L Oaks, Peter L Lundberg, David Klenerman, Richard Durbin, and Anthony J Smith. Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456(7218):53–9, Nov 2008. [14] M F Berger, M S Lawrence, F Demichelis, Y Drier, K Cibulskis, A Y Sivachenko, A Sboner, R Esgueva, D Pflueger, C Sougnez, R Onofrio, S L Carter, K Park, L Habegger, L Ambrogio, T Fennell, M Parkin, G Saksena, D Voet, A H Ramos, T J Pugh, J Wilkinson, S Fisher, W Winckler, S Mahan, K Ardlie, J Baldwin, J W Simons, N Kitabayashi, T Y MacDonald, P W Kantoff, L Chin, S B Gabriel, M B Gerstein, T R Golub, M Meyerson, A Tewari, E S Lander, G Getz, M A Rubin, and L A Garraway. The genomic complexity of primary human prostate cancer. Nature, 470(7333):214–220, Feb 2011. [15] M F Berger, J Z Levin, K Vijayendran, A Sivachenko, X Adiconis, J Maguire, L A Johnson, J Robinson, R G Verhaak, C Sougnez, R C Onofrio, L Ziaugra, K Cibulskis, E Laine, J Barretina, W Winckler, D E Fisher, G Getz, M Meyerson, D B Jaffe, S B Gabriel, E S Lander, R Dummer, A Gnirke, C Nusbaum, and L A Garraway. Integrative analysis of the melanoma transcriptome. Genome Res, 20(4):413–427, Apr 2010. [16] G R Bignell, T Santarius, J C Pole, A P Butler, J Perry, E Pleasance, C Green- man, A Menzies, S Taylor, S Edkins, P Campbell, M Quail, B Plumb, L Matthews, K McLay, P A Edwards, J Rogers, R Wooster, P A Futreal, and M R Stratton. Archi- tectures of somatic genomic rearrangement in human cancer amplicons at sequence- level resolution. Genome Res, 17(9):1296–1303, Sep 2007. [17] Kaya Bilgüvar, Ali Kemal Öztürk, Angeliki Louvi, Kenneth Y Kwan, Murim Choi, Burak Tatlı, Dilek Yalnızoğlu, Beyhan Tüysüz, Ahmet Okay Çağlayan, Sarenur Gök- ben, et al. Whole-exome sequencing identifies recessive wdr62 mutations in severe brain malformations. Nature, 467(7312):207–210, 2010. [18] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006. [19] Paul C Boutros, Michael Fraser, Nicholas J Harding, Richard de Borja, Dominique Trudel, Emilie Lalonde, Alice Meng, Pablo H Hennings-Yeomans, Andrew McPherson, Veronica Y Sabelnykova, Amin Zia, Natalie S Fox, Julie Livingstone, Yu-Jia Shiah, Jianxin Wang, Timothy A Beck, Cherry L Have, Taryne Chong, Michelle Sam, Jeremy Johns, Lee Timms, Nicholas Buchner, Ada Wong, John D Watson, Trent T Sim- mons, Christine P’ng, Gaetano Zafarana, Francis Nguyen, Xuemei Luo, Kenneth C

114 Chu, Stephenie D Prokopec, Jenna Sykes, Alan Dal Pra, Alejandro Berlin, Andrew Brown, Michelle A Chan-Seng-Yue, Fouad Yousif, Robert E Denroche, Lauren C Chong, Gregory M Chen, Esther Jung, Clement Fung, Maud H W Starmans, Hanbo Chen, Shaylan K Govind, James Hawley, Alister D’Costa, Melania Pintilie, Daryl Waggott, Faraz Hach, Philippe Lambin, Lakshmi B Muthuswamy, Colin Cooper, Rosalind Eeles, David Neal, Bernard Tetu, Cenk Sahinalp, Lincoln D Stein, Neil Fleshner, Sohrab P Shah, Colin C Collins, Thomas J Hudson, John D McPherson, Theodorus van der Kwast, and Robert G Bristow. Spatial genomic heterogeneity within localized, multifocal prostate cancer. Nat Genet, May 2015.

[20] Y S Brooks, G Wang, Z Yang, K K Smith, E Bieberich, and L Ko. Functional pre- mrna trans-splicing of coactivator coaa and corepressor rbm4 during stem/progenitor cell diﬀerentiation. J Biol Chem, 284(27):18033–18046, Jul 2009.

[21] J R Brown. Shortest alternating path algorithms. Networks, 4:311–334, 1974.

[22] Rebecca A Burrell, Sarah E McClelland, David Endesfelder, Petra Groth, Marie- Christine Weller, Nadeem Shaikh, Enric Domingo, Nnennaya Kanu, Sally M De- whurst, Eva Gronroos, Su Kit Chew, Andrew J Rowan, Arne Schenk, Michal Sheﬀer, Michael Howell, Maik Kschischo, Axel Behrens, Thomas Helleday, Jiri Bartek, Ian P Tomlinson, and Charles Swanton. Replication stress links structural and numerical cancer chromosomal instability. Nature, 494(7438):492–6, Feb 2013.

[23] Rebecca A Burrell and Charles Swanton. Tumour heterogeneity and the evolution of polyclonal drug resistance. Mol Oncol, 8(6):1095–111, Sep 2014.

[24] Harold J Burstein. The distinctive nature of her2-positive breast cancers. N Engl J Med, 353(16):1652–4, Oct 2005.

[25] P J Campbell, S Yachida, L J Mudie, P J Stephens, E D Pleasance, L A Stebbings, L A Morsberger, C Latimer, S McLaren, M L Lin, D J McBride, I Varela, S A Nik- Zainal, C Leroy, M Jia, A Menzies, A P Butler, J W Teague, C A Griﬃn, J Burton, H Swerdlow, M A Quail, M R Stratton, C Iacobuzio-Donahue, and P A Futreal. The patterns and dynamics of genomic instability in metastatic pancreatic cancer. Nature, 467(7319):1109–1113, Oct 2010.

[26] Mauro Castellarin, Katy Milne, Thomas Zeng, Kane Tse, Michael Mayo, Yongjun Zhao, John R Webb, Peter H Watson, Brad H Nelson, and Robert A Holt. Clonal evolution of high-grade serous ovarian carcinoma from primary to recurrent disease. J Pathol, 229(4):515–24, Mar 2013.

[27] W K Cavenee, T P Dryja, R A Phillips, W F Benedict, R Godbout, B L Gallie, A L Murphree, L C Strong, and R L White. Expression of recessive alleles by chromosomal mechanisms in retinoblastoma. Nature, 305(5937):779–84, 1983.

[28] K Chen, J W Wallis, M D McLellan, D E Larson, J M Kalicki, C S Pohl, S D McGrath, M C Wendl, Q Zhang, D P Locke, X Shi, R S Fulton, T J Ley, R K Wilson, L Ding, and E R Mardis. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods, 6(9):677–681, Sep 2009.

115 [29] 1000 Genomes Project Consortium et al. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061–1073, 2010.

[30] 1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56–65, 2012.

[31] The UniProt Consortium. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Research, 38(suppl 1):D142–D148, 2010.

[32] S L Cooke, J Temple, S Macarthur, M A Zahra, L T Tan, R A F Crawford, C K Y Ng, M Jimenez-Linan, E Sala, and J D Brenton. Intra-tumour genetic heterogeneity and poor chemoradiotherapy response in cervical cancer. Br J Cancer, 104(2):361–8, Jan 2011.

[33] Colin S Cooper, Rosalind Eeles, David C Wedge, Peter Van Loo, Gunes Gundem, Ludmil B Alexandrov, Barbara Kremeyer, Adam Butler, Andrew G Lynch, Niedz- ica Camacho, Charlie E Massie, Jonathan Kay, Hayley J Luxton, Sandra Edwards, Zsoﬁa Kote-Jarai, Nening Dennis, Sue Merson, Daniel Leongamornlert, Jorge Zamora, Cathy Corbishley, Sarah Thomas, Serena Nik-Zainal, Sarah O’Meara, Lucy Matthews, Jeremy Clark, Rachel Hurst, Richard Mithen, Robert G Bristow, Paul C Boutros, Michael Fraser, Susanna Cooke, Keiran Raine, David Jones, Andrew Menzies, Lucy Stebbings, Jon Hinton, Jon Teague, Stuart McLaren, Laura Mudie, Claire Hardy, Elizabeth Anderson, Olivia Joseph, Victoria Goody, Ben Robinson, Mark Maddi- son, Stephen Gamble, Christopher Greenman, Dan Berney, Steven Hazell, Naomi Livni, ICGC Prostate Group, Cyril Fisher, Christopher Ogden, Pardeep Kumar, Alan Thompson, Christopher Woodhouse, David Nicol, Erik Mayer, Tim Dudderidge, Nimish C Shah, Vincent Gnanapragasam, Thierry Voet, Peter Campbell, Andrew Futreal, Douglas Easton, Anne Y Warren, Christopher S Foster, Michael R Stratton, Hayley C Whitaker, Ultan McDermott, Daniel S Brewer, and David E Neal. Analysis of the genetic phylogeny of multifocal prostate cancer identiﬁes multiple independent clonal expansions in neoplastic and morphologically normal prostate tissue. Nat Genet, 47(4):367–72, Apr 2015.

[34] G Q Daley, R A Van Etten, and D Baltimore. Induction of chronic myelogenous leukemia in mice by the p210bcr/abl gene of the philadelphia chromosome. Science, 247(4944):824–30, Feb 1990.

[35] R Dalla-Favera, M Bregni, J Erikson, D Patterson, R C Gallo, and C M Croce. Human c-myc onc gene is located on the region of chromosome 8 that is translocated in burkitt lymphoma cells. Proc Natl Acad Sci U S A, 79(24):7824–7, Dec 1982.

[36] Shaloam Dasari and Paul Bernard Tchounwou. Cisplatin in cancer therapy: molecular mechanisms of action. Eur J Pharmacol, 740:364–78, Oct 2014.

[37] S S Dave, K Fu, G W Wright, L T Lam, P Kluin, E J Boerma, T C Greiner, D D Weisenburger, A Rosenwald, G Ott, H K Müller-Hermelink, R D Gascoyne, J Delabie, L M Rimsza, R M Braziel, T M Grogan, E Campo, E S Jaﬀe, B J Dave, W Sanger, M Bast, J M Vose, J O Armitage, J M Connors, E B Smeland, S Kvaloy, H Holte, R I Fisher, T P Miller, E Montserrat, W H Wilson, M Bahl, H Zhao, L Yang, J Powell, R Simon, W C Chan, L M Staudt, and Lymphoma/Leukemia Molecular Proﬁling

116 Project. Molecular diagnosis of burkitt’s lymphoma. N Engl J Med, 354(23):2431– 2442, Jun 2006.

[38] Elza C de Bruin, Nicholas McGranahan, Richard Mitter, Max Salm, David C Wedge, Lucy Yates, Mariam Jamal-Hanjani, Seema Shaﬁ, Nirupa Murugaesu, An- drew J Rowan, Eva Grönroos, Madiha A Muhammad, Stuart Horswell, Marco Ger- linger, Ignacio Varela, David Jones, John Marshall, Thierry Voet, Peter Van Loo, Doris M Rassl, Robert C Rintoul, Sam M Janes, Siow-Ming Lee, Martin Forster, Tanya Ahmad, David Lawrence, Mary Falzon, Arrigo Capitanio, Timothy T Harkins, Clarence C Lee, Warren Tom, Enock Teefe, Shann-Ching Chen, Sharmin Begum, Adam Rabinowitz, Benjamin Phillimore, Bradley Spencer-Dene, Gordon Stamp, Zoltan Szallasi, Nik Matthews, Aengus Stewart, Peter Campbell, and Charles Swan- ton. Spatial and temporal diversity in genomic instability processes deﬁnes lung cancer evolution. Science, 346(6206):251–6, Oct 2014.

[39] Olivier Delaneau, Jonathan Marchini, and Jean-François Zagury. A linear complexity phasing method for thousands of genomes. Nat Methods, 9(2):179–81, Feb 2012.

[40] Li Ding, Timothy J. Ley, David E. Larson, Christopher A. Miller, Daniel C. Koboldt, John S. Welch, Julie K. Ritchey, Margaret A. Young, Tamara Lamprecht, Michael D. McLellan, Joshua F. McMichael, John W. Wallis, Charles Lu, Dong Shen, Christo- pher C. Harris, David J. Dooling, Robert S. Fulton, Lucinda L. Fulton, Ken Chen, Heather Schmidt, Joelle Kalicki-Veizer, Vincent J. Magrini, Lisa Cook, Sean D. McGrath, Tammi L. Vickery, Michael C. Wendl, Sharon Heath, Mark A. Watson, Daniel C. Link, Michael H. Tomasson, William D. Shannon, Jacqueline E. Payton, Shashikant Kulkarni, Peter Westervelt, Matthew J. Walter, Timothy A. Graubert, Elaine R. Mardis, Richard K. Wilson, and John F. DiPersio. Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature, 481(7382):506–510, 01 2012.

[41] Peter Eirew, Adi Steif, Jaswinder Khattra, Gavin Ha, Damian Yap, Hossein Farahani, Karen Gelmon, Stephen Chia, Colin Mar, Adrian Wan, Emma Laks, Justina Biele, Karey Shumansky, Jamie Rosner, Andrew McPherson, Cydney Nielsen, Andrew J L Roth, Calvin Lefebvre, Ali Bashashati, Camila de Souza, Celia Siu, Radhouane Aniba, Jazmine Brimhall, Arusha Oloumi, Tomo Osako, Alejandra Bruna, Jose L Sandoval, Teresa Algara, Wendy Greenwood, Kaston Leung, Hongwei Cheng, Hui Xue, Yuzhuo Wang, Dong Lin, Andrew J Mungall, Richard Moore, Yongjun Zhao, Julie Lorette, Long Nguyen, David Huntsman, Connie J Eaves, Carl Hansen, Marco A Marra, Carlos Caldas, Sohrab P Shah, and Samuel Aparicio. Dynamics of genomic clones in breast cancer patient xenografts at single-cell resolution. Nature, 518(7539):422–6, Feb 2015.

[42] Charles Elkan and Keith Noto. Learning classiﬁers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’08, pages 213–220, New York, NY, USA, 2008. ACM.

[43] J Erikson, A ar Rushdi, H L Drwinga, P C Nowell, and C M Croce. Transcriptional activation of the translocated c-myc oncogene in burkitt lymphoma. Proc Natl Acad Sci U S A, 80(3):820–4, Feb 1983.

117 [44] Hannah Farmer, Nuala McCabe, Christopher J Lord, Andrew N J Tutt, Damian A Johnson, Tobias B Richardson, Manuela Santarosa, Krystyna J Dillon, Ian Hickson, Charlotte Knights, Niall M B Martin, Stephen P Jackson, Graeme C M Smith, and Alan Ashworth. Targeting the dna repair defect in brca mutant cells as a therapeutic strategy. Nature, 434(7035):917–21, Apr 2005.

[45] J L Fernandez-Luna. Bcr-Abl and inhibition of apoptosis in chronic myelogenous leukemia cells. Apoptosis, 5(4):315–318, Oct 2000.

[46] Andrej Fischer, Ignacio Vázquez-García, Christopher J R Illingworth, and Ville Mu- stonen. High-deﬁnition reconstruction of clonal composition in cancer. Cell Rep, 7(5):1740–52, Jun 2014.

[47] Jonathan J Forster. Bayesian inference for poisson and multinomial log-linear models. Statistical Methodology, 7(3):210–224, 2010.

[48] Pedro Galante, Raphael Parmigiani, Qi Zhao, Otávia Caballero, Jorge de Souza, Fábio Navarro, Alexandra Gerber, Marisa Nicolás, Anna Salim, Ana Silva, Lee Edsall, Sylvie Devalle, Luiz Almeida, Zhen Ye, Samantha Kuan, Daniel Pinheiro, Israel Tojal, Renato Pedigoni, Rodrigo de Sousa, Thiago Oliveira, Marcelo de Paula, Lucila Ohno- Machado, Ewen Kirkness, Samuel Levy, Wilson da Silva, Ana Vasconcelos, Bing Ren, Marco Zago, Robert Strausberg, Andrew Simpson, Sandro de Souza, and Anamaria Camargo. Distinct patterns of somatic alterations in a lymphoblastoid and a tumor genome derived from the same individual. Nucleic Acids Research, 39(14):6056–6068, 2011.

[49] Marco Gerlinger, Stuart Horswell, James Larkin, Andrew J Rowan, Max P Salm, Ignacio Varela, Rosalie Fisher, Nicholas McGranahan, Nicholas Matthews, Claudio R Santos, Pierre Martinez, Benjamin Phillimore, Sharmin Begum, Adam Rabinowitz, Bradley Spencer-Dene, Sakshi Gulati, Paul A Bates, Gordon Stamp, Lisa Pickering, Martin Gore, David L Nicol, Steven Hazell, P Andrew Futreal, Aengus Stewart, and Charles Swanton. Genomic architecture and evolution of clear cell renal cell carcinomas deﬁned by multiregion sequencing. Nat Genet, 46(3):225–33, Mar 2014.

[50] Marco Gerlinger, Andrew J Rowan, Stuart Horswell, James Larkin, David Endes- felder, Eva Gronroos, Pierre Martinez, Nicholas Matthews, Aengus Stewart, Patrick Tarpey, Ignacio Varela, Benjamin Phillimore, Sharmin Begum, Neil Q McDonald, Adam Butler, David Jones, Keiran Raine, Calli Latimer, Claudio R Santos, Mahrokh Nohadani, Aron C Eklund, Bradley Spencer-Dene, Graham Clark, Lisa Pickering, Gordon Stamp, Martin Gore, Zoltan Szallasi, Julian Downward, P Andrew Futreal, and Charles Swanton. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med, 366(10):883–92, Mar 2012.

[51] Marco Gerlinger, Andrew J. Rowan, Stuart Horswell, James Larkin, David Endes- felder, Eva Gronroos, Pierre Martinez, Nicholas Matthews, Aengus Stewart, Patrick Tarpey, Ignacio Varela, Benjamin Phillimore, Sharmin Begum, Neil Q. McDonald, Adam Butler, David Jones, Keiran Raine, Calli Latimer, Claudio R. Santos, Mahrokh Nohadani, Aron C. Eklund, Bradley Spencer-Dene, Graham Clark, Lisa Pickering, Gordon Stamp, Martin Gore, Zoltan Szallasi, Julian Downward, P. Andrew Futreal,

118 and Charles Swanton. Intratumor Heterogeneity and Branched Evolution Revealed by Multiregion Sequencing. N Engl J Med, 366(10):883–892, 2012.

[52] Osamu Gotoh. An improved algorithm for matching biological sequences. Journal of molecular biology, 162(3):705–708, 1982.

[53] C D Greenman, E D Pleasance, S Newman, F Yang, B Fu, S Nik-Zainal, D Jones, K W Lau, N Carter, P A Edwards, P A Futreal, M R Stratton, and P J Campbell. Estimation of rearrangement phylogeny for cancer genomes. Genome Res, Oct 2011.

[54] Malachi Griﬃth, Michelle Tang, Obi Griﬃth, Ryan Morin, Susanna Chan, Jennifer Asano, Thomas Zeng, Stephane Flibotte, Adrian Ally, Agnes Baross, Martin Hirst, Steven Jones, Gregg Morin, Isabella Tai, and Marco Marra. ALEXA: a microarray design platform for alternative expression analysis. Nature Methods, 5(2):118–118, 2008.

[55] Jay Gunawardana, Fong Chun Chan, Adèle Telenius, Bruce Woolcock, Robert Kridel, King L Tan, Susana Ben-Neriah, Anja Mottok, Raymond S Lim, Merrill Boyle, Sanja Rogic, Lisa M Rimsza, Chrystelle Guiter, Karen Leroy, Philippe Gaulard, Corinne Haioun, Marco A Marra, Kerry J Savage, Joseph M Connors, Sohrab P Shah, Randy D Gascoyne, and Christian Steidl. Recurrent somatic mutations of ptpn1 in primary mediastinal b cell lymphoma and hodgkin lymphoma. Nat Genet, 46(4):329–35, Apr 2014.

[56] Gunes Gundem, Peter Van Loo, Barbara Kremeyer, Ludmil B Alexandrov, Jose M C Tubio, Elli Papaemmanuil, Daniel S Brewer, Heini M L Kallio, Gunilla Högnäs, Matti Annala, Kati Kivinummi, Victoria Goody, Calli Latimer, Sarah O’Meara, Kevin J Dawson, William Isaacs, Michael R Emmert-Buck, Matti Nykter, Christopher Foster, Zsoﬁa Kote-Jarai, Douglas Easton, Hayley C Whitaker, ICGC Prostate UK Group, David E Neal, Colin S Cooper, Rosalind A Eeles, Tapio Visakorpi, Peter J Campbell, Ultan McDermott, David C Wedge, and G Steven Bova. The evolutionary history of lethal metastatic prostate cancer. Nature, 520(7547):353–7, Apr 2015.

[57] Gavin Ha, Andrew Roth, Jaswinder Khattra, Julie Ho, Damian Yap, Leah M Prentice, Nataliya Melnyk, Andrew McPherson, Ali Bashashati, Emma Laks, Justina Biele, Jiarui Ding, Alan Le, Jamie Rosner, Karey Shumansky, Marco A Marra, C Blake Gilks, David G Huntsman, Jessica N McAlpine, Samuel Aparicio, and Sohrab P Shah. Titan: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data. Genome Res, Jul 2014.

[58] F Hach, F Hormozdiari, C Alkan, F Hormozdiari, I Birol, E E Eichler, and S C Sahinalp. mrsfast: a cache-oblivious algorithm for short-read mapping. Nat Methods, 7(8):576–577, Aug 2010.

[59] I Hajirasouliha, F Hormozdiari, C Alkan, J M Kidd, I Birol, E E Eichler, and S C Sahinalp. Detection and characterization of novel sequence insertions using paired-end next-generation sequencing. Bioinformatics, 26(10):1277–1283, May 2010.

[60] David Haussler, Stephen J O’Brien, Oliver A Ryder, F Keith Barker, Michele Clamp, Andrew J Crawford, Robert Hanner, Olivier Hanotte, Warren E Johnson, Jimmy A

119 McGuire, et al. Genome 10k: a proposal to obtain whole-genome sequence for 10 000 vertebrate species. Journal of Heredity, 100(6):659–674, 2009.

[61] Nils Homer and Stanley F Nelson. Improved variant discovery through local realignment of short-read next-generation sequencing data using srma. Genome Biol, 11(10):R99, 2010.

[62] G F Hong. A method for sequencing single-stranded cloned dna in both directions. Biosci Rep, 1(3):243–52, Mar 1981.

[63] Matthew K H Hong, Geoﬀ Macintyre, David C Wedge, Peter Van Loo, Keval Patel, Sebastian Lunke, Ludmil B Alexandrov, Clare Sloggett, Marek Cmero, Francesco Marass, Dana Tsui, Stefano Mangiola, Andrew Lonie, Haroon Naeem, Nikhil Sapre, Pramit M Phal, Natalie Kurganovs, Xiaowen Chin, Michael Kerger, Anne Y Warren, David Neal, Vincent Gnanapragasam, Nitzan Rosenfeld, John S Pedersen, Andrew Ryan, Izhak Haviv, Anthony J Costello, Niall M Corcoran, and Christopher M Hovens. Tracking the origins and drivers of subclonal metastatic expansion in prostate cancer. Nat Commun, 6:6605, 2015.

[64] Marlous Hoogstraat, Mirjam S de Pagter, Geert A Cirkel, Markus J van Roosmalen, Timothy T Harkins, Karen Duran, Jennifer Kreeftmeijer, Ivo Renkens, Petronella O Witteveen, Clarence C Lee, Isaac J Nijman, Tanisha Guy, Ruben van ’t Slot, Trudy N Jonges, Martijn P Lolkema, Marco J Koudijs, Ronald P Zweemer, Emile E Voest, Edwin Cuppen, and Wigard P Kloosterman. Genomic and transcriptomic plasticity in treatment-naive ovarian cancer. Genome Res, 24(2):200–11, Feb 2014.

[65] F Hormozdiari, C Alkan, E E Eichler, and S C Sahinalp. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res, 19(7):1270–1278, Jul 2009.

[66] F Hormozdiari, I Hajirasouliha, P Dao, F Hach, D Yorukoglu, C Alkan, E E Eichler, and S C Sahinalp. Next-generation variationhunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics, 26(12):350–357, Jun 2010.

[67] F Hormozdiari, I Hajirasouliha, A McPherson, E E Eichler, and S C Sahinalp. Simul- taneous structural variation discovery in multiple paired-end sequenced genomes. In RECOMB 2011, Vancouver, Canada, March 28-31 2011. RECOMB.

[68] Fereydoun Hormozdiari, Iman Hajirasouliha, Andrew McPherson, Evan E Eichler, and S Cenk Sahinalp. Simultaneous structural variation discovery among multiple paired-end sequenced genomes. Genome Res, 21(12):2203–12, Dec 2011.

[69] J Houseley and D Tollervey. Apparent non-canonical trans-splicing is generated by reverse transcriptase in vitro. PLoS One, 5(8), 2010.

[70] Y Hu, K Wang, X He, D Y Chiang, J F Prins, and J Liu. A Probabilistic Framework for Aligning Paired-end RNA-seq Data. Bioinformatics, Jun 2010.

[71] Thomas J Hudson, Warwick Anderson, Axel Aretz, Anna D Barker, Cindy Bell, Rosa R Bernabé, MK Bhan, Fabien Calvo, Iiro Eerola, Daniela S Gerhard, et al. International network of cancer genome projects. Nature, 464(7291):993–998, 2010.

120 [72] David T W Jones, Barbara Hutter, Natalie Jäger, Andrey Korshunov, Marcel Kool, Hans-Jörg Warnatz, Thomas Zichner, Sally R Lambert, Marina Ryzhova, Dong Anh Khuong Quang, Adam M Fontebasso, Adrian M Stütz, Sonja Hutter, Marc Zuckermann, Dominik Sturm, Jan Gronych, Bärbel Lasitschka, Sabine Schmidt, Huriye Seker-Cin, Hendrik Witt, Marc Sultan, Meryem Ralser, Paul A Northcott, Volker Hovestadt, Sebastian Bender, Elke Pfaﬀ, Sebastian Stark, Damien Faury, Jeremy Schwartzentruber, Jacek Majewski, Ursula D Weber, Marc Zapatka, Benjamin Raeder, Matthias Schlesner, Catherine L Worth, Cynthia C Bartholomae, Christof von Kalle, Charles D Imbusch, Sylwester Radomski, Chris Lawerenz, Peter van Sluis, Jan Koster, Richard Volckmann, Rogier Versteeg, Hans Lehrach, Camelia Monoranu, Beate Winkler, Andreas Unterberg, Christel Herold-Mende, Till Milde, Andreas E Kulozik, Martin Ebinger, Martin U Schuhmann, Yoon-Jae Cho, Scott L Pomeroy, Andreas von Deimling, Olaf Witt, Michael D Taylor, Stephan Wolf, Matthias A Karajannis, Charles G Eberhart, Wolfram Scheurlen, Martin Hasselblatt, Keith L Ligon, Mark W Kieran, Jan O Korbel, Marie-Laure Yaspo, Benedikt Brors, Jörg Fels- berg, Guido Reifenberger, V Peter Collins, Nada Jabado, Roland Eils, Peter Lichter, Stefan M Pﬁster, and International Cancer Genome Consortium PedBrain Tumor Project. Recurrent somatic alterations of fgfr1 and ntrk2 in pilocytic astrocytoma. Nat Genet, 45(8):927–32, Aug 2013.

[73] M Kato, S Khan, N Gonzalez, B P O’Neill, K J McDonald, B J Cooper, N Z Angel, and D N Hart. Hodgkin’s lymphoma cell lines express a fusion protein encoded by intergenically spliced mrna for the multilectin receptor dec-205 (cd205) and a novel c-type lectin receptor dcl-1. J Biol Chem, 278(36):34035–34041, Sep 2003.

[74] W James Kent. BLAT–the BLAST-like alignment tool. Genome Res, 12(4):656–64, Apr 2002.

[75] A G Knudson, Jr. Mutation and cancer: statistical study of retinoblastoma. Proc Natl Acad Sci U S A, 68(4):820–3, Apr 1971.

[76] Vladimir Kolmogorov. Blossom v: a new implementation of a minimum cost perfect matching algorithm. Mathematical Programming Computation, 1(1):43–67, 2009.

[77] J O Korbel, A E Urban, J P Aﬀourtit, B Godwin, F Grubert, J F Simons, P M Kim, D Palejev, N J Carriero, L Du, B E Taillon, Z Chen, A Tanzer, A C Saunders, J Chi, F Yang, N P Carter, M E Hurles, S M Weissman, T T Harkins, M B Gerstein, M Egholm, and M Snyder. Paired-end mapping reveals extensive structural variation in the human genome. Science, 318(5849):420–426, Oct 2007.

[78] M Ladanyi, M Y Lui, C R Antonescu, A Krause-Boehm, A Meindl, P Argani, J H Healey, T Ueda, H Yoshikawa, A Meloni-Ehrig, P H Sorensen, F Mertens, N Mandahl, H van den Berghe, R Sciot, P Dal Cin, and J Bridge. The der(17)t(X;17)(p11;q25) of human alveolar soft part sarcoma fuses the TFE3 transcription factor gene to ASPL, a novel gene at 17q25. Oncogene, 20(1):48–57, Jan 2001.

[79] E S Lander, L M Linton, B Birren, C Nusbaum, M C Zody, J Baldwin, K Devon, K Dewar, M Doyle, W FitzHugh, R Funke, D Gage, K Harris, A Heaford, J Howland, L Kann, J Lehoczky, R LeVine, P McEwan, K McKernan, J Meldrim, J P Mesirov, C Miranda, W Morris, J Naylor, C Raymond, M Rosetti, R Santos, A Sheridan,

121 C Sougnez, N Stange-Thomann, N Stojanovic, A Subramanian, D Wyman, J Rogers, J Sulston, R Ainscough, S Beck, D Bentley, J Burton, C Clee, N Carter, A Coulson, R Deadman, P Deloukas, A Dunham, I Dunham, R Durbin, L French, D Grafham, S Gregory, T Hubbard, S Humphray, A Hunt, M Jones, C Lloyd, A McMurray, L Matthews, S Mercer, S Milne, J C Mullikin, A Mungall, R Plumb, M Ross, R Shown- keen, S Sims, R H Waterston, R K Wilson, L W Hillier, J D McPherson, M A Marra, E R Mardis, L A Fulton, A T Chinwalla, K H Pepin, W R Gish, S L Chissoe, M C Wendl, K D Delehaunty, T L Miner, A Delehaunty, J B Kramer, L L Cook, R S Fulton, D L Johnson, P J Minx, S W Clifton, T Hawkins, E Branscomb, P Predki, P Richardson, S Wenning, T Slezak, N Doggett, J F Cheng, A Olsen, S Lucas, C Elkin, E Uberbacher, M Frazier, R A Gibbs, D M Muzny, S E Scherer, J B Bouck, E J Soder- gren, K C Worley, C M Rives, J H Gorrell, M L Metzker, S L Naylor, R S Kucherlapati, D L Nelson, G M Weinstock, Y Sakaki, A Fujiyama, M Hattori, T Yada, A Toyoda, T Itoh, C Kawagoe, H Watanabe, Y Totoki, T Taylor, J Weissenbach, R Heilig, W Saurin, F Artiguenave, P Brottier, T Bruls, E Pelletier, C Robert, P Wincker, D R Smith, L Doucette-Stamm, M Rubenﬁeld, K Weinstock, H M Lee, J Dubois, A Rosenthal, M Platzer, G Nyakatura, S Taudien, A Rump, H Yang, J Yu, J Wang, G Huang, J Gu, L Hood, L Rowen, A Madan, S Qin, R W Davis, N A Federspiel, A P Abola, M J Proctor, R M Myers, J Schmutz, M Dickson, J Grimwood, D R Cox, M V Olson, R Kaul, C Raymond, N Shimizu, K Kawasaki, S Minoshima, G A Evans, M Athanasiou, R Schultz, B A Roe, F Chen, H Pan, J Ramser, H Lehrach, R Reinhardt, W R McCombie, M de la Bastide, N Dedhia, H Blöcker, K Hornischer, G Nordsiek, R Agarwala, L Aravind, J A Bailey, A Bateman, S Batzoglou, E Birney, P Bork, D G Brown, C B Burge, L Cerutti, H C Chen, D Church, M Clamp, R R Copley, T Doerks, S R Eddy, E E Eichler, T S Furey, J Galagan, J G Gilbert, C Har- mon, Y Hayashizaki, D Haussler, H Hermjakob, K Hokamp, W Jang, L S Johnson, T A Jones, S Kasif, A Kaspryzk, S Kennedy, W J Kent, P Kitts, E V Koonin, I Korf, D Kulp, D Lancet, T M Lowe, A McLysaght, T Mikkelsen, J V Moran, N Mulder, V J Pollara, C P Ponting, G Schuler, J Schultz, G Slater, A F Smit, E Stupka, J Szustakowski, D Thierry-Mieg, J Thierry-Mieg, L Wagner, J Wallis, R Wheeler, A Williams, Y I Wolf, K H Wolfe, S P Yang, R F Yeh, F Collins, M S Guyer, J Pe- terson, A Felsenfeld, K A Wetterstrand, A Patrinos, M J Morgan, P de Jong, J J Catanese, K Osoegawa, H Shizuya, S Choi, Y J Chen, J Szustakowki, and Interna- tional Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921, Feb 2001.

[80] Ben Langmead. Aligning short sequencing reads with bowtie. Curr Protoc Bioinfor- matics, Chapter 11:Unit 11.7, Dec 2010.

[81] Ben Langmead and Steven Salzberg. Fast gapped-read alignment with bowtie 2. Nat Meth, 9(4):357–359, 2012.

[82] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-eﬃcient alignment of short DNA sequences to the human genome. Genome Biol, 10(3):R25, 2009.

[83] Cheng-Han Lee, Wen-Bin Ou, Adrian Mariño-Enriquez, Meijun Zhu, Mark Mayeda, Yuexiang Wang, Xiangqian Guo, Alayne Brunner, Frédéric Amant, Christopher French, Robert West, Jessica McAlpine, Blake Gilks, Michael Yaﬀe, Leah Prentice,

122 Andrew McPherson, Steven Jones, Marco Marra, Sohrab Shah, Matt van de Rijn, David Huntsman, Paola Dal Cin, Maria Debiec-Rychter, Marisa Nucci, and Jonathan Fletcher. 14-3-3 fusion oncogenes in high-grade endometrial stromal sarcoma. Proceed- ings of the National Academy of Sciences of the United States of America, 109(3):929– 934, 2012.

[84] S Lee, F Hormozdiari, C Alkan, and M Brudno. Modil: detecting small indels from clone-end sequencing with mixtures of distributions. Nat Methods, 6(7):473–474, Jul 2009.

[85] H Li, J Wang, G Mor, and J Sklar. A neoplastic gene fusion mimics trans-splicing of RNAs in normal human cells. Science, 321(5894):1357–1361, Sep 2008.

[86] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows- wheeler transform. Bioinformatics, 25(14):1754–60, Jul 2009.

[87] Henrik Lilljebjörn, Helena Ågerstam, Christina Orsmark-Pietras, Marianne Rissler, Hans Ehrencrona, L Nilsson, Johan Richter, and Thoas Fioretos. Rna-seq identi- ﬁes clinically relevant fusion genes in leukemia including a novel mef2d/csf1r fusion responsive to imatinib. Leukemia, 2013.

[88] C. Lin, L. Yang, B. Tanasa, K. Hutt, B. Ju, K.A. Ohgi, J. Zhang, D.W. Rose, X.D. Fu, C.K. Glass, et al. Nuclear receptor-induced chromosomal proximity and dna breaks underlie speciﬁc translocations in cancer. Cell, 139(6):1069–1083, 2009.

[89] T G Lugo, A M Pendergast, A J Muller, and O N Witte. Tyrosine kinase activity and transformation potency of bcr-abl oncogene products. Science, 247(4946):1079–82, Mar 1990.

[90] C A Maher, C Kumar-Sinha, X Cao, S Kalyana-Sundaram, B Han, X Jing, L Sam, T Barrette, N Palanisamy, and A M Chinnaiyan. Transcriptome sequencing to detect gene fusions in cancer. Nature, 458(7234):97–101, Mar 2009.

[91] C A Maher, N Palanisamy, J C Brenner, X Cao, S Kalyana-Sundaram, S Luo, I Khreb- tukova, T R Barrette, C Grasso, J Yu, R J Lonigro, G Schroth, C Kumar-Sinha, and A M Chinnaiyan. Chimeric transcript discovery by paired-end transcriptome sequencing. Proc Natl Acad Sci U S A, 106(30):12353–12358, Jul 2009.

[92] Shyamala Maheswaran, Lecia V Sequist, Sunitha Nagrath, Lindsey Ulkus, Brian Brannigan, Chey V Collura, Elizabeth Inserra, Sven Diederichs, A John Iafrate, Daphne W Bell, Subba Digumarthy, Alona Muzikansky, Daniel Irimia, Jeﬀrey Settle- man, Ronald G Tompkins, Thomas J Lynch, Mehmet Toner, and Daniel A Haber. De- tection of mutations in egfr in circulating lung-cancer cells. N Engl J Med, 359(4):366– 77, Jul 2008.

[93] Ahmad Mahmoody, Crystal L Kahn, and Benjamin J Raphael. Reconstructing genome mixtures from partial adjacencies. BMC Bioinformatics, 13 Suppl 19:S9, 2012.

[94] Ian J Majewski, Lorenza Mittempergher, Nadia M Davidson, Astrid Bosma, Stefan M Willems, Hugo M Horlings, Iris de Rink, Liliana Greger, Gerrit K J Hooijer, Dennis

123 Peters, Petra M Nederlof, Ingrid Hoﬂand, Jeroen de Jong, Jelle Wesseling, Roelof J C Kluin, Wim Brugman, Ron Kerkhoven, Frank Nieboer, Paul Roepman, Annegien Broeks, Thomas R Muley, Jacek Jassem, Jacek Niklinski, Nico van Zandwijk, Alvis Brazma, Alicia Oshlack, Michel van den Heuvel, and René Bernards. Identiﬁcation of recurrent fgfr3 fusion genes in lung cancer through kinome-centred rna sequencing. J Pathol, 230(3):270–6, Jul 2013.

[95] Marcel Margulies, Michael Egholm, William E Altman, Said Attiya, Joel S Bader, Lisa A Bemben, Jan Berka, Michael S Braverman, Yi-Ju Chen, Zhoutao Chen, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376–380, 2005.

[96] Riccardo Masetti, Martina Pigazzi, Marco Togni, Annalisa Astolﬁ, Valentina Indio, Elena Manara, Rita Casadio, Andrea Pession, Giuseppe Basso, and Franco Locatelli. Cbfa2t3-glis2 fusion transcript is a novel common feature in pediatric, cytogenetically normal aml, not restricted to fab m7 subtype. Blood, 121(17):3469–72, Apr 2013.

[97] J N McAlpine, K C Wiegand, R Vang, B M Ronnett, A Adamiak, M Köbel, S E Kalloger, K D Swenerton, D G Huntsman, C B Gilks, and D M Miller. HER2 overexpression and ampliﬁcation is present in a subset of ovarian mucinous carcinomas and can be targeted with trastuzumab therapy. BMC Cancer, 9:433–433, 2009.

[98] Nuala McCabe, Nicholas C Turner, Christopher J Lord, Katarzyna Kluzek, Aneta Bialkowska, Sally Swift, Sabrina Giavara, Mark J O’Connor, Andrew N Tutt, Mał- gorzata Z Zdzienicka, Graeme C M Smith, and Alan Ashworth. Deﬁciency in the repair of dna damage by homologous recombination and sensitivity to poly(adp-ribose) polymerase inhibition. Cancer Res, 66(16):8109–15, Aug 2006.

[99] B McClintock. The stability of broken ends of chromosomes in zea mays. Genetics, 26(2):234–82, Mar 1941.

[100] Kevin Judd McKernan, Heather E Peckham, Gina L Costa, Stephen F McLaughlin, Yutao Fu, Eric F Tsung, Christopher R Clouser, Cisyla Duncan, Jeﬀrey K Ichikawa, Clarence C Lee, et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome research, 19(9):1527–1541, 2009.

[101] Megan E McNerney, Christopher D Brown, Xiaoyue Wang, Elizabeth T Bartom, Subhradip Karmakar, Chaitanya Bandlamudi, Shan Yu, Jinkyung Ko, Barry P San- dall, Thomas Stricker, John Anastasi, Robert L Grossman, John M Cunningham, Michelle M Le Beau, and Kevin P White. Cux1 is a haploinsuﬃcient tumor suppressor gene on chromosome 7 frequently inactivated in acute myeloid leukemia. Blood, 121(6):975–83, Feb 2013.

[102] Andrew McPherson, Fereydoun Hormozdiari, Abdalnasser Zayed, Ryan Giuliany, Gavin Ha, Mark Sun, Malachi Griﬃth, Alireza Heravi Moussavi, Janine Senz, Na- taliya Melnyk, Marina Pacheco, Marco Marra, Martin Hirst, Torsten Nielsen, Cenk Sahinalp, David Huntsman, and Sohrab Shah. defuse: An algorithm for gene fusion discovery in tumor rna-seq data. PLoS Comput Biol, 7(5):e1001138, 2011.

124 [103] Andrew McPherson, Andrew Roth, Cedric Chauve, and S Cenk Sahinalp. Joint inference of genome structure and content in heterogeneous tumor samples. In Research in Computational Molecular Biology, pages 256–258. Springer, 2015.

[104] Andrew McPherson, Chunxiao Wu, Iman Hajirasouliha, Fereydoun Hormozdiari, Faraz Hach, Anna Lapuk, Stanislav Volik, Sohrab Shah, Colin Collins, and Cenk Sahinalp. Comrad: detection of expressed rearrangements by integrated analysis of rna-seq and low coverage genome sequence data. Bioinformatics, 27(11):1481–1488, 2011.

[105] Andrew McPherson, Chunxiao Wu, Alexander W Wyatt, Sohrab Shah, Colin Collins, and S Cenk Sahinalp. nfuse: discovery of complex genomic rearrangements in cancer using high-throughput sequencing. Genome Res, 22(11):2250–61, Nov 2012.

[106] Paul Medvedev and Michael Brudno. Maximum likelihood genome assembly. J Com- put Biol, 16(8):1101–16, Aug 2009.

[107] George Michailides, Kjell Johnson, and Mark Culp. ada: An R Package for Stochastic Boosting. Journal of Statistical Software, 17(i02), undated.

[108] F Mitelman, B Johansson, and F Mertens. The impact of translocations and gene fusions on cancer causation. Nat Rev Cancer, 7(4):233–245, Apr 2007.

[109] Zahi Mitri, Tina Constantine, and Ruth O’Regan. The her2 receptor in breast cancer: Pathophysiology, clinical use, and new advances in therapy. Chemother Res Pract, 2012:743193, 2012.

[110] P Modena, E Lualdi, F Facchinetti, L Galli, M R Teixeira, S Pilotti, and G Sozzi. SMARCB1/INI1 tumor suppressor gene is frequently inactivated in epithelioid sarcomas. Cancer Res, 65(10):4012–4019, May 2005.

[111] Edmund A Mroz, Aaron M Tward, Rebecca J Hammon, Yin Ren, and James W Rocco. Intra-tumor genetic heterogeneity and mortality in head and neck cancer: analysis of data from the cancer genome atlas. PLoS Med, 12(2):e1001786, Feb 2015.

[112] M. Nambiar and S.C. Raghavan. How does dna break during chromosomal translocations? Nucleic Acids Research, 39(14):5813–5825, 2011.

[113] Torﬁnn Nome, Andreas M Hoﬀ, Anne Cathrine Bakken, Torleiv O Rognum, Arild Nesbakken, and Rolf I Skotheim. High frequency of fusion transcripts involving tcf7l2 in colorectal cancer: novel fusion partner and splice variants. PLoS One, 9(3):e91264, 2014.

[114] Torfinn Nome, Gard Os Thomassen, Jarle Bruun, Terje Ahlquist, Anne C Bakken, Andreas M Hoff, Torleiv Rognum, Arild Nesbakken, Susanne Lorenz, Jinchang Sun, João Diogo Barros-Silva, Guro E Lind, Ola Myklebost, Manuel R Teixeira, Leonardo A Meza-Zepeda, Ragnhild A Lothe, and Rolf I Skotheim. Common fusion transcripts identified in colorectal cancer cell lines by high-throughput rna sequencing. Transl Oncol, 6(5):546–53, 2013.

[115] P Nowell and D Hungerford. A minute chromosome in chronic granulocytic leukemia. Science, 132(3438):1488–1501, November 1960.

125 [116] Layla Oesper, Ahmad Mahmoody, and Benjamin J Raphael. Theta: inferring intratumor heterogeneity from high-throughput dna sequencing data. Genome Biol, 14(7):R80, 2013.

[117] Layla Oesper, Anna Ritz, Sarah J Aerni, Ryan Drebin, and Benjamin J Raphael. Re- constructing cancer genomes from paired-end sequencing data. BMC Bioinformatics, 13 Suppl 6:S10, 2012.

[118] Layla Oesper, Gryte Satas, and Benjamin J Raphael. Quantifying tumor heterogeneity in whole-genome and whole-exome sequencing data. Bioinformatics, Oct 2014.

[119] Brian J O’Roak, Pelagia Deriziotis, Choli Lee, Laura Vives, Jerrod J Schwartz, San- thosh Girirajan, Emre Karakoc, Alexandra P MacKenzie, Sarah B Ng, Carl Baker, et al. Exome sequencing in sporadic autism spectrum disorders identiﬁes severe de novo mutations. Nature genetics, 43(6):585–589, 2011.

[120] M Ozery-Flato and R Shamir. Sorting cancer karyotypes by elementary operations. J Comput Biol, 16(10):1445–1460, Oct 2009.

[121] Teresa Palomero, Lucile Couronné, Hossein Khiabanian, Mi-Yeon Kim, Alberto Ambesi-Impiombato, Arianne Perez-Garcia, Zachary Carpenter, Francesco Abate, Maddalena Allegretta, J Erika Haydu, Xiaoyu Jiang, Izidore S Lossos, Concha Nico- las, Milagros Balbin, Christian Bastard, Govind Bhagat, Miguel A Piris, Elias Campo, Olivier A Bernard, Raul Rabadan, and Adolfo A Ferrando. Recurrent mutations in epigenetic regulators, rhoa and fyn kinase in peripheral t cell lymphomas. Nat Genet, 46(2):166–70, Feb 2014.

[122] G Parra, A Reymond, N Dabbouseh, E T Dermitzakis, R Castelo, T M Thomson, S E Antonarakis, and R Guigó. Tandem chimerism as a means to increase protein complexity in the human genome. Genome Res, 16(1):37–44, Jan 2006.

[123] L Pelletier, S Rebouissou, A Paris, E Rathahao-Paris, E Perdu, P Bioulac-Sage, S Imbeaud, and J Zucman-Rossi. Loss of hepatocyte nuclear factor 1alpha function in human hepatocellular adenomas leads to aberrant activation of signaling pathways involved in tumorigenesis. Hepatology, 51(2):557–566, Feb 2010.

[124] Pavel Pevzner. Computational Molecular Biology: An Algorithmic Approach (Com- putational Molecular Biology). The MIT Press, 2000.

[125] D Pﬂueger, S Terry, A Sboner, L Habegger, R Esgueva, P C Lin, M A Svensson, N Kitabayashi, B J Moss, T Y MacDonald, X Cao, T Barrette, A K Tewari, M S Chee, A M Chinnaiyan, D S Rickman, F Demichelis, M B Gerstein, and M A Rubin. Discovery of non-ets gene fusions in human prostate cancer using next-generation rna sequencing. Genome Res, 21(1):56–67, Jan 2011.

[126] E D Pleasance, P J Stephens, S O’Meara, D J McBride, A Meynert, D Jones, M L Lin, D Beare, K W Lau, C Greenman, I Varela, S Nik-Zainal, H R Davies, G R Ordoñez, L J Mudie, C Latimer, S Edkins, L Stebbings, L Chen, M Jia, C Leroy, J Marshall, A Menzies, A Butler, J W Teague, J Mangion, Y A Sun, S F McLaughlin, H E Peckham, E F Tsung, G L Costa, C C Lee, J D Minna, A Gazdar, E Birney, M D Rhodes, K J McKernan, M R Stratton, P A Futreal, and P J Campbell. A

126 small-cell lung cancer genome with complex signatures of tobacco exposure. Nature, 463(7278):184–190, Jan 2010.

[127] Joan U. Pontius, Lukas Wagner, and Gregory D. Schuler. UniGene: A Uniﬁed View of the Transcriptome. The NCBI Handbook, 2002.

[128] B J Raphael and P A Pevzner. Reconstructing tumor amplisomes. Bioinformatics, 20 Suppl 1:265–273, Aug 2004.

[129] B J Raphael, S Volik, C Collins, and P A Pevzner. Reconstructing tumor genome architectures. Bioinformatics, 19 Suppl 2:162–171, Oct 2003.

[130] Tobias Rausch, Thomas Zichner, Andreas Schlattl, Adrian M Stütz, Vladimir Benes, and Jan O Korbel. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics (Oxford, England), 28(18):i333–i339, September 2012.

[131] B Rhead, D Karolchik, R M Kuhn, A S Hinrichs, A S Zweig, P A Fujita, M Diekhans, K E Smith, K R Rosenbloom, B J Raney, A Pohl, M Pheasant, L R Meyer, K Learned, F Hsu, J Hillman-Jackson, R A Harte, B Giardine, T R Dreszer, H Clawson, G P Barber, D Haussler, and W J Kent. The UCSC Genome Browser database: update 2010. Nucleic Acids Res, 38(Database issue):613–619, Jan 2010.

[132] D S Rickman, D Pﬂueger, B Moss, V E VanDoren, C X Chen, A de la Taille, R Kuefer, A K Tewari, S R Setlur, F Demichelis, and M A Rubin. Slc45a3-elk4 is a novel and frequent erythroblast transformation-speciﬁc fusion transcript in prostate cancer. Cancer Res, 69(7):2734–2738, Apr 2009.

[133] Kathryn G Roberts, Ryan D Morin, Jinghui Zhang, Martin Hirst, Yongjun Zhao, Xiaoping Su, Shann-Ching Chen, Debbie Payne-Turner, Michelle L Churchman, Richard C Harvey, Xiang Chen, Corynn Kasap, Chunhua Yan, Jared Becksfort, Richard P Finney, David T Teachey, Shannon L Maude, Kane Tse, Richard Moore, Steven Jones, Karen Mungall, Inanc Birol, Michael N Edmonson, Ying Hu, Kenneth E Buetow, I-Ming Chen, William L Carroll, Lei Wei, Jing Ma, Maria Kleppe, Ross L Levine, Guillermo Garcia-Manero, Eric Larsen, Neil P Shah, Meenakshi Devidas, Gregory Reaman, Malcolm Smith, Steven W Paugh, William E Evans, Stephan A Grupp, Sima Jeha, Ching-Hon Pui, Daniela S Gerhard, James R Downing, Cheryl L Willman, Mignon Loh, Stephen P Hunger, Marco A Marra, and Charles G Mullighan. Genetic alterations activating kinase and cytokine receptor signaling in high-risk acute lymphoblastic leukemia. Cancer Cell, 22(2):153–66, Aug 2012.

[134] G Robertson, J Schein, R Chiu, R Corbett, M Field, S D Jackman, K Mungall, S Lee, H M Okada, J Q Qian, M Griﬃth, A Raymond, N Thiessen, T Cezard, Y S Butterﬁeld, R Newsome, S K Chan, R She, R Varhol, B Kamoh, A L Prabhu, A Tam, Y Zhao, R A Moore, M Hirst, M A Marra, S J Jones, P A Hoodless, and I Birol. De novo assembly and analysis of RNA-seq data. Nat Methods, 7(11):909–912, Nov 2010.

[135] James T Robinson, Helga Thorvaldsdóttir, Wendy Winckler, Mitchell Guttman, Eric S Lander, Gad Getz, and Jill P Mesirov. Integrative genomics viewer. Nat Biotechnol, 29(1):24–6, Jan 2011.

127 [136] B Rosenberg, L VanCamp, J E Trosko, and V H Mansour. Platinum compounds: a new class of potent antitumour agents. Nature, 222(5191):385–6, Apr 1969.

[137] J D Rowley. Letter: A new consistent chromosomal abnormality in chronic myelogenous leukaemia identiﬁed by quinacrine ﬂuorescence and giemsa staining. Nature, 243(5405):290–293, Jun 1973.

[138] A Sboner, L Habegger, D Pﬂueger, S Terry, D Z Chen, J S Rozowsky, A K Tewari, N Kitabayashi, B J Moss, M S Chee, F Demichelis, M A Rubin, and M B Gerstein. FusionSeq: a modular framework for ﬁnding gene fusions by analyzing paired-end RNA-sequencing data. Genome Biol, 11(10), Oct 2010.

[139] K L Schaefer, K Brachwitz, Y Braun, R Diallo, D H Wai, S Zahn, D T Schneider, C Kuhnen, A Vollmann, G Brockhoﬀ, H E Gabbert, and C Poremba. Constitutive activation of neuregulin/ERBB3 signaling pathway in clear cell sarcoma of soft tissue. Neoplasia, 8(7):613–622, Jul 2006.

[140] Michael C Schatz, Arthur L Delcher, and Steven L Salzberg. Assembly of large genomes using second-generation sequencing. Genome Res, 20(9):1165–73, Sep 2010.

[141] David W Scott, Karen L Mungall, Susana Ben-Neriah, Sanja Rogic, Ryan D Morin, Graham W Slack, King L Tan, Fong Chun Chan, Raymond S Lim, Joseph M Con- nors, Marco A Marra, Andrew J Mungall, Christian Steidl, and Randy D Gascoyne. Tbl1xr1/tp63: a novel recurrent gene fusion in b-cell non-hodgkin lymphoma. Blood, 119(21):4949–52, May 2012.

[142] S P Shah, M Köbel, J Senz, R D Morin, B A Clarke, K C Wiegand, G Leung, A Zayed, E Mehl, S E Kalloger, M Sun, R Giuliany, E Yorida, S Jones, R Varhol, K D Swener- ton, D Miller, P B Clement, C Crane, J Madore, D Provencher, P Leung, A DeFazio, J Khattra, G Turashvili, Y Zhao, T Zeng, J N Glover, B Vanderhyden, C Zhao, C A Parkinson, M Jimenez-Linan, D D Bowtell, A M Mes-Masson, J D Brenton, S A Aparicio, N Boyd, M Hirst, C B Gilks, M Marra, and D G Huntsman. Mutation of FOXL2 in granulosa-cell tumors of the ovary. N Engl J Med, 360(26):2719–2729, Jun 2009.

[143] Sohrab Shah, Andrew Roth, Rodrigo Goya, Arusha Oloumi, Gavin Ha, Yongjun Zhao, Gulisa Turashvili, Jiarui Ding, Kane Tse, Gholamreza Haﬀari, Ali Bashashati, Leah Prentice, Jaswinder Khattra, Angela Burleigh, Damian Yap, Virginie Bernard, Andrew McPherson, Karey Shumansky, Anamaria Crisan, Ryan Giuliany, Alireza Heravi-Moussavi, Jamie Rosner, Daniel Lai, Inanc Birol, Richard Varhol, Angela Tam, Noreen Dhalla, Thomas Zeng, Kevin Ma, Simon Chan, Malachi Griﬃth, An- nie Moradian, Grace Cheng, Gregg Morin, Peter Watson, Karen Gelmon, Stephen Chia, Suet-Feung Chin, Christina Curtis, Oscar Rueda, Paul Pharoah, Sambasivarao Damaraju, John Mackey, Kelly Hoon, Timothy Harkins, Vasisht Tadigotla, Mahvash Sigaroudinia, Philippe Gascard, Thea Tlsty, Joseph Costello, Irmtraud Meyer, Con- nie Eaves, Wyeth Wasserman, Steven Jones, David Huntsman, Martin Hirst, Carlos Caldas, Marco Marra, and Samuel Aparicio. The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature, advance online publica- tion, 2012.

128 [144] S.P. Shah, R.D. Morin, J. Khattra, L. Prentice, T. Pugh, A. Burleigh, A. Delaney, K. Gelmon, R. Guliany, J. Senz, et al. Mutational evolution in a lobular breast tumour proﬁled at single nucleotide resolution. Nature, 461(7265):809–813, 2009.

[145] Q Sheng, X Liu, E Fleming, K Yuan, H Piao, J Chen, Z Moustafa, R K Thomas, H Greulich, A Schinzel, S Zaghlul, D Batt, S Ettenberg, M Meyerson, B Schoeberl, A L Kung, W C Hahn, R Drapkin, D M Livingston, and J F Liu. An activated ErbB3/NRG1 autocrine loop supports in vivo proliferation in ovarian cancer cells. Cancer Cell, 17(3):298–310, Mar 2010.

[146] S K Sidhar, J Clark, S Gill, R Hamoudi, A J Crew, R Gwilliam, M Ross, W M Linehan, S Birdsall, J Shipley, and C S Cooper. The t(X;1)(p11.2;q21.2) translocation in papillary renal cell carcinoma fuses a novel gene PRCC to the TFE3 transcription factor gene. Hum Mol Genet, 5(9):1333–1338, Sep 1996.

[147] Suzanne S Sindi, Selim Onal, Luke C Peng, Hsin-Ta Wu, and Benjamin J Raphael. An integrative probabilistic model for identiﬁcation of structural variation in sequencing data. Genome biology, 13(3):R22, 2012.

[148] D J Slamon, G M Clark, S G Wong, W J Levin, A Ullrich, and W L McGuire. Human breast cancer: correlation of relapse and survival with ampliﬁcation of the her-2/neu oncogene. Science, 235(4785):177–82, Jan 1987.

[149] T F Smith and M S Waterman. Identiﬁcation of common molecular subsequences. J Mol Biol, 147(1):195–7, Mar 1981.

[150] M Soda, Y L Choi, M Enomoto, S Takada, Y Yamashita, S Ishikawa, S Fujiwara, H Watanabe, K Kurashina, H Hatanaka, M Bando, S Ohno, Y Ishikawa, H Aburatani, T Niki, Y Sohara, Y Sugiyama, and H Mano. Identiﬁcation of the transforming EML4- ALK fusion gene in non-small-cell lung cancer. Nature, 448(7153):561–566, Aug 2007.

[151] Christian Steidl, Sohrab P Shah, Bruce W Woolcock, Lixin Rui, Masahiro Kawahara, Pedro Farinha, Nathalie A Johnson, Yongjun Zhao, Adele Telenius, Susana Ben Ner- iah, Andrew McPherson, Barbara Meissner, Ujunwa C Okoye, Arjan Diepstra, Anke van den Berg, Mark Sun, Gillian Leung, Steven J Jones, Joseph M Connors, David G Huntsman, Kerry J Savage, Lisa M Rimsza, Douglas E Horsman, Louis M Staudt, Ulrich Steidl, Marco A Marra, and Randy D Gascoyne. Mhc class ii transactivator ciita is a recurrent gene fusion partner in lymphoid cancers. Nature, 471(7338):377–81, Mar 2011.

[152] P J Stephens, D J McBride, M L Lin, I Varela, E D Pleasance, J T Simpson, L A Stebbings, C Leroy, S Edkins, L J Mudie, C D Greenman, M Jia, C Latimer, J W Teague, K W Lau, J Burton, M A Quail, H Swerdlow, C Churcher, R Natrajan, A M Sieuwerts, J W Martens, D P Silver, A Langerod, H E Russnes, J A Foekens, J S Reis-Filho, L van ’t Veer, A L Richardson, A L Borresen-Dale, P J Campbell, P A Futreal, and M R Stratton. Complex landscapes of somatic rearrangement in human breast cancer genomes. Nature, 462(7276):1005–1010, Dec 2009.

[153] Philip Stephens, Chris Greenman, Beiyuan Fu, Fengtang Yang, Graham Bignell, Laura Mudie, Erin Pleasance, King Lau, David Beare, Lucy Stebbings, Stuart

129 McLaren, Meng-Lay Lin, David McBride, Ignacio Varela, Serena Nik-Zainal, Cather- ine Leroy, Mingming Jia, Andrew Menzies, Adam Butler, Jon Teague, Michael Quail, John Burton, Harold Swerdlow, Nigel Carter, Laura Morsberger, Christine Iacobuzio- Donahue, George Follows, Anthony Green, Adrienne Flanagan, Michael Stratton, Andrew Futreal, and Peter Campbell. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell, 144(1):27–40, 2011.

[154] W Y Tarn and J A Steitz. A novel spliceosome containing u11, u12, and u5 snrnps excises a minor class (at-ac) intron in vitro. Cell, 84(5):801–811, Mar 1996.

[155] Clarisse Thiollier, Cécile K Lopez, Bastien Gerby, Cathy Ignacimouttou, Sandrine Poglio, Yannis Duﬀourd, Justine Guégan, Paola Rivera-Munoz, Olivier Bluteau, Vin- ciane Mabialah, M’boyba Diop, Qiang Wen, Arnaud Petit, Anne-Laure Bauchet, Dirk Reinhardt, Beat Bornhauser, Daniel Gautheret, Yann Lecluse, Judith Landman- Parker, Isabelle Radford, William Vainchenker, Nicole Dastugue, Stéphane de Botton, Philippe Dessen, Jean-Pierre Bourquin, John D Crispino, Paola Ballerini, Olivier A Bernard, Françoise Pﬂumio, and Thomas Mercher. Characterization of novel genomic alterations and therapeutic approaches using acute megakaryoblastic leukemia xenograft models. J Exp Med, 209(11):2017–31, Oct 2012.

[156] C Tognon, S R Knezevich, D Huntsman, C D Roskelley, N Melnyk, J A Mathers, L Becker, F Carneiro, N MacPherson, D Horsman, C Poremba, and P H Sorensen. Expression of the ETV6-NTRK3 gene fusion as a primary event in human secretory breast carcinoma. Cancer Cell, 2(5):367–376, Nov 2002.

[157] S A Tomlins, B Laxman, S M Dhanasekaran, B E Helgeson, X Cao, D S Morris, A Menon, X Jing, Q Cao, B Han, J Yu, L Wang, J E Montie, M A Rubin, K J Pienta, D Roulston, R B Shah, S Varambally, R Mehra, and A M Chinnaiyan. Distinct classes of chromosomal rearrangements create oncogenic ETS gene fusions in prostate cancer. Nature, 448(7153):595–599, Aug 2007.

[158] Scott A Tomlins, Daniel R Rhodes, Sven Perner, Saravana M Dhanasekaran, Rohit Mehra, Xiao-Wei Sun, Sooryanarayana Varambally, Xuhong Cao, Joelle Tchinda, Rainer Kuefer, Charles Lee, James E Montie, Rajal B Shah, Kenneth J Pienta, Mark A Rubin, and Arul M Chinnaiyan. Recurrent fusion of tmprss2 and ets transcription factor genes in prostate cancer. Science, 310(5748):644–8, Oct 2005.

[159] Jose M C Tubio and Xavier Estivill. Cancer: When catastrophe strikes a cell. Nature, 470(7335):476–7, Feb 2011.

[160] E Tuzun, A J Sharp, J A Bailey, R Kaul, V A Morrison, L M Pertz, E Haugen, H Hayden, D Albertson, D Pinkel, M V Olson, and E E Eichler. Fine-scale structural variation of the human genome. Nat Genet, 37(7):727–732, Jul 2005.

[161] David Dw Twa, Anja Mottok, Fong Chun Chan, Susana Ben-Neriah, Bruce W Wool- cock, King L Tan, Andrew J Mungall, Helen McDonald, Yongjun Zhao, Raymond S Lim, Brad H Nelson, Katy Milne, Sohrab P Shah, Ryan D Morin, Marco A Marra, David W Scott, Randy D Gascoyne, and Christian Steidl. Recurrent genomic rearrangements in primary testicular lymphoma. J Pathol, 236(2):136–41, Jun 2015.

130 [162] S Volik, S Zhao, K Chin, J H Brebner, D R Herndon, Q Tao, D Kowbel, G Huang, A Lapuk, W L Kuo, G Magrane, P De Jong, J W Gray, and C Collins. End-sequence proﬁling: sequence-based analysis of aberrant genomes. Proc Natl Acad Sci U S A, 100(13):7696–7701, Jun 2003.

[163] Jianmin Wang, Charles G Mullighan, John Easton, Stefan Roberts, Sue L Heat- ley, Jing Ma, Michael C Rusch, Ken Chen, Christopher C Harris, Li Ding, Linda Holmfeldt, Debbie Payne-Turner, Xian Fan, Lei Wei, David Zhao, John C Obenauer, Clayton Naeve, Elaine R Mardis, Richard K Wilson, James R Downing, and Jinghui Zhang. CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nature methods, 8(8):652–654, August 2011.

[164] K. Wang, D. Singh, Z. Zeng, S. J. Coleman, Y. Huang, G. L. Savich, X. He, P. Mieczkowski, S. A. Grimm, C. M. Perou, J. N. MacLeod, D. Y. Chiang, J. F. Prins, and J. Liu. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res., 38:e178, Oct 2010.

[165] K Wang, G Ubriaco, and L C Sutherland. Rbm6-rbm5 transcription-induced chimeras are diﬀerentially expressed in tumours. BMC Genomics, 8:348–348, 2007.

[166] Z Wang, M Gerstein, and M Snyder. Rna-seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 10(1):57–63, Jan 2009.

[167] J L Weber and E W Myers. Human whole-genome shotgun sequencing. Genome Res, 7(5):401–9, May 1997.

[168] K C Wiegand, S P Shah, O M Al-Agha, Y Zhao, K Tse, T Zeng, J Senz, M K McConechy, M S Anglesio, S E Kalloger, W Yang, A Heravi-Moussavi, R Giuliany, C Chow, J Fee, A Zayed, L Prentice, N Melnyk, G Turashvili, A D Delaney, J Madore, S Yip, A W McPherson, G Ha, L Bell, S Fereday, A Tam, L Galletta, P N Tonin, D Provencher, D Miller, S J Jones, R A Moore, G B Morin, A Oloumi, N Boyd, S A Aparicio, I e M Shih, A M Mes-Masson, D D Bowtell, M Hirst, B Gilks, M A Marra, and D G Huntsman. ARID1A mutations in endometriosis-associated ovarian carcinomas. N Engl J Med, 363(16):1532–1543, Oct 2010.

[169] C Wu, A W Wyatt, A V Lapuk, A McPherson, B J McConeghy, R H Bell, S Anderson, A Haegert, S Brahmbhatt, R Shukin, F Mo, E Li, L Fazli, A Hurtado-Coll, E C Jones, Y S Butterﬁeld, F Hach, F Hormozdiari, I Hajirasouliha, P C Boutros, R G Bristow, S J Jones, M Hirst, M A Marra, C A Maher, A M Chinnaiyan, S C Sahinalp, M E Gleave, S V Volik, and C C Collins. Integrated genome and transcriptome sequencing identiﬁes a novel form of hybrid and aggressive prostate cancer. J Pathol, 227(1):53– 61, May 2012.

[170] Chunxiao Wu, Alexander W Wyatt, Anna V Lapuk, Andrew McPherson, Brian J Mc- Coneghy, Robert H Bell, Shawn Anderson, Anne Haegert, Sonal Brahmbhatt, Robert Shukin, Fan Mo, Estelle Li, Ladan Fazli, Antonio Hurtado-Coll, Edward C Jones, Yaron S Butterﬁeld, Faraz Hach, Fereydoun Hormozdiari, Iman Hajirasouliha, Paul C Boutros, Robert G Bristow, Steven Jm Jones, Martin Hirst, Marco A Marra, Christo- pher A Maher, Arul M Chinnaiyan, S Cenk Sahinalp, Martin E Gleave, Stanislav V

131 Volik, and Colin C Collins. Integrated genome and transcriptome sequencing iden- tiﬁes a novel form of hybrid and aggressive prostate cancer. J Pathol, 227(1):53–61, May 2012.

[171] Chunxiao Wu, Alexander W Wyatt, Andrew McPherson, Dong Lin, Brian J Mc- Coneghy, Fan Mo, Robert Shukin, Anna V Lapuk, Steven J M Jones, Yongjun Zhao, Marco A Marra, Martin E Gleave, Stanislav V Volik, Yuzhuo Wang, S Cenk Sahi- nalp, and Colin C Collins. Poly-gene fusion transcripts and chromothripsis in prostate cancer. Genes Chromosomes Cancer, 51(12):1144–53, Dec 2012.

[172] Thomas D Wu and Colin K Watanabe. Gmap: a genomic mapping and alignment program for mrna and est sequences. Bioinformatics, 21(9):1859–75, May 2005.

[173] Alexander W Wyatt, Fan Mo, Kendric Wang, Brian McConeghy, Sonal Brahmbhatt, Lina Jong, Devon M Mitchell, Rebecca L Johnston, Anne Haegert, Estelle Li, Janet Liew, Jake Yeung, Raunak Shrestha, Anna V Lapuk, Andrew McPherson, Robert Shukin, Robert H Bell, Shawn Anderson, Jennifer Bishop, Antonio Hurtado-Coll, Hong Xiao, Arul M Chinnaiyan, Rohit Mehra, Dong Lin, Yuzhuo Wang, Ladan Fazli, Martin E Gleave, Stanislav V Volik, and Colin C Collins. Heterogeneity in the inter- tumor transcriptome of high risk prostate cancer. Genome Biol, 15(8):426, 2014.

[174] Jindan Yu, Jianjun Yu, Ram-Shankar Mani, Qi Cao, Chad J Brenner, Xuhong Cao, Xiaoju Wang, Longtao Wu, James Li, Ming Hu, Yusong Gong, Hong Cheng, Bharathi Laxman, Adaikkalam Vellaichamy, Sunita Shankar, Yong Li, Saravana M Dhanasekaran, Roger Morey, Terrence Barrette, Robert J Lonigro, Scott A Tomlins, Sooryanarayana Varambally, Zhaohui S Qin, and Arul M Chinnaiyan. An integrated network of androgen receptor, polycomb, and tmprss2-erg gene fusions in prostate cancer progression. Cancer Cell, 17(5):443–54, May 2010.

[175] Daniel R. Zerbino, Benedict Paten, Glenn Hickey, and David Haussler. An algebraic framework to sample the rearrangement histories of a cancer metagenome with double cut and join, duplication and deletion events. arXiv, 03 2013.

[176] Jianjun Zhang, Junya Fujimoto, Jianhua Zhang, David C Wedge, Xingzhi Song, Jiexin Zhang, Sahil Seth, Chi-Wan Chow, Yu Cao, Curtis Gumbs, Kathryn A Gold, Neda Kalhor, Latasha Little, Harshad Mahadeshwar, Cesar Moran, Alexei Protopopov, Huandong Sun, Jiabin Tang, Xifeng Wu, Yuanqing Ye, William N William, J Jack Lee, John V Heymach, Waun Ki Hong, Stephen Swisher, Ignacio I Wistuba, and P Andrew Futreal. Intratumor heterogeneity in localized lung adenocarcinomas delineated by multiregion sequencing. Science, 346(6206):256–9, Oct 2014.

[177] Jinghui Zhang, Gang Wu, Claudia P Miller, Ruth G Tatevossian, James D Dalton, Bo Tang, Wilda Orisme, Chandanamali Punchihewa, Matthew Parker, Ibrahim Qad- doumi, Fredrick A Boop, Charles Lu, Cyriac Kandoth, Li Ding, Ryan Lee, Robert Huether, Xiang Chen, Erin Hedlund, Panduka Nagahawatte, Michael Rusch, Kristy Boggs, Jinjun Cheng, Jared Becksfort, Jing Ma, Guangchun Song, Yongjin Li, Lei Wei, Jianmin Wang, Sheila Shurtleﬀ, John Easton, David Zhao, Robert S Fulton, Lucinda L Fulton, David J Dooling, Bhavin Vadodaria, Heather L Mulder, Chunlao Tang, Kerri Ochoa, Charles G Mullighan, Amar Gajjar, Richard Kriwacki, Denise

132 Sheer, Richard J Gilbertson, Elaine R Mardis, Richard K Wilson, James R Down- ing, Suzanne J Baker, David W Ellison, and St. Jude Children’s Research Hospital– Washington University Pediatric Cancer Genome Project. Whole-genome sequencing identiﬁes genetic alterations in pediatric low-grade gliomas. Nat Genet, 45(6):602–12, Jun 2013.

[178] Q Zhao, O L Caballero, S Levy, B J Stevenson, C Iseli, S J de Souza, P A Galante, D Busam, M A Leversha, K Chadalavada, Y H Rogers, J C Venter, A J Simpson, and R L Strausberg. Transcriptome-guided characterization of genomic rearrangements in a breast cancer cell line. Proc Natl Acad Sci U S A, 106(6):1886–1891, Feb 2009.

[179] Qi Zhao, Ewen Kirkness, Otavia Caballero, Pedro Galante, Raphael Parmigiani, Lee Edshall, Samantha Kuan, Zhen Ye, Samuel Levy, Ana Vasconcelos, Bing Ren, Sandro de Souza, Anamaria Camargo, Andrew Simpson, and Robert Strausberg. Systematic detection of putative tumor suppressor genes through the combined use of exome and transcriptome sequencing. Genome Biology, 11(11):R114, 2010.

133 Appendix A

Supplementary Material for deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data

A.1 Glossary

Low Grade Serous (LGS) Ovarian cancer subtype characterized by small micropapillae that inﬁltrate ovarian stroma. Somatic KRAS, ERBB2, or BRAF mutations are found in two thirds of the cases and TP53 is rarely mutated.

High Grade Serous (HGS) Highly proliferative ovarian carcinoma subtype characterized by genomic instability due to TP53 loss and in some cases BRCA1/2 mutations. This cancer may originate in the fallopian tube.

Clear cell carcinoma (CCC) Ovarian carcinoma subtype characterized by large epithe- lial cells with abundant clear cytoplasm.

Endometrioid tumor (EMD) Ovarian carcinoma subtype composed of tubular glands bearing a close resemblance to benign or malignant endometrium.

Mucinous tumor (MUC) Ovarian carcinoma with similarities to mucinous colonic carcinomas.

Yolk sac tumor (YKS) Ovarian germ cell tumor that represents a proliferation of both yolk sac endoderm and extraembryonic mesenchyme.

Granulosa cell tumor (GRC) Ovarian tumors that arise from granulosa cells characterized by a single nucleotide variation in FOXL2.

Small cell hypercalemic (SCH) Ovarian cancer subtype characterized by diﬀuse sheets of cells punctured by variable numbers of follicle-like spaces. Often presents with hypercalcemia.

134 A.2 Supplementary Results

A.2.1 A classiﬁer for gene fusions predictions

We sought to develop a classiﬁer for gene fusion predictions so that we would not have to rely on arbitrary thresholds. We selected the following 11 features, described in detail in section A.3.7. We chose to not select features that could be related to expression, such as the number of split or spanning reads, since we did not wish to bias the classiﬁer towards highly expressed fusions.

• Spanning read coverage • Split position p-value • Minimum split anchor p-value • Corroboration p-value • Concordant ratio • Fusion boundary di-nucleotide entropy • Fusion boundary homology • cDNA adjusted percent identity • Genome adjusted percent identity • EST adjusted percent identity • EST islands adjusted percent identity

We established whether each feature could be used to discriminate between true and false positives by plotting histograms of each feature for the 121 predictions in the example dataset (ﬁgure A.1).

135 Positive Negative

Figure A.1: Histograms of each feature for all 121 predictions in the example dataset of 60 positive and 61 negative predictions.

A.3 Supplementary Computational Methods

A.3.1 Conditions for discordant alignments to have originated from reads spanning the same fusion

Let r be the read length, fivep(aX ) be the aligned position in transcript X of the 5’ end of the read and let strand(aX ) be the strand of that alignment aX . Then the fusion boundary region is given by equation A.1.

 [fivep(a ) + r , fivep(a ) + l − r] if strand(a ) = +  X X max X br(aX ) = (A.1)   [fivep(aX ) − lmax + r , fivep(aX ) − r] if strand(aX ) = −

136 Let aX , aY , bX and bY be the alignments to transcript X and Y of paired end reads a and b. We define the overlapping boundary region condition as the condition that the fusion boundary regions in each transcript must overlap in order to consider paired end reads a and b to have originated from the same fusion transcript. The overlapping boundary region condition ensures that there exists a valid location for the fusion boundary in transcript X and transcript Y that would simultaneously explain both paired end alignments. Included in the overlapping boundary region condition is the condition that strand(aX ) = strand(aY ) and strand(bX ) = strand(bY ). The overlapping boundary region condition is defined specifically as given in equation A.2.

(br(aX ) ∩ br(bX ) 6= ∅) ∧ (br(aY ) ∩ br(bY ) 6= ∅) (A.2)

Suppose now that transcripts X and Y are concatenated together as fusion transcript XY with a +− alignment configuration (alignments are to the + strand of X and the − strand of Y ). The location of the fusion boundary in each transcript is unknown, as is the variable d that corresponds to the distance between the two fusion boundaries in the concatenated sequence. The fragment lengths la and lb of fragments a and b are unknown also. However, it is possible to calculate the difference between the fragment lengths as |la − lb| = |za − zb| as shown in figure A.2. We define the similar fragment length condition as the constraint that |la − lb| must be no more than lmax − lmin for us to consider paired end reads a and b to have originated from the same fusion transcript.

Fragment a Fragment b

Unknown Fusion Boundaries Transcript X Transcript Y d

za = l a + d

zb = l b + d

Trivially, if XY produces a −+ alignment configuration then YX will produce a +− configuration and should be considered instead. However, it may also be interesting to consider the situation in which XY results in a −− or ++ configuration because although the prediction may not represent a chimeric transcript with preserved open reading frame, it may represent an expressed structural variation or gene interruption. For this situation, a +− configuration can be obtained by considering the reverse complement of either X or Y and recalculating the alignment positions to that reverse complemented sequence. In practise, however, it is not necessary to remap the position of each alignment to the concatenated sequence described above, since any offset added to the positions of alignments

137 to X or Y will be incorporated into the value d and will cancel out when calculating |za −zb|. For the same reason, if it is necessary to reverse complement either X or Y , all that is required is to consider the negation of the positions of alignments to whichever of X or Y it was necessary to reverse complement, since any additional oﬀset will be incorporated into the value d, and will cancel out. The value za (and zb) can be calculated, with consideration for the strand of the alignments, using equation A.3. Note that this formulation of the similar fragment length condition is equivalent to that given in the main text, and allows for easier calculation of maximal valid clusters using the method in A.3.2.

 fivep(a ) + fivep(a ) if strand(a ) = strand(a )  Y X X Y za = (A.3)   fivep(aY ) − fivep(aX ) if strand(aX ) 6= strand(aY )

A.3.2 Generating Maximal Valid Clusters

We provide a polynomial time algorithm for calculating a set of clusters of paired end alignments, such that any two paired end alignments satisfy the overlapping boundary region and similar fragment length conditions, and such that those clusters are maximal. Let G be the set of transcripts under consideration. Let S = {+, −} be the set of strands. Let AX,Y,S,T be the set of alignments such that one end ﬁnds at least one alignment to strand S of transcript X and the other end ﬁnds at least one alignment to strand T of transcript Y . Consider all distinct sets AX,Y,S,T 6= ∅. Let AX be the alignments to transcript X and AY be the alignments to transcript Y . Maximal paired end alignment clusters PX,Y,S,T satisfying both conditions can be computed in polynomial time as follows.

1. Create the fusion boundary region clusters CX for transcript X. The fusion boundary region clusters can be created using a polynomial time algorithm as described in [65], reiterated here. Fusion boundary regions br(AX ) are sorted by their start coordinate. k Clustering proceeds by adding regions in left to right order to cluster CX until a region k k is encountered that does not overlap with all other regions in CX . Cluster CX is kept k−1 k+1 k unless it is a proper subset of CX . Cluster CX is initialized to CX \ a where a is k the region in CX with the leftmost end coordinate and the process repeats. Repeat for transcript Y creating CY .

2. Create clusters of paired end alignments DCX ,CY where every paired end alignment

a ∈ DCX ,CY satisﬁes a ∈ CX ∧a ∈ CY . For any DCX ,CY it should be true that any two

paired end alignments in DCX ,CY satisfy the overlapping boundary region condition.

3. Reﬁne clusters of paired end alignments DCX ,CY into clusters of paired end alignments {Di } that also satisfy the similar fragment length condition. For each paired end CX ,CY

alignment a in DCX ,CY calculate the value za. Sort the alignments by z and use a sliding window of size l −l to calculate clusters {Di }. Speciﬁcally, proceed max min CX ,CY by adding alignments to cluster Dk in order of increasing z while maintaining CX ,CY the property that the diﬀerence between the lowest and highest z values in Dk is CX ,CY less than or equal to l − l . Cluster Dk is kept unless it is a proper subset max min CX ,CY

138 of Dk−1 . Cluster Dk+1 is initialized to Dk \ a where a is the paired end CX ,CY CX ,CY CX ,CY alignment with the lowest z value.

4. Remove any cluster that is the subset of another cluster. Let P = {Di } X,Y,S,T CX ,CY be the resulting set of clusters. It can be easily veriﬁed that PX,Y,S,T is the set of maximal paired end alignment clusters satisfying both conditions.

A.3.3 Split read boundary sequence prediction

Let CX,Y,S,T be a paired end alignment cluster that is evidence between strand S of transcript X and strand T of transcript Y . Let AX and AY be the end alignments of each paired end to transcripts X and Y respectively. Let br(AX ) = ∩aX ∈AX br(aX ) and br(AY ) = ∩aY ∈AY br(aY ). For each alignment with one end aligning to transcript X we calculate the mate alignment region denoted mate(aX ) as in equation A.4.

 h i  fivep(aX ) + lmin − r , fivep(aX ) + lmax if strand(aX ) = +  mate(aX ) = (A.4)  h i  fivep(aX ) − lmax , fivep(aX ) − lmin + r if strand(aX ) = −

For each alignment with one end aligning to transcript X, if br(AX ) ∩ mate(aX ) 6= ∅ then add the sequence of the end that does not align to transcript X to MX . Repeat the process for transcript Y to create MY . Create the sequence SX by extracting the sequence of transcript X in the range br(AX ) expanded by r on each side.. Repeat for transcript Y to create SY . Reverse complement SY if S = T . Reverse complement the sequences in MX . Reverse complement the sequences in MY if S 6= T . For each candidate split read r ∈ MX ∪ MY = M align r to SX using dynamic programming based local alignment and penalizing initial gaps in the end sequence. Repeat with the reverse of sequence r and the reverse of sequence SY (see supplementary section A.3.4). Proceed as described in the main text of the paper.

A.3.4 Dynamic programming matrix deﬁnition

We use dynamic programming based local alignment penalizing initial gaps in the read sequence as part of the method for ﬁnding read sequences split by the fusion boundary. Let δ(p, q) = m if p = q otherwise δ(p, q) = u, thus m is the match score. Let g be the score given for a gap in either the read sequence of the transcript sequence. Let r be the read sequence and S the reference sequence on one side of the fusion boundary. The dynamic programming matrix may be deﬁned as follows[149].

139 D(i, 0) = 0 0 ≤ i ≤ |S| D(0, j) = D(i, j − 1) + g 0 < j ≤ |r|

 (A.5) D(i − 1, j − 1) + δ(p, q)  D(i, j) = max D(i − 1, j) + g 0 < i ≤ |S|, 0 < j ≤ |r|  D(i, j − 1) + g

A.3.5 Covariance between the lengths of fragments spanning a fusion boundary

We do not assume that the set of fragment lengths {li} of paired end reads spanning the same fusion boundary are drawn independently from the fragmet length distribution P (L). Thus the variance of ¯l includes a covariance term Cov(L1,L2) as given by equation A.6. The covariance Cov(L1,L2) represents the degree to which two fragments overlapping the same position are likely to have the same length.

1 V ar(L¯) = nV ar(L) + 1 − Cov(L ,L ) (A.6) n 1 2

We estimate the covariance between the lengths of two fragments originating from the same location in the transcriptome using concordant alignments to cDNA. Concordant alignments to cDNA often contain paired end alignments that are consistently aligned to the wrong splice variant causing some alignments to imply the wrong fragment length. In an attempt to mitigate this aﬀect we only consider paired end alignments for which the implied fragment length is in the range [µ−3σ µ+3σ] where µ and σ are the mean and standard deviation of inferred fragment length distribution. We begin by selecting n positions in the transcriptome at random. For each position we select at random, if they exist, two paired end alignments with one end aligning entirely to the left and one end aligning entirely to the right of that position. Let the fragment lengths implied by the two paired end alignments selected for position i be given by li1 and li2. Equation A.7 is used to estimate the covariance between the two random variables L1 and L2 representing the fragment lengths of two reads spanning the same fusion boundary.

P P P l l li1 lj2 Covˆ [L ,L ] = i i1 i2 − i j (A.7) 1 2 n n2

A.3.6 Covariance between split read statistics for reads split by a fusion boundary

We do not assume that the values pi calculated for reads split by a fusion boundary are drawn independently from a uniform distribution. To model dependency we estimate the covariance Cov(pi, pj). We begin by selecting n positions in the transcriptome at random.

140 For each position we select at random, if they exist, two paired end alignments with one end overlapping that position by at least nanchor nucleotides. We calculate p1 and p2 for both of these split alignments as given by equation A.10. Equation A.8 is then used to estimate the covariance between two random variables P1 and P2 representing pi values of two reads split by the same fusion boundary. An equivilent analysis is used to estimate Covˆ (Q1,Q2) for qi values as calculated by equation A.10.

P P P p p pi pj Covˆ (P ,P ) = i i1 i2 − i 1 j 2 (A.8) 1 2 n n2

A.3.7 Features

For each fusion prediction we calculate a number of features to assist in the discrimination between real fusions and false positives.

Spanning read count Number of reads spanning the fusion boundary.

Spanning read coverage Normalized spanning read coverage (section A.3.7).

Split read count Number of reads split by the fusion boundary.

Split position p-value P-Value for the hypothesis that the split position statistic was calculated from split reads that are evenly distributed across the fusion boundary (section A.3.7).

Minimum split anchor p-value P-Value for the hypothesis that the minimum split anchor statistic was calculated from split reads that are evenly distributed across the fusion boundary (section A.3.7).

Corroboration p-value P-Value for the hypothesis that the lengths of reads spanning the fusion boundary were drawn from the fragment length distribution (section Cor- roborating spanning and split read evidence in the main text).

Concordant ratio Proportion of spanning reads supporting a fusion that have a concordant alignment using blat with default parameters.

Fusion boundary di-nucleotide entropy Di-nucldeotide entropy calculated 40 nt upstream and downstream of the fusion boundary for the predicted sequence, taking the minimum of both values (section A.3.7) .

Fusion boundary homology Number of homologous nucleotides in each gene at the predicted fusion boundary (section A.3.7). cDNA adjusted percent identity Maximum adjusted percent identity (section A.3.7) for the alignments of the predicted sequence to any cDNA.

Genome adjusted percent identity Maximum adjusted percent identity (section A.3.7) for the alignments of the predicted sequence to the genome.

141 EST adjusted percent identity Maximum adjusted percent identity (section A.3.7) for the alignments of the predicted sequence to any EST.

EST island adjusted percent identity Maximum adjusted percent identity (section A.3.7) for the alignments of the predicted sequence to any EST island (section A.3.7).

Normalized spanning read coverage

For each fusion partner gene X we calculate cX , the number of nucleotides matched in X by at least one of the prediction’s spanning reads alignments. We then normalize cX by the expected coverage lavg − rmin where lavg is the mean fragment length and rmin is the minimum read length. The normalized spanning read coverage for a prediction is the minimum of the normalized coverage calculated for each gene predicted as fused (equation A.9). PCR duplicates of poor quality reads, or systematic alignment errors for small homologous regions are expected to result in smaller values for the normalized spanning read coverage than predictions representing real fusions.

min(c , c ) Normalized spanning read coverage = X Y (A.9) lavg − rmin

Split position p-value and minimum split anchor p-value

Split read alignments are prone to systematic alignment errors that produce false positive fusion boundary predictions. We expect a true positive to produce a certain number of reads split approximately in half by the fusion boundary, whereas many false positives are identiﬁed by the lack of any reads that are split approximately in half. We calculate two statistics in order to identify false positive split alignments.

For each of the n split alignments supporting a prediction, let li and ri be the number of nucleotides aligning to the left and right of the fusion boundary respectively. Under the null hypothesis that the fusion boundary is real, the normalized split position pi (equation A.10), and normalized minimum split anchor qi (equation A.11) should be uniformly distributed 1 on [0, 1] and have expected value E[pi] = E[qi] = 0.5 and variance V ar[pi] = V ar[qi] = 12n .

li − nanchor pi = (A.10) li + ri − 2nanchor min(li, ri) − nanchor qi = (A.11) li+ri 2 − nanchor

A dependence between pi values for reads split by the same fusion boundary means that the sample variance of a set of n pi values includes a covariance term. The covariance term and sample variance of n pi values are calculated as described in A.3.6. A dependence between qi is resolved similarly. The samples means of the n pi and n qi values are assumed normally distributed.

142 A two sided z-test with alternative hypothesis E[p] 6= 0.5 is used to calculate the split position p-value. A one sided z-test with alternative hypothesis that E[q] < 0.5 is used to calculate the minimum split anchor p-value. Signiﬁcant values for these p-values represents evidence to reject the null hypothesis that the split reads are uniformly distributed across the fusion boundary.

Fusion boundary di-nucleotide entropy

A common source of false positive fusion boundary predictions using split alignments results from the alignment of low complexity reads such as poly-A reads to low complexity regions in genes. In order to identify spurious fusion boundary predictions caused by low complexity reads, we calculate the di-nucldeotide entropy of the predicted fusion boundary sequence. Let D = {ninj : ni, nj ∈ {A, C, T, G}} be the set of all possible di-nucldeotides. Let S be a sequence of length m and let count(d, S) be the number of occurrences of di-nucleotide d in sequence S. The di-nucleotide entropy of the sequence S can be calculated as given by equation A.12.

X H(S) = − pd,S log2 pd,S d∈D count(d, S) p = (A.12) d,S m − 1

Let Su be the 40 nucleotides of the predicted sequence upstream of the fusion boundary, and let Sd be the 40 nucleotides of the predicted sequence downstream of the fusion boundary. For the purposes of this study we use m = 40. We calculate the fusion boundary di- nucleotide entropy as min(H(Su),H(Sd)). The fusion boundary di-nucleotide entropy is expected to be lower for fusion boundary predictions involving low complexity sequence on either side of the fusion boundary

Fusion boundary homology

Reverse transcriptase (RT) during cDNA preparation has been identiﬁed previously as a mechanism for producing chimeric cDNA fragments[69]. An identifying feature of chimeric cDNA produced by template switching is the existence of short homologous sequence at the ’splice site’ implied by the cDNA sequence [69]. Thus, to identify predictions resulting from chimeric reads produced by template switching during RT, we calculate the length of homologous sequence at the fusion boundary. Let S be the predicted sequence for a fusion prediction between gene X and gene Y , and let l be length of S. Let mX and mY be the number of matches minus mismatches for the best alignments of S to all splice variants of X and Y respectively. We calculate an estimate of the fusion boundary homology as given by equation A.13.

Fusion boundary homology = mX + mY − l (A.13)

143 Note that if a prediction is caused by misalignments of non-chimeric reads from a single gene, the predicted sequence may align with high sequence similarity to only that gene. This situation will also produce a higher than normal value for the fusion boundary homology, also indicating a likely false positive. All alignments of S to splice variants of X and Y were obtained using blat [74].

Adjusted percent identity

We sought to identify concordant alignments of the predicted fusion sequence to cDNA, EST and chromosome sequences. However, some predicted fusion sequences are asymmetrical: they involve only a small amount of sequence from one of the genes predicted as fused. As a result, reporting a simple percent identify for the alignment of the predicted sequence to a cDNA, EST, or chromosome would be biased against asymmetrical fusion prediction sequences. We use the adjusted percent identity, described below, as an alternative to the percent identity that does not suﬀer from a bias against asymmetrical fusion prediction sequences. Let S be the predicted sequence for a fusion prediction between gene X and gene Y , let ζ be fusion boundary in S, and let l be the length of S. Also let SX and SY be the sequences on the X and Y sides of ζ respectively, with lengths lX and lY respectively. Given an alignment of S to a cDNA, EST or chomosome sequence, let m be the matches minus mismatches for the alignment. We ﬁrst assume that the longer of SX and SY is matched exactly in the alignment, and any remaining matches exist in the shorter of SX and SY . We then calculate the adjusted percent identity as the percent identity of the alignment within the shorter of SX and SY under these assumptions (equation A.14). All alignments of S to cDNA, EST and chromosome sequences were obtained using blat [74].

m − max(l , l ) Adjusted percent identity = X Y (A.14) min(lX , lY )

EST islands

We sought to identify predictions that could be explained by alternative splicing as opposed to underlying genomic structural variation. We use UCSC’s spliced EST alignments[131] as evidence of co-transcription of genomic regions. An EST island is then deﬁned as the set of minimal genomic regions such that any splice EST alignment that overlaps with an EST island is contained within that EST island. EST islands represent islands of co-transcription in the genome as evident by EST alignments. The EST island adjusted percent identity for a fusion prediction is the adjusted percent identity of a spliced alignments of the predicted sequence that falls entirely within an EST island.

A.3.8 Filtering

A principled machine learning approach to discriminating between true and false positives is diﬃcult without a signiﬁcant number of positive and negative controls. Thus in order

144 to roughly discriminate between real fusions and false positives, we initially used a set of thresholds on a subset of the features calculated in section A.3.7. These thresholds are given below.

A.3.9 Probabilistic motivation for clustering conditions

The two conditions for clustering paired end alignments can be motivated probabilistically by considering the likelihood of two paired end alignments given that those paired end reads represent the same fusion transcript. Consider the alignments of two discordant paired end reads, a and b. Suppose a has an alignment of end aX to transcript X and end aY to transcript Y. Similarly, suppose b has an alignment of end bX to transcript X and an alignment of end bY to transcript Y. Figure A.3 shows a possible conﬁguration of the alignments. v d dX Y

b b a X Y X aY Transcript X Transcript Y

Figure A.3: Shown is the relative position of two paired end alignments, where dX and dY are the diﬀerence between the alignments in transcript X and Y , and V is inner distance between the two ends of a on a concatenated X-Y transcript.

The distances dX and dY are the diﬀerences between the positions of alignments on transcript X and transcript Y respectively. Also, v is the latent variable representing the unknown length of the unsequenced region of paired end a. Given v, we can calculate the fragment lengths xa and xb of paired end reads a and b as,

145 xa = v + 2r

xb = v − dX − dY + 2r, where r is the read length. Thus given v and supposing that paired end reads a and b result from the same fusion isoform F , we can calculate the probability P (dX , dY |v, F ) as

X P (dX , dY |F ) = P (dX , dY , v|F ) v 1 X = P (dX , dY |v, F ) Z v

Z is a normalization constant calculated as follows.

X X X Z = P (dX , dY |v, F ) v dX dY

Figure A.4 shows the probability distribution P (dX , dY |F ) for r = 50, µ = 200 and σ = 30. The overlapping boundary region condition and the similar fragment length condition have equivalent formulations as constraints on dX and dY . The overlapping boundary region condition is equivalent to the constraints given by equations A.15 and A.16. Any values for dX or dY outside these constraints will result in non overlapping fusion boundary regions for transcript X or transcript Y . Values for dX and dY that satisfy constraints given by both equations A.15 and A.16 will have overlapping boundary regions and will satisfy the overlapping boundary region condition. The similar fragment length condition is equivalent to the constraint −(lmax − lmin) ≤ dX + dY ≤ lmax − lmin, which is simply a reformulation of the equation in ﬁgure 2c.

−lmin − 2r < dX < lmax + 2r (A.15)

−lmin − 2r < dY < lmax + 2r (A.16)

−(lmax − lmin) ≤ dX + dY ≤ lmax − lmin (A.17)

146 Figure A.4: Probability distribution P (dX , dY |F ) of the distances dX and dY between alignments for two paired end alignments produced by the same fusion F .

We compared the region of the dx × dy configuration space that satisfies the overlapping boundary region condition and similar fragment length condition for α = 0.05 with a region contained within an equivalent contour of P (dX , dY |F ). We used r = 50, µ = 200 and σ = 30 as was used for figure A.4. We calculated lmax−lmin for α = 0.05 and then calculated q = P P (l )P (l ) = 0.99422. The value q represents the combined probablity |li−lj |

147 dy

Overlapping boundary region condition for α = 0.05 Similar fragment length condition α = 0.05

Contour for P(d x ,d y ) Equivalent Contour for P(d x ,d y ) for α = 0.05 Region satisfying both conditions Figure A.5: Overlapping boundary region condition and similar fragment length condition in the context of probability distribution P (dX , dY |F ). For α = 0.05, the region of the conﬁguration space satisfying the two conditions overlaps with the equivalent contour of P (dX , dY |F ).

A.3.10 FusionSeq predictions

FusionSeq version 0.6.1 was used to predict gene fusions in CCC15, CCC16 and EMD6. These 3 cases were chosen because they contained the greatest number of validated predictions, with 6, 3 and 4 validations respectively. We followed the instructions provided on the FusionSeq and RSeqTools websites, reiterated here. We ﬁrst downloaded the hg18 bundled dataset. We then created a junction library from the ucsc provided 2bit genome and the gene models provided in the bundled dataset using the following command:

148 createSpliceJunctionLibrary hg18.2bit knownGeneAnnotationTranscriptCompositeModel.txt 45 Next we used bowtie-build to generate bowtie indices for the human genome and junction library combined. Bowtie was used with default parameters to independently generate alignments for each end of the paired end reads. The two bowtie outputs were con- verted into MRF format using the bowtiePairedFix executable provided by the author and bowtie2mrf from RSeqTools. bowtie hg18_junctions reads.1.fastq reads.1.bwtout bowtie hg18_junctions reads.2.fastq reads.2.bwtout cat reads.1.bwtout reads.2.bwtout | sort | bowtiePairedFix | bowtie2mrf paired -sequence > data.mrf

Fusions were predicted based on the data.mrf ﬁle using the following commands with default parameters as given by the FusionSeq website:

geneFusions data 4 < data.mrf > data.1.gfr 2> data.1.log (gfrAbnormalInsertSizeFilter 0.01 < data.1.gfr | gfrPCRFilter 4 4 | gfrProximityFilter 1000 | gfrAddInfo | gfrAnnotationConsistencyFilter ribosomal | gfrLargeScaleHomologyFilter | gfrRibosomalFilter | gfrSmallScaleHomologyFilter) > data.gfr 2> data.log gfrConfidenceValues data < data.gfr > data.confidence.gfr To compare the overlap between FusionSeq predictions and deFuse predictions, we aligned the FusionSeq read evidence to fusion sequences predicted by deFuse using bowtie. Com- paring the results in this way avoided problems that would result from trying to compare gene identifiers from different sets of gene annotations. We also sought to validate fusions predicted by FusionSeq that were not predicted by deFuse. In order to maximize our chances of successful validation, we applied a set of filters to the FusionSeq output before selecting fusions to validate. We first sought to classify as concordant reads that were evidence for the FusionSeq predictions. We aligned the read evidence to the genome and ESTs from UCSC, and searched for alignments of each within 1000nt of each other on the same chromosome/EST. We removed any FusionSeq prediction for which at least one read could be classified as concordant using this method. We also removed FusionSeq predictions for which at least one end of one read aligned to a ribosomal RNA (ensembl 54 gene models). Since we were were not interested in differences between the results that arose because of the use of different sets of gene annotations, we removed Fu- sionSeq predictions for which none of the reads aligned using blat to gene regions considered by deFuse. Several of the predictions were removed because they involved reads that did not align to a contiguous region of the genome or to contiguous exons, making it difficult to pinpoint a breakpoint and design primers. Finally, we removed fusions also predicted by deFuse, and selected the 3 predictions from each library with the highest RESPER score. This produced 3 candidates for CCC16 and EMD6, and 2 candidates for CCC16 which only contained 2 fusions after filtering.

A.3.11 MapSplice predictions

MapSplice version 1.14.1 was first used predict fusions in CCC15, CCC16 and EMD6. To reiterate, these 3 cases were chosen because they contained the greatest number of validated predictions, with 6, 3 and 4 validations respectively. We followed the set of instructions on the MapSplice website, downloading the ucsc genome and building a bowtie index. The default paired end configuration file was used, with the following differences.

149 read_length = 50 segment_length = 16 junction_type = non-canonical run_MapPER = yes full_running = no do_fusion = yes We then searched the MapSplice results for the validated deFuse predictions. We selected all of the sequences in the synthetic sequence column of fusion.junction file and used blat with default parameters to find an alignment of those junction sequences to the sequences predicted by deFuse. We also extracted all splice junction predictions from the CIGAR string of each alignment in the alignments.sam file, and compared those splice junction predictions with the validated deFuse predictions. We suspected MapSplice might perform better on the 75mer libraries. Thus, we ran Map- Splice on the 75mer reads from SCH1, EMD5 and GRC5. The default paired end configuration file was used, with the following differences.

read_length = 75 segment_length = 25 junction_type = non-canonical run_MapPER = yes do_fusion = yes We sought to validate fusions predicted by MapSplice that were not predicted by deFuse. In order to maximize our chances of successful validation, we applied a set of conservative filters to the MapSplice output before selecting fusions to validate. From the fusion.junction file we selected fusions with at least 2 supporting reads that were predicted to occur within the boundaries fo the ensembl genes we were considering in this study. We then removed predictions for which the synthetic sequence aligned with greater than 90% identity by blat to the genome, or greater than 50% identity to ribosomal RNA. After applying these filters we were left with 14 predictions from CCC15, CCC16, EMD6, SCH1, EMD5 and GRC5.

A.3.12 Running deFuse on melanoma RNA-Seq datasets

RNA-Seq datasets for 13 melanoma samples and cell lines were downloaded from the short read archive. These datasets are half the size of our average sarcoma or ovarian cancer datasets, and 4 of the fusions represented in these datasets have 5 or less supporting reads. Thus we adjusted the following parameters of deFuse so that deFuse would be able to predict fusions in these datasets. The clustering_precision parameter is equal to 1 − α.

clustering_precision = 0.80 span_count_threshold = 2 split_count_threshold = 1

150 Appendix B

Supplementary Material for nFuse: Discovery of complex genomic rearrangements in cancer using high-throughput sequencing

B.1 nFuse pipeline overview

The nFuse method builds upon Comrad [104], our previous work on rearrangement detection in matched RNA-seq and WGSS. We begin this section by briefly describing Comrad, then describe significant differences between Comrad and nFuse. An overview of the nFuse pipeline is shown in Figure B.1.

151 Reference Transcriptome WGSS RNA-Seq Reference Genome Breakpoint Fusion prediction Alignment Alignment Transcript Prediction

Clustering Clustering

Candidate Candidate Breakpoints Transcripts

Complex rearrangment Breakpoint prediction Graph

Path Search

Complex Breakpoints

Cycle Search

CCBRs

Multi-map Resolution Corroborating Rearrangements

Maximum Parsimony

Post-processing

Annotation & Results Classification

Figure B.1: The nFuse pipeline involves 5 major steps: breakpoint prediction, fusion transcript prediction, complex rearrangement prediction, multi-map resolution, and post- processing. Shaded nodes represent data and unshaded nodes represent analysis.

Comrad was developed to predict fusion transcripts and their associated rearrangements from matched RNA-seq and WGSS tumour data. Comrad begins by aligning RNA-seq reads to the reference transcriptome and WGSS reads to the reference genome. Paired end RNA-seq reads for which both ends align to the same gene are classiﬁed as concordant, and

152 all other reads are classified as discordant. Discordant RNA-seq reads are clustered into sets of reads that suggest the same candidate fusion transcripts. Paired end WGSS reads for which both ends align in the expected direction within 2kb are classified as concordant, and all other reads are classified as discordant. Discordant WGSS reads are clustered into sets of reads that suggest the same candidate breakpoint. The key contribution provided by Comrad was a method for resolving multiply mapped RNA-seq and WGSS reads using a maximum parsimony based combinatorial formulation. Real and biologically relevant gene fusions are known to exist where one of the fusion partners shares significant sequence homology with other genes in the genome [83]. These fusions may produce RNA-seq reads with ambiguous genomic origin, suggesting multiple, equally likely fusion transcripts. If a corroborating breakpoint can be predicted from WGSS data, it may be possible to identify the actual fusion transcript. For example, if A-B and A-B’ are two fusion transcripts suggested by the same set of multi-mapping RNA-seq reads, an A-B’ breakpoint predicted from matched WGSS data would help to identify A-B’ as more likely. The reverse is also true: unambiguous RNA-seq reads can assist in predicting the correct breakpoint implied by multi-mapping WGSS reads. In fact, even for ambiguous RNA-seq and WGSS reads, it should be possible to correctly associate the RNA-seq and WGSS evidence even though the exact gene pair remains unknown. Unfortunately, naive application of Comrad to the identification of CGRs will result in over-prediction of these events. The nFuse pipeline differs from Comrad in 4 major areas: WGSS alignment, discordant read clustering, corroboration between fusion transcripts and breakpoints, and the maximum parsimony formulation. WGSS reads are aligned using a seed and extend strategy, and the best partial discordant alignments of the WGSS reads are used to predict breakpoints. Discordant reads are clustered using a mixture model and the EM algorithm. nFuse adds detection of CGRs associated with fusion transcripts, replacing the Comrad method of corroborating fusion transcripts and breakpoints. Finally, nFuse incorporates a new maximum parsimony formulation for resolving multi-map reads that does not over-predict CGRs.

B.1.1 Partial alignments of WGSS reads

As sequencing technology improves and read lengths increase, a larger proportion of each DNA fragment is sequenced, and a smaller proportion of the fragment remains unsequenced. Thus it becomes increasingly likely that a breakpoint will fall within a sequenced region of a DNA fragment rather than the unsequenced region in the middle of a fragment. As a result we will be less likely to find a complete and contiguous alignment of both reads produced by DNA fragment harbouring a breakpoint. For instance, in HCC1954, the DNA fragments are approximately 193 bp in length, and many of the reads are 81 bp in length. If we expect complete and contiguous alignments, we will only be able to identify discordant reads for which the breakpoint is in the 193 − 2 × 81 = 31bp region in the middle of the read. To mitigate the aforementioned problem, we search for partial alignments of discordant paired end WGSS reads. Let r be the sequence of one end of a paired end read and define the partial alignment of r as an alignment of the first ` nucleotides of r where 1 ≤ ` ≤ |r|. A read with a breakpoint at position ` + 1 in the read should ideally produce a partial

153 alignment of length `. The score of a partial alignment is calculated using a fixed bonus for matches, an affine gap penalty, and penalty for mismatches based on the quality of the read at the mismatch. A partial alignment can be calculated using a trivial modification to the dynamic programming algorithm for calculating alignments with affine gap penalties [52]. We use bowtie2 in local alignment mode as an approximate but effective method for generating partial alignments of reads [81]. To generate the n top scoring mapping locations for a read, we first use bowtie2 with parameters –very-sensitive-local -k n + 1 to calculate n + 1 local alignments, and re-score these alignments if they include soft-clipping at the beginning of the read. We use the default scoring method implemented by bowtie2 for end-to-end alignments, re-described here. A gap of length N is given a penalty calculated as,

GO + N × GE.

We use the bowtie2 defaults, GO = 5 and GE = 3. Mismatches are given a penalty calculated as, ! MX − MN MN + floor 1 40 min(Q, 40.0) where Q is the Phred quality value. We use the bowtie2 defaults, MN = 2 and MX = 6. Matches are given a bonus MA = 2, the default for bowtie2. Let S = {s1, s2, ..., sn, sn+1} be the resulting set of scores and let T = {si : si = max(S)} be the set of scores that attain the maximum value. If |T | = n + 1 the read is filtered, otherwise the set of alignments T is retained. By default nFuse uses n = 20. Note that we are currently exploring the tradeoff between speed, accuracy and flexibility of available aligners to allow optimal performance of the nFuse breakpoint prediction.

B.1.2 Discordant read clustering

Discordant reads are clustered into sets of reads that support the same fusion transcript/breakpoint using Expectation Maximization applied to a mixture model. Given a breakpoint formed by joining position s in chromosome X to position t in chromosome Y , we can write the likelihood of alignment A given breakpoint B = (s, t) as:

(N (s − x + t − y|µ, σ) x ≤ s and y ≤ t P (A|B) ∝ 0 x > s or y > t

Normalization constant W can be calculated exactly by noting that the volume under the surface deﬁned by P (A|B) is equivalent to an extrusion of the normal distribution by µ, thus W = µ. The assumption that P (A|B) = 0 for x > s or y > t implies that the partial alignment process is perfect. We soften this assumption because a distribution with support over the full 2d space of alignment positions will be more conducive to an EM type algorithm. Thus deﬁne the soft boundary likelihood as follows:

154 P (A|B) ∝ N (s − x + t − y|µ, σ2) · e−λR(x−s) · e−λR(y−t)

Where R(x) denotes the ramp function deﬁned as follows:

(x x ≥ 0 R(x) = 0 x < 0

As an approximation, we continue to use the normalization constant W = µ for the soft boundary likelihood. Suppose now that the N paired end alignments in A are produced by a mixture of K breakpoints B. At this stage K is assumed given, below we describe how we determine K. Let znk = 1 if and only if alignment An was generated by breakpoint Bk ∈ B, and let πk = P (znk = 1). Write the log likelihood of A given B as follows:

N K X X log p(A|B) = log πkP (An, znk|Bk) n=1 k=1

We use EM to infer the B, π, and Z that maximize log p(A|B). Finally we select K using the Bayesian Information Criterion. We start by setting K = 1, running EM, and calculating the BIC (Equation B.1). Next we select the paired end alignment m with the minimum model probability (Equation B.2). We increment K and augment the previous set of responsibilities, allowing the new cluster to take full responsibility for Am. We redo EM, starting with the M Step, and calculate BIC for the result. The process continues until the new BIC is larger than the previous, at which point we stop iteration and select the previous solution. Lastly, we assign each paired end alignment n to the cluster k for which P (znk|An, πˆ, B) is maximum.

BIC = −2 log p(A|B) + 2K log N (B.1)

K 0 X m = argmin p(A |B) = argmin πkp(An|Bk) (B.2) n n k=1

B.1.3 Corroboration between fusion transcripts and breakpoints

Corroboration between fusion transcripts and breakpoints diﬀers signiﬁcantly between nFuse and Comrad. For Comrad, fusion transcripts and rearrangements breakpoints are said to

155 corroborate given satisfaction of two conditions that loosely determined whether a fusion transcript conceivable arose from a single breakpoint. Corroboration was determined by comparing all fusion transcripts with all breakpoints, and assessing satisfaction of the two conditions. By contrast, nFuse constructs a breakpoint graph from breakpoint predictions, including those supported by ambiguous WGSS reads. nFuse then searches for the shortest alternating path through the breakpoint graph that would corroborate each fusion transcript. As stated above, we add vertices representing the predicted fusion boundaries, and search for the shortest alternating path between those vertices. If a path exists below a given threshold score, that path is said to corroborate the fusion transcript. nFuse also searches for evidence of CCBRs associated with fusion transcripts. For each breakpoint in an alternating path corroborating a fusion transcript, nFuse searches for a CCBR that includes that breakpoint. As stated above, we ﬁrst remove the breakpoint edge from the graph, then search for the shortest alternating path between the two vertices of that breakpoint. We also identify fusion transcripts with mapped genomic distance less than 200 kbp, and with an orientation suggestive of a deletion, and label these as read-throughs.

B.1.4 Maximum parsimony formulation for resolving multi-map reads

Let B be the set of all breakpoint predictions implied by WGSS reads G, and let F be the set of all fusion transcript predictions implied by RNA-seq reads R. Let MG ⊆ G × B be a mapping from WGSS reads to breakpoints, and let MR ⊆ R × F be a mapping from RNA-seq reads to fusion transcripts, each produced by the alignment and clustering process. We would like to identify UG ⊆ MG and UR ⊆ MR, such that UG and UR are unique mappings from reads to fusions/breakpoints, and such that UG and UR maximize parsimony. Under the assumption that fusion transcripts and breakpoints are rare, we deﬁne the UG and UR that maximizes parsimony as the UG and UR that minimizes the number of fusion transcripts and breakpoints. However, we would also like to maximize the number of CGRs we discover, without over-predicting CGRs. We do so by searching within the space of maximum parsimony solutions for a solution that also maximizes the weighted sum of discovered CGRs (herein referred to as the CGR corroboration score). Our formulation can thus be seen as a type of multi-objective optimization, for which the minimization of fusion transcripts and breakpoints is primary, and the maximization of the CGR corroboration score is secondary.

We deﬁne the space of maximum parsimony solutions as follows. Let δf (UR) indicate that UR maps at least one RNA-seq read to fusion transcript f. Similarly, Let δb(UG) indicate that UG maps at least one WGSS read to breakpoint b. The space of maximum parsimony solutions P is deﬁned as given in Equation B.3.

X PR = argmin δf (UR) UR f X PG = argmin δb(UG) UG b P = PR × PG (B.3)

156 Next we deﬁne the CGR corroboration score for paths as follows. Let Q be the set of pairs of fusion transcripts and corroborating paths. Let δp(UG) indicate that δb(UG) = 1 for all breakpoints b in path p. Read-throughs are represented by empty paths and by deﬁnition δp(UG) = 1 for read-throughs. The CGR corroboration score sp for paths is given by Equation B.4.

X sp(UR, UG) = wp · δf (UR) · δp(UG) (f,p)∈Q   X X = wp · δp(UG)  δf (UR) (B.4) p f:(f,p)∈Q

We down-weight more complex paths by deﬁning wp as given by Equation B.5.

1 w = (B.5) p 1 + |p|

Finally we deﬁne the CGR corroboration score for cycles as follows. Let c be a cycle and let δc(U) indicate that δb(U) = 1 for all breakpoints b in cycle c. The CGR corroboration score sc for cycles is given by Equation B.6.

X sc(UG) = wc δc(UG) (B.6) c

Similar to paths, we down-weight more complex cycles by deﬁning wc as given by Equa- tion B.7.

1 w = (B.7) c 1 + |c|

Our ﬁnal optimization problem is to identify a solution to Equation B.8.

argmax [sp(UR, UG) + sc(UG)] (B.8) (UR,UG)∈P

A complete search of P would be prohibitively expensive for even small instances of the problem. Thus, we identify an optimal (UR, UG) using a heuristic search algorithm. Our heuristic is based on the greedy set cover algorithm, and as such guarantees a worst case approximation ratio of O(log n) for the problem of minimizing the number of fusion transcripts and breakpoints. No guarantee is provided for maximizing the CGR corroboration score.

157 The algorithm we propose alternates between minimizing the number of fusion transcripts and minimizing the number of breakpoints. The number of fusion transcripts is minimized, and the CGR corroboration score maximized, given an estimate for the value of δp. Next the number of breakpoints is minimized, and the CGR corroboration score maximized, P given an estimate for the value of κp = (f,p)∈Q δf . The value κp represents an estimate of the number of fusion transcripts corroborated by path p. At each iteration we re-estimate values for δp and κp. For path p, let χp = δp1, δp2, ..., δpi be a sequence of δp values from previous solutions to the problem of minimizing breakpoints. We initialize χp to a sequence of m successes, placing a heavy prior on existence of each path in the form of m pseudo-counts. Furthermore, let ψp = κp1, κp2, ..., κpi be a sequence of κp values from previous solutions to the problem of minimizing fusion transcripts. We initialize ψp as follows. For each path p, identify ub(κp), an upper bound on κp, by running Algorithm 3 (with null bonuses) for only the fusion transcripts in the set {f :(f, p) ∈ Q}. Initialize ψp as a sequence of m values ub(κp), placing a heavy prior on the existence of the fusion transcripts corroborating p. The algorithm proceeds as follows. Let Z be the set of complex rearrangements (paths and cycles).

P κ 1. Estimate κˆ = i pi . Let V be a set of bonuses for rearrangements, where v is a p |ψp| z bonus given for the solution that selects rearrangement z. Set the bonus vc for cycle c as wc and the bonus vp for path p as κˆpwp. Use Algorithm 1 to select a minimal set of breakpoints while attaining the maximum total in bonuses. Let CG = MinimizeBreakpoints(G, B, Z, V, ). Create UG from CG by assigning each read g to the breakpoint in CG that covers g and contains the greatest number of reads, breaking ties randomly.

2. For each path p, calculate δp based on UG from step 1, and append it to χp.

P δ 3. Estimate δˆ = i pi . Let Y be a set of bonuses for fusion transcripts, where y is p |χp| f a bonus for a solution that selects fusion transcript f. Set the bonus yf for fusion ˆ transcript f corroborated by path p as δpwp. If f is a read-through then yf = 1 since δp = 1 and wp = 1 by deﬁnition for read-throughs. If f is neither a read-through or corroborated by a path, yf = 0. Use Algorithm 3 to select a minimal set of fusion transcripts while attaining the maximum total in bonuses. Let CR = MinimizeFusions(R, F, Y, ). Create UR from CR by assigning each read r to the fusion transcript in CR that covers r and contains the greatest number of reads, breaking ties randomly.

4. For each path p, calculate κp based on UR from step 3, and append it to ψp.

5. Repeat n times and return UG and UR from the most recent iteration.

For the purposes of this study, we have used m = 10 and n = 10. Algorithm 1 takes as input B, Z and V . Each element of B is a labelled set of reads representing a breakpoint. Each element of Z is a set of labels representing a set of breakpoints putatively forming a CGR. Y contains a bonus for each CGR in Z. Initially, each

158 breakpoint is given a weight 1. The bonus vz for CGR z is distributed evenly between vz each breakpoint b ∈ z by subtracting a small bonus, |z| from the weight of each b. At each iteration the algorithm selects the most cost eﬃcient breakpoint b0 and adds it to the set of chosen breakpoints C. All breakpoints not in C but completely covered by S C are considered invalid. All CGRs containing invalid breakpoints are invalid. The bonuses of invalid CGRs are removed from the CGR’s remaining valid breakpoints. Finally, for each CGR z containing b0, the bonus from z is redistributed to the remaining breakpoints in z that are not in C. The algorithm terminates when C covers all reads. Calculation of weights is given by Algorithm 2.

Algorithm 1 MinimizeBreakpoints(G,B,Z,V ,) INPUT: WGSS reads G INPUT: Breakpoints B as labelled sets of reads INPUT: CGRs Z as sets of breakpoints INPUT: CGRs bonus V INPUT: Small constant OUTPUT: Breakpoints C ⊆ B C ← ∅ for all b ∈ B do wb ← CalculateBreakpointWeight(b,B,Z,V ,C,) end for while S C 6= G do select b 6∈ B that minimizes wb |b\S C| C ← C ∪ {b} for all b0 ∈ B : b0 6= b, b0 ⊆ S C do for all z ∈ Z : b0 ∈ z do Z ← Z \{z} for all b00 ∈ z do 00 wb00 ← CalculateBreakpointWeight(b ,B,Z,V ,C,) end for end for B ← B \{b0} end for for all z ∈ Z : b ∈ z do for all b0 ∈ z do 0 wb0 ← CalculateBreakpointWeight(b ,B,Z,V ,C,) end for end for end while return C

159 Algorithm 2 CalculateBreakpointWeight(b,B,Z,V ,C,) INPUT: Breakpoint b INPUT: Valid breakpoints B as labelled sets of reads INPUT: Valid CGRs Z as sets of breakpoints INPUT: CGR bonuses V INPUT: Chosen breakpoints C INPUT: Small constant OUTPUT: weight w w ← 1 for all z ∈ Z do if z ⊆ B then for all b ∈ z do u = b \C · v w = w − b |u| end for end if end for return w

Algorithm 3 takes as input F and Y . Each element of F is a labelled set of reads representing a fusion transcript. Y contains a bonus for each fusion transcript in F . Each fusion transcript f is given a weight 1 − · yf . At each iteration the algorithm selects the most cost eﬃcient breakpoint f 0 and adds it to the set of chosen breakpoints C. The algorithm terminates when C covers all reads.

Algorithm 3 MinimizeFusions(R,F ,Y ,) INPUT: RNA-seq reads R INPUT: Fusion transcripts F as labelled sets of reads INPUT: Fusion transcripts bonuses Y INPUT: Small constant OUTPUT: Fusion transcripts C ⊆ F C ← ∅ for all f ∈ F do wf ← 1 − · yf end for while S C 6= R do select f 6∈ F that minimizes wf |f\S C| C ← C ∪ {f} end while return C

B.1.5 Post-processing

The result of the mult-map resolution stage of Comrad will be a set of fusion transcripts and corroborating breakpoints. Assembled sequences for breakpoint predictions are re-aligned to

160 the reference genome using blat. Breakpoint sequences that align with 90% identity within 2kb sized genomic region are ﬁltered, as are the associated CGRs. The fusion transcripts are further processed as for deFuse [102]. In brief, targeted dynamic programming is used to identify split reads and assemble nucleotide level fusion transcripts. A probability is calculated for each assembled fusion transcript using a classiﬁer trained on known positive and negative fusion transcript predictions.

B.2 Calculating breakpoint probability

We predict breakpoints from discordant paired end alignments. Our approach aims for high sensitivity by including reads with multiple genomic mappings, and reads that map only partially to the genome. To ensure adequate speciﬁcity, we calculate a probability for each breakpoint based on the alignment evidence and use that probability in downstream analysis including CGR discovery. Let R be the set of paired end WGSS reads. We generate a set of mapping locations M for R using the following well established strategy [160, 162]. For each paired end read 1 2 (rj , rj ) ∈ R:

1. identify a single concordant mapping location if it exists. 2. if no concordant mapping location exists: 1 (a) identify the n top scoring mapping locations for rj 2 (b) identify the n top scoring mapping locations for rj

1 2 We identify the n top scoring mapping locations for rj (and rj ) as follows. Let sj be the maximum alignment score attained by partial alignment of read j to the genome. Let k be the number of mappings of read j that attain sj. If k > n assume the read is unmappable and filter it, otherwise retain the k mapping locations. 1 2 Let mj ∈ M be the mapping locations identified for read (rj , rj ) ∈ R. Define the following indicator variables:

cj ≡ read j is concordant dj ≡ the true alignment was discovered and is in the set mj

We make the assumption that reads mapped concordantly by the aligner are in fact concordant (with probability 1). We ﬁlter the concordantly mapped reads to create the set of d d discordant reads R and set of discordant mappings M . As a result, P (cj = 1, dj = 1) = 0 for the set of ﬁltered reads. We estimate probabilities for the remaining two possibilities for the true alignment of each read:

P (cj = 1|·) ≡ concordant but missed by the aligner P (dj = 1|cj = 0, ·) ≡ discordant but missed by the aligner

We estimate P (cj = 1|·) using the maximum concordant alignment score csj. To calculate csj, we align both ends of read j to all mapping locations in the set mj, and set csj to the

161 maximum alignment score identiﬁed by this process. We then calculate P (cj = 1|csj), and use it to approximate P (cj = 1|·). We approximate P (dj = 1|cj = 0, ·) as P (dj = 1|cj = 0, asj) where asj is the alignment score for read j. Next, we cluster the discordant alignments Md based on the likelihood that a set of alignments were generated by the same breakpoint. Let the resulting clusters of alignments represent putative breakpoints. Let gij indicate that putative breakpoint i generated read j. Assume gij = 0 if read j is not in the cluster that supports breakpoint i. We estimate P (gij = 1|·) as P (gij = 1|nmj, dj = 1), where nmj is the number of alternate mapping locations of read j. Under the assumption that all mapping locations discovered by the aligner are equally likely, we calculate P (g = 1|nm , d = 1) = 1 . ij j j nmj

Finally, let bi indicate that breakpoint i is true, let Gi be the set of all gij for breakpoint i, and let ni be the number of reads that were generated by breakpoint i, that is ni = P g . We estimate P (b |n ) and use it to estimate P (b |·) as given by equation B.9. gij ∈Gi ij i i i X Y P (bi|·) = P (bi|ni) P (gij = 1|nmj, dj = 1) Gi j

×P (dj = 1|asj, cj = 0)

×P (cj = 0|csj) (B.9)

We ﬁrst describe methods for estimating the above probability distributions, then describe an algorithm for calculating P (bi|·) from these distributions.

B.2.1 Calculating P (cj = 0|csj)

Alignment scores are calculated using dynamic programming with aﬃne gap penalties [52]. We use the scoring function implemented for bowtie2 [81], redescribed here. A gap of length N is given a penalty calculated as,

GO + N × GE.

We calculate the maximum concordant alignment score csj for discordant read j as follows. 1 2 1 2 Let rj and rj be end 1 and 2 of read j, and let mj and mj be the discordant mappings of 1 2 1 2 each end. Mappings mj and mj were generated by a partial alignment of rj and rj . First 1 2 1 2 1 truncate rj and rj to the maximum aligned lengths in mj and mj respectively, to form trj 2 1 1 2 and trj respectively. For each mapping mjk ∈ mj align trj to the 1000nt region adjacent to 1 1 1 mjk in the genome (downstream if the strand of mjk is "+", upstream if the strand of mjk

162 2 2 1 is "-"). Repeat for mjk ∈ mj and trj . Calculate csj as the maximum of all scores identiﬁed using the above procedure.

Letting P (cj = 0) = πc, we can calculate P (cj = 0|csj) using bayes rule.

P (csj|cj = 0)πc P (cj = 0|csj) = P (csj|cj = 0)πc + P (csj|cj = 1)(1 − πc)

We estimate P (csj|cj = 0) from alignments of reads to random locations in the genome. We ﬁrst uniformly sample 1/1000 reads from the WGSS data. For each sampled read, we produce copies of the read truncated to lengths ranging from 20nt to the length of the read. We then align the truncated reads to genomic locations selected uniformly and at random. For each truncation length, we use the samples to calculate a density using gaussian kernel density estimation with bandwidth 1. Let f ` be the resulting density for ` truncation length `. Since csj is the result of multiple trials, we cannot naively use f ` to calculate P (csj|cj = 0). We instead calculate extreme value distributions based on f , and use these to calculate P (csj|cj = 0). Let t be the number of trials used to calculate ` maximum concordant alignment score x, and let ft (x) represent to probability of attaining ` maximum concordant alignment score csj after t trials. ft (csj) can be calculated from ` ` ` f1 = f , and the cumulative density F1 using the following recursion.

` ` ` ` ` ` ` ft = ft−1 · F1 + Ft−1 · f1 − ft−1 · f1

` We then estimate P (csj|cj = 0) = ft (csj).

We estimate P (csj|cj = 1) from concordant alignments. We ﬁrst uniformly sample 1/1000 reads from the WGSS data. For each sampled read, we produce copies of the read truncated to lengths ranging from 20nt to the length of the read. We then align the truncated reads to genomic locations given by the concordant alignments of the original non-truncated reads. Given that read j is concordant, the alignment score csj represents the score produced by aligning read j to its single location of origin. Thus for P (csj|cj = 1) we do not use an extreme value distribution. Instead, for each truncation length we ﬁt a negative binomial distribution to the samples and use that to estimate P (csj|cj = 1).

Finally, we estimate πc using the EM algorithm. Let J be the number of potentially discordant reads. The expected value of the log likelihood function with respect to the (t) conditional distribution of all cj ∈ C given πc and csj ∈ S is as follows.

The maximum likelihood estimates of πc yield the following update equation.

P P (c = 0|cs , π(t)) π(t+1) = k j j c c J

163 (t) Thus we calculate P (cj = 0|csj) and πc by iteratively calculating P (cj = 0|csj, πc ) for all j (t+1) (Expectation step) and using those values to calculate πc (Maximization step) repeating until convergence.

B.2.2 Calculating P (dj = 1|asj, cj = 0)

Alignment scores asj are calculated using dynamic programming with aﬃne gap penalties as described in the previous section. A discordant read may have a marginal alignment score for the following reasons:

1. the read is poor quality

2. the region has polymorphisms compared to the reference genome

3. the read is non-genomic

4. the read is mapped to the wrong location

We would like to distinguish the first two possibilities from the last two possibilities. We do so by calculating P (dj = 1|asj, cj = 0), the probability that discordant read j has a valid discordant mapping given its alignment score (referred to herein as simply P (dj = 1|asj). We can estimate P (asj|dj = 1) from concordant alignments similar to how we estimated P (csj|cj = 1) in the previous section. Unfortunately it is difficult to estimate P (asj|dj = 0). Thus we formulate the problem of calculating P (dj = 1|asj) as a learning problem with positive and unlabelled data and use a modified version of a previously described method [42]. Let indicator sj represent whether read j has been sampled. Then we can write,

P (sj = 1|asj) P (dj = 1|asj) = P (sj = 1|dj = 1)

We first sample m scores from the concordant alignments (see previous section), where m is the number of discordant alignments. We then build a k-nearest-neighbour classifier (k = 0.05m) from the unlabelled scores U from the discordant alignments, and the labelled positive scores L from the concordant alignments. The KNN classifier will yield a function g(asj) = P (sj = 1|asj). We then estimate c = P (sj = 1|dj = 1) from the labelled positive scores as described previously (estimator 1 from [42]). 1 cˆ = X g(as) |L| as∈L

B.2.3 Calculating P (gij = 1|nmj, dj = 1)

We assume all discordant alignment mapping locations are equally likely. Thus, 1 P (gij = 1|nmj, dj = 1) = . nmj

164 B.2.4 Calculating P (bi|ni)

Calculation of P (bi|ni) is formulated as a positive unlabelled learning problem and uses a similar technique as described for calculating P (dj = 1|asj, cj = 0). We build a set of labelled positives as follows. Select 100,000 genomic positions uniformly at random, and identify the number of reads that span (spanning read count) those positions from concordant alignments. Remove positions not covered by any reads and re-sample to produce a set of m spanning read counts, where m is the number of breakpoint predictions. Build a k-nearest neighbour classiﬁer (k = 0.05m) from the labelled positive spanning read counts from concordant alignments and the unlabelled spanning read counts from the breakpoint predictions. Estimate P (bi|ni) as described above for P (dj = 1|asj, cj = 0).

B.2.5 Calculating P (bi|·)

Finally, we require an eﬃcient way of calculating a sum over all possible settings of the values of Gi, and we do this using dynamic programming similar to previously described methods [65]. Rearranging the expression for P (bi|·) we obtain the following.

Let f(p, q) be deﬁned as follows.

q X Y f(p, q) = P (gij|nmj, asj, csj) G :P g =p j=1 i j ij

2 Then we can calculate f({0, 1, ..., |Gi|}, |Gi|) in O(n ) time using the following recurrence.

f(0, 0) = 1 f(−1, q) = 0 f(q + 1, q) = 0

f(p, q) = f(p − 1, q − 1) · P (gij = 1|nmj, asj, csj) +

f(p, q − 1) · P (gij = 0|nmj, asj, csj)

Given f({0, 1, ..., |Gi|}, |Gi|), we can calculate P (bi|·) as follows.

|Gi| X P (bi|·) = P (bi|ni) · f(ni, |Gi|) ni=0

165 B.3 Shortest alternating path algorithm

The algorithm we propose for finding the shortest alternating path follows the algorithm proposed by [21]. We first create a new directed graph H from the undirected breakpoint graph G as follows. For each vertex v in the G, add two vertices, vin and vout in H. If (v1, v2) is a breakpoint edge in the G, add the edge (v1in, v2out) to H. If (v1, v2) is an adjacency edge in the G, add the edge (v1out, v2in) to H. We then find a shortest alternating path in G by finding a shortest path in H. Finding the shortest path in H is a non-trivial problem. Adjacency edges connect vertex v in G to vertices that are: a) on the same chromosome as v, and b) and have opposite direction to v. Thus G and also H can be considered dense. The algorithm we propose is similar to dijkstras algorithm, however it has the benefit of being faster than dijkstras for a constrained search in a dense graph. We constrain our shortest path search in 2 ways. First, we search for paths with length less than a maximum allowable distance. Second, we place a limit on the number of vertices relaxed by the algorithm. Like dijkstras we maintain a set of vertices S for which we know the shortest path from the starting vertex s to any vertex in S. Also like dijkstras, we maintain a priority queue of the neighbours of S that are closest to vertices in S. However, we refrain from maintaining a priority queue containing all vertices adjacent to S. Instead, we maintain a priority queue such that for each vertex v in the priority queue, v is the vertex in S¯ that is the closest neighbor in S¯ of some vertex in S. We also maintain, for each vertex v in the priority queue, a set, p(v) of the vertices in S for which v is the closest neighbor in S¯.

Clearly, the vertex vnext at the top of the priority queue will be the same for our algorithm as for dijkstras. We relax vnext by adding it to S and recording the shortest path to vnext. We then maintain the priority queue by doing the following. For each vertex u in the set {vnext} ∪ p(vnext), find the vertex v in S¯ that is closest to u, and add v to the priority queue if it exists. For this task we require a list of the neighbours of u sorted by their distance from u. If the outgoing edges from u in H are breakpoint edges in G this is trivial, since there is only one outgoing edge. If the outgoing edges from u in H are adjacency edges in G, then we can simply use a sorted list of breakpoint positions, where the same sorted list need not be duplicated multiple times for different vertices. Suppose that we wish to constrain our search to a maximum of k relaxation steps. At each relaxation step we select the vertex vnext from the top of the priority queue, and then add the next closest neighbors of vertices in p(vnext) to the priority queue. On average one new vertex will be added to the queue at each step, since multiple vertices added at one step imply no vertex was added during previous steps. Using a binary heap, priority queue updates will require O(k log k) operations. At each step we also find the next closest vertex v for every vertex u in the set p(vnext). In the worst case, for all k vertices in S at step k, the k − 1 other vertices in S will be closer than any vertex in S¯. Thus, finding the next 2 closest vertex and updating p(vnext) will be worst case O(k ) resulting in an overall worst case complexity of O(k2). The algorithm will be beneficial if k n, where n is the number of vertices, since dijkstras would be worst case O(kn log n) complexity.

166 B.4 Path search parameter βp

The parameter βp can be seen as a mixing parameter for the two types of edges in the shortest alternating path optimization problem. Lower values of lambda will place more importance on the genomic distances between breakpoints in a path (adjacency edges), whereas higher values will place more importance on breakpoint probabilities.

We sought to estimate the eﬀect of βp on path searches using the HCC1954 data. A range of βp values was selected, and the shortest alternating path algorithm was used to predict paths for each value of βp. We made the strong assumption that paths were valid if they were composed entirely of breakpoints validated in 3 previous studies [16, 152, 48].

We then used the resulting valid paths to estimate an optimal value of βp, and measure the effect of βp on the shortest path algorithm. The number of valid paths ranged from 11 to 14 across the range of βp values (Figure B.2a), with the greatest number predicted between 1427 and 4074. As expected, scores for the valid breakpoints decreased almost monotonically with βp (Figure B.2b). We also analyzed the number of vertices visited by the algorithm (visit count) for each valid path, as a proxy for the time spent on the search. Interestingly, the distribution of visit counts for valid paths was bimodel for lower values of βp, indicating that some proportion of the paths would take significantly longer to be identified at these βp values. At a value of βp = 5296, the distribution of visit counts becomes unimodel, and is at a local minima. We selected βp = 6884, maximum visit count of 300,000, and maximum score of 30 based, as an adequate balance of sensitivity and balancing sensitivity and running time. Interestingly, the value we selected for βp is close to the estimate of 6082 gained by fitting the distribution of intron lengths to an exponential distribution.

167 Figure B.2: Statistics for paths composed of previously validated breakpoints (valid paths), as identiﬁed using a range of values for lambda. 14 12 10 8 6 umber of paths predicted n 4 2 0

500 650 845 1098 1427 1855 2411 3134 4074 5296 6884 8949 11633 15122 19658 25555 33221 43187

lambda (a) Number of valid paths identiﬁed. 60 40 ound paths 20 Score of f 0

500 650 845 1098 1427 1855 2411 3134 4074 5296 6884 8949 11633 15122 19658 25555 33221 43187

lambda (b) Distribution of scores for valid paths. 1e+06 tices e r v 1e+04 Number of visited 1e+02 1e+00

500 650 845 1098 1427 1855 2411 3134 4074 5296 6884 8949 11633 15122 19658 25555 33221 43187

lambda (c) Distribution of the number of vertices visited by the algorithm for valid paths.

168 Appendix C

Supplementary Material for ReMixT: Joint inference of genome structure and content in heterogeneous tumor samples

C.1 Overview

See Table C.1 for a description of each parameter/symbol used in the text.

169 Table C.1: Description of each parameter in the ReMixT method as used in the main text.

parameter description N number of segments M number of populations L length of the genome ln length of segment n lseg regular segmentation length A reference adjacencies / edges B breakpoints / breakpoint edges S segments / segment edges T telomere adjacencies / edges Q bond edges, A ∪ B ∪ T V vertex set, W and extremities of dummy telomere segment W segment extremities U unobserved bond edges H genome graph G genome collection, set of genome instances g genome instance, mapping from edges to copy numbers sn segment n χi allele indicator for heterozygous SNP i η haplotype block η¯ alternate allele haplotype block zη read count of haplotype block η φn proportion of genotypeable reads for segment n k measurement 1:major, 2: minor, 3: total µnk expected read count of segment n, measurement k xnk observed read count of segment n, measurement k pnk` proportion of reads from allele ` of segment n contributing to measurement k β smoothing parameter of the prior

Table C.2 describes additional parameters/symbols used in the supplementary material.

170 Table C.2: Description of each additional parameter of the ReMixT method as used in the supplementary material.

parameter description Bn Copy number of segment n when in the null HMM state (conflicts with set of breakpoints in the main text) πn Segment specific copy number prior An Segment specific copy number transition probability matrix λa parameter of amplification probability for construction of An αn(c) probability of amplification for segment n with copy state c λd parameter of divergence probability for construction of An ρn(c) probability of divergence for segment n with copy state c β parameter of telomere probability for construction of An η(c, c0) probability of transition from copy state c to copy state c0 (conflicts with haplotype block in main text)

C.2 Expectation Maximization Method for Learning h

To infer the haploid read depth parameter h, we break the dependency between segments connected by breakpoints, and rely on a restricted model that includes only the dependencies between segments adjacent in the reference genome. We use a similar likelihood function, and then select a tractable graphical model (HMM) structure that matches the original problem as closely as possible. Hidden states in the HMM correspond to copy number matrices: for segment n, copy number state c is an M by 2 matrix, with entry cm` for the copy number of allele ` for clone m. The HMM version of our problem requires a finite state space. We restrict the state space by imposing a maximum copy number, and a maximum copy number difference between clones. Segments with true copy number outside this finite state space are modeled with an additional null state ∅. Transitions into the null state are penalized. Segments in the null state are free to take on any copy number state bn in the infinite state space of all copy number matrices. Thus xn is dependent on both the copy number prior C and an independent copy number state bn. A uniform (improper) prior is placed over bn. In practice, only bn in the neighborhood of argmaxb p(xn|h, cn = b) are considered. The graphical model for the HMM is depicted in Figure C.1.

171 π1 A2 An-1 An

C1 C2 Cn-1 Cn

B1 B2 Bn-1 Bn

h x1 x2 xn-1 xn

Figure C.1: Graphical model of the HMM version of the problem used for parameter learning.

The full data log likelihood can be expressed as given by Equation C.1, where q(b, c) is given by Equation C.2.

(c if c 6= ∅ q(b, c) = (C.2) b if c = ∅

We model the probability of a segment having copy number that is higher than the maximum copy number in our HMM state space with an exponential distribution over the length of the segment, with exponential parameter λa. Calculate αn(c), the probability of high level ampliﬁcation of segment n, as given by Equation C.3.

αn(c) = λa exp (−λalnI(c = ∅)) (C.3)

172 Additionally, we model the probability of a segment having copy number that is divergent between tumour clones with an exponential distribution over the length of the segment, with exponential parameter λd. Calculate the number of alleles with divergent copy number d(c) as given by equation C.4, and the probability of divergence ρn(c) as given by equation C.5.

M M X Y Y d(c) = I(cm` = cm0`) (C.4) ` m=2 m0=2 ρn(c) = λd exp (−λdlnd(c)) (C.5)

Finally, we model the probability of copy number transitions between adjacent segments as an exponential distribution over the copy number diﬀerence between the segments. The parameter of the transition exponential, β is set to the same value as the telomere penalty in the combinatorial problem. Calculate the number of telomere copies t(c, c0) at a transition from a segment with copy number c0 to a segment with copy number c as given by equation C.6, and the probability of the transition as given by equation C.7.

t(c, c0) = X X |c − c0| (C.6) m ` η(c, c0) = β exp −βt(c, c0) (C.7)

The prior probability over the copy number of the ﬁrst segment p(c1 = c) is thus given by equation C.8, and the transition probability from segment n − 1 to segment n is given by equation C.9.

π1(c) = p(c1 = c)

∝ ρ1(c)α1(c) (C.8) 0 0 An(c, c ) = p(cn = c|cn−1 = c ) 0 ∝ ρn(c)αn(c)η(c, c ) (C.9)

The expected value of the log likelihood function with respect to the conditional distribution p(C|X, h(t−1), ·), excluding terms independent of h, is given by Equation C.10.

(t−1) To calculate the joint-posterior marginal probabilities p(bn = b, cn = c|xn, h , ·), and thereby calculate Q(h, h(t−1)) (E-step), we use the forwards-backwards algorithm. The graphical model is a polytree, thus the forwards-backwards cannot be applied directly. Instead, we model both cn and bn jointly, and apply the forwards-backwards to the merged state space of cn and bn.

173 Typically the state space of the jointly modeled variables would be a cross product of the state space of each variable independently. For our model, such a state space would be unmanageably large. Fortunately a cross product state space can be avoided due to the mutually exclusive eﬀect of cn and bn on the likelihood of xn, and given that cn and bn are independent of bn−1. For instance, when marginalizing over the space of cn and bn in the forwards pass, part of the marginalization can take place analytically (Equation C.11).

The backwards pass can be simpliﬁed similarly. In eﬀect, we can consider a state space that includes all regular (6= ∅) states for cn, concatenated with the additional states for bn. Entries of the transition matrix are given by Equation C.12.

0 0 p(bn = b, cn = c|cn−1 = c ) = p(cn = c|cn−1 = c )p(bn = b) (C.12)

We maximize Q(h, h(t−1)) with respect to h (M-Step) numerically. The gradient of Q(h, h(t−1)), used for numerical maximization, can be computed as given by Equations C.13-C.16.

∂Q(h, h(t−1)) 3 ∂ = X X p(c = c|x , h(t−1), ·) X log p(x |h, c = c, ·) ∂h n n ∂h nk n m n c6=∅ k=1 m 3 ∂ + X X p(b = b, c = ∅|x , h(t−1), ·) X log p(x |h, c = b, ·) n n n ∂h nk n n b k=1 m (C.13)

∂ ∂ ∂µnk log p(xnk|h, cn, ·) = log p(xnk|h, cn, ·) (C.14) ∂hm ∂µnk ∂hm ∂µnk ∂γn = Pnln (C.15) ∂hm ∂hm ∂γnk = cnm (C.16) ∂hm

A negative binomial speciﬁc term is given by Equation C.17, and the poisson speciﬁc term by Equation C.18.

∂ xnk rk + xnk log p(xnk|h, cn, ·) = − (C.17) ∂µnk µnk rk + µnk ∂ xnk log p(xnk|h, cn, ·) = − 1 (C.18) ∂µnk µnk

174 C.3 Estimating the overdispersion parameter r

We estimate the overdispersion parameter r oﬄine from segment read counts. We assume that the majority of adjacent segments have the same genotype, and thus the same expected read depth γ. Under this assumption, we identify the r that maximize the likelihood of the read count data (Equation C.19) for pairs of adjacent segments i and i + 1 with identical read depth γi. We use gradient descent to ﬁnd a local optima of the likelihood with respect to both r and γi, with partial derivates with respect to r and γi calculated as given by Equations C.21 and C.20 respectively.

N 2i X2 X `(r, l, p, γ) = log p(xn|ln, pn, γi, r) i=1 n=2i−1 N 2i X2 X = log (Γ(xn + r)) − log(xn!) − log (Γ(r)) + r log (r) i=1 n=2i−1 −r log (r + lnγi) + xn log(lnγi) − xn log (r + lnγi) (C.19) ∂`(r, l, p, γ) 2i rl x x l = X − n + n − n n (C.20) ∂γ r + l γ γ r + l γ i n=2i−1 n i i n i N ∂`(r, l, p, γ) 2 2i = X X ψ(x + r) − ψ(r) + log (r) + 1 ∂r n i=1 n=2i−1 r xn − log (r + lnγi) − − (C.21) r + lnγi r + lnγi

C.4 Independence of Segment Read Counts

Previously, [116] modelled segment read counts as a single draw from a multinomial likelihood. Suppose, in addition to the multinomial, we model the total number of reads T measured by the sequencing experiment as a Poisson distributed random variable with unknown mean λ. The joint likelihood of multinomial distributed read counts xj and total read count T given proportions πi and expected total read count λ can be written as given P λj by Equation C.22. Introduce variables λj such that λ = j λj and πj = λ . Then Equa- tion C.22 can be rewritten as given by Equation C.23, which is the likelihood of J poisson distributed independent random variables with means λj [47]. Thus, our use of independent Poisson likelihoods can be seen as equivalent to the multinomial used by [116] if we also assume the total read count of the experiment is a Poisson distributed random variable.

−λ T Y xj p(T, x1, .., xJ |λ, π1, .., πJ ) ∝ e λ πj (C.22) j −λ Y xj p(x1, ..., xJ |λ1, ..., λJ ) ∝ e λj (C.23) j

175 C.5 Parameters used for existing methods

C.5.1 Theta2.0

Theta requires an existing segmentation, thus we merged adjacent segments with identical copy number states to form a collection of perfect segments. We then counted reads within those segments and used those read counts as input to Theta. We used the –NUM_INTERVALS 15 option to allow for reasonable running time, and the –FORCE argument to for Theta to run on some of the genomes which were sub-optimal candidates for Theta analysis. We then used octave to execute the runBAFGaussianModel function to select a single solution from potentially multiple optimal solutions. Only the solution with 2 tumour clones was considered, as that is what was simulated.

C.5.2 Titan

As input to Titan, we calculated counts of reads contained within regular 1000bp segments, and counts reads supporting the reference and non-reference allele for heterozygous germline SNPs. We then ran Titan, without correcting for GC and mappability (as this was not simulated). We used multiple initializations, with normal contamination from 0 to 1 in increments of 0.1, and ploidy from 1 to 4 in increments of 1. The number of Titan clusters was ﬁxed at 2. We then selected the solution with lowest S_Dbw validity index. Since the parameterization for Titan is slightly diﬀerent than for other tools, we used the following formula to convert from estimates of tumour clone prevalences t1 and t2, and normal contamination estimate n, as follows.

mixture = [n, (1 − n) × t2, (1 − n) × |t1 − t2|] (C.24)

C.5.3 CloneHD

As input to CloneHD, we calculated counts of reads contained within regular 1000bp segments, and counts reads supporting the reference and non-reference allele for heterozygous germline SNPs (b-allele frequency (BAF) data). We used a series of steps as outlined in run_example.sh.

1. use ﬁlterhd to analyze the normal read depth data for technical read depth modulation

2. use ﬁlterhd to analyze the tumour read depth data to get a benchmark of the log likelihood

3. use ﬁlterhd to analyze the tumour read depth data with the bias estimate from the normal

4. use ﬁlterhd to analyze the tumour BAF data

5. use clonehd to infer copy number based on tumour read depth and BAF data.

176