CANCER FUSION DISCOVERY THROUGH INTEGRATIVE GENOMIC

ANALYSIS

A DISSERTATION

SUBMITTED TO THE INTERDISCIPLINARY CANCER BIOLOGY PROGRAM

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Craig Patrick Giacomini May 2013

© 2013 by Craig Patrick Giacomini. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/ht558fh2632

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Jonathan Pollack, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Calvin Kuo

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Joseph Lipsick

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Robert West

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii

ABSTRACT

Recurrent gene fusions and chromosomal translocations have long been recognized for their roles in the pathogenesis of hematological and mesenchymal malignancies. Over the last several years, it has become increasingly apparent that gene fusions also exist in epithelial cancers including prostate, lung, and gastric carcinomas. Advancements in genomic technologies including microarray and sequencing-based approaches have facilitated the discovery of these alterations (discussed further in Chapter 1). This dissertation employs these genomic approaches to discover and characterize novel gene fusions in several cancer subtypes. First, we developed a “breakpoint analysis” pipeline to discover candidate gene fusions from DNA microarrays by screening for tell-tale transcript level or genomic DNA copy number transitions occurring within . Applying this approach to a large collection of nearly 1,000 human cancer specimens, we discovered and characterized twelve new gene fusions in diverse cancer types including angiosarcoma, pancreatic cancer, and colon cancer (Chapter 2).

Separately, we performed RNA Sequencing on a series of 36 breast cancer specimens for gene fusion discovery. Using a suite of computational tools developed in-house, we discovered ~350 candidate gene rearrangements including recurrent fusions of the sterile 20 (STE20)-like TAOK1. Functional studies suggest that these TAOK1 fusions encode potent oncoproteins that drive carcinogenesis (Chapter 3). The sum total of this work provides a much deeper understanding of the biology of gene fusions

iv

in cancer pathogenesis, yielding many novel therapeutic targets and hopefully allowing scientists to gain a foothold in the fight against these diseases.

v

ACKNOWLEDGMENTS

I have had significant help and support from many people during my graduate career at Stanford University, and the work in this dissertation would not have been possible without these individuals. First and foremost, I would like to thank my PI,

Dr. Jonathan Pollack, for his tremendous guidance over these many years. He has helped me grow as a physician scientist, and he is truly a great friend, advisor, and role model as I move forward with my career. I have had the opportunity to work with extraordinarily accomplished and brilliant colleagues in the Pollack laboratory, and I am very grateful for the many helpful scientific discussions and for the friendships I have made. I would like to thank my parents, John and Kathleen Giacomini, for their love, guidance, and support throughout my life. Lastly, I want to thank my beautiful wife Marilyn. I am truly grateful for all of the life experiences we have shared, from high school graduation to college graduation and now to my graduation from

Stanford’s MD/PhD program. It has been a long road, and I am looking forward to continuing my journey with you.

vi

TABLE OF CONTENTS

Abstract...... iv Acknowledgments ...... vi CHAPTER 1...... 1 Introduction: Gene Fusions and Cancer ...... 1 History of gene fusion discovery...... 2 Gene fusions in sarcomas ...... 4 Gene fusions in uncommon epithelial cancers ...... 5 Recent discovery of gene fusions in common carcinomas...... 6 References ...... 10 CHAPTER 2...... 25 Breakpoint Analysis Uncovers Novel Gene Fusions Spanning Multiple Cancer Types ...... 25 Abstract...... 26 Author Summary ...... 26 Introduction ...... 28 Results ...... 29 Microarray datasets ...... 29 Breakpoint Analysis ...... 30 Novel ROS1 rearrangements in angiosarcoma and epithelioid hemangioendothelioma...... 32 Discovery of APIP/SLC1A2 in colon cancer...... 34 Novel RAF kinase rearrangements in pancreatic cancer and anaplastic astrocytoma...... 35 Discovery and characterization of EWSR1/CREM in melanoma...... 36 Identification of FAM133B/CDK6 in T-ALL...... 37 Rearrangement of CLTC and VMP1 occurs in multiple cancer types...... 38 Novel cell line models for EGFRVIII and FIP1L1/PDGFRA...... 39 Discussion...... 39 Materials and methods...... 43

vii

References ...... 53 CHAPTER 3...... 98 Transcriptome Sequencing Uncovers Recurrent TAOK1 Rearrangements in Breast Cancer...... 98 Abstract...... 99 Introduction ...... 100 Results ...... 101 Genomic Datasets...... 101 Gene Rearrangement Discovery...... 102 Discovery and characterization of TAOK1 rearrangements in breast cancer ....103 Discussion...... 106 Materials and methods...... 108 References ...... 114 CHAPTER 4...... 138 Conclusions and Future Directions ...... 138 References ...... 142

viii

TABLES Table 1.1. Characteristic gene fusions in hematological cancers, sarcomas, and carcinomas...... 24 Table 2.1. Validated gene fusions and rearrangements...... 70 Table 2.S1. Candidate rearrangements nominated by RBA...... 84 Table 2.S2. Candidate rearrangements nominated by DBA...... 87 Table 2.S3. Sarcoma subtypes included on TMA...... 95 Table 2.S4. Affect of filtering parameters on DBA analysis of bone cancer cell lines...... 96 Table 2.S5. RT-PCR primers (for validation of candidate fusions)...... 97 Table 3.S1. Breast cancer cell lines...... 128 Table 3.S2. Candidate gene fusions...... 129 Table 3.S3. Candidate intragenic rearrangements...... 135 Table 3.S4. Candidate gene-intergenic rearrangements...... 136

ix

FIGURES Figure 1.1. CML karyotype and the Philadelphia ...... 21 Figure 1.2. Two major classes of gene fusions in human cancers...... 22 Figure 1.3. History of gene fusion discovery...... 23 Figure 2.1. Breakpoint analysis for discovering novel cancer gene rearrangements ...71 Figure 2.2. Discovery and characterization of CEP85L/ROS1 in angiosarcoma...... 72 Figure 2.3. Discovery of APIP/SLC1A2 in colon cancer...... 73 Figure 2.4. Identification and characterization of novel RAF1 gene fusions in pancreatic cancer and anaplastic astrocytoma...... 74 Figure 2.5. Discovery and characterization of EWSR1/CREM in melanoma...... 75 Figure 2.6. Identification and characterization of FAM133B/CDK6 in J.RT3-T3.5...76 Figure 2.7. DBA discovery of recurrent rearrangements of CLTC and VMP1 across diverse cancer types...... 77 Figure 2.8. Discovery of new cell line models for the known rearrangements, EGFRvIII and FIP1L1/PDGFRA...... 78 Figure 2.S1. Datasets and cancer types included for breakpoint analysis...... 79 Figure 2.S2. RBA for discovery of gene fusions...... 80 Figure 2.S3. DBA pipeline for gene fusion discovery...... 81 Figure 2.S4. RBA rediscovery of known gene fusions in various cancers...... 82 Figure 2.S5. DBA rediscovery of known gene fusions in various cancers...... 83 Figure 3.1. RNA Seq gene rearrangement discovery pipeline...... 123 Figure 3.2. TAOK1 rearrangement discovery in breast cancer...... 124 Figure 3.3. TAOK1 rearrangements in primary breast cancer...... 125 Figure 3.4. TAOK1 rearrangements encode an active kinase...... 126 Figure 3.5. TAOK1 fusion knockdown inhibits cell growth...... 127

x

CHAPTER 1 INTRODUCTION: GENE FUSIONS AND CANCER

1

HISTORY OF GENE FUSION DISCOVERY

The German biologist, Theodor Heinrich Boveri, noted that sea urchin embryos would develop abnormally following errors occurring during mitosis. He proposed a critical hypothesis in the early 1900’s that cancer similarly might result from errors in cell division resulting in abnormal [1]. The next several decades led to significant advancements in cytogenetic technologies, which resulted in discoveries substantiating Boveri’s theory. In 1958, Rothfels and Siminovitch published a new cytogenetic, air-drying technique for flattening chromosomes[2]. In

1960, Peter Nowell and David Hungerford applied this approach to discover a recurring small chromosome in chronic myelogenous leukemia (CML). This alteration, the first evidence for a genetic origin of human cancer, was termed the

Philadelphia chromosome after the city in which it was discovered[3-5] (Figure 1.1).

In the 1970’s, the Swedish cytologist and geneticist Torbjörn Oskar

Caspersson pioneered the development of chromosome banding techniques[6], and these methods were used to discover that the Philadelphia chromosome resulted from a translocation between chromosomes 9 and 22[7]. Further characterization revealed that this translocation created the gene fusion BCR/ABL1, which was subsequently shown to drive oncogenesis through constitutively active tyrosine kinase signaling[8-

17]. Since its discovery, BCR/ABL1 has been used clinically as a diagnostic marker for CML as well as a therapeutic target. In particular, the treatment of CML has been revolutionized by the development of the BCR/ABL1 tyrosine kinase inhibitor, imatinib mesylate[18,19].

2

In the 1970’s, Zech et al discovered a recurrent translocation between chromosomes 8 and 14 in Burkitt’s lymphoma[20]. The identity of the genes involved in this translocation remained elusive until 1982 when it was demonstrated that t(8;14) resulted in the juxtaposition of the immunoglobulin heavy chain (IGH) gene with the oncogenic transcription factor v-myc myelocytomatosis viral oncogene homolog

(avian) (MYC)[21-23]. Unlike BCR/ABL1, which encodes a chimeric with domains derived from its constituent genes, IGH/MYC functions to drive high-level expression of the MYC oncoprotein through the influence of IGH regulatory elements

(Figure 1.2). This aberrant MYC expression is a necessary component for malignant transformation in Burkitt’s lymphoma[24].

With further improvements in cytogenetic methodologies, additional recurrent chromosomal rearrangements and gene fusions were discovered in various hematological cancer types. In 1977, a recurrent translocation between chromosomes

15 and 17 was discovered in acute promyelocytic leukemia (APL) and was later found to create the gene fusion PML/RARA[25-27]. All-trans retinoic acid (ATRA) and arsenic trioxide are major molecular therapies for APL and function by targeting

PML/RARA to cause differentiation and apoptosis in cancer cells[28-31]. In 1991, the RUNX1/RUNX1T1 gene fusion was discovered to characterize a fraction of acute myeloid leukemias (AML)[32,33]. RUNX1 is considered a promiscuously rearranged gene with over 10 fusion partner genes now discovered[34]. Many additional recurrent gene fusions were discovered in the late 20th and early 21st century including

FIP1L1/PDGFRA in chronic eosinophilic leukemia and NPM1/ALK in anaplastic large

3

cell lymphoma[35,36] (Table 1.1). The vast majority of these alterations were discovered in hematological cancers (Figure 1.3).

GENE FUSIONS IN SARCOMAS

The solid sarcomas represent an uncommon but clinically and morphologically heterogeneous group of neoplasms arising from bone, cartilage, or connective tissues.

From a genetic perspective, these malignancies have traditionally been classified into two major subtypes: those with near-diploid karyotypes and those with complex and unbalanced karyotypes. The latter category comprises the majority of sarcomas and includes most angiosarcoma, leiomyosarcoma, and peripheral nerve sheath tumors.

These cancers are characterized by complex patterns of chromosomal gains, losses, and rearrangements, but very few gene fusions have been discovered in these malignancies. In contrast, mesenchymal malignancies with simple genomes account for approximately one-third of all sarcomas and include Ewing’s sarcoma, synovial sarcoma, and congenital fibrosarcoma. These tumors often harbor only a single defining cytogenetic abnormality, and gene fusions frequently characterize these neoplasms[37-40].

In the 1980’s it was noted that Ewing’s sarcoma and related primitive neuroectodermal tumors (PNET) share a recurrent translocation between chromosomes 11 and 22[41-44]. In 1992, Delattre et al discovered that this translocation creates the gene fusion EWSR1/FLI1[45]. Studies have demonstrated that this alteration juxtaposes a transcriptional transactivating domain from EWSR1 to

4

the DNA binding domain of FLI1. EWSR1/FLI1 drives sarcomagenesis by enhancing transcription of various target genes that regulate cellular proliferation, invasion, and apoptosis[46-53]. It was later discovered that while 85% of Ewing’s sarcomas harbor

EWSR1/FLI1, similar gene fusions are present in most other cases, including

EWSR1/ERG, EWSR1/ETV1, and EWSR1/ETV4[54-58].

Approximately 40 gene fusions have subsequently been discovered in various sarcoma subtypes[59]. For example, the majority of alveolar rhabdomyosarcomas are associated with recurrent translocations fusing genes from the PAX family of transcription factors to FOXO1A. The specific gene fusion holds prognostic importance in these cancers. For example, PAX3/FOXO1A-positive tumors behave more aggressively than those with PAX7/FOXO1A gene fusions, though the tumors are morphologically indistinguishable[60-65]. Myxoid-round cell liposarcoma is the second most common form of liposarcoma, after well differentiation/dedifferentiated liposarcoma, and is characterized by the gene fusion FUS/DDIT3 (accounting for 90% of cases) or EWSR1/DDIT3[66-70]. Low grade fibromyxoid sarcoma is associated with a characteristic t(7;16) or t(11;16) juxtaposing FUS to either CREB3L2 or

CREL3L1[71-73]. Additional gene fusions characterizing various sarcoma subtypes are outlined in Table 1.1.

GENE FUSIONS IN UNCOMMON EPITHELIAL CANCERS

Although the vast majority of recurrent gene fusions discovered in the 1900’s were identified in leukemias and sarcomas, a few rearrangements were found in

5

uncommon epithelial cancers. Notably, thyroid carcinoma was one of the first epithelial tumor types in which a gene fusion, CCDC6/RET, was detected[74]. Since this discovery, a large number of rearrangements involving RET have been discovered in papillary thyroid carcinoma, which all encode constitutively active tyrosine [75-77]. In the 1980’s, karyotypic analysis led to the discovery of a recurrent translocation in a fraction of pleomorphic adenoma, a slow growing epithelial tumor responsible for more than 50% of salivary gland tumors[78]. Nearly 20 years later,

Kas et al discovered that this translocation leads to the formation of the

CTNNB1/PLAG1 fusion gene[79]. This rearrangement drives overexpression of the oncogenic zinc finger protein PLAG1 through the strong promoter of CTNNB1.

Additional epithelial gene fusions include PRCC/TFE3 in renal cell carcinoma,

BRD/NUT in midline carcinoma, PAX8/PPARG in follicular thyroid carcinoma, and

ETV6/NTRK3 in secretory breast cancer[80] (Table 1.1).

RECENT DISCOVERY OF GENE FUSIONS IN COMMON CARCINOMAS

Until recently, recurrent gene fusions have been considered rare in common carcinomas such as breast, prostate, lung, and colon cancers. Two major explanations have been proposed for the relative scarcity of identified fusions in these epithelial tumors[81]. First, recurrent fusion genes might be rare because common epithelial tumors exhibit different mechanisms of genomic instability, like chromosome instability giving rise to chromosome gains and losses without recurrent breakpoints.

6

Second, fusion genes might not be rare, but might remain unidentified because cytogenetic analysis of epithelial tumors is challenging and because rampant chromosome instability masks key recurrent rearrangements. Recent discoveries support the latter hypothesis (Table 1.1).

In 2005, Tomlins et al used DNA microarrays to demonstrate that two ETS family oncogenic transcription factors, ETV1 and ERG, frequently exhibit high-level,

“outlier” expression in a fraction of prostate cancers[82]. Because these genes have previously been discovered to form gene fusions in other malignancies such as

Ewing’s sarcoma, the authors hypothesized that the observed high-level expression occurred secondary to novel chromosomal rearrangements. Further characterization revealed that the promoter and 5’ untranslated region (UTR) of the prostate-expressed and androgen-regulated gene TMPRSS2 is fused to either ERG or ETV1 in these samples. While TMPRSS2/ERG is the most common gene fusion in prostate cancer, several additional rearrangements between prostate specific gene promoters and ETS transcription factor genes have been discovered (e.g. TMPRSS2/ETV4,

SLC45A3/ETV1, FLJ35294/ETV1)[83]. Functional characterization demonstrated that these alterations drive carcinogenesis by enhancing transcription of target genes that regulate various malignant processes including cellular proliferation and invasion[83,84]. In total, ETS gene fusions are estimated to occur in approximately

40-60% of prostate cancers. Notably, given the high prevalence of prostate cancer, these alterations represent the most frequently occurring class of recurrent gene fusions discovered to date in any human cancer[82].

7

The identification of TMPRSS2/ETS sparked major efforts to discover and characterize novel oncogenic gene fusions in other common epithelial malignancies.

In 2007, through a functional screen using retroviral cDNA expression libraries, Soda et al discovered that a small fraction of non-small-cell-lung-cancers (NSCLC) harbor the gene fusion EML4/ALK[85]. It is estimated that this alteration occurs in approximately 2-7% of NSCLC patients and is the target of the novel anti-cancer agent, crizotinib[86]. In 2010, Palanisamy et al used paired-end RNA sequencing to discover that a small fraction of prostate cancers, gastric cancers, and melanomas harbor recurrent gene fusions involving RAF1 and BRAF[87]. These rearrangements encode oncogenic, constitutively active serine threonine kinases, and in vitro experiments suggest that malignancies harboring these alterations are responsive to the anti-cancer agent Sorafenib. In 2011, Tao et al used high-density array-based comparative genomic hybridization (aCGH) microarrays to discover a novel gene fusion between the promoter and 5’UTR of the highly expressed CD44 and the glutamate transporter SLC1A2[88]. CD44/SLC1A2 occurs in approximately 2% of gastric cancers, and functional studies suggest that this alteration drives carcinogenesis by enhancing intracellular glutamate transport, which subsequently promotes cellular proliferation, invasion, and anchorage-independent growth.

These recent discoveries suggest that additional rearrangements remain to be identified in other cancer types. Advancements in genomic technologies have facilitated gene fusion discovery, and, notably, the majority of these recently discovered alterations were identified using these technologies. Furthermore, major genome centers and consortiums including The Cancer Genome Atlas Project (TCGA)

8

and the Wellcome Trust Sanger Institute have been using these genomic methodologies to profile large numbers of cancer specimens and have made these data publicly available[89,90].

The major goal of this dissertation is to discover and characterize novel gene fusions using various genomic technologies. In Chapter 2, I describe a novel DNA microarray-based approach for gene fusion discovery. This approach is based on detecting tell-tale expression level or genomic DNA copy number transitions disrupting specific genes. Using this method, I discovered twelve new gene fusions in diverse cancer types. Many of these alterations represent the first gene fusions discovered to date in the corresponding malignancy, and for several I go on to demonstrate their oncogenic roles and potential as therapeutic targets. In Chapter 3, I develop a computational pipeline to discover novel cancer gene rearrangements from transcriptome sequencing (RNA Seq) data. Applied to a collection of breast cancer specimens, I discovered approximately 350 candidate gene rearrangements, including a recurrent fusion of the TAOK1 kinase. I verified its expression and demonstrated oncogene dependency.

Taken together, my work provides new methods to discover cancer gene fusions, and through these approaches, I have uncovered several novel alterations.

Many of these rearrangements represent potential therapeutic targets with clinical implications for patients, and many represent the first gene fusions in corresponding tumor types. These findings provide a better understanding of the roles of gene fusions in oncogenesis and highlight a more wide-spread role of these alterations in cancer.

9

REFERENCES

1. Boveri T (1914) Zur frage der enstehung maligner tumoren. Gustav Fisher, Jena: 1- 64. 2. Rothfels KH, Siminovitch L (1958) An air-drying technique for flattening chromosomes in mammalian oells grown in vitro. Stain Technol 33: 73-77. 3. Nowell P HD (1960) A minute chromosome in human chronic granulocytic leukemia [Abstract]. Science 132: 1497. 4. Nowell PC, Hungerford DA (1961) Chromosome studies in human leukemia. II. Chronic granulocytic leukemia. J Natl Cancer Inst 27: 1013-1035. 5. Tough IM, Court Brown WM, Baikie AG, Buckton KE, Harnden DG, et al. (1961) Cytogenetic studies in chronic myeloid leukaemia and acute leukaemia associated with monogolism. Lancet 1: 411-417. 6. Caspersson T, Lomakka G, Zech L (1972) The 24 fluorescence patterns of the human metaphase chromosomes - distinguishing characters and variability. Hereditas 67: 89-102. 7. Rowley JD (1973) Letter: A new consistent chromosomal abnormality in chronic myelogenous leukaemia identified by quinacrine fluorescence and Giemsa staining. Nature 243: 290-293. 8. Groffen J, Stephenson JR, Heisterkamp N, de Klein A, Bartram CR, et al. (1984) Philadelphia chromosomal breakpoints are clustered within a limited region, bcr, on chromosome 22. Cell 36: 93-99. 9. Groffen J, Heisterkamp N, Stephenson JR, van Kessel AG, de Klein A, et al. (1983) c-sis is translocated from chromosome 22 to chromosome 9 in chronic myelocytic leukemia. J Exp Med 158: 9-15. 10. Bartram CR, de Klein A, Hagemeijer A, Grosveld G, Heisterkamp N, et al. (1984) Localization of the human c-sis oncogene in Ph1-positive and Ph1-negative chronic myelocytic leukemia by in situ hybridization. Blood 63: 223-225. 11. de Klein A, van Kessel AG, Grosveld G, Bartram CR, Hagemeijer A, et al. (1982) A cellular oncogene is translocated to the Philadelphia chromosome in chronic myelocytic leukaemia. Nature 300: 765-767.

10

12. Canaani E, Gale RP, Steiner-Saltz D, Berrebi A, Aghai E, et al. (1984) Altered transcription of an oncogene in chronic myeloid leukaemia. Lancet 1: 593-595. 13. Konopka JB, Watanabe SM, Witte ON (1984) An alteration of the human c-abl protein in K562 leukemia cells unmasks associated tyrosine kinase activity. Cell 37: 1035-1042. 14. Ben-Neriah Y, Daley GQ, Mes-Masson AM, Witte ON, Baltimore D (1986) The chronic myelogenous leukemia-specific P210 protein is the product of the bcr/abl hybrid gene. Science 233: 212-214. 15. Gishizky ML, Witte ON (1992) Initiation of deregulated growth of multipotent progenitor cells by bcr-abl in vitro. Science 256: 836-839. 16. Pierce A, Spooncer E, Wooley S, Dive C, Francis JM, et al. (2000) Bcr-Abl protein tyrosine kinase activity induces a loss of p53 protein that mediates a delay in myeloid differentiation. Oncogene 19: 5487-5497. 17. Hariharan IK, Harris AW, Crawford M, Abud H, Webb E, et al. (1989) A bcr-v- abl oncogene induces lymphomas in transgenic mice. Mol Cell Biol 9: 2798- 2805. 18. Druker BJ, Sawyers CL, Kantarjian H, Resta DJ, Reese SF, et al. (2001) Activity of a specific inhibitor of the BCR-ABL tyrosine kinase in the blast crisis of chronic myeloid leukemia and acute lymphoblastic leukemia with the Philadelphia chromosome. N Engl J Med 344: 1038-1042. 19. Druker BJ, Talpaz M, Resta DJ, Peng B, Buchdunger E, et al. (2001) Efficacy and safety of a specific inhibitor of the BCR-ABL tyrosine kinase in chronic myeloid leukemia. N Engl J Med 344: 1031-1037. 20. Zech L, Haglund U, Nilsson K, Klein G (1976) Characteristic chromosomal abnormalities in biopsies and lymphoid-cell lines from patients with Burkitt and non-Burkitt lymphomas. Int J Cancer 17: 47-56. 21. Dalla-Favera R, Bregni M, Erikson J, Patterson D, Gallo RC, et al. (1982) Human c-myc onc gene is located on the region of chromosome 8 that is translocated in Burkitt lymphoma cells. Proc Natl Acad Sci U S A 79: 7824-7827.

11

22. ar-Rushdi A, Nishikura K, Erikson J, Watt R, Rovera G, et al. (1983) Differential expression of the translocated and the untranslocated c-myc oncogene in Burkitt lymphoma. Science 222: 390-393. 23. Nishikura K, ar-Rushdi A, Erikson J, Watt R, Rovera G, et al. (1983) Differential expression of the normal and of the translocated human c-myc oncogenes in B cells. Proc Natl Acad Sci U S A 80: 4822-4826. 24. Adams JM, Harris AW, Pinkert CA, Corcoran LM, Alexander WS, et al. (1985) The c-myc oncogene driven by immunoglobulin enhancers induces lymphoid malignancy in transgenic mice. Nature 318: 533-538. 25. Rowley JD, Golomb HM, Vardiman J, Fukuhara S, Dougherty C, et al. (1977) Further evidence for a non-random chromosomal abnormality in acute promyelocytic leukemia. Int J Cancer 20: 869-872. 26. Kakizuka A, Miller WH, Jr., Umesono K, Warrell RP, Jr., Frankel SR, et al. (1991) Chromosomal translocation t(15;17) in human acute promyelocytic leukemia fuses RAR alpha with a novel putative transcription factor, PML. Cell 66: 663-674. 27. de The H, Lavau C, Marchio A, Chomienne C, Degos L, et al. (1991) The PML- RAR alpha fusion mRNA generated by the t(15;17) translocation in acute promyelocytic leukemia encodes a functionally altered RAR. Cell 66: 675-684. 28. Castaigne S, Chomienne C, Daniel MT, Ballerini P, Berger R, et al. (1990) All- trans retinoic acid as a differentiation therapy for acute promyelocytic leukemia. I. Clinical results. Blood 76: 1704-1709. 29. Powell BL, Moser B, Stock W, Gallagher RE, Willman CL, et al. (2010) Arsenic trioxide improves event-free and overall survival for adults with acute promyelocytic leukemia: North American Leukemia Intergroup Study C9710. Blood 116: 3751-3757. 30. Chen GQ, Zhu J, Shi XG, Ni JH, Zhong HJ, et al. (1996) In vitro studies on cellular and molecular mechanisms of arsenic trioxide (As2O3) in the treatment of acute promyelocytic leukemia: As2O3 induces NB4 cell apoptosis

12

with downregulation of Bcl-2 expression and modulation of PML-RAR alpha/PML . Blood 88: 1052-1061. 31. Huang ME, Ye YC, Chen SR, Chai JR, Lu JX, et al. (1988) Use of all-trans retinoic acid in the treatment of acute promyelocytic leukemia. Blood 72: 567- 572. 32. Gao J, Erickson P, Gardiner K, Le Beau MM, Diaz MO, et al. (1991) Isolation of a yeast artificial chromosome spanning the 8;21 translocation breakpoint t(8;21)(q22;q22.3) in acute myelogenous leukemia. Proc Natl Acad Sci U S A 88: 4882-4886. 33. Nucifora G, Rowley JD (1995) AML1 and the 8;21 and 3;21 translocations in acute and chronic myeloid leukemia. Blood 86: 1-14. 34. Roulston D, Espinosa R, 3rd, Nucifora G, Larson RA, Le Beau MM, et al. (1998) CBFA2(AML1) translocations with novel partner chromosomes in myeloid leukemias: association with prior therapy. Blood 92: 2879-2885. 35. Morris SW, Kirstein MN, Valentine MB, Dittmer KG, Shapiro DN, et al. (1994) Fusion of a kinase gene, ALK, to a nucleolar protein gene, NPM, in non- Hodgkin's lymphoma. Science 263: 1281-1284. 36. Cools J, DeAngelo DJ, Gotlib J, Stover EH, Legare RD, et al. (2003) A tyrosine kinase created by fusion of the PDGFRA and FIP1L1 genes as a therapeutic target of imatinib in idiopathic hypereosinophilic syndrome. N Engl J Med 348: 1201-1214. 37. Mitelman F, Johansson B, Mertens F (2004) Fusion genes and rearranged genes as a linear function of chromosome aberrations in cancer. Nat Genet 36: 331-334. 38. Helman LJ, Meltzer P (2003) Mechanisms of sarcoma development. Nat Rev Cancer 3: 685-694. 39. Mertens F, Antonescu CR, Hohenberger P, Ladanyi M, Modena P, et al. (2009) Translocation-related sarcomas. Semin Oncol 36: 312-323. 40. Taylor BS, Barretina J, Maki RG, Antonescu CR, Singer S, et al. (2011) Advances in sarcoma genomics and new therapeutic targets. Nat Rev Cancer 11: 541- 557.

13

41. Iselius L, Lindsten J, Aurias A, Fraccaro M, Bastard C, et al. (1983) The 11q;22q translocation: a collaborative study of 20 new cases and analysis of 110 families. Hum Genet 64: 343-355. 42. Whang-Peng J, Triche TJ, Knutsen T, Miser J, Douglass EC, et al. (1984) Chromosome translocation in peripheral neuroepithelioma. N Engl J Med 311: 584-585. 43. Vigfusson NV, Allen LJ, Phillips JH, Alschibaja T, Riches WG (1986) A neuroendocrine tumor of the small intestine with a karyotype of 46,XY,t(11;22). Cancer Genet Cytogenet 22: 211-218. 44. Aurias A, Rimbaut C, Buffe D, Dubousset J, Mazabraud A (1983) [Translocation of chromosome 22 in Ewing's sarcoma]. C R Seances Acad Sci III 296: 1105- 1107. 45. Delattre O, Zucman J, Plougastel B, Desmaze C, Melot T, et al. (1992) Gene fusion with an ETS DNA-binding domain caused by chromosome translocation in human tumours. Nature 359: 162-165. 46. May WA, Gishizky ML, Lessnick SL, Lunsford LB, Lewis BC, et al. (1993) Ewing sarcoma 11;22 translocation produces a chimeric transcription factor that requires the DNA-binding domain encoded by FLI1 for transformation. Proc Natl Acad Sci U S A 90: 5752-5756. 47. May WA, Lessnick SL, Braun BS, Klemsz M, Lewis BC, et al. (1993) The Ewing's sarcoma EWS/FLI-1 fusion gene encodes a more potent transcriptional activator and is a more powerful transforming gene than FLI-1. Mol Cell Biol 13: 7393-7398. 48. Siligan C, Ban J, Bachmaier R, Spahn L, Kreppel M, et al. (2005) EWS-FLI1 target genes recovered from Ewing's sarcoma chromatin. Oncogene 24: 2512- 2524. 49. Ohno T, Rao VN, Reddy ES (1993) EWS/Fli-1 chimeric protein is a transcriptional activator. Cancer Res 53: 5859-5863. 50. Prieur A, Tirode F, Cohen P, Delattre O (2004) EWS/FLI-1 silencing and gene profiling of Ewing cells reveal downstream oncogenic pathways and a crucial

14

role for repression of insulin-like growth factor binding protein 3. Mol Cell Biol 24: 7275-7283. 51. Richter GH, Plehm S, Fasan A, Rossler S, Unland R, et al. (2009) EZH2 is a mediator of EWS/FLI1 driven tumor growth and metastasis blocking endothelial and neuro-ectodermal differentiation. Proc Natl Acad Sci U S A 106: 5324-5329. 52. Lessnick SL, Braun BS, Denny CT, May WA (1995) Multiple domains mediate transformation by the Ewing's sarcoma EWS/FLI-1 fusion gene. Oncogene 10: 423-431. 53. Luo W, Gangwal K, Sankar S, Boucher KM, Thomas D, et al. (2009) GSTM4 is a microsatellite-containing EWS/FLI target involved in Ewing's sarcoma oncogenesis and therapeutic resistance. Oncogene 28: 4126-4132. 54. Sorensen PH, Lessnick SL, Lopez-Terrada D, Liu XF, Triche TJ, et al. (1994) A second Ewing's sarcoma translocation, t(21;22), fuses the EWS gene to another ETS-family transcription factor, ERG. Nat Genet 6: 146-151. 55. Jeon IS, Davis JN, Braun BS, Sublett JE, Roussel MF, et al. (1995) A variant Ewing's sarcoma translocation (7;22) fuses the EWS gene to the ETS gene ETV1. Oncogene 10: 1229-1234. 56. Kaneko Y, Yoshida K, Handa M, Toyoda Y, Nishihira H, et al. (1996) Fusion of an ETS-family gene, EIAF, to EWS by t(17;22)(q12;q12) chromosome translocation in an undifferentiated sarcoma of infancy. Genes Chromosomes Cancer 15: 115-121. 57. Peter M, Couturier J, Pacquement H, Michon J, Thomas G, et al. (1997) A new member of the ETS family fused to EWS in Ewing tumors. Oncogene 14: 1159-1164. 58. Urano F, Umezawa A, Hong W, Kikuchi H, Hata J (1996) A novel chimera gene between EWS and E1A-F, encoding the adenovirus E1A enhancer-binding protein, in extraosseous Ewing's sarcoma. Biochem Biophys Res Commun 219: 608-612.

15

59. Mitelman FJ, B.; Mertens, F. (2009) Mitelman Database of Chromosome Aberrations in Cancer. http://cgapncinihgov/Chromosomes/Mitelman. 60. Sorensen PH, Lynch JC, Qualman SJ, Tirabosco R, Lim JF, et al. (2002) PAX3- FKHR and PAX7-FKHR gene fusions are prognostic indicators in alveolar rhabdomyosarcoma: a report from the children's oncology group. J Clin Oncol 20: 2672-2679. 61. Davis RJ, D'Cruz CM, Lovell MA, Biegel JA, Barr FG (1994) Fusion of PAX7 to FKHR by the variant t(1;13)(p36;q14) translocation in alveolar rhabdomyosarcoma. Cancer Res 54: 2869-2872. 62. Galili N, Davis RJ, Fredericks WJ, Mukhopadhyay S, Rauscher FJ, 3rd, et al. (1993) Fusion of a fork head domain gene to PAX3 in the solid tumour alveolar rhabdomyosarcoma. Nat Genet 5: 230-235. 63. Barr FG, Qualman SJ, Macris MH, Melnyk N, Lawlor ER, et al. (2002) Genetic heterogeneity in the alveolar rhabdomyosarcoma subset without typical gene fusions. Cancer Res 62: 4704-4710. 64. Kelly KM, Womer RB, Sorensen PH, Xiong QB, Barr FG (1997) Common and variant gene fusions predict distinct clinical phenotypes in rhabdomyosarcoma. J Clin Oncol 15: 1831-1836. 65. Anderson J, Gordon T, McManus A, Mapp T, Gould S, et al. (2001) Detection of the PAX3-FKHR fusion gene in paediatric rhabdomyosarcoma: a reproducible predictor of outcome? Br J Cancer 85: 831-835. 66. Sreekantaiah C, Karakousis CP, Leong SP, Sandberg AA (1992) Cytogenetic findings in liposarcoma correlate with histopathologic subtypes. Cancer 69: 2484-2495. 67. Aman P, Ron D, Mandahl N, Fioretos T, Heim S, et al. (1992) Rearrangement of the transcription factor gene CHOP in myxoid liposarcomas with t(12;16)(q13;p11). Genes Chromosomes Cancer 5: 278-285. 68. Crozat A, Aman P, Mandahl N, Ron D (1993) Fusion of CHOP to a novel RNA- binding protein in human myxoid liposarcoma. Nature 363: 640-644.

16

69. Rabbitts TH, Forster A, Larson R, Nathan P (1993) Fusion of the dominant negative transcription regulator CHOP with a novel gene FUS by translocation t(12;16) in malignant liposarcoma. Nat Genet 4: 175-180. 70. Panagopoulos I, Hoglund M, Mertens F, Mandahl N, Mitelman F, et al. (1996) Fusion of the EWS and CHOP genes in myxoid liposarcoma. Oncogene 12: 489-494. 71. Mertens F, Fletcher CD, Antonescu CR, Coindre JM, Colecchia M, et al. (2005) Clinicopathologic and molecular genetic characterization of low-grade fibromyxoid sarcoma, and cloning of a novel FUS/CREB3L1 fusion gene. Lab Invest 85: 408-415. 72. Reid R, de Silva MV, Paterson L, Ryan E, Fisher C (2003) Low-grade fibromyxoid sarcoma and hyalinizing spindle cell tumor with giant rosettes share a common t(7;16)(q34;p11) translocation. Am J Surg Pathol 27: 1229- 1236. 73. Storlazzi CT, Mertens F, Nascimento A, Isaksson M, Wejde J, et al. (2003) Fusion of the FUS and BBF2H7 genes in low grade fibromyxoid sarcoma. Hum Mol Genet 12: 2349-2358. 74. Pierotti MA, Santoro M, Jenkins RB, Sozzi G, Bongarzone I, et al. (1992) Characterization of an inversion on the long arm of chromosome 10 juxtaposing D10S170 and RET and creating the oncogenic sequence RET/PTC. Proc Natl Acad Sci U S A 89: 1616-1620. 75. Pierotti MA (2001) Chromosomal rearrangements in thyroid carcinomas: a recombination or death dilemma. Cancer Lett 166: 1-7. 76. Fugazzola L, Pilotti S, Pinchera A, Vorontsova TV, Mondellini P, et al. (1995) Oncogenic rearrangements of the RET proto-oncogene in papillary thyroid carcinomas from children exposed to the Chernobyl nuclear accident. Cancer Res 55: 5617-5620. 77. Rabes HM, Demidchik EP, Sidorow JD, Lengfelder E, Beimfohr C, et al. (2000) Pattern of radiation-induced RET and NTRK1 rearrangements in 191 post-

17

chernobyl papillary thyroid carcinomas: biological, phenotypic, and clinical implications. Clin Cancer Res 6: 1093-1103. 78. Bullerdiek J, Raabe G, Boschen C, Bartnitzke S (1987) Translocation (3;8;8)(p22 or p23;p23;q12) in a case of pleomorphic adenoma: similarity to a primary cytogenetic abnormality detected in an endometrial adenocarcinoma. Cancer Genet Cytogenet 27: 177-180. 79. Kas K, Voz ML, Roijer E, Astrom AK, Meyen E, et al. (1997) Promoter swapping between the genes for a novel zinc finger protein and beta-catenin in pleiomorphic adenomas with t(3;8)(p21;q12) translocations. Nat Genet 15: 170-174. 80. Brenner JC, Chinnaiyan AM (2009) Translocations in epithelial cancers. Biochim Biophys Acta 1796: 201-215. 81. Mitelman F, Johansson B, Mertens F (2007) The impact of translocations and gene fusions on cancer causation. Nat Rev Cancer 7: 233-245. 82. Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, et al. (2005) Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310: 644-648. 83. Tomlins SA, Laxman B, Dhanasekaran SM, Helgeson BE, Cao X, et al. (2007) Distinct classes of chromosomal rearrangements create oncogenic ETS gene fusions in prostate cancer. Nature 448: 595-599. 84. Klezovitch O, Risk M, Coleman I, Lucas JM, Null M, et al. (2008) A causal role for ERG in neoplastic transformation of prostate epithelium. Proc Natl Acad Sci U S A 105: 2105-2110. 85. Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, et al. (2007) Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 448: 561-566. 86. Kwak EL, Bang YJ, Camidge DR, Shaw AT, Solomon B, et al. (2010) Anaplastic lymphoma kinase inhibition in non-small-cell lung cancer. N Engl J Med 363: 1693-1703.

18

87. Palanisamy N, Ateeq B, Kalyana-Sundaram S, Pflueger D, Ramnarayanan K, et al. (2010) Rearrangements of the RAF kinase pathway in prostate cancer, gastric cancer and melanoma. Nat Med 16: 793-798. 88. Tao J, Deng NT, Ramnarayanan K, Huang B, Oh HK, et al. (2011) CD44- SLC1A2 gene fusions in gastric cancer. Sci Transl Med 3: 77ra30. 89. McLendon RF, A.; Bigner, D. Van Meir E., Brat D. (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455: 1061-1068. 90. Bignell GR, Greenman CD, Davies H, Butler AP, Edkins S, et al. (2010) Signatures of mutation and selection in the cancer genome. Nature 463: 893- 898.

19

FIGURE LEGENDS

Figure 1.1. CML karyotype and the Philadelphia chromosome. The Philadelphia chromosome (red arrow) is created by a translocation between chromosomes 9 and 22, creating the BCR/ABL1 gene fusion. The reciprocal translocation is also depicted

(blue arrow).

Figure 1.2. Two major classes of gene fusions in human cancers. Gene fusions can drive high-level expression of a proto-oncogene through the transcriptional regulatory elements of the 5’ partner (A) as exemplified by IGH/MYC in Burkitt’s lymphoma or can create chimeric proteins (B) as exemplified by BCR/ABL1 in chronic myelogenous leukemia.

Figure 1.3. History of gene fusion discovery. Dates of major gene fusion discoveries in the 20th and 21st centuries are depicted in chronological order.

Table 1.1. Characteristic gene fusions in hematological cancers, sarcomas, and carcinomas.

20

Figure 1.1. CML karyotype and the Philadelphia chromosome.

21

Figure 1.2. Two major classes of gene fusions in human cancers[81].

22

Figure 1.3. History of gene fusion discovery [1, 3-7, 20-23, 25-27, 45, 82, 85].

23

Table 1.1. Characteristic gene fusions in hematological cancers, sarcomas, and carcinomas

Chromosome Gene fusion Cancer subtype rearrangement Hematological Cancers t(9;22)(q34;q11) BCR/ABL1 Chronic myeloid leukemia t(8;14)(q24;q32) IGH/MYC Burkitt’s lymphoma t(15;17)(q22;q21) PML/RARA Acute promyelocytic leukemia t(8;21)(q22;q22) RUNX1/RUNX1T1 Acute myeloid leukemia (M2) del(4)(q12q12) FIP1L1/PDGFRA Chronic eosinophilic leukemia t(2;5)(p23;q35) NPM1/ALK Anaplastic large cell lymphoma t(1;22)(p13;q13) RBM15/MKL1 Acute megakaryoblastic leukemia t(12;21)(p13;q22) ETV6/RUNX1 B-cell acute lymphoblastic leukemia t(14;16)(q32;q23) IGH/MAF Multiple myeloma Sarcomas t(11;22)(p13;q12) EWSR1/FLI1 Ewing’s sarcoma t(2;13)(q35;q14) PAX3/FOXO1A Alveolar rhabdomyosarcoma t(12;16)(q13;p11) FUS/DDIT3 Myxoid/round cell liposarcoma t(7;16)(q33;p11) FUS/CREB3L2 Low-grade fibromyxoid sarcoma t(12;22)(q13;q12) EWSR1/ATF1 Clear cell sarcoma t(12;15)(p13;q25) ETV6/NTRK3 Congenital fibrosarcoma t(11;22)(p13;q12) EWSR1/WT1 Desmoplastic round cell tumor t(7;17)(p15;q11) JAZF1/SUZ12 Endometrial stromal sarcoma t(X;18)(p11;q11) SS18/SSX1 Synovial sarcoma Carcinomas t(3;8)(p21;q12) CTNNB1/PLAG1 Pleomorphic adenoma t(X;1)(p11;q21) PRCC/TFE3 Renal cell carcinoma t(15;19)(q13;p13.1) BRD3/NUT Aggressive midline carcinoma t(2;3)(q13;p25) PAX8/PPARG Follicular thyroid cancer t(12;15)(p13;q25) ETV6/NTRK3 Secretory breast cancer del/ins(21)(q22.2) TMPRSS2/ERG Prostate cancer inv(2)(p23;p21) EML4/ALK NSCLC inv(11)(p13;p13) CD44/SLC1A2 Gastric cancer

24

CHAPTER 2 BREAKPOINT ANALYSIS UNCOVERS NOVEL GENE FUSIONS SPANNING MULTIPLE CANCER TYPES

This chapter is a publication with the following citation: Craig P. Giacomini, Steven Sun, Sushama Varma, A. Hunter Shain, Marilyn M. Giacomini, Jay Balagtas, Robert T. Sweeney, Everett Lai, Catherine A. Del Vecchio, Andrew D. Forster, Nicole Clarke, Kelli D. Montgomery, Shirley Zhu, Albert J. Wong, Matt van de Rijn, Robert B. West, Jonathan R. Pollack. Breakpoint analysis uncovers novel gene fusions spanning multiple cancer types. PLoS Genet. 2013 Apr;9(4):e1003464. Epub 2013 Apr 25.

25

ABSTRACT

Gene fusions, like BCR/ABL1 in chronic myelogenous leukemia, have long been recognized in hematologic and mesenchymal malignancies. The recent finding of gene fusions in prostate and lung cancers has motivated the search for pathogenic gene fusions in other malignancies. Here, we developed a “breakpoint analysis” pipeline to discover candidate gene fusions by tell-tale transcript level or genomic DNA copy number transitions occurring within genes. Mining data from 974 diverse cancer samples, we identified 198 candidate fusions involving annotated cancer genes. From these, we validated and further characterized novel gene fusions involving ROS1 tyrosine kinase in angiosarcoma (CEP85L/ROS1), SLC1A2 glutamate transporter in colon cancer (APIP/SLC1A2), RAF1 kinase in pancreatic cancer (ATG7/RAF1) and anaplastic astrocytoma (BCL6/RAF1), EWSR1 in melanoma (EWSR1/CREM), CDK6 kinase in T-cell acute lymphoblastic leukemia (FAM133B/CDK6), and CLTC in breast cancer (CLTC/VMP1). Notably, while these fusions involved known cancer genes, all occurred with novel fusion partners and in previously unreported cancer types. Moreover, several constituted druggable targets (including kinases), with therapeutic implications for their respective malignancies. Lastly, breakpoint analysis identified new cell line models for known rearrangements, including EGFRvIII and FIP1L1/PDGFRA. Taken together, we provide a robust approach for gene fusion discovery, and our results highlight a more widespread role of fusion genes in cancer pathogenesis.

AUTHOR SUMMARY

Gene fusions represent an important class of cancer genes, created by rearrangements of the genome that bring together two different genes. Because they are unique to cancer cells, gene fusions are ideal diagnostic markers and therapeutic targets. While

26

gene fusions were once thought restricted mainly to blood cancers, recent discoveries suggest they are more widespread. Here, we have developed an approach for mining DNA microarray data to detect the tell-tale signatures of gene fusions, as “breakpoints” occurring within the encoding DNA or expressed transcripts. We apply this approach to a large collection of nearly 1,000 human cancer specimens. From this analysis, we discover and verify twelve new gene fusions occurring in diverse cancer types. We verify that some of these rearrangements recur in other samples of the same cancer type (supporting a causal role), and that the cancers show dependency on the fusion for cancer cell growth. Notably, some of these fusions (e.g. CEP85L/ROS1 in angiosarcoma) represent the first for that cancer type, and thus provide important new biological insight. Some are also good drug targets (including rearrangements of ROS1, RAF1 and CDK6 kinases), with clear implications for therapy.

27

INTRODUCTION

During cancer development and progression, chromosomal rearrangements frequently lead to the juxtaposition of two previously separate genes. The resulting gene fusions often play major roles in oncogenesis and generally fall into two categories. In the first, promoter or enhancer elements are juxtaposed to a proto- oncogene resulting in aberrant overexpression of an oncogenic protein (e.g. IGH/MYC). In the second category, the coding sequences of two genes are combined leading to the formation of a chimeric protein with new or altered activity (e.g. BCR/ABL1) [1]. Pathogenic gene fusions characterize many hematological and mesenchymal neoplasms [2,3]. However, recent studies have demonstrated that epithelial malignancies can also harbor recurrent gene fusions, including ETS rearrangements in prostate cancer and EML4/ALK in non-small cell lung cancer [4-6]. Notably, gene fusions are used clinically for diagnosis and prognostication and can be important therapeutic targets, for example imatinib targeting BCR/ABL1 and crizotinib targeting EML4/ALK [7,8]. Gene fusions frequently represent markers for specific cancer subtypes. For example, chronic myelogenous leukemia (CML) is characterized by the Philadelphia chromosome and the resulting BCR/ABL1 gene fusion, while acute promyelocytic leukemia (APL) is characterized by RARA rearrangement [9-12]. However, certain gene fusions occur across multiple cancer types (i.e. “multi-tumor” rearrangements). For example, ETV6/NTRK3 has been described in secretory breast cancer, congenital fibrosarcoma, acute myeloid leukemia, and other malignancies [13-16]. Similarly, oncogenic rearrangements of the RAF kinases, RAF1 and BRAF, have been found in various cancers including pilocytic astrocytoma, melanoma, gastric cancer, and prostate cancer [17,18]. Such multi-tumor rearrangements suggest that cancers arising from distinct cell types and tissues might nonetheless represent related disease entities belonging to a common molecular grouping.

28

Advancements in genomic technologies have facilitated gene fusion discovery. Next-generation genomic and transcriptome sequencing have been used to discover novel gene rearrangements in prostate cancer, lung cancer, colon cancer, and melanoma [19-26]. Microarray-based approaches have been used to discover novel gene fusions in gastric cancer, prostate cancer, and leukemia [4,27,28]. Furthermore, major genome centers and consortiums, including The Cancer Genome Atlas Project (TCGA) and the Wellcome Trust Sanger Institute, have been using these genomic methodologies to profile large numbers of cancer specimens and have made these data publicly available [29-31]. In the modern era of cancer genomics, a major goal will be to mine these large datasets for the discovery of novel pathogenic alterations that drive oncogenesis. We hypothesized that novel multi-tumor rearrangements exist across various cancers and should be discoverable in large genomic datasets. Here we describe the development of a pipeline for the detection of these alterations based on the identification of tell-tale rearrangement “breakpoints” in transcriptome and genomic data. We apply this method to both publicly-available microarray datasets as well as data generated in our laboratory. As a proof of concept, we successfully rediscovered several known gene fusions. More significantly, we nominate and subsequently validate several novel gene fusions spanning multiple human cancer types.

RESULTS

Microarray datasets For our breakpoint analysis (detailed below), we mined transcriptome data from 92 exon microarray experiments, together representing 12 different cancer types (Figure 2.S1). Our laboratory generated 16 of these profiles, which included several specimens with known rearrangements to optimize our methodology, as well as various cancer types where gene fusions had yet to be discovered. The remaining data

29

were obtained from published studies [32,33]. In particular, we focused on datasets of established cancer cell lines, so that we could readily obtain the samples for validation and follow-up experiments. Separately, we mined genomic profiles from 882 high- density array-based comparative genomic hybridization (aCGH) experiments (Figure 2.S1). Of these samples, 812 were generated from the Wellcome Trust Sanger Institute’s Cancer Genome Project, which included cancer cell lines from 29 distinct tissue sites [31]. The remaining profiles were generated in our laboratory [34] and comprised 70 pancreatic cancer cell lines and early-passage xenografts.

Breakpoint Analysis To nominate candidate gene fusions from transcriptome data (using exon microarrays), we developed an approach which we termed RNA breakpoint analysis (RBA) (Figure 2.1 and Figure 2.S2). Other groups have proposed similar methods, although in limited application to detect known fusions [35-37], or very recently with some success in discovering novel fusions [38,39]. Our strategy was to identify transcript “breakpoints”, i.e. significant transitions in expression level between proximal and distal exons. These transitions might reflect elevated expression of the exons proximal (for 5’ fusion partners) or distal (for 3’ fusion partners) to a gene fusion junction. To identify such transitions, we implemented a “walking” Student’s t-test, comparing expression levels of proximal and distal exons (testing all possible exonic breakpoints), for all assayed transcripts (Figure 2.S2A, B). Because such transitions might be present due to reasons other than rearrangement (e.g. alternative splicing), we applied additional filters to enrich for true positives (see Materials and Methods), including applying a stringent Bonferroni correction to adjust for multiple gene testing. We also limited our analysis to candidate breakpoints that disrupted genes known to be rearranged in human cancer, as defined by the Cancer Gene Census [40]. Though we might miss some novel genes, we reasoned, as have others [4], that as a starting point this gene set would be enriched for true positives, and for novel “multi-tumor” gene fusions that might span multiple cancer types.

30

To discover gene fusions from genomic data (using high-density CGH/SNP arrays), we employed a similar method called DNA breakpoint analysis (DBA), based on identifying intragenic breakpoints as transitions in DNA copy number occurring within genes (Figure 2.1 and Figure 2.S3A). These intragenic genomic breakpoints might reflect unbalanced chromosomal rearrangements that result in the creation of a gene fusion. Other groups have recently reported similar approaches, though in limited datasets, either to rediscover known gene fusions or to discover novel rearrangements [27,28,41]. To identify genomic breakpoints, we first segmented the copy number data to identify statistically-significant copy number alterations (CNAs), using the fused lasso method (false discovery rate; FDR 1%) [42]. Because this approach tended to overcall copy number transitions, we also devised an algorithm to better define the boundaries of statistically-significant CNAs, which we termed “copy number smoothing” (see Materials and Methods). We then screened for copy number changes disrupting those genes of the Cancer Gene Census. Altogether, RBA identified 54 different transcript breakpoints across the 92 cancer samples analyzed (Figure 2.S2C and Table 2.S1). Many of these breakpoints corresponded to known gene fusions, including BCR/ABL1 in CML, FIP1L1/PDGFRA in eosinophilic leukemia, and NPM1/ALK in anaplastic large cell lymphoma (ALCL) (Figure 2.S4). In most cases of known gene fusions, we found that RBA was better suited to detect the 3’ fusion partner. This likely reflects that for 5’ partners, comparable expression of the remaining wildtype allele might mask an expression-level breakpoint, whereas for 3’ partners the corresponding wildtype allele is more likely to be expressed at low or negligible levels (from its endogenous promoter). Altogether, DBA identified 144 different intragenic DNA copy number breakpoints across the 882 cancer samples analyzed (Figure 2.S3B and Table 2.S2). Many of these candidates also corresponded to known gene fusions, including EWSR1/FLI1 in Ewing sarcoma and ABL1 rearrangements in several leukemia samples (Figure 2.S5). When possible, RBA and DBA results were integrated. In particular, four candidates were supported by both approaches, with three

31

corresponding to known gene fusions (Table 2.S1 and Table 2.S2). However, opportunities for integrating RBA and DBA were few because of the limited overlap of samples profiled at both the transcriptional and genomic level. In all, we prioritized two candidate gene fusions nominated by RBA and 12 candidate rearrangements nominated by DBA for further characterization. We used various criteria to select these candidates, and our rationale is presented in more detail in Table 2.S1 and Table 2.S2. Briefly, we prioritized RBA candidates by focusing on the most statistically-significant novel rearrangements. For DBA, we prioritized novel rearrangements associated with focal copy number alterations, because we noted in the datasets that many known gene fusions occurred in the context of focal genomic gains or losses. We also used gene-expression profiling data when available to prioritize DBA candidates that were highly expressed in the respective sample. In addition, for both RBA and DBA, we prioritized breakpoints aligning to exon positions previously demonstrated to be rearranged in other malignancies. In total, we were able to define and PCR-validate rearrangements in 12 of the 14 (86%) candidates tested (Table 2.1, Table 2.S1, and Table 2.S2).

Novel ROS1 rearrangements in angiosarcoma and epithelioid hemangioendothelioma Rare oncogenic gene fusions involving the ROS1 receptor tyrosine kinase (RTK), a poorly characterized RTK with unknown ligand [43], have been described in glioblastoma, non-small cell lung cancer, and cholangiocarcinoma [44-46]. DBA identified a genomic breakpoint disrupting ROS1 in U-118MG cells, corresponding to the known GOPC/ROS1 (also called FIG/ROS1) gene fusion in this glioblastoma cell line [45] (Figure 2.S5C). In addition, RBA nominated 6 other candidate ROS1 rearrangements, in breast cancer (BT-549, HS578t), glioblastoma (SF-295, U251), lung cancer (HOP-62), and angiosarcoma (AS1). However, only the primary angiosarcoma specimen, AS1, exhibited a prominent and highly significant (P<10-27) expression transition (Figure 2.2A), with the predicted breakpoint corresponding to

32

known rearrangements of ROS1 in other malignancies. Thus, we chose to further investigate ROS1 in this specimen. While several sarcoma subtypes (e.g. Ewing sarcoma) harbor pathognomonic gene fusions, no such alterations have been discovered to date in angiosarcoma, a rare but aggressive endothelial neoplasm [47,48]. By 5’ rapid amplification of cDNA ends (5’ RACE), we uncovered a novel CEP85L/ROS1 rearrangement in AS1 (Figure 2.2B, C). CEP85L and ROS1 are located approximately 1 megabase (MB) apart within cytoband 6q22, and are oriented in the same direction. The gene fusion is in frame, and preserves the tyrosine kinase domain of ROS1, but removes its transmembrane and extracellular domains (Figure 2.2C). CEP85L was recently discovered to be the 5’ partner of a rearrangement involving PDGFRB in a patient with precursor T-ALL and an associated myeloproliferative neoplasm [49]. The breakpoint of CEP85L/PDGFRB includes the first 11 exons of CEP85L whereas CEP85L/ROS1 includes the first 12 exons. While little is known about the function of CEP85L (centrosomal protein 85kDa-like), structural analysis [50] predicts the presence of a coiled-coil domain that is retained in these gene fusions (Figure 2.2C). Rearrangements of RTKs often involve 5’ (N-terminal) partnered coiled-coil domains, which, presumptively by mediating dimerization, are necessary for the transforming properties of these fusions [5,51,52]. To further investigate the underlying genomic rearrangement in the AS1 angiosarcoma specimen, we performed a “break-apart” fluorescence in situ hybridization (FISH) assay, using two FISH probes (with different fluors) flanking ROS1. FISH analysis confirmed genomic rearrangement with amplification of ROS1 (Figure 2.2D). To determine whether ROS1 rearrangements recurred in angiosarcomas or other sarcoma subtypes, we performed the break-apart FISH assay on two tissue microarrays (TMA) containing 280 specimens representing 36 diverse sarcoma and soft tissue tumor diagnoses (Table 2.S3). An advantage of FISH (e.g. as compared to RT-PCR) is that it does not require knowing the identity of the 5’ fusion partner, which may differ among tumor specimens. Of 33 evaluable angiosarcoma and 20 epithelioid hemangioendothelioma (EHE; a related diagnosis) cases, one EHE

33

case (EHE10) exhibited rearrangement at the ROS1 locus (Figure 2.2D). Thus, in all we observed ROS1 rearrangement in 1 of 34 (~3%) angiosarcomas and 1 of 20 (5%) EHE cases. No additional ROS1 rearrangements were identified in other sarcoma and soft tissue tumor subtypes. Although ROS1 rearrangements appeared to be relatively uncommon, we hypothesized that ROS1 might nonetheless play a more general role in angiosarcoma pathogenesis, even in cases without rearrangement. To explore this hypothesis, we analyzed a microarray dataset of gene-expression profiles from various sarcoma subtypes including angiosarcoma [53-55]. By supervised analysis, we identified 455 genes (FDR < 5%) with elevated expression in angiosarcoma relative to other sarcoma subtypes. In addition to including various vascular endothelial markers (ECSCR, TIE1, CD34, CDH5, ESAM), the angiosarcoma gene signature also included ROS1 (Figure 2.2E), supporting a possible broader role of ROS1 in the pathogenesis of this disease.

Discovery of APIP/SLC1A2 in colon cancer Recently, Tao et al. reported that a small subset of gastric cancers harbors the novel gene fusion CD44/SLC1A2 [27]. This fusion is formed through a chromosomal inversion that juxtaposes most of the coding region of the glutamate transporter gene SLC1A2 to the strong transcriptional promoter of its neighboring gene CD44. The rearrangement results in overexpression of an N-terminally truncated SLC1A2 protein, which increases intracellular glutamate levels and stimulates oncogenic growth. Our DBA results suggested that SLC1A2 rearrangements occur in cancer types other than gastric carcinomas. In addition to detecting a known SLC1A2 rearrangement in the gastric cancer cell line SNU-16, DBA identified breakpoints disrupting SLC1A2 in the colon cancer cell line SNU-C1 and in a pancreatic cancer xenograft (247) (Figure 2.3A and Table 2.S2). All of these breakpoints occur within intron 1 of SLC1A2, the same position found to be disrupted in several gastric cancers [27]. We chose to further characterize the putative rearrangement in SNU-C1.

34

By paired-end RNA sequencing (RNA-seq; see Materials and Methods) of SNU-C1 cells, we uncovered a novel colon cancer gene fusion, APIP/SLC1A2 (Figure 2.3B). The structure of this rearrangement is nearly identical to that of CD44/SLC1A2 and is predicted to encode the same truncated transporter protein. In particular, as is the case for CD44/SLC1A2, translation is predicted to occur from an internal start codon within exon 2 of SLC1A2 (Figure 2.3B). Analogous to TMPRSS2/ERG in prostate cancer, the SLC1A2 fusion in gastric cancer is thought to be driven by strong expression of its 5’ partner, CD44 [27]. We therefore reasoned that for APIP/SLC1A2, the 5’ partner APIP (APAF1 interacting protein) ought to exhibit strong expression in colon. Indeed, analysis of publicly- available microarray data [56,57] revealed high-level expression of APIP in colon compared to other tissues (Figure 2.3C). Furthermore, analysis of a publicly-available colorectal cancer gene-expression dataset [58] demonstrated SLC1A2 to be expressed at higher levels in SNU-C1 compared to all other cell lines interrogated (Figure 2.3D). Attempts to characterize the oncogenic contribution of APIP/SLC1A2 by RNA interference (RNAi)-mediated knockdown were met with technical difficulties in efficiently transfecting the suspension line SNU-C1 (data not shown). Further studies are needed to fully characterize the role of this alteration in colon carcinogenesis.

Novel RAF kinase rearrangements in pancreatic cancer and anaplastic astrocytoma

Recurrent rearrangements of the RAF kinases, RAF1 and BRAF, were recently reported in a small fraction of prostate cancers, gastric cancers, and melanomas [17]. Here, DBA identified candidate rearrangements of RAF1 in lung cancer (DMS-153), pancreatic cancer (PL5), anaplastic astrocytoma (D538-MG), and osteosarcoma (CAL-72) (Figure 2.4A and Table 2.S2), and candidate rearrangements of BRAF in gastric cancer (NCI-N87), breast cancer (HCC38), and glioblastoma (D397-MG) (Table 2.S2). We further evaluated two of these candidates by paired-end RNA-seq, from which we identified novel gene fusions, ATG7/RAF1 in pancreatic cancer and BCL6/RAF1 in anaplastic astrocytoma (Figure 2.4B, C). Both of these fusions

35

retained exons 8-17 of RAF1, and the encoded fusions were predicted to be in frame. In addition, both rearrangements preserved the RAF1 serine/threonine kinase domain but removed an N-terminal autoinhibitory Ras Binding Domain (RBD) (Figure 2.4B), consistent with the structural organization of known RAF kinase gene fusions [17,18]. We further characterized the oncogenic relevance of ATG7/RAF1 in pancreatic cancer, using RNAi to knockdown its expression. Transfection of PL5 cells with short interfering RNAs (siRNAs) targeting the 3’ end of RAF1 (i.e. the portion retained in the fusion) led to reduced expression of the RAF1 fusion (Figure 2.4D). This resulted in significantly decreased cell proliferation and invasiveness (by Boyden chamber assay), compared to PL5 cells transfected with a non-targeting control siRNA (Figure 2.4E, F). More than 90% of pancreatic cancers harbor activating mutations of KRAS, and a subset also exhibits KRAS amplification [59]. Comparatively little is known of the pathobiology of the pancreatic cancer subset that is wildtype for KRAS. Since RAF kinases mediate KRAS signaling through the MAPK cascade, we reasoned that ATG7/RAF1 might substitute for KRAS mutation in PL5. Supporting this possibility, PL5 exhibited neither amplification (by aCGH profile; data not shown) nor activating mutation (by Sanger sequencing) of KRAS. To determine whether RAF kinase rearrangements are recurrent events in pancreatic cancer, we performed break-apart FISH assays for both BRAF and RAF1, on TMAs containing 104 evaluable pancreatic cancer cases. We identified BRAF rearrangement in one of the 104 samples (~1%) (Figure 2.4G) but no additional RAF1 rearrangements. Taken together, our findings are consistent with RAF kinase fusions occurring in a small subset of pancreatic cancers, where they possibly substitute for KRAS mutations.

Discovery and characterization of EWSR1/CREM in melanoma Rearrangements of the RNA binding protein, EWSR1, characterize various malignancies including Ewing sarcoma (EWSR1/ETS), desmoplastic small round cell tumor (EWSR1/WT1), and some acute lymphoblastic leukemias (EWSR1/ZNF384)

36

[60-62]. By DBA, we identified intragenic breakpoints disrupting EWSR1 in Ewing sarcoma (ES6, EW12, EW22), neuroblastoma (GOTO, NBsusSR), and melanoma (CHL-1, SH4) (Figure 2.5A and Table 2.S2). As EWSR1 gene fusions had not previously been described in cutaneous melanoma, we prioritized CHL-1 and SH4 for further evaluation. By paired-end RNA-seq, we uncovered a novel rearrangement, EWSR1/CREM, in CHL-1 (Figure 2.5B, C), but were unable to identify an EWSR1 fusion in SH4. CREM is a basic leucine zipper transcription factor and downstream mediator of the cAMP signal transduction cascade [63-65]. The structure of EWSR1/CREM is typical of oncogenic EWSR1 rearrangements, with a putative transcriptional transactivating domain from EWSR1 fused in-frame to the basic leucine zipper DNA binding domain of CREM (Figure 2.5B). To explore an oncogenic contribution of EWSR1/CREM in melanoma, we again used RNAi to knockdown expression of the fusion. Transfection of CHL-1 cells with siRNAs targeting the 3’ end of CREM (the portion retained in the fusion) led to reduced transcript levels of the EWSR1/CREM fusion (Figure 2.5D), and to significantly decreased cell proliferation and invasion (compared to non-targeting control siRNAs) (Figure 2.5E, F). Notably, CHL-1 cells transfected with CREM- targeting siRNAs also appeared flattened and enlarged, morphological changes suggestive of senescence. To substantiate this observation, we stained for senescence- associated β-galactosidase and observed significantly increased numbers of senescent cells (Figure 2.5G).

Identification of FAM133B/CDK6 in T-ALL

Cyclin-dependent kinase 6 (CDK6) encodes a regulator of G1/S cell-cycle progression and has been found rearranged in B-cell lymphoma (IGK/CDK6), chronic lymphocytic leukemia (IGL/CDK6, IGH/CDK6, IGK/CDK6), and acute lymphoblastic leukemia (CDK6/MLL) [66-68]. DBA identified a focal DNA amplification disrupting CDK6 in J.RT3-T3.5, a mutant TCR-negative Jurkat cell line derivative [69] (Figure 2.6A). To evaluate this further, we performed paired-end RNA-seq on

37

Jurkat cells, which revealed a novel gene fusion, FAM133B/CDK6 (Figure 2.6B). CDK6 sits adjacent to FAM133B (an uncharacterized gene) at chr 7q21.2, and both genes are transcribed in the same direction. However, CDK6 resides upstream of FAM133B; therefore the fusion might result from a tandem duplication event. The predicted fusion is in-frame, and juxtaposes 41 amino acids from the N-terminus of FAM133B to an N-terminally truncated CDK6. Analysis of publicly-available microarray data confirmed high-level expression of CDK6 in J.RT3-T3.5, relative to other leukemia cell lines (Figure 2.6C; array probes mapped to the portion of CDK6 retained in the fusion). In addition, Jurkat cells exhibited marked sensitivity to the

CDK4/6 inhibitor, PD0332991 (IC50 = 0.27 µM; Figure 2.6D).

Rearrangement of CLTC and VMP1 occurs in multiple cancer types Gene fusions involving clathrin heavy chain (CLTC) have been described in various leukemias (CLTC/ALK) and in renal cell carcinoma (CLTC/TFE3) [70-72]. DBA suggested that CLTC rearrangements might be more widespread in human malignancies (Table 2.S2). Copy-number transitions within cytoband 17q23.1 occurred as focal deletions that involved three neighboring genes, CLTC, PTRH2, and VMP1 (also called TMEM49). We selected to further evaluate two breast cancer cell lines, BT-549 and HCC1954, with deletions spanning CLTC-VMP1 (Figure 2.7A). Paired-end RNA-seq revealed a distinct CLTC/VMP1 fusion transcript in each sample (Figure 2.7B, C). Notably, both CLTC/VMP1 fusions were predicted to be out of frame. A recent study also identified the CLTC/VMP1 fusion in BT-549 [73]; our findings now demonstrate this to be a recurrent rearrangement in breast cancer. A similar deletion pattern occurred in other malignancies, including glioblastoma, neuroblastoma, lung cancer, bladder cancer, thyroid cancer, melanoma, leukemia, and others (Figure 2.7D). Across all these samples, the minimum common region of deletion appeared to include only PTRH2 and VMP1. One of the samples, renal cell carcinoma line RXF393, was also analyzed by RBA, where a candidate CLTC rearrangement was identified (Figure 2.7E).

38

Novel cell line models for EGFRVIII and FIP1L1/PDGFRA In addition to discovering novel gene fusions, our breakpoint analysis approach proved useful for identifying new cell line models for known oncogenic rearrangements. In particular, DBA identified genomic breakpoints within epidermal growth factor receptor (EGFR) in two glioblastoma multiforme cell lines, DKMG and CAS-1 (Figure 2.8A). Approximately 20-30% of glioblastoma tumors harbor a constitutively active rearrangement of EGFR, called EGFRvIII, but glioblastoma derived cell lines typically lose EGFR amplification and EGFRvIII expression [74,75]. Hence, studies of EGFRvIII have been hindered by the lack of suitable cell line models. Paired-end RNA-seq, followed by RT-PCR and Western blotting, revealed the expression of EGFRvIII in DKMG cells (Figure 2.8B, C). Further functional characterization of EGFRvIII in this cell line is described elsewhere [76]. DBA also identified DNA copy-number transitions within platelet-derived growth factor receptor alpha (PDGFRA) in glioblastoma (SNB19) and chronic eosinophilic leukemia (EOL-1) (Table 2.S2). RBA identified a corroborating expression-level transition within PDGFRA in EOL-1 (Table 2.S1, Figure 2.S4C), and another in the T-ALL cell line SUPT13 (Figure 2.8D). The FIP1L1/PDGFRA fusion is a hallmark of chronic eosinophilic leukemia and has been studied extensively in EOL-1 cells [77,78], but other cell line models are lacking. We performed paired-end RNA-seq on SUPT13 and found that it harbors FIP1L1/PDGFRA (Figure 2.8E, F), albeit with a distinct gene fusion junction from EOL-1. Notably, SUPT13 cells also demonstrated marked sensitivity to the PDGFR inhibitor, imatinib mesylate

(IC50=0.036 µM) (Figure 2.8G). Thus, SUPT13 represents a new cell line model for studies of this known gene fusion.

DISCUSSION

Here, we have described the development and implementation of a breakpoint analysis pipeline for cancer gene fusion discovery, which we applied to a large

39

collection of nearly 1,000 cancer samples. We discovered novel gene rearrangements in diverse human cancer types, including fusions of ROS1, SLC1A2, RAF1, EWSR1, CDK6, and CLTC. The ROS1 rearrangement (CEP85L/ROS1), to our knowledge, represents the first gene fusion described in angiosarcoma. By FISH analysis, ROS1 rearrangements appear to be infrequent in angiosarcoma (and in another endothelial-derived tumor, epithelioid hemangioendothelioma). Nevertheless, the finding of elevated ROS1 expression in angiosarcomas, relative to other sarcoma subtypes, suggests that ROS1 might play a broader role in angiosarcoma pathogenesis. Angiosarcoma is an aggressive sarcoma subtype, with an overall 5-year survival rate of approximately 35% [48]. Locally recurrent and metastatic tumors are generally chemoresistant. As a tyrosine kinase, ROS1 represents a potential new therapeutic opportunity. In this regard, we note that ROS1 tyrosine kinase is sensitive to the existing ALK small- molecule inhibitor, crizotinib, and indeed a single patient’s non-small cell lung cancer harboring a ROS1 fusion was found to be responsive [79]. Intriguingly, single nucleotide polymorphism (SNP) variants in ROS1 have been associated with increased risk of vascular diseases, including coronary artery disease and stroke [80,81]. These reports possibly suggest an even broader link between ROS1 and endothelial cell pathobiology. Our findings also demonstrate a more widespread role of SLC1A2 rearrangements in human malignancies. We show that in addition to CD44/SLC1A2 in gastric cancer, SLC1A2 is involved in a novel but analogous gene fusion, APIP/SLC1A2, in colon cancer. Both of these rearrangements are predicted to overexpress the identical N-terminally truncated SLC1A2 protein, a functioning glutamate transporter. Notably, while most oncogenic gene fusions encode protein kinases and transcription factors [3,40,82], SLC1A2 fusions appear to define a new class of rearrangement targeting metabolism-related genes [27]. Indeed, altered cell metabolism, is increasingly recognized as a primary driver of human cancer [83]. SLC1A2 fusions therefore also represent potential therapeutic targets in gastric and now colon cancer. Pharmacological inhibitors of several transporter proteins have

40

been developed [84]. However, as glutamate is a major excitatory neurotransmitter of the central nervous system, a monoclonal antibody targeting SLC1A2 might provide an alternative anti-cancer agent, where the larger size would limit crossing the blood- brain barrier. Our analysis also uncovered RAF1 rearrangements in pancreatic cancer and in anaplastic astrocytoma. To our knowledge, these are the first fusion genes reported in either cancer type. On a cautionary note, most pilocytic astrocytomas (a distinct diagnosis, but related to anaplastic astrocytoma) carry RAF1 or BRAF rearrangements; thus it is possible that the D538-MG cell line (harboring BCL6/RAF1) was actually derived from a misdiagnosed pilocytic astrocytoma. Regardless, BCL6/RAF1 constitutes a novel RAF1-partnered fusion. Our findings extend the spectrum of cancer types harboring RAF kinase rearrangements, and underscore the importance of the RAS-RAF-MAPK signaling pathway in these additional malignancies. In pancreatic cancer, pathway activation typically occurs by mutation of KRAS, but in uncommon KRAS-wildtype tumors, RAF kinase fusions may provide an alternative route. Though RAF kinase fusions are uncommon, they nonetheless have therapeutic implications for this deadly malignancy. Several RAF kinase and MAP kinase pathway inhibitors are now in clinical trials for various cancer types [85,86]. Breakpoint analysis also identified a novel EWSR1/CREM fusion in melanoma (CHL-1). Several “singleton” gene fusions have been reported in melanoma, but it is unclear whether any of these rearrangements have oncogenic properties [20]. In addition, Palanisamy et al. found rearrangements of RAF kinase genomic loci by FISH in rare cases of melanoma, but no specific RAF kinase gene fusion was identified [17]. Thus, EWSR1/CREM potentially represents the first oncogenic gene fusion discovered to date in melanoma. EWSR1 rearrangements in Ewing’s sarcoma have recently been shown to confer sensitivity to PARP-1 inhibition [87]. Advanced melanomas carry a poor prognosis and are generally unresponsive to anti-cancer medications or rapidly acquire resistance to these agents. The potential role of EWSR1/CREM as a marker for PARP-1 inhibitor sensitivity should be further explored.

41

Our discovery of a CDK6 fusion in T-ALL also carries important pathobiologic and clinical implications. In knockout studies in mice, CDK6 was recently shown to play a role in thymocyte development and tumorigenesis [88]. Thus, it is plausible that the CDK6 rearrangement drives deregulated CDK6 expression and T-cell derived leukemia. Our findings provide a rationale for preclinical testing and clinical trials using existing CDK6 inhibitors (e.g. PD0332991). Lastly our breakpoint analysis uncovered recurrent deletions and rearrangements of the CLTC-PTRH2-VMP1 locus, evident in diverse tumor types, including glioblastoma, neuroblastoma, lung cancer, breast cancer, bladder cancer, thyroid cancer, melanoma, and leukemias. In breast cancer, we discovered two CLTC- VMP1 fusions; however, both were out-of-frame. These findings are most consistent with one or more of the three genes at this locus functioning as a tumor suppressor in multiple tumor types. Notably, PTRH2, the centrally residing gene at this locus, encodes a mitochondrial protein that induces apoptosis through interactions with the small Groucho family transcriptional regulator, AES, consistent with a tumor suppressive function [89]. In the current study, we performed pharmacologic inhibition and RNAi knockdown experiments to functionally characterize several gene fusions, and we performed FISH to assess recurrence. The results of these experiments highlight the pathogenic roles of these alterations in their corresponding cancer types. However, not all rearrangements were fully characterized. In particular, we were unable to culture D538-MG cells, and so we did not perform experiments to assess the function of BCL6/RAF1. In addition, we were unable to efficiently transfect the suspension cell line SNU-C1 with siRNAs targeting APIP/SLC1A2. While the structures of these alterations strongly support oncogenic roles, further experiments must be undertaken to fully characterize their function. Additional FISH and RT-PCR experiments are also planned to further assess rearrangement frequencies for several gene fusions. By our novel discoveries, we demonstrate that breakpoint analysis provides a powerful approach for gene fusion discovery. While our opportunities to integrate RBA and DBA were limited (due to the small overlap of samples), we expect that

42

candidates identified by both methods would be further enriched for valid fusions. There exist now publicly-available microarray data for many thousands of cancer samples [29,30,90] which can be mined by breakpoint analysis. In particular, recurrent gene fusions appear to occur at low frequency in many cancer types, and therefore these existing very large sample sets should empower their discovery. While here we have applied breakpoint analysis to discover rearrangements of known cancer genes as part of novel fusions and in novel cancer types, our approach should be extendable to discover pathogenic fusion genes not previously linked to malignancy. In summary, breakpoint analysis uncovered several novel gene rearrangements spanning multiple human cancer types. We identified new gene fusions involving ROS1, SLC1A2, RAF1, EWSR1, CDK6, and CLTC, some occurring in cancer types not previously known to harbor fusions. Several of these fusions represent druggable targets or potential markers for sensitivity to specific anti-cancer treatments with therapeutic implications for the corresponding cancer types. Importantly, such multi- tumor rearrangements support the notion that tumors might be better classified by their underlying molecular alterations, rather than their tissue of origin.

MATERIALS AND METHODS

Exon microarray expression datasets For RBA, we mined data from 76 publicly-available exon-resolution expression arrays, done on Affymetrix Human Exon 1.0 ST microarrays, and including 17 T-ALL (GSE9342) cell lines and all 59 of the NCI-60 cancer cell lines (GSE29682) [33].

Affymetrix Expression Console software was used to extract normalized log2 ratios from raw data files using the RMA-sketch algorithm from Affymetrix’s Power Tools package. Exon log2 ratios were then mean centered across the array set. In addition, we profiled 16 cancer samples on a custom Agilent 8 x 15K microarray that contained 325 genes previously known to be involved in oncogenic rearrangements. The sample set included 8 positive control samples harboring known rearrangements, used to optimize our analysis pipeline, as well as 8 sarcoma specimens representing sarcoma subtypes where gene fusions had not yet been described. For the custom arrays,

43

sample labeling was done using the Fairplay III Microarray Labeling Kit (Agilent). Briefly, 10µg of sample total RNA and 1µg of reference mRNA (pooled from 11 diverse cell lines; [91]) were differentially labeled with Cy5 and Cy3, respectively, and co-hybridized to the microarray. Following overnight hybridization and washing, arrays were imaged using Agilent’s High-Resolution C Scanner. Normalized fluorescence ratios were extracted using Agilent Feature Extraction Software, and values were mean centered across samples.

Array CGH datasets For DBA, we mined data from 812 CGH/SNP arrays, representing cancer cell lines derived from 29 distinct tissues, from the Wellcome Trust Sanger Institute’s Cancer Genome Project [31]. These cell lines were profiled on Affymetrix SNP 6.0 microarrays containing 1.8 million genetic markers including more than 946,000 probes for the detection of copy number variation. Affymetrix Genotyping Console software was used to extract probeset intensities from raw data files using the regional GC correction configuration for Copy Number/LOH analysis and default settings.

Intensities were normalized against a HapMap 270 normal reference dataset, and log2 ratios were analyzed for genomic breakpoints. In addition, we analyzed a pancreatic cancer dataset generated by our laboratory, consisting of 22 pancreatic cancer cell lines and 48 early-passage xenografts [34]. These samples were profiled on Agilent

244K CGH arrays and normalized log2 ratios were obtained as described [34].

RNA breakpoint analysis RBA was implemented using custom C# scripts. The RBA algorithm is based on a “walking” Student’s t-test, which for every exon-exon junction along the transcript compares expression levels of all proximal vs. distal exons (see Fig. S2A). The algorithm was applied to all annotated genes and subsequently filtered for candidate expression breakpoints disrupting genes previously identified in oncogenic rearrangements, as defined by the Cancer Gene Census [40]. The Cancer Gene Census was downloaded in November 2011 from the Wellcome Trust Sanger Institute

44

(http://www.sanger.ac.uk/genetics/CGP/Census/). We filtered this list to exclude known common fragile sites, as well as non-oncogenic fusion partners such as those involved in rearrangements with MLL and the 5’ partners of tyrosine kinase fusions, with the exception of promiscuously rearranged genes (i.e. those involved in multiple distinct gene fusions). We also included SLC1A2, which has recently been discovered to form oncogenic gene fusions in gastric cancer[27], but had not yet been added to the census. The resulting filtered list included 306 genes. Statistical significance (P < 0.05) was determined using a Bonferroni correction to adjust for multiple t-tests. Specifically, 3,218 t-tests were performed for each Affymetrix microarray experiment, with significance corresponding to an uncorrected P = 1.55 x 10-5, and 1,807 t-tests were performed for each custom microarray experiment, with significance corresponding to an uncorrected P = 2.77 X 10-5. Positive hits were defined as genes with P-values dipping below the significance threshold during the walking t-test. We only included expression breakpoints with directional orientation (i.e. being the 5’ or 3’ partner) corresponding to that of known rearrangements involving a given gene.

DNA breakpoint analysis DBA was done using a combination of publicly available software and custom C# scripts. Copy number alterations (CNAs) were initially determined from normalized log2 ratios using the fused lasso algorithm (FDR 1%) [92]. We then used a custom algorithm to better define the boundaries of each CNA (thereby minimizing overcalled transitions), which we termed “copy number smoothing.” Copy number smoothing was applied to each chromosome of each profiled sample, where each iteration begins by identifying the upper (5’) boundary of the subsequent candidate “well-defined”

CNA called by fused lasso. A well-defined CNA was defined by an average |log2| ratio greater than or equal to an adjustable threshold (here set to 0.3) and a minimum length of at least 50 probe sets. Adjusting the log2 ratio threshold affected the number of nominated gene fusions. We empirically chose a threshold that enabled detection of many known gene fusions, such as EWSR1/FLI1 in Ewing’s sarcoma, while minimizing false positives (Table 2.S4). For high-level CNAs, defined by |log2| ratio

45

greater than or equal to 1.0, we permitted a minimum length of only 10 probe sets, because we observed that focal high-level copy number transitions often characterized known rearrangements, e.g. BCR/ABL1 (K562), MLL (OCI-AML2), EWSR1/FLI1 (CADO-ES1, EW18), and CD44/SLC1A2 (SNU-16). After finding this upper boundary, the algorithm walks down the CNA to identify its lower (3’) boundary. The lower boundary is defined as either reaching the end of the chromosome or finding the position where 95% of the subsequent 100 ratios meet any one of the following criteria: (1) copy number neutral (log2 ratio = 0); (2) change in the log2 ratio sign, i.e. from (-) to (+) or vice versa; or (3) average |log2| ratio that changes by an adjustable threshold (here set to 0.3). For high-level CNAs, 95% of only the next 50 ratios are evaluated using these criteria. After finding the upper and lower boundaries of a given CNA, its average value is determined. A second custom C# script then mines the CNAs for those that disrupt annotated genes. These candidates were further filtered to include only the subset disrupting Cancer Gene Census genes. We also prioritized those breakpoints where the directional orientation of the copy number transition corresponded to that of known rearrangements of the particular gene. For example, breakpoints disrupting ABL1 kinase must comprise either amplification of the 3’ end or deletion of the 5’ end of the gene, since ABL1 is the 3’ partner in known oncogenic rearrangements such as BCR/ABL1.

Gene expression datasets and supervised analysis To analyze expression levels, we also mined microarray gene-expression data including 76 leukemia cell lines profiled on Affymetrix U133 Plus 2.0 microarrays from the National Cancer Institute’s caArray database (https://array.nci.nih.gov/caarray/project/woost-00041), 67 sarcoma specimens profiled on cDNA microarrays printed at Stanford [53-55], 136 normal solid tissue samples profiled on Affymetrix Human Genome U133A microarrays [56,57], 67 colon cancer cell lines profiled on the Rosetta/Merck Human RSTA Custom Affymetrix 2.0 microarrays [58], and a subset of the gene-expression profiling data (Affymetrix U133 plus 2.0 arrays) from the Cancer Cell Line Encyclopedia (CCLE)

46

[93]. An angiosarcoma gene-expression signature was defined as previously described [94], as those genes meeting the following criteria: (1) gene expression correlated (Pearson correlation |R| ≥ 0.5) with angiosarcoma subtype considered as a binary variable; (2) gene expression significantly altered in angiosarcoma samples (two-tailed Student’s t-test, P<0.001); and (3) ≥ 2-fold difference in average expression between angiosarcomas and other sarcoma specimens. To estimate a FDR (i.e. fraction of genes falsely called significant), we compared our results to those obtained from 1,000 trials with class labels (i.e. angiosarcoma versus other sarcomas) randomly permuted.

Cell lines and tissues Cell lines SNU-C1, BT-549, HCC1954, SK-ES-1, A-172, K562, A431, CHL-1, SH-4, VCaP and J-RT3-T3-5 were obtained from the American Type Culture Collection. DK-MG, SU-DHL-1, and EOL-1 were obtained from the German Collection of Microorganisms and Cell Cultures (DSMZ). ONS-76 was obtained from the Health Science Research Resources Bank (HSRRB, Tokyo, Japan). The remaining cell lines were kind gifts from different research laboratories including D-538MG (Dr. Darell Bigner, Duke University), PL5 (Dr. Anirban Maitra, Johns Hopkins University), and SUPT-13 (Dr. Michael Cleary, Stanford University). Cell lines were propagated in RPMI-1640 supplemented with 10% fetal bovine serum (FBS), except for VCaP (DMEM with 10% FBS), D-538MG (Richter’s zinc option medium (Invitrogen) with 10% FBS), and SK-ES-1 (McCoy's 5A Medium with 15% FBS). Cells were harvested at 80% confluency. Freshly-frozen cancer specimens were obtained from the Stanford Tissue Bank, collected with IRB approval and patient informed consent. Total RNA from tumors and cell lines was isolated using the RNeasy Mini Kit (Qiagen).

RACE-PCR Rapid extension of cDNA ends (RACE), done using the GeneRacer Kit (Invitrogen), was used to identify the 5’ fusion partner of ROS1 in the AS1 specimen (prior to our development of an RNA-seq pipeline). In brief, 5µg of total RNA from AS1 was

47

treated with calf intestinal phosphatase to remove the 5’ phosphate group from truncated or non-mRNA molecules. Next, the sample was treated with tobacco acid pyrophosphatase to remove the 5’ cap structure from full length mRNA, to create a free 5’ phosphate group for subsequent adapter ligation. These molecules were ligated to the GeneRacer RNA oligo. Random primers and SuperScript III were then used to produce the RACE ready cDNA. The GeneRacer 5’ primer served as the forward primer and a custom primer designed within exon 35 of ROS1 (AGTTGGCTGAGCTGCGAGGTCTG) was used as a reverse primer. RACE PCR reactions were resolved on a 1% agarose gel. A 600 bp band was purified and Sanger sequenced.

Paired-end library preparation for Illumina sequencing Paired-end transcriptome sequencing (RNA-seq) was done to discover the identity of the fusion partner of candidate fusion genes. Adapter-ligated cDNA libraries were prepared using the mRNA Seq-8 Sample Prep Kit (Illumina). Briefly, mRNA was isolated and purified from 1 to 10µg of total RNA using Sera-Mag Magnetic Oligo(dT) Beads. mRNA was subsequently fragmented at 94°C in a fragmentation buffer and converted to single stranded cDNA using SuperScript II reverse transcriptase (Invitrogen). Subsequently second-strand cDNA synthesis was performed using E. coli DNA polymerase I (Invitrogen). Double stranded cDNA was end repaired using T4 DNA polymerase and T4 polynucleotide kinase, and then monoadenylated using Klenow DNA polymerase I. Adapter sequences were ligated to library molecules using T4 DNA . Library fragments were then size selected (300-400 bp) on a 2% agarose gel and then purified using the QIAquick Gel Extraction Kit (Qiagen). Purified cDNA fragments were enriched with 15 PCR cycles using Phusion DNA Polymerase and provided buffers. Libraries were again electrophoresed and then gel purified using the QIAquick Minelute Gel Purification Kit (Qiagen). Adapter ligated cDNA libraries were quantified with the Agilent DNA 1000 kit on the Agilent 2100 Bioanalyzer. Libraries were sequenced on either the Genome Analyzer II or HiSeq 2000 instruments (Illumina).

48

Paired-end gene fusion discovery pipeline Mate-paired RNA-seq reads were mapped to the human genome (hg18) and the RefSeq transcriptome allowing up to 2 mismatches, using Efficient Alignment of Nucleotide Databases (ELAND). For a given sample and its corresponding candidate gene fusion, a custom C# script was used to extract all mate pairs with one read mapping to the candidate rearranged gene and the other read mapping to a different genomic locus. The mapping position(s) of the paired read were used to nominate candidate gene fusion partners. A series of filters was then applied to distinguish nominated rearrangements from artifacts arising during library construction. Specifically, the median predicted distance between paired reads was required to be between 100 and 400 nts. Nominated fusions involving genes located adjacent to one another and oriented in the same direction on the chromosome (i.e. likely “readthrough” transcripts) were filtered out. In addition, a second C# script was designed to screen for mate pairs with single reads spanning potential exon-exon fusion junctions (chimeric reads) of nominated gene fusions. Briefly, we screened for mate pairs with a single read mapping to either gene in a nominated gene fusion and with a second non-mapping read. The script attempted to align these non-mapping reads to various exon-exon combinations from the two genes involved in the nominated rearrangement. Identified chimeric reads were merged with the other mate pairs supporting the nominated gene fusion. Nominated rearrangements with less than two supporting mate pairs were filtered out and candidates were validated by RT-PCR followed by Sanger sequencing.

RT-PCR validation of fusions Specimen RNAs were reverse transcribed using SuperScript III reverse transcriptase with random hexamers (Invitrogen). Primers used for RT-PCR gene fusion validation are listed in Table 2.S5. PCR reactions were resolved on 1% agarose TAE gels, and bands were purified and Sanger-sequenced to verify predicted fusion junctions. For validation of the EGFRvIII gene product [95], RT-PCR was performed using 200ng of

49

total RNA and the One-Step RT-PCR kit (Qiagen). Reverse transcription was done at 52°C for 45 minutes, 60°C for 1 minute, and 52°C for 30 minutes, followed by inactivation and hot-start PCR at 95°C for 15 minutes. Denaturation, annealing, and extension were done at 93°C, 60°C, and 72°C, for 30 seconds, 1 minute, and 45 seconds, respectively, for a total of 40 cycles, with a final extension period at 72°C for 10 minutes. Reaction products were electrophoresed in 2% agarose gels and stained with SYBR Green.

Break-apart FISH assays Probe labeling and FISH were performed using Vysis/Abbott Molecular reagents and protocols. Locus-specific BACs encompassing ROS1 (CTD-2174H19 telomeric, RP11-605K7 centromeric), RAF1 (RP11-586C12 telomeric, RP11-767C1 centromeric), and BRAF (RP11-364M15 telomeric, RP11-597I24 centromeric) were labeled with Cy5-dUTP (telomeric probes) or Cy3-dUTP (centromeric probes). Chromosomal locations of BACs were first validated using normal metaphase slides. Fluorescently labeled probes interrogating ROS1 were hybridized to TMAs containing 280 sarcoma and soft tissue tumor specimens. Probes interrogating RAF1 and BRAF were hybridized to TMAs containing 104 evaluable pancreatic cancer cases. Slides were counterstained with DAPI, and imaged using an Olympus BX51 fluorescence microscope with Applied Imaging Ariol 3.0 software. Rearrangement was defined by physical separation of the red and green FISH signals, or loss of the red or green FISH signal, in at least 25% of tumor nuclei. siRNA transfections On-TARGETplus siRNAs targeting RAF1 and CREM, as well as a non-targeting control siRNA pool (ON-TARGETplus siCONTROL Non-targeting Pool), were obtained from Dharmacon. Cell lines were seeded at a density of 75,000-150,000 cells per 6-well plate well and transfected using Lipofectamine 2000 reagent (Invitrogen). Cells were transfected with a final concentration of 25nM siRNA for 16

50

hours in Opti-Mem (GIBCO), which was subsequently replaced with complete growth media (RPMI-1640 with 10% FBS).

Q-RT-PCR and Western blots Q-RT-PCR was performed using Assay-on-Demand TaqMan probes and reagents (Applied Biosystems). A custom primer set encompassing the EWSR1/CREM gene fusion junction was designed to interrogate expression of the gene fusion in CHL-1 (GCCAACAGAGCAGCAGCTA, GGATCTGGTAAGTTGGCATGTCA). Western blots were done on whole cell lysates, using the following primary antibodies: anti- RAF1 rabbit polyclonal (1:200; Cell Signaling); anti-EGFRvIII (1:1000, [95]); anti- GAPDH rabbit polyclonal antibody (1:5000; Santa Cruz Biotechnology); β-actin (1:10,000; Chemicon).

Cell proliferation, invasion, and senescence assays Cell viability/proliferation was quantified by colorimetry associated with cleavage of the tetrazolium salt, WST-1(Roche). Briefly, 10% WST-1 reagent was added to cells at 1, 3, and 5 days post siRNA transfection and then incubated at 37°C for 30 minutes. Absorbance was measured at 450nm with reference to 650nm using a Spectra Max 190 plate reader (Molecular Devices). Invasion was quantified by the Boyden chamber assay (BD Biosciences). Briefly, siRNA transfected cells were plated at a density of 20,000 cells per 24-well insert. A chemotactic gradient of 1% to 10% FBS was established, and cells were fixed and stained with crystal violet 48 hours post transfection. Cells traversing the membrane were counted. Senescence was assessed 72 hours post transfection using the Senescence β-Galactosidase Staining Kit (Cell Signaling) according to the manufacturer’s instructions. Cells were washed with 1 x PBS and then treated with a fixative solution. Cells were then stained for β- Galactosidase and counted. All assays were performed as biological triplicate, and mean values together with SDs are reported. All experiments were reproduced at least once.

51

Gleevec, sorafenib, and PD0332991 treatment Gleevec and sorafenib were obtained from LC Laboratories (Woburn, MA) and PD0332991 was obtained from Selleck Chemicals (Houston, TX). Agents were reconstituted in DMSO and used at the indicated concentrations. IC50 values were determined by fitting sigmoidal (four-parameter logistic) curves with Prism 4.0 software (GraphPad).

Data access All microarray and short-read sequencing data have been deposited in the NCBI Gene Expression Omnibus and Short Read Archive under the accession no. GSE45137.

ACKNOWLEDGEMENTS

The authors would like to thank the members of the Pollack lab for helpful discussion.

52

REFERENCES

1. Aman P (1999) Fusion genes in solid tumors. Semin Cancer Biol 9: 303-318. 2. Mitelman F, Johansson B, Mertens F (2004) Fusion genes and rearranged genes as a linear function of chromosome aberrations in cancer. Nat Genet 36: 331-334. 3. Mitelman F, Johansson B, Mertens F (2007) The impact of translocations and gene fusions on cancer causation. Nat Rev Cancer 7: 233-245. 4. Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, et al. (2005) Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310: 644-648. 5. Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, et al. (2007) Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 448: 561-566. 6. Tomlins SA, Laxman B, Dhanasekaran SM, Helgeson BE, Cao X, et al. (2007) Distinct classes of chromosomal rearrangements create oncogenic ETS gene fusions in prostate cancer. Nature 448: 595-599. 7. Kwak EL, Bang YJ, Camidge DR, Shaw AT, Solomon B, et al. (2010) Anaplastic lymphoma kinase inhibition in non-small-cell lung cancer. N Engl J Med 363: 1693-1703. 8. Tallman MS (2002) Advancing the treatment of hematologic malignancies through the development of targeted interventions. Semin Hematol 39: 1-5. 9. Rowley JD (1973) Letter: A new consistent chromosomal abnormality in chronic myelogenous leukaemia identified by quinacrine fluorescence and Giemsa staining. Nature 243: 290-293. 10. de Klein A, van Kessel AG, Grosveld G, Bartram CR, Hagemeijer A, et al. (1982) A cellular oncogene is translocated to the Philadelphia chromosome in chronic myelocytic leukaemia. Nature 300: 765-767. 11. de The H, Lavau C, Marchio A, Chomienne C, Degos L, et al. (1991) The PML- RAR alpha fusion mRNA generated by the t(15;17) translocation in acute promyelocytic leukemia encodes a functionally altered RAR. Cell 66: 675-684.

53

12. Kakizuka A, Miller WH, Jr., Umesono K, Warrell RP, Jr., Frankel SR, et al. (1991) Chromosomal translocation t(15;17) in human acute promyelocytic leukemia fuses RAR alpha with a novel putative transcription factor, PML. Cell 66: 663-674. 13. Eguchi M, Eguchi-Ishimae M, Tojo A, Morishita K, Suzuki K, et al. (1999) Fusion of ETV6 to neurotrophin-3 receptor TRKC in acute myeloid leukemia with t(12;15)(p13;q25). Blood 93: 1355-1363. 14. Knezevich SR, McFadden DE, Tao W, Lim JF, Sorensen PH (1998) A novel ETV6-NTRK3 gene fusion in congenital fibrosarcoma. Nat Genet 18: 184- 187. 15. Tognon C, Knezevich SR, Huntsman D, Roskelley CD, Melnyk N, et al. (2002) Expression of the ETV6-NTRK3 gene fusion as a primary event in human secretory breast carcinoma. Cancer Cell 2: 367-376. 16. Forghieri F, Morselli M, Potenza L, Maccaferri M, Pedrazzi L, et al. (2011) Chronic eosinophilic leukaemia with ETV6-NTRK3 fusion transcript in an elderly patient affected with pancreatic carcinoma. Eur J Haematol 86: 352- 355. 17. Palanisamy N, Ateeq B, Kalyana-Sundaram S, Pflueger D, Ramnarayanan K, et al. (2010) Rearrangements of the RAF kinase pathway in prostate cancer, gastric cancer and melanoma. Nat Med 16: 793-798. 18. Jones DT, Kocialkowski S, Liu L, Pearson DM, Backlund LM, et al. (2008) Tandem duplication producing a novel oncogenic BRAF fusion gene defines the majority of pilocytic astrocytomas. Cancer Res 68: 8673-8677. 19. Bass AJ, Lawrence MS, Brace LE, Ramos AH, Drier Y, et al. (2011) Genomic sequencing of colorectal adenocarcinomas identifies a recurrent VTI1A- TCF7L2 fusion. Nat Genet 43: 964-968. 20. Berger MF, Levin JZ, Vijayendran K, Sivachenko A, Adiconis X, et al. (2010) Integrative analysis of the melanoma transcriptome. Genome Res 20: 413-427.

54

21. Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, et al. (2008) Identification of somatically acquired rearrangements in cancer using genome- wide massively parallel paired-end sequencing. Nat Genet 40: 722-729. 22. Ju YS, Lee WC, Shin JY, Lee S, Bleazard T, et al. (2011) Fusion of KIF5B and RET transforming gene in lung adenocarcinoma revealed from whole-genome and transcriptome sequencing. Genome Res. 23. Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, et al. (2009) Transcriptome sequencing to detect gene fusions in cancer. Nature 458: 97- 101. 24. Maher CA, Palanisamy N, Brenner JC, Cao X, Kalyana-Sundaram S, et al. (2009) Chimeric transcript discovery by paired-end transcriptome sequencing. Proc Natl Acad Sci U S A 106: 12353-12358. 25. Pflueger D, Terry S, Sboner A, Habegger L, Esgueva R, et al. (2011) Discovery of non-ETS gene fusions in human prostate cancer using next-generation RNA sequencing. Genome Res 21: 56-67. 26. Stephens PJ, McBride DJ, Lin ML, Varela I, Pleasance ED, et al. (2009) Complex landscapes of somatic rearrangement in human breast cancer genomes. Nature 462: 1005-1010. 27. Tao J, Deng NT, Ramnarayanan K, Huang B, Oh HK, et al. (2011) CD44- SLC1A2 gene fusions in gastric cancer. Sci Transl Med 3: 77ra30. 28. Kawamata N, Ogawa S, Zimmermann M, Niebuhr B, Stocking C, et al. (2008) Cloning of genes involved in chromosomal translocations by high-resolution single nucleotide polymorphism genomic microarray. Proc Natl Acad Sci U S A 105: 11921-11926. 29. McLendon R, Friedman A, Bigner D, Van Meir E, Brat D, et al. (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455: 1061-1068. 30. Bell D, Berchuck A, Birrer M, Chien J, Cramer D, et al. (2011) Integrated genomic analyses of ovarian carcinoma. Nature 474: 609-615.

55

31. Bignell GR, Greenman CD, Davies H, Butler AP, Edkins S, et al. (2010) Signatures of mutation and selection in the cancer genome. Nature 463: 893- 898. 32. O'Neil J, Tchinda J, Gutierrez A, Moreau L, Maser RS, et al. (2007) Alu elements mediate MYB gene tandem duplication in human T-ALL. J Exp Med 204: 3059-3066. 33. Reinhold WC, Mergny JL, Liu H, Ryan M, Pfister TD, et al. (2010) Exon array analyses across the NCI-60 reveal potential regulation of TOP1 by transcription pausing at guanosine quartets in the first intron. Cancer Res 70: 2191-2203. 34. Shain AH, Giacomini CP, Matsukuma K, Karikari CA, Bashyam MD, et al. (2012) Convergent structural alterations define SWItch/Sucrose NonFermentable (SWI/SNF) chromatin remodeler as a central tumor suppressive complex in pancreatic cancer. Proc Natl Acad Sci U S A 109: E252-259. 35. Lovf M, Thomassen GO, Bakken AC, Celestino R, Fioretos T, et al. (2011) Fusion gene microarray reveals cancer type-specificity among fusion genes. Genes Chromosomes Cancer 50: 348-357. 36. Lin E, Li L, Guan Y, Soriano R, Rivers CS, et al. (2009) Exon array profiling detects EML4-ALK fusion in breast, colorectal, and non-small cell lung cancers. Mol Cancer Res 7: 1466-1476. 37. Jhavar S, Reid A, Clark J, Kote-Jarai Z, Christmas T, et al. (2008) Detection of TMPRSS2-ERG translocations in human prostate cancer by expression profiling using GeneChip Human Exon 1.0 ST arrays. J Mol Diagn 10: 50-57. 38. Wang L, Motoi T, Khanin R, Olshen A, Mertens F, et al. (2012) Identification of a novel, recurrent HEY1-NCOA2 fusion in mesenchymal chondrosarcoma based on a genome-wide screen of exon-level expression data. Genes Chromosomes Cancer 51: 127-139. 39. Li F, Feng Y, Fang R, Fang Z, Xia J, et al. (2012) Identification of RET gene fusion by exon array analyses in "pan-negative" lung cancer from never smokers. Cell Res 22: 928-931.

56

40. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, et al. (2004) A census of human cancer genes. Nat Rev Cancer 4: 177-183. 41. Ritz A, Paris PL, Ittmann MM, Collins C, Raphael BJ (2011) Detection of recurrent rearrangement breakpoints from copy number data. BMC Bioinformatics 12: 114. 42. Tibshirani R, Wang P (2008) Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics 9: 18-29. 43. Acquaviva J, Wong R, Charest A (2009) The multifaceted roles of the receptor tyrosine kinase ROS in development and cancer. Biochim Biophys Acta 1795: 37-52. 44. Rikova K, Guo A, Zeng Q, Possemato A, Yu J, et al. (2007) Global survey of phosphotyrosine signaling identifies oncogenic kinases in lung cancer. Cell 131: 1190-1203. 45. Charest A, Lane K, McMahon K, Park J, Preisinger E, et al. (2003) Fusion of FIG to the receptor tyrosine kinase ROS in a glioblastoma with an interstitial del(6)(q21q21). Genes Chromosomes Cancer 37: 58-71. 46. Gu TL, Deng X, Huang F, Tucker M, Crosby K, et al. (2011) Survey of tyrosine kinase signaling reveals ROS kinase fusions in human cholangiocarcinoma. PLoS One 6: e15640. 47. Toguchida J, Nakayama T (2009) Molecular genetics of sarcomas: applications to diagnoses and therapy. Cancer Sci 100: 1573-1580. 48. Young RJ, Brown NJ, Reed MW, Hughes D, Woll PJ (2010) Angiosarcoma. Lancet Oncol 11: 983-991. 49. Chmielecki J, Peifer M, Viale A, Hutchinson K, Giltnane J, et al. (2012) Systematic screen for tyrosine kinase rearrangements identifies a novel C6orf204-PDGFRB fusion in a patient with recurrent T-ALL and an associated myeloproliferative neoplasm. Genes Chromosomes Cancer 51: 54-65. 50. Apweiler R, Jesus Martin M, O'onovan C, Magrane M, Alam-Faruque Y, et al. (2012) Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res 40: D71-75.

57

51. Charest A, Kheifets V, Park J, Lane K, McMahon K, et al. (2003) Oncogenic targeting of an activated tyrosine kinase to the Golgi apparatus in a glioblastoma. Proc Natl Acad Sci U S A 100: 916-921. 52. Soda M, Takada S, Takeuchi K, Choi YL, Enomoto M, et al. (2008) A mouse model for EML4-ALK-positive lung cancer. Proc Natl Acad Sci U S A 105: 19893-19897. 53. West RB, Nuyten DS, Subramanian S, Nielsen TO, Corless CL, et al. (2005) Determination of stromal signatures in breast carcinoma. PLoS Biol 3: e187. 54. West RB, Rubin BP, Miller MA, Subramanian S, Kaygusuz G, et al. (2006) A landscape effect in tenosynovial giant-cell tumor from activation of CSF1 expression by a translocation in a minority of tumor cells. Proc Natl Acad Sci U S A 103: 690-695. 55. Beck AH, Lee CH, Witten DM, Gleason BC, Edris B, et al. (2010) Discovery of molecular subtypes in leiomyosarcoma through integrative molecular profiling. Oncogene 29: 845-854. 56. Wu C, Orozco C, Boyer J, Leglise M, Goodale J, et al. (2009) BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol 10: R130. 57. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, et al. (2004) A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101: 6062-6067. 58. Loboda A, Nebozhyn MV, Watters JW, Buser CA, Shaw PM, et al. (2011) EMT is the dominant program in human colon cancer. BMC Med Genomics 4: 9. 59. Koorstra JB, Hustinx SR, Offerhaus GJ, Maitra A (2008) Pancreatic carcinogenesis. Pancreatology 8: 110-125. 60. Delattre O, Zucman J, Plougastel B, Desmaze C, Melot T, et al. (1992) Gene fusion with an ETS DNA-binding domain caused by chromosome translocation in human tumours. Nature 359: 162-165. 61. Gerald WL, Rosai J, Ladanyi M (1995) Characterization of the genomic breakpoint and chimeric transcripts in the EWS-WT1 gene fusion of

58

desmoplastic small round cell tumor. Proc Natl Acad Sci U S A 92: 1028- 1032. 62. Martini A, La Starza R, Janssen H, Bilhou-Nabera C, Corveleyn A, et al. (2002) Recurrent rearrangement of the Ewing's sarcoma gene, EWSR1, or its homologue, TAF15, with the transcription factor CIZ/NMP4 in acute leukemia. Cancer Res 62: 5408-5412. 63. Molina CA, Foulkes NS, Lalli E, Sassone-Corsi P (1993) Inducibility and negative autoregulation of CREM: an alternative promoter directs the expression of ICER, an early response repressor. Cell 75: 875-886. 64. Masquilier D, Foulkes NS, Mattei MG, Sassone-Corsi P (1993) Human CREM gene: evolutionary conservation, chromosomal localization, and inducibility of the transcript. Cell Growth Differ 4: 931-937. 65. Foulkes NS, Borrelli E, Sassone-Corsi P (1991) CREM gene: use of alternative DNA-binding domains generates multiple antagonists of cAMP-induced transcription. Cell 64: 739-749. 66. Hayette S, Tigaud I, Callet-Bauchu E, Ffrench M, Gazzo S, et al. (2003) In B-cell chronic lymphocytic leukemias, 7q21 translocations lead to overexpression of the CDK6 gene. Blood 102: 1549-1550. 67. Raffini LJ, Slater DJ, Rappaport EF, Lo Nigro L, Cheung NK, et al. (2002) Panhandle and reverse-panhandle PCR enable cloning of der(11) and der(other) genomic breakpoint junctions of MLL translocations and identify complex translocation of MLL, AF-4, and CDK6. Proc Natl Acad Sci U S A 99: 4568-4573. 68. Corcoran MM, Mould SJ, Orchard JA, Ibbotson RE, Chapman RM, et al. (1999) Dysregulation of cyclin dependent kinase 6 expression in splenic marginal zone lymphoma through chromosome 7q translocations. Oncogene 18: 6271- 6277. 69. Ohashi PS, Mak TW, Van den Elsen P, Yanagi Y, Yoshikai Y, et al. (1985) Reconstitution of an active surface T3/T-cell antigen receptor by DNA transfer. Nature 316: 606-609.

59

70. Argani P, Lui MY, Couturier J, Bouvier R, Fournet JC, et al. (2003) A novel CLTC-TFE3 gene fusion in pediatric renal adenocarcinoma with t(X;17)(p11.2;q23). Oncogene 22: 5374-5378. 71. Cools J, Wlodarska I, Somers R, Mentens N, Pedeutour F, et al. (2002) Identification of novel fusion partners of ALK, the anaplastic lymphoma kinase, in anaplastic large-cell lymphoma and inflammatory myofibroblastic tumor. Genes Chromosomes Cancer 34: 354-362. 72. Gascoyne RD, Lamant L, Martin-Subero JI, Lestou VS, Harris NL, et al. (2003) ALK-positive diffuse large B-cell lymphoma is associated with Clathrin-ALK rearrangements: report of 6 cases. Blood 102: 2568-2573. 73. Robinson DR, Kalyana-Sundaram S, Wu YM, Shankar S, Cao X, et al. (2011) Functionally recurrent rearrangements of the MAST kinase and Notch gene families in breast cancer. Nat Med 17: 1646-1651. 74. Bigner SH, Humphrey PA, Wong AJ, Vogelstein B, Mark J, et al. (1990) Characterization of the epidermal growth factor receptor in human glioma cell lines and xenografts. Cancer Res 50: 8017-8022. 75. Stockhausen MT, Broholm H, Villingshoj M, Kirchhoff M, Gerdes T, et al. (2011) Maintenance of EGFR and EGFRvIII expressions in an in vivo and in vitro model of human glioblastoma multiforme. Exp Cell Res 317: 1513-1526. 76. Del Vecchio CA, Giacomini CP, Vogel H, Jensen KC, Florio T, et al. (2012) EGFRvIII gene rearrangement is an early event in glioblastoma tumorigenesis and expression defines a hierarchy modulated by epigenetic mechanisms. Oncogene. 77. Cools J, Quentmeier H, Huntly BJ, Marynen P, Griffin JD, et al. (2004) The EOL- 1 cell line as an in vitro model for the study of FIP1L1-PDGFRA-positive chronic eosinophilic leukemia. Blood 103: 2802-2805. 78. Cools J, DeAngelo DJ, Gotlib J, Stover EH, Legare RD, et al. (2003) A tyrosine kinase created by fusion of the PDGFRA and FIP1L1 genes as a therapeutic target of imatinib in idiopathic hypereosinophilic syndrome. N Engl J Med 348: 1201-1214.

60

79. Bergethon K, Shaw AT, Ou SH, Katayama R, Lovly CM, et al. (2012) ROS1 rearrangements define a unique molecular class of lung cancers. J Clin Oncol 30: 863-870. 80. Shiffman D, Ellis SG, Rowland CM, Malloy MJ, Luke MM, et al. (2005) Identification of four gene variants associated with myocardial infarction. Am J Hum Genet 77: 596-605. 81. Yamada Y, Metoki N, Yoshida H, Satoh K, Kato K, et al. (2008) Genetic factors for ischemic and hemorrhagic stroke in Japanese individuals. Stroke 39: 2211- 2218. 82. Rabbitts TH (1994) Chromosomal translocations in human cancer. Nature 372: 143-149. 83. DeBerardinis RJ, Lum JJ, Hatzivassiliou G, Thompson CB (2008) The biology of cancer: metabolic reprogramming fuels cell growth and proliferation. Cell Metab 7: 11-20. 84. Nakanishi T (2007) Drug transporters as targets for cancer chemotherapy. Cancer Genomics Proteomics 4: 241-254. 85. Chappell WH, Steelman LS, Long JM, Kempf RC, Abrams SL, et al. (2011) Ras/Raf/MEK/ERK and PI3K/PTEN/Akt/mTOR inhibitors: rationale and importance to inhibiting these pathways in human health. Oncotarget 2: 135- 164. 86. McCubrey JA, Steelman LS, Abrams SL, Chappell WH, Russo S, et al. (2009) Emerging Raf inhibitors. Expert Opin Emerg Drugs 14: 633-648. 87. Brenner JC, Feng FY, Han S, Patel S, Goyal SV, et al. (2012) PARP-1 inhibition as a targeted strategy to treat Ewing's sarcoma. Cancer Res 72: 1608-1613. 88. Hu MG, Deshpande A, Enos M, Mao D, Hinds EA, et al. (2009) A requirement for cyclin-dependent kinase 6 in thymocyte development and tumorigenesis. Cancer Res 69: 810-818. 89. Jan Y, Matter M, Pai JT, Chen YL, Pilch J, et al. (2004) A mitochondrial protein, Bit1, mediates apoptosis regulated by integrins and Groucho/TLE corepressors. Cell 116: 751-762.

61

90. Edgar R, Domrachev M, Lash AE (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30: 207- 210. 91. Lapointe J, Li C, Higgins JP, van de Rijn M, Bair E, et al. (2004) Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci U S A 101: 811-816. 92. Nowak G, Hastie T, Pollack JR, Tibshirani R (2011) A fused lasso latent feature model for analyzing multi-sample aCGH data. Biostatistics 12: 776-791. 93. Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, et al. (2012) The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483: 603-607. 94. Kim YH, Girard L, Giacomini CP, Wang P, Hernandez-Boussard T, et al. (2006) Combined microarray analysis of small cell lung cancer reveals altered apoptotic balance and distinct expression signatures of MYC family gene amplification. Oncogene 25: 130-138. 95. Del Vecchio CA, Jensen KC, Nitta RT, Shain AH, Giacomini CP, et al. (2012) Epidermal growth factor receptor variant III contributes to cancer stem cell phenotypes in invasive breast carcinoma. Cancer Res 72: 2657-2671. 96. Heisterkamp N, Morris C, Sender L, Knoppel E, Uribe L, et al. (1990) Rearrangement of the human ABL oncogene in a glioblastoma. Cancer Res 50: 3429-3434. 97. Ullrich A, Coussens L, Hayflick JS, Dull TJ, Gray A, et al. (1984) Human epidermal growth factor receptor cDNA sequence and aberrant expression of the amplified gene in A431 epidermoid carcinoma cells. Nature 309: 418-425. 98. Hunts JH, Shimizu N, Yamamoto T, Toyoshima K, Merlino GT, et al. (1985) Translocation chromosome 7 of A431 cells contains amplification and rearrangement of EGF receptor gene responsible for production of variant mRNA. Somat Cell Mol Genet 11: 477-484.

62

FIGURE LEGENDS

Figure 2.1. Breakpoint analysis for discovering novel cancer gene rearrangements. Schematic depiction of the approach and workflow, demonstrated by example of the rediscovery of a known gene fusion, SET/NUP214, in the T-ALL cell line LOUCY. Various publicly-available and in-house exon microarray and high- density CGH/SNP array experiments were analyzed. RNA breakpoint analysis (RBA) identifies significant transitions in exon expression level, which may reflect elevated expression of exons distal (3’ partner) or proximal (5’ partner) to a gene fusion junction. To identify such transitions a “walking” Student’s t-test was applied, comparing expression levels of proximal and distal exons. Candidate rearrangements were subsequently filtered for those disrupting genes of the Cancer Gene Census, with directional orientation (i.e. being the 5’ or 3’ partner) consistent with known rearrangements of that gene. RBA candidates were further filtered using a Bonferroni correction to adjust for multiple t-tests. DNA breakpoint analysis (DBA) screens for intragenic DNA copy number transitions, which may reflect unbalanced chromosomal rearrangements leading to the formation of gene fusions. The fused lasso method (FDR 1%) followed by a copy number smoothing algorithm was applied to identify CNAs. Copy number transitions were filtered for those disrupting any annotated gene and then further filtered for those disrupting genes of the Cancer Gene Census. We included only candidate breakpoints where the directional orientation of the copy number transition was consistent with known rearrangements of that gene. Several candidates were then validated using molecular and cytogenetic approaches. The average numbers of candidate rearrangements per cancer sample are depicted along the left and right panels at various stages of the workflow.

Figure 2.2. Discovery and characterization of CEP85L/ROS1 in angiosarcoma. (A) RBA of angiosarcoma specimen AS1 reveals an expression breakpoint between exons 34 and 35 of ROS1, suggesting rearrangement. (B) Experimental validation of CEP85L/ROS1 in AS1 by RT-PCR, using primers flanking the gene fusion junction.

63

(C) Predicted structure of CEP85L/ROS1. CEP85L and ROS1 are oriented in the same direction and located ~1 MB apart within cytoband 6q22. The gene fusion preserves a coiled-coil (CC) domain from CEP85L and the tyrosine kinase (TK) domain of ROS1. Exons are numbered, with untranslated regions depicted in corresponding lighter shades. (D) Break-apart FISH demonstrates rearrangement of ROS1 in angiosarcoma and epithelioid hemangioendothelioma. Co-localizing red and green signals are indicative of normal chr 6 (left panel). AS1 exhibits loss of red signal with multiple green signals indicative of amplification of rearranged ROS1. An epithelioid hemangioendothelioma specimen (EHE10) also exhibits loss of red signal, indicative of unbalanced rearrangement of ROS1. (E) Increased ROS1 expression in angiosarcoma compared to other sarcoma subtypes. Heatmap depicts genes selectively overexpressed in angiosarcoma, identified by supervised analysis. Genes are ordered by rank value of their t-statistic scores. Mean-centered gene expression ratios are depicted by a log2 pseudocolor scale (ratio-fold change indicated). AS: angiosarcoma, DTF: desmoid-type fibromatosis, GCTTS: giant cell tumor-tendon sheath, HPC: hemangiopericytoma, PVNS: pigmented villonodular synovitis, SFT: solitary fibrous tumor, SS: synovial sarcoma, LMS: leiomyosarcoma. *** P=4.26X10- 28 (Student’s t-test).

Figure 2.3. Discovery of APIP/SLC1A2 in colon cancer. (A) Array CGH heatmap displaying genomic breakpoints disrupting SLC1A2 in the SNU-C1 colon cancer cell line and the SNU-16 gastric cancer cell line. SNU-16 is known to harbor CD44/SLC1A2 and its array CGH profile is depicted for comparison. Unsmoothed log2 ratios are displayed. (B) Paired-end RNA seq uncovers APIP/SLC1A2 in SNU- C1. A subset of paired-end reads mapping to APIP/SLC1A2 as well as the gene fusion structure are displayed (left panel). The structure of the known gastric cancer gene fusion CD44/SLC1A2 is depicted for comparison (right panel). An internal start codon within exon 2 of SLC1A2 is predicted to initiate translation in both rearrangements. Inset: experimental validation of APIP/SLC1A2 by RT-PCR with primers flanking the gene fusion junction. (C, D) Gene expression profiling depicts

64

high-level expression of APIP in normal colon (C) and overexpression of SLC1A2 in

SNU-C1 (D). Mean-centered gene expression ratios are depicted by a log2 pseudocolor scale and ranked in descending order from left to right.

Figure 2.4. Identification and characterization of novel RAF1 gene fusions in pancreatic cancer and anaplastic astrocytoma. (A) Array CGH heatmaps displaying intragenic RAF1 genomic breakpoints identified in the PL5 pancreatic cancer cell line (left panel) and the D-538MG anaplastic astrocytoma cell line (right panel). Unsmoothed log2 ratios are displayed. (B) Identification of ATG7/RAF1 (left) and BCL6/RAF1 (right) in PL5 and D-538MG cells, respectively, by paired-end RNA- seq. A subset of the paired-end reads supporting each gene fusion is displayed. Both gene fusions are in-frame and the RAF1 serine threonine kinase domain (STK) is retained in both fusions. (C) Experimental validation of gene fusions by RT-PCR, using primers flanking the respective gene fusion junction. (D) Western blotting verifies knockdown of ATG7/RAF1 in PL5 following transfection of a RAF1- targeting siRNA pool. ATG7/RAF1 protein levels were monitored using an anti- RAF1 antibody, with anti-GAPDH providing a loading control. (E) Decreased cell proliferation and (F) invasion rates of PL5 following transfection of a RAF1-targeting siRNA pool, compared to transfection of a non-targeting control (NTC) siRNA pool. ** P<0.01 (two-sided Student’s t-test). (G) Break-apart FISH demonstrates rearrangement of BRAF in a pancreatic cancer case from the TMA, as evidenced by physical separation of the red and green probes (arrows) flanking BRAF (single interphase nucleus shown).

Figure 2.5. Discovery and characterization of EWSR1/CREM in melanoma. (A) Array CGH heatmap displaying intragenic EWSR1 breakpoints identified in the SH-4 and CHL-1 melanoma cell lines. (B) Paired-end RNA-seq identification of EWSR1/CREM in CHL-1. Paired-end reads supporting the rearrangement are depicted along with the predicted gene fusion structure. CREM contributes a basic leucine zipper motif (ZIP), while EWSR1 contributes the EWS Activation Domain (EAD).

65

(C) RT-PCR verification of EWSR1/CREM in CHL-1. (D) Quantitative RT-PCR using primers flanking the gene fusion junction verifies EWSR1/CREM knockdown following transfection of an siRNA pool targeting the 3’ end of CREM. (E, F, G) Transfection of CHL-1 with CREM-targeting siRNA pool results in (E) decreased cell proliferation, (F) decreased invasion, and (G) a higher fraction of senescent cells, compared to non-targeting control (NTC). **P<0.01 (two-sided Student’s t-test).

Figure 2.6. Identification and characterization of FAM133B/CDK6 in J.RT3- T3.5. (A) Heatmap depicting rearrangement of CDK6 in J.RT3-T3.5 (Jurkat derivative). (B) Discovery of the FAM133B/CDK6 rearrangement by paired-end RNA-seq. The fusion junction was confirmed by RT-PCR (not shown) and Sanger sequencing. (C) Gene expression profiling reveals high-level expression of CDK6 in J.RT3-T3.5 compared to other leukemia cell lines. Note that array probes mapped to the portion of CDK6 retained in the fusion. (D) Jurkat demonstrates marked sensitivity to the CDK4/6 inhibitor PD0332991 (IC50 = 0.27 µM). K562, which expresses only wildtype CDK6, is used as a negative control cell line and shows minimal sensitivity to PD0332991 (IC50 = 5.9 µM).

Figure 2.7. DBA discovery of recurrent rearrangements of CLTC and VMP1 across diverse cancer types. (A) Heatmap depicting focal deletions between CLTC and VMP1 in the breast cancer cell lines BT-549 and HCC1954. (B) Discovery of the recurrent CLTC/VMP1 rearrangement in BT-549 (left panel) and HCC1954 (right panel) by paired-end RNA-seq. (C) RT-PCR verification of CLTC/VMP1 fusion in BT-549 and HCC1954. (D) Heatmap depicting focal deletions disrupting CLTC, PTRH2 and/or VMP1 in various cancer types (see legend). (E) A renal cell carcinoma line, RXF393, was also profiled by exon microarray where an expression breakpoint was evident within CLTC. ***P<10-9 (Student’s t-test).

Figure 2.8. Discovery of new cell line models for the known rearrangements, EGFRvIII and FIP1L1/PDGFRA. (A) Heatmap depicting genomic breakpoints

66

within EGFR in the glioblastoma cell lines, CAS-1 and DKMG. (B) Identification of EGFRvIII in DKMG cells by paired-end RNA-seq. Paired-end reads supporting the rearrangement are depicted. (C) Verification of EGFRvIII expression by RT-PCR (top panel) and Western blotting (bottom panel) in DKMG. RT-PCR was done using primers flanking the exon 1/exon 8 junction of EGFRvIII, and Western blotting was done using an antibody specific to the EGFRvIII isoform. Control samples include U87 glioblastoma cells without EGFR rearrangement, U87-vIII cells engineered to express exogenous EGFRvIII, and A431 epidermoid carcinoma cells with EGFR amplification. (D) RBA identification of expression-level breakpoint within PDGFRA in SUPT13 T-ALL cells. ***P<10-11 (Student’s t-test). (E) RNA-seq identification of FIP1L1/PDGFRA. (F) RT-PCR validation of FIP1L1/PDGFRA expression in

SUPT13. (G) SUPT13 cells are sensitive to imatinib (IC50 = 0.036 µM). K562 is a positive control CML cell line harboring BCR/ABL1 with known sensitivity to imatinib (IC50 = 0.18 µM).

67

SUPPORTING INFORMATION LEGENDS

Figure 2.S1. Datasets and cancer types included for breakpoint analysis. Pie- charts of cancer type representation for (A) the 92 exon microarray profiles included in RBA, and (B) the 882 aCGH profiles included in DBA. Cancer types indicated in descending order of sample size, clockwise from 12 o’clock.

Figure 2.S2. RBA for discovery of gene fusions. (A) Depiction of the walking t-test algorithm, illustrated for NOTCH1 in SUPT-1 cells (known to carry a TCRB/NOTCH1 rearrangement). At each exon-exon junction along the transcript, a Student’s t-test is performed comparing the expression levels (green line, above) of exons proximal and distal to that junction. P-values are plotted (blue line, below) and a positive hit is recorded if a P-value drops below a significance threshold defined by Bonferroni adjustment (red dashed line). The minimum P-value corresponds to the predicted breakpoint for the gene fusion. (B) Distribution of walking t-statistics for all samples analyzed by RBA. Note that known gene fusions (red arrows) tend to have “outlier” P-values compared to most transcripts. (C) Distribution of the 54 candidate rearrangements nominated by RBA across cancer types.

Figure 2.S3. DBA pipeline for gene fusion discovery. (A) DBA pipeline. Fused lasso (FDR 1%) is used initially to call copy number alterations (CNAs). We found that fused lasso tends to overcall transitions (breakpoints) in copy number status. Thus, we applied a custom method, termed “copy number smoothing” to identify well- defined CNAs and to better determine their upper and lower boundaries. Breakpoints are then screened for those disrupting Cancer Gene Census genes. In this depiction, a breakpoint disrupting PDGFRA corresponds to the FIP1L1/PDGFRA rearrangement in the EOL-1 leukemia cell line. (B) Distribution of the 144 intragenic breakpoints identified by DBA across cancer types.

68

Figure 2.S4. RBA rediscovery of known gene fusions in various cancers. Exonic expression breakpoints representing known gene fusions including (A) BCR/ABL1 in K562 (CML), (B) NPM1/ALK in SUDHL-1 (ALCL), (C) FIP1L1/PDGFRA in EOL-1 (eosinophilic leukemia), (D) CCDC6/RET in TPC-1 (thyroid cancer), (E) NUP214/ABL1 in ALL-SIL, (F) EWSR1/FLI1 in SKES-1 (Ewing sarcoma).

Figure 2.S5. DBA rediscovery of known gene fusions in various cancers. (A) Heatmaps depicting identified intragenic breakpoints disrupting (A) FLI1 in four Ewing’s sarcoma cell lines (EWSR1/FLI1), (B) ABL1 in seven CML (BCR/ABL1) and T-ALL (NUP214/ABL1) cell lines, and (C) ROS1 in glioblastoma cell line U-118MG (GOPC/ROS1). Samples without rearrangement are also depicted for comparison.

Table 2.S1. Candidate rearrangements nominated by RBA.

Table 2.S2. Candidate rearrangements nominated by DBA.

Table 2.S3. Sarcoma subtypes included on TMA.

Table 2.S4. Affect of filtering parameters on DBA analysis of bone cancer cell lines.

Table 2.S5. RT-PCR primers (for validation of candidate fusions).

69

Table 2.1. Validated gene fusions and rearrangements

No. Discovery supporting Gene fusiona Sample Type Tissue type method reads ABL1/CBFBb A172 Cell line GBM DBA 30 APIP/SLC1A2 SNU-C1 Cell line Colon cancer DBA 57 ATG7/RAF1 PL5 Cell line Pancreatic cancer DBA 14 Anaplastic BCL6/RAF1 D-538MG Cell line astrocytoma DBA 39 CEP85L/ROS1 AS1 Tumor Angiosarcoma RBA NA CLTC/VMP1 BT549 Cell line Breast cancer DBA 16 CLTC/VMP1 HCC1954 Cell line Breast cancer DBA 95 Skin squamous cell EGFR/PPARGC1Ab A431 Cell line carcinoma DBA 46 EGFRvIII DKMG Cell line GBM DBA 16 EWSR1/CREM CHL-1 Cell line Melanoma DBA 120 FAM133B/CDK6 J-RT3-3T-5 Cell line T-ALL DBA 30 FIP1L1/PDGFRA SUPT13 Cell line T-ALL RBA 13 aGene fusions initially nominated by breakpoint analysis and subsequently validated by paired-end RNA-seq (or 5’ RACE for CEP85L/ROS1) and RT-PCR. bABL and EGFR locus rearrangements were previously reported in the respective cell lines [96-98]; however associated fusion transcripts were not identified.

70

Figure 2.1. Breakpoint analysis for discovering novel cancer gene rearrangements

RBA: Exon array datasets DBA: Array CGH datasets 76 public array profiles 812 public array profiles 16 in-house array profiles 70 in-house array profiles

Walking Student's t-test 1. Fused lasso (FDR 1%), for all annotated genes 2. Copy number smoothing ~66.5 CNAs/ ~258 genes/ sample sample

Filter for CNAs disrupting annotated genes ~30.2 genes/ Cancer census filter, sample breakpoint direction filter

~3.0 genes/ sample Cancer census filter ~0.43 genes/ sample

Bonferroni correction CNA direction filter ~0.6 genes/ ~0.14 genes/ sample sample chr9

Leukemia cell lines LOUCY LOUCY

5 10 15 20 25 30 35

NUP214 Exon Position NUP214 Expression level

Loss Gain *

Gene fusion candidate validation (RACE, RNA-seq,A FISH, RT-PCR)

1 5 6 7 18 19 20 36 SET/NUP214

71

Figure 2.2. Discovery and characterization of CEP85L/ROS1 in angiosarcoma.

A B C 5 CEP85L/ROS1 AS1 4 *** No templateAS1 CC TK 1 10 11 12 35 36 37 43 3 500bp

2 CEP85L ROS1 1 10 11 12 13 1 36 2 34 35 37 43 1

Log ratio 0 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 -1 chr6 ROS1 exon position -2

D E AS DTF GCTTS HPC PVNS SFT SS LMS

Normal tissue AS1 ROS1

EHE10 <3 1 >3

72

Figure 2.3. Discovery of APIP/SLC1A2 in colon cancer.

A chr11

Colon cancer cell lines SNU-C1SNU-16

q11 SLC1A2

<-2.51.0 >2.5

B No templateSNU-C1 Paired-end reads 200bp

SNU-C1 100bp SNU-16 ATG ATG

1 2 9 10 11 APIP/SLC1A2 1 2 9 10 11 CD44/SLC1A2

APIP SLC1A2 CD44 SLC1A2 1 2 5 6 7 1 2 9 10 11 1 2 16 17 18 1 2 9 10 11

chr11 q11 chr11 q11

C

APIP Trachea Salivary gland Colon Prostate Hypothalamus Thymus Thyroid gland Pineal gland (day) Liver Cardiac myocytes Skeletal muscle Heart Retina Pineal gland (night) Superior cervical ganglion Pituitary gland Pancreatic islets Prefrontal cortex Smooth muscle Lung Cerebellum peduncles Appendix Bronchial epithelial cells Spinal cord Amygdala Fetal lung Thalamus Fetal thyroid Parietal lobe Placenta ganglion Trigeminal (leydig cell) Testis Adrenal cortex Fetal brain Small intestine Adipocyte Cinculate cortex Tonsil Tongue Pons Adrenal gland Medulla oblongata lobe Temporal Uterus corpus Occipital lobe Caudate nucleus Uterus Subthalamic nucleus (seminiferous tubule) Testis Atrioventricular node Fetal liver Pancreas Testis (germ cell) Testis Ciliary ganglion Olfactory bulb Whole brain Kidney (interstitial) Testis Cerebellum Dorsal root ganglion Skin Globus pallidus Ovary Normal tissues 0.00 0.17 0.33 0.50 -0.50 -0.33 -0.17

D T84 SNU-C1 SW480 COLO320DM SW626 COLO320HSR SNUC2A HCC56 DLDDKO3 SNUC2B CL14 HCT15 COCM1 NCIH747 SW1417 LS411N CX1 GP5D HCT116 LS180 SW620 LS123 HT55 CCK81 NCIH630 CL11 SW48 SW948 HCT116HKH2 COLO205 HT29GLUCC1 RKO DLDDKO4 RKOAS451 LOVO LS174T SW1116 GP2D CAR1 SKCO1 DLDDKO1 LS1034 RKOE6 SW837 SW403 SW1463 DLD1 C2BBE1 RCM1 OUMS23 LS513 COLO678 HCT8 NCIH716 COLO206F COLO201 HT29 NCIH508 GEO CL34 CL40 MDST8 WIDR HT115 COLO741 HCT116TSP531026 KM20 SLC1A2 Colon cancer cell lines

73

Figure 2.4. Identification and characterization of novel RAF1 gene fusions in pancreatic cancer and anaplastic astrocytoma.

A chr3 chr3

Pancreatic cancer cell linesPL5 CNS cancer cell lines D538MG RAF1 RAF1

<-2.51.0 >2.5 <-2.51.0 >2.5

B PL5 Paired-end readsD538MG Paired-end reads

STK STK 1 15 16 17 8 9 10 17 ATG7/RAF1 1 2 6 7 8 9 10 17 BCL6/RAF1

ATG7 RAF1 BCL6 RAF1 1 15 16 17 1818 1 2 8 9 10 17 1 2 6 7 188 1 2 8 9 10 17

chr3 chr3

C DEPL5 4.0 PL5 siRNA pool: RAF1 control No templatePL5 No templateD358MG 3.0 NTC siRNA pool RAF1 siRNA pool 117 kDa ATG7/RAF1 400bp 2.0 37 kDa GAPDH ** 1.0

** Cell growth (OD value) ** F PL5 0.0 1.0 1 3 5 G Time (days) 0.8

0.6

0.4 **

0.2 Relative invasion (%) 0.0 RAF1 siRNA pool NTC siRNA pool Pancreatic cancer

74

Figure 2.5. Discovery and characterization of EWSR1/CREM in melanoma.

A chr22 Skin cancer cell lines SH-4CHL-1 q11.1 EWSR1

<-2.01.0 >2.0

BCPaired-end reads CHL-1

No templateCHL-1 EAD ZIP 1 6 7 8 EWSR1/CREM 400bp

EAD ZIP EWSR1 1 6 7 16 1 5 6 7 8 CREM

chr22 q11.1 chr10

D E 10 CHL-1 3.0 8 NTC siRNA pool CREM1 siRNA pool 6 2.0

transcript (% GAPDH) 4 1.0 2

Cell growth (OD value) ** ** 0 0.0 CREM siRNA pool NTC siRNA pool 1 3 5 EWSR1/CREM Time (days) ** FG** 1.0 CHL-1 1.0 CHL-1 0.8 0.8 0.6 0.6 0.4 ** 0.4

Relative invasion 0.2 0.2 ** 0.0 Relative senescence 0.0 CREM siRNA pool NTC siRNA pool CREM siRNA pool NTC siRNA pool

75

Figure 2.6. Identification and characterization of FAM133B/CDK6 in J.RT3- T3.5. A

chr7 Hematological cell lines J.RT3-T3.5 CDK6

<-2.51.0 >2.5 BD Paired-end reads 150 JURKAT FAM133B/CDK6 1 2 3 4 5 8 100

FAM133B CDK6

% Viability % Viability 50 1 2 9 10 11 1 2 3 4 5 8 K562 Jurkat

0 chr7 0.0001 0.01 1 100 [PD0332991] (μM) C

JM1 BC-1 SU-DHL-5 BC-2 HS-Sultan SU-DHL-16 RL Toledo CCRF-CEM J.RT3-T3.5 ML-2 SR SKO-007 P3HR1 U266 TANOUE DB NC-37 CRO-AP5 RPMI-8226_myeloma Kasumi-2 CML-T1 RCH-ACV CRO-AP2 HuT-78 HT EB-3 SU-DHL-6 HEL-92.1.7 EB-1 GA10 NAMALWA EB-2 MC-116 L-428 JVM-3 1A2 MOLT-16 BC-3 GDM-1 ARH-77 CESS MOLT-4 HL-60 CA46 K-562 DG-75 RAJI HD-MY-Z MHH-PREB-1 MC_CAR RPMI_6666 KG-1 DOHH-2 REC-1 MV-4-11 THP1 HuNS1 EM-2 MJ ST486 CEM C1 HH BDCM PLB-985 Jiyoye SEM MEG-01 MEC-1 HT-3 Daudi KU812 SU-DHL-10 BV-173 NALM-1 NALM-6 CDK6 -0.80 -0.53 -0.27 0.00 0.27 0.53 0.80

76

Figure 2.7. DBA discovery of recurrent rearrangements of CLTC and VMP1 across diverse cancer types

A chr17 Breast cancer cell lines BT-549HCC1954 CLTC VMP1 PTRH2

<-2.51.0 >2.5 B Paired-end reads Paired-end reads BT549 HCC1954

CLTC/VMP1 1 28 29 30 11 12 CLTC/VMP1 1 25 26 27 11 12

CLTC VMP1 CLTC VMP1 1 28 29 30 31 32 1 9 10 11 12 1 25 26 27 28 32 1 9 10 11 12

chr17 chr17

C

No templateBT549 No templateHCC1954

300bp 300bp

D E

1 SF268 ACN NCI-H226 A101D DEL BHY SCC-4 B2-17 LOXIMVI CGTH-W-1 BFTC-909 HT-1080 SN12C NCI-H838 CAL-33 RXF393 639-V RXF393 ***

0.5

0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 ratios 2 -0.5

CLTC Log PTRH2 -1 VMP1 -1.5 CLTC exon position Gain -2

Central nervous system Upper aerodigestive Autonomic ganglia Thyroid Lung Kidney Skin Soft tissue Loss Hematological Bladder

77

Figure 2.8. Discovery of new cell line models for the known rearrangements, EGFRvIII and FIP1L1/PDGFRA

A chr7

CNS cancer cell lines CAS-1DKMG EGFR

<-2.51.0 >2.5

Paired-end reads Ladder H20 U87 U87-vIII DKMG B C

1 8 28 EGFRvIII 200bp EGFRvIII

1 2 3 4 5 6 78 28 EGFR A431 U87 U87-vIII DKMG EGFRvIII

chr7 Actin

5.0 D SUPT13 4.0 *** F No tempateSUPT13

3.0 300bp 2.0 ratios 2

Log 1.0

0.0

12 3 4 5 67 8 9 10 11 12 13 1415 13 1415 1314151617 -1.0 150 PDGFRA exon position G K562 SUPT13 E paired-end reads 100 FIP1L1/PDGFRA

1 7 8 9 12 13 14 23

% Viability % Viability 50 FIP1L1 PDGFRA

11 7 8 9 10 18 1 12 13 14 23

0 0.0001 0.01 1 chr4 [Imatinib] (µM)

78

Figure 2.S1. Datasets and cancer types included for breakpoint analysis.

A Exon microarray datasets Hematological Skin Lung Kidney Soft tissue Colon Ovary Prostate Breast CNS Bone Thyroid

B Lung Array CGH datasets Hematological Pancreas CNS Breast Skin Colon Autonomic ganglia Bone Kidney Upper aerodigestive Esophagus Ovary Stomach Urinary tract Soft tissue Cervix Thyroid Endometrium Liver Pleura Biliary tract Prostate Testis Vulva Adrenal gland Placenta Eye Small intestine

79

Figure 2.S2. RBA for discovery of gene fusions.

A 3 2.5 SUPT-1 2 1.5 1

ratio TCRB NOTCH1 2 NOTCH1 0.5 1 25 26 27 34

Log 0 0 5 10 15 20 25 30 35 -0.5 -1 Exon Position 25 26 27 34 0

) 0 5 10 15 20 25 30 35 TCRB/NOTCH1 -5 P=1.5X10^-5

-10 P-value

( -15 10

-20 Log -25

B 0.03

0.025

0.02

0.015 Frequency 0.01

0.005

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 26 27 28 29 Walking t-statistic C 20 18

16

14

12

10

8

6

4 # Candidate fusions 2

0

Skin CNS Lung Bone Breast Ovary Kidney Colon Prostate Thyroid Soft tissue Hematological Cancer type

80

Figure 2.S3. DBA pipeline for gene fusion discovery.

A DBA pipeline B 35

30

25

20

15

10

5 # Candidate breakpoints Fused Lasso 0

CNA calls Skin Eye LungCNS Bone Colon Ovary Liver Testis Vulva Breast PleuraCervix Kidney Thyroid (FDR 0.01) Pancreas Stomach Placenta Prostate Esophagus Soft tissueBiliary tract Endometrium AerodigestiveUrinary tract Hematological Adrenal gland Small Intestine Autonomic ganglia Cancer types Upper

CNA Smoothing

Filter for breakpoints in Cancer Census genes PDGFRA

81

Figure 2.S4. RBA rediscovery of known gene fusions in various cancers.

A K562 B SUDHL-1 6 P=1.56E-13 P=4.06E-11 1.2 2.5 5 1 2 0.8 4 1.5 0.6 3 1 0.4 ratio 2 2 ratio ratio 2 0.5 2 0.2 1 Log Log 0 Log 0 0 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 -0.5 -0.2 0 5 10 15 20 25 -1 -0.4 -1 BCR Exon Position ABL1 Exon Position ALK Exon Position BCR/ABL1 NPM1/ALK 1 12 13 14 2 3 4 11 1 2 3 4 20 21 22 29

BCR ABL1 NPM1 ALK 1 12 13 14 15 23 1 2 3 4 11 1 2 3 4 5 11 1 20 21 22 29

chr22 chr9 chr5 chr2

C EOL-1 P=2.18E-5 P=5.35E-6 D TPC-1 P=1.82E-8 5 0.6 3 2.5 0.4 4 2 0.2 3 1.5 0 1 ratio ratio

ratio 2 2 2 2 0 2 4 6 8 10 12 14 16 18 0.5 -0.2 1 0 Log Log Log -0.4 -0.5 0 2 4 6 8 10 12 14 16 18 0 -0.6 0 2 4 6 8 10 12 14 16 18 20 22 -1 -1 -0.8 FIP1L1 Exon Position PDGFRA Exon Position -1.5 RET Exon Position

FIP1L1/PDGFRA CCDC6/RET 1 10 11 12 12 13 14 23 1 12 13 14 19

FIP1L1 PDGFRA CCDC6 RET 11 10 11 12 13 18 1 12 13 14 23 1 2 3 4 9 1 12 13 14 19

chr4 chr10 chr10

E ALL-SIL F SKES-1 4 P=3.45E-3

2.5 1.4 3 P=3.22E-9 1.2 2 1 2 1.5 0.8 ratio 1 0.6 2 1 0.4

ratio 0 ratio Log 2 2 0.5 0.2 0 2 4 6 8 10 0 -1 Log Log 0 0 5 10 15 20 25 30 35 -0.2 0 2 4 6 8 10 -0.5 -2 -0.4 FLI1 Exon Position -1 NUP214 Exon Position -0.6 ABL1 Exon Position

NUP214/ABL1 EWSR1/FLI1 1 30 31 32 2 3 4 11 1 5 6 7 5 6 10

NUP214 ABL1 EWSR1 FLI1 1 30 31 32 36 1 2 3 4 11 1 5 6 7 8 23 1 5 6 10

chr9 chr22 q11.1 chr11

82

Figure 2.S5. DBA rediscovery of known gene fusions in various cancers.

FLI1 rearrangements A chr11 Bone cancer cell lines q11 FLI1

<-2.51.0 >2.5 ABL1 B chr9 rearrangements Hematological cancer cell lines 9p11.1 9q11 ABL1

C chr6 GOPC/ROS1 CNS cancer cell lines ROS1

83

Table 2.S1. Candidate rearrangements nominated by RBA.

P- Also Gene symbol value Predicted nominated Gene Prioritization Samplea Subtype (accession) P-valueb rankb Direction breakpoint by DBA fusionc rationale Bone Ewing's WHSC1 5.58E- SK-ES-1 sarcoma (NM_133330) 06 41 3' Ex3-Ex4 Ewing's FLI1 3.45E- EWSR1/ SK-ES-1 sarcoma (NM_002017) 03 54 3' Ex4-Ex5 FLI1 Breast Adenoca ROS1 2.39E- BT-549 rcinoma (NM_002944) 07 19 3' Ex17-Ex18 Adenoca ROS1 1.10E- HS578T rcinoma (NM_002944) 05 48 3' Ex18-Ex19 Top-10 P- value rank; MCF7 previously profiled by RNA Seq with no USP6 fusions reported (Robinson DR, et al. (2011) Adenoca USP6 1.29E- Nat Med 17: MCF7 rcinoma (NM_004505) 10 9 3' Ex14-Ex15 1646-51) MYB MDA- Adenoca (NM_0011616 3.39E- MB-231 rcinoma 57) 06 33 5' Ex4-Ex5 MDA- Adenoca WHSC1 7.34E- MB-231 rcinoma (NM_133330) 06 45 3' Ex3-Ex4 Adenoca RUNX1 9.83E- T47D rcinoma (NM_001754) 06 46 5' Ex6-Ex7 CNS Glioblas JAK2 3.97E- SF-295 toma (NM_004972) 06 35 3' Ex6-Ex7 Glioblas ROS1 1.87E- SF-295 toma (NM_002944) 06 24 3' Ex16-Ex17 Glioblas ROS1 3.72E- U251 toma (NM_002944) 06 34 3' Ex15-Ex16 Hematol ogical NUP214 3.22E- NUP214 ALL-SIL T-ALL (NM_005085) 09 13 5' Ex32-Ex33 /ABL1 Top-10 P- value rank; cell line not JAK2 1.52E- publicly DU-528 T-ALL (NM_004972) 11 6 3' Ex8-Ex9 available Eosinop hilic leukemi USP6 3.34E- EOL-1 a (NM_004505) 06 32 3' Ex14-Ex15 Eosinop hilic FIP1L1/ leukemi PDGFRA 5.35E- PDGFR EOL-1 a (NM_006206) 06 39 3' Ex11-Ex12 Yes A NOTCH1 1.17E- JURKAT T-ALL (NM_017617) 05 50 3' Ex25-Ex26 Top-10 P- value rank; BCR 4.06E- BCR/A known gene K-562 CML (NM_004327) 11 8 5' Ex14-Ex15 BL1 fusion

K-562 CML MLL 5.53E- 40 5' Ex30-Ex31

84

(NM_005933) 06

Top 10-P- value rank; NUP214 3.60E- SET/NU known gene LOUCY T-ALL (NM_005085) 12 5 3' Ex17-Ex18 Yes P214 fusion NUP98 1.87E- LOUCY T-ALL (NM_016320) 09 12 5' Ex21-Ex22 WHSC1 1.18E- MOLT-4 T-ALL (NM_133330) 05 51 3' Ex3-Ex4 WHSC1 4.86E- MOLT13 T-ALL (NM_133330) 07 21 3' Ex3-Ex4 NOTCH1 1.40E- MOLT4 T-ALL (NM_017617) 05 53 3' Ex2-Ex3 NTRK3 RPMI84 (NM_0010123 3.22E- 02 T-ALL 38) 06 31 3' Ex13-Ex14 Top-10 P- value rank; SU- ALK 1.56E- NPM1/ known gene DHL-1 ALCL (NM_004304) 13 3 3' Ex19-Ex20 ALK fusion Top-10 P- value rank; same breakpoint as MCF7 (no SU- USP6 9.99E- USP6 fusions DHL-1 ALCL (NM_004505) 13 4 3' Ex14-Ex15 reported) SU- FUS 4.93E- DHL-1 ALCL (NM_004960) 06 38 5' Ex13-Ex14 Top-10 P- TCR- value rank. NOTCH1 1.50E- β/NOTC Known gene SUPT1 T-ALL (NM_017617) 23 2 3' Ex25-Ex26 Yes H1 fusion Top-10-P- value rank; breakpoint aligns to exons disrupted in known PDGFRA gene fusions (Cools FIP1L1/ J, et al. (2004) PDGFRA 3.39E- PDGFR Blood 103: SUPT13 T-ALL (NM_006206) 11 7 3' Ex11-Ex12 A 2802-5) USP6 2.42E- TALL1 T-ALL (NM_004505) 06 26 3' Ex14-Ex15 Kidney Renal cell carcino NUP214 5.68E- CAKI-1 ma (NM_005085) 06 42 5' Ex30-Ex31 Renal cell carcino CLTC 4.11E- RXF-393 ma (NM_004859) 10 11 5' Ex14-Ex15 Yes Large Intestine TCF3 (NM_0011361 3.13E- HT29 Colon 39) 06 30 5' Ex2-Ex4 LPP 2.27E- KM12 Colon (NM_005578) 06 25 5' Ex2-Ex3 Lung ROS1 1.29E- HOP-62 NSCLC (NM_002944) 05 52 3' Ex20-Ex21 NCI- RUNX1 5.86E- H322M NSCLC (NM_001754) 06 43 5' Ex6-Ex7 Ovary NCI- ADR- NUP98 3.42E- RES Ovarian (NM_016320) 07 20 5' Ex15-Ex16 OVCAR- NUP98 4.29E- 5 Ovarian (NM_016320) 06 36 5' Ex13-Ex14

85

Top-10 P- value rank; NUP98 3.33E- prioritize for SK-OV-3 Ovarian (NM_016320) 10 10 5' Ex31-Ex32 future studies Prostate DU- 145(DTP NUP98 1.14E- ) Prostate (NM_016320) 05 49 5' Ex31-Ex32 DU145( NUP98 2.90E- ATCC) Prostate (NM_016320) 06 28 5' Ex31-Ex32 DU145( NUP214 5.95E- ATCC) Prostate (NM_005085) 08 15 5' Ex29-Ex30 JAK2 2.83E- PC-3 Prostate (NM_004972) 06 27 3' Ex2-Ex3 Skin LOXIM Melano JAK2 1.35E- VI ma (NM_004972) 07 18 3' Ex2-Ex3 SK- Melano USP6 1.74E- MEL-28 ma (NM_004505) 06 23 3' Ex14-Ex15 SK- Melano NOTCH1 2.93E- MEL-5 ma (NM_017617) 06 29 3' Ex30-Ex31 SK- Melano USP6 4.87E- MEL-5 ma (NM_004505) 06 37 3' Ex14-Ex15 UACC- Melano USP6 6.50E- 257 ma (NM_004505) 07 22 3' Ex14-Ex15 Soft tissue Top-10 P- value rank; breakpoint corresponds to known ROS1 gene fusions (Charest A, et al. (2003) Genes Chromosomes Angiosa ROS1 4.26E- CEP85L Cancer 37: 58- AS1 rcoma (NM_002944) 28 1 3' Ex34-Ex35 /ROS1 71) Angiosa PAX5 6.99E- AS2 rcoma (NM_016734) 06 44 5' Ex3-Ex4 Angiosa NOTCH1 1.03E- AS2 rcoma (NM_017617) 05 47 3' Ex6-Ex7 Thyroid RET 1.82E- CCDC6/ TPC-1 Thyroid (NM_020630) 08 14 3' Ex11-Ex12 RET NUP214 6.04E- TPC-1 Thyroid (NM_005085) 08 16 3' Ex29-Ex30 NTRK3 (NM_0010123 1.31E- TPC-1 Thyroid 38) 07 17 3' Ex15-Ex16 aHighlighted rows indicate samples selected for gene fusion validation bP-value determined from walking t-test and used to rank candidates for prioritization c EWSR1/FLI1 in SK-ES-1 did not reach statistical significance but the exon breakpoint could still be observed

86

Table 2.S2. Candidate rearrangements nominated by DBA.

CNA Also Gene CNA log2 nucleotide Validated nominate Prioritization Samplea symbol CNA type ratio position gene fusion d by RBA rationale Citations Autonomic Ganglia chr22:2795451 GOTO EWSR1 Amplification 0.4359 8-28011216 chr22:2795451 GOTO-P3 EWSR1 Amplification 0.4766 8-28011216 chr2:29551887 IMR.5 ALK Amplification 0.6678 -29779579 Known; subclone of SK-N-MC, which Riggi N, et al. (2008) Cancer Res 68: chr11:9852671 expresses 2176-85; Ban J, et al. (2008) Cancer MC.IXC FLI1 Deletion -0.5093 1-128171111 EWSR1/FLI1 Res 68: 7100-9 Known; intragenic ALK rearrangement chr2:29217307 (ALK(del2-3)) Okubo J, et al. (2012) Oncogene 31: NB1 ALK Amplification 2.0436 -29511409 in NB1 4667-76 Known; intragenic ALK rearrangement chr2:29511409 (ALK(del2-3)) Okubo J, et al. (2012) Oncogene 31: NB1 ALK Amplification 2.3462 -29670304 in NB1 4667-76 Known; intragenic ALK rearrangement chr2:29670304 (ALK(del2-3)) Okubo J, et al. (2012) Oncogene 31: NB1 ALK Amplification 1.8593 -29928515 in NB1 4667-76 chr9:36587- NB69 PAX5 Deletion -0.3926 36876339 chr21:3851177 NBsusSR ERG Amplification 0.4769 6-38692234 chr22:2785276 NBsusSR EWSR1 Amplification 0.3731 9-28011216 Bone McAllister RM, et al. (1971) Cancer Known; 27: 397-402; Cooper CS, et al. chr7:11618312 TPR/MET in (1984) Cancer Res 44: 1-10; Cooper 143B MET Amplification 0.3650 3-116351181 HOS, 143B CS, et al. (1984) Nature 311: 29-33 chr3:12359542 CAL.72 RAF1 Amplification 0.4912 -12644925 Known; EWSR1/FLI1 chr11:1275663 in Ewing's Delattre O, et al. (1994) N Engl J ES6 FLI1 Deletion -1.9071 06-128142185 sarcoma Med 331: 294-299 Known; EWSR1/FLI1 chr22:2728664 in Ewing's Delattre O, et al. (1994) N Engl J ES6 EWSR1 Amplification 0.4376 8-28009327 sarcoma Med 331: 294-299 chr9:36587- ES6 JAK2 Deletion -0.3074 5071754 Known; EWSR1/FLI1 chr11:1281487 in Ewing's Delattre O, et al. (1994) N Engl J EW.12 FLI1 Amplification 0.4949 63-134449982 sarcoma Med 331: 294-299 Known; EWSR1/FLI1 chr22:2144253 in Ewing's Delattre O, et al. (1994) N Engl J EW.12 EWSR1 Amplification 0.4822 5-28011216 sarcoma Med 331: 294-299

87

Known; EWSR1/FLI1 chr11:1281750 in Ewing's Delattre O, et al. (1994) N Engl J EW.22 FLI1 Amplification 0.4917 30-134449982 sarcoma Med 331: 294-299 Known; EWSR1/FLI1 chr22:1443517 in Ewing's Delattre O, et al. (1994) N Engl J EW.22 EWSR1 Amplification 0.4602 1-28016590 sarcoma Med 331: 294-299 Known; EWSR1/FLI1 chr11:1278943 in Ewing's Delattre O, et al. (1994) N Engl J EW.7 FLI1 Deletion -0.6530 52-128178531 sarcoma Med 331: 294-299 McAllister RM, et al. (1971) Cancer Known; 27: 397-402; Cooper CS, et al. chr7:11616257 TPR/MET in (1984) Cancer Res 44: 1-10; Cooper HOS MET Amplification 0.4720 2-116349524 HOS, 143B CS, et al. (1984) Nature 311: 29-33 Breast Focal deletion of CLTC- PTRH2-VMP1 locus; BT549 and HCC1954 chr17:5511896 CLTC/VM chosen for BT549 CLTC Deletion -0.7745 0-55252629 P1 validation chr11:9428249 CAL.85.1 MAML2 Amplification 0.4205 9-95571475 Previously profiled by RNA-seq; no RUNX1 gene chr21:3500964 fusions Robinson DR, et al. (2011) Nat Med HCC1143 RUNX1 Deletion -0.7788 8-35119798 reported 17: 1646-51. Previously profiled by RNA-seq; no RUNX1 gene chr21:3511979 fusions Robinson DR, et al. (2011) Nat Med HCC1143 RUNX1 Deletion -1.3996 8-35141436 reported 17: 1646-51. Previously profiled by RNA-seq; no LPP gene chr3:18811700 fusions Robinson DR, et al. (2011) Nat Med HCC1937 LPP Amplification 0.4984 9-189451940 reported 17: 1646-51. Focal deletion of CLTC- PTRH2-VMP1 locus; BT549 and HCC1954 chr17:5511700 CLTC/VM chosen for HCC1954 CLTC Deletion -0.3826 4-55246725 P1 validation Previously profiled by RNA-seq; no BRAF gene chr7:14015759 fusions Robinson DR, et al. (2011) Nat Med HCC38 BRAF Deletion -0.5260 0-158812469 reported 17: 1646-51. Previously profiled by RNA-seq; no DEK gene MDA.MB. chr6:17842830 fusions Robinson DR, et al. (2011) Nat Med 157 DEK Deletion -0.6717 -18334668 reported 17: 1646-51. Previously profiled by RNA-seq; no RANBP17 MDA.MB. RANBP1 chr5:17023865 gene fusions Robinson DR, et al. (2011) Nat Med 468 7 Deletion -0.6397 0-180612986 reported 17: 1646-51. NCI.ADR. chr6:18347518 RES DEK Amplification 0.7390 -18417625 Cervix chr11:8183678 SKG.IIIa PICALM Amplification 0.7651 7-85435237

88

CNS Focal chr9:13246154 ABL1/CBF homozygous A172 ABL1 Deletion -0.3833 0-132646479 B deletion Focal chr9:13264661 ABL1/CBF homozygous A172 ABL1 Deletion -1.7927 8-132695822 B deletion Focal chr9:13270280 ABL1/CBF homozygous A172 ABL1 Deletion -0.7470 4-132725197 B deletion Focal amplification; similar appearance to known chr7:55018825 EGFRvIII McLendon R, et al. (2008) Nature CAS-1 EGFR Amplifcation 0.7527 -55081596 rearrangements 455: 1061-8 chr7:14014267 D.397MG BRAF Deletion -0.3137 2-140344336 Focal CNA; breakpoint aligns near Palanisamy N, et al. (2010) Nat Med chr3:12469278 BCL6/RAF known RAF1 16: 793-8; Jones DT, et al. (2009) D.538MG RAF1 Amplification 0.3486 -12643003 1 breakpoints Oncogene 28: 2119-23 Focal CNA; breakpoint aligns near Palanisamy N, et al. (2010) Nat Med chr3:12644320 BCL6/RAF known RAF1 16: 793-8; Jones DT, et al. (2009) D.538MG RAF1 Deletion -0.5275 -25809370 1 breakpoints Oncogene 28: 2119-23 chr10:4275063 DK.MG RET Deletion -0.3980 0-42932615 chr3:18957892 DK.MG LPP Deletion -0.6360 5-190208779 Focal amplification; similar appearance to known chr7:54952612 EGFRvIII McLendon R, et al. (2008) Nature DK.MG EGFR Amplification 1.7157 -55065281 EGFRvIII rearrangements 455: 1061-8 Focal amplification; similar appearance to known chr7:55065281 EGFRvIII McLendon R, et al. (2008) Nature DK.MG EGFR Amplification 1.0354 -55097175 EGFRvIII rearrangements 455: 1061-8 Focal amplification; similar appearance to known chr7:55195667 EGFRvIII McLendon R, et al. (2008) Nature DK.MG EGFR Amplification 1.2968 -55283723 EGFRvIII rearrangements 455: 1061-8 chr3:17875924 GAMG ETV5 Amplification 0.4129 2-187276126 Similar appearance to chr11:1178604 None known MLL ONS.76 MLL Deletion -0.3651 12-134449982 detected rearrangements Known; KDR/PDGFR A and PDGFRA intragenic chr4:54815302 rearrangements Ozawa T, et al. (2010) Genes Dev 24: SNB19 PDGFRA Amplification 1.0335 -55104572 in glioblastoma 2205-18 Known; chr6:11776821 GOPC/ROS1 Charest A, et al. (2003) Genes U.118.MG ROS1 Deletion -2.0142 6-117999797 in U.118.MG Chromosomes Cancer 37: 58-71 Endometri um chr15:8896120 EFE.184 CRTC3 Deletion -0.3889 4-100238113

89

chr3:16410925 ESS.1 LPP Amplification 0.3083 4-189991971 Esophagus CTNNB1 expressed at chr3:41242026 low levels in Barretina J, et al. (2012) Nature 483: KYSE.270 CTNNB1 Deletion -1.5361 -41244520 KYSE.270 603-7 CTNNB1 expressed at chr3:41244520 low levels in Barretina J, et al. (2012) Nature 483: KYSE.270 CTNNB1 Deletion -1.2161 -41254695 KYSE.270 603-7 CTNNB1 expressed at chr3:41254695 low levels in Barretina J, et al. (2012) Nature 483: KYSE.270 CTNNB1 Deletion -0.8051 -41261235 KYSE.270 603-7 chr3:18122842 KYSE.70 LPP Amplification 0.6345 2-189563151 chr3:18956359 KYSE.70 LPP Amplification 0.3306 0-199318795 Hematolog ical Focal deletion of CLTC- PTRH2-VMP1 locus; BT549 and HCC1954 chr17:5507149 chosen for DEL CLTC Deletion -0.3111 2-55268305 validation Known; chr9:70174866 NUP214/ABL Graux C, et al. (2004) Nat Genet 36: BE.13 ABL1 Deletion -0.4648 -132715651 1 in BE.13 1084-9 Known; chr9:13309822 NUP214/ABL Graux C, et al. (2004) Nat Genet 36: BE.13 NUP214 Deletion -0.4491 9-140211203 1 in BE.13 1084-9 chr8:35802422 BV.173 FGFR1 Amplification 0.8678 -38438506 Known; chr9:89230294 BCR/ABL1 in Quentmeier H, et al. (2011) J BV.173 ABL1 Deletion -0.5767 -132595308 BV.173 Hematol Oncol 4: 6 Known; chr9:10548174 BCR/ABL1 in Mahon FX, et al. (2000) Blood 96: EM.2 ABL1 Deletion -0.3903 1-132650644 EM.2 1070-9 Known; chr4:53991138 FIP1L1/PDGF Cools J, et al. (2004) Blood 103: EoL.1.cell PDGFRA Deletion -0.5309 -54836050 Yes RA in EoL.1 2802-5 GA.10.Clo chr1:36527109 ne.20 THRAP3 Deletion -0.3018 -37719442 chr12:4922157 GR.ST ATF1 Deletion -0.5099 6-49483622 ALK not chr2:29627990 expressed in Mourali J, et al. (2006) Mol Cell Biol J.RT3.T3.5 ALK Deletion -0.5404 -51634467 Jurkat 26: 6209-22 Focal amplification; CDK6 highly chr7:91976519 FAM133B/ expressed in Barretina J, et al. (2012) Nature 483: J.RT3.T3.5 CDK6 Amplification 0.3677 -92151313 CDK6 J.RT3.T3.5 603-7 Known; chr9:13260000 BCR/ABL1 in Mahon FX, et al. (2000) Blood 96: K.562 ABL1 Amplification 1.4180 0-133056090 K562 1070-9 chr9:13260000 K.562 NUP214 Amplification 1.4180 0-133056090 KARPAS. chr15:5507703 45 TCF12 Amplification 0.5808 2-55351986 Known; chr8:38225662 FGFR1OP2/F Gu TL, et al. (2006) Blood 108: KG-1 FGFR1 Amplification 1.0574 -38393531 GFR1 in KG-1 4202-4 chr17:2501830 KU812 SUZ12 Deletion -0.6937 4-27330572 MYB expressed at chr6:13554869 low levels in Barretina J, et al. (2012) Nature 483: L.428 MYB Amplification 0.6380 0-135767725 L.428 603-7

90

Known; chr9:23369719 BCR/ABL1 in Mahon FX, et al. (2000) Blood 96: LAMA.84 ABL1 Deletion -0.3145 -132664964 LAMA.84 1070-9 Known; chr9:13305609 SET/NUP214 Van Vlierberghe P, et al. (2008) LOUCY NUP214 Deletion -0.5021 0-133135875 Yes in LOUCY Blood 111: 4668-80 Known; chr9:13049789 SET/NUP214 Van Vlierberghe P, et al. (2008) LOUCY SET Deletion -0.5021 3-133012678 in LOUCY Blood 111: 4668-80 chr16:3530082 LP.1 CREBBP Amplification 0.4341 -3815405 Known; chr9:13261074 BCR/ABL1 in Kiyoi H, et al. (2007) Clin Cancer MEG.01 ABL1 Amplification 0.3803 8-133032089 MEG.01 Res 13: 4575-82 chr9:13049789 MEG.01 NUP214 Amplification 0.3803 3-133012678 chr9:13261074 MEG.01 NUP214 Amplification 0.5988 8-133032089 Known; chr11:1178596 MLL/AF6 in Wang QF, et al. (2011) Blood 117: ML.2 MLL Deletion -0.5167 20-13449982 ML.2 6895-905 MONO.M chr21:3524111 AC.6 RUNX1 Amplification 0.3156 7-46921373 chr11:1178208 NKM-1 MLL Amplification 0.4749 47-117863699 OCI- chr11:1178012 Andersson A, et al. (2005) Leukemia AML2 MLL Amplification 0.3555 75-117860950 Known 19: 1042-50 Known; LMO2 rearrangement Del(11)(p13p1 3) in P12.ICHIK chr11:3386557 P12.ICHIKA Chen S, et al. (2011) Leukemia 25: AWA LMO2 Deletion -0.5829 4-36614919 WA 1632-5 Known; chr21:3524111 ETV6/RUNX1 Andersson A, et al. (2005) Leukemia REH RUNX1 Amplification 0.3190 7-46921373 in REH 19: 1042-50 chr18:5884127 Yao R, et al. (2002) Clin Chem 48: RL BCL2 Deletion -0.3956 8-58974534 Known 1344-51 Known; chr9:13852631 TCRB/NOTC Ellisen LW, et al. (1991) Cell 66: SUPT1 NOTCH1 Deletion -0.5536 2-140145695 Yes H1 in SUPT1 649-61 Known; chr1:16302908 TCF3/PBX1 in Kamps MP, et al. (1991) Genes Dev X697 PBX1 Amplification 0.3294 0-247190999 X697 5: 358-68 Kidney Focal deletion of CLTC- PTRH2-VMP1 locus. BT549 and HCC1954 chr17:5510600 chosen for RXF-393 CLTC Deletion -0.4418 4-55246725 Yes validation Large Intestine ETV6 expressed at chr12:1170386 low levels in Barretina J, et al. (2012) Nature 483: COLO.678 ETV6 Deletion -0.5689 7-12343209 COLO.678 603-7 chr2:22207865 RKO PAX3 Deletion -0.4381 7-222845096 Focal amplification; similar appearance to NCI-SNU-16, chr11:3487687 APIP/SLC1 which carries Tao J, et al. (2011) Sci Transl Med 3: SNU-C1 SLC1A2 Amplification 1.4926 7-35320461 A2 CD44/SLC1A2 77ra30 CTNNB1 expressed at chr3:35293567 low levels in Barretina J, et al. (2012) Nature 483: T84 CTNNB1 Amplification 0.4177 -41236841 T84 603-7 Lung

91

Focal deletion of CLTC- PTRH2-VMP1 locus; BT549 and HCC1954 chr17:5507899 chosen for NCI-H838 CLTC Deletion -0.5780 2-55273152 validation Focal deletion of CLTC- PTRH2-VMP1 locus; BT549 and HCC1954 chr17:5511994 chosen for NCI-H226 CLTC Deletion -0.3706 4-55273152 validation chr3:18701080 A427 LPP Amplification 0.5481 1-189534553 chr12:3734152 COR.L51 ETV6 Deletion -0.5023 -11853840 chr12:1185384 COR.L51 ETV6 Deletion -1.5387 0-11978174 PRDM16 expressed at chr1:3204592- low levels in Barretina J, et al. (2012) Nature 483: COR.L88 PRDM16 Amplification 0.4488 5866725 COR.L88 603-7 chr3:12657751 DMS.153 RAF1 Deletion -0.3945 -86218050 NUP98 expressed at chr11:188510- low levels in Barretina J, et al. (2012) Nature 483: DMS.79 NUP98 Deletion -0.4063 3734968 DMS.79 603-7 EPLC.272 chr17:5253432 H MSI2 Amplification 0.3350 7-52808799 chr17:5206131 MS.1 MSI2 Amplification 0.4939 9-52851092 NCI.H110 chr8:42931756 5 TCEA1 Deletion -0.4099 -55072873 NCI.H162 chr8:57010099 3 PLAG1 Amplification 0.7089 -57247481 NCI.H169 chr1:16306571 4 PBX1 Amplification 1.6443 5-163135873 PBX1 expressed at NCI.H183 chr1:16296281 low levels in Barretina J, et al. (2012) Nature 483: 8 PBX1 Amplification 0.4214 8-167200611 NCI.H1838 603-7 NCI.H192 chr12:1176455 6 ETV6 Amplification 0.3141 5-11900727 NCI.H202 chr1:3098377- 9 PRDM16 Deletion -2.2461 3117398 CDK6 expressed at NCI.H219 chr7:91110415 low levels in Barretina J, et al. (2012) Nature 483: 6 CDK6 Amplification 0.3503 -92188963 NCI.H2196 603-7 NCI.H322 chr7:90611132 M CDK6 Amplification 0.6133 -92146953 chr9:13058571 NCI.H378 NUP214 Amplification 0.3496 4-133055733 NCI.H510 chr9:13302916 A NUP214 Amplification 0.4125 3-133495215 CREB1 expressed at chr2:20746661 low levels in Barretina J, et al. (2012) Nature 483: NCI.H889 CREB1 Deletion -0.7898 6-208135923 NCI.H889 603-7 Ovary chr17:4688838 OVCAR.5 MSI2 Amplification 0.3973 8-52763999 Pancreas chr11:3533736 247 SLC1A2 Deletion -1.3468 5-36195363 chr2:26497716 302 ALK Amplification 0.6930 -29687253 chr2:29700433 302 ALK Amplification 0.9585 -29918184

92

chr1:36293439 PANC2.03 THRAP3 Amplification 0.4472 -36509163 Breakpoint aligns near known RAF1 Palanisamy N, et al. (2010) Nat Med breakpoints; 16: 793-8; Jones DT, et al. (2009) RAF1 highly Oncogene 28: 2119-23; Shain AH, et chr3:12604063 ATG7/RAF expressed in al. (2012) Proc Natl Acad Sci U S A PL5 RAF1 Deletion -0.8113 -58219893 1 PL5 109: E252-9 chr6:41826893 YAPC CCND3 Amplification 1.3575 -42067763 Placenta chr12:1152724 JAR ETV6 Amplification 0.3285 9-11721671 chr12:1172436 JAR ETV6 Amplification 0.9471 5-13302674 Pleura CTNNB1 expressed at chr3:41230215 low levels in Barretina J, et al. (2012) Nature 483: NCI.H28 CTNNB1 Deletion -1.9217 -41254106 NCI.H28 603-7 CTNNB1 expressed at chr3:41254366 low levels in Barretina J, et al. (2012) Nature 483: NCI.H28 CTNNB1 Deletion -2.0271 -41598226 NCI.H28 603-7 Skin Focal deletion of CLTC- PTRH2-VMP1 locus; BT549 and HCC1954 chr17:5507163 chosen for LOXIMVI CLTC Deletion -0.3800 8-55272633 validation Focal deletion of CLTC- PTRH2-VMP1 locus; BT549 and HCC1954 chr17:5511167 chosen for A101D CLTC Deletion -0.3354 2-55270138 validation chr1:14900733 A4.Fuk ARNT Amplification 0.8054 6-149067788 chr12:8022225 A4.Fuk ETV6 Amplification 0.3321 -11721803 chr12:1187704 A4.Fuk ETV6 Amplification 0.3152 8-16966807 chr7:54941256 EGFR/PPA Focal A431 EGFR Amplification 1.3666 -55188661 RGC1A amplification chr7:55188661 EGFR/PPA Focal A431 EGFR Amplification 1.0531 -55578571 RGC1A amplification Focal deletion; similar appearance to known chr22:2801863 EWSR1/CR EWSR1 gene CHL.1 EWSR1 Deletion -0.7572 2-28044186 EM fusions chr21:3894120 COLO.792 ERG Deletion -0.4138 6-46921373 chr19:816374- GAK TCF3 Deletion -0.5473 1577143 chr7:11615847 MMAC.SF MET Amplification 0.4284 7-116300300 Focal deletion; similar appearance to known chr22:2801863 None EWSR1 gene SH.4 EWSR1 Deletion -0.3520 2-33140861 detected fusions chr22:2531695 SK.MEL.2 MN1 Amplification 0.6574 9-26488649 chr22:2649533 SK.MEL.2 MN1 Deletion -0.3554 4-36640620

93

Soft Tissue Known; PAX3-FKHR in Galili N, et al. (1993) Nat Genet 5: chr2:22278937 rhabdomyosarc 230-5; Shapiro DN, et al. (1993) SJRH30 PAX3 Amplification 0.8295 0-222947681 oma Cancer Res 53: 5108-12 Stomach chr1:3284282- MKN28 PRDM16 Amplification 0.9052 4323756 Known; CD44/SLC1A2 NCI-SNU- chr11:3485769 in NCI-SNU- Tao J, et al. (2011) Sci Transl Med 3: 16 SLC1A2 Amplification 1.8737 9-35369623 16 77ra30 Known; BRAF gene fusions in gastric chr7:14009665 adenocarcinom Palanisamy N, et al. (2010) Nat Med NCI.N87 BRAF Deletion -0.5381 4-148851307 a 16: 793-8 JAZF1 expressed at chr7:27833876 low levels in Barretina J, et al. (2012) Nature 483: NCI.N87 JAZF1 Deletion -0.5068 -28041697 NCI-N87 603-7

aHighlighted rows indicate samples selected for gene fusion validation

94

Table 2.S3. Sarcoma subtypes included on TMA. Diagnosis Classa Number of casesb Aneurysmal bone cyst stt 1 Angioma stt 20 Angiomatosis stt 1 Angiomyolipoma stt 4 Angiosarcoma scr 47 AV malformation stt 2 Chondrosarcoma scr 2 Enchondroma stt 1 Fibrous dysplasia stt 4 Fibrous histiocytoma of bone stt 1 Giant cell tumor of bone scr 3 Glomangioma stt 1 Gloms stt 5 Gloms tumor stt 5 Granular cell tumor stt 5 Hemangioendothelioma stt 22 Hemangioma stt 27 Hemangiopericytoma stt 3 Inflammatory myofibroblastic tumor stt 1 Inflammatory pseudotumor stt 2 Intimal sarcoma scr 5 Kaposi sarcoma scr 5 Malignant peripheral nerve sheath tumor scr 3 Maxima stt 3 Neuroblastoma scr 2 Neurofibroma stt 6 Non-ossifying fibro stt 3 Ossifying fibro stt 1 Pigmented villonodular synovitis stt 4 Pleomorphic sarcoma, soft tissue scr 32 Rhabdomyosarcoma scr 7 Schwannoma stt 10 Solitary fibrous tumor, spindle cell stt 18 Synovial sarcoma scr 10 Tenosynovial giant cell tumor stt 13 Vascular malformation, soft tissue stt 1 ascr (sarcoma); stt (soft tissue tumor) bEach case represented by a duplicate core

95

Table 2.S4. Affect of filtering parameters on DBA analysis of bone cancer cell lines.

False-positive called EWSR1 Copy-number True-positive called EWSR1 breakpoints in 12 smoothing log2 ratio breakpoints in 20 Ewing's osteosarcoma threshold sarcoma lines lines Cell lines with nominated EWSR1 breakpointsb No copy-number CADO-ES1, CAL72, ES6, EW11, EW12, EW18, smoothing 7 5 EW22, EW24, Hu09, NOS1, Saos-2, X143B 0.15 4 0 CADO.ES1, ES6, EW.12, EW.22 0.3 3 0 ES6, EW.12, EW.22 0.6 0 0 1.2 0 0

False-positive called FLI1 Copy-number True-positive called FLI1 breakpoints in 12 smoothing log2 ratio breakpoints in 20 Ewing's osteosarcoma threshold sarcoma lines lines Cell lines with nominated FLI1 breakpointsb ES5, ES6, ES7, EW-1, EW-7, EW-11, EW-12, No copy-number EW-13, EW-22, EW-24, MHH.ES.1, NY, Saos.2, smoothing 12 4 SJSA.1, SK.PN.DW, U.2.OS 0.15 6 0 ES5, ES6, ES7, EW.7, EW.12, EW.22 0.3 4 0 ES6, EW.7, EW.12, EW.22 0.6 2 0 ES6, EW.7 1.2 1 0 ES6

Total called rearranged Copy-number Cancer Gene smoothing log2 ratio Total called rearranged Census genes per threshold Cancer Gene Census genes cell line 0.15 95 2.88 0.3 11 0.33 0.6 2 0.06 1.2 1 0.03 a20 Ewing's sarcoma cell lines, 12 osteosarcoma cell lines, and 1 chondrosarcoma cell line bEwing's sarcoma cell lines in bold

96

Table 2.S5. RT-PCR primers (for validation of candidate fusions).

Gene Fusion Sample Forward primer Reverse primer ABL1/CBFB A-172 GAAATCCACCAAGCCTTTGA CTCCAGACAGCCCATACCAT APIP/SLC1A2 SNU-C1 CATGTCTGGCTGTGATGCTC GGTGAGCAGCAGATTCTTCC ATG7/RAF1 PL5 GGGGATTTCTTTCACGGTTT TCCACGAGGCCTAATTTTGT BCL6/RAF1 D-538MG AACCTGAAAACCCACACTCG TCCACGAGGCCTAATTTTGT C6ORF204/ROS1 AS1 GAAAGATGCCTTGAAGATGGA TAAGCACTGTCACCCCTTCC CLTC/TMEM49 HCC1954 AAGCTCATCTTTGGGCAGAA TCAAACATCCAGGACAACCA CLTC/TMEM49 BT549 ACTCCTGCAGTGGTTTTTGC TCAAACATCCAGGACAACCA EGFR/PPARGC1A A431 CAACACCCTGGTCTGGAAGT AATCCGTCTTCATCCACAGG EGFRvIII DKMG GGGGAGCAGCGATGCGACCCT ACTTGCGGACGCCGTCTTCCT EWSR1/CREM CHL-1 AGTCACTGCACCTCCATCCT TTTGCGTGTTGCTTCTTCTG JURKAT, FAM133B/CDK6 J-RT3-T3-5 CAATAGCAATGGCGAGATCA ACCTCGGAGAAGCTGAAACA FIP1L1/PDGFRA SUPT-13 CCTGGTGCTGATCTTTCTGA TGTTCCTTCAACCACCTTCC

97

CHAPTER 3 TRANSCRIPTOME SEQUENCING UNCOVERS RECURRENT TAOK1 REARRANGEMENTS IN BREAST CANCER

This chapter corresponds to a manuscript in preparation with the following authors: Craig P. Giacomini, A. Hunter Shain, Sushama Varma, Andrew D. Forster, Nicole Clarke, Kelli D. Montgomery, Steven Sun, Shirley Zhu, Matt van de Rijn, Robert B. West, Jonathan R. Pollack.

98

ABSTRACT

Breast cancer is a leading cause of cancer mortality in women, accounting for approximately 40,000 deaths annually in the United States. With the recent discovery of TMPRSS2/ETS in prostate cancer and EML4/ALK in lung cancer, major efforts are ongoing to identify novel gene rearrangements in other epithelial malignancies. Next generation sequencing has enabled the discovery of such alterations in breast, prostate, gastric, and lung cancer. However, few recurrent alterations have been discovered, and these studies have primarily searched for gene-to-gene fusions. Here we develop a computational pipeline to discover novel gene rearrangements including gene-to- gene fusions, intragenic rearrangements, and gene-intergenic rearrangements from transcriptome sequencing (RNA Seq) data. Applied to a collection of 36 breast cancer cell lines, we discover 348 alterations, including recurrent fusions of the sterile 20 (STE20)-like kinase TAOK1. Notably, we demonstrate TAOK1 rearrangements in breast cancer cell lines, primary tumors, and ductal carcinoma in situ (DCIS) specimens, suggesting that these rearrangements occur early in carcinogenesis. Functional studies suggest that TAOK1 fusions encode potent oncoproteins, which may serve as therapeutic targets. These findings underscore that recurrent gene rearrangements play major roles in subsets of epithelial cancers and highlight the importance of TAOK1 rearrangements in breast carcinogenesis.

99

INTRODUCTION

Breast cancer is a leading cause of cancer death for women in the United States. Histopathological and molecular features are frequently used for prognostication and to determine appropriate therapeutic regimens for patients. For example, estrogen receptor (ER) or progesterone receptor (PR) positivity predicts response to selective estrogen receptor modulators such as tamoxifen, while overexpression of ERBB2 predicts responsiveness to the monoclonal antibody trastuzumab[1-3]. From a genetic perspective, breast cancers are frequently characterized by complex karyotypes, with amplifications, deletions, and chromosomal rearrangements. Such alterations can serve as prognostic and diagnostic markers, and many are known to disrupt key cancer genes that drive carcinogenesis[4- 7]. However, many of these alterations remain uncharacterized, and a central aim in breast cancer research is to identify the underlying oncogenes and tumor suppressors that are targeted by these events. Discovery and characterization of these cancer genes is critical for furthering our understanding of breast cancer and for developing novel treatments. The recent identification of TMPRSS2/ETS in the majority of prostate cancers and EML4/ALK in lung cancer has sparked major ongoing efforts to discover and characterize novel gene fusions in epithelial malignancies[8-17]. Many of these studies have used genomic technologies including next generation DNA and RNA sequencing as well as high-density microarray-based approaches to discover large numbers of complex somatic rearrangements in prostate, breast, colon, gastric, and lung cancer. However, the driving recurrent gene fusions discovered in these studies occur at comparatively low frequencies, and several groups have been unable to discover any recurrent alterations. Indeed, Robinson et al recently reported oncogenic Notch-family and MAST kinase rearrangements in 2% and 1% of breast cancers, respectively, while Berger et al discovered many singleton gene fusions in melanoma with unclear impact on cancer pathogenesis[16,17].

100

Screening for other forms of gene rearrangements may facilitate the discovery of recurrent alterations. For example, the majority of the aforementioned studies have searched specifically for gene-to-gene fusions. However, intragenic and other types of gene rearrangements play critical roles in cancer pathogenesis. Indeed, internal tandem duplications occurring in exons 14 and 15 of the FMS-like tyrosine kinase 3 (FLT3) result in constitutively active tyrosine kinase signaling which drives oncogenesis in 15-35% of acute myelogenous leukemia (AML)[18-25]. Similarly, 20- 30% of glioblastoma multiforme harbor an oncogenic rearrangement of the epidermal growth factor receptor (EGFR), EGFRvIII, which occurs through an in frame deletion of exons 2-7[26-29]. In addition to these intragenic rearrangements, fusions involving intergenic regions of the genome have been occasionally described. For example, Tomlins et al discovered two prostate cancer specimens harboring rearrangements of the ETS transcription factor ETV1 to the genomic locus 14q13.3-14q21.1. This region was demonstrated to be prostate-specific and coordinately regulated by androgen, as a result driving overexpression of ETV1[30]. We hypothesized that novel recurrent gene rearrangements exist in breast cancer and should be discoverable through RNA sequencing (RNA Seq). We develop a computational approach to screen RNA Seq data for intragenic rearrangements, gene-to-intergenic fusions (gene-intergenic fusions), and gene-to-gene fusions (gene fusions). Applied to a series of 36 breast cancer cell lines, we discover and characterize several novel alterations including recurrent oncogenic fusions of the TAOK1 serine/threonine kinase. Further characterization of the novel alterations discovered through our approach will enhance our understanding of the molecular mechanisms governing breast carcinogenesis.

RESULTS

Genomic Datasets To discover novel rearrangements driving breast carcinogenesis, we performed paired-end RNA sequencing (36 bp reads) on a series of 36 breast cancer cell lines.

101

We chose to perform paired-end sequencing because studies have demonstrated that this approach represents a more sensitive method for gene fusion discovery as compared to single-read sequencing [31]. These cell lines were chosen because they are well-established and frequently used as in vitro models of breast cancer and include a mix of molecular subtypes including 11 basal-A, 6 basal-B (claudin-low), 17 luminal, and 2 unclassified lines[2]. In addition, we profiled normal primary human mammary epithelial cells (HMEC) as a negative control. Read lengths of 36 nucleotides were generated and an average of 16 million high quality paired-reads were created per sample (Table 3.S1). In addition, we analyzed publicly available microarray-based comparative genomic hybridization (array CGH) data comprising many of these same cell lines from the Wellcome Trust Sanger Institute[32] to integrate transcriptome discoveries with copy number alterations (CNAs).

Gene Rearrangement Discovery We developed a computational pipeline to mine these RNA-Seq data for novel intragenic and gene-intergenic rearrangements as well as gene fusions. To discover gene-intergenic fusions and gene fusions, our pipeline screened for paired reads mapping to distinct genomic loci. At least one locus was required to correspond to an annotated gene to facilitate downstream filtering steps. For each candidate rearrangement, non-mapping reads were separately queried to identify “chimeric” reads spanning potential junctions from the two different genomic regions (Figure 3.1A). To discover intragenic alterations that create exonic internal tandem duplications, our approach screened for paired reads that map within the same gene but suggest a disrupted numerical order of exons. In parallel, our method searched for supporting “chimeric” reads that span potentially disrupted exon junctions (Figure 3.1B). We then applied a series of filters regarding the structural distribution and number of supporting reads to enrich for true positive candidate rearrangements (see Materials and Methods). As a proof of concept, we rediscovered several known gene fusions in the breast cancer cell line MCF7, which has been extensively characterized in other studies[33,34] (Table 3.S2). As expected, no gene rearrangements were

102

discovered in HMEC (Table 3.S2-4). Our computational pipeline nominated a total of 348 rearrangements across the 36 breast cancer cell lines (~10 per cell line) including 244 gene fusions, 31 intragenic rearrangements, and 73 gene-intergenic rearrangements (Table 3.S2-4). Many of these alterations were notable for their potential roles in breast carcinogenesis. For example, the basal-B cell line MDA436 was found to harbor an in-frame internal tandem duplication of EGFR exon 20. This exon corresponds to a portion of the tyrosine kinase domain and has been found to harbor activating mutations in lung cancers[35]. Intragenic rearrangements including internal tandem duplications create oncogenic EGFR variants in a subset of glioblastomas[27,36,37], suggesting that this rearrangement represents a novel mechanism of EGFR activation in breast cancer. Our findings also included recurrent rearrangements of CDH1 (E-Cadherin) such as CDH1/CDH3 in UACC893 and CDH3/CDH1 in HCC2218. These gene fusions are associated with focal homozygous deletions disrupting CDH1 in corresponding array CGH profiles (data not shown). CDH1 encodes a calcium dependent cell-cell adhesion glycoprotein, and mutations in this tumor suppressor gene are correlated with the development of various malignancies including gastric, breast, and ovarian cancers[38-40]. Our findings suggest that rearrangement represents another mechanism for tumor suppressor inactivation. Also notable was the discovery of the gene fusion TMEM123/MMP20 in the breast cancer cell line HCC38. MMP20 is a member of the matrix metalloproteinase family, which includes several genes with demonstrated roles in tumor invasion and metastasis[41-43].

Discovery and characterization of TAOK1 rearrangements in breast cancer To prioritize rearrangements for further characterization, we focused our attention on recurrent alterations. Nearly all of the ~350 candidate rearrangements were singletons, present in only one cell line and therefore of unclear impact in breast cancer. In fact, our pipeline nominated only one recurrent gene fusion, CLTC/VMP1, which we had previously characterized[44]. A small number of individual genes were involved in more than one distinct gene rearrangement. The majority of these

103

alterations, including CDH1/CDH3 and CDH3/CDH1, contain disrupted open reading frames or lack important functional domains, suggesting potential roles in tumor suppressor inactivation. Among the list of candidate gene-intergenic fusions, we noted that two cell lines (~5.5% of samples), ZR75-30 and HCC1419, harbored novel amplified rearrangements of the TAOK1 kinase (Figure 3.2, Table 3.S4). Specifically, these alterations juxtapose the first 12-13 exons of TAOK1 (wild-type TAOK1 contains 20 exons) to intergenic sequences on chromosome 17. The open reading frames of these alterations encode for N-terminal truncated isoforms of TAOK1, which retain the serine/threonine kinase domain. We speculated that TAOK1 rearrangement leads to constitutively active kinase signaling through amplification or loss of auto-inhibitory C-terminal sequences. Oncogenic kinases have proven to be excellent therapeutic targets[45], and so we chose to further characterize these novel rearrangements. TAOK1 (also called PSK2) is short for “thousand and one amino acid kinase 1”; the protein comprises 1,001 amino acids and is one of three TAOK family kinases[46]. TAOKs encode serine/threonine kinases and belong to a family of approximately 30 “sterile 20-like kinases”. Little is known about the function of TAOK1. Some studies suggest that it functions as a regulator of apoptosis and various MAP kinases[46-48]. Another study suggested a role in mitotic spindle checkpoint signaling[49], although these findings have been refuted[50]. Notably, TAOK1 has not been implicated as an oncogene in breast cancer. A recent publication identified a single rearrangement of TAOK1 in a single breast cancer cell line; however, the authors did not functionally characterize the fusion, and in fact speculated that it was an inactivating event[51]. As a first step, we sought to determine whether TAOK1 rearrangements occur in primary breast malignancies and are therefore not simply cell culture artifacts. To this end, we developed a “break-apart” fluorescence in situ hybridization (FISH) assay (Figure 3.3A). In preliminary studies, we identified TAOK1 amplified rearrangements in primary breast tumors as well as ductal carcinoma in situ (DCIS) cases, suggesting that these alterations occur early in carcinogenesis (Figure 3.3B). We also analyzed

104

publicly available breast cancer array CGH data[52,53] and uncovered similar TAOK1 rearrangements in 5 of 169 samples (3%) (Figure 3.3C). The finding of recurrent TAOK1 rearrangements supports a potential oncogenic “driver” role in breast carcinogenesis. To determine whether the identified TAOK1 rearrangements were also expressed at the protein level, we performed various Western blots using an antibody that recognizes autophosphorylated TAOK (Ser181). We observed prominent protein bands of the predicted size (~50-55kD) in both the HCC1419 and ZR75-30 lines as well as less prominent nearby bands in breast cancer lines without TAOK1 rearrangements (Figure 3.4A). We hypothesize that these other bands represent proteolytic fragments of native TAOK kinases (possibly active). Notably, transfection of HCC1419 cells with small interfering RNAs (siRNAs) targeting the 5’ end of TAOK1 resulted in decreased expression of the presumed rearrangement specific band. In contrast, its expression did not diminish upon transfection with an siRNA pool targeting the 3’ end of TAOK1, which is not present in the rearrangement (Figure 3.4B). We subsequently created a Flag-tagged TAOK1 rearrangement construct cloned from HCC1419 cells. Overexpression of this construct resulted in a co- migrating band, further supporting the presumptive assignment. Notably, the fusion construct exhibited autophosphorylation suggesting that these alterations encode functional, enzymatically-active kinases. As a control, we engineered a kinase-dead construct (K57A) by site-directed mutagenesis, which did not exhibit autophosphorylation (Figure 3.4C). We next sought to characterize the oncogenic potential of TAOK1 rearrangements in breast cancer, using RNAi knockdown. Specifically, transfection of HCC1419 with individual siRNAs targeting the 5’ end of TAOK1 (i.e. the portion retained in the fusion) led to reduced expression of the rearrangement and resulted in significantly decreased cell growth/viability. In contrast, an siRNA pool targeting the 3’ end of TAOK1 (absent in the fusion) specifically diminished expression of wild- type but not rearranged TAOK1 and had no effect on cell growth. Notably, HCC1419 cells transfected with rearrangement-targeting siRNAs appeared flattened and

105

enlarged, morphological features of cellular senescence. To substantiate this observation, we stained for senescence-associated β-galactosidase and observed significantly increased numbers of senescent cells (Figure 3.5).

DISCUSSION

Here, we describe the development and implementation of a pipeline for cancer gene rearrangement discovery from RNA Seq data, which was applied to a collection of 36 breast cancer cell lines. We discovered ~350 novel rearrangements including gene fusions as well as intragenic and gene-intergenic rearrangements. Many of these findings are novel alterations in breast cancer, and several represent potential oncogenes or tumor suppressors. This analysis uncovered novel recurrent rearrangements of the (STE20)-like kinase TAOK1, which encode N-terminal truncations that retain an active serine threonine kinase domain. We demonstrate that these alterations are present in both breast cancer cell lines as well as primary clinical specimens. Notably, we identify TAOK1 rearrangements in several DCIS samples, suggesting that rearrangement occurs early in the development of breast cancer. While further studies are needed to assess the frequency of TAOK1 rearrangements, our preliminary findings suggest that a small fraction of breast cancers harbor these fusions. Many recently discovered gene fusions are present at low frequencies in other epithelial malignancies. For example, RAF kinase rearrangements are estimated to occur in 1-3% of prostate, gastric, and pancreatic cancers and EML4/ALK is predicted to occur in 3-5% non-small cell lung cancers[10,13]. Our findings support the notion that recurrent gene fusions have key roles in small subsets of carcinoma subtypes. Little is known about the function of TAOK1, and it has not been linked as an oncogene in breast carcinogenesis. Schulte et al identified a distinct rearrangement, TAOK1/PCGF2, in ZR75-30 cells. We also detect this alteration, which includes only the first exon and 5’ UTR of TAOK1 fused to exon 3 of PCGF2[51]. However, the authors did not report the TAOK1 rearrangement characterized in the present study.

106

Furthermore, the authors suggested that TAOK1 is a tumor suppressor gene inactivated by rearrangement. In contrast, our findings strongly suggest that TAOK1 fusions encode enzymatically active oncogenic serine/threonine kinases. Specifically, we demonstrate that knockdown of TAOK1 rearrangement expression in HCC1419 significantly diminished cell growth and increased cellular senescence. Oncogenic kinase gene fusions including EML4/ALK, CD74/ROS1, BCR/ABL1, and FIP1L1/PDGFRA have proven to be outstanding therapeutic targets for small molecule inhibitors[54-57]. Development of such inhibitors may improve survival of breast cancer patients harboring TAOK1 fusions. While our findings strongly implicate TAOK1 as a driver of breast carcinogenesis, further studies are needed to fully assess its role. For example, its oncogenic contribution in ZR75-30 cells would need to be further explored. In addition, overexpression studies of TAOK1 fusion constructs in non-tumorigenic breast cell lines are needed to assess for gain of cancer phenotypes. We hypothesize that TAOK1 may possess C-terminal inhibitory domains, and that rearrangement activates kinase signaling through removal of these regions. Studies to characterize the location and structure of these domains would contribute to our understanding of TAOK1 biology. Finally, studies to characterize the molecular pathways and substrates of TAOK1 kinase signaling would uncover its mechanism of transformation. In summary, we discovered several novel rearrangements in breast cancer, including recurrent fusions of TAOK1 in a subset of cases. Our findings suggest that TAOK1 fusions encode oncogenic kinases, which may serve as therapeutic targets for the development of small molecule inhibitors. Recent high-impact gene fusion discoveries including TMPRSS2/ETS, EML4/ALK, CD44/SLC1A2, and others all involve rearrangements juxtaposing two distinct genes[13-15]. Furthermore, rearrangements such as CTNNB1/PLAG1 or TMPRSS2/ETS, involve fusions of 3’ open reading frames to 5’ non-coding sequences. Specifically, in these alterations, the proto-oncogene (PLAG1 or ETS family members) remains mostly intact, but the genomic rearrangement places a new promoter and 5’-UTR upstream of the coding sequence, leading to overexpression of the proto-oncogene. In contrast, TAOK1

107

fusions are unique in that they involve rearrangements that fuse a 5’ open reading frame to 3’ non-coding sequences within the genome. Notably, several studies have performed next generation sequencing on HCC1419 and ZR75-30 without discovering these alterations because these studies have limited their analysis to the identification of gene-to-gene fusions only[16,51]. Our findings highlight the importance of screening for additional types of rearrangements such as intragenic and gene- intergenic alterations from next generation sequencing data. Our findings highlight the importance of gene rearrangements in carcinogenesis and support the notion that recurrent rearrangements play major roles in subsets of epithelial cancers. Further characterization of the newly discovered alterations in this study is critical for improving our understanding of breast cancer pathogenesis and may yield novel therapeutic strategies for patients.

MATERIALS AND METHODS

Cell lines Cell lines were obtained from the American Type Culture Collection except for HMEC, which was obtained from Clonetics/Lonza. All lines were propagated in RPMI-1640 supplemented with 10% fetal bovine serum (FBS). Cells were harvested at 80% confluency. Total RNA and genomic DNA were isolated using the AllPrep DNA/RNA Mini Kit (Qiagen).

Paired-end library preparation for Illumina sequencing Paired-end transcriptome sequencing (RNA-seq) was performed as described[44]. Briefly adapter-ligated cDNA libraries were prepared using the mRNA Seq-8 Sample Prep Kit (Illumina). Specifically, mRNA was isolated and purified from 1 to 10µg of total RNA using Sera-Mag Magnetic Oligo(dT) Beads and was subsequently fragmented at 94°C in a fragmentation buffer and converted to single

108

stranded cDNA using SuperScript II reverse transcriptase (Invitrogen). Subsequently second-strand cDNA synthesis was performed using E. coli DNA polymerase I (Invitrogen). Double stranded cDNA was end repaired using T4 DNA polymerase and T4 polynucleotide kinase, and then monoadenylated using Klenow DNA polymerase I. Adapter sequences were ligated to library molecules using T4 DNA ligase. Library fragments were then size selected (300-400 bp) on a 2% agarose gel and then purified using the QIAquick Gel Extraction Kit (Qiagen). Purified cDNA fragments were enriched with 15 PCR cycles using Phusion DNA Polymerase and corresponding buffers. Libraries were electrophoresed once again and then gel purified using the QIAquick Minelute Gel Purification Kit (Qiagen). Adapter ligated cDNA libraries were quantified with the Agilent DNA 1000 kit on the Agilent 2100 Bioanalyzer. Libraries were sequenced on either the Genome Analyzer II instrument (Illumina).

Gene rearrangement discovery As a first step, paired-reads were mapped to the human genome (hg18) and the RefSeq transcriptome allowing up to 2 mismatches, using Efficient Alignment of Nucleotide Databases (ELAND). A custom C# script was then used to identify mate pairs with discordant genomic mapping positions. In parallel, non-mapping reads were aligned to all annotated RefSeq exons using VMATCH, a pattern-matching program[59]. Reads were required to have alignment of at least 6 nucleotides to an exon boundary. A perl script then screened these results for reads mapping to the exon boundaries of two separate genes. These “chimeric reads” were integrated with the previously identified discordantly mapping mate paired reads to nominate candidate rearrangements. At least two distinct mate pairs were required for prioritization. A series of filters was then applied to eliminate false positive candidates. Specifically, a custom C# script was designed to score gene fusion nominations on a 5-point scale using the following criteria. First, as library fragments were size selected at 300-400 base pairs, we required the average calculated distance between mate-paired reads to be between 150 and 300 nucleotides (1 point). In

109

addition, the standard deviations of mapping positions for all reads 5’ and 3’ to the predicted fusion junction were separately calculated. We required these standard deviation values to be between 35-70 nucleotides (1 point each). Lastly, if one or more chimeric reads supported the existence of a particular gene fusion, we required the predicted junction from the paired reads to be within one exon of the junction suggested by the chimeric reads (2 points). These criteria were chosen based on characteristics of known gene fusions (primarily from MCF7 cells) rediscovered in our dataset. Candidate gene rearrangements were nominated if they scored 2 points or higher. This pipeline was used to discover both gene fusions and gene-intergenic rearrangements. To discover intragenic rearrangements, reads mapped to the RefSeq transcriptome were screened using a custom C# script for paired-reads mapping within the same gene but disruptive of the numerical exon order. A separate C# script was then used to screen non-mapping reads for those aligning to potentially disrupted exon junctions. A minimum of two distinct mate pairs was required for candidate nominations, and the same 5-point scale outlined above was used to prioritize rearrangements.

Microarray datasets Array CGH profiles from 54 breast cancer cell lines were obtained from the Wellcome Trust Sanger Institute’s Cancer Genome Project[32]. These cell lines were profiled on Affymetrix SNP 6.0 microarrays containing 1.8 million genetic markers including more than 946,000 probes for the detection of copy number variation. In addition, we analyzed publicly available high-density array CGH datasets including GSE16619[52], which includes 115 breast cancer tumors (Affymetrix 5.0 SNP microarrays) and GSE17907, which includes 54 ERBB2 amplified breast cancers (Agilent 244K microarrays)[53]. Affymetrix Genotyping Console software was used to extract probeset intensities from raw data files using the regional GC correction configuration for Copy Number/LOH analysis and default settings. Intensities were normalized against a HapMap 270 normal reference dataset, and log2 ratios were

110

analyzed for genomic breakpoints. We also obtained an exon microarray dataset comprising 41 breast cancer cell lines profiled on Affymetrix Human Exon 1.0ST arrays to screen for exonic expression breakpoints disrupting TAOK1 transcripts (GSE16732)[58].

Cell proliferation and senescence assays Cell viability/proliferation was quantified as described using colorimetry associated with cleavage of the tetrazolium salt, WST-1(Roche). Briefly, 10% WST-1 reagent was added to cells at 1, 3, and 5 days post siRNA transfection and then incubated at 37°C for 30 minutes. Absorbance was measured at 450nm with reference to 650nm using a Spectra Max 190 plate reader (Molecular Devices). Senescence was assessed 72 hours post transfection using the Senescence β-Galactosidase Staining Kit (Cell Signaling) according to the manufacturer’s instructions. Cells were washed with 1 x PBS and then treated with a fixative solution. Cells were then stained for β- Galactosidase and counted. All assays were performed as biological triplicate, and mean values together with standard deviations are reported. All experiments were reproduced at least once.

RT-PCR validation of fusions Specimen RNAs were reverse transcribed using SuperScript III reverse transcriptase with random hexamers (Invitrogen). The following primer sequences were used for RT-PCR gene fusion validation. For HCC1419, forward and reverse primer sequences were ATGATGGAGGGAGACCACAC and CACAAGCCATAGGCGATCTT, respectively. For ZR75-30, ATGATGGAGGGAGACCACAC and GAAGTGCAGTGACAGGACCA were used as the forward and reverse primer sequences. PCR reactions were resolved on 1% agarose TAE gels, and bands were purified and Sanger-sequenced to verify predicted fusion junctions.

111

Break-apart FISH assays Probe labeling and FISH were performed using Vysis/Abbott Molecular reagents and protocols. Locus-specific BACs encompassing TAOK1 (RP11-412K13 telomeric, RP11-218J24 centromeric) were labeled with Cy5-dUTP (telomeric probes) or Cy3- dUTP (centromeric probes). Chromosomal locations of BACs were first validated using normal metaphase slides. Fluorescently labeled probes interrogating TAOK1 were hybridized to TMAs containing several invasive ductal carcinoma and DCIS tumor specimens. Slides were counterstained with DAPI, and imaged using an Olympus BX51 fluorescence microscope with Applied Imaging Ariol 3.0 software. Rearrangement was defined by physical separation of the red and green FISH signals or loss of the red FISH signal (with or without amplification of the green signal), in at least 25% of tumor nuclei. siRNA transfections On-TARGETplus siRNAs targeting various regions of TAOK1, as well as a non- targeting control siRNA pool (ON-TARGETplus siCONTROL Non-targeting Pool), were obtained from Dharmacon. Cell lines were seeded at a density of 75,000- 150,000 cells per 6-well plate well and transfected using Lipofectamine 2000 reagent (Invitrogen). Cells were transfected with a final concentration of 25nM siRNA for 16 hours in Opti-Mem (GIBCO), which was subsequently replaced with complete growth media (RPMI-1640 with 10% FBS).

Q-RT-PCR and Western blots Q-RT-PCR was performed using Assay-on-Demand TaqMan probes and reagents (Applied Biosystems). A custom primer set encompassing the TAOK1/chr17p gene fusion junction was designed to interrogate expression of the gene fusion in HCC1419. A separate primer set aligning to the 3’ end of TAOK1 (not present in the fusion) was used to interrogate expression of the native, full-length TAOK1 transcript. Western blots were done on whole cell lysates, using a phospho-TAOK antibody (1:500; Epitomics) that recognizes autophosphorylated TAOK (Ser181).

112

ACKNOWLEDGEMENTS

The authors would like to thank the members of the Pollack lab for helpful discussion.

113

REFERENCES

1. Bergamaschi A, Kim YH, Wang P, Sorlie T, Hernandez-Boussard T, et al. (2006) Distinct patterns of DNA copy number alteration are associated with different clinicopathological features and gene-expression subtypes of breast cancer. Genes Chromosomes Cancer 45: 1033-1040. 2. Kao J, Salari K, Bocanegra M, Choi YL, Girard L, et al. (2009) Molecular profiling of breast cancer cell lines defines relevant tumor models and provides a resource for cancer gene discovery. PLoS One 4: e6146. 3. Subramaniam DS, Isaacs C (2005) Utilizing prognostic and predictive factors in breast cancer. Curr Treat Options Oncol 6: 147-159. 4. Bocanegra M, Bergamaschi A, Kim YH, Miller MA, Rajput AB, et al. (2010) Focal amplification and oncogene dependency of GAB2 in breast cancer. Oncogene 29: 774-779. 5. Climent J, Garcia JL, Mao JH, Arsuaga J, Perez-Losada J (2007) Characterization of breast cancer by array comparative genomic hybridization. Biochem Cell Biol 85: 497-508. 6. Ong CC, Jubb AM, Haverty PM, Zhou W, Tran V, et al. (2011) Targeting p21- activated kinase 1 (PAK1) to induce apoptosis of tumor cells. Proc Natl Acad Sci U S A 108: 7177-7182. 7. Savelyeva L, Schwab M (2001) Amplification of oncogenes revisited: from expression profiling to clinical application. Cancer Lett 167: 115-123. 8. Kohno T, Ichikawa H, Totoki Y, Yasuda K, Hiramoto M, et al. (2012) KIF5B-RET fusions in lung adenocarcinoma. Nat Med 18: 375-377. 9. Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, et al. (2009) Transcriptome sequencing to detect gene fusions in cancer. Nature 458: 97- 101. 10. Palanisamy N, Ateeq B, Kalyana-Sundaram S, Pflueger D, Ramnarayanan K, et al. (2010) Rearrangements of the RAF kinase pathway in prostate cancer, gastric cancer and melanoma. Nat Med 16: 793-798.

114

11. Rikova K, Guo A, Zeng Q, Possemato A, Yu J, et al. (2007) Global survey of phosphotyrosine signaling identifies oncogenic kinases in lung cancer. Cell 131: 1190-1203. 12. Seshagiri S, Stawiski EW, Durinck S, Modrusan Z, Storm EE, et al. (2012) Recurrent R-spondin fusions in colon cancer. Nature 488: 660-664. 13. Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, et al. (2007) Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 448: 561-566. 14. Tao J, Deng NT, Ramnarayanan K, Huang B, Oh HK, et al. (2011) CD44- SLC1A2 gene fusions in gastric cancer. Sci Transl Med 3: 77ra30. 15. Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, et al. (2005) Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310: 644-648. 16. Robinson DR, Kalyana-Sundaram S, Wu YM, Shankar S, Cao X, et al. (2011) Functionally recurrent rearrangements of the MAST kinase and Notch gene families in breast cancer. Nat Med 17: 1646-1651. 17. Berger MF, Levin JZ, Vijayendran K, Sivachenko A, Adiconis X, et al. (2010) Integrative analysis of the melanoma transcriptome. Genome Res 20: 413-427. 18. Kiyoi H, Towatari M, Yokota S, Hamaguchi M, Ohno R, et al. (1998) Internal tandem duplication of the FLT3 gene is a novel modality of elongation mutation which causes constitutive activation of the product. Leukemia 12: 1333-1337. 19. Nakao M, Yokota S, Iwai T, Kaneko H, Horiike S, et al. (1996) Internal tandem duplication of the flt3 gene found in acute myeloid leukemia. Leukemia 10: 1911-1918. 20. Kiyoi H, Naoe T, Yokota S, Nakao M, Minami S, et al. (1997) Internal tandem duplication of FLT3 associated with leukocytosis in acute promyelocytic leukemia. Leukemia Study Group of the Ministry of Health and Welfare (Kohseisho). Leukemia 11: 1447-1452.

115

21. Stirewalt DL, Kopecky KJ, Meshinchi S, Appelbaum FR, Slovak ML, et al. (2001) FLT3, RAS, and TP53 mutations in elderly patients with acute myeloid leukemia. Blood 97: 3589-3595. 22. Meshinchi S, Woods WG, Stirewalt DL, Sweetser DA, Buckley JD, et al. (2001) Prevalence and prognostic significance of Flt3 internal tandem duplication in pediatric acute myeloid leukemia. Blood 97: 89-94. 23. Schnittger S, Schoch C, Dugas M, Kern W, Staib P, et al. (2002) Analysis of FLT3 length mutations in 1003 patients with acute myeloid leukemia: correlation to cytogenetics, FAB subtype, and prognosis in the AMLCG study and usefulness as a marker for the detection of minimal residual disease. Blood 100: 59-66. 24. Thiede C, Steudel C, Mohr B, Schaich M, Schakel U, et al. (2002) Analysis of FLT3-activating mutations in 979 patients with acute myelogenous leukemia: association with FAB subtypes and identification of subgroups with poor prognosis. Blood 99: 4326-4335. 25. Abu-Duhier FM, Goodeve AC, Wilson GA, Gari MA, Peake IR, et al. (2000) FLT3 internal tandem duplication mutations in adult acute myeloid leukaemia define a high-risk group. Br J Haematol 111: 190-195. 26. Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, et al. (2010) Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17: 98-110. 27. Humphrey PA, Wong AJ, Vogelstein B, Friedman HS, Werner MH, et al. (1988) Amplification and expression of the epidermal growth factor receptor gene in human glioma xenografts. Cancer Res 48: 2231-2238. 28. Yamazaki H, Fukui Y, Ueyama Y, Tamaoki N, Kawamoto T, et al. (1988) Amplification of the structurally and functionally altered epidermal growth factor receptor gene (c-erbB) in human brain tumors. Mol Cell Biol 8: 1816- 1820. 29. Watanabe K, Tachibana O, Sata K, Yonekawa Y, Kleihues P, et al. (1996) Overexpression of the EGF receptor and p53 mutations are mutually exclusive

116

in the evolution of primary and secondary glioblastomas. Brain Pathol 6: 217- 223; discussion 223-214. 30. Tomlins SA, Laxman B, Dhanasekaran SM, Helgeson BE, Cao X, et al. (2007) Distinct classes of chromosomal rearrangements create oncogenic ETS gene fusions in prostate cancer. Nature 448: 595-599. 31. Maher CA, Palanisamy N, Brenner JC, Cao X, Kalyana-Sundaram S, et al. (2009) Chimeric transcript discovery by paired-end transcriptome sequencing. Proc Natl Acad Sci U S A 106: 12353-12358. 32. Bignell GR, Greenman CD, Davies H, Butler AP, Edkins S, et al. (2010) Signatures of mutation and selection in the cancer genome. Nature 463: 893- 898. 33. Hampton OA, Den Hollander P, Miller CA, Delgado DA, Li J, et al. (2009) A sequence-level map of chromosomal breakpoints in the MCF-7 breast cancer cell line yields insights into the evolution of a cancer genome. Genome Res 19: 167-177. 34. Edgren H, Murumagi A, Kangaspeska S, Nicorici D, Hongisto V, et al. (2011) Identification of fusion genes in breast cancer by paired-end RNA-sequencing. Genome Biol 12: R6. 35. Lynch TJ, Bell DW, Sordella R, Gurubhagavatula S, Okimoto RA, et al. (2004) Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to gefitinib. N Engl J Med 350: 2129-2139. 36. Fenstermaker RA, Ciesielski MJ, Castiglia GJ (1998) Tandem duplication of the epidermal growth factor receptor tyrosine kinase and calcium internalization domains in A-172 glioma cells. Oncogene 16: 3435-3443. 37. Ciesielski MJ, Fenstermaker RA (2000) Oncogenic epidermal growth factor receptor mutants with tandem duplication: gene structure and effects on receptor function. Oncogene 19: 810-820. 38. Berx G, Becker KF, Hofler H, van Roy F (1998) Mutations of the human E- cadherin (CDH1) gene. Hum Mutat 12: 226-237.

117

39. Berx G, Cleton-Jansen AM, Nollet F, de Leeuw WJ, van de Vijver M, et al. (1995) E-cadherin is a tumour/invasion suppressor gene mutated in human lobular breast cancers. EMBO J 14: 6107-6115. 40. Risinger JI, Berchuck A, Kohler MF, Boyd J (1994) Mutations of the E-cadherin gene in human gynecologic cancers. Nat Genet 7: 98-102. 41. Nelson AR, Fingleton B, Rothenberg ML, Matrisian LM (2000) Matrix metalloproteinases: biologic activity and clinical implications. J Clin Oncol 18: 1135-1149. 42. Adachi Y, Yamamoto H, Itoh F, Hinoda Y, Okada Y, et al. (1999) Contribution of matrilysin (MMP-7) to the metastatic pathway of human colorectal cancers. Gut 45: 252-258. 43. Wagenaar-Miller RA, Gorden L, Matrisian LM (2004) Matrix metalloproteinases in colorectal cancer: is it worth talking about? Cancer Metastasis Rev 23: 119- 135. 44. Giacomini CSSVSSAGMBJSRLEDVCFACNMK (2013) Breakpoint analysis uncovers novel gene fusions spanning multiple cancer types. PLoS Genetics (In press). 45. Sawyers CL (2002) Rational therapeutic intervention in cancer: kinases as drug targets. Curr Opin Genet Dev 12: 111-115. 46. Hutchison M, Berman KS, Cobb MH (1998) Isolation of TAO1, a protein kinase that activates MEKs in stress-activated protein kinase cascades. J Biol Chem 273: 28625-28632. 47. Zihni C, Mitsopoulos C, Tavares IA, Ridley AJ, Morris JD (2006) Prostate- derived sterile 20-like kinase 2 (PSK2) regulates apoptotic morphology via C- Jun N-terminal kinase and Rho kinase-1. J Biol Chem 281: 7317-7323. 48. Raman M, Earnest S, Zhang K, Zhao Y, Cobb MH (2007) TAO kinases mediate activation of p38 in response to DNA damage. EMBO J 26: 2005-2014. 49. Draviam VM, Stegmeier F, Nalepa G, Sowa ME, Chen J, et al. (2007) A functional genomic screen identifies a role for TAO1 kinase in spindle- checkpoint signalling. Nat Cell Biol 9: 556-564.

118

50. Westhorpe FG, Diez MA, Gurden MD, Tighe A, Taylor SS (2010) Re-evaluating the role of Tao1 in the spindle checkpoint. Chromosoma 119: 371-379. 51. Schulte I, Batty EM, Pole JC, Blood KA, Mo S, et al. (2012) Structural analysis of the genome of breast cancer cell line ZR-75-30 identifies twelve expressed fusion genes. BMC Genomics 13: 719. 52. Kadota M, Sato M, Duncan B, Ooshima A, Yang HH, et al. (2009) Identification of novel gene amplifications in breast cancer and coexistence of gene amplification with an activating mutation of PIK3CA. Cancer Res 69: 7357- 7365. 53. Sircoulomb F, Bekhouche I, Finetti P, Adelaide J, Ben Hamida A, et al. (2010) Genome profiling of ERBB2-amplified breast cancers. BMC Cancer 10: 539. 54. Cools J, DeAngelo DJ, Gotlib J, Stover EH, Legare RD, et al. (2003) A tyrosine kinase created by fusion of the PDGFRA and FIP1L1 genes as a therapeutic target of imatinib in idiopathic hypereosinophilic syndrome. N Engl J Med 348: 1201-1214. 55. Druker BJ, Sawyers CL, Kantarjian H, Resta DJ, Reese SF, et al. (2001) Activity of a specific inhibitor of the BCR-ABL tyrosine kinase in the blast crisis of chronic myeloid leukemia and acute lymphoblastic leukemia with the Philadelphia chromosome. N Engl J Med 344: 1038-1042. 56. Kwak EL, Bang YJ, Camidge DR, Shaw AT, Solomon B, et al. (2010) Anaplastic lymphoma kinase inhibition in non-small-cell lung cancer. N Engl J Med 363: 1693-1703. 57. Bergethon K, Shaw AT, Ou SH, Katayama R, Lovly CM, et al. (2012) ROS1 rearrangements define a unique molecular class of lung cancers. J Clin Oncol 30: 863-870. 58. Riaz M, Elstrodt F, Hollestelle A, Dehghan A, Klijn JG, et al. (2009) Low-risk susceptibility alleles in 40 human breast cancer cell lines. BMC Cancer 9: 236. 59. Abouelhoda MI KS, Enno Ohlebusch (2004) Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms: 53-86.

119

FIGURE LEGENDS

Figure 3.1. RNA Seq gene rearrangement discovery pipeline. (A) Identification of fusion genes. Sequence reads from 36 breast cancer cell lines are aligned to the RefSeq transcriptome and human genome, and then paired reads mapping to two distinct genomic locations are identified. Non-mapping reads are separately queried for supporting “chimeric” reads that span potential junctions from the two genomic loci. In the schematic figure, paired reads are illustrated; the black rectangles represent chimeric reads. Other filtering criteria involve the number and structural distribution of supporting reads. The schematic depicts discovery of a gene-to-gene fusion, but the same algorithm is used for discovering genes rearranged to intergenic genomic regions. (B) Schematic depicting the approach for discovering intragenic rearrangements (internal tandem duplications). RNA seq reads were mapped to the RefSeq transcriptome and subsequently screened for paired-reads mapping within the same gene, but disruptive of the numerical exon order. Non-mapping reads are separately queried to identify reads spanning potential disrupted exon junctions.

Figure 3.2. TAOK1 rearrangement discovery in breast cancer. (A) Schematic structure of wildtype TAOK1. Coding exons are in red and untranslated regions are colored pink. The serine/threonine kinase domain is indicated by the green rectangle. (B) Identification of TAOK1 rearrangement in the HCC1419 breast cancer cell line. The asterisks (*) indicates the location of a newly introduced stop codon. The rearrangement is predicted to encode an N-terminal truncation of TAOK1, which retains the kinase domain. (C) A separate TAOK1 rearrangement was discovered in ZR75-30 breast cancer cells, with similar structure. The rearrangement also encodes an N-terminal truncation of TAOK1, with preserved kinase domain. A subset of the paired-end RNA seq reads supporting each rearrangement is displayed. (D) Experimental validation of TAOK1 rearrangements by RT-PCR, using primers

120

flanking the respective gene fusion junction. (E). Array CGH heatmaps displaying intragenic TAOK1 genomic breakpoints identified in HCC1419 and ZR75-30. The 5’ end of TAOK1, including the kinase domain is amplified (indicated by red in the heatmap). (F) Exon microarray data indicates that TAOK1 5’ exons are highly expressed relative to 3’ exons, consistent with amplified rearrangement of the 5’ portion of TAOK1.

Figure 3.3. TAOK1 rearrangements in primary breast cancer. (A) Schematic of the “break-apart” FISH assay. Green and red FISH probes flank TAOK1. In normal interphase nuclei, green and red signals co-localize. In samples with TAOK1 rearrangement alone, separation of red and green signals will be observed. In samples with amplified rearrangement of TAOK1, separation of red and green will again be observed together with clusters of green signals. (B) FISH analysis of a ductal carcinoma in situ (DCIS) specimen showing amplified rearrangement of TAOK1. One TAOK1 allele is intact (black arrowhead), while the other exhibits the pattern of amplified rearrangement (white arrowhead). (C) Heatmap of publicly available array CGH data demonstrating five primary breast cancer specimens with TAOK1 amplified rearrangement analogous to the breakpoints observed in HCC1419 and ZR7530.

Figure 3.4. TAOK1 rearrangements encode an active kinase. (A) Western blot of breast cancer cell lines reveals protein bands corresponding to TAOK1 rearrangements (pTAOK1*; green asterisk) in HCC1419 and ZR7530. Other cell lines without TAOK1 fusions have similarly sized bands that may represent proteolytic degradation products of full length TAOKs. The phospho-TAOK antibody also detects native, wildtype pTAOK1, 2, and 3. (B) Western blot depicting siRNA knockdown of TAOK1 rearrangements. TAOK1 #1 and #5 represent individual siRNAs targeting the 5’ end of TAOK1 (present in the fusions), and corresponding lanes show diminished expression of TAOK1 rearrangements (pTAOK1*; yellow asterisk) and native TAOK1 (pink asterisk). An siRNA pool targeting the 3’ end of TAOK1 (not present in the rearrangements) shows diminished expression of native TAOK1 (pink asterisk) but not TAOK1 fusions. (C) Overexpression of a Flag-tagged TAOK1

121

fusion construct (cloned from HCC1419) demonstrates a band co-migrating with the native pTAOK1* fusion. As the antibody detects autophosphorylated TAOK1 (at S181), the presence of this band also indicates that the rearrangement encodes an active kinase. Overexpression of a kinase dead TAOK1 fusion construct (Flag- TAOK1*(KD)) did not produce a corresponding band.

Figure 3.5. TAOK1 fusion knockdown inhibits cell growth. (A) Schematic depicting siRNA target sequence locations at the TAOK1 5’ (siRNAs #1 and #5) or 3’ (3’ siRNA pool) ends. (B) Q-RT-PCR verification of siRNA-mediated knockdown of native TAOK1 (left, using PCR primers at the TAOK1 3’ end) or TAOK1* fusion (right, using PCR primers spanning the fusion junction). (C) Knockdown of TAOK1* fusion but not native TAOK1 alone leads to significant decrease in cell growth/viability. (D) Knockdown of TAOK1* fusion results in increased cellular senescence. * P<0.05; ** P<0.01 (two-sided Student’s t-test).

Table 3.S1. Breast cancer cell lines.

Table 3.S2. Candidate gene fusions.

Table 3.S3. Candidate intragenic rearrangements.

Table 3.S4. Candidate gene-intergenic rearrangements.

122

Figure 3.1. RNA Seq gene rearrangement discovery pipeline.

A.

Identify mate pairs mapping to separate Paired genes/genomic loci reads: Paired-end Map mate pairs Gene X Gene Y RNA Seq Reads against RefSeq transcriptome and 1 2 3 5 6 7 36 Breast Cancer human genome Cell Lines Screen non-mapping Chimeric reads for supporting chimeric reads reads:

B.

Identify mate pairs mapping within same Paired gene but disrupting exon order reads: Paired-end Map mate pairs Gene X RNA Seq Reads against RefSeq 1 2 3 2 3 4 36 Breast Cancer transcriptome Cell Lines Screen non-mapping Chimeric reads for supporting chimeric reads reads:

123

Figure 3.2. TAOK1 rearrangement discovery in breast cancer.

A. B. paired-end reads Full Length TAOK1 HCC1419

STK STK * 1 2 11 12 13 14 20 TAOK1 1 2 11 12 17p11.2

chr17

C. D. paired-end reads

ZR75-30 HCC1419ZR7530

STK * 17q21.33 TAOK1 1 2 11 12 13 200bp

chr17

E. chr17 Breast Cancer Cell Lines ZR7530HCC1419 TAOK1

F. ZR7530 *** 2.0

1.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Expression -1.0

-2.0 Exon Position

124

Figure 3.3. TAOK1 rearrangements in primary breast cancer.

AB C TAOK1

5' FISH probe 3' FISH probe

1 2 11 12 13 14 20 TAOK1 Breast tumors

Normal nucleus Rearrangement Amplified chr17 Rearrangement

125

Figure 3.4. TAOK1 rearrangements encode an active kinase.

AB C

HCC1419 ZR75-30 BT474 SKBR3 siRNA: TAOK1 #1 NTC TAOK1 #5 NTC TAOK1 3' poolNTC Flag Flag-TAOK1*Flag-TAOK1*(KD) 150kD 150kD 150kD pTAOK1/2/3 pTAOK1/2/3 * pTAOK1/2/3 100kD * * 100kD 100kD 75kD 75kD 75kD pTAOK1* ** pTAOK1* pTAOK1/2/3(c)? pTAOK1/2/3(c)? * * 50kD pTAOK1* 50kD * HCC1419 50kD HCC1419

126

Figure 3.5. TAOK1 fusion knockdown inhibits cell growth.

A. siRNA #5 siRNA #1 TAOK1* 1 2 3 11 12 17p11.2

3’ siRNA pool siRNA #5 siRNA #1

TAOK1 1 2 3 11 12 13 14 20

0.50 0.35 B. 0.45 HCC1419 HCC1419 0.40 0.30 TAOK1 expression TAOK1* expression 0.35 (3' Q-RT-PCR probe) 0.25 (Fusion junction 0.30 Q-RT-PCR probe) 0.20 0.25 0.20 0.15

0.15 0.10 0.10 0.05

Relative expression (% GAPDH) 0.05 Relative Expression (% GAPDH) NTC TAOK1 TAOK1 TAOK1 NTC TAOK1 TAOK1 TAOK1 siRNA pool siRNA#1 siRNA5 3' siRNA pool siRNA pool siRNA#1 siRNA5 3' siRNA pool

C. 0.6 NTC siRNA pool NTC siRNA pool TAOK1 siRNA #1 0.6 TAOK1 siRNA #5

0.3 0.3 ** * Cell growth (OD value) Cell growth (OD value) * ** ** 0.0 0.0 1 3 5 7 1 3 5 7 Time (days) Time (days)

0.6 NTC siRNA pool 3’ TAOK1 siRNA pool

0.3 Cell growth (OD value) 0.0 1 3 5 7

Time (days)

D. 2.0 1.5 ** 1.0

0.5 Relative Senescence 0.0 NTC TAOK1 siRNA pool siRNA #1, #5 pool

127

Table 3.S1. Breast cancer cell lines.

Number of high Cell line quality mate pairs Breast cancer subtype HCC1569 18,908,362 Basal A HCC1599 1,755,574 Basal A HCC1937 21,341,427 Basal A HCC1954 21,527,481 Basal A HCC2157 21,372,824 Basal A HCC1187 20,872,880 Basal A HCC70 5,103,464 Basal A BT20 18,196,182 Basal A HCC1143 16,831,243 Basal A HCC3153 17,846,186 Basal A MDA468 18,182,490 Basal A HCC1395 20,394,584 Basal B HCC38 20,659,436 Basal B HS578t 13,626,818 Basal B CAL51 13,270,508 Basal B MDA436 17,086,650 Basal B BT549 7,451,847 Basal B BT474 13,405,988 Luminal HCC1419 18,034,737 Luminal MCF7 17,641,976 Luminal SUM44 8,833,618 Luminal UACC812 16,293,328 Luminal ZR-75-30 18,389,999 Luminal HCC2185 11,039,590 Luminal HCC2218 17,838,259 Luminal MDA134 16,303,669 Luminal MDA361 15,722,765 Luminal SKBR3 11,912,117 Luminal UACC893 11,691,160 Luminal HCC1428 17,531,910 Luminal CAMA1 20,087,597 Luminal HCC1500 17,620,804 Luminal EFM19 17,117,606 Luminal SUM52 10,979,240 Luminal HMEC 16,966,305 Normal tissue HCC1806 21,550,742 Unknown DU4475 11,502,195 Unknown

128

Table 3.S2. Candidate gene fusions.

Number supporting Gene Fusion Cell Line Score 5' chromosome 3' chromosome paired reads BCAS4/BCAS3 MCF7 3 chr20 chr17 921 PLEC1/C8orf38 HCC1419 5 chr8 chr8 302 PRMT1/BCL2L12 HCC38 4 chr19 chr19 294 RARA/KIAA0195 UACC893 5 chr17 chr17 189 DDX5/DEPDC6 ZR-75-30 5 chr17 chr8 175 TATDN1/GSDMB SKBR3 3 chr8 chr17 165 USP32/CCDC49 ZR-75-30 5 chr17 chr17 158 HDGF/S100A10 UACC812 5 chr1 chr1 146 VPS18/ZFYVE19 HCC1419 4 chr15 chr15 114 RFT1/UQCRC2 HCC1806 5 chr3 chr16 109 CTAGE5/SIP1 HCC1187 5 chr14 chr14 82 RPS6KB1/SNF8 BT474 5 chr17 chr17 88 WIPF2/ERBB2 UACC812 3 chr17 chr17 93 C6orf106/SPDEF HCC1954 5 chr6 chr6 78 GRB7/PGAP3 HCC2218 3 chr17 chr17 78 PPP1R12B/SNX27 UACC812 5 chr1 chr1 76 SLC9A7/ALDH7A1 HCC1500 5 chrX chr5 52 DCAF12/UNC13B HCC2185 5 chr9 chr9 73 EIF3K/CYP39A1 HCC1395 5 chr19 chr6 59 SEMA3F/GNAI2 HCC1937 4 chr3 chr3 65 CTNND1/SMTNL1 HCC1187 5 chr11 chr11 60 RORA/ANXA2 HCC3153 4 chr15 chr15 12 DNAJC3/CLDN10 HCC3153 5 chr13 chr13 65 ARFGEF2/SULF2 MCF7 5 chr20 chr20 59 BOP1/LMF1 HCC1569 5 chr8 chr16 62 PHF20L1/SAMD12 HCC1954 3 chr8 chr8 64 TAX1BP1/AHCY HCC1806 3 chr7 chr20 56 PRPF3/SRGAP2 UACC812 5 chr1 chr1 5 BCAS3/HOXB9 ZR-75-30 5 chr17 chr17 45 SPAG9/NGFR HCC1428 5 chr17 chr17 50 GRB7/PPP1R1B UACC893 5 chr17 chr17 49 ACACA/STAC2 BT474 5 chr17 chr17 48 PUM1/TRERF1 HCC1187 4 chr1 chr6 45 VAPB/IKZF3 BT474 4 chr20 chr17 43 CCNE2/FAM82B HCC1419 3 chr8 chr8 35 C18orf45/HM13 HCC1143 2 chr18 chr20 44 VAV2/TRUB2 HCC1419 4 chr9 chr9 39

129

POLA2/CAPN1 HCC1806 5 chr11 chr11 42 FGFR2/ACADSB SUM52 3 chr10 chr10 38 FAM102A/CIZ1 BT474 5 chr9 chr9 36 PHF12/LRRC37B2 UACC812 4 chr17 chr17 17 MED1/ACSF2 BT474 4 chr17 chr17 36 TMEM104/CRKRS MDA361 5 chr17 chr17 35 DCAF8/COPA MDA468 4 chr1 chr1 32 PDE7A/ARMC1 UACC893 3 chr8 chr8 29 TRPS1/LASP1 ZR-75-30 5 chr8 chr17 31 STX16/RAE1 BT474 4 chr20 chr20 30 MED1/STXBP4 BT474 5 chr17 chr17 33 GCN1L1/MSI1 MCF7 3 chr12 chr12 30 WWC1/ADRBK2 HCC3153 5 chr5 chr22 30 ZNF251/TSHZ2 HCC1419 5 chr8 chr20 29 ZMYND8/CEP250 BT474 4 chr20 chr20 29 MTAP/PCDH7 HCC38 5 chr9 chr4 25 GALNT7/ORC4L HCC1954 3 chr4 chr2 28 BAG4/ASH2L HCC1569 5 chr8 chr8 26 AGPAT5/MCPH1 HCC1187 3 chr8 chr8 27 GSDMC/PVT1 HCC1954 5 chr8 chr8 26 PTPN1/CDS2 HCC2185 3 chr20 chr20 28 RASA2/ACPL2 HCC2157 5 chr3 chr3 33 PREX1/CPNE1 SKBR3 3 chr20 chr20 24 RAPGEF5/PPFIA3 SUM52 2 chr7 chr19 25 EZR/OSTCL MDA361 5 chr6 chr6 25 CRK/YWHAE HCC3153 5 chr17 chr17 5 CDH1/CDH3 UACC893 5 chr16 chr16 23 ARID1A/WDTC1 UACC812 5 chr1 chr1 23 SRGAP2/PRPF3 UACC812 5 chr1 chr1 5 UNC13B/PRSS3 HCC2185 5 chr9 chr9 18 DEPDC1B/ELOVL7 MCF7 5 chr5 chr5 22 RAB7A/LRCH3 HCC1395 5 chr3 chr3 22 STRADB/NOP58 HCC1954 5 chr2 chr2 22 SUPT4H1/CCDC46 MDA361 2 chr17 chr17 20 INTS1/PRKAR1B HCC1954 5 chr7 chr7 22 RANBP1/C22orf25 HCC2157 5 chr22 chr22 15 DNMT1/KEAP1 HCC2157 5 chr19 chr19 25 SMURF2/CCDC46 HCC1806 3 chr17 chr17 21 TMEM123/MMP20 HCC38 3 chr11 chr11 21 PPM1D/TRAPPC9 ZR-75-30 5 chr17 chr8 22 XRCC6/C6orf125 HCC38 4 chr22 chr6 5 PRLR/AGXT2 HCC2157 4 chr5 chr5 23

130

RPS6KB1/DIAPH3 MCF7 5 chr17 chr13 14 SLC37A1/ABCG1 HCC1428 5 chr21 chr21 20 WNK1/CWC22 HCC1806 5 chr12 chr2 19 CDH3/CDH1 HCC2218 5 chr16 chr16 20 UBR5/SLC25A32 MDA468 3 chr8 chr8 18 PLA2R1/RBMS1 HCC1395 4 chr2 chr2 18 TANC2/CA4 MCF7 4 chr17 chr17 18 KLHDC2/SNTB1 SKBR3 5 chr14 chr8 17 ESR1/C6orf97 HCC1428 3 chr6 chr6 18 NGFR/SPAG9 HCC1428 5 chr17 chr17 18 ACTB/RNF216 BT20 3 chr7 chr7 5 ARIH2/SOS1 BT549 4 chr3 chr2 16 GPX4/CNN2 HCC1569 5 chr19 chr19 15 MYO9B/FCHO1 MCF7 5 chr19 chr19 16 CCDC85C/SETD3 SKBR3 5 chr14 chr14 17 PPFIBP1/C12orf70 HCC1187 4 chr12 chr12 16 CHKA/PPFIA1 CAMA1 4 chr11 chr11 17 DOCK5/CDCA2 UACC893 5 chr8 chr8 16 HMGXB3/PPARGC1B HCC38 4 chr5 chr5 15 POFUT1/TM9SF4 HCC1187 5 chr20 chr20 15 TAOK1/PCGF2 ZR-75-30 3 chr17 chr17 15 ACBD6/RRP15 HCC38 5 chr1 chr1 16 BTG3/C21orf91 HCC1569 5 chr21 chr21 14 FOSL2/BRE HCC1395 5 chr2 chr2 14 JDP2/FAM164C HCC1569 3 chr14 chr14 14 GSK3B/GPR156 UACC893 4 chr3 chr3 13 SMARCA4/CARM1 MCF7 5 chr19 chr19 13 C16orf62/IQCK MCF7 5 chr16 chr16 14 LDLRAD3/TCP11L1 HCC1806 5 chr11 chr11 13 KLK5/CDH23 HCC1187 4 chr19 chr10 13 TULP4/PBX1 SUM44 5 chr6 chr1 14 EHMT1/COL5A1 CAMA1 5 chr9 chr9 12 CDYL/CDKAL1 MDA361 5 chr6 chr6 12 PTPRJ/LPXN HCC1806 4 chr11 chr11 11 SYTL2/PICALM MCF7 5 chr11 chr11 12 BCAS3/REG4 MCF7 5 chr17 chr1 12 PTPRA/HHLA1 HCC38 5 chr20 chr8 11 MFSD3/HEATR7A HCC1395 5 chr8 chr8 11 GOLGB1/ILDR1 BT20 4 chr3 chr3 11 NFIC/FZR1 HCC3153 5 chr19 chr19 11 LUC7L2/TBX4 SUM52 5 chr7 chr17 13 KLHDC10/DYNC1I1 SUM52 3 chr7 chr7 12

131

ST7/PRKAG2 CAMA1 5 chr7 chr7 10 CYTSA/SUSD2 HCC3153 5 chr22 chr22 3 FITM2/UQCC BT474 4 chr20 chr20 10 FBXL20/CRKRS UACC893 5 chr17 chr17 10 STARD3/TAC4 HCC1419 3 chr17 chr17 10 PIP4K2B/RAD51C BT474 3 chr17 chr17 10 SCAPER/TM6SF1 HCC38 5 chr15 chr15 10 RNF111/TCF12 HCC38 5 chr15 chr15 10 PAPOLA/AK7 MCF7 5 chr14 chr14 10 BCAS3/TMEM71 ZR-75-30 5 chr17 chr8 9 MSH2/KCNK12 HCC1569 3 chr2 chr2 9 NARS2/RLN3 HCC3153 5 chr11 chr19 9 TMEM104/RAB37 HCC1395 5 chr17 chr17 9 USP32/NLK HCC2218 5 chr17 chr17 9 RNF121/SFRS2IP HCC1937 5 chr11 chr12 9 HNRNPUL2/AHNAK HCC1395 3 chr11 chr11 9 NFIA/EHF HCC1937 4 chr1 chr11 10 PDE2A/NDUFC2 SUM44 3 chr11 chr11 9 C1orf66/CTSS UACC812 4 chr1 chr1 9 FHL2/TGFBRAP1 HCC3153 5 chr2 chr2 10 FEM1B/KCNG3 UACC893 4 chr15 chr2 7 INPP5A/PWWP2B HCC1954 3 chr10 chr10 8 ZFYVE9/USP33 EFM19 3 chr1 chr1 8 FAM49B/TTLL11 HCC2185 2 chr8 chr9 7 GNA12/CARD11 HCC1395 3 chr7 chr7 6 LIMCH1/UCHL1 HCC38 3 chr4 chr4 7 GLB1/CMTM7 BT474 5 chr3 chr3 7 GALNT13/ARID5A HCC3153 5 chr2 chr2 7 CENPO/RAB10 ZR-75-30 4 chr2 chr2 8 AKT3/P2RX5 UACC812 5 chr1 chr17 7 SUZ12P/MYO18A UACC812 4 chr17 chr17 4 AHCYL1/RAD51C MCF7 5 chr1 chr17 7 C8orf38/NCOR1 HCC1419 5 chr8 chr17 7 TMX2/CTNND1 HCC1500 3 chr11 chr11 7 E2F8/CSRP3 HCC1937 4 chr11 chr11 7 GALNTL4/SBF2 HCC1806 4 chr11 chr11 7 SEC22B/NOTCH2 HCC1187 4 chr1 chr1 7 SHB/LOC158376 HCC2185 3 chr9 chr9 6 PSD3/CSGALNACT1 HCC1569 4 chr8 chr8 6 ANK1/ZMAT4 MDA134 3 chr8 chr8 5 UVRAG/ZMAT4 SUM44 5 chr11 chr8 6 MLL5/LHFPL3 HCC1187 5 chr7 chr7 6

132

ZFAND2A/C7orf50 UACC893 3 chr7 chr7 6 MYO6/SENP6 MCF7 5 chr6 chr6 6 EFTUD2/KIF18B HCC1395 4 chr17 chr17 6 POLDIP2/BRIP1 HCC2218 4 chr17 chr17 6 INTS2/RPS6KB1 HCC2218 5 chr17 chr17 5 MLLT6/PLXDC1 UACC812 3 chr17 chr17 7 CHD6/SKAP1 HCC1419 5 chr20 chr17 7 RBM23/PSMB5 HCC38 4 chr14 chr14 6 ARID1A/MAST2 MDA468 5 chr1 chr1 6 NUDC/ZDHHC18 HCC2157 5 chr1 chr1 6 ARHGEF10L/PADI6 HCC1806 4 chr1 chr1 6 ZNF362/ROR1 HCC1428 3 chr1 chr1 7 CDK5RAP2/MEGF9 HCC1428 4 chr9 chr9 5 ZNF79/RALGPS1 HCC1419 3 chr9 chr9 5 EIF3H/RAD21 MCF7 3 chr8 chr8 5 PHACTR2/C6orf94 DU4475 4 chr6 chr6 5 ADAMTS19/SLC27A6 MCF7 4 chr5 chr5 5 MORC3/DOPEY2 BT20 4 chr21 chr21 5 DHX35/ITCH SKBR3 4 chr20 chr20 5 TCF7L1/KCMF1 HCC3153 3 chr2 chr2 5 MED1/GSDMB HCC38 5 chr17 chr17 5 LIMA1/USP22 BT20 5 chr12 chr17 5 CHKA/CASC3 UACC893 5 chr11 chr17 5 TAF4/BRIP1 MCF7 4 chr20 chr17 5 CRLF3/CHD9 EFM19 5 chr17 chr16 4 TRAF3/ZNF839 HCC1419 2 chr14 chr14 5 LAMP1/MCF2L BT474 4 chr13 chr13 5 PPP1R12A/MGAT4C BT474 3 chr12 chr12 5 SAMD12/MRPL13 SKBR3 5 chr8 chr8 4 UTRN/RSPH9 HCC38 5 chr6 chr6 4 SLC26A6/PRKAR2A HCC38 5 chr3 chr3 4 RAD51C/ATXN7 MCF7 3 chr17 chr3 4 PRKCE/MBOAT2 HCC38 3 chr2 chr2 4 ITGB6/RBMS1 UACC893 5 chr2 chr2 4 MED1/IKZF3 UACC893 3 chr17 chr17 4 NQO1/ERBB2 HCC1954 2 chr16 chr17 4 SPRED1/BUB1B HCC38 5 chr15 chr15 4 PDE8A/SLC28A1 HCC1428 5 chr15 chr15 4 ERO1L/FERMT2 HCC1395 3 chr14 chr14 4 PCCA/TM9SF2 MDA436 3 chr13 chr13 4 IFFO2/UBR4 HCC38 3 chr1 chr1 4 RNF187/OBSCN HCC1428 5 chr1 chr1 4

133

C1orf159/OPRD1 HCC1428 2 chr1 chr1 4 NOTCH1/GABBR2 BT20 2 chr9 chr9 3 IQGAP1/FBXO32 HCC1143 3 chr15 chr8 5 TULP4/TMEM181 HCC3153 2 chr6 chr6 3 SYNJ2/ZDHHC14 HCC2185 2 chr6 chr6 3 TBL1XR1/RGS17 MCF7 4 chr3 chr6 2 PPP2R2C/ADD1 MDA361 3 chr4 chr4 3 ZNF236/ZNF516 HCC2157 4 chr18 chr18 4 INTS2/TMEM49 HCC2218 5 chr17 chr17 3 PARD6B/BCAS3 MCF7 4 chr20 chr17 3 ATXN7L3/FAM171A2 MCF7 3 chr17 chr17 3 WNK1/USP31 HCC1806 4 chr12 chr16 3 RIN3/SLC24A4 HCC2157 3 chr14 chr14 3 ALDOA/GANAB HCC1395 2 chr16 chr11 3 EEF2/CFL1 CAMA1 3 chr19 chr11 2 THRAP3/EIF2C3 HCC2157 5 chr1 chr1 3 TCHP/UBR4 HCC1569 4 chr12 chr1 3 VPS13D/IFFO2 HCC1428 3 chr1 chr1 3 CXorf15/SYAP1 MCF7 3 chrX chrX 2 PPM1D/RAD54B ZR-75-30 3 chr17 chr8 2 POP1/MATN2 MCF7 3 chr8 chr8 2 REEP5/MCC HCC1806 5 chr5 chr5 2 USP4/RHOA HCC1395 2 chr3 chr3 2 FITM2/C20orf111 HCC1428 3 chr20 chr20 2 EIF3H/FAM65C HCC1419 4 chr8 chr20 2 TTLL4/CYP27A1 HCC1954 4 chr2 chr2 2 EIF2AK3/PRKD3 UACC893 5 chr2 chr2 2 EIF3K/ACTN4 HCC3153 4 chr19 chr19 2 LASP1/CSF3 UACC812 3 chr17 chr17 2 ACTN4/CDH13 UACC893 4 chr19 chr16 2 FLNA/PKM2 BT474 2 chrX chr15 2 PLEKHO2/ANKDD1A BT20 3 chr15 chr15 2 HOXC10/FLJ12825 HCC1419 3 chr12 chr12 2 SCARB1/UBC HCC1806 5 chr12 chr12 2 SPATS2/LOC100286844 HCC2185 3 chr12 chr12 2 VPS33A/CLIP1 HCC1419 3 chr12 chr12 2 ITGAV/CNTN5 Hs578t 2 chr2 chr11 2 ERBB2IP/NLN HCC1143 3 chr5 chr5 2 ST20/MTHFS DU4475 3 chr15 chr15 2

134

Table 3.S3. Candidate intragenic rearrangements.

Number supporting Intragenic rearrangement Cell line Score Chromosome paired-reads AGPAT6 HCC38 4 chr8 17 ALDOA DU4475 2 chr16 2 ARPC1A HCC1187 4 chr7 2 CDK5RAP2 HCC1806 5 chr9 2 CEP97 MDA436 5 chr3 14 CYTSB HCC1143 2 chr17 4 DTNB HCC38 5 chr2 35 DYNC2LI1 HCC38 4 chr2 2 EGFR MDA436 5 chr7 12 EPHB4 HCC1395 5 chr7 8 FAM53B MDA436 5 chr10 8 FBXO22 HCC38 3 chr15 3 FNIP1 MDA436 4 chr5 2 IL10RB MDA436 3 chr21 4 LOC286467 HCC1428 2 chrX 2 MAML2 HCC3153 5 chr11 5 MTR MDA436 5 chr1 10 NIPA1 MDA436 5 chr15 7 NUP160 MDA436 5 chr11 44 PCMT1 HCC38 5 chr6 111 PHF21A MDA436 3 chr11 9 PKP4 BT20 2 chr2 2 PNPT1 HCC38 5 chr2 20 RCL1 MDA436 5 chr9 4 REPS1 HCC38 5 chr6 14 RPTOR MDA436 5 chr17 31 SBNO1 HCC3153 5 chr12 36 SEC24B HCC3153 5 chr4 16 SIN3A BT20 3 chr15 2 SRPK2 HCC1395 5 chr7 13 TRAF3IP2 MDA436 4 chr6 5

135

Table 3.S4. Candidate gene-intergenic rearrangements.

Number supporting 5' 3' paired- Gene-intergenic fusions Cell Line Score chromosome chromosome reads DDX5/chr8_120950001-121000000 ZR7530 4 chr17 chr8 997 EIF3E/chr5_116850001-116900000 HCC1954 4 chr8 chr5 244 MYO18A/chr17_44900001-44950000 ZR7530 4 chr17 chr17 153 chr3_195250001-195300000/OPA1 HCC1937 4 chr3 chr3 126 chr1_245150001-245200000/NAAA BT474 4 chr1 chr4 81 LOC388796/chr8_52950001-53000000 HCC38 4 chr20 chr8 59 ATP9A/chr20_46000001-46050000 MCF7 4 chr20 chr20 42 LOC388796/chr8_53000001-53050000 HCC38 4 chr20 chr8 40 NUP160/chr11_47850001-47900000 MDA436 4 chr11 chr11 38 PAK1/chr8_21350001-21400000 MDA134 4 chr11 chr8 39 MGLL/chr3_129050001-129100000 BT20 4 chr3 chr3 32 chr1_28400001-28450000/SLC9A7 HCC1500 4 chr1 chrX 31 DTNB/chr2_25750001-25800000 HCC1395 4 chr2 chr2 28 BIRC6/chr8_123150001-123200000 HCC1419 4 chr2 chr8 25 EGFR/chr7_54600001-54650000 MDA468 4 chr7 chr7 25 TAOK1/chr17_16450001-16500000 HCC1419 4 chr17 chr17 26 TAOK1/chr17_46750001-46800000 ZR7530 4 chr17 chr17 21 ACSL3/chr2_223300001-223350000 HCC1395 4 chr2 chr2 22 chr6_25250001-25300000/ACOT13 HCC2218 4 chr6 chr6 20 FAM104A/chr17_22000001-22050000 EFM19 4 chr17 chr17 21 TPD52L2/chr17_58300001-58350000 MCF7 4 chr20 chr17 17 GALNTL4/chr11_39250001-39300000 HCC1143 4 chr11 chr11 17 NUPL2/chr7_23150001-23200000 MDA436 4 chr7 chr7 16 GRLF1/chr19_49750001-49800000 HCC38 4 chr19 chr19 16 HNRNPR/chr1_23550001-23600000 MDA436 4 chr1 chr1 15 CTPS/chr1_41150001-41200000 HCC2157 4 chr1 chr1 19 APP/chr21_39000001-39050000 HCC1954 4 chr21 chr21 15 JAK1/chr1_24450001-24500000 HCC1428 4 chr1 chr1 15 ZNFX1/chr8_118350001-118400000 HCC2185 4 chr20 chr8 13 TPX2/chr20_29650001-29700000 HCC1806 4 chr20 chr20 13 ARFRP1/chr8_90200001-90250000 HCC1419 4 chr20 chr8 12 RAD21/chr5_2550001-2600000 HCC1954 4 chr8 chr5 12 chr14_37100001-37150000/FOXA1 HCC1954 4 chr14 chr14 11 KRTCAP2/chr2_45050001-45100000 HCC2157 4 chr1 chr2 14 PVT1/chr8_128450001-128500000 HCC1569 4 chr8 chr8 11 TXNL4A/chr18_74050001-74100000 HCC2185 4 chr18 chr18 10 MEF2A/chr15_96450001-96500000 HCC38 4 chr15 chr15 10 SRA1/chr18_36950001-37000000 MDA436 3 chr5 chr18 23 ANO1/chr8_38650001-38700000 MDA134 3 chr11 chr8 207 SH3PXD2B/chr5_174900001-174950000 BT20 3 chr5 chr5 120

136

ATXN7/chr1_106550001-106600000 MCF7 3 chr3 chr1 112 SETD5/chr3_9350001-9400000 HCC1143 3 chr3 chr3 95 NSD1/chr8_130400001-130450000 HCC1954 3 chr5 chr8 66 C7orf50/chr7_9750001-9800000 HCC38 3 chr7 chr7 54 SLC2A1/chr1_43200001-43250000 HCC2157 3 chr1 chr1 40 PCGF2/chr17_44800001-44850000 ZR7530 3 chr17 chr17 33 GOSR1/chr1_113650001-113700000 UACC812 3 chr17 chr1 25 ASAP2/chr2_9100001-9150000 HCC1806 3 chr2 chr2 25 GGA1/chr22_36150001-36200000 HCC1569 3 chr22 chr22 23 KIAA0100/chr8_144350001-144400000 ZR7530 3 chr17 chr8 21 LIMCH1/chr4_40950001-41000000 HCC1569 3 chr4 chr4 21 FAM117A/chr8_102100001-102150000 ZR7530 3 chr17 chr8 21 chr15_39050001-39100000/PSMB6 HCC1419 3 chr15 chr17 20 PALB2/chr16_23550001-23600000 BT20 3 chr16 chr16 20 DDX42/chr20_44800001-44850000 MCF7 3 chr17 chr20 18 GRLF1/chr19_49800001-49850000 HCC38 3 chr19 chr19 17 PVT1/chr8_117450001-117500000 HCC2157 3 chr8 chr8 18 ADAM9/chr8_37000001-37050000 HCC38 3 chr8 chr8 15 REV3L/chr6_115450001-115500000 HCC2185 3 chr6 chr6 15 TFRC/chr8_118800001-118850000 HCC2185 3 chr3 chr8 15 OSBP/chr11_59150001-59200000 HCC38 3 chr11 chr11 13 NAGK/chr2_71100001-71150000 MDA436 3 chr2 chr2 13 MLL3/chr3_136650001-136700000 MDA436 3 chr7 chr3 12 PVT1/chr8_128300001-128350000 HCC1569 3 chr8 chr8 12 FLJ43663/chr7_51400001-51450000 MDA436 3 chr7 chr7 11 PAXIP1/chr7_154700001-154750000 MDA436 3 chr7 chr7 11 TBC1D16/chr2_45000001-45050000 HCC1395 3 chr17 chr2 11 UIMC1/chr8_128100001-128150000 HCC1954 3 chr5 chr8 11 PPIF/chr10_80750001-80800000 MDA436 2 chr10 chr10 55 CUGBP1/chr11_47550001-47600000 MDA436 2 chr11 chr11 27 chr2_242550001-242600000/MLPH HCC1419 2 chr2 chr2 17 C20orf199/chr17_44650001-44700000 EFM19 2 chr20 chr17 13 SYCP2/chr20_37600001-37650000 UACC812 2 chr20 chr20 10

137

CHAPTER 4 CONCLUSIONS AND FUTURE DIRECTIONS

138

The major goal of my dissertation work has been to discover and characterize novel gene rearrangements in cancer using genomic technologies. To this end, I designed and implemented several original computational approaches for gene fusion discovery, including both microarray and sequencing-based methods. I subsequently used these tools to discover a large number of novel alterations in various cancer types. Many of these alterations are the first gene fusions reported to date in the corresponding cancer type, and I demonstrated recurrence and oncogenic dependency for several rearrangements. Although these gene fusions appear to be relatively rare, many represent potentially druggable targets with therapeutic implications for patients. Furthermore, as several of these gene fusions occur in common malignancies such as breast and colon cancer, taken together we would expect these alterations to characterize a large number of cancer cases. In summary, I have described robust approaches for gene fusion discovery, and my results highlight a more widespread role of fusion genes in cancer pathogenesis. After having completed these projects, several possibilities exist for future studies. Our breakpoint analysis pipeline described in Chapter 2 was designed to search for novel rearrangements of known cancer genes. While this approach yielded many exciting discoveries, I believe our pipeline can be extended to identify rearrangements involving genes not previously implicated in cancer pathogenesis. The pipeline has already been run to include all known, annotated genes, and over 25,000 candidates were nominated. Predicting which of these candidates result in oncogenic driver gene fusions as opposed to passenger alterations represents a major challenge. However, several approaches could be attempted to make this task more manageable. For example, recurrence suggests functional importance, so focusing on such breakpoints would likely greatly reduce the number of potential alterations. In addition, the Cancer Cell Line Encyclopedia (CCLE)[1] contains gene expression profiling data for the majority of cell lines analyzed by our method. These data could be integrated with our breakpoint analysis results to filter for candidate rearrangements that are also overexpressed. Indeed, many of our discoveries including APIP/SLC1A2,

139

FAM133B/CDK6, and CEP85L/ROS1 exhibit high-level expression of the 3’ partner gene. Lastly, we have noted that several known gene fusions are associated with focal copy number alterations, and we could further filter our candidate list by focusing on such rearrangements. While challenging, this approach has the potential to discover truly novel oncogenic rearrangements, as opposed to novel rearrangements of known oncogenes. The novel TAOK1 rearrangements discovered in breast cancer (Chapter 3) are unique in that they involve fusions to intergenic regions of the genome. Many studies using next generation sequencing to discover gene rearrangements have limited their analysis to the identification of gene-to-gene fusions[2-5]. As a result, several studies that have performed next generation sequencing on HCC1419 and ZR75-30 breast cancer cell lines, which both harbor TAOK1 fusions, have missed these alterations[4,5]. I hypothesize that other gene-intergenic fusions remain to be discovered in publicly available sequencing datasets. My RNA-Seq gene rearrangement discovery pipeline searches not only for gene-to-gene fusions but also for these gene-intergenic rearrangements and intragenic internal tandem duplications. As a future direction, I would love to apply my approach to mine these data. As a separate future direction, I would like to extend my computational pipelines to allow discovery of other types of rearrangements. For example, recent studies have implicated non-coding regions of the genome, including microRNAs and long non-coding RNAs (lncRNAs), in cancer pathogenesis[6-11]. Perhaps these sequences are involved in pathogenic rearrangements in certain cancer types. To our knowledge, no such alterations have been reported, and likely few if any studies have searched specifically for such fusions. Mining genomic datasets including next generation sequencing and array CGH data may identify such a novel class of rearrangements. My fusion discovery pipelines could also be extended to screen for additional types of intragenic rearrangements. Currently, my method only searches for intragenic internal tandem duplications from RNA-Seq data. Such alterations, including internal tandem duplications involving FLT3, have been described to play major roles in the

140

pathogenesis of certain cancers[12,13]. Another type of intragenic rearrangement involves internal deletions of specific exon sequences. For example, glioblastoma is frequently characterized by the EGFRvIII rearrangement, which is created by an in- frame deletion of exons 2 through 7 of the epidermal growth factor receptor (EGFR)[14]. This alteration has a distinct appearance on array CGH profiles. In particular, the 5’ and 3’ ends of EGFR are amplified, but the middle region appears to either be deleted or present at normal copy number levels. It would be very interesting to search for this same array CGH pattern disrupting other genes. In addition, the high resolution of new array CGH platforms and next generation DNA sequencing would allow for the discovery of additional novel intragenic deletions. Correlation of such findings with RNA-Seq data to screen for expression of the corresponding rearrangement would help to prioritize candidates. Gene rearrangements play major roles in the pathogenesis of various malignancies, and I believe there is significant opportunity for further discovery of novel alterations using genomic approaches. Thousands of microarray and next generation sequencing datasets are publicly available and additional datasets are being rapidly generated. Mining these data in new and creative ways represents a major challenge in the field of cancer genomics. I hope the methods I have developed in this dissertation can be used and extended to discover additional pathogenic rearrangements from these data. I also especially hope that our findings are ultimately translated to novel therapeutic options for patients suffering from these devastating diseases.

141

REFERENCES

1. Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, et al. (2012) The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483: 603-607.

2. Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, et al. (2009) Transcriptome sequencing to detect gene fusions in cancer. Nature 458: 97-101.

3. Palanisamy N, Ateeq B, Kalyana-Sundaram S, Pflueger D, Ramnarayanan K, et al. (2010) Rearrangements of the RAF kinase pathway in prostate cancer, gastric cancer and melanoma. Nat Med 16: 793-798.

4. Robinson DR, Kalyana-Sundaram S, Wu YM, Shankar S, Cao X, et al. (2011) Functionally recurrent rearrangements of the MAST kinase and Notch gene families in breast cancer. Nat Med 17: 1646-1651.

5. Schulte I, Batty EM, Pole JC, Blood KA, Mo S, et al. (2012) Structural analysis of the genome of breast cancer cell line ZR-75-30 identifies twelve expressed fusion genes. BMC Genomics 13: 719.

6. Tsai MC, Spitale RC, Chang HY (2011) Long intergenic noncoding RNAs: new links in cancer progression. Cancer Res 71: 3-7.

7. Calin GA, Liu CG, Ferracin M, Hyslop T, Spizzo R, et al. (2007) Ultraconserved regions encoding ncRNAs are altered in human leukemias and carcinomas. Cancer Cell 12: 215-229.

142

8. Gupta RA, Shah N, Wang KC, Kim J, Horlings HM, et al. (2010) Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer metastasis. Nature 464: 1071-1076.

9. Medina PP, Nolde M, Slack FJ (2010) OncomiR addiction in an in vivo model of microRNA-21-induced pre-B-cell lymphoma. Nature 467: 86-90.

10. Png KJ, Halberg N, Yoshida M, Tavazoie SF (2012) A microRNA regulon that mediates endothelial recruitment and metastasis by cancer cells. Nature 481: 190-194. 11. Gee HE, Camps C, Buffa FM, Colella S, Sheldon H, et al. (2008) MicroRNA-10b and breast cancer metastasis. Nature 455: E8-9; author reply E9.

12. Kiyoi H, Towatari M, Yokota S, Hamaguchi M, Ohno R, et al. (1998) Internal tandem duplication of the FLT3 gene is a novel modality of elongation mutation which causes constitutive activation of the product. Leukemia 12: 1333-1337.

13. Nakao M, Yokota S, Iwai T, Kaneko H, Horiike S, et al. (1996) Internal tandem duplication of the flt3 gene found in acute myeloid leukemia. Leukemia 10: 1911- 1918.

14. Del Vecchio CA, Giacomini CP, Vogel H, Jensen KC, Florio T, et al. (2012) EGFRvIII gene rearrangement is an early event in glioblastoma tumorigenesis and expression defines a hierarchy modulated by epigenetic mechanisms. Oncogene.

143