Improved Detection of Gene Fusions by Applying Statistical Methods Reveals Oncogenic RNA Cancer Drivers Roozbeh Dehghannasiria, Donald E
Total Page:16
File Type:pdf, Size:1020Kb
Improved detection of gene fusions by applying statistical methods reveals oncogenic RNA cancer drivers Roozbeh Dehghannasiria, Donald E. Freemana,b, Milos Jordanskic, Gillian L. Hsieha, Ana Damljanovicd, Erik Lehnertd, and Julia Salzmana,b,e,1 aDepartment of Biochemistry, Stanford University, Stanford, CA 94305; bDepartment of Biomedical Data Science, Stanford University, Stanford, CA 94305; cDepartment of Computer Science, University of Belgrade, 11000 Belgrade, Serbia; dSeven Bridges Genomics Inc., Cambridge, MA 02142; and eStanford Cancer Institute, Stanford University, Stanford, CA 94305 Edited by Ali Torkamani, The Scripps Research Institute, La Jolla, CA, and accepted by Editorial Board Member Peter K. Vogt June 14, 2019 (received for review January 10, 2019) The extent to which gene fusions function as drivers of can- of multiple algorithms and filtering lists of fusions using manual cer remains a critical open question. Current algorithms do not approaches (13–15). These approaches lead to what third-party sufficiently identify false-positive fusions arising during library reviews agree is imprecise fusion discovery and bias against dis- preparation, sequencing, and alignment. Here, we introduce Data- covering novel oncogenes (15–17). This suboptimal performance Enriched Efficient PrEcise STatistical fusion detection (DEEPEST), becomes more problematic when fusion detection is deployed on an algorithm that uses statistical modeling to minimize false- large cancer sequencing datasets that contain thousands or tens positives while increasing the sensitivity of fusion detection. In of thousands of samples. In such scenarios, precise fusion detec- 9,946 tumor RNA-sequencing datasets from The Cancer Genome tion must overcome the problem of multiple hypothesis testing: Atlas (TCGA) across 33 tumor types, DEEPEST identifies 31,007 each algorithm is testing for fusions thousands of times, a regime fusions, 30% more than identified by other methods, while known to introduce FPs. To overcome these problems, the field calling 10-fold fewer false-positive fusions in nontransformed has turned to consensus-based approaches, where multiple algo- human tissues. We leverage the increased precision of DEEPEST rithms are run in parallel (10), and a metacaller allows “voting” to discover fundamental cancer biology. Namely, 888 candi- to produce the final list of fusions. This is also unsatisfactory, as date oncogenes are identified based on overrepresentation in it introduces FNs. DEEPEST calls, and 1,078 previously unreported fusions involv- Both shortcomings in the ascertainment of fusions by existing ing long intergenic noncoding RNAs, demonstrating a previously algorithms and using recurrence alone to assess fusions’ function unappreciated prevalence and potential for function. DEEPEST also reveals a high enrichment for fusions involving oncogenes Significance in cancers, including ovarian cancer, which has had minimal treat- ment advances in recent decades, finding that more than 50% of Gene fusions are tumor-specific genomic aberrations and are tumors harbor gene fusions predicted to be oncogenic. Specific among the most powerful biomarkers and drug targets in trans- protein domains are enriched in DEEPEST calls, indicating a global lational cancer biology. The advent of RNA-sequencing tech- selection for fusion functionality: kinase domains are nearly 2-fold nologies over the last decade has provided a unique opportu- more enriched in DEEPEST calls than expected by chance, as are nity for detecting novel fusions via deploying computational domains involved in (anaerobic) metabolism and DNA binding. algorithms on public sequencing databases. However, pre- The statistical algorithms, population-level analytic framework, cise fusion detection algorithms are still out of reach. We and the biological conclusions of DEEPEST call for increased atten- develop Data-Enriched Efficient PrEcise STatistical fusion detec- tion to gene fusions as drivers of cancer and for future research tion (DEEPEST), a highly specific and efficient statistical pipeline into using fusions for targeted therapy. specially designed for mining massive sequencing databases gene fusion j cancer genomics j bioinformatics j and apply it to all 33 tumor types and 10,500 samples in The pan-cancer analysis j TCGA Cancer Genome Atlas database. We systematically profile the landscape of detected fusions via classic statistical models and ene fusions are known to drive some cancers and can be identify several signatures of selection for fusions in tumors. Ghighly specific and personalized therapeutic targets; some Author contributions: R.D., D.E.F., and J.S. designed research; R.D., D.E.F., and J.S. of the most famous fusions are the BCR–ABL1 fusion in chronic performed research; R.D., D.E.F., M.J., G.L.H., A.D., E.L., and J.S. contributed new myelogenous leukemia (CML), the EML4–ALK fusion in non- reagents/analytic tools; R.D., D.E.F., and J.S. analyzed data; and R.D. and J.S. wrote the small lung cell carcinoma, TMPRSS2–ERG in prostate cancer, paper.y and FGFR3–TACC3 in a variety of cancers including glioblas- The authors declare no conflict of interest.y toma multiforme (1–4). Since fusions are generally absent in This article is a PNAS Direct Submission. A.T. is a guest editor invited by the Editorial healthy tissues, they are among the most clinically relevant events Board.y in cancer to direct targeted therapy and to be used as effective Published under the PNAS license.y diagnostic tools in early detection strategies using RNA or pro- Data deposition: DEEPEST workflow, with all needed softwares preinstalled, have been teins; moreover, as they are truly specific to cancer, they have deposited in GitHub, https://github.com/salzmanlab/DEEPEST-Fusion. Also, a publicly- promising potential as neo-antigens (5–7). available online tool with web interface is available for the DEEPEST algorithm on the Cancer Genomics Cloud platform, https://cgc.sbgenomics.com/public/apps#jordanski. Because of this, clinicians and large sequencing consortia have milos/deepest-fusion/deepest-fusion/. All custom scripts used to generate the figures made major efforts to identify fusions expressed in tumors via have been deposited in GitHub, https://github.com/salzmanlab/DEEPEST-Fusion/tree/ screening massive cancer sequencing datasets (8–12). However, master/custom scripts.y these attempts are limited by critical roadblocks: current algo- 1 To whom correspondence may be addressed. Email: [email protected] rithms suffer from high false-positive (FP) rates and unknown This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. false-negative (FN) rates. Thus, ad hoc choices have been made 1073/pnas.1900391116/-/DCSupplemental.y in calling and analyzing fusions including taking the consensus Published online July 15, 2019. 15524–15533 j PNAS j July 30, 2019 j vol. 116 j no. 31 www.pnas.org/cgi/doi/10.1073/pnas.1900391116 Downloaded by guest on September 26, 2021 have limited the use of fusions to discover new cancer biology. As nature of fusion expression consistent with the existence of one of many examples, a recent study of more than 400 pancre- under-appreciated drivers of human cancer, including selection atic cancers found no recurrent gene fusions, raising the question for rare or private gene fusions with implications from basic of whether this is due to high FN rates or whether this means biology to the clinic. that fusions are not drivers in the disease (18). Recurrence of fusions is currently one of the only standards in the field used Results to assess the functionality of fusions, but the most frequently DEEPEST Is a Statistical Algorithm for Gene Fusion Discovery in expressed fusions may not be the most carcinogenic (19); on the Massive Public Databases. We engineered a statistical algorithm, other hand, there may still be many undiscovered gene fusions DEEPEST, to discover and estimate the prevalence of gene that drive cancer. fusions in massive numbers of datasets. Here, we have applied Thus, the critical question “Are gene fusions underappreci- DEEPEST to ∼10,000 datasets, but in principle, DEEPEST can ated drivers of cancer?” is still unanswered. In this paper, we be applied to 100,000, 1 million, or more samples. DEEPEST first provide an algorithm that has significant advance in preci- includes key innovations such as controlling FPs arising from sion for unbiased fusion detection at exon boundaries in mas- analysis of massive RNA-Seq datasets for fusion discovery, a sive genomics datasets. The algorithm, Data-Enriched Efficient problem conceptually analogous to multiple hypothesis testing PrEcise STatistical fusion detection (DEEPEST), is a second- via P values, which cannot be solved by direct application of com- generation fusion algorithm with significant computational and mon false-discovery rate (FDR)-controlling procedures, which algorithmic advance over our previously developed MACHETE rely on the assumption of a uniform distribution of P values (Mismatched Alignment Chimera Tracking Engine) algorithm under the null hypothesis. (20). A key innovation in DEEPEST is its statistical test of fusion The DEEPEST pipeline contains 2 main computational steps: prevalence across populations, which can identify FPs in a global 1) junction nomination component which is run on a subset of all unbiased manner. samples to be analyzed, called “the discovery set”; and 2) statis- The precision and efficient implementation of DEEPEST tical testing of nominated junctions