Improved detection of fusions by applying statistical methods reveals oncogenic RNA cancer drivers Roozbeh Dehghannasiria, Donald E. Freemana,b, Milos Jordanskic, Gillian L. Hsieha, Ana Damljanovicd, Erik Lehnertd, and Julia Salzmana,b,e,1

aDepartment of Biochemistry, Stanford University, Stanford, CA 94305; bDepartment of Biomedical Data Science, Stanford University, Stanford, CA 94305; cDepartment of Computer Science, University of Belgrade, 11000 Belgrade, Serbia; dSeven Bridges Genomics Inc., Cambridge, MA 02142; and eStanford Cancer Institute, Stanford University, Stanford, CA 94305

Edited by Ali Torkamani, The Scripps Research Institute, La Jolla, CA, and accepted by Editorial Board Member Peter K. Vogt June 14, 2019 (received for review January 10, 2019)

The extent to which gene fusions function as drivers of can- of multiple algorithms and filtering lists of fusions using manual cer remains a critical open question. Current algorithms do not approaches (13–15). These approaches lead to what third-party sufficiently identify false-positive fusions arising during library reviews agree is imprecise fusion discovery and bias against dis- preparation, sequencing, and alignment. Here, we introduce Data- covering novel oncogenes (15–17). This suboptimal performance Enriched Efficient PrEcise STatistical fusion detection (DEEPEST), becomes more problematic when fusion detection is deployed on an algorithm that uses statistical modeling to minimize false- large cancer sequencing datasets that contain thousands or tens positives while increasing the sensitivity of fusion detection. In of thousands of samples. In such scenarios, precise fusion detec- 9,946 tumor RNA-sequencing datasets from The Cancer Genome tion must overcome the problem of multiple hypothesis testing: Atlas (TCGA) across 33 tumor types, DEEPEST identifies 31,007 each algorithm is testing for fusions thousands of times, a regime fusions, 30% more than identified by other methods, while known to introduce FPs. To overcome these problems, the field calling 10-fold fewer false-positive fusions in nontransformed has turned to consensus-based approaches, where multiple algo- human tissues. We leverage the increased precision of DEEPEST rithms are run in parallel (10), and a metacaller allows “voting” to discover fundamental cancer biology. Namely, 888 candi- to produce the final list of fusions. This is also unsatisfactory, as date oncogenes are identified based on overrepresentation in it introduces FNs. DEEPEST calls, and 1,078 previously unreported fusions involv- Both shortcomings in the ascertainment of fusions by existing ing long intergenic noncoding RNAs, demonstrating a previously algorithms and using recurrence alone to assess fusions’ function unappreciated prevalence and potential for function. DEEPEST also reveals a high enrichment for fusions involving oncogenes Significance in cancers, including ovarian cancer, which has had minimal treat- ment advances in recent decades, finding that more than 50% of Gene fusions are tumor-specific genomic aberrations and are tumors harbor gene fusions predicted to be oncogenic. Specific among the most powerful biomarkers and drug targets in trans- domains are enriched in DEEPEST calls, indicating a global lational cancer biology. The advent of RNA-sequencing tech- selection for fusion functionality: kinase domains are nearly 2-fold nologies over the last decade has provided a unique opportu- more enriched in DEEPEST calls than expected by chance, as are nity for detecting novel fusions via deploying computational domains involved in (anaerobic) metabolism and DNA binding. algorithms on public sequencing databases. However, pre- The statistical algorithms, population-level analytic framework, cise fusion detection algorithms are still out of reach. We and the biological conclusions of DEEPEST call for increased atten- develop Data-Enriched Efficient PrEcise STatistical fusion detec- tion to gene fusions as drivers of cancer and for future research tion (DEEPEST), a highly specific and efficient statistical pipeline into using fusions for targeted therapy. specially designed for mining massive sequencing databases gene fusion | cancer genomics | bioinformatics | and apply it to all 33 tumor types and 10,500 samples in The pan-cancer analysis | TCGA Cancer Genome Atlas database. We systematically profile the landscape of detected fusions via classic statistical models and ene fusions are known to drive some cancers and can be identify several signatures of selection for fusions in tumors. Ghighly specific and personalized therapeutic targets; some Author contributions: R.D., D.E.F., and J.S. designed research; R.D., D.E.F., and J.S. of the most famous fusions are the BCR–ABL1 fusion in chronic performed research; R.D., D.E.F., M.J., G.L.H., A.D., E.L., and J.S. contributed new myelogenous leukemia (CML), the EML4–ALK fusion in non- reagents/analytic tools; R.D., D.E.F., and J.S. analyzed data; and R.D. and J.S. wrote the small lung cell carcinoma, TMPRSS2–ERG in prostate cancer, paper.y and FGFR3–TACC3 in a variety of cancers including glioblas- The authors declare no conflict of interest.y toma multiforme (1–4). Since fusions are generally absent in This article is a PNAS Direct Submission. A.T. is a guest editor invited by the Editorial healthy tissues, they are among the most clinically relevant events Board.y in cancer to direct targeted therapy and to be used as effective Published under the PNAS license.y diagnostic tools in early detection strategies using RNA or pro- Data deposition: DEEPEST workflow, with all needed softwares preinstalled, have been teins; moreover, as they are truly specific to cancer, they have deposited in GitHub, https://github.com/salzmanlab/DEEPEST-Fusion. Also, a publicly- promising potential as neo-antigens (5–7). available online tool with web interface is available for the DEEPEST algorithm on the Cancer Genomics Cloud platform, https://cgc.sbgenomics.com/public/apps#jordanski. Because of this, clinicians and large sequencing consortia have milos/deepest-fusion/deepest-fusion/. All custom scripts used to generate the figures made major efforts to identify fusions expressed in tumors via have been deposited in GitHub, https://github.com/salzmanlab/DEEPEST-Fusion/tree/ screening massive cancer sequencing datasets (8–12). However, master/custom scripts.y these attempts are limited by critical roadblocks: current algo- 1 To whom correspondence may be addressed. Email: [email protected] rithms suffer from high false-positive (FP) rates and unknown This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. false-negative (FN) rates. Thus, ad hoc choices have been made 1073/pnas.1900391116/-/DCSupplemental.y in calling and analyzing fusions including taking the consensus Published online July 15, 2019.

15524–15533 | PNAS | July 30, 2019 | vol. 116 | no. 31 www.pnas.org/cgi/doi/10.1073/pnas.1900391116 Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 egansr tal. et Dehghannasiri ttsia nlsst h nieTG aaae eel sig- a reveals database, TCGA entire the important with to have conjunction analysis in results statistical applied Its DEEPEST, dockerized implications: Methods). a biological in and run easily (Materials be container can repro- and is available, algorithm publicly The ducible, datasets. RNA-Seq massive tumors in detection cancer pathway. kinase ovarian a in 36% the and gene, involving COSMIC fusions of known have screened a we involving half fusions tumors bulk gene ovarian fusions, express the from detectable of 94.6% data have that (RNA-Seq) find and apply RNA-sequencing tumors We pri- algorithms. to available or been currently have DEEPEST rare in fusions biases the of to and is due cancers, development missed ovarian hypothesis the in fusions testable permit driver vate a mutations TP53 instability, that genome TP53 Because create lacked pathways. mutations recombination now homologous and until mutations in TP53 defects has universal nearly beyond tumors drivers explanatory cystadenocarcinoma candidate serous identifying ian for basis fusions. druggable a and fusions. driver provide and gene oncogenes These of tests selection fusions. statistical or for in signal The domains exclusively significant a protein present reveal domains for analyses protein enrichment of and Can- pairs kinase In (22), Mutations as Somatic (COSMIC) such Of Catalog cer families in fusions gene curated gene those of or recurrent genes enrichment rarely genes, by of partner expression rates and fusion expected of formulated the selection have calculating nonneutral We for TCGA. tests as statistical such tumor massive databases, rare across or selection transcriptome of private signature reported a rare exhibit whether fusions of algorithms gene very published analysis or in statistical rates a private FP prevent high functional cancer, the potential However, a considered (21). be drivers drives to beginning fusion are fusions a available gene evidence that only and support the initiation, been to tumor historically during has event recurrence selective this a of hallmark a ered that biology studies. previous cancer by detected in been not implications had important fusions discovers potentially DEEPEST with fusions, (TCGA). known Atlas Genome of Cancer recovery The Beyond of 9,946 types across tumor samples, including 33 adjacent) entire (tumor datasets, the normal 575 RNA-sequencing and 10,521 samples in tumor GRCh38) of on fusions cohort (based expressed boundaries a for exon screen annotated unbiased at an occurring conduct to us allowed global a in FPs manner. fusion identify unbiased of can test which statistical populations, its across is prevalence DEEPEST algorithm in innovation Engine) key A Tracking (20). Chimera MACHETE Alignment developed (Mismatched previously and our computational second- over advance significant a algorithmic with is algorithm (DEEPEST), fusion mas- detection generation in fusion Efficient Data-Enriched boundaries STatistical algorithm, exon PrEcise The preci- at we datasets. in detection genomics paper, advance fusion sive this significant unbiased In has for that unanswered. sion algorithm still an is provide cancer?” first of drivers ated fusions gene undiscovered many be frequently cancer. the still drive on most that may (19); the carcinogenic used there most field but hand, the other the be fusions, not in of may of standards fusions functionality Recurrence expressed only means the (18). the this disease assess of whether the to one or in currently rates drivers is FN not fusions high are to fusions due that is pancre- question this 400 the than whether raising fusions, more of gene of recurrent study no found recent cancers a As atic biology. examples, cancer many new discover of to one fusions of use the limited have nsmay EPS sa dac nacrc o fusion for accuracy in advance an is DEEPEST summary, In ovar- of fraction large a findings, our of one illustrate To consid- been has fusions gene of recurrence frequent While DEEPEST of implementation efficient and precision The underappreci- fusions gene “Are question critical the Thus, uinb B e eaieb AHT,wihrequires which MACHETE, by for negative positive yet SBT be looking by could by fusion merely samples searching a sequences, by Since fusion-junctional sensitivity SBT). perfect at as has (such preva- approach SBT estimated based the the string-query with a using consistent lence statistically MACHETE than is running checks KNIFE) by specificity found step (or better fusions this of have Intuitively, prevalence the (20). to whether algorithm shown published other already MACHETE, any beyond further been fusions step has This of which SBT. identification using thou- FP of queries the tens rapid here decreases by set, samples, test the of arbitrarily in sands an samples with added along of set number discovery large they the test: in statistical tested secondary efficiently a are to nomina- algorithms subjected junction are discovery the component by running tion nominated by Fusions datasets. on massive testing, brought on multiple challenge dis- from fusion major resulting any FPs to a identify applied be to principle algorithm in covery can and modular can is and step that datasets specific method whether genomic a query of rapidly (25), composition (SBT) sequence tree the bloom indexes sequence level the sequence via orthogonal queries on based approaches statistical this orous For (24). 3.0 fusions. ChimerDB known in curated of clinicians fusions set used where a we makes purpose, settings of which identification clinical fusions precise MACHETE, to desire cancer of portable standard step easily improve- gold nomination DEEPEST further junction including the is by in MACHETE sensitivity Another of over (CWL). ment DEEPEST Language in Workflow Common advance repro- the efficient using contain- dockerized an ers, on based into workflow available pipeline which publicly ducible engineering, MACHETE computational the extensive restructures on relies step in This Appendix abundance (SI gene fusions and FP composition generating sequence the junction models that of profiles effect read-alignment using of 1). by model database statistical (Fig. null initial a the /strands from MB different nominated detect 1 are on fusions to than Putative or and farther (20) other being MB, algorithm each exons 1 from MACHETE partner of with the on distance junctions are on chimeric the exons based within whose method and genes, a distinct as 2 same defined between (23), the event junctions splicing chimeric detect a desired. to if method data RNA-Seq explorer) of discovery fraction the a as be samples could all set used this have but we set, paper, “the this samples, statis- In analyzed 2) set.” all and on test junctions set”; nominated discovery of “the testing all called tical of subset analyzed, a be on to run samples is which component nomination of junction 1) distribution uniform a hypothesis. of null the assumption under which the procedures, on (FDR)-controlling rely rate false-discovery a testing mon discovery, hypothesis fusion multiple from for to via arising analogous datasets FPs conceptually RNA-Seq DEEPEST problem controlling massive samples. as more of such or analysis innovations million, 1 key 100,000, includes applied to have gene applied we of be Here, prevalence datasets. of to the DEEPEST in numbers estimate massive Discovery and in Fusion fusions discover Gene to for DEEPEST, Algorithm Databases. Statistical Public Massive a Is DEEPEST basic from Results implications with fusions of clinic. gene the existence private to biology the or selection including with rare cancer, for consistent human of expression drivers under-appreciated fusion of nature nSe ,tesaitclrfieetse,DEETue rig- uses DEEPEST step, refinement statistical the 2, Step In isoform novel and (known KNIFE running includes 1 Step steps: computational main 2 contains pipeline DEEPEST The P aus hc antb ovdb ietapiaino com- of application direct by solved be cannot which values, 000dtst,bti rnil,DEETcan DEEPEST principle, in but datasets, ∼10,000 PNAS | eegnee ttsia algorithm, statistical a engineered We uy3,2019 30, July mr peri h ops This corpus. the in appear k-mers and | aeil n Methods). and Materials o.116 vol. | o 31 no. P | values 15525

BIOPHYSICS AND COMPUTATIONAL BIOLOGY A

B

Fig. 1. Origin and identification of FP from running DEEPEST on thousands of samples. (A) DEEPEST uses all reads, including those censored by other algorithms, to generate an empirical P value for each candidate fusion. SBTs, together with further statistical modeling, are used to identify FP arising from testing on multiple samples, some of which are reported by other algorithms (SI Appendix, Fig. S4A). The first black arrow shows the motivation for designing the SBT step. (B) cDNA or mapping artifacts result in the inclusion of exon–exon junctions from all combinations of exons within a fixed genomic radius of X1 with all exons in the radius of Y3. Some such exon junctions will include degenerate sequences that cannot be mapped uniquely, and thus DEEPEST blinds itself to detection of fusions containing such highly degenerate sequences (for example, due to Alu exonization) or with polyA stretches at the 50 end.

discordant reads to nominate fusions (20). For a fusion to be Because simulations can only model errors with known called by DEEPEST, it should have an SBT detection frequency sources, it is common for algorithms to perform differently that is statistically consistent with the estimated prevalence by on real and simulated data; for example, simulated data do the junction nomination component and additionally pass statis- not model reverse transcriptase template switching or chimeras tical filtering such as a test for repetitive sequences near exon arising from ligation or PCR artifacts. Thus, in addition to eval- boundaries (Fig. 1 and SI Appendix). uating the performance of DEEPEST on simulated data, we DEEPEST does not require human guidance, is fully auto- performed a thorough computational study of DEEPEST per- mated, and can be applied to any paired-end RNA-Seq database formance on real data. To evaluate the FP rate of DEEPEST by leveraging the massive computational power of cloud plat- on real data, we applied it to several hundred normal datasets, forms. A web-based user-friendly version of the pipeline has including Genotype-Tissue Expression (GTEx) (30) and TCGA been implemented on the Seven Bridges Cancer Genomics normal samples. Notably, DEEPEST calls 80% fewer fusions Cloud (CGC) (26), which allows a user to run the workflow either in GTEx samples than does STAR-Fusion (28) (SI Appendix, by uploading RNA-Seq data or using RNA databases already Fig. S2A), an algorithm used in a recent pan-cancer TCGA available on CGC. (Currently, the average cost of running the analysis (10). In addition, DEEPEST reports fewer fusions (509 workflow for a single TCGA sample on the cloud is roughly $3.) fusions) on TCGA normal samples compared with the 3,128 Moreover, most parts are portable as they are dockerized and calls in the same samples by TumorFusions (8) (SI Appendix, can be easily exported to many platforms using a description Fig. S2B), which is a TCGA fusion list based on PRADA given by the CWL (27). (29). This provides evidence that, unlike other algorithms, DEEPEST retains the specificity seen in simulations in real DEEPEST Improves Sensitivity and Specificity of Fusion Detection. tissue samples. We first evaluated DEEPEST FP and FN rates on fusion positive We ran DEEPEST on the entire TCGA corpus: 9,946 benchmarking datasets used by third parties to assess the perfor- tumor samples across all 33 tumor types. DEEPEST detects mance of 14 state-of-the-art algorithms (28). On each dataset, 31,007 fusions across TCGA. Consistent with what is known DEEPEST has 100% positive-predictive value (PPV), the ratio about tumor type-specific gene fusion expression, DEEPEST of the number of true positive calls to the total number of calls, reports the highest abundance of fusions in sarcoma (SARC), higher than all 14 other state-of-the-art algorithms (SI Appendix, uterine corpus endometrial carcinoma (UCEC), and esophageal Fig. S1), DEEPEST has a comparable, although numerically carcinoma (ESCA) tumor types and the fewest number of higher, PPV than PRADA (Pipeline for RNA-Sequencing Data detected fusions in thyroid carcinoma (THCA), testicular Analysis) (29), which is the next best algorithm. For this analy- germ cell tumors (TGCT), and uveal melanoma (UVM). sis, we only applied the first component of DEEPEST, which is We provide the description of tumor types in SI Appendix, based on MACHETE, as the SBT refinement step utilizes the Fig. S5. statistical power across a large cohort of samples, which is not While calling significantly fewer fusions in normal samples, the case for simulated datasets. DEEPEST identifies significantly more fusions in TCGA tumor

15526 | www.pnas.org/cgi/doi/10.1073/pnas.1900391116 Dehghannasiri et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 n ihn1M,2%aeo h aecrmsm n strand and chromosome strand same the and on are chromosome 22% same MB, 1 the within part- from and both transcribed have genes fusions of are of ner function apart, 24% a pro- Around as MB and characteristics. fusions 1 chromosomes extreme DEEPEST-called of a different than distribution on be farther the are file to are or fusion” strands, that opposite “extreme exons on an joins define that the circular we transcrip- fusion on into or (31), example, variation transcribed (circRNA) for DNA and RNA splicing, local posttranscriptional genome to or due reference tional be the could strand in same MB 1 than is data. high real more DEEPEST identifies on and that fusions data confidence real implies (SI and simulated 10 this on ref. specific Together, more and S2D). TumorFusions by ovarian com- Fig. only Appendix, SARC) (ESCA, found and fusions instability can- with adenocarcinoma, genomic in better pared stomach high enriched a [OV], are have is carcinoma fusions to DEEPEST DEEPEST-only known by data. cers used real for modeling fit the that suggesting egansr tal. et Dehghannasiri (C type. tumor each 2. Fig. S2 Fig. Appendix, (SI fewer datasets substantially normal and real TumorFusions) 10 in ref. by calls in fusions fusions 23,624 19,846 with and compared fusions, on more much (29,820 detects rates) fusions DEEPEST considered, cancer FP are studies real 3 higher S2 in between Fig. of sensitive , Appendix more (SI cost is datasets more the DEEPEST is (at datasets, that simulated might sensitivity STAR-Fusion algorithms on fusion better some based exhibit While is sam- data. simulated same latter in the the sensitive of 10), surveys recent (8, most ples 2 with compared samples C B A eas uin ewe xn htaecoe oec other each to closer are that exons between fusions Because h ubro eurn uin for fusions recurrent of number The (B) fusions. detected the in exons partner the of position relative The (A) fusions. detected of landscape The uin ihtems ies uo types. tumor diverse most the with Fusions (D) type. tumor each for fusion recurrent most The ) C and .We ape shared samples When D). A D and B), .W further We S2C). Fig. (https://fusionhub.persistent.co.in/ through them Appendix , portal queried (SI FusionHub by and 10) fusions only DEEPEST-only ref. and found investigated TCGA TumorFusions in are in on fusions fusions fusions (4,402 studies 5,860 fewer algorithms other fusion Far the of previous S2 C). one Fig. by Appendix, detected (SI been not had (BLCA) carcinoma urothelial 2D). samples), (Fig. bladder tumor in of acute DHRS2–GSTM4 8% in samples, and PML–RARA (14 samples), (LAML) leukemia tumor myeloid of (PRAD) well- 36.3% adenocarcinoma the samples, of have prostate (182 exceptions fusions in with most TMPRSS2–ERG fusions type, gene known tumor recurrent single of exam- levels a low types) for to cancer Restricted cancers, (10 2C). diverse (Fig. FGFR3–TACC3 in and detected 2B). MRPS16–CFAP70 (Fig. are type ple, tumor fusions a within gene tumors 2 Many least at in with called fusions), crosses strand are 23% interchromosomal 2A). are MB, (Fig. 31% and fusions 1 strands, opposite least on being at genes by separated but nw ohv rvn noei ciiisi h prostate the in activities oncogenic driving (32). have cancer (LncRNA) to RNA noncoding S1 known list long (Dataset a in this SCHLAP1, fusions involving present in PRAD recurrent Included not unreported S1). were and previously that (Dataset pairs) 157 found database gene and are fusion database (i.e., other fusion fusions any other distinct any in 9,272 present are they rud4%o EPS’ 107fsos(216fusions) (12,196 fusions 31,007 DEEPEST’s of 41% Around recurrent distinct (512 fusions recurrent 1,486 finds DEEPEST ,icuigarcretfso for fusion recurrent a including S3), Fig. Appendix, SI PNAS | uy3,2019 30, July | o.116 vol. | o 31 no. oseif see to ) | 15527

BIOPHYSICS AND COMPUTATIONAL BIOLOGY Statistical Analysis of Fusion Prevalence in Large Data Queries correlation: 0.497, 0.38, and 0.31, respectively; Spearman’s cor- Improves Precision of Calls. The SBT refinement step is a critical relation: 0.637, 0.596, and 0.54, respectively; Fig. 3A). We innovation of the DEEPEST pipeline and substantially improved found that there is a significant correlation between fusion its specificity by removing fusion calls likely to be FPs. Such abundance and TP53 mutation frequency as 2 orthogonal mea- fusions passed the first component of DEEPEST due to mul- sures of genomic instability. DEEPEST calls more fusions in tiple hypothesis testing but were flagged by the statistical SBT tumors with high TP53 mutation rates in less cytogenetically refinement step (SI Appendix, Fig. S4A). complex tumors, while retaining tight control of FPs in other The main power of the refinement step is it is agnostic to samples. gene name or ontology (such as pseudogenes, paralogs, synony- mous genes, and duplicated genes) used by conventional filters, DEEPEST Identifies a Positive Selection on Fusions Containing Known and which can lead to true positives being removed from lists Oncogenes. Gene fusions could arise from random reassort- (e.g., RUNX1–RUNX1T1). Furthermore, most fusions (89%) ment of DNA sequences due to deficiencies in the structural removed during the SBT refinement step would have passed such integrity of DNA in tumors with no functional impact, or filters (SI Appendix, Fig. S4B). they can be driving events, such as for BCR–ABL1 in chronic To further evaluate the performance of the SBT-based refine- myelogenous leukemia (1). In our first global computational ment component, we extracted 2 groups of fusions from the tests of whether fusions are passengers or drivers, we tested fusions called by the junction nomination component: 1) likely whether DEEPEST-called fusions are enriched in genes known FPs (LFPs) that are fusions with SBT hits in the GTEx sam- to play oncogenic roles. Since functional gene ontologies are ples and 2) likely true positives (LTPs) that contain fusions not used by DEEPEST, this analysis provides an independent shared between the DEEPEST first component, ref. 10, and test of whether fusions are enriched for genes in known can- TumorFusions. Note that LFP fusions are defined on the basis of cer pathways. For each tumor type, we tested for enrichment GTEx data, whereas the SBT refinement step does not “touch” of the 719 genes present in the COSMIC Cancer Gene Census GTEx data, meaning that the LFP fusion set can be used as a database (22). test set for the specificity of the SBT refinement step. A fusion Under the null hypothesis of random pairing of genes in is removed by the SBT refinement step if its 2-sided binomial fusions, we compared the observed to expected fraction of sam- P value is less than 0.05, an arbitrary and conventional statis- ples containing at least one fusion with a COSMIC gene by tical threshold. To evaluate the precision of the SBT step, we conditioning on the total number of fusions detected for a tumor treated each 2-sided P value as a continuous measure and strat- type, where tumor types with more detected fusions are expected ified the P values for LFP and LTP sets (SI Appendix, Fig. S4C). The P values of LFP fusions (with GTEx SBT hits) are signifi- cantly smaller than those of LTP fusions (Mann–Whitney U test, P < 2.2e−16), implying that the majority of them are filtered out A by the SBT statistical refinement step. On the other hand, LTPs Pearson’s r = 0.4967, p = 0.003 possess higher 2-sided P values, which indicates that they would 8 SARC Spearman’s rho = 0.6370, p < 10-4 pass the refinement step. In summary, while the statistical refine- ment step improves DEEPEST calls, the approach outlined is ESCA 6 UCS OV also a general methodology for increasing the precision of RNA STAD variant calls on massive datasets. PRAD BRCA 4 UCEC ACC GBM BLCA DEEPEST Identifies LncRNAs Are Prevalent in Fusions. A major cate- CHOL SKCM LUAD LUSC 2 CESC HNSC gory of fusions identified by DEEPEST are fusions that involve KIRP MESO LIHC PCPG KIRC COAD DLBC LGG READ LncRNAs, which have been overlooked by other methods due to THYM LAML PAAD Average number fusions of Average 0UVM TGCT KICH bioinformatic or heuristic filters. Fusions involving well-studied THCA (and thus named) LncRNAs have been appreciated to have 020406080 oncogenic potential (21, 33, 34). Due to their functions in reg- TP53 mutation frequency (%) ulating cellular homeostasis, fusions that involve LncRNA have the potential to contribute or drive oncogenic phenotypes (34, B 35). Genome-wide pan-cancer analyses have previously not pro- filed those with the LINC annotation. We found that fusions involving LncRNAs (as annotated by the Ensembl database 0.7 release 89) are abundant in tumors (10% of fusion calls involve 0.6 Detected LncRNAs and 20% of tumors are found to have at least one 0.5 Expected fusion involving a LncRNA) (Dataset S4). A large fraction 0.4 of LncRNA fusions (∼ 30% or 994 fusions) involves LINC RNAs, which have been overlooked by previous methods due to 0.3 their biases and heuristic filters for discarding “uncharacterized” 0.2 genes. 0.1 0.0 Prevalence of Fusions Found by DEEPEST Is Correlated with Genome OV LGG ACC UCS UVM LIHC GBM KIRP KIRC KICH Fraction of samples w/ COSMIC fusions samples w/ of Fraction BLCA STAD LUSC LUAD DLBC LAML TGCT ESCA THCA PAAD CESC

Instability. BRCA SARC CHOL

As another computational test of whether DEEPEST UCEC READ PRAD PCPG HNSC THYM SKCM COAD MESO maintains high precision in a variety of solid tumors that have more complex cytogenetics than LAML, we tested whether Fig. 3. Association of fusions with genomic instability. (A) DEEPEST fusions the abundance of detected fusions per cancer is correlated are significantly correlated with the TP53 mutation frequency. (B) Detected fusions are highly enriched in genes cataloged by COSMIC. For all tumor with the mutation rate of TP53, which is an orthogonal mea- types, except for uterine carcinosarcoma (UCS), UCEC, pheochromocy- sure of a tumor’s genome instability (36). The average number toma and paraganglioma (PCPG), UVM, TGCT, kidney chromophobe (KICH), of fusions per sample identified by DEEPEST has a higher thymoma (THYM), and adrenocortical carcinoma (ACC), the observed frac- (and significant) correlation with TP53 mutation rate across tion significantly exceeds the expected fraction based on the null hypothesis tumor types compared with ref. 10 and TumorFusions (Pearson of random pairing (Bonferroni-corrected FDR < 0.05).

15528 | www.pnas.org/cgi/doi/10.1073/pnas.1900391116 Dehghannasiri et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 en(M1 rsn n8tmrtps stems prevalent most the is types, fusion tumor pro- 8 a vacuolar in a RPS6KB1–VMP1, and present (41) (VMP1) tumors; kinase tein protein diverse ribosomal the in between for selected 40). 39, TMPRSS2–ERG, (4, DHRS2–GSTM4 recov- including and FGFR3–TACC3, fusions PML–RARA, analysis recurrent this known testing, several hypothesis. ers hypothesis null other the multiple of under thousands for observed and Controlling be this to making observe unlikely 4), to highly expect (Fig. fusions would fusions we such and 5 times, only 2 only observed be would least detected at for of called number be to the expected have given are c them to words, of expected other many how are In fusions, boxes balls? many more how or pairs), gene possible into statis- thrown in in are of problem out familiar probability worked a if the to tics: was mapped fusions which We gene recurrent Methods). of observing and expres- theory (Materials fusion the model 38 where chance, a ref. distribution by fits null arises a fusions sion by recurrent selection rare neutral of of prevalence the whether as one cohort. even TCGA survey, the population-level as a fusion large in rare expected fusions, in is participating expression high could when genes observing oncogenes human as of without fraction function even moderate a tumors If recurrence. in of levels for in selected bases strongly protein-coding of be (∼ muta- number genome point the transcribed possible by the of bounded is number order an that ana- the than tions than we more greater fusions, samples gene magnitude potential the of million (in 625 to genome up are the in with genes quadratically lyzed, scales of fusions mutations. number gene point sam- the potential of (the space (37). of fusions sample number genomes the gene The exceeds cancer possible greatly in of space) number mutations ple out total point grew the argument on However, This focused role. driving work a of plays fusion Than a More that in dence Selection Tumors. a TCGA Shows of Fusions 11% Rare of Analysis Statistical role. tumor driving currently a are various play fusions in to where considered OV, fusions not as gene such cancers on for including evidence pressure types, strong enriched selection is statistically positive this Together a are rate. genes background the COSMIC above studied, we FDR 3B samples than (Bonferroni-corrected less (Fig. and 90.7% types SARC tumor than of for other fraction 45% for null is 40% the fusions chance, COSMIC by a with expected 50%, samples than exceeds greater partner COSMIC much a rate containing fusions with ples in genes COSMIC for enriched be no types. will tumor tumor is fusions other there Most that thus bias and genes. priori fusions, COSMIC a gene recurrent as prevalent therefore lack annotated types and are studied, intensively partners includ- been fusions, their have where disease drivers, a known is ing LAML and fusions, enriched highly kinase are for tumors are THCA (which genes), factors COSMIC transcription as cataloged of family ETS the involve PRAD egansr tal. et Dehghannasiri fraction; expected vs fraction: change is (4.9-fold expected genes vs COSMIC change (3-fold for PRAD (Materi- enrichment in fusions largest COSMIC The with Methods). samples and of als ratio higher a have to S3 fraction; expected vs change fold samples? hsaayi eel vdneta eurn uin are fusions recurrent that evidence reveals analysis This selection neutral under expected fusions prevalent most The for test statistical a formalized we effect, this for account To sam- of fraction the OV, and UCS, ESCA, SARC, PRAD, In .Ti sepce eas h otfeun eefsosin fusions gene frequent most the because expected is This ). k ∼ al crepnigt h ubro bevdfusions) observed of number the to (corresponding balls 2 000 22, n ee eeepesd.Ti en htthere that means This expressed). were genes oe crepnigt h oa ubrof number total the to (corresponding boxes uinrcrec scniee ob evi- be to considered is recurrence Fusion 30 × P 10 < 6 .Teeoe uin could fusions Therefore, ). P 1e and < −6 1e < Fg 3B (Fig. ) aae S3 Dataset −6 .5 ftetumor the of 0.05) ,adLM (5.6- LAML and ), P < 1e and −6 .I more In ). ,THCA ), Dataset c infiatnmeso 5 of numbers significant 3 one of fusions) coincidence with 31,007 to the (i.e., box map one pairs We in fusion samples. of all number across detected total the possible boxes” to all in spond to “balls correspond the boxes used 3 where we above, cohort, distribution TCGA null the in DEEPEST ensont c socgns u nig alfrfurther for call have findings (51) our BCAR4 oncogenes, as and act (50), to LINC00511 shown (49), been PVT1 as such most the have AC020637.1 5 and AP005135.1 the function: unknown have LINC00511 as unknown 3 and RNAs most of AC025165.3, noncoding RNAs AC134511.1, 31 noncoding found function: PVT1, suppressors also genes. tumor We fused well-known (48). significantly tar- and RAD51B a (47), as 5 as therapy such emerging distinct cancer kinase of in cyclin-dependent numbers get a highest CDK12, the include with 5C). part- genes (Fig. (39 partners) Other (61 development 3 C1QTNF3–AMACR gene immune promiscuous the innate and ners), regulating most gene UVRAG The a and CPM, 5C). (45) TP53 (Fig. regulates (46) negatively TCGA which of such MDM2, suppressors 0.5% as tumor in and PVT1, oncogenes, include or known partners (HER2), tumors recurrent ERBB2 (44); significant 52 signaling highly in receptor Other detected cases. FGF are in fusions 5 critical FRS2 is significant expected that be most protein would docking The genes chance. such fusions in no by found expected when are be respectively, genes would partners, 48 5 partners fused and significantly 6 190 as S2); than (Dataset more chance with by genes 110 only 5 3 recurrent 864 reports DEEPEST 5 (Fig. whether test more To much chance. be by could expected be partners 3 Tumors. under would yet fusions TCGA than private, partners, of prevalent be fusion 30% oncogenic could Than as selection More serve in could Fused genes Are Tis- and Nonneoplastic from sue Tumors Distinguish Genes Fused Recurrently and 4 (Fig. fusions recurrent have (P (1,181) chance tumors by higher expected at observed than that are DEEPEST rates 14% studies by Globally, found 2C). (1,486) previous (Fig. fusions 43) the by (42, of role driving findings a entire have supports fusions the these and across cohort TMPRSS2–ERG, after TCGA fusion, gene detected (Benjamini–Yekutieli 0.01). null level the at in control expected FDR CI 99% upper than the more and expectation occur that fusions rent 4. Fig. 0 0 0 0 h ubro infiatyfsd5 fused significantly of number The

.Wiesm fteennoigRNAs noncoding these of some While ). S2 (Dataset partners Number of fusions r5 or rsetvl 5 (respectively ates ohhaving both partners, 0.1 10 10 10 10 0 1 4 2 3 0 0 ttsia nlsso eurn uin.Osre ubro recur- of number Observed fusions. recurrent of analysis Statistical A ates n CR,PT,adnnoigRA of RNAs noncoding and PVT1, BCAR4, and partners; ate ee r vrersne nfsosfudby found fusions in overrepresented are genes partner 050 and 0 rs.5 (resp. B). c itnt5 distinct 0 ates(xrse ee)adblscorre- balls and genes) (expressed partners ) PNAS 0 Number of occurrences ate n algnswt statistically with genes call and partner ) 0 n 3 and 0 | 0 P n 3 and rs.3 (resp. uy3,2019 30, July ausof values 0 x ate ee ihmr hn12 than more with genes partner ie ssgicnl ihrta the than higher significantly is times 0 < ates“infiatyfused” “significantly partners 0 ate ee en paired being genes partner ) 0 0 0 150 100 atesad38recurrent 378 and partners 1e ate eei R2 a FRS2, is gene partner 0 | −6 <1e n 3 and o.116 vol. ;mr hn1.%of 11.9% than more ); 0 −5 Expected CI 99% Upper fusions Recurrent ate ee are genes partner 0 atesi large: is partners ,when 5B), (Fig. | aae S1). Dataset o 31 no. 0 partners fmany If | c 15529 balls 182

BIOPHYSICS AND COMPUTATIONAL BIOLOGY 80

AB C 5' partners

Significantly

⎩ 104 5’ significantly fused genes 3' partners

fused gene A 3’ significantly fused genes 60 3 ⎨ 10 Upper 99% CI

102 Expected 40 ⎧ ⎧ 10 k distinct 1 20 genes ⎨ 0.1

Λορεμ ιπσυμ Number partners of

Gene A fused with k genes of Number ⎩ distinct 3’ partners 0 0 TG

fusions 2102030 40 50 60 65 NF1 CPM MSI2 PVT1 FRS2 NAV3 VMP1 MDM2 CDK12 CPSF6 BCAS3 ERBB2 PPFIA1 Number of partners CNOT2 UVRAG FBXL20 SHANK2 TMPRSS2 D ephrin receptor signaling pathway E histone lysine methylation adherens junction organization C1QTNF3−AMACR TCGA tumor (extreme) regulation of dendritic spine development 40 TCGA tumor (non-extreme) peptidyl−threonine phosphorylation 30 GTEx (extreme) ERBB signaling pathway GTEx (non-extreme) peptidyl−threonine modification 20 intracellular steroid hormone receptor signaling pathway 10 actin cytoskeleton reorganization samples % of 0 androgen receptor signaling pathway 7 10 15 20 25 30 012345 Fold enrichement Significantly fused gene

Fig. 5. (A) Significantly fused genes: those paired with multiple partners (counted by counting distinct fusion pairs). (B) Significantly fused genes are observed at rates higher than expected by chance; genes with k > 10 partners would be unlikely to occur by random assortment of genes into fusions (FDR- corrected P < 0.01). (C) Genes with the highest number of distinct gene partners. (D) GO enrichment analysis of significantly fused genes shows enrichment in pathways known to regulate cancer growth. (E) Tumors are highly enriched for extreme fusions involving significantly fused genes.

investigation into the potential driving roles of other significantly (12.1% of samples). Notably, the 2 fusions detected in GTEx are fused noncoding RNAs. PVT1–MYC and FRS2–CPSF6, both fusions are “nonextreme,” While the list of significantly fused genes includes well-known splicing detected between 2 genes transcribed in the same orien- cancer genes as described above, only 15.5% of them (193 genes) tation with promoters < 200 kB from each other, events which are currently annotated as COSMIC genes, i.e., 888 significantly could arise from somatic or germline variation or transcriptional fused genes are previously unreported candidate oncogenes, readthrough. calling for further functional investigation of these genes. This Fusions involving significantly fused genes in TCGA and is an enrichment for COSMIC annotation but is not exhaustive: GTEx samples have distinguishing structural features. The large significantly fused genes are a large class of potential oncogenes majority of fusions in tumors arise from extreme rearrangements and tumor suppressors that function through gene rearrange- (SI Appendix, Fig. S6) regardless of the number of partners a ment rather than gain or loss of function through point mutation. significantly fused gene contains. More than 90% of fusions in To functionally annotate significantly fused partner genes, we TCGA that involve significantly fused genes with at least 23 part- carried out (GO) enrichment analysis and found ners are extreme, whereas no such GTEx fusions are extreme (SI the highest enrichment (Binomial test, Bonferroni-corrected Appendix, Fig. S6). This again implies a tumor-specific selection FDR < 0.05) in cancer pathways such as androgen signaling for extreme fusions, which increase the complexity of partners pathway, ERBB signaling pathway, and ephrin receptor signaling available to significantly fused genes. Together, analysis of rare pathway (Fig. 5D and Dataset S2). recurrent gene fusions and recurrent 30 and 50 partners identify To further support the role of significantly fused genes in can- hundreds of candidate oncogenes, which constitute a significant cer, we evaluated the rate that such genes are detected in TCGA fraction of gene fusions. tumor and GTEx samples as a function of 1) the number of DEEPEST finds higher enrichment of significantly fused genes partners and 2) the nature of the rearrangement underlying the in more tumors compared with other TCGA fusions lists (SI fusion: “extreme” events that bring together 2 exons that are Appendix, Fig. S7), calling significantly fused genes with >10 farther than 1 MB apart, on opposite strands, or on different partners in ∼ 50%; and with >20 partners in ∼ 70% more sam- chromosomes in the reference, and all other events are “nonex- ples compared with recent studies (>10 partners: DEEPEST: treme” (Fig. 5E). Nonextreme events could arise through small 3,705; ref. 10: 2,570; TumorFusions: 2,787 samples; >20 part- scale genomic duplication or transcriptional readthrough cou- ners: DEEPEST: 1,479; ref. 10: 823; TumorFusions: 958 sam- pled to “back-splicing” to generate circRNA (20, 23). Globally, ples). While 7.6% of DEEPEST fusions have a gene with >20 DEEPEST detects fusions including significantly fused genes partners, only 4.8% and 5.6% of fusions in ref. 10 and Tumor- (>10 partners) at a much higher rate in TCGA tumors (7,050 Fusions, respectively, have such genes. FRS2 is found to have fusions in ∼ 34% of samples) than in GTEx controls (29 such the highest number of partners in all 3 lists; however, DEEPEST fusions in ∼ 9% of samples), despite GTEx samples being identifies 65 30 partners, which is larger than other 2 lists: 41 and sequenced at an average depth of 50 million reads, roughly 52 partners by ref. 10 and TumorFusions, respectively. similar depth to tumor samples. The deviation between the fraction of such fusions in TCGA Lung Adenocarcinoma and Serous OV Have High Statistical Enrich- versus GTEx increases with the number of partners of the sig- ment for Kinase Fusions. The most common genetic lesions in OV nificantly fused gene such that among those with at least 23 and lung adenocarcinoma (LUAD) is TP53 mutation, present partners, only 2 fusions are detected in GTEx (0.7% of samples), in 85.8% of OV and 52.12% of LUAD cases (cBioPortal; while 1,845 such fusions are detected in 1,202 TCGA tumors retrieved 2018 Nov 19) (52), although there is a debate in the

15530 | www.pnas.org/cgi/doi/10.1073/pnas.1900391116 Dehghannasiri et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 rtis h othgl nihddmisaeAT fusion are in domains enriched enriched statistically the highly are in most that prevalence a domains The its identified 120 . analysis to domain of This in proteome protein set included proteome. each reference domains fusion which the protein DEEPEST-called at the rate in the on occurs compared selection Proteome. we Cancer is fusions, the there Rewire if to test Fusions for Selection Positive of samples; (15.7% (CESC) adenocarcinoma endocervical and carcinoma samples; of (16% (13.3% (HNSC) THCA include: fusions samples; kinase of cancers of Other enrichment fusions. high in genes with of pairing random of assumption egansr tal. et Dehghannasiri all in enriched cellular domains of protein addi- enrichment the in identifies transcripts. in fusion analysis fusions organization GO kinase DNA (B) and of rates. metabolism enrichment overall high high other of many significant tion and OV, have LUAD, types CHOL, THCA, tumor that reveals fusions kinase taining 6. Fig. SWISNF the in example for found motif binding DNA a S3 test, mial tumor these (Binomial in tumors ovarian fusions test, of kinase 37% of that predicts role supporting DEEPEST driving chance, types. and by for expected selection be a sig- would statistically than is higher fusions kinase nificantly fusion of in rate shortcomings The of sensitivity. because detection We missed cancers, been (54). these have of might fraction unknown which gener- some could driving yet for OV responsible in as fusions instability ate genome are that OV, hypothesis In events the (53). tested However, driving cancers cause underestimate. explanatory to an sufficient the not is are prevalence mutations TP53 this that literature B Fraction of samples with kinase fusions A ,art ihrta htwudb xetdbsdo h null the on based expected be would what than higher rate a ), respiratory electrontransportchain anaerobic electrontransportchain

0.0 0.1 0.2 0.3 0.4 P THYM UVM nlsso h rcino ape con- samples of fraction the of Analysis (A) analysis. domain Protein < . . . . 0.4 0.3 0.2 0.1 0.0 P KICH 1e chromosome condensation P KIRC DNA conformationchange THCA chromosome organization KIRP 7.7e = COAD −5 < P electron transportchain DNA metabolicprocess TGCT organelle organization n 5 fln dncrioatmr (Bino- tumors adenocarcinoma lung of 25% and ) < PCPG 1e anaerobic respiration Expected fraction of samples with kinase fusions PAAD CESC DLBC LAML −5 1e HNSC −5 READ MESO −6 oti iaefsos(i.6A (Fig. fusions kinase contain ) DNA packaging Fg 6A (Fig. ) LGG CHOL ,ha n eksumu elcarcinoma cell squamous neck and head ), LIHC GBM LUSC P SKCM LUAD < ACC and 1e BLCA 05101520 −6 PRAD aae S3 Dataset ,adcria qaoscell squamous cervical and ), UCEC Fold enrichment Fold BRCA STAD ). UCS OV ESCA and Dataset y=x hook, SARC 25 To raiain n N eaoimo raiain(i.6B (Fig. organization or metabolism and DNA and and condensation chromosome organization, Benjamini–Yekutieli- transport, test, electron (anaerobic) processes pack- per- (Binomial biological R we FDR domains corrected overrepresented dcGOR proteins, these identified the fusions among and using in (55) analysis enriched age enrichment domains the GO refer- 120 with formed the compared the than fusions ize (P in frequency proteome remodeling, enriched reference higher lipid 1.8-fold are times in domains (P 15 involved proteome domain at ence nucle- a some present in Per1 all present and domain oporins, a NUP84-NUP100, complex, eotn hmhv ihrannetdo otiilF rate fusions FP those nontrivial making filters, or ontological nontested studies or heuristic a sequencing, using either even high-throughput have by them reporting identified been have modifying by tuned be scoring. can on threshold DEEPEST sen- the of between tradeoff specificity and the and and scientific score rate sitivity potential discovery statistical of the as but a utility algorithms Such clinical abso- other fusion. the in the than unavailable rather supporting is support, counts statistical read of lute basis prioritize Further, to the used positives. on be filters can true fusions that score known DEEPEST statistical a of unguided assigns detection DEEPEST The sacrificed FP not algorithms. lower without have other significantly datasets than has RNA-Seq rates DEEPEST large-scale filtering. in human-guided fusions gene detect oncogenic novel successful. partially of and only However, been discovery biology 58). have fusions the basic (57, gene for inform biomarkers methodologies or can unbiased targets which therapeutic oncogenes, provide novel discov- The the (56). of promised risk has ery cancer sequencing and high-throughput of mutations advent inherited mod- linked statistical with that discovered eling were oncogenes first the of Some pro- reference Discussion the in function found tumor-specific for not potential include ( their also pairs implies which but domain Pkinase teome, FGFR3–TACC3 pairs other in domain well- 9,500 the I-set–TACC include as pairs and such These TACC fusions proteome. do reference driving that the known pairs in domain protein exist have fusion not that proteins the in result in fusions domains intact couple that fusions protein. such 282 of only tion null conservative whereas closed-form, observed, (P a distribution under were chance domains by expected 2 were with genes, fusions parental only 1-domain with 681 considers fusions 3,388 analysis of Out This domain. tated form. 5 problem closed whose our in fusions for computed domains distribution be in-frame null 2 can pessimistic” test containing “most a proteins the formulated we fusion where distribution, of null selection the model for we how to tive LAML in detected ( RUNX1–RUNX1T1 samples protein fusion in- frame the (Bonferroni-corrected in RUNT–zf-MYND and background RUNT–TAFH, NHR2–RUNT, above enriched FDR the against are domain 226 pair domains; pairs domain between pairing each random of of probability null frequency observed the pared aae S4 Dataset lhuhmn ieydiigaddugbegn fusions gene druggable and driving likely many Although to algorithm statistical reproducible unified, a is DEEPEST DEEPEST all of 17% enrichment, above the to addition In sensi- be could pairs domain protein of enrichment Because com- we fusions, in enriched pairs domain of set the find To aae S4 Dataset < .5,aogtehgeterce oanpisare pairs domain enriched highest the among 0.05), aae S4 Dataset ). ). < < 0 1e n 3 and .5:teerce oan eeivle in involved were domains enriched the 0.05): PNAS −5 ). << ,srn vdnefrselec- for evidence strong Appendix ), (SI ) 0 << 1e | aetgnscnanol n anno- one only contain genes parent uy3,2019 30, July −10 1e .Trsn kinase Tyrosine ). S4 (Dataset ) −10 .T ucinlycharacter- functionally To ). | o.116 vol. | o 31 no. | 15531 Tyr–

BIOPHYSICS AND COMPUTATIONAL BIOLOGY unreliable for clinical use. Similar problems also limit the sen- For intuition on why this step is important, consider the scheme in Fig. 1: sitivity in screens of massive datasets to discover fusions, novel a candidate exon-exon junction sequence that could be generated from oncogenes, or signatures of evolutionary advantage for rare or sequencing error results in a read that has been generated by a single private gene fusions. To illustrate DEEPEST’s potential clini- gene being more similar in sequence to a read that spans a fusion between cal contribution, we integrated DEEPEST-detected fusions with 2 homologous genes. There is a difference between MACHETE and SBT, which will lead to both FPs and FNs by the SBT step: SBTs will not con- a recent curated precision oncology knowledge base (OncoKB) sider the alignment profile of all reads aligning to a junction (as MACHETE (59), with drugs stratified by evidence that they interact with spe- does), including reads with errors or evidence of other artifacts, as such cific proteins and found druggable fusions in 327 tumors (3.3%) reads that would have mismatches with the query sequence and are con- (SI Appendix, Fig. S8). However, the current list of druggable sequently censored by the SBT (these reads would be FN by the SBT). The fusions is biased toward genes that have classically been consid- same censoring leads the SBT, like other algorithms, to have a high FP rate ered oncogenes, and mainly kinases. By increasing the precision due to: 1) FP intrinsic to the Bloom filters used in the SBT; 2) even if the of fusion calls, the analysis in this paper opens the door for bloom filter itself has a null FP rate, SBT may falsely identify a putative further functional and clinical studies to better identify drug fusion due to events such as described above (and depicted in Fig. 1). As targets and repurpose already validated drugs that could target another example of a FP from reason 2 above, if a single artifact (e.g., a lig- ation artifact between 2 highly expressed genes) in a single sample passes fusions. MACHETE statistical threshold in the discovery step, it will be included as The DEEPEST algorithm improves detection of gene fusions a query sequence for the SBT step, and the SBT could detect it at a high that have been missed by other algorithms’ lists of “high con- frequency because the statistical models used by MACHETE are not used by fidence” gene fusions. Analysis of these gene fusions uncovers the SBT (Fig. 1). Testing for the consistency of the rate of each sequence fundamental cancer biology. First, we find evidence that gene being detected in the discovery set with its prevalence estimated by SBT fusions are more prevalent than previously thought in a variety of controls for the multiple testing bias leading to increased FPs from the tumors including high grade serous ovarian cancers. Second, our SBT (Fig. 1). computational analysis suggests the hypothesis that gene fusions We built SBT index files for all TCGA samples across various TCGA projects involving kinases and perhaps other genes are a contributing using SBT default parameters (k-mer index size 20 and a minimum count of 3 for adding a k-mer to a bloom filter). The 40mer flanking the fusion driver of these cancers. Further, fusions in ovarian and most junction (20 nucleotides on the 50 side and 20 nucleotides on the 30 side) is tumor types are under selection to include gene families that are retrieved for each fusion nominated by the junction nomination component known to drive cancers, such as kinases and genes annotated as and each TCGA tumor type is queried for the fasta file containing all 40mers COSMIC. for the fusions called by the first component. For the sensitivity threshold, DEEPEST allows for rigorous and unbiased quantification of which determines the required fraction of k-mers in the query sequence that gene fusions at annotated exonic boundaries and for tests of should be found for a hit, we used a more stringent value of 0.9 (instead of whether partners in gene fusions that may be rare or private default value 0.8) to improve the specificity of DEEPEST. After querying, the are present at greater frequencies than would be expected due detection frequencies of each fusion junction by the first component and to chance. The results in this paper establish statistical evidence SBT are compared, and if they are statistically consistent, the fusion could pass the SBT refinement step; otherwise, it would be discarded. Technical that gene fusions and the partner genes involved in fusions are details of the statistical framework, postprocessing of DEEPEST output files, under a much greater selective pressure than previously appreci- and SBT query step are provided in SI Appendix. ated: under a highly stringent definition of an enriched partner, more than 10 to 20% of all TCGA tumors profiled harbor a gene Null Probability for Recurrent Fusions. For g genes, there are g(g − 1) pos- fusion including a gene under selection by tumors to be involved sible fusions. If n fusions are detected by DEEPEST across all tumors, let X in gene fusions that require large scale genomic rearrangement. be the number of recurrent fusions. In our analysis, there are g = 22, 000 Future work profiling normal cohorts will distinguish whether different gene names in DEEPEST report files and we have reported n = fusions including these genes that found in GTEx arise due to 31, 007 fusions. The probability that no fusion is recurrent can be com- puted using the Poisson approximation for the birthday coincidences prob- variation at the DNA sequence, transcriptional, or posttranscrip- n(n−1) −λ lem with λ = 2g(g−1) , Prob(X = 0) ≈ e = 0.451. As shown in Fig. 4, the tional levels. Significantly, this discovery required comprehensive expected value of the number of recurrent fusions (with a frequency of at statistical analysis of rare gene fusions, using large numbers of least 2) is 5. samples to increase power to detect selection for gene fusion expression by tumors. Calculations for the Expected Number of Recurrent 50 and 30 Partners. As a Together, the results in this paper lead to a model that test of the likelihood of observing our results, we use a statistical model 0 0 fusions may be lesions like point mutations, present across of the probability of observing as many or more recurrent 5 and 3 part- tumors rather than tumor-defining, and suggests that by focus- ners under the assumption that the genes in each fusion pair are randomly chosen from all expressed genes. To do this, we use the generalized birth- ing on one tumor type to detect recurrence, and by relying on day model from ref. 38. First, we consider all g = 22, 000 expressed genes classical metrics for recurrence and selection, some important as boxes, which represent each potential 50 partner gene. Next, we con- cancer biology is lost. Further, the computational evidence in this sider the distribution of the number of distinct 30 partners for each gene paper suggests rare fusions are drivers of a substantial fraction if 30 partner genes, which we take to be numbered balls, were thrown at of tumors. random into boxes. When the first ball arrives in the box j1, this represents 0 that the first observed fusion on our list has gene j1 on its 5 side. At the Materials and Methods end of this process, we have thrown n = 31, 007 fusions (balls) into g boxes. For a given c, the number of balls occupying a single box, we can calculate An Enhanced Statistical Fusion Detection Framework for Large-Scale Genomics. the probability of having Xg,c boxes with at least c balls. The distribution We used the discovery set to generate a list of fusions passing MACHETE c of X has been shown to be a Poisson distribution Po( t ) (ref. 38, theorem statistical bar (Fig. 1). We then queried all datasets for any fusions found in g,c c! 2.2), where t = n . We perform the following calculations to find signif- any discovery set (Fig. 1) and estimated the incidence of each fusion with g1−1/c SBTs (25). SBTs are data structures developed to quickly query many files icant recurrent 50 gene partners. For each c, we find the expected number of data of short-read sequences from RNA-Seq data (and other data) for a and the 99% upper CI of the number of boxes (50 genes) that have at least particular sequence. These structures build on the concept of Bloom filters. c balls (distinct 30 partners) according to the null distribution. For statisti- SI Appendix contains technical details about the methodology used. Next, cal analysis, a significance level of 0.01 was considered. Moreover, since we we used standard binomial CIs to test for consistency of the rate that fusions are testing multiple hypotheses in our analysis, we adopt the Benjamini– were present in the samples used in MACHETE discovery step and the rate Hochberg–Yekutieli FDR control procedure (60) and correct the significance that they were found in the SBT. Fusion sequences that were more prevalent value for each c. For each c, we construct the CI at level (1 − corrected sig- across the entire dataset that is statistically compatible with the predicted nificance level) (Fig. 5B). Similarly, we can find the expectation and upper CI prevalence from the discovery set were excluded from the final list of fusions (after Benjamini–Hochberg–Yekutieli correction) for each number of recur- (Fig. 1). rent 30 genes that have at least c 50 gene partners (Fig. 5B). We provide

15532 | www.pnas.org/cgi/doi/10.1073/pnas.1900391116 Dehghannasiri et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 0 .Lonsdale J. 30. Torres-Garc W. 29. Haas https://github. B. GitHub. DEEPEST-Fusion. 28. Salzman, J. Geno- Jordanski, M. Cancer Dehghannasiri, sequencing App. R. DEEPEST-Fusion 27. short-read Salzman, of J. thousands Dehghannasiri, of R. search Jordanski, Fast M. 26. Kingsford, C. Solomon, B. 25. 4 .Lee M. 24. Szabo L. 23. 2 .A Forbes A. S. 22. fusions gene oncogenic understanding and Discovering Babu, M. M. Latysheva, S. N. 21. fusion cancer of database A Fusioncancer: Dong, D. Wu, Z. Liu, J. Wu, kinase N. of Wang, Y. landscape The 13. Lengauer, C. Kim, Yoshihara L. J. K. Schalm, 12. S. Cerami, E. Stransky, N. 11. Gao Q. 10. 0 .Hsieh G. 20. Liu S. 15. Abate F. 14. 9 .R Saram R. O. fusion 19. the for methods of assessment Bailey Comparative P. Li, H. 18. Qin, F. Vo, D. A. Kumar, S. 17. Carrara M. 16. egansr tal. et Dehghannasiri ftePisndsrbto Po( distribution Poisson the of bea e.2.As,apbil vial nieto ihwbitraeis interface web with avail- 26. tool ref. are online at fusions Cloud available Genomics of publicly Cancer a the analysis on Also, for available 27. used ref. scripts at custom able all and preinstalled, Availability. Software least at iha least at with of table a .B le-aaai .Bauy .W alsn .A iso,E aso,Goa analysis Global Larsson, E. Nilsson, A. J. Karlsson, W. J. Bhadury, J. Alaei-Mahabadi, B. 9. cancer. Hu to X. immunogenomics of 8. Applications Mardis, R. E. Liu, S. X. cancer: against 7. vaccines gene-fusion vectored of rationale The gene Holst, P. personalized Ragonnaud, for E. pipeline A INTEGRATE-neo: 6. Maher, A. C. Mardis, R. E. Zhang, J. 5. .D Singh D. 4. Tomlins A. S. 3. Soda M. 2. .D .Hnefr,Amnt hoooei ua hoi rnlctcleukemia. granulocytic chronic human in chromosome minute A Hungerford, A. D. 1. 8 (2013). 585 matics 2017). March (24 120295 page BioRxiv, 2019. January 8 Deposited com/salzmanlab/DEEPEST-Fusion. 2019. May 13 Deposited deepest-fusion/. https://cgc.sbgenomics.com/public/apps#jordanski.milos/deepest-fusion/ Cloud. mics experiments. mining. data literature and transcriptome iseseicidcino iclrRAdrn ua ea development. fetal human during RNA Biol. circular of induction tissue-specific cancer. human (2016). approaches. computational intensive data through Res. Acids Nucleic data. RNA-seq from derived genes fusions. transcript cancer. in fusions in cancers. expression human gene of on impact their and cancers. alterations human genomic diverse structural somatic of fusions. (2017). evidence. latest and strategies Evolving discovery. neoantigen fusion ihafvrbeprognosis. favorable a with data. RNA-seq paired-end Res. in Acids methods performing top combine to meta-caller model. fusion accurate (2012). on based discovery transcripts Science Neoplasia cancer. lung cell Nature data. RNA-Seq from detection transcripts Bioinform. tissues?BMC normal in chimeras transcription-induced Science 2 (2015). 126 16, opeesv vlaino uintasrp eeto loihsada and algorithms detection transcript fusion of evaluation Comprehensive al., et c uoFsos nitgaiersuc o acrascae transcript cancer-associated for resource integrative An TumorFusions: al., et 2422 (2014). 2224–2226 30, rvrfsosadteripiain ntedvlpetadtreatment and development the in implications their and fusions Driver al., et 75 (2016). 47–52 531, 5 trfso:Fs n cuaefso rncitdtcinfo RNA-seq. from detection transcript fusion accurate and Fast Star-fusion: al., et 2113 (2012). 1231–1235 337, 4719 (1960). 1497–1499 132, hmrb30 nehne aaaefrfso ee rmcancer from genes fusion for database enhanced An 3.0: Chimerdb al., et uli cd Res. Acids Nucleic dnicto ftetasomn M4AKfso eei non-small- in gene fusion EML4–ALK transforming the of Identification al., et rnfrigfsoso GRadTC ee nhmnglioblastoma. human in genes TACC and FGFR of fusions Transforming al., et 0 P eoi nlssietf oeua utpso acetccancer. pancreatic of subtypes molecular identify analyses Genomic al., et ttsial ae piigdtcinrvasnua nihetand enrichment neural reveals detection splicing based Statistically al., et elrpots nRASqdt nlssfaeokfrchimeric for framework analysis data RNA-Seq An Bellerophontes: al., et ttsia loihsipoeacrc fgn uindetection. fusion gene of accuracy improve algorithms Statistical al., et 4–4 (2015). e47–e47 44, 7–8 (2008). 177–188 10, (3 h eoyeTsu xrsin(Tx project. (GTEx) Expression Genotype-Tissue The al., et ausfrec bevto ftenme frcretgenes recurrent of number the of observation each for values aki c ¨ omc xlrn h ol’ nweg fsmtcmttosin mutations somatic of knowledge world’s the Exploring Cosmic: al., et tt fatfso-ne loihsaesial odetect to suitable are algorithms fusion-finder art of State al., et 0 a.Biotechnol. Nat. atesi i.5b h oml 1 formula the by 5 Fig. in partners stecmltv itiuinfunction distribution cumulative the is ·) F( where partners), ) h adcp n hrpui eeac fcancer-associated of relevance therapeutic and landscape The al., et oeo h MRS-R eefso npott cancer. prostate in fusion gene TMPRSS2-ERG the of Role al., et ´ ıa uli cd Res. Acids Nucleic MRS:R uinietfisasbru fpott cancers prostate of subgroup a identifies fusion TMPRSS2:ERG al., et Nature rd:Ppln o N eunigdt analysis. data sequencing RNA for Pipeline Prada: al., et a.Commun. Nat. 16e2 (2017). e126–e126 45, Oncogene elRep. Cell EPS oko,i hc l eddsfwrsare softwares needed all which in workflow, DEEPEST rc al cd Sci.U.S.A. Acad. Natl. Proc. 6–6 (2007). 561–566 448, 14–14 (2017). D1144–D1149 46, ln acrRes. Cancer Clin. 2–3 (2018). 227–238 23, 8545 (2015). 4845–4854 34, Bioinformatics 0–0 (2016). 300–302 34, t c 86(2014). 4846 5, c ! 85D1 (2014). D805–D811 43, ). S2 (Dataset ) ig.Pathol. Diagn. hr d.Vaccin. Adv. Ther. c.Rep. Sci. uli cd Res. Acids Nucleic 5–5 (2017). 555–557 33, 3530 (2008). 3395–3400 14, 3 (2015). 131 10, 19 (2016). 21597 6, 36–37 (2016). 13768–13773 113, uli cd Res. Acids Nucleic − Bioinformatics F(#3 34 (2013). 33–47 1, 74D8 (2016). D784–D789 45, 0 (5 a.Genet. Nat. 2(2013). S2 14, 0 Cell ateswith partners ) 2114–2121 28, 4487–4503 44, 600–612 168, Bioinfor- Genome 580– 45, Nucleic 0 .Bnaii .Yktei h oto ftefledsoeyrt nmlil testing multiple in rate discovery false the of control The Yekutieli, D. Benjamini, Y. 60. Chakravarty D. 59. Lawrence S. M. 58. domain protein Cibulskis and retinoblastoma. of K. ontologies study 57. Statistical analysing cancer: and for Mutation Knudson, package G. A. R 56. An dcGOR: Fang, H. 55. Bowtell D. D. 54. Martincorena I. 53. Cerami E. 52. Xing Z. 51. Sun C.-C. 50. for cancer. target Guan and Y. therapeutic instability 49. emerging genetic family, An gene CDK12: RAD51 The Kemp, Thacker, J. J. C. 48. Grandori, C. Lui, L. Y. G. 47. Mdm2. by stability p53 of Regulation Liang Vousden, C. H. K. 46. Jones, N. S. Kubbutat, G. H. M. docking- the 45. for role Critical Schlessinger, J. Lax, I. Kouhara, H. Gotoh, N. Hadari, R. Y. 44. problem. Kakizuka A. birthday generalized and 39. a for law Causes limit Poisson revisited. cancer: A liaisons Henze, N. and Dangerous 38. Chromothripsis translocations: Chromosome Jackson, Rowley, P. S. D. J. Kaidi, 37. A. long Forment, of dissection V. experimental J. and 36. classification Functional Mendell, T. cancer. J. in lncRNAs Kopp, of F. role emerging 35. The circuitry. Huarte, signaling M. Wiring cancer: 34. in RNA noncoding Long Yang, L. Lin, C. 33. Prensner pre- the R. are J. RNAs Circular 32. Brown, O. P. Lacayo, N. Wang, L. P. Gawad, C. Salzman, J. 31. 3 .E Blum E. A. 43. Inaki K. 42. Cai C. 41. Luo W. 40. h I lu rdt oe io,acmoeto h I i aato Data Big NIH the of component from a credits Pilot, of Model use Program. Credits the Knowledge from Cloud Program benefited NIH Scholars research Biology the This Systems CA180993. Molec- Cancer Evolutionary R25 by & supported Grant Computational is in R.D. Fellow Biology. R01 Sloan MCB- P. Fellowship. ular Alfred Family Award Grant an Baxter also is Program Sciences a is J.S. and Development J.S. Medical Fellowship, manuscript. Career McCormick–Gabilan General a the Early 1552196, of on Faculty Institute feedback NSF GM116847, for National laboratory by J.S.’s supported of members and ACKNOWLEDGMENTS. ne dependency. under Oncol. types. tumour samples. cancer heterogeneous U.S.A. Sci. Acad. annotations. cancer. ovarian serous skin. human normal in mutations data, genomics cancer (2012). multidimensional exploring signals. chemokine p57. suppressing and EZH2 to Acids binding by cancer lung small-cell cancer. breast and (2005). 125–135 cancer. UVRAG. protein Nature U.S.A. Sci. Acad. FRS2α protein rarα fuses leukemia (1998). 333–336 39, Cancer shattering. chromosome of consequences RNAs. noncoding Biol. complex. SWI/SNF the (2013). antagonizes and types. cancer cell prostate diverse in genes human of hundreds One PLoS from isoform transcript dominant spaeladenocarcinomas. esophageal cancer. breast cancer. prostate resistance. therapeutic (2009). and oncogenesis sarcoma Ewing’s (1991). 8–0 (2018). 287–301 28, 35(2016). e385 5, i-9 niistmrporsinb agtn P6B nhuman in RPS6KB1 targeting by progression tumor inhibits miR-195 al., et –6(2017). 1–16 1, .Ci.Pathol. Clin. J. 4–5 (2001). 245–250 1, 9–0 (1997). 299–303 387, uohgcadtmu upesratvt fanvlBeclin1-binding novel a of activity suppressor tumour and Autophagic al., et mlfiaino v1cnrbtst h ahpyilg fovarian of pathophysiology the to contributes pvt1 of Amplification al., et nRAdrcscoeaieeieei euaindwsra of downstream regulation epigenetic cooperative directs lncRNA al., et SM samcoaelt-otiigESFItre novdin involved target EWS/FLI microsatellite-containing a is GSTM4 al., et rncitoa osqecso eoi tutrlaertosin aberrations structural genomic of consequences Transcriptional al., et ogitrei ocdn N 01 csa nocgn nnon– in oncogene an as acts 00511 RNA noncoding intergenic Long al., et 373(2012). e30733 7, n eunigietfistasrpinlyval eefsosin fusions gene viable transcriptionally identifies sequencing Rna al., et h Bocne eoispra:A pnpafr for platform open An portal: genomics cancer cBio The al., et hoooa rnlcto (51)i ua ct promyelocytic acute human in t(15;17) translocation Chromosomal al., et estv eeto fsmtcpitmttosi mueand impure in mutations point somatic of detection Sensitive al., et LSCmu.Biol. Comput. PLoS tal. et eoeRes. Genome Nature nFFrcpo-eitdsga rndcinpathways. transduction signal receptor-mediated FGF in h ognnoigRAShA1pooe aggressive promotes SChLAP1 RNA noncoding long The al., et noB rcso nooykoldebase. knowledge oncology precision A OncoKB: al., et icvr n auainaayi fcne ee cos21 across genes cancer of analysis saturation and Discovery al., et ln acrRes. Cancer Clin. 2–2 (1971). 820–823 68, Biol. Cell Nat. (2001). 8578–8583 98, ihbre n evsv oiieslcino somatic of selection positive pervasive and burden High al., et Cell ehnigoaincne I euigmraiyfo high-grade from mortality Reducing II: cancer ovarian Rethinking , ln acrRes. Cancer Clin. Cell n.Stat. Ann. ihanvlpttv rncito atr PML. factor, transcription putative novel a with 9–0 (2014). 495–501 505, 9–0 (2018). 393–407 172, 5–6 (2018). 957–962 71, a.Rv Cancer Rev. Nat. PNAS 1012 (2014). 1110–1125 159, etakSee rad o sfldiscussions, useful for Artandi Steven thank We 7–8 (2011). 676–687 21, 1518 (2001). 1165–1188 29, 8–9 (2006). 688–698 8, acrRes. Cancer | a.Biotechnol. Nat. 1099(2014). e1003929 10, 9243 (2015). 4922–4934 21, Science uy3,2019 30, July 7555 (2007). 5745–5755 13, 6–7 (2015). 668–679 15, 8–8 (2015). 880–886 348, a.Rv Cancer Rev. Nat. 6853 (2016). 5628–5633 76, 1–1 (2013). 213–219 31, | a.Med. Nat. o.116 vol. acrDiscov. Cancer a.Genet. Nat. Oncogene 6–7 (2012). 663–670 12, 2316 (2015). 1253–1261 21, o.Therapy-Nucleic Mol. | o 31 no. tt rbb Lett. Probab. Stat. acrLett. Cancer Cell 1392–1398 45, 4126–4132 28, 663–674 66, 401–404 2, C Precis. JCO rnsCell Trends rc Natl. Proc. rc Natl. Proc. a.Rev. Nat. | 15533 219,

BIOPHYSICS AND COMPUTATIONAL BIOLOGY