Downloaded by guest on October 7, 2021 ncligadaayigfsosicuigtkn h consensus the taking including fusions made been analyzing have unknown and choices and calling hoc rates in ad Thus, (FP) algo- rates. false-positive current (FN) high false-negative roadblocks: from critical suffer by via rithms limited tumors However, are (8–12). in attempts datasets expressed these sequencing fusions identify massive to screening efforts have major they made cancer, to specific (5–7). truly neo-antigens as pro- are effective potential or they promising as RNA as used using be strategies moreover, detection to teins; early and in therapy in tools targeted diagnostic direct absent to events generally relevant cancer clinically are most in the among fusions glioblas- are they Since including tissues, healthy (1–4). of multiforme variety cancer, toma a prostate in in FGFR3–TACC3 TMPRSS2–ERG non- and carcinoma, in fusion lung EML4–ALK chronic small the in fusion (CML), BCR–ABL1 leukemia the are myelogenous fusions famous most the of G analysis pan-cancer research fusion future gene for and therapy. cancer targeted of for fusions drivers using as into fusions atten- increased gene for to call tion DEEPEST framework, of conclusions binding. analytic biological the DNA population-level and are and algorithms, as metabolism statistical chance, (anaerobic) by The expected in than involved calls domains DEEPEST in 2-fold enriched nearly are more global domains kinase a Specific functionality: indicating fusion oncogenic. calls, for DEEPEST selection be in to enriched of are predicted domains 50% protein fusions than more gene that harbor finding tumors decades, treat- recent minimal in had involving advances has which ment fusions cancer, ovarian for DEEPEST including enrichment cancers, function. in high for a reveals potential also and previously involv- a prevalence demonstrating fusions RNAs, unappreciated unreported in noncoding candi- previously intergenic overrepresentation long 888 1,078 on ing and Namely, based calls, identified biology. DEEPEST are cancer DEEPEST oncogenes of fundamental while date precision nontransformed increased methods, discover the in other to leverage fusions We by false-positive tissues. 31,007 identified human fewer identifies than 10-fold DEEPEST more calling types, tumor 30% 33 fusions, across Cancer In The false- (TCGA) detection. from Atlas fusion minimize datasets of RNA-sequencing to sensitivity tumor modeling 9,946 the statistical increasing while uses positives (DEEPEST), that detection algorithm fusion STatistical an PrEcise Efficient library can- Data- not introduce Enriched during we of do Here, arising alignment. drivers algorithms and sequencing, fusions Current preparation, as false-positive question. function identify open for fusions critical sufficiently (received 2019 gene a 14, June remains which Vogt K. cer Peter to Member Board extent Editorial by The accepted and CA, Jolla, La 2019) Institute, 10, Research January Scripps review The Torkamani, Ali by Edited 94305 CA Stanford, University, Stanford Institute, Cancer c a Salzman Julia and Dehghannasiri Roozbeh drivers cancer RNA oncogenic reveals applying methods by statistical fusions gene of detection Improved www.pnas.org/cgi/doi/10.1073/pnas.1900391116 eateto optrSine nvriyo egae 10 egae Serbia; Belgrade, 11000 Belgrade, of University Science, Computer of Department eateto iceity tnodUiest,Safr,C 94305; CA Stanford, University, Stanford Biochemistry, of Department eas fti,ciiin n ag eunigcnotahave consortia sequencing large and clinicians this, of Because n uin r nw odiesm acr n a be can some and targets; cancers therapeutic some personalized drive and to specific known highly are fusions ene | acrgenomics cancer | TCGA a,b,e,1 a | oadE Freeman E. Donald , bioinformatics | a,b io Jordanski Milos , b eateto imdclDt cec,Safr nvriy tnod A94305; CA Stanford, University, Stanford Science, Data Biomedical of Department loihsaduigrcrec ln oass uin’function fusions’ assess to alone recurrence using and algorithms as unsatisfactory, also is This “voting” fusions. FNs. allows of introduces metacaller list it a final and the (10), produce parallel to in algo- run multiple field are where the rithms approaches, problems, consensus-based these to overcome turned regime To has a FPs. times, introduce of to thousands testing: fusions known hypothesis for multiple testing is of algorithm detec- problem each fusion the precise overcome scenarios, must such tens tion In or samples. thousands of contain thousands that of datasets on sequencing deployed cancer is large detection fusion when dis- problematic more performance against becomes suboptimal bias This and (15–17). oncogenes discovery novel fusion covering third-party imprecise what is to agree manual lead using reviews approaches fusions These of (13–15). lists filtering approaches and algorithms multiple of vial nieto ihwbitraei vial o h EPS loih on algorithm DEEPEST the for available is interface platform, Cloud web Genomics Cancer with the tool been online have available preinstalled, GitHub, softwares needed in all deposited with workflow, DEEPEST deposition: Data the under Published 1073/pnas.1900391116/-/DCSupplemental. y at online information supporting contains article This 1 GitHub, in master/custom deposited been have Editorial the by invited editor guest a is A.T. Submission. Direct PNAS Board. a is article new This interest.y contributed of the conflict J.S. wrote no J.S. declare J.S. and authors and The R.D. and E.L., and D.E.F., data; A.D., analyzed R.D., J.S. G.L.H., paper. research; and y D.E.F., M.J., designed R.D., D.E.F., tools; J.S. reagents/analytic R.D., and research; D.E.F., performed R.D., contributions: Author milos/deepest-fusion/deepest-fusion/ owo orsodnemyb drse.Eal [email protected] Email: addressed. be may correspondence whom To dniysvrlsgaue fslcinfrfsosi tumors. in fusions for selection of and signatures models The several statistical identify classic in via samples the fusions profile 10,500 detected systematically of and landscape We types database. tumor Atlas 33 Genome all Cancer databases to it sequencing apply massive and mining We for reach. designed pipeline specially of statistical efficient and specific out highly a (DEEPEST), still tion detec- are fusion STatistical pre- PrEcise Efficient algorithms However, Data-Enriched develop detection databases. fusion sequencing cise computational public deploying on via fusions algorithms opportu- novel unique a detecting provided for has tech- nity decade RNA-sequencing last the of over advent nologies The biology. cancer trans- lational in are targets drug and and biomarkers aberrations powerful most the genomic among tumor-specific are fusions Gene Significance ohsotoig nteacraneto uin yexisting by fusions of ascertainment the in shortcomings Both c d ila .Hsieh L. Gillian , ee rde eoisIc,Cmrde A012 and 02142; MA Cambridge, Inc., Genomics Bridges Seven y scripts NSlicense.y PNAS . y lo publicly- a Also, https://github.com/salzmanlab/DEEPEST-Fusion. a n Damljanovic Ana , https://github.com/salzmanlab/DEEPEST-Fusion/tree/ l utmsrpsue ognrt h figures the generate to used scripts custom All . https://cgc.sbgenomics.com/public/apps#jordanski. www.pnas.org/lookup/suppl/doi:10. NSLts Articles Latest PNAS d rkLehnert Erik , e Stanford | d , f10 of 1

BIOPHYSICS AND COMPUTATIONAL BIOLOGY have limited the use of fusions to discover new cancer biology. As nature of fusion expression consistent with the existence of one of many examples, a recent study of more than 400 pancre- under-appreciated drivers of human cancer, including selection atic cancers found no recurrent gene fusions, raising the question for rare or private gene fusions with implications from basic of whether this is due to high FN rates or whether this means biology to the clinic. that fusions are not drivers in the disease (18). Recurrence of fusions is currently one of the only standards in the field used Results to assess the functionality of fusions, but the most frequently DEEPEST Is a Statistical Algorithm for Gene Fusion Discovery in expressed fusions may not be the most carcinogenic (19); on the Massive Public Databases. We engineered a statistical algorithm, other hand, there may still be many undiscovered gene fusions DEEPEST, to discover and estimate the prevalence of gene that drive cancer. fusions in massive numbers of datasets. Here, we have applied Thus, the critical question “Are gene fusions underappreci- DEEPEST to ∼10,000 datasets, but in principle, DEEPEST can ated drivers of cancer?” is still unanswered. In this paper, we be applied to 100,000, 1 million, or more samples. DEEPEST first provide an algorithm that has significant advance in preci- includes key innovations such as controlling FPs arising from sion for unbiased fusion detection at exon boundaries in mas- analysis of massive RNA-Seq datasets for fusion discovery, a sive genomics datasets. The algorithm, Data-Enriched Efficient problem conceptually analogous to multiple hypothesis testing PrEcise STatistical fusion detection (DEEPEST), is a second- via P values, which cannot be solved by direct application of com- generation fusion algorithm with significant computational and mon false-discovery rate (FDR)-controlling procedures, which algorithmic advance over our previously developed MACHETE rely on the assumption of a uniform distribution of P values (Mismatched Alignment Chimera Tracking Engine) algorithm under the null hypothesis. (20). A key innovation in DEEPEST is its statistical test of fusion The DEEPEST pipeline contains 2 main computational steps: prevalence across populations, which can identify FPs in a global 1) junction nomination component which is run on a subset of all unbiased manner. samples to be analyzed, called “the discovery set”; and 2) statis- The precision and efficient implementation of DEEPEST tical testing of nominated junctions on all analyzed samples, “the allowed us to conduct an unbiased screen for expressed fusions test set.” In this paper, we have used all samples as the discovery occurring at annotated exon boundaries (based on GRCh38) in set, but this set could be a fraction of RNA-Seq data if desired. a cohort of 10,521 RNA-sequencing datasets, including 9,946 Step 1 includes running KNIFE (known and novel isoform tumor samples and 575 normal (tumor adjacent) samples, across explorer) method to detect chimeric junctions (23), defined as the entire 33 tumor types of The Cancer Genome Atlas (TCGA). a splicing event between 2 distinct genes, whose exons are on Beyond recovery of known fusions, DEEPEST discovers fusions the same and within the distance of 1 MB, and with potentially important implications in cancer biology that a method based on the MACHETE algorithm (20) to detect had not been detected by previous studies. chimeric junctions with partner exons being farther than 1 MB While frequent recurrence of gene fusions has been consid- from each other or on different /strands (Fig. 1). ered a hallmark of a selective event during tumor initiation, and Putative fusions are nominated from the initial database by using this recurrence has historically been the only evidence available a null statistical model of read-alignment profiles that models the to support that a fusion drives a cancer, private or very rare effect of junction sequence composition and gene abundance in gene fusions are beginning to be considered potential functional generating FP fusions (SI Appendix and Materials and Methods). drivers (21). However, the high FP rates in published algorithms This step relies on extensive computational engineering, which prevent a statistical analysis of whether reported private or rare restructures the MACHETE pipeline into an efficient repro- gene fusions exhibit a signature of selection across massive tumor ducible publicly available workflow based on dockerized contain- databases, such as TCGA. We have formulated ers, using the Common Workflow Language (CWL). Another statistical tests for nonneutral selection of fusion expression by advance in DEEPEST over MACHETE is further improve- calculating the expected rates of rarely recurrent gene fusions ment of sensitivity by including gold standard cancer fusions and partner genes, enrichment of gene families such as kinase in the junction nomination step of MACHETE, which makes genes or those curated in Catalog Of Somatic In Can- DEEPEST easily portable to clinical settings where clinicians cer (COSMIC) (22), and enrichment for protein domains or desire precise identification of a set of known fusions. For this pairs of protein domains present exclusively in fusions. These purpose, we used fusions curated in ChimerDB 3.0 (24). analyses reveal a significant signal for selection of gene fusions. In Step 2, the statistical refinement step, DEEPEST uses rig- The statistical tests provide a basis for identifying candidate orous statistical approaches based on orthogonal sequence level oncogenes and driver and druggable fusions. queries via the sequence bloom tree (SBT) (25), a method that To illustrate one of our findings, a large fraction of ovar- indexes the sequence composition of genomic datasets and can ian serous cystadenocarcinoma tumors has until now lacked rapidly query whether specific k-mers appear in the corpus. This explanatory drivers beyond nearly universal TP53 mutations and step is modular and can in principle be applied to any fusion dis- defects in pathways. Because TP53 covery algorithm to identify FPs resulting from multiple testing, mutations create genome instability, a testable hypothesis is a major challenge brought on by running discovery algorithms that TP53 mutations permit the development of rare or pri- on massive datasets. Fusions nominated by the junction nomina- vate driver fusions in ovarian cancers, and the fusions have been tion component are subjected to a secondary statistical test: they missed due to biases in currently available algorithms. We apply are efficiently tested in the discovery set along with an arbitrarily DEEPEST to RNA-sequencing (RNA-Seq) data from bulk large number of added samples in the test set, here tens of thou- tumors and find that 94.6% of the ovarian tumors we screened sands of samples, by rapid queries using SBT. This step further have detectable fusions, half of the tumors decreases the FP identification of fusions beyond MACHETE, express gene fusions involving a known COSMIC gene, and 36% which has been already shown to have better specificity than have fusions involving genes in a kinase pathway. any other published algorithm (20). Intuitively, this step checks In summary, DEEPEST is an advance in accuracy for fusion whether the prevalence of fusions found by running MACHETE detection in massive RNA-Seq datasets. The algorithm is repro- (or KNIFE) is statistically consistent with the estimated preva- ducible, publicly available, and can be easily run in a dockerized lence using a string-query based approach (such as SBT). Since container (Materials and Methods). Its results have important the SBT has perfect sensitivity by searching merely by looking biological implications: DEEPEST, applied in conjunction with at fusion-junctional sequences, samples could be positive for statistical analysis to the entire TCGA database, reveals a sig- a fusion by SBT yet negative by MACHETE, which requires

2 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1900391116 Dehghannasiri et al. Downloaded by guest on October 7, 2021 Downloaded by guest on October 7, 2021 ttsia oe cosalrechr fsmls hc snot is which samples, the of datasets. simulated utilizes cohort for step large case the refinement a is SBT across which the power DEEPEST, statistical as of analy- component MACHETE, this first For on the algorithm. based applied best only next we the numerically sis, is Data although which RNA-Sequencing (29), for comparable, Analysis) (Pipeline PRADA a than has PPV calls, higher, of DEEPEST Appendix, number (SI S1), total algorithms state-of-the-art the Fig. other to 14 calls all positive than true ratio higher of the dataset, number (PPV), each the value On of positive-predictive (28). 100% algorithms has state-of-the-art DEEPEST perfor- the 14 assess of to parties mance positive third fusion Detection. by on used rates Fusion datasets FN benchmarking and of FP DEEPEST Specificity evaluated first and We Sensitivity description Improves a DEEPEST using platforms and many (27). dockerized CWL to the are by exported they given easily as $3.) be roughly portable is are can the cloud the parts running on most of sample Moreover, cost TCGA already single average databases a the for RNA workflow (Currently, using CGC. Genomics or on data Cancer available either workflow RNA-Seq Bridges the run has uploading to Seven user by pipeline a allows the the which (26), (CGC) of on Cloud plat- version cloud implemented user-friendly of been power web-based computational A massive forms. database the RNA-Seq paired-end leveraging any to by applied be can exon and near mated, sequences repetitive for test and 1 a (Fig. as boundaries by such statis- prevalence pass filtering additionally estimated tical and the component be nomination with to junction frequency consistent the fusion detection statistically SBT a an is For have that should (20). it fusions DEEPEST, by nominate called 5 to with to the itself X1 at blinds of reads stretches DEEPEST radius polyA thus with discordant genomic and or uniquely, fixed exonization) mapped a Alu be within to SBT cannot exons due that the example, sequences of designing (for degenerate for combinations sequences include motivation all degenerate will the from highly junctions shows such junctions exon arrow containing such exon–exon black fusions Some of of first Y3. inclusion detection The of the radius A). S4 in the Fig. in result Appendix, exons artifacts (SI all mapping algorithms or other by cDNA reported (B) are step. which of some samples, multiple empirical an generate to 1. Fig. egansr tal. et Dehghannasiri EPS osntrqiehmngiac,i ul auto- fully is guidance, human require not does DEEPEST EPS ssalras nldn hs esrdb te algorithms, other by censored those including reads, all uses DEEPEST (A) samples. of thousands on DEEPEST running from FP of identification and Origin P IAppendix). SI au o ahcniaefso.ST,tgte ihfrhrsaitclmdln,aeue oietf Paiigfo etn on testing from arising FP identify to used are modeling, statistical further with together SBTs, fusion. candidate each for value B A EPS dnie infiatymr uin nTG tumor TCGA in fusions more significantly in (UVM). identifies DEEPEST types melanoma tumor uveal testicular of and description (THCA), of S5 Fig. the (TGCT), number carcinoma provide tumors fewest thyroid We the cell in and germ fusions types tumor detected (SARC), (ESCA) esophageal and in carcinoma (UCEC), fusions carcinoma DEEPEST known endometrial of corpus expression, is abundance uterine fusion highest what detects the gene with DEEPEST reports type-specific Consistent types. tumor TCGA. tumor about across 33 fusions all 31,007 across real samples in PRADA tumor algorithms, simulations on other in based unlike seen list specificity that, samples. fusion tissue the evidence TCGA retains Appendix, provides a (SI DEEPEST (8) This is 3,128 TumorFusions which (29). the by with samples S2B), same compared TCGA Fig. the samples pan-cancer in normal recent (509 calls TCGA fusions a fewer on in reports Appendix DEEPEST fusions) used addition, (SI In algorithm (28) (10). analysis STAR-Fusion an does S2A), than Fig. samples calls GTEx DEEPEST datasets, in Notably, normal TCGA DEEPEST hundred samples. and several of (30) normal to (GTEx) rate Expression it FP Genotype-Tissue applied the including we evaluate data, To we per- real data. DEEPEST on data, real of simulated on study on computational formance DEEPEST thorough eval- a to of do addition performed performance in data Thus, the artifacts. simulated PCR uating chimeras differently or example, or ligation perform switching for from template arising to transcriptase data; reverse algorithms simulated model not for and common real on is it sources, hl aln infiatyfwrfsosi omlsamples, normal in fusions fewer significantly calling While 9,946 corpus: TCGA entire the on DEEPEST ran We known with errors model only can simulations Because . NSLts Articles Latest PNAS 80% 0 end. ee fusions fewer IAppendix, SI | f10 of 3 ,

BIOPHYSICS AND COMPUTATIONAL BIOLOGY samples compared with 2 most recent surveys of the same sam- but separated by at least 1 MB, 23% are strand crosses with ples (8, 10), the latter is based on STAR-Fusion that is more genes being on opposite strands, and 31% are interchromosomal sensitive in simulated data. While some fusion algorithms might fusions (Fig. 2A). exhibit better sensitivity (at the cost of higher FP rates) on DEEPEST finds 1,486 recurrent fusions (512 distinct recurrent simulated datasets, DEEPEST is more sensitive in real cancer fusions), called in at least 2 tumors within a tumor type (Fig. 2B). datasets (SI Appendix, Fig. S2 C and D). When samples shared Many gene fusions are detected in diverse cancers, for exam- between 3 studies are considered, DEEPEST detects much more ple, MRPS16–CFAP70 and FGFR3–TACC3 (10 cancer types) fusions (29,820 fusions, compared with 23,624 fusions in ref. 10 (Fig. 2C). Restricted to a single tumor type, most fusions have and 19,846 fusions by TumorFusions) and substantially fewer low levels of recurrent gene fusions with exceptions of the well- calls in real normal datasets (SI Appendix, Fig. S2 A and B), known TMPRSS2–ERG in prostate adenocarcinoma (PRAD) suggesting that the modeling used by DEEPEST is a better (182 samples, 36.3% of tumor samples), PML–RARA in acute fit for real data. DEEPEST-only fusions are enriched in can- myeloid leukemia (LAML) (14 samples, 8% of tumor samples), cers known to have high genomic instability (ESCA, ovarian and DHRS2–GSTM4 in bladder urothelial carcinoma (BLCA) carcinoma [OV], stomach adenocarcinoma, and SARC) com- (Fig. 2D). pared with fusions found only by TumorFusions and ref. 10 (SI Around 41% of DEEPEST’s 31,007 fusions (12,196 fusions) Appendix, Fig. S2D). Together, this implies that DEEPEST is had not been detected by previous fusion studies on TCGA more specific on simulated and real data and identifies more high (SI Appendix, Fig. S2C). Far fewer fusions are found only by confidence fusions on real data. one of the other algorithms (4,402 fusions in TumorFusions and Because fusions between exons that are closer to each other 5,860 fusions in ref. 10) (SI Appendix, Fig. S2C). We further than 1 MB in the reference genome and transcribed on the investigated DEEPEST-only fusions and queried them through same strand could be due to local DNA variation or transcrip- FusionHub portal (https://fusionhub.persistent.co.in/) to see if tional or posttranscriptional splicing, for example, into circular they are present in any other fusion database and found that RNA (circRNA) (31), we define an “extreme fusion” to be a 9,272 distinct fusions (i.e., gene pairs) were not present in fusion that joins exons that are farther than 1 MB apart, are any other fusion database (Dataset S1). Included in this list on opposite strands, or are on different chromosomes and pro- are 157 previously unreported recurrent fusions (Dataset S1 file the distribution of DEEPEST-called fusions as a function of and SI Appendix, Fig. S3), including a recurrent fusion for extreme characteristics. Around 24% of fusions have both part- PRAD involving SCHLAP1, a long noncoding RNA (LncRNA) ner genes transcribed from the same chromosome and strand known to have driving oncogenic activities in the prostate and within 1 MB, 22% are on the same chromosome and strand cancer (32).

A D

B

C

Fig. 2. The landscape of detected fusions. (A) The relative position of the partner exons in the detected fusions. (B) The number of recurrent fusions for each tumor type. (C) The most recurrent fusion for each tumor type. (D) Fusions with the most diverse tumor types.

4 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1900391116 Dehghannasiri et al. Downloaded by guest on October 7, 2021 Downloaded by guest on October 7, 2021 nRA n 0 ftmr r on ohv tlatone least S4 at (Dataset have to LncRNA) found a are involve involving tumors calls database fusion of fusion of Ensembl 20% (10% fusions the and tumors that in LncRNAs by abundant found are annotated We 89) release (as annotation. LINC LncRNAs the involving with pro- (34, not those phenotypes previously filed oncogenic have analyses drive pan-cancer or Genome-wide reg- 35). contribute have in to LncRNA have functions involve potential their to that the to fusions appreciated Due homeostasis, cellular been 34). ulating 33, have (21, LncRNAs potential oncogenic named) well-studied involving thus Fusions (and filters. to due heuristic involve methods or other that by bioinformatic fusions overlooked are been have DEEPEST which by LncRNAs, identified fusions of Fusions. in gory Prevalent Are LncRNAs Identifies DEEPEST RNA of datasets. is massive precision on outlined the calls increasing approach variant for the methodology general calls, a DEEPEST also refine- improves statistical the step while ment summary, In step. LTPs refinement hand, the pass other the 2-sided On higher step. possess refinement statistical SBT the by P egansr tal. et Dehghannasiri across higher (Pearson rate TumorFusions a and 10 has ref. TP53 with DEEPEST compared with by types tumor correlation identified mea- number significant) average sample orthogonal The correlated (and an per (36). is is instability fusions which genome cancer of tumor’s TP53, a whether per of of rate tested fusions sure mutation we detected the have LAML, of with that tumors than abundance solid the of variety complex a more in precision high Genome maintains with Correlated Is DEEPEST Instability. by Found Fusions of Prevalence “uncharacterized” discarding to for genes. due filters heuristic methods and previous biases by their overlooked been (∼ have which fusions RNAs, LncRNA of (Mann–Whitney fusions LTP of those than smaller cantly we step, The SBT the the of ified precision 2-sided binomial the each 2-sided evaluate treated its To if threshold. fusion step tical A a refinement step. as SBT refinement used SBT the P be the by can of removed set specificity is fusion the LFP for “touch” the set not test that does meaning step and refinement data, SBT fusions 10, GTEx of the basis contain ref. the whereas on data, that defined component, sam- GTEx are fusions GTEx first (LTPs) LFP that the DEEPEST positives Note TumorFusions. in the true hits between likely SBT shared 2) with the likely fusions and from 1) are ples fusions component: that nomination of (LFPs) junction groups FPs the 2 by called extracted fusions we component, ment S4B). Fig. such lists Appendix, passed have (SI from would filters step (89%) removed refinement SBT fusions being the during most positives removed Furthermore, true to RUNX1–RUNX1T1). filters, lead (e.g., conventional can by used which genes) synony- and paralogs, duplicated pseudogenes, and as genes, (such mous ontology or name SBT gene statistical the mul- Such S4A). by to Fig. FPs. Appendix, flagged due (SI were be step DEEPEST but refinement to of testing likely component hypothesis first calls tiple the fusion passed removing Queries fusions by Data specificity improved substantially Large its and pipeline DEEPEST in the of Prevalence innovation Calls. Fusion of Precision of Improves Analysis Statistical < ofrhreaut h efrac fteSTbsdrefine- SBT-based the of performance the evaluate further To to agnostic is it is step refinement the of power main The au sls hn00,a rirr n ovninlstatis- conventional and arbitrary an 0.05, than less is value 2.2 P auso F uin wt TxSTht)aesignifi- are hits) SBT GTEx (with fusions LFP of values e −16 P i.S4C). Fig. Appendix, (SI sets LTP and LFP for values saohrcmuainlts fwehrDEEPEST whether of test computational another As ,ipyn httemjrt fte r lee out filtered are them of majority the that implying ), P P au sacniuu esr n strat- and measure continuous a as value aus hc niae htte would they that indicates which values, 30% h B enmn tpi critical a is step refinement SBT The r94fsos novsLINC involves fusions) 994 or .Alrefraction large A ). ao cate- major A U test, lscnann tlatoefso ihaCSI eeby gene expected are COSMIC fusions detected tumor a more a with for types with detected tumor fusions where fusion of type, number one total the sam- least on of conditioning at fraction expected containing to observed ples the compared we fusions, Census Gene enrichment Cancer COSMIC for can- the tested in known we (22). present in database type, genes genes 719 tumor the for each of enriched independent For are an pathways. are fusions provides cer ontologies analysis whether gene this of DEEPEST, functional test tested by Since we used roles. not drivers, known oncogenic genes or play in passengers computational enriched to or are are global fusions chronic impact, fusions DEEPEST-called first in whether whether our functional BCR–ABL1 of for In no as tests (1). such with structural leukemia events, tumors the driving myelogenous in be in deficiencies can DNA they to due of sequences integrity DNA of ment Known Containing Fusions on Oncogenes. Selection Positive other a Identifies in DEEPEST FPs cytogenetically of less control in tight in retaining rates samples. fusions while mutation more tumors, TP53 fusion calls complex high DEEPEST between We with mea- instability. correlation tumors orthogonal 3A). genomic 2 significant as of Fig. frequency a sures mutation respectively; TP53 is and 0.54, there abundance and that 0.596, cor- found Spearman’s 0.637, respectively; 0.31, relation: and 0.38, 0.497, correlation: frno arn Bnern-orce FDR hypothesis null (Bonferroni-corrected the pairing on random frac- based of fraction observed expected the pheochromocy- the exceeds (ACC), significantly UCEC, carcinoma tion adrenocortical (KICH), (UCS), tumor and chromophobe (THYM), kidney all TGCT, carcinosarcoma thymoma For UVM, (PCPG), COSMIC. uterine paraganglioma by and for toma cataloged genes except in Detected types, enriched (B) highly frequency. mutation are TP53 fusions the with correlated significantly are 3. Fig. Fraction of samples w/ COSMIC fusions B A Average number of fusions in genes of pairing random of hypothesis null the Under 0.2 0.3 0.5 0 2 4 6 8 0.6 0.0 0.4 0.1 0.7 THYM PCPG UVM EPS fusions DEEPEST (A) instability. genomic with fusions of Association KIRP 020406080 THCA KIRC TGCT PRAD CESC LAML

SARC reassort- random from arise could fusions Gene ESCA CHOL UCS PRAD

OV DLBC ACC SKCM

STAD MESO BRCA

LAML TP53 mutation frequency (%) CHOL LIHC GBM BLCA BRCA LUSC KICH SARC GBM UCS UCEC LUAD LGG BLCA MESO Spearman’s =0.6370, rho p <10 STAD SKCM Pearson’s r=0.4967, p=0.003 LUAD

READ COAD CESC < NSLts Articles Latest PNAS

DLBC PAAD 0.05). LIHC ACC

HNSC HNSC

THCA READ LGG OV

PAAD Expected Detected COAD

KIRP LUSC

KIRC ESCA

| PCPG UCEC

f10 of 5 KICH TGCT -4

UVM THYM

BIOPHYSICS AND COMPUTATIONAL BIOLOGY to have a higher ratio of samples with COSMIC fusions (Materi- 104 Recurrent fusions als and Methods). The largest enrichment for COSMIC genes is 103 in PRAD (3-fold change vs expected fraction: P < 1e−6), THCA Upper 99% CI −6 102 (4.9-fold change vs expected fraction; P < 1e ), and LAML (5.6- Expected fold change vs expected fraction; P < 1e−6) (Fig. 3B and Dataset 10 S3). This is expected because the most frequent gene fusions in 1 PRAD involve the ETS family of transcription factors (which are cataloged as COSMIC genes), THCA tumors are highly enriched 0.1

for kinase fusions, and LAML is a disease where fusions, includ- fusions of Number 0 ing known drivers, have been intensively studied, and therefore their partners are annotated as COSMIC genes. Most tumor 050100 150 182 types lack prevalent recurrent gene fusions, and thus there is no Number of occurrences a priori bias that fusions will be enriched for COSMIC genes in other tumor types. Fig. 4. Statistical analysis of recurrent fusions. Observed number of recur- In PRAD, SARC, ESCA, UCS, and OV, the fraction of sam- rent fusions that occur more than x times is significantly higher than the ples with fusions containing a COSMIC partner exceeds 50%, a expectation and the upper 99% CI expected in the null (Benjamini–Yekutieli FDR control at level 0.01). rate much greater than expected by chance, the null fraction of samples with COSMIC fusions is 45% for SARC and less than 40% for other tumor types (Fig. 3B and Dataset S3). In more detected gene fusion, after TMPRSS2–ERG, across the entire than 90.7% (Bonferroni-corrected FDR < 0.05) of the tumor TCGA cohort and supports findings by previous studies that samples we studied, COSMIC genes are statistically enriched these fusions have a driving role (42, 43) (Fig. 2C). Globally, 14% above the background rate. Together this is strong evidence for of the fusions (1,486) found by DEEPEST are observed at higher a positive selection pressure on gene fusions in various tumor rates than expected by chance (P < 1e−6); more than 11.9% of types, including cancers such as OV, where fusions are currently tumors (1,181) have recurrent fusions (Fig. 4 and Dataset S1). not considered to play a driving role. Recurrently Fused Genes Distinguish Tumors from Nonneoplastic Tis- Statistical Analysis of Rare Fusions Shows a Selection in More Than sue and Are Fused in More Than 30% of TCGA Tumors. If many 11% of TCGA Tumors. Fusion recurrence is considered to be evi- genes could serve as oncogenic fusion partners, fusions under dence that a fusion plays a driving role. This argument grew out selection could be private, yet partners could be much more of work focused on point mutations in cancer (37). prevalent than would be expected by chance. To test whether However, the total number of possible gene fusions (the sam- 30 or 50 partner genes are overrepresented in fusions found by ple space) greatly exceeds the sample space of point mutations. DEEPEST in the TCGA cohort, we used the “balls in boxes” The number of potential gene fusions scales quadratically with null distribution above, where boxes correspond to all possible the number of genes in the genome (in the samples we ana- 30 (respectively 50) partners (expressed genes) and balls corre- lyzed, ∼ 22, 000 genes were expressed). This means that there spond to the total number of fusion pairs (i.e., 31,007 fusions) are up to 625 million potential gene fusions, more than an order detected across all samples. We map the coincidence of c balls of magnitude greater than the number of possible point muta- in one box to c distinct 50 (resp. 30) partner genes being paired tions that is bounded by the number of protein-coding bases in with one 30 (resp. 50) partner and call genes with statistically the transcribed genome (∼ 30 × 106). Therefore, fusions could significant numbers of 50 and 30 partners “significantly fused” be strongly selected for in tumors even without observing high (Fig. 5 A and B). levels of recurrence. If a moderate fraction of human genes could The number of significantly fused 50 and 30 partners is large: function as oncogenes when participating in fusions, rare fusion DEEPEST reports 864 recurrent 50 partners and 378 recurrent expression is expected in a population-level survey, even one as 30 partners, both having P values of <1e−5 (Fig. 5B), when large as the TCGA cohort. only 110 genes with more than 6 partners would be expected To account for this effect, we formalized a statistical test for by chance (Dataset S2); 190 and 48 genes are found in fusions whether the prevalence of rare recurrent fusions fits a model as significantly fused 50 and 30 partner genes with more than 12 of neutral selection by a null distribution where fusion expres- partners, respectively, when no such genes would be expected sion arises by chance, the theory of which was worked out in by chance. The most significant 50 partner gene is FRS2, a ref. 38 (Materials and Methods). We mapped the probability of docking protein that is critical in FGF receptor signaling (44); observing recurrent gene fusions to a familiar problem in statis- FRS2 fusions are detected in 52 tumors or in 0.5% of TCGA tics: if k balls (corresponding to the number of observed fusions) cases. Other highly significant recurrent partners include PVT1, are thrown into n boxes (corresponding to the total number of ERBB2 (HER2), known oncogenes, and tumor suppressors such possible gene pairs), how many boxes are expected to have c as MDM2, which negatively regulates TP53 (45) and UVRAG or more balls? In other words, given the number of detected (46) (Fig. 5C). The most promiscuous 30 partner genes are fusions, how many of them are expected to be called for at least CPM, a gene regulating innate immune development (39 part- c samples? ners), and the gene C1QTNF3–AMACR (61 partners) (Fig. 5C). The most prevalent fusions expected under neutral selection Other genes with the highest numbers of distinct 50 partners would be observed only 2 times, and we would expect to observe include CDK12, a cyclin-dependent kinase emerging as a tar- only 5 such fusions (Fig. 4), making this and thousands of other get in cancer therapy (47), and well-known tumor suppressors fusions highly unlikely to be observed under the null hypothesis. such as RAD51B (48). We also found 31 noncoding RNAs as Controlling for multiple hypothesis testing, this analysis recov- significantly fused genes. PVT1, noncoding RNAs of unknown ers several known recurrent fusions including TMPRSS2–ERG, function: AC134511.1, AC025165.3, and LINC00511 have the PML–RARA, FGFR3–TACC3, and DHRS2–GSTM4 (4, 39, 40). most 30 partners; and BCAR4, PVT1, and noncoding RNAs of This analysis reveals evidence that recurrent fusions are unknown function: AP005135.1 and AC020637.1 have the most selected for in diverse tumors; RPS6KB1–VMP1, a fusion 50 partners (Dataset S2). While some of these noncoding RNAs between the ribosomal protein kinase (41) and a vacuolar pro- such as PVT1 (49), LINC00511 (50), and BCAR4 (51) have tein (VMP1) present in 8 tumor types, is the most prevalent been shown to act as oncogenes, our findings call for further

6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1900391116 Dehghannasiri et al. Downloaded by guest on October 7, 2021 Downloaded by guest on October 7, 2021 hl ,4 uhfsosaedtce n122TG tumors TCGA al. et 1,202 Dehghannasiri in 23 detected least are at fusions samples), with of such (0.7% those GTEx 1,845 in sig- among detected while the are that fusions of 2 such partners only partners, of gene number fused the roughly nificantly with reads, increases GTEx million versus 50 of samples. depth tumor to average depth similar an at sequenced 0prnr)a uhhge aei CAtmr (7,050 tumors TCGA in in rate fusions higher genes in much fused a fusions significantly at partners) including Globally, (>10 23). fusions (20, detects circRNA cou- generate DEEPEST readthrough to “back-splicing” transcriptional to small or pled through duplication arise different could genomic “nonex- on events are scale or Nonextreme events 5E). strands, other are all (Fig. opposite that and treme” on reference, exons the apart, of 2 in MB chromosomes number together 1 the bring the than 1) that underlying farther rearrangement of events the function “extreme” of nature fusion: a the as 2) TCGA and samples in partners detected GTEx are genes and such that tumor rate the evaluated we cer, 5D signaling (Fig. receptor pathway ephrin and Bonferroni-corrected pathway, signaling ERBB test, pathway, (Binomial found enrichment FDR and analysis we highest enrichment genes, (GO) the partner ontology gene fused out significantly carried mutation. annotate point rearrange- through functionally function gene of To loss through or gain function than oncogenes rather that ment potential suppressors of class tumor exhaustive: large not and a is are This but genes genes. annotation fused COSMIC significantly these oncogenes, for of enrichment candidate investigation an is functional unreported further previously for calling are significantly 888 genes i.e., genes, fused genes) COSMIC (193 as them annotated of currently 15.5% are only above, described as genes cancer significantly RNAs. other noncoding of fused roles driving potential the into investigation (E growth. cancer regulate to known pathways in with genes chance; by expected corrected than higher rates at observed 5. Fig. fused gene A Significantly Significantly k distinct AB h eito ewe h rcino uhfsosi TCGA in fusions such of fraction the between deviation The ofrhrspotterl fsgicnl ue ee ncan- in genes fused significantly of role the support further To well-known includes genes fused significantly of list the While genes < infiatyfsdgnsare genes fused Significantly (B) pairs). fusion distinct counting by (counted partners multiple with paired those genes: fused Significantly (A) intracellular steroid hormone receptor signaling pathway signaling receptor hormone steroid intracellular P ⎨ ⎧ ⎩ .5 ncne ahassc sadoe signaling androgen as such pathways cancer in 0.05) < ∼ ∼ .1.(C 0.01). 34% 9% D Gene Afused with k distinct 3’ partners partners 3’ distinct and fsmls hni Txcnrl 2 such (29 controls GTEx in than samples) of fsmls,dsieGE ape being samples GTEx despite samples), of Oercmn nlsso infiatyfsdgnssosenrichment shows genes fused significantly of analysis enrichment GO (D) partners. gene distinct of number highest the with Genes ) regulation of dendritic spine development spine dendritic of regulation fusions aae S2 Dataset androgen receptor signaling pathway signaling receptor androgen peptidyl−threonine phosphorylation ephrin receptor signaling pathway signaling receptor ephrin actin cytoskeleton reorganization Λορεμ ιπσυμ peptidyl−threonine modification adherens junction organization junction adherens ⎩ ⎧ ⎨ histone lysine methylation lysine histone ). ERBB signaling pathway signaling ERBB Number of genes 0 0.1 1 10 10 10 10 4 2 3 uosaehgl nihdfreteefsosivligsgicnl ue genes. fused significantly involving fusions extreme for enriched highly are Tumors ) 1 065 210 20 k Number of partners > 012345 Fold enrichement 0prnr ol eulkl oocrb admasrmn fgnsit uin (FDR- fusions into genes of assortment random by occur to unlikely be would partners 10 04 060 50 40 30 Expected CI Upper 99% genes fused significantly 3’ genes fused 5’ significantly n8.%o Vad5.2 fLA ae (cBioPortal; the in cases debate LUAD a is of there although 52.12% (52), and 19) Nov OV present 2018 mutation, of retrieved TP53 85.8% is (LUAD) in adenocarcinoma Enrich- lung Statistical and High Have Fusions. Kinase OV for ment Serous and Adenocarcinoma Lung respectively. TumorFusions, and 10 ref. by partners have DEEPEST 52 however, to lists; 3 found 3 Tumor- all 65 is in and identifies partners FRS2 10 of ref. number genes. highest in such the with fusions have gene of respectively, a 5.6% Fusions, and have sam- 4.8% fusions 958 only DEEPEST TumorFusions: partners, of 823; 7.6% 10: While samples; ref. ples). 2,787 1,479; TumorFusions: DEEPEST: 2,570; ners: ( 10: studies ref. recent 3,705; with with (SI genes compared lists fused ples fusions significantly TCGA in calling partners other S7), with Fig. compared Appendix, tumors more in significant fusions. a gene constitute of which fraction oncogenes, candidate 3 rare of recurrent of hundreds and analysis partners fusions Together, of gene genes. complexity recurrent fused the significantly increase to selection which available tumor-specific fusions, a extreme implies again (SI for extreme This are S6). fusions Fig. GTEx such Appendix, part- in no 23 a whereas least fusions extreme, at partners of are with ners genes 90% of fused than significantly number involve More that the TCGA contains. of gene regardless fused S6 ) significantly Fig. rearrangements extreme Appendix, from arise (SI tumors large in fusions The of features. majority structural distinguishing have samples GTEx transcriptional or variation germline readthrough. or somatic from arise could orien- same promoters the in with transcribed tation genes 2 between “nonextreme,” detected are are splicing GTEx fusions in both detected FRS2–CPSF6, fusions and 2 PVT1–MYC the Notably, samples). of (12.1% EPS nshge niheto infiatyfsdgenes fused significantly of enrichment higher finds DEEPEST and TCGA in genes fused significantly involving Fusions E ∼ % of samples 0 n with and 50%; 30 40 20 10 C 0 ates hc slre hnohr2lss 1and 41 lists: 2 other than larger is which partners,

7 Number of partners 20 40 60 80 0 01 02 30 25 20 15 10 < h otcmo eei ein nOV in lesions genetic common most The Significantly fusedgene Significantly

0 Bfo ahohr vnswhich events other, each from kB 200 FRS2 PVT1 0prnr in partners >20

GTEx (non-extreme) GTEx (extreme) GTEx TCGA tumor (non-extreme) TCGA tumor (extreme) C1QTNF3−AMACR SHANK2 CPM BCAS3

0prnr:DEEPEST: partners: >10 ERBB2 0

NSLts Articles Latest PNAS CNOT2 n 5 and VMP1 MDM2 CPSF6 ∼ 0

atesidentify partners MSI2 70% PPFIA1 UVRAG 3' partners partners 5'

oesam- more CDK12 0part- >20 TG

| TMPRSS2

f10 of 7 FBXL20 >20 >10 NF1 NAV3

BIOPHYSICS AND COMPUTATIONAL BIOLOGY literature that this prevalence is an underestimate. However, complex, NUP84-NUP100, a domain present in some nucle- TP53 mutations are not sufficient to cause cancers (53). In OV, oporins, and Per1 a domain involved in lipid remodeling, the explanatory driving events are as yet unknown (54). We all present at 15 times higher frequency than the refer- tested the hypothesis that genome instability in OV could gener- ence proteome (P << 1e−10)(Dataset S4). Tyrosine kinase ate fusions responsible for driving some fraction of these cancers, domains are 1.8-fold enriched in fusions compared with the which might have been missed because of shortcomings in fusion reference proteome (P << 1e−10). To functionally character- detection sensitivity. The rate of kinase fusions is statistically sig- ize the 120 domains enriched in fusions proteins, we per- nificantly higher than would be expected by chance, supporting formed GO enrichment analysis using the dcGOR R pack- a selection for and driving role of kinase fusions in these tumor age (55) and identified overrepresented biological processes types. DEEPEST predicts that 37% of ovarian tumors (Binomial among these domains (Binomial test, Benjamini–Yekutieli- test, P < 1e−5) and 25% of lung adenocarcinoma tumors (Bino- corrected FDR < 0.05): the enriched domains were involved in mial test, P < 1e−5) contain kinase fusions (Fig. 6A and Dataset (anaerobic) electron transport, chromosome condensation and S3), a rate higher than what would be expected based on the null organization, and DNA metabolism or organization (Fig. 6B assumption of random pairing of genes in fusions. Other cancers and Dataset S4). with high enrichment of kinase fusions include: THCA (13.3% To find the set of domain pairs enriched in fusions, we com- of samples; P < 1e−6), head and neck squamous cell carcinoma pared the observed frequency of each domain pair against the (HNSC) (16% of samples; P < 1e−6), and cervical squamous cell null probability of random pairing between domains; 226 domain carcinoma and endocervical adenocarcinoma (CESC) (15.7% of pairs are enriched above background (Bonferroni-corrected < samples; P = 7.7e−5) (Fig. 6A and Dataset S3). FDR 0.05), among the highest enriched domain pairs are NHR2–RUNT, RUNT–TAFH, and RUNT–zf-MYND in the in- Positive Selection for Fusions to Rewire the Cancer Proteome. To frame RUNX1–RUNX1T1 detected in LAML test if there is selection on the protein domains included in samples (Dataset S4). fusions, we compared the rate at which each protein domain Because enrichment of protein domain pairs could be sensi- occurs in the reference proteome to its prevalence in the tive to how we model the null distribution, we formulated a test DEEPEST-called fusion proteome. This analysis identified a for selection of fusion proteins containing 2 in-frame domains set of 120 domains that are statistically enriched in fusion where the “most pessimistic” null distribution for our problem can be computed in closed form. This analysis considers only proteins. The most highly enriched domains are AT hook, 0 0 a DNA binding motif found for example in the SWISNF fusions whose 5 and 3 parent genes contain only one anno- tated domain. Out of 3,388 fusions with 1-domain parental genes, 681 fusions with 2 domains were observed, whereas only 282 were expected by chance under a closed-form, conservative null A distribution (P < 1e−5)(SI Appendix), strong evidence for selec- tion of such fusions that couple intact domains in the fusion protein. y=x In addition to the above enrichment, 17% of all DEEPEST ESCA fusions result in proteins that have protein domain pairs that do

OV not exist in the reference proteome. These pairs include well- STAD SARC known driving fusions such as the domain pairs Pkinase Tyr– LUAD UCEC UCS TACC and I-set–TACC in FGFR3–TACC3 but also include BLCA BRCA CHOL LUSC 9,500 other domain pairs not found in the reference pro- HNSC GBM THCA CESC PRAD teome, which implies their potential for tumor-specific function MESO SKCM PAAD (Dataset S4). COAD LGG LIHC ACC KIRP READ KIRC LAML UVM DLBC THYM PCPG TGCT Discussion

0.0 0.1KICH 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 Some of the first oncogenes were discovered with statistical mod- Fraction of samples with kinase fusions kinase with samples of Fraction Expected fraction of samples with kinase fusions eling that linked inherited mutations and cancer risk (56). The advent of high-throughput sequencing has promised the discov- B ery of novel oncogenes, which can inform basic biology and organelle organization provide therapeutic targets or biomarkers (57, 58). However, DNA metabolic process unbiased methodologies for the discovery of novel oncogenic chromosome organization gene fusions have been only partially successful. DNA conformation change DEEPEST is a unified, reproducible statistical algorithm to DNA packaging detect gene fusions in large-scale RNA-Seq datasets without chromosome condensation human-guided filtering. DEEPEST has significantly lower FP electron transport chain rates than other algorithms. The unguided DEEPEST filters respiratory electron transport chain have not sacrificed detection of known true positives. Further, anaerobic respiration DEEPEST assigns a statistical score that can be used to prioritize fusions on the basis of statistical support, rather than the abso- anaerobic electron transport chain lute read counts supporting the fusion. Such a statistical score 0510152025 is unavailable in other algorithms but of potential scientific and Fold enrichment clinical utility as the discovery rate and the tradeoff between sen- sitivity and specificity of DEEPEST can be tuned by modifying Fig. 6. Protein domain analysis. (A) Analysis of the fraction of samples con- taining kinase fusions reveals that THCA, CHOL, LUAD, OV, and many other the threshold on scoring. tumor types have significant high enrichment of kinase fusions in addi- Although many likely driving and druggable gene fusions tion of high overall rates. (B) GO analysis identifies enrichment of cellular have been identified by high-throughput sequencing, studies metabolism and DNA organization in the protein domains enriched in all reporting them have either a nontested or nontrivial FP rate fusion transcripts. even using heuristic or ontological filters, making those fusions

8 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1900391116 Dehghannasiri et al. Downloaded by guest on October 7, 2021 Downloaded by guest on October 7, 2021 egansr tal. et Dehghannasiri fusions of predicted list final the the 1). with from excluded (Fig. compatible were set statistically discovery is the from rate that prevalent prevalence more the dataset were and that entire sequences step the Fusion discovery SBT. across the MACHETE in in found fusions were used that they samples rate that the the of in consistency for present test were to CIs binomial standard filters. used Bloom a we of for concept data) files the other Appendix many on (and SI build query data structures RNA-Seq quickly with These from sequence. to fusion sequences particular each developed short-read of structures of incidence data data the of in are estimated MACHETE found SBTs fusions and passing (25). any 1) fusions for SBTs (Fig. datasets of set all list queried discovery a then any We generate 1). to (Fig. bar set statistical discovery the Genomics. used Large-Scale for We Framework Detection Fusion Statistical Enhanced An Methods and Materials fraction substantial a of drivers tumors. are of fusions this rare on important in suggests evidence relying some computational paper by the selection, Further, and lost. and is recurrence, biology across recurrence cancer detect for to present focus- metrics type by mutations, classical tumor that one suggests point on and like ing tumor-defining, than lesions rather be tumors may fusion fusions gene for selection of detect numbers to tumors. large by power expression using increase fusions, to gene samples rare of analysis to comprehensive statistical due required discovery arise this Significantly, posttranscrip- GTEx levels. or tional transcriptional, in sequence, found whether DNA the that distinguish at variation genes will these cohorts including normal rearrangement. fusions genomic profiling scale work large involved be require Future to that tumors fusions gene by a gene selection harbor in under profiled gene tumors a TCGA partner, all including enriched of fusion an 20% to of 10 definition than stringent more are highly appreci- fusions a previously in than under involved pressure ated: selective genes greater partner much evidence a the statistical under private and due establish fusions paper or expected gene this be rare in that would results be of than The may chance. tests frequencies to that greater for at fusions and present gene boundaries are in exonic partners annotated whether at fusions gene as annotated genes and most kinases as and are such COSMIC. that ovarian cancers, families drive gene in include to contributing to fusions known selection a under Further, are are types cancers. genes tumor these other of perhaps driver and fusions kinases gene that involving hypothesis our the Second, suggests cancers. analysis ovarian computational gene serous of variety grade that a high in including evidence thought tumors previously find than uncovers prevalent con- we more fusions “high are First, gene fusions of biology. these lists cancer of algorithms’ Analysis fundamental other fusions. by gene missed fidence” been have that for drug target door could identify that the better drugs opens validated to fusions. paper already studies repurpose this and clinical in targets and analysis functional the further precision calls, the increasing fusion consid- druggable By been of kinases. of classically mainly have list and that oncogenes, current genes ered the toward biased However, is (3.3%) S8). fusions tumors 327 Fig. in Appendix, fusions spe- (SI druggable with found interact and they proteins that evidence cific (OncoKB) by base stratified knowledge drugs with oncology (59), precision curated clini- with recent potential fusions a DEEPEST-detected DEEPEST’s integrated or we illustrate rare contribution, To cal for novel fusions. advantage fusions, gene evolutionary discover private of sen- to signatures the datasets or limit massive oncogenes, also of problems screens in Similar sitivity use. clinical for unreliable oehr h eut nti ae edt oe that model a to lead paper this in results the Together, of quantification unbiased and rigorous for allows DEEPEST fusions gene of detection improves algorithm DEEPEST The otistcncldtisaottemtoooyue.Next, used. methodology the about details technical contains atrBnaiiHcbr–euil orcin o ahnme frecur- of CI 3 upper number rent and each expectation for the correction) find Benjamini–Hochberg–Yekutieli can (after we Similarly, 5B). (Fig. level) Benjamini– nificance the significance adopt the each correct for we and value analysis, (60) procedure our we control since in FDR Moreover, Hochberg–Yekutieli hypotheses considered. was multiple 0.01 testing of are level significance a (5 analysis, boxes cal of number the of c CI upper 99% the and 5 recurrent icant of having of probability the gene given thrown has a have list For we our process, this on box of the fusion end in observed arrives first ball the first that the When boxes. 3 into distinct random of number 3 the if 5 of potential distribution all each the birth- consider sider represent generalized we which the First, use boxes, 38. we as ref. this, randomly from are do To pair model fusion genes. day model each 5 expressed in statistical all recurrent genes from more a the chosen or that use assumption many we the as results, under observing ners our of observing probability of the 5 of likelihood Recurrent the of of Number test Expected the for at Calculations of frequency a (with fusions 5. recurrent is com- of 2) number least be the can of value recurrent prob- expected coincidences is birthday fusion the reported with for no have lem approximation Poisson that we the and probability using are puted files The there report fusions. analysis, DEEPEST our 007 31, in In names fusions. gene recurrent different of number the be If fusions. Fusions. sible Recurrent for Probability Null in provided Technical files, are discarded. output step DEEPEST query be of SBT would could postprocessing and framework, it fusion statistical the otherwise, the consistent, of step; details statistically and refinement are component SBT they first the if the pass and by compared, junction are the of fusion querying, (instead SBT each After 0.9 DEEPEST. of of of value frequencies specificity stringent the detection more improve a to used 0.8) we value hit, default threshold, a sensitivity for found the be For should component. of fraction first required 40mers the the all determines containing by which file called fasta fusions the for the queried for is type component tumor nomination TCGA junction each the and by nominated fusion each for 5 retrieved the on nucleotides (20 a junction adding for 3 (k the of parameters from default FPs SBT increased using to SBT by leading sequence estimated bias each prevalence testing of 1). its (Fig. multiple rate SBT with the the set high of for a discovery consistency at controls the the it in for by detect detected Testing used could not being 1). are SBT (Fig. MACHETE the by as SBT and used included the models step, be statistical SBT the will the because it frequency passes for step, sample sequence discovery single query the lig- a As a a in in (e.g., 1). genes) threshold putative artifact Fig. expressed single statistical a a in highly MACHETE the if identify 2 depicted above, if between falsely (and 2 even artifact reason above may from 2) ation described FP SBT SBT; a as rate, the of such example in FP another events used null to filters a due Bloom rate has fusion FP the The itself high to SBT). a filter intrinsic the have bloom FP to by algorithms, 1) FN other to: be con- like would due SBT, are such the reads and as leads (these sequence con- censoring artifacts, SBT query same not the other the by will with of censored SBTs mismatches evidence sequently step: have MACHETE or (as would SBT errors junction that the a with reads to by reads aligning SBT, FNs reads including and all and does), of MACHETE FPs profile single between alignment both the a difference to sider between by a lead fusion is generated a will spans There been which that read genes. has a homologous to that from sequence 2 read in generated similar a be more being in could gene that results sequence error junction sequencing exon-exon candidate a .) where 2.2), al dsic 3 (distinct balls X ebitSTidxfie o l CAsmlsars aiu CAprojects TCGA various across samples TCGA all for files index SBT built We 1: Fig. in scheme the consider important, is step this why on intuition For 0 g,c ate ee,wihw aet enmee al,wr honat thrown were balls, numbered be to take we which genes, partner 0 a ensont eaPisndsrbto Po( distribution Poisson a be to shown been has ee hthv tleast at have that genes λ t c = h ubro al cuyn igebx ecncalculate can we box, single a occupying balls of number the , = 2g(−1) c n(−1) g n o each For . 1−/c 0 0 uin r eetdb EPS cosaltmr,let tumors, all across DEEPEST by detected are fusions n eeprnr.Freach For partners. gene k ates codn otenl itiuin o statisti- For distribution. null the to according partners) mrt lo le) h 0e akn h fusion the flanking 40mer The filter). bloom a to -mer Prob(X , epromtefloigcluain ofidsignif- find to calculations following the perform We . c X ecntutteC tlvl(1 level at CI the construct we , g,c = oe iha least at with boxes 0) mridxsz 0adamnmmcount minimum a and 20 size index -mer 0 c iead2 uloie nte3 the on nucleotides 20 and side ≈ IAppendix SI 5 n 0 e −λ = .W provide We 5B). (Fig. partners gene For 1 0 uin bls into (balls) fusions 007 31, = k g c mr nteqeysqec that sequence query the in -mers efidteepce number expected the find we , 5.A hw nFg ,the 4, Fig. in shown As 0.451. 0 ee,teeare there genes, ate ee et econ- we Next, gene. partner NSLts Articles Latest PNAS g . = 0 0 ee)ta aea least at have that genes) 2 0 xrse genes expressed 000 22, c j atesfrec gene each for partners 0 1 al.Tedistribution The balls. n 3 and nis5 its on t c c ! rf 8 theorem 38, (ref. ) j 1 − 0 hsrepresents this , Partners. 0 orce sig- corrected 0 g( n 3 and ie tthe At side. g = − | g 0 2 000 22, )pos- 1) ie is side) 0 f10 of 9 boxes. part- sa As n = X

BIOPHYSICS AND COMPUTATIONAL BIOLOGY a table of P values for each observation of the number of recurrent genes ACKNOWLEDGMENTS. We thank Steven Artandi for useful discussions, with at least c partners in Fig. 5 by the formula 1 − F(#30(50) partners with and members of J.S.’s laboratory for feedback on the manuscript. J.S. is at least c 50 (30) partners), where F(·) is the cumulative distribution function supported by National Institute of General Medical Sciences Grant R01 c GM116847, NSF Faculty Early Career Development Program Award MCB- of the Poisson distribution Po( t )(Dataset S2). c! 1552196, a McCormick–Gabilan Fellowship, and a Baxter Family Fellowship. J.S. is also an Alfred P. Sloan Fellow in Computational & Evolutionary Molec- Software Availability. DEEPEST workflow, in which all needed softwares are ular Biology. R.D. is supported by Cancer Systems Biology Scholars Program preinstalled, and all custom scripts used for analysis of fusions are avail- Grant R25 CA180993. This research benefited from the use of credits from able at ref. 27. Also, a publicly available online tool with web interface is the NIH Cloud Credits Model Pilot, a component of the NIH Big Data to available on the Cancer Genomics Cloud at ref. 26. Knowledge Program.

1. D. A. Hungerford, A minute chromosome in human chronic granulocytic leukemia. 31. J. Salzman, C. Gawad, P. L. Wang, N. Lacayo, P. O. Brown, Circular RNAs are the pre- Science 132, 1497–1499 (1960). dominant transcript isoform from hundreds of human genes in diverse cell types. 2. M. Soda et al., Identification of the transforming EML4–ALK fusion gene in non-small- PLoS One 7, e30733 (2012). cell . Nature 448, 561–566 (2007). 32. J. R. Prensner et al., The long noncoding RNA SChLAP1 promotes aggressive 3. S. A. Tomlins et al., Role of the TMPRSS2-ERG gene fusion in . prostate cancer and antagonizes the SWI/SNF complex. Nat. Genet. 45, 1392–1398 Neoplasia 10, 177–188 (2008). (2013). 4. D. Singh et al., Transforming fusions of FGFR and TACC genes in human glioblastoma. 33. C. Lin, L. Yang, Long noncoding RNA in cancer: Wiring signaling circuitry. Trends Cell Science 337, 1231–1235 (2012). Biol. 28, 287–301 (2018). 5. J. Zhang, E. R. Mardis, C. A. Maher, INTEGRATE-neo: A pipeline for personalized gene 34. M. Huarte, The emerging role of lncRNAs in cancer. Nat. Med. 21, 1253–1261 (2015). fusion neoantigen discovery. Bioinformatics 33, 555–557 (2017). 35. F. Kopp, J. T. Mendell, Functional classification and experimental dissection of long 6. E. Ragonnaud, P. Holst, The rationale of vectored gene-fusion vaccines against cancer: noncoding RNAs. Cell 172, 393–407 (2018). Evolving strategies and latest evidence. Ther. Adv. Vaccin. 1, 33–47 (2013). 36. J. V. Forment, A. Kaidi, S. P. Jackson, Chromothripsis and cancer: Causes and 7. X. S. Liu, E. R. Mardis, Applications of immunogenomics to cancer. Cell 168, 600–612 consequences of chromosome shattering. Nat. Rev. Cancer 12, 663–670 (2012). (2017). 37. J. D. Rowley, Chromosome translocations: Dangerous liaisons revisited. Nat. Rev. 8. X. Hu et al., TumorFusions: An integrative resource for cancer-associated transcript Cancer 1, 245–250 (2001). fusions. Nucleic Acids Res. 46, D1144–D1149 (2017). 38. N. Henze, A Poisson limit law for a generalized birthday problem. Stat. Probab. Lett. 9. B. Alaei-Mahabadi, J. Bhadury, J. W. Karlsson, J. A. Nilsson, E. Larsson, Global analysis 39, 333–336 (1998). of somatic structural genomic alterations and their impact on in 39. A. Kakizuka et al., Chromosomal translocation t(15;17) in human acute promyelocytic diverse human cancers. Proc. Natl. Acad. Sci.U.S.A. 113, 13768–13773 (2016). leukemia fuses rarα with a novel putative transcription factor, PML. Cell 66, 663–674 10. Q. Gao et al., Driver fusions and their implications in the development and treatment (1991). of human cancers. Cell Rep. 23, 227–238 (2018). 40. W. Luo et al., GSTM4 is a microsatellite-containing EWS/FLI target involved in 11. N. Stransky, E. Cerami, S. Schalm, J. L. Kim, C. Lengauer, The landscape of kinase Ewing’s sarcoma oncogenesis and therapeutic resistance. 28, 4126–4132 fusions in cancer. Nat. Commun. 5, 4846 (2014). (2009). 12. K. Yoshihara et al., The landscape and therapeutic relevance of cancer-associated 41. C. Cai et al., miR-195 inhibits tumor progression by targeting RPS6KB1 in human transcript fusions. Oncogene 34, 4845–4854 (2015). prostate cancer. Clin. Cancer Res. 21, 4922–4934 (2015). 13. Y. Wang, N. Wu, J. Liu, Z. Wu, D. Dong, Fusioncancer: A database of cancer fusion 42. K. Inaki et al., Transcriptional consequences of genomic structural aberrations in genes derived from RNA-seq data. Diagn. Pathol. 10, 131 (2015). breast cancer. Genome Res. 21, 676–687 (2011). 14. F. Abate et al., Bellerophontes: An RNA-Seq data analysis framework for chimeric 43. A. E. Blum et al., Rna sequencing identifies transcriptionally viable gene fusions in transcripts discovery based on accurate fusion model. Bioinformatics 28, 2114–2121 esophageal adenocarcinomas. Cancer Res. 76, 5628–5633 (2016). (2012). 44. Y. R. Hadari, N. Gotoh, H. Kouhara, I. Lax, J. Schlessinger, Critical role for the docking- 15. S. Liu et al., Comprehensive evaluation of fusion transcript detection algorithms and a protein FRS2α in FGF receptor-mediated signal pathways. Proc. Natl. meta-caller to combine top performing methods in paired-end RNA-seq data. Nucleic Acad. Sci. U.S.A. 98, 8578–8583 (2001). Acids Res. 44, e47–e47 (2015). 45. M. H. G. Kubbutat, S. N. Jones, K. H. Vousden, Regulation of p53 stability by Mdm2. 16. M. Carrara et al., State of art fusion-finder algorithms are suitable to detect Nature 387, 299–303 (1997). transcription-induced chimeras in normal tissues?BMC Bioinform. 14, S2 (2013). 46. C. Liang et al., Autophagic and tumour suppressor activity of a novel Beclin1-binding 17. S. Kumar, A. D. Vo, F. Qin, H. Li, Comparative assessment of methods for the fusion protein UVRAG. Nat. Cell Biol. 8, 688–698 (2006). transcripts detection from RNA-Seq data. Sci. Rep. 6, 21597 (2016). 47. G. Y. L. Lui, C. Grandori, C. J. Kemp, CDK12: An emerging therapeutic target for 18. P. Bailey et al., Genomic analyses identify molecular subtypes of pancreatic cancer. cancer. J. Clin. Pathol. 71, 957–962 (2018). Nature 531, 47–52 (2016). 48. J. Thacker, The RAD51 gene family, genetic instability and cancer. Cancer Lett. 219, 19. O. R. Saramaki¨ et al., TMPRSS2:ERG fusion identifies a subgroup of prostate cancers 125–135 (2005). with a favorable prognosis. Clin. Cancer Res. 14, 3395–3400 (2008). 49. Y. Guan et al., Amplification of pvt1 contributes to the pathophysiology of ovarian 20. G. Hsieh et al., Statistical algorithms improve accuracy of gene fusion detection. and breast cancer. Clin. Cancer Res. 13, 5745–5755 (2007). Nucleic Acids Res. 45, e126–e126 (2017). 50. C.-C. Sun et al., Long intergenic noncoding RNA 00511 acts as an oncogene in non– 21. N. S. Latysheva, M. M. Babu, Discovering and understanding oncogenic gene fusions small-cell lung cancer by binding to EZH2 and suppressing p57. Mol. Therapy-Nucleic through data intensive computational approaches. Nucleic Acids Res. 44, 4487–4503 Acids 5, e385 (2016). (2016). 51. Z. Xing et al., lncRNA directs cooperative epigenetic regulation downstream of 22. S. A. Forbes et al., Cosmic: Exploring the world’s knowledge of somatic mutations in chemokine signals. Cell 159, 1110–1125 (2014). human cancer. Nucleic Acids Res. 43, D805–D811 (2014). 52. E. Cerami et al., The cBio cancer genomics portal: An open platform for 23. L. Szabo et al., Statistically based splicing detection reveals neural enrichment and exploring multidimensional cancer genomics data, Cancer Discov. 2, 401–404 tissue-specific induction of circular RNA during human fetal development. Genome (2012). Biol. 16, 126 (2015). 53. I. Martincorena et al., High burden and pervasive positive selection of somatic 24. M. Lee et al., Chimerdb 3.0: An enhanced database for fusion genes from cancer mutations in normal human skin. Science 348, 880–886 (2015). transcriptome and literature data mining. Nucleic Acids Res. 45, D784–D789 (2016). 54. D. D. Bowtell et al., Rethinking ovarian cancer II: Reducing mortality from high-grade 25. B. Solomon, C. Kingsford, Fast search of thousands of short-read sequencing serous ovarian cancer. Nat. Rev. Cancer 15, 668–679 (2015). experiments. Nat. Biotechnol. 34, 300–302 (2016). 55. H. Fang, dcGOR: An R package for analysing ontologies and protein domain 26. M. Jordanski, R. Dehghannasiri, J. Salzman, DEEPEST-Fusion App. Cancer Geno- annotations. PLoS Comput. Biol. 10, e1003929 (2014). mics Cloud. https://cgc.sbgenomics.com/public/apps#jordanski.milos/deepest-fusion/ 56. A. G. Knudson, Mutation and cancer: Statistical study of retinoblastoma. Proc. Natl. deepest-fusion/. Deposited 13 May 2019. Acad. Sci. U.S.A. 68, 820–823 (1971). 27. R. Dehghannasiri, M. Jordanski, J. Salzman, DEEPEST-Fusion. GitHub. https://github. 57. K. Cibulskis et al., Sensitive detection of somatic point mutations in impure and com/salzmanlab/DEEPEST-Fusion. Deposited 8 January 2019. heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013). 28. B. Haas et al., Star-fusion: Fast and accurate fusion transcript detection from RNA-seq. 58. M. S. Lawrence et al., Discovery and saturation analysis of cancer genes across 21 BioRxiv, page 120295 (24 March 2017). tumour types. Nature 505, 495–501 (2014). 29. W. Torres-Garc´ıa et al., Prada: Pipeline for RNA sequencing data analysis. Bioinfor- 59. D. Chakravarty et al., OncoKB: A precision oncology knowledge base. JCO Precis. matics 30, 2224–2226 (2014). Oncol. 1, 1–16 (2017). 30. J. Lonsdale et al., The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580– 60. Y. Benjamini, D. Yekutieli, The control of the false discovery rate in multiple testing 585 (2013). under dependency. Ann. Stat. 29, 1165–1188 (2001).

10 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1900391116 Dehghannasiri et al. Downloaded by guest on October 7, 2021