Genome-wide colocalization of RNA–DNA interactions and fusion RNA pairs

Zhangming Yana,b,1, Norman Huanga,b,1, Weixin Wua,b,1, Weizhong Chena,b, Yiqun Jianga,b, Jingyao Chena,b, Xuerui Huangc, Xingzhao Wena,b, Jie Xud, Qiushi Jind, Kang Zhange, Zhen Chenf, Shu Chiena,b,g,2, and Sheng Zhonga,b,2

aDepartment of Bioengineering, University of California, San Diego, La Jolla, CA 92093; bInstitute of Engineering in Medicine, University of California, San Diego, La Jolla, CA 92093; cDivision of Biological Sciences, University of California, San Diego, La Jolla, CA 92093; dDepartment of Biochemistry and Molecular Biology, College of Medicine, The Pennsylvania State University, Hershey, PA 17033; eDepartment of Ophthalmology, University of California, San Diego, La Jolla, CA 92093; fDepartment of Diabetes Complications and Metabolism, Beckman Research Institute, City of Hope, Duarte, CA 91010; and gDepartment of Medicine, University of California, San Diego, La Jolla, CA 92093

Contributed by Shu Chien, December 19, 2018 (sent for review November 20, 2018; reviewed by Ning Jiang and Guo-Cheng Yuan)

Fusion transcripts are used as biomarkers in companion diag- input cells by 100-fold to ∼5 million cells. We used iMARGI noses. Although more than 15,000 fusion RNAs have been iden- to map RNA–DNA interactions in human embryonic kidney tified from diverse cancer types, few common features have (HEK) and human foreskin fibroblast (HFF) cells. The detected been reported. Here, we compared 16,410 fusion transcripts RNA–DNA interactions often appeared on the pairs detected in cancer (from a published cohort of 9,966 tumor involved in cancer-derived fusion transcripts. The widespread samples of 33 cancer types) with genome-wide RNA–DNA inter- RNA–DNA interactions on the gene pairs involved in fusion actions mapped in two normal, noncancerous cell types [using transcripts suggest a model wherein the RNA of gene 1 by inter- iMARGI, an enhanced version of the mapping of RNA–genome acting with the genomic sequence of gene 2 is poised for being interactions (MARGI) assay]. Among the top 10 most signifi- spliced into gene 2’s nascent transcript and thus creating a fusion cant RNA–DNA interactions in normal cells, 5 colocalized with transcript. Consistent with this model, we identified an RNA the gene pairs that formed fusion RNAs in cancer. Furthermore, fusion in a new cancer sample that does not involve the creation throughout the genome, the frequency of a gene pair to exhibit of a fusion gene. RNA–DNA interactions is positively correlated with the probabil- ity of this gene pair to present documented fusion transcripts Results in cancer. To test whether RNA–DNA interactions in normal cells Characteristics of Genome-Wide RNA–DNA Interaction Maps. To sys- are predictive of fusion RNAs, we analyzed these in a valida- tematically characterize caRNAs and their genomic interaction tion cohort of 96 lung cancer samples using RNA sequencing locations, we developed iMARGI, an enhanced version of the (RNA-seq). Thirty-seven of 42 fusion transcripts in the validation cohort were found to exhibit RNA–DNA interactions in normal Significance cells. Finally, by combining RNA-seq, single-molecule RNA FISH, and DNA FISH, we detected a cancer sample with EML4-ALK fusion RNA without forming the EML4-ALK fusion gene. Collec- It remains formidable to predict what unreported RNA pairs tively, these data suggest an RNA-poise model, where spatial can form new fusion transcripts. By systematic mapping proximity of RNA and DNA could poise for the creation of fusion of chromatin-associated RNAs and their respective genomic transcripts. interaction loci, we obtained genome-wide RNA–DNA inter- action maps from two noncancerous cell types. The gene pairs involved in RNA–DNA interactions in these normal cells exhib- fusion transcripts | RNA–DNA interactions | RNA-poise model ited strong overlap with those with cancer-derived fusion transcripts. These data suggest an RNA-poise model, where usion transcripts are associated with diverse cancer types and the spatial proximity of one gene’s transcripts and the other Fhave been proposed as diagnostic biomarkers (1–3). Com- gene’s genomic sequence poises for the creation of fusion panion tests and targeted therapies have been developed to transcripts. We validated this model with 96 additional lung identify and treat fusion-gene defined cancer subtypes (2, 4). cancer samples. One of these additional samples exhibited Efforts of detection of fusion transcripts have primarily relied fusion transcripts without a corresponding fusion gene, sug- on analyses of RNA sequencing (RNA-seq) data (1, 3, 5–8). A gesting that genome recombination is not a required step of recent study analyzed 9,966 RNA-seq datasets across 33 cancer the RNA-poise model. types from The Cancer Genome Atlas (TCGA) and identified more than 15,000 fusion transcripts (4). Author contributions: Z.Y., N.H., W.W., S.C., and S.Z. designed research; Z.Y., N.H., Despite the large number of gene pairs in identified fusion W.W., W.C., Y.J., J.C., J.X., Q.J., K.Z., Z.C., and S.Z. performed research; Z.Y., N.H., transcripts, it remains formidable to predict what unreported W.W., W.C., and S.Z. contributed new reagents/analytic tools; Z.Y., N.H., W.C., X.H., X.W., and S.Z. analyzed data; and Z.Y., N.H., W.W., Z.C., S.C., and S.Z. wrote the pair of may form a new fusion transcript. Recent analyses paper.y could not identify any distinct feature of fusion RNA form- Reviewers: N.J., University of Texas at Austin; and G.-C.Y., Dana Farber Cancer Institute.y ing gene pairs (9). Here, we report a characteristic pattern of the 2D distribution of the genomic locations of the gene pairs Conflict of interest statement: S.Z. is a cofounder of Genemo, Inc.y involving RNA–DNA interactions that provides insights into This open access article is distributed under Creative Commons Attribution License 4.0 (CC BY).y understanding the creation of fusion transcripts. Data deposition: The data reported in this paper have been deposited in the Gene Chromatin-associated RNAs (caRNAs) provide an additional Expression Omnibus (GEO) database, https://www.ncbi.nlm.nih.gov/geo (accession no. layer of epigenomic information in parallel to DNA and histone GSE122690).y modifications (10). The recently developed mapping of RNA– 1 Z.Y., N.H., and W.W. contributed equally to this work.y genome interactions (MARGI) technology enabled identifica- 2 To whom correspondence may be addressed. Email: [email protected] or szhong@ tion of diverse caRNAs and the respective genomic interacting ucsd.edu.y locations of each caRNA (6). In this work, we developed an This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. improved MARGI experimental pipeline called iMARGI. Com- 1073/pnas.1819788116/-/DCSupplemental.y pared with MARGI, iMARGI reduced the required amount of Published online February 4, 2019.

3328–3337 | PNAS | February 19, 2019 | vol. 116 | no. 8 www.pnas.org/cgi/doi/10.1073/pnas.1819788116 Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 N n rw n h N n clm)fo ml bu)t ag rd cl,nraie nec row. each in pairs. normalized read al. scale, proximal et (red) of Yan large removal to to after mapped (blue) cells were small HFF from bars) and (column) HEK (blue end in ends DNA pairs DNA the read the and intrachromosomal (row) and and end inter- 11, RNA of chromosomal Ratios on (D) gene ( C pairs. MALAT1 read bars). (E the million (blue M: to gene pairs. mapped read KTN1 were interaction the bars) near (red ends 14 RNA the where lines), (horizonal 1. Fig. B A 100% C D eta fa N–N neato arxi E el.Tenmeso MRIra ar r lte ihrsett h apdpstoso the of positions mapped the to respect with plotted are pairs read iMARGI of numbers The cells. HEK in matrix interaction RNA–DNA an of Heatmap ) 25% 50% 75% 0% iMARGI iMARGI 0 Gene HFF HEK fragmentation RNA/DNA chr11 neapeo necrmsmlra pairs read interchromosomal of example An B) ( procedure. experimental iMARGI of view Schematic (A) data. and method iMARGI of Overview 83.3% (13.7M) 37.6% rxmldsa inter distal proximal Remote interactions E HFF HEK (3.8M) 10.4% 16.7% 65.45 mb ne intra inter Adenylated linkerligation RNA/DNA endprepare MALAT1 MALAT1 82.8% (18.9M) 52% In situ 65.5 mb 17.2% A T

rprin fpoia,dsa,aditrhoooa edpisi olcino ai RNA–DNA valid of collection a in pairs read interchromosomal and distal, proximal, of Proportions ) RNA E chr22 chr21 chr20 chr19 chr18 chr17 chr16 chr15 chr14 chr13 chr12 chr11 chr10 chrY chrX chr3 chr2 chr1 chr9 chr8 chr7 chr6 chr5 chr4 65.55 mb proximity ligation RNA-DNA

chr1 65.6 mb chr2 −1

chr3 chr14 Reverse transcription Chimera pull-down Reverse crosslinking chr4 cDNA 55.5 mb

chr5 KTN1 chr6

PNAS chr7

chr8 DNA | 55.6 mb Linearization Circularization eray1,2019 19, February chr9 In vitro

chr10 BamHI chr11 chr12 chr13

| chr14 55.7 mb o.116 vol. chr15 Paired-end sequencing chr16 RNA end Read 1

chr17 0.02 0.01 0

| chr18

o 8 no. chr19 chr20 chr21 DNA end Read 2

| chr22

3329 chrX chrY

SYSTEMS BIOLOGY MARGI assay (10). The main difference between iMARGI and Second, we compared by the numbers of discovered caR- MARGI is that iMARGI carries out the ligation steps in situ NAs. Under the smallest possible threshold (1 valid read pair), (Fig. 1A), whereas MARGI performs these ligation steps on iMARGI revealed a similar amount of caRNAs to that of streptavidin beads. We applied iMARGI to HEK and HFF cells pxMARGI, with mRNA, lincRNA, pseudogene RNA, and anti- to yield 361.2 million and 355.2 million 2 × 100-bp paired-end sense RNA as the most abundant types of caRNAs (SI Appendix, sequencing read pairs, respectively. These resulted in 36.3 mil- Fig. S2). diMARGI revealed severalfold fewer caRNAs in every lion (HEK) and 17.8 million (HFF) valid RNA–DNA interaction RNA type (SI Appendix, Fig. S2), consistent with its many fewer read pairs, in which ∼35%, 10%, and 55% were proximal, dis- usable read pairs. tal, and interchromosomal interactions, respectively (Fig. 1C and Third, we compared by the “RNA attachment level” (10) on SI Appendix, Fig. S1A). The proximal and distal interactions every genomic segment. We segmented the genome into 100-kb were defined as the intrachromosomal interactions where the bins and counted the number of read pairs with the DNA end RNA end and DNA end were mapped to within and beyond mapped to each bin. iMARGI and pxMARGI exhibited a strong 200 kb, respectively. Following Sridhar et al. (10), we removed correlation (Pearson correlation = 0.88, P value <2.2×10−16; SI proximal read pairs from further analysis because proximal inter- Appendix, Fig. S3A). The correlation increased with the bin size actions likely represent interactions between nascent transcripts (SI Appendix, Fig. S3 B and C). diMARGI data exhibited much and their neighboring genomic sequences. Hereafter, we refer weaker correlation to iMARGI data (SI Appendix, Fig. S3 D–F), to the union of distal and interchromosomal interactions as likely also due to its very small amount of usable read pairs. remote interactions. The rest of this paper deals only with remote interactions. The Most Significant RNA–DNA Interactions Colocalized with the Among the remote RNA–DNA interactions, both cell types Gene Pairs Forming Fusion Transcripts in Cancer. We set off to exhibited an ∼1:5 ratio of intra- and interchromosomal inter- identify the most significant distal RNA–DNA interactions actions (Fig. 1D). The 2D map of RNA–DNA interactions from the iMARGI data. Excluding extremely abundant noncod- exhibited more interactions near the diagonal line (Fig. 1E and ing RNAs, such as XIST, the top gene pair with the largest SI Appendix, Fig. S1D). Within each chromosome, the number amount of interchromosomal and distal iMARGI read pairs of iMARGI read pairs negatively correlated with their genomic in HEK cells was FHIT-PTPRG (Fig. 2B and SI Appendix, distances (Fig. 2A, red and blue circles). Fig. S4A). Investigating this gene pair, we found the report- ing of FHIT-PTPRG fusion transcripts from kidney, liver, head Comparison of iMARGI with MARGI. We compared iMARGI with and neck, lung, and prostate cancers (7). The second largest the MARGI technology previously described (10). iMARGI amount of interchromosomal and distal iMARGI read pairs requires only ∼5 million cells to start the experiment, whereas was from GPC5-GPC6 (Fig. 2B and SI Appendix, Fig. S4B). MARGI requires ∼500 million cells. MARGI has two tech- Fusion transcripts from this second-ranked gene pair were nical variations called pxMARGI and diMARGI, which differ reported from liver and prostate cancers (7). Notably, 5 of by the degree of chromatin fragmentation (10). We compared the top 10 gene pairs were reported as fusion transcripts in the iMARGI, pxMARGI, and diMARGI datasets generated cancers (1, 7). These findings led us to systematically analyze from HEK293T cells. These datasets had roughly comparable the relationship between RNA–DNA interactions and fusion amounts of raw read pairs (SI Appendix, Table S1). transcripts. First, we compared the distribution of the read pairs. iMARGI and pxMARGI produced similar amounts of valid interchro- Nonuniform Distribution of the RNA Pairs Contributing to Fusion mosomal (∼19 million) and distal (1–4 million) read pairs Transcripts. We asked whether there is any global characteristic (SI Appendix, Table S1). diMARGI generated many fewer of genome-wide distribution of the RNA pairs that contribute valid interchromosomal (∼0.5 million) and distal (∼26,000) to fusion transcripts. To this end, we subjected the previously read pairs. reported 16,410 fusion transcripts that were derived from 9,966

AB108 Futra 3 10 HEK

HFF 105 102

FHIT-PTPRG 1 GPC5-GPC6 10 102 Number of gene pairs Number of interacted 0

genomic bins or Futra pairs genomic bins or Futra 10 105 106 107 108 0 200 400 600 Genomic distance between genomic bins Number of iMARGI read pairs or Futra pairs (bp)

Fig. 2. Summary of iMARGI data. (A) The number of genomic bin pairs with 10 or more iMARGI read pairs (y axis) is plotted against the genomic distance between the pair of genomic bins (x axis) in HEK cells (red circles) and HFF cells (blue circles). For comparison, the number of fusion transcript-contributing RNA (Futra) pairs (y axis) derived from 9,966 cancer samples is plotted against the genomic distance (x axis) between the genes involved in the fusion (purple circles). (B) Histogram of the number of iMARGI read pairs of every gene pair (x axis) across all of the intrachromosomal gene pairs with genomic distance >200 kb in HEK cells. Arrow: the gene pairs with the largest and the second largest number of iMARGI read pairs.

3330 | www.pnas.org/cgi/doi/10.1073/pnas.1819788116 Yan et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 nrw n oun.Tenmeso ur ar fcrepniggnmcpstosaesoni le(ml)t e lre oo cee Bin scheme. color (large) red to (small) blue a in shown are mul- positions in al. and genomic et red) Yan corresponding pairs, Futra of (nonrecurring pairs sample cancer Futra one of only numbers in appeared The Mb. (C that 10 columns. green). pairs size: Futra and pairs, of Futra rows numbers the (recurring on of samples chart cancer Pie (B) tiple samples. cancer 9,966 the 3. Fig. pairs Futra together, Taken circles). purple chromosomal 2A, their (Fig. with correlated intrachromo- distance of negatively number pairs The Futra pairs. somal Futra the between 3C chromo- distances large (Figs. or domains within somal suggesting pairs map, fusion gene the of of gene enrichment line of diagonal density the Higher on interchromo- C). S5 appeared of Fig. pairs amounts Appendix , (SI largest pairs the gene to somal contribute intrachro- 19 11, of and 1, amounts Chromosomes 17, S5B). largest 12, Fig. the Appendix, (SI harbored pairs gene 17 mosomal Chromo- and 12, respectively. 1, interchromosomal, somes and 27.91, intra- = were ratio pairs (odds pairs value gene intrachromoso- interchromosomal more with than uniform, mal not was distribution 3C 2D (Fig. The map” “fusion a call we which transcripts fusion in pairs gene recurrent 12). (11, of con- data scarcity These 3B). the (Fig. pairs”). analyzed in firmed samples only 9,966 occurred (“Futra the pairs of Futra sample of pairs 1 15,144) of (14,482 RNA 95% than pairs transcript-contributing More gene these fusion to refer we to as per Hereafter corresponded transcripts pairs. transcripts RNA fusion fusion unique further 16,410 15,144 two to The were 3A). project there (Fig. TCGA sample average, from On types (4). cancer analysis 33 across samples B Number of samples A evsaie h rqec fFtapisi Dheatmap, 2D a in pairs Futra of frequency the visualized We 1000 1500 2000 500 <2.2 0 itiuino h ubro eetdFtapisi ahsml across sample each in pairs Futra detected of number the of Distribution (A) types. cancer 33 from derived pairs Futra 15,144 the of Overview Number ofFutrapairsinasample 102030400 Recurrence × n More thanone One 10 − (14482) 16 95.6% , (662) χ 4.4% 2 et.Attlo ,9 n ,5 Futra 6,253 and 8,891 of total A test). .W unie h relative the quantified We 4A). and and i.S5A). Fig. Appendix, SI C chr22 chr21 chr20 chr19 chr18 chr17 chr16 chr15 chr14 chr13 chr12 chr11 chr10 h eoi itiuino ur ar.Gnmccodntso l hoooe r ordered are chromosomes all of coordinates Genomic pairs. Futra of distribution genomic The ) chrY chrX chr9 chr8 chr7 chr6 chr5 chr4 chr3 chr2 chr1

chr1

chr2 P

chr3 4 (Fig. similarities pronounced observed and that and with S1D) pairs 1E Futra Fig. (Figs. of interactions distribution RNA–DNA 2D of the visual- of a out comparison carried ized We Interac- interactions. RNA–DNA RNA–DNA genome-wide and Pairs Futra tions. of global Colocalization different Genome-Wide exhibited interactions. (TADs) genome pairs of domains those Futra from are associated characteristics together, distribution topologically which Taken of lengths, (15). to sizes in 1/10th typical chromosome the approximately a from of 1/3rd ranged domains chromosomal domains 3C intrachromosomal chromosomal large (Figs. The in enrichment 14). exhibited pairs (13, read Futra interchromosomal capture-derived less conformation were whereas chromosomal pairs interchromosomal, of were (6,253 15% pairs percent than Futra and Forty-one of Pairs 15,144) interactions. of Futra genome of with correlate Locations Genomic Interactions. Genome the Between to Differences preference and pairs distances. intrachromosomal genomic smaller of characterized enrichment genome, the by in distribution nonuniform exhibited

chr4 S6 Fig. Appendix, SI eakdt htetn ur ar a oniewith coincide may pairs Futra extent what to asked We

chr5 n 4A and

chr6 PNAS

chr7 and

chr8 may pairs Futra extent what to asked We | .Teeenriched These S6A). Fig. Appendix, SI Number ofFutrapairs eray1,2019 19, February chr9 Futra 34 of set a example, For A–C). chr10 chr11 chr12 chr13 3C and |

o.116 vol. chr14 chr15 and

chr16 256 16 0 02 times ∼10–20

| chr17 IAppendix, SI

o 8 no. chr18 chr19 chr20

| chr21 A–C

3331 chr22 chrX chrY

SYSTEMS BIOLOGY A Fusion map B RNA-DNA interaction in HEKC RNA-DNA interaction in HFF 0 0 0

50 50 50 chr9 chr9 chr9

100 100 100

139 139 139 (M) 0 50 100 139 (M) 050100139(M) 050100139 chr9 chr9 chr9 Number of Futra pairs Number of read pairs Number of read pairs 0246 050100150 200 0204060 80

D chr9

25 M 30 M 35 M 40 M 45 M

Genes

Futra pairs

RNA iMARGI read pairs DNA

Genes

25 M 30 M 35 M 40 M 45 M

chr9

Fig. 4. Comparison of Futra pairs and iMARGI read pairs. (A) Distribution of intrachromosomal Futra pairs on chromosome 9. Genomic coordinates are plotted from top to bottom (rows) and from left to right (columns). (B and C) RNA–DNA interaction matrices in HEK cells (B) and HFF cells (C). The numbers of iMARGI read pairs are plotted with respect to the mapped positions of the RNA end (rows) and the DNA end (columns) on chromosome 9 from small (blue) to large (red) scale. (D) Detailed view of a 30-Mb region (yellow boxes in A–C). This genomic region and the contained genes are plotted twice in the top and the bottom lanes. In the track of Futra pairs, each blue line linking a gene from the top lane to a gene at the bottom indicates a Futra pair. In the track of iMARGI read pairs in HEK cells, each green line indicates an iMARGI read pair with the RNA end mapped to the genomic position in the top lane and the DNA end mapped to the bottom lane.

pairs was enriched in an ∼7-Mb region on chromosome 9 (chr9) 14.1, P value <2.2 × 10−16, χ2 test). Among the 8,891 intra- (32–39 Mb, Fig. 4 A and D). This Futra-pair–enriched region chromosomal Futra pairs, 7,427 (83.5%) overlapped with RNA– colocalized with a chromosomal region enriched in RNA–DNA DNA interactions in either HEK or HFF cells (odds ratio = interactions (Fig. 4 B–D). Such colocalizations were observed 35.44, P value <2.2 × 10−16, χ2 test). These data pointed to in multiscale analyses using different resolutions, including 10- a common feature of cancer-derived Futra pairs, which is their Mb (SI Appendix, Fig. S6 A–C), 1-Mb (Fig. 4 A–C), and 100-kb colocalization with RNA–DNA interactions in normal cells. (Fig. 4D) resolutions, as well as at the resolution of individual fusion pairs and read pairs (SI Appendix, Fig. S7). Four fusion Cancer-Derived Futra Pairs That Colocalize with RNA–DNA Interac- transcripts, KMT2C-AUTS2, KMT2C-CALN1, KMT2C-CLIP2, tion in Normal Cells Do Not Form Fusion Transcripts in Normal Cells. and KMT2C-GTF2IRD, were formed between the KMT2C A model that may explain the colocalization of RNA–DNA mRNA near the 152-Mb location of (chr7: interactions and Futra pairs is that RNA–DNA interactions in 152,000,000) and four mRNAs that were approximately 80 Mb the normal cells poise for creation of fusion transcripts. Rec- away (chr7: 66,000,000–78,000,000) (Futra pairs; SI Appendix, ognizing that this model cannot be tested by perturbation due Fig. S7). Correspondingly, a total of 73 RNA–DNA iMARGI to the very small likelihood for a fusion transcript to occur read pairs were mapped to KMT2C and the four fusion partners in a cancer sample, we carried out two other tests. First, we in HEK cells (iMARGI; SI Appendix, Fig. S7). tested whether the cancer-derived Futra pairs were detectable We quantified the overlaps between Futra pairs and iMARGI- in normal cells. We reanalyzed the merged RNA-seq datasets identified RNA–DNA interactions. Among the 6,253 interchro- of more than 75 million 2 × 100-bp paired-end read pairs mosomal Futra pairs, 3,014 (48.2%) overlapped with RNA– from HEK293T cells (16) and ran STAR-Fusion (17) on these DNA interactions in either HEK or HFF cells (odds ratio = datasets, which reported a total of 8 Futra pairs. None of the

3332 | www.pnas.org/cgi/doi/10.1073/pnas.1819788116 Yan et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 ooa ur ar htoelpwt N–N neatosi ohHKadHF(elwgen el r I5:IK,LPOBL,FCGBP:MT-RNR2, LPP:OSBPL6, LIN52:PI4KA, are cells al. (yellow-green) et HFF Yan and HEK interchro- both 10 in The MALAT1:TNFRSF10B. interactions and #: RNA–DNA COL1A1:MALAT1, COL1A2:MALAT1, MTOR-AS1:RERE. with FLNA:MALAT1, KTN1:MALAT1, CEP70:GSK3B, and overlap CUX1:TRRAP, ETV6:TTC3, KMT2B:MALAT1, LPP:PPM1L, that LRCH1:RP11-29G8.3, NAV3:RP1-34H18.1, ACTN4:ERCC2, pairs FRS2:NUP107, LMO7:LRCH1, Futra RP11-557H15.4:SGK1, cells. RP1-148H17.1:TOP1, mosomal ALK:EML4, (pink) CHST11:NTN4, are HFF KCTD1:SS18, cells and (yellow-green) NIPBL:WDR70, (green) HFF ( y and HEK transcripts HEK in fusion (D both detected interactions samples. of cancer RNA–DNA Number 69 to (B) the pairs (column). from sample detected each pairs Futra of interchromosomal bar) blue (dark pairs read 5. Fig. C analyzed targeted also and extracted We was cohorts. RNA TCGA (H2228). the line of cell NSCLC part a not patients from were samples cancer valida- who lung a new analyzed 96 we comprising end, cohort fusion this tion Fusion To of cancer. predictive of in are formation cells transcript Predictive normal in Are interactions Samples. RNA–DNA Cells Cancer New Normal in Transcripts in Interactions cells. RNA–DNA normal the transcripts in present fusion not the are cancer cells, with in normal found colocalized line in pairs cell interactions sug- Futra data RNA–DNA cancer-derived NSCLC these although a together, that, Taken in S8). gest Fig. transcripts Appendix, fusion (SI (H2228) Appendix detected 6 (SI assays Fig. cells both (see HEK293T fusion cells in EML4-ALK detected ALK HFF transcripts analysis PCR the and quantitative nor HEK and PCR in RNA were locus EML4 there genomic and between (18), interactions (NSCLC) specifically reported carcinoma were RNA–DNA we lung which cell addition, transcripts, nonsmall In fusion in cells. EML4-ALK for HEK293T tested in data detected RNA-seq TCGA were from pairs Futra 15,144 derived previously A B Number of fusion

h ubro N-e edpis(ih lebr n h nqeymapped uniquely the and bar) blue (light pairs read RNA-seq of number The (A) samples. cancer lung new the from detected transcripts Fusion transcripts Number of read pairs 0 1 2 3 4 5 x10 Number of Futra pairs 0 2 4 6 10 15 20 0 5

H2228 H2228 6 Type ofFutrapairs 1 1 3 3

nr inter intra 10 10 23 11 11 13 13 14 14 15 15 17 17 18 18 19 19 19 21 21 22 22 23

23 the whether tested we Next, 24 24 25 25 26 26 27 27

DE 28 28 30 30 whereas S8), Fig. , 31 31 32 32 nr-hoooa Inter-chromosomal Intra-chromosomal 33 33 35 35 Neither A). 4 36 36 38 38 39 39 4 15 h 5itahoooa ur ar htoelpwt N–N neatosin interactions RNA–DNA with overlap that pairs Futra intrachromosomal 15 The ∗: 40 40

and 41 41

* 42 42 43

E 43

E ( interchromosomal and (D) intrachromosomal these of Intersections ) 44 44

45 45 the of Nineteen interactions. RNA–DNA with 106.51, colocalized = pairs ratio predictive (odds are cancer cells 10 in normal RNA–DNA pairs assayed that Futra already idea of the the in supporting in cells, interactions interactions normal RNA–DNA assayed with the colocalized (88.1%) 37 cohort, TCGA the across 3B). pairs (Fig. Futra expected samples between recurring is of pairs fraction samples small TCGA Futra the and from recurring samples cancer of doc- additional previously amount not these were small that The transcripts samples, fusion umented. new cancer 40 TCGA as 9,966 fusions, well the as from FRS2-NUP107 reported also tran- and were fusion which EML4-ALK 42 5 of included (Fig. total samples a transcripts 69 reported these it was from (17) and scripts STAR-Fusion dataset 5A). this uniquely (Fig. to million sample applied 3.9 per average pairs sufficient a on read produced yield aligned yielded samples not and 69 did other library 27 the sequencing Pan- whereas samples, sequencing, RNA 96 for TruSight these RNA Illumina’s Of with Panel. out Cancer carried was RNA-seq 46 46 47 47 − validation the from detected pairs Futra 42 these Among 48 48 16 ,

49 49 χ

50 50 2 51 51 Futra intrachromosomal only whether asked We test). 52 52 Total Futrapairs HFF iMARGI Overlapped with HEK iMARGI Overlapped with 53 53

xs nec ape(oun) (C (columns). sample each in axis) 55 55 56

PNAS 56 57 57 58 58 59 59 | 60 60 eray1,2019 19, February 61 61 62 62 64 64 73 73 74 74 75 75 76 76 Uniquely mappedreadpairs Total readpairs

77 77 B

4 78 78 and | 79 79

o.116 vol. 80 80 ubr fita and intra- of Numbers ) 81 81 fusion 42 These C). 82

10 82 P 1 83 83 84 # 84 value | 87 87 o 8 no. 88 88 89 89 <2.2 4 90 90 | Futra ) 91 91

3333 92 92

95 95 × 96 96

SYSTEMS BIOLOGY 42 (45%) detected Futra pairs were interchromosomal (Fig. 5C), is a fusion-susceptible pair (Fig. 6A), EML4-ALK fusion tran- comparable to the proportion (41%) of interchromosomal Futra scripts are detected in one of our new tumor samples (sample no. pairs detected from TCGA samples. Eighty-three percent (19 of 44) (Fig. 6B), and there is an FDA-approved diagnosis kit (Vysis 23) of intrachromosomal and 95% (18 of 19) of interchromoso- ALK Break Apart FISH) based on DNA FISH detection of the mal Futra pairs overlapped with RNA–DNA interactions (Fig. 5 EML4-ALK fusion gene. Break Apart assays were performed D and E), suggesting that the colocalization of RNA–DNA inter- by Knight Diagnostic Laboratories at the Oregon Health & actions and Futra pairs was not restricted to intrachromosomal Science University according to standardized protocols. We sub- interactions. Taken together, the colocalization of Futra pairs jected the remaining tissue from sample no. 44 to DNA FISH and RNA–DNA interactions, the lack of cancer-derived fusion analysis. None of our eight attempts yielded any DNA FISH sig- transcripts in normal cells, and the predictability of additional nal in the remaining tissue from either control or ALK probes. Futra pairs in new cancer samples support the model where We therefore could not ascertain whether there was genome RNA–DNA interactions in normal cells poise for creation of rearrangement in the only sample with detectable EML4-ALK fusion transcripts in cancers. Hereafter, we refer to this model as fusion transcripts. the RNA-poise model. We call the gene pairs with RNA–DNA To identify other cancer samples that express EML4-ALK interactions in normal cells fusion-susceptible pairs. fusion transcripts, we reanalyzed our collection of 96 lung can- cer samples with FuseFISH, a single-molecule fluorescence in RNA–DNA Interaction Between EML4 and ALK Correlates with an situ hybridization (sm-FISH)–based method for the detection of RNA Fusion Without Fusion Gene in Tumor. We tested whether fusion transcripts (19, 20). We carried out quantum dot-labeled genome rearrangement is a prerequisite step for the creation sm-FISH (21) by labeling EML4 and ALK transcripts with quan- of fusion transcripts from fusion-susceptible pairs by choosing tum dots at 705 nm and 605 nm, respectively (SI Appendix, EML4-ALK fusion transcripts for this test because EML4-ALK Fig. S9). The FISH probes were designed to hybridize to the

A chr2 chr2 29.2 mb 29.4 mb 29.6 mb 29.8 mb 42.1 mb 42.3 mb

29.3 mb 29.5 mb 29.7 mb 29.9 mb 42.2 mb 42.4 mb

Gene ALK EML4

H2228 H2228 RNAseq: RNAseq read pairs # 44

HEK

iMARGI HFF

B Positive outcomeNegative outcome No measurement

RNA-seq

FuseFISH

DNA FISH * 6 2 5 7 8 1 3 4 9 94 67 93 73 74 75 76 83 85 87 88 96 20 29 34 47 66 92 10 56 60 68 69 70 78 80 81 82 84 95 16 17 30 38 39 43 46 48 63 65 72 86 89 91 11 12 13 24 26 36 51 53 54 55 57 59 79 61 62 64 71 77 90 14 15 18 19 21 22 23 25 27 28 32 33 35 37 40 42 45 49 50 52 58 31 41 44 HEK H2228 C EML4ALK MergeD BreakApart FISH

Fig. 6. RNA fusion and DNA break. (A, Middle tracks) RNA-seq read pairs (purple bars) aligned to ALK (Left) and EML4 (Right). Each pair of paired-end reads is linked by a horizontal line. (A, Lower tracks) iMARGI read pairs aligned to the two genes. Red bars: RNA end. Blue bars: DNA end. Thin gray lines: pairing information of iMARGI read pairs. (B) Positive (red) and negative (black) outcomes of detection of EML4-ALK fusion transcripts based on RNA-seq, FuseFISH, and DNA FISH (ALK Break Apart FISH, a DNA rearrangement test) in each cancer sample (column). White box: no measurement. ∗: Detected a partial deletion of ALK gene without rearrangement. (C) FuseFISH images of the no. 37 cancer sample from the EML4 and ALK channels. Arrows: colocalized FISH signals indicating fusion RNA. (Scale bar: 2 µm.) (D) A representative image of ALK Break Apart FISH, with colocalized red and yellow signals that indicate integral ALK gene without rearrangement.

3334 | www.pnas.org/cgi/doi/10.1073/pnas.1819788116 Yan et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 xetfrnihoiggns(1,tecacsfrtoRNA two for however, chances space; the 3D in (11), other RNA genes two each neighboring to to for only close except happen are can that transsplicing not molecules that splicing does is the were in model Transsplicing gap error transcripts theoretical cre- A (24). separate rearrangement. was genome two (transsplicing) involve that where together gene errors, is spliced process fusion splicing less-recognized pro- RNA a The better-recognized by rearrangement. of genome The transcription by processes. ated through two is by cess created Errors. be Splicing can for Allows Model RNA-Poise The indepen- RNAs warrants which fusion limited, investigation. validated remains future rearrangement of genome amount of sheer dent the that recognize date, we to However, fusion rearrangement. inde- observed transcripts genome all fusion of of for pendent occurrence account the not suggest and do the transcripts with genes consistent fusion are fig- results that bars, published notion These Pass 1). ref. (High of reads S1A ure WGS corresponding had only transcripts removed, whole-genome were data low-quality is (WGS) when this sequencing However, because rear- 1). overestimate genome ref. an of to S1A likely figure attributable bars, were Pass (Low by data rangement created RNA-seq transcripts to cancer fusion paid the from of i.e., Indeed, attention lack (2). coin, The research genes the (11). fusion the of case side to rare other attributable a the often be likely not not is although may reports literature, example, the an in Such seen cor- gene. the without fusion transcript fusion responding a contained that Tran- sample cancer Fusion Rearrangement-Independent scripts. Genome of Abundance the DNA. Discussion words, the of fusion other alterations of require In not creation does pairs. the model RNA-poise for fusion-susceptible step from prerequisite rear- transcripts a genome not that suggest is data rangement These gene. fusion an having EML4-ALK without transcripts fusion EML4-ALK expressed 37 no. ALK integral 6 exhibited 6B, (Fig. 37, (Fig. genes no. genes sample including fusion samples, ALK-related 5 ALK of other the sign of no deletion but partial gene, a RNA-seq exhibited by analyses) transcripts sample FuseFISH fusion 1 and EML4-ALK and for attempts failed negative ALK- four 56, 57) from exhibited (no. (no. signals samples FISH sample DNA 1 ALK 7 57, generate specifically, to Vysis these More 56, using genes. of 51, fusion analysis None related 18, recombination FISH. (nos. DNA Apart to samples Break together 65) selected it and randomly subjected 63, we other genes, 6 fusion with ALK-related not had did 6 37 which which 37 (Fig. no. 44 no. data sample no. RNA-seq and sample RNA-seq yield by including RNA- analyzed samples, also yield 2 was not EML4-ALK in did detected transcripts analysis that FuseFISH included fusion 18 The samples 6B). and (Fig. 57 data data seq RNA-seq These yielded analysis. that FuseFISH 39 for tissues ing IAppendix (SI cells the HEK293T with in consistent transcript S8 S9), fusion Fig. signals Fig. a Appendix, such colocalized (SI of zero cells lack 19 average contrast, from In cell on (23). per exhibited S8) H2228 Fig. cells 22 Appendix, HEK293T of (SI total transcripts colocalized a fusion S9 12 ALK in Fig. of detected , Appendix average were (SI an cells cell test, per control signals positive sm-FISH tran- a ALK by and In detected EML4 scripts. were targeting transcripts signals sm-FISH fusion colocalized Follow- 20), EML4-ALK the (22). (19, date of literature to prior variants identified ing been 28 have all that transcripts among fusion shared exons consensus a tal. et Yan norcleto f9 uo ape,ol 7hdremain- had 57 only samples, tumor 96 of collection our In ). u N IHadDAFS nlssrvae a revealed analyses FISH DNA and FISH RNA Our B and .Tkntgte,teln acrsample cancer lung the together, Taken D). 66%o uintasrpsderived transcripts fusion of ∼36–65% htwr nw oepesEML4- express to known were that ) B and .T etwehrsample whether test To C). 04%o fusion of ∼30–45% uintranscripts Fusion ∗ .The ). , h CIBorjc aaae(ceso o.SRR2992206–SRR2992208 merged. were nos. datasets three (accession The (16). PRJNA305831) database no. project BioProject under NCBI the Data. RNA-Seq Public were (GRCh38.84) analyses. 84 Annotations. data release all annotation throughout Gene used gene Ensembl and and hg38/GRCh38 Genome Reference Methods of and Materials creation or transsplicing either genes. by fusion transcripts could fusion transsplic- 7) by Fig. create model, only (confinement-poise transcripts submodel pro- other fusion The create the ing. could (targeting-poise on 7) submodel be depending Fig. One can submodels, model, interaction. model two RNA–DNA of RNA-poise of union cess the a Thus, as (25). regarded genes thus and fusion sequences genomic create genome the close of in spatially chances the genes of the two rearrangement enhance of could proximity model spatial confinement splic- the RNA for addition, allow In thus spatial errors. gene and provide ing molecules one interactions confinement, RNA two RNA–DNA (RNA of between of gene proximity transcripts means another Both nascent of 7). the sequence Fig. bring genomic sequences the could genomic to the space of 3D proximity spatial in the Second, targeting, 7). (RNA molecules Fig. sequences, tethering genomic by mediated specific be Interactions. target could which can RNA–DNA caRNA two the least by at First, by means. created Model be could RNA-Poise interactions RNA–DNA Remote the Down Breaking tran- 2’s gene during cotranscriptional perform 1 to transsplicing. gene opportunity the of for cotranscriptional. allows transcripts scription are of gene events Fur- availability splicing to transsplicing. The of of close majority possibility the spatially the thermore, for transcripts sequence allowing genomic nascent transcripts, 2’s 1’s 2’s gene on gene transcripts positions 1’s gene of stallation per- to difficult created errors. are remains transcripts splicing fusion by it which in Therefore, process to biophysical small. a locations ceive are chromosomal space in distant meet from transcribed molecules ga ro ntergt,wihsbeunl rdcsfso RNA. fusion produces subsequently which fusion right), gene the facilitate on also arrow arrows), may purple (gray sequences (gray genomic 2, errors genes of splicing two proximity (RNA enhance the the gene whereas of could proximity cases another spatial Both to or confinement). targeting) proximity (RNA (RNA spatial tethering to exhibit due can bar) bar) red 1, 7. Fig. Fusion RNAbiogenesis RNA-DNA interaction RNA-poise model h N-os oe lsti hoeia a.Teprein- The gap. theoretical this fills model RNA-poise The Pol II Gene 2 Gene 1 N-os oe.I hsmdl h rncit foegn (RNA gene one of transcripts the model, this In model. RNA-poise RNA 2 RNA 1 Spliceosome PNAS E23 N-e aaeswr onoddfrom downloaded were datasets RNA-seq HEK293T RNA 1 agtn-os Confinement-poise Targeting-poise | RNA 1 N agtn RNAConfinement RNA Targeting rn-piigDNA re-arrangement Trans-splicing eray1,2019 19, February RNA 2 Fusion RNA ua eoeassembly genome Human | o.116 vol. | o 8 no. RNA 1 | 3335

SYSTEMS BIOLOGY TCGA Derived Fusion Transcripts. The TCGA RNA-seq–derived fusion tran- struction design, read pairs were filtered out if the 50-most two bases of scripts were downloaded from the Tumor Fusion Gene Data Portal their DNA end (Read 2) were not CT. In addition, the first two bases of (www.tumorfusions.org) (26). Tier 1 and tier 2 fusion transcripts were used the RNA end (Read 1) were removed as they are random nucleotides. Then, in our analyses. Genomic coordinates were converted to hg38 by liftOver. the cleaned read pairs were mapped to the (hg38), using Following Davidson et al. (8), Futra pairs within 200 kb on hg38 were bwa mem (version 0.7.17) with parameters “-SP5M” (32). Finally, pairtools removed. The data of Futra pairs were mainly processed using R (27) with (version v0.2.0, https://github.com/mirnylab/pairtools) and in-house scripts Bioconductor packages GenomicRanges (28) and InteractionSet (29). were used to parse, deduplicate, and filter the mapped read pairs. The valid read pairs that were mapped to genomic locations within 200 kb of each Visualization of Futra Pairs. Heatmaps of the count matrix were plotted other were defined as proximal interactions, which were excluded from our using Bioconductor package ComplexHeatmap (30). Genomic plots of Futra analysis. GenomicRanges (28) and InteractionSet (29) were used for further pairs were created with GIVE (31). analysis of iMARGI data. Visualization of iMARGI read pairs. Heatmaps of the count matrix were Constructing iMARGI Sequencing Libraries. plotted using Bioconductor package ComplexHeatmap (30). Genomic plots Nuclei preparation and chromatin digestion. Approximately 5 × 106 cells of iMARGI read pairs were created with Bioconductor package Gviz (33) and were used for the construction of an iMARGI sequencing library. Cells were GIVE (31). cross-linked in 1% formaldehyde at room temperature (RT) for 10 min with Intersection of iMARGI read pairs and Futra pairs. An iMARGI read pair was rotation. The cross-linking reaction was quenched with glycine at 0.2 M regarded as overlapping with a Futra pair when the RNA end was strand- concentration and incubated at RT for 10 min. Cells were pelleted, washed specifically mapped to the gene body of one gene in the Futra pair and the using 1× PBS, and aliquoted into ∼5 × 106 in each tube. To prepare nuclei, DNA end was mapped to the gene body ±100 kb flanking regions of the cross-linked cells were incubated in 1 mL of cell lysis buffer (10 mM Tris·HCl, other gene in the Futra pair. pH 7.5, 10 mM NaCl, 0.2% Nonidet P-40, 1× protease inhibitor) on ice for 15 min and homogenized with dounce homogenizer pestle A for 20 strokes FuseFISH Analysis. on ice. Nuclei were pelleted and weighed to estimate the pellet volume Probe design. Oligonucleotide probes were designed to hybridize to exons (10 mg of nuclei pellet was estimated to be about 10 µL). The nuclei pel- 2–6 of EML4 RNA and exons 20–23 of ALK RNA. These exons were chosen let was incubated with SDS buffer (0.5× Cutsmart buffer, 0.5% SDS) at because they were present in all of the observed variations of EML4-ALK 1:3 vol ratio and 62 ◦C for 10 min with mixing and immediate quenching fusion genes. These probes were 35–40 nt in length, with similar GC contents with a final 1% of Triton X-100. To digest chromatin, the washed nuclei and melting temperatures. pellet was resuspended in an AluI chromatin digestion mix [2.3 units/µL Conjugation of quantum dots to oligonucleotide probes. Oligos were mod- 0 AluI (NEB), with 0.3 unit/µL RNasinPlus (Promega) and 1× Cutsmart buffer] ified on the 5 end with a primary amino group and a spacer of 30 and incubated at 37 ◦C overnight with mixing. After chromatin digestion, carbons to minimize steric hindrance of probe–RNA hybridization. These 1 µL of RNase I (1:10 diluted in 1× PBS) was directly added to the reaction probes were conjugated with quantum dots through the amino group mixture and incubated at 37 ◦C for 3 min to fragment RNA. Nuclei were using EDC reaction (34). The probes were subsequently purified with pelleted and washed twice using PNK wash buffer (20 mM Tris·HCl, pH 7.5, 0.2 µm membrane filtration and 100,000 molecular weight cutoff (MWCO). 10 mM MgCl2). The retentate of the 100,000 MWCO was subjected to dynabeads MyOne Ligations. To prepare the RNA and DNA ends for linker ligation, nuclei SILANE purification to remove any remaining unconjugated probes. A were incubated in 200 µL RNA 30-end dephosphorylation reaction mix subsequent 0.2-µm membrane filtration was used to remove any final [0.5 unit/µL T4 PNK (NEB), 0.4 unit/µL RNasinPlus, 1× PNK phosphatase aggregates. buffer, pH 6.5] at 37 ◦C for 30 min with mixing. Nuclei were washed twice Hybridization of adherent cell lines. Probe hybridization in H2228 cells with PNK buffer, resuspended in 200 µL DNA A-tailing mix [0.3 unit/µL was carried out as previously described (19, 21). Briefly, probes were Klenow Fragment (30 → 50 exo-) (NEB), 0.1 mM dATP, 0.1% Triton X-100, added to the hybridization solution and incubated with the cells at ◦ 1× NEB buffer 2] and incubated at 37 ◦C for 30 min with mixing. The same 37 C overnight. Cells were washed and resuspended in 1× PBS for linker sequence as described in the previous MARGI paper was used (10). For imaging. in situ RNA-linker ligation, nuclei were resuspended in 200 µL ligation mix Hybridization of tissue samples. Probe hybridization in tissues was carried [38 µL adenylated and annealed linker, 10 units/µL T4 RNA ligase 2- out as previously described (35). Briefly, a hybridization solution with probes truncated KQ (NEB), 1× T4 RNA ligase reaction buffer, 20% PEG 8000, 0.1% was added to the surface of the parafilm to form droplets. A tissue slice (5– Triton X-100, 0.4 unit/µL RNasinPlus] and incubated at 22 ◦C for 6 h and 10 µm in thickness) fixed on a glass coverslip was gently placed over the ◦ then 16 ◦C overnight with mixing. After ligation, the nuclei were washed hybridization solution. The mixture was incubated at 37 C overnight. The five times with PNK buffer to remove excess free linker. For in situ RNA– tissue was subsequently washed with wash buffer and resuspended in 1× DNA proximity ligation, nuclei were resuspended in 2 mL of proximity PBS for imaging. ligation mixture [4 units/µL T4 DNA ligase (NEB), 1× DNA ligase reaction Imaging and analysis. Cells or tissues were imaged in 1× PBS through wide- buffer, 0.1% Triton X-100, 1 mg/mL BSA (NEB), 0.5 unit/µL RNasinPlus] and field fluorescence imaging using an Olympus IX83 inverted microscope at incubated at 16 ◦C overnight. 60× oil immersion objective (N.A. = 1.4). Image processing was carried out Library construction. To reverse cross-linking, nuclei were washed twice as previously described (36). Briefly, single transcripts were detected using with 1× PBS, resuspended in 250 µL of extraction buffer [1 mg/mL Pro- an automated thresholding algorithm that searches for robust thresholds, teinase K (NEB), 50 mM Tris·HCl, pH 7.5, 1% SDS, 1 mM EDTA, 100 mM where counts do not change within a range. Fusion transcripts were deter- NaCl] and incubated at 65 ◦C for 3 h. DNA and RNA were extracted by mined by searching for colocalization of detection transcripts by overlap adding an equal volume of phenol:chloroform:isoamyl alcohol (pH 7.9, between predicted centers within a radius. Ambion) followed by ethanol precipitation. The subsequent steps including removal of biotin from nonproximity ligated linkers, pulldown of RNA–DNA RNA Sequencing and Analysis. RNA was extracted with Trizol from an chimera, reverse transcription of RNA, DNA denaturation, circularization, H2228 cell line and lung cancer tissue samples of the approximate size oligo annealing and BamHI (NEB) digestion, and sequencing library gener- 3 mm×3 mm× 30 µm per sample. RNA sequencing was carried out ation were performed as previously described (10). iMARGI libraries were using the TruSight RNA Pan-Cancer Panel (Illumina) following the man- subsequently subjected to paired-end 100-cycle sequencing on an Illumina ufacterer’s protocol. All of the RNA-seq data, including HEK293T pub- HiSeq 4000. The circularization and library construction strategy can phase lic data, the H2228 cell line, and lung cancer sample sequencing data, RNA and DNA ends into Read 1 and Read 2 as shown Fig. 1A, which is were mapped to the human genome (hg38) using STAR (v2.5.4b) with the same as with MARGI library configuration (10). Since AluI restriction default parameters (37). Fusion transcripts were called using STAR-Fusion enzyme recognizes “AGCT” sequence and leaves “CT” at the 50 end of the (v0.8.0) (17), requiring both numbers of supporting discordant read pair cut, we expect the first two bases of DNA end (Read 2) to be enriched and junction-spanning read larger than zero and the sum of them larger with CT. than 2.

ACKNOWLEDGMENTS. Male hTert-immortalized human foreskin fibroblasts Analysis of iMARGI Sequencing Data. (HFF-hTert-clone 6) are a cell line of the 4D Nucleome Tier 1 cells, provided Mapping iMARGI read pairs. The detailed iMARGI data-processing meth- by the Job Dekker laboratory (https://www.4dnucleome.org/cell-lines.html). ods can be found in our GitHub repository (https://github.com/Zhong-Lab- This work is funded by NIH Grants R00HL122368 (to Z.C.), R01HL108735 (to UCSD/iMARGI methods). Briefly, they include three main steps. First, the S.C.), R01HL106579 (to S.C.), R01HL121365 (to S.C.), DP1HD087990 (to S.Z.), read pairs were cleaned by in-house scripts. According to the library con- and NIH 4D Nucleome U01CA200147 (to S.C. and S.Z.).

3336 | www.pnas.org/cgi/doi/10.1073/pnas.1819788116 Yan et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 4 a eWre J ta.(02 out4-e aaaayi osre o regulatory for screen to analysis data 4C-seq Robust (2012) al. et HJ, in-nucleus Werken versus de in-solution van using results 14. hi-C of Comparison (2015) al. whole-transcriptome et T, by Nagano detected 13. transcripts fusion Recurrent (2015) cancer. al. breast et in transcripts J, fusion Kim read-through 12. Recurrent vivo. (2014) in al. interactions et RNA-chromatin KE, of Varley mapping 11. Systematic (2017) al. et B, Sridhar 10. 5 io R ta.(02 oooia oan nmmaingnmsietfidby identified genomes mammalian in domains Topological (2012) al. et JR, Dixon 15. 8 edelJ,e l 21)Dtcino nw n oe L uintranscripts fusion ALK novel and known of Detection (2017) from al. detection et transcript fusion JA, accurate Vendrell and Fast 18. the Star-fusion: (2017) of al. role et B, the Haas highlights 17. transcriptomics Comparative (2017) al. et JW, Wynne 16. a tal. et Yan .LiJ ta.(05 uintasrp oisaemn eoi etrswt non-fusion with features genomic many share loci transcriptome- transcript sensitivity Fusion (2015) High al. et Jaffa: analysis. J, (2015) Lai data A 9. sequencing Oshlack RNA IJ, cancer- for Majewski of NM, Pipeline relevance Davidson Prada: therapeutic 8. (2014) al. and et landscape W, The Torres-Garcia kinase (2015) of 7. landscape al. The et (2014) promis- C K, A Lengauer Yoshihara JL, genes: Kim 6. Fusion S, (2018) Schalm E, J Cerami Zhang N, Stransky M, 5. Xing H, gene Cheng of R, Landscape Theobard (2015) X, AM Dai Chinnaiyan 4. S, Kalyana-Sundaram C, Kumar-Sinha 3. .MresF oaso ,Foeo ,Mtla 21)Teeegn opeiyof complexity emerging The (2015) and F development Mitelman the T, in Fioretos B, implications Johansson their F, and Mertens fusions 2. Driver (2018) al. et Q, Gao 1. N interactions. DNA ligation. samples. cancer breast primary 54:681–691. 120 of sequencing Treat Res Cancer Breast Biol Curr loci. detection. gene fusion focused Bioinformatics fusions. transcript associated cancer. in fusions cancer. against 160. combating tool ing find. shall ye and Seq cancers: epithelial in fusions nlsso hoai interactions. chromatin of analysis cancer. in fusions gene cancers. human of treatment nln acrptet sn etgnrto eunigapproaches. sequencing next-generation using patients 12510. cancer lung in 2017. 24, March posted Preprint, ebolavirus. bioRxiv:120295. to RNA-seq. response host the in factor transcription 91:e01174-17. 1 activator M Genomics BMC eoeBiol Genome 27:610–612. 30:2224–2226. a Commun Nat a Methods Nat 16:1021. 16:175. a e Cancer Rev Nat 146:287–297. Oncogene elRep Cell eoeMed Genome 5:4846. 9:969–972. Nature ici ipy caRvCancer Rev Acta Biophys Biochim 23:227–238e3. 15:371–381. 34:4845–4854. 485:376–380. 7:43. eoeMed Genome ee hoooe Cancer Chromosomes Genes 7:129. c Rep Sci 1869:149– Virol J 7: 7 oi ,e l 21)SA:UtaatuieslRAsqaligner. RNA-seq Imag- universal (2008) Ultrafast STAR: S (2013) Tyagi al. et A, A, Oudenaarden in Dobin van counting 37. SA, Rifkin and P, detection Bogaard mRNA den van Single-molecule A, (2013) Raj polymer-coated 36. al. using expression et gene of A, visualization Lyubimova situ 35. In (2009) al. et Y, Choi 34. bioconductor. and with Gviz using contigs data genomic assembly Visualizing and (2016) R sequences Ivanek F, clone Hahne for reads, browsers 33. sequence genome Portable Aligning GIVE: (2013) (2018) H S Zhong Li A, 32. Zheng Q, correlations Wu and Z, patterns Yan reveal X, interactions: heatmaps Cao Complex genomic 31. (2016) for M Schlesner Infrastructure R, Eils (2016) Z, E Gu Ing-Simmons 30. M, ranges. Perry genomic AT, annotating and Lun computing 29. for Software (2013) al. et cancer-associated M, Lawrence for 28. resource integrative (2018) Team An Core Tumorfusions: R (2018) 27. al. rearrangements. et genomic human X, for Hu Mechanisms (2008) 26. of JR trans-splicing Lupski mimics F, fusion gene Zhang neoplastic W, A Gu (2008) J 25. Sklar G, Mor J, Wang and cancer H, lung Li cell non-small 24. in rearrangement EML4-ALK high-resolution. (2009) al. at et genetics MP, Martelli cancer Somatic 23. Cosmic: vivo (2017) in al. Single- structure et RNA imaging: SA, and Forbes FISH interactome RNA-RNA Fusion 22. Mapping (2014) (2016) al. M et Batish TC, in Nguyen S, fusions Tyagi 21. gene W, transcribed Ruezinsky of FB, detection Robust Markey FuseFISH: 20. (2014) al. et S, Semrau 19. 15–21. probes. labeled singly multiple using 5:877–879. molecules mRNA individual ing tissue. mammalian conjugates. quantum-dot-DNA 335–351. pp 1418, Vol York), New Biology Molecular in Methods Genomics. tistical 2013. 26, May posted Preprint, arXiv:1303.3997v2. BWA-MEM. websites. personal data. genomic multidimensional in F1000Res experiments. related and ChIA-PET Hi-C, for classes Bioconductor Biol Comput PLoS 3.5.1. Version Vienna), Computing, Statistical for Foundation fusions. transcript Pathogenetics cells. human normal in RNAs tissues. lung non-tumor Res Acids MARIO. by situ. in transcripts fusion gene of detection molecule cells. single 45:D777–D783. a Commun Nat elRep Cell 1:4. 9:e1003118. uli cd Res Acids Nucleic eoeBiol Genome a Protoc Nat 6:18–23. PNAS :ALnug n niomn o ttsia Computing Statistical for Environment and Language A R: mJPathol J Am 7:12023. | Science 8:1743–1758. Small eray1,2019 19, February 19:92. Bioinformatics 46:D1144–D1149. 174:661–670. 321:1357–1361. 5:2085–2091. d Math eds , 32:2847–2849. | LSOne PLoS o.116 vol. ,DvsS(uaaPress, (Humana S Davis E, e ´ 9:e93488. | Bioinformatics o 8 no. a Methods Nat 5:950. | Nucleic 3337 Sta- 29: (R

SYSTEMS BIOLOGY