<<

Global analysis of somatic structural genomic alterations and their impact on expression in diverse human cancers

Babak Alaei-Mahabadia, Joydeep Bhaduryb, Joakim W. Karlssona, Jonas A. Nilssonb, and Erik Larssona,1

aDepartment of Medical and Cell Biology, Institute of Biomedicine, The Sahlgrenska Academy, University of Gothenburg, SE-405 30 Gothenburg, Sweden; and bDepartment of Surgery, Sahlgrenska Cancer Center, Institute of Clinical Sciences, University of Gothenburg, SE-405 30 Gothenburg, Sweden

Edited by Mary-Claire King, University of Washington, Seattle, WA, and approved October 21, 2016 (received for review April 19, 2016) Tumor are mosaics of somatic structural variants (SVs) segments (14). Several factors complicate the analysis, in particular that may contribute to the activation of oncogenes or inactivation mappability issues due to repetitive sequence regions (15). Indeed, of tumor suppressors, for example, by altering gene copy number it has become clear that the results produced by different methods amplitude. However, there are multiple other ways in which SVs are not consistent, and some studies have intersected multiple ap- can modulate transcription, but the general impact of such events proaches to provide a presumed high-confidence set of predictions on tumor transcriptional output has not been systematically de- (16, 17). Adding to the challenges is the difficulty of assessing termined. Here we use whole- sequencing data to map SVs performance: True positive sets have thus far been obtained across 600 tumors and 18 cancers, and investigate the relationship through simulated genomic sequences (18), but this will not re- between SVs, copy number alterations (CNAs), and mRNA expression. flect the true complexity of cancer genomes. Improved bench- We find that 34% of CNA breakpoints can be clarified structurally marking is thus desirable before SVs can be comprehensively and that most amplifications are due to tandem duplications. We studied with WGS in human cancers. observe frequent swapping of strong and weak promoters in the Here, enabled by the availability of a large cancer cohort sub- context of gene fusions, and find that this has a measurable global jected to WGS by The Cancer Genome Atlas (TCGA) consor- impact on mRNA levels. Interestingly, several long noncoding RNAs tium, we map somatic SVs in 600 tumors across 18 different BIOPHYSICS AND were strongly activated by this mechanism. Additionally, SVs were

cancer types and determine relationships between SVs, CNAs, COMPUTATIONAL BIOLOGY confirmed in telomere reverse transcriptase (TERT) upstream regions and mRNA expression. We compare SV breakpoints with those in several cancers, associated with elevated TERT mRNA levels. We from microarray-based CNA data to evaluate and optimize tools also highlight high-confidence gene fusions supported by both and parameters for SV detection, which additionally gives insights genomic and transcriptomic evidence, including a previously unde- into the structural basis of CNAs. We next investigate relation- scribed paired box 8 (PAX8)–nuclear factor, erythroid 2 like 2 ships between SVs and changes in levels, both in NFE2L2 ( ) fusion in thyroid carcinoma. In summary, we combine SV, the context of gene fusions as well as SVs affecting putative reg- CNA, and expression data to provide insights into the structural basis ulatory regions near TSSs. Finally, we combine breakpoints de- of CNAs as well as the impact of SVs on gene expression in tumors. termined from WGS and RNA-sequencing (RNA-seq) data to detect high-confidence fusion across this large cohort. cancer genomics | somatic structural variation | gene fusion | gene expression Results CNA-Based Evaluation of SV Detection Methodology. We first noted omatic structural variants (SVs) in genomic DNA can lead to large discrepancies, for example in terms of SV types, when Schanges in gene copy number, which may contribute to tumor development by altering gene expression levels (1). Several Significance studies have tried to systematically investigate the effect of copy number alterations (CNAs) on tumor transcriptomes (2, 3), but Structural changes in can alter the expression and this can only explain part of the expression changes observed in function of genes in tumors, an important driving mechanism in tumors. Notably, there are several other mechanisms by which some tumors. Whole-genome sequencing makes it possible to SVs can alter tumor transcriptional output, including creation of detect such events on a genome-wide scale, but comprehensive fusion genes. These may produce chimeric mRNAs encoding investigations are still missing. Here, enabled by a massive novel oncogenic (4, 5) or, alternatively, fusion mRNAs amount of whole-genome sequencing data generated by The that maintain their -coding sequence while being tran- ′ Cancer Genome Atlas consortium, we map somatic structural scriptionally induced or repressed due to swapping of 5 ends, changes in 600 tumors of diverse origins. At a global level, we including the (6). Additionally, SVs can affect mRNA find that such events often contribute to altered gene expres- levels by bringing distant regulatory elements into the proximity sion in human cancer, and also highlight specific events that may of transcription start sites (TSSs) without structurally altering the have functional roles during tumor development. mRNA (7). More systematic investigations of the impact of SVs on gene expression levels in tumors are, however, lacking, in part Author contributions: B.A.-M., J.A.N., and E.L. designed research; B.A.-M., J.B., and J.W.K. due to the fact that mapping of SVs is considerably more chal- analyzed data; and B.A.-M. and E.L. wrote the paper. lenging compared with determining CNA profiles. Additionally, The authors declare no conflict of interest. although CNAs arise as a result of SVs, the two have generally This article is a PNAS Direct Submission. been studied in isolation and the general structural basis of so- Freely available online through the PNAS open access option. matic CNAs in tumors is poorly investigated. Data deposition: The sequence reported in this paper has been deposited in the Gene Recent studies have started to use whole-genome sequencing Expression Omnibus (GEO) database, www.ncbi.nlm.nih.gov/geo (accession no. (WGS) to identify SVs in tumors, and multiple tools have been GSE79756). published (8–12). These are based on the idea that breakpoints 1To whom correspondence should be addressed. Email: [email protected]. should reveal themselves on the basis of discordant read pairs (13) This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. or split reads mapping partially to both ends of fused chromosomal 1073/pnas.1606220113/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1606220113 PNAS Early Edition | 1of6 Downloaded by guest on September 24, 2021 applying several available tools for SV detection to WGS data and 125 per sample, respectively). Few SVs were found in copy from TCGA tumor/normal pairs (Fig. 1A), and therefore sought number stable cancers such as thyroid carcinoma (THCA) (26) to carefully evaluate multiple options. A general problem in (∼8 per sample on average). Similarly, clear cell and chromo- assessing SV results is the lack of true positive data for com- phobe kidney carcinomas (KIRC and KICH) and low-grade gli- parison (19). To partially overcome this problem, we made use of oma (LGG) tumors exhibited a small number of SVs (Fig. 2A). orthogonal results in the form of CNA profiles from the Affy- We next investigated the relationship between somatic CNA metrix SNP6 microarray platform (SI Methods). CNAs arise as a and SV breakpoints. On average, 34% of CNA breakpoints consequence of somatic SVs in the genome, and segmented (segments with absolute log2 amplitude ≥0.3, SNP6 platform) CNA data therefore provide a true positive subset. Although this detected across all 600 tumors were confirmed in the SV data. subset is incomplete, an ideal experimental and computational Overlaps were reduced in CNA stable tumors such as KICH, pipeline should essentially be able to detect all CNA breakpoints. which are dominated by arm-level events and therefore may lack We applied four available tools, BreakDancer (20), SVDetect breakpoints detectable by WGS (Fig. 2B). Absolute numbers of (21), DELLY (22), and Meerkat (11), to high-coverage WGS CNA and SV breakpoints were strongly correlated across tumors data from 200 TCGA tumors of diverse origins, and quantified (Pearson’s r = 0.81; Fig. 2C). However, a small number of overlaps between SV and CNA breakpoints (<15-kb tolerance). samples had elevated CNAs not reflected in the SV data (Fig. To reduce the computational complexity, only a single chromo- 2C), again explainable by the predominance of arm-level changes some (chr) was considered in this analysis (chr8). To also address in these tumors (Fig. S2). These results further support the the specificity of the methods, we investigated overlaps with a set validity of the SV data and show that both approaches are ef- of randomized CNA breakpoints that maintained a similar po- fective for quantifying overall chromosomal instability. SI Methods sitional distribution across the genome ( ). SVDetect The overlap between CNA and SV data at the level of com- had the highest sensitivity score (35%), followed by Meerkat plete events (for example, both breakpoints for an amplification B (29%) (Fig. 1 ). However, Meerkat performed considerably better segment) was considerably lower (∼10%) compared with B than SVDetect in terms of specificity (Fig. 1 ). Additionally, breakpoint-level overlaps. In many cases, this was explained in Meerkat performed favorably compared with the other methods the SV data by a more complex structural basis than immediately when applied to 11 complete tumor genomes of diverse origins (Fig. evident from CNA profiles. Examples include apparent closely S1), and therefore we chose Meerkat for SV detection. spaced large copy number deletion segments, revealed by the SV data to be due to a large (for example, whole-) A Map of Somatic SVs Across 600 Tumor Genomes. We next used deletion event superimposed on several small-amplitude–neu- high-coverage WGS data from TCGA to map somatic SVs in 600 tralizing tandem duplications (Fig. 2D). Similarly, apparent tumors across 18 cancer types (Dataset S1), and combined these clustered copy number amplification segments could sometimes results with somatic CNA (Affymetrix SNP6 microarrays) and be explained by a larger duplication in combination with several expression (RNA-seq) data from the same samples. In total, E 51,446 somatic SVs were identified (on average 87 events per small deletions (Fig. 2 ). These results emphasize that care sample). Of these, 36,382 were intrachromosomal (13,745 dele- needs to be taken when drawing conclusions about structural chromosomal changes based on CNA data alone. tions, 13,335 tandem duplications, and 9,302 inversions), whereas ≤− 14,953 were interchromosomal translocations. The number of SVs Of all CNA deletion segments (log2 amplitude 0.3) with a clear correspondence in the SV data, we found that 97% were and the relative distribution between SV classes varied consider- F ably within and across cancer types (Fig. 2A). categorized as deletions in the SV analysis (Fig. 2 ). Likewise, Confirming previous data (10, 23–25), breast (BRCA), ovarian 90% of amplified regions were due to tandem duplication, (OV), and lung squamous (LUSC) tumors had a large number of whereas only 5% each overlapped with deletions or inversions. SVs (>100 SVs per sample on average). Likewise, this number was high in uterine (UCEC) and stomach (STAD) tumors (107 Global Impact of Somatic SVs on Tumor mRNA Levels. For somatic SVs to be functional in cancer, they need to have an influence on transcription in some way, by altering mRNA structures or levels. We found that a large fraction (56%) of SV breakpoints were in A B genic regions, and that on average 112 genes per sample over- lapped with at least one SV breakpoint, showing that SVs have Observed Randomized considerable potential to alter tumor transcriptional output. A known mechanism for gene activation by SVs is the swapping of 30 strong and weak promoters in the context of gene fusions. This is 20 described for many individual genes, both with and without al- teration of the protein-coding region (6, 27, 28), but it remains SvDetect 10 BreakDancer unclear to what extent such events have a measurable global influence on tumor expression profiles.

Delly BDancer SVDetect Delly Meerkat Meerkat 0 Based on the SV data, we identified local rearrangements Deletion 1 predicted to result in a fusion gene involving a strong promoter Inversion Intra-chromosomal 2 juxtaposed to a normally weakly expressed gene. We thus con- Insertion 3 sidered only gene pairs where the 5′ partner was expressed at a Other 4 considerably higher average level within a given cancer type Translocation: Inter-chromosomal (greater than fourfold difference; SI Methods). We found that CNAs detected by SV tools (%) 5 these criteria were often predictive of strong induction of the 3′ Fig. 1. Evaluating SV detection tools. (A) Four SV detection tools were fusion partner: In 62/313 cases (20%), we observed induction of applied to 11 WGS tumor/normal pairs. The distribution of SV classes is the 3′ partner above threefold in the structurally affected tumor shown for each tool. SVDetect identifies two classes of SVs: inter- and compared with other tumors of the same cancer type (Fig. 3A). intrachromosomal translocations. The “other” class for Meerkat represents tandem duplications. (B) The fraction of CNA breakpoints overlapping with This should be compared with only 8% and 3% for fusions in- SV breakpoints was used as a proxy to determine the sensitivity of each tool volving incorrect gene orientation or a weakly expressed pro- (red bars). Randomized CNA breakpoints were used to assess the specificity moter, respectively (Fig. 3A; P = 1.1e-5 compared with incorrect of each tool (blue bars). orientation, P = 2.7e-7 compared with weak promoters, and

2of6 | www.pnas.org/cgi/doi/10.1073/pnas.1606220113 Alaei-Mahabadi et al. Downloaded by guest on September 24, 2021 A 800 Deletion Inversion 600 Tandem duplication Inter chromosomal translocation 400

Number of SVs 200

0

OV B GBM KICH LGG KIRC BRCA UCEC STAD LUSC BLCA LUAD SKCM READ CESC COAD HNSC PRAD THCA 80 TCGA-A7-A13D: chr11 < 1% 2% 70 C Mostly arm level events D >3 F 3 Chromothripsis effected 2

60 10 CNA Deletion Both 0 50 2 SV 10 Deletions 40 97% 30 1 10 Tandem duplications 20 TCGA-D7-6822: chr7 5% 0 E 5% 5% 10 10 >3

0 Number of CNAs 2 Amplification -1 CNA 0 10 -1 OV 0 1 2 3 LGG GBM KICH KIRC SV BLCA STAD LUAD LUSC 10 10 10 10 10 THCA BRCA READ CESC PRAD HNSC UCEC COAD SKCM

Amplifications 90%

CNA breakpoints detected by Meerkat (%) Number of SVs High CN instability Low Deletions

Fig. 2. Overview of detected somatic SVs and overlaps with CNAs across 18 cancer types. (A) Samples are grouped by cancer type and ordered by the number of detected SVs. (B) The fraction of CNA breakpoints (Affymetrix SNP6 data) detected by the SV detection pipeline is shown for each tumor. Red lines indicate the median in each cancer type. Cancer types are ordered by the degree of copy number instability (average CNA segments). Segments involving breakpoints in centromeres and telomeres (within 2 Mb) were filtered out for this analysis, as these may not be assessable by SV callers. (C) Scatter plot showing cor- “ ” relation between the number of CNA and SV breakpoints in each sample. Mostly arm level indicates that more than half of all chromosome arms have arm- BIOPHYSICS AND level CNAs. “Chromothripsis effected” indicates at least one chromosome arm with more than 30 copy number segments. (D and E) Two examples of complex COMPUTATIONAL BIOLOGY relationships between CNAs and SVs. Blue represents deletions, and red represents amplifications. The linear copy number amplitude is indicated for the CNA data. (F) Structural basis of copy number amplifications and deletions. Only CNA segments where both breakpoints could be matched to those of a single SV event were considered (15-kb tolerance). Pie chart slices are color coded similar to Fig. 2A.

P = 1.3e-6 compared with randomly picked cases; two-sided (Fig. 3B; P = 0.069 and P = 0.0099, respectively, compared with Kolmogorov–Smirnov test). the control sets). Although it remains to be determined to what Among the 62 cases showing induced expression were several extent the observed events are under positive selection, the re- confirmatory observations (Fig. S3), including prominent acti- sults suggest upstream SVs may often contribute to transcrip- vation of ERG in prostate cancer (29) as well as RET (30) and tional activation of genes in human tumors. ALK (31) in THCA, by fusion with strong promoters from Notably, among top cases showing increased expression in re- TMPRSS2, CCDC6, and STRN, respectively. Notable was strong lation to upstream SVs (40 cases above fourfold) was telomere (8.1-fold relative mean) mRNA induction of protein kinase C reverse transcriptase (TERT) in two different cancers, KICH and beta (PRKCB) through fusion with USP7, a highly expressed KIRC (Fig. S5). The data from KICH, where TERT upstream SVs deubiquitinase, in colorectal carcinoma (COAD). PRKCB is a were found in two samples with high TERT mRNA levels, confirm protein kinase linked to tumor growth and survival (32) that is recently published results (7) (Fig. 3C). In KIRC, the sample with druggable by the small-molecule inhibitor enzastaurin (33). Al- the highest TERT expression also harbored an upstream SV, though fusions involving PRKCB were recently detected in fi- whereas an additional affected sample lacked mRNA elevation in brous histiocytoma (34), they have not been described in COAD. this cancer (Fig. 3D). Although not considered in the global screen Additionally, among the induced transcripts were several long due to simultaneous copy number gain, TERT upstream SVs were noncoding RNAs (lncRNAs), including the intergenic RP11- found also in one rectal adenocarcinoma (READ) and one mel- 521D12.1, strongly (39.9-fold) induced through fusion with ASAP2 anoma (SKCM) sample (Fig. 3 E and F), the latter notably lacking in BRCA. Although the functional relevance remains to be TERT promoter mutations. These results further support a role established, this establishes promoter swapping as a mechanism for upstream SVs in activation of TERT in cancer. for activation of lncRNAs in cancer. A few recent studies show that structural rearrangements can Improved Gene Fusion Detection by Combined DNA/RNA Analysis. contribute to transcriptional activation through shuffling of cis- Recent studies have relied on RNA-seq data to detect fusion regulatory elements near the TSS of genes, without altering genes in comprehensive tumor materials (37, 38). However, mRNA structure (35, 36). This has been described in individual similar to SV analysis, RNA-seq–based fusion detection is prone cases, but the impact on a more global scale remains unknown. to both false positives and negatives (39), for example due to To investigate this, we identified cases of genes being affected by read-mapping issues or misinterpreted germ-line events (40). upstream SVs (−10 to 0 kb from the TSS). We next considered Therefore, we explored whether fusion gene detection can be the relative expression levels for these genes in affected samples improved by combining RNA and DNA data. compared with unaffected samples of the same cancer type while We applied FusionCatcher (41) to identify fusion transcripts avoiding confounding effects from copy number changes and in 431/600 samples with available paired-end RNA-seq data. In chromothripsis (SI Methods and Fig. S4). Cases with SVs in a total, 9,657 fusions (on average 22.4 per sample) were reported, region −110 to 100 kb upstream, or randomly picked cases, were of which 1,467 were in-frame (Dataset S2). In parallel, we an- considered as controls. We found that SV-affected genes/cases notated all genomic SV breakpoints and identified 9,854 pre- showed a weak trend toward overall increased mRNA levels dicted gene fusions, of which 740 were inversions, 1,240 were

Alaei-Mahabadi et al. PNAS Early Edition | 3of6 Downloaded by guest on September 24, 2021 A B Novel Mechanism for Nuclear Factor, Erythroid 2 Like 2 Activation in 1 1 Cancer. Notable among the high-confidence fusions was paired Correct orientation 0.9 0.9 box 8 (PAX8) fusing with nuclear factor, erythroid 2 like 2 0.8 0.8 NFE2L2 Incorrect orientation ( ) in one THCA tumor, revealed by the SV data to be 0.7 0.7 due to tandem duplication on (Fig. 5A). This 0.6 Weaker promoter 0.6 NFE2L2 Randomized (median) event was associated with potent transcriptional acti- 0.5 0.5 vation (Fig. 5B) as well as loss of the KEAP1 interaction domain 0.4 0.4 in NFE2L2 (Fig. 5A). NFE2L2 (NRF2) is a 0.3 0.3 Cumulative frequency Cumulative frequency that coordinates the cellular response to , which is 0.2 0.2 0-10 kb upstream TSS 0.1 0.1 100-110 kb upstream TSS activated in some tumors by mutations that disrupt interaction Randomized (median) KEAP1 NFE2L2 0 0 with its inhibitor (43). However, fusions in- -10-8 -6 -4 -2 026 4 -4 -3 -1-2 0 1 2 3 4 KEAP1 Altered vs. WT (log mRNA change) Altered vs. WT (log mRNA change) volving loss of the interaction domain have not been 2 2 described, and the result is thus suggestive of a novel mechanism CDTERT TERT E TERT FTERT KICH KIRC READ SKCM for NFE2L2 activation in cancer. ) 1 2 5 2 P = 0.23 2 P = 0.50 To further evaluate the PAX8–NFE2L2 fusion, we investi- -1 P = 0.038 0 P = 0.86 0 gated the expression of known target genes downstream of 0 -2 -2 -3 NFE2L2. We found that known NFE2L2 targets were strongly -4 induced in the tumor that carried the PAX8–NFE2L2 fusion,

mRNA level (log -4 -5 -1 -0.50 0.5 1 -1 -0.50 0.5 1 -1 -0.50 0.5 1 -1 -0.50 0.5 1 NFE2L2 Copy number amplitude WT Promoter proximal SV Promoter mutation supporting its function as a potent activator of the transcriptional program (Fig. 5C). By considering an expanded Fig. 3. Impact of SVs on gene expression levels in tumors. (A)Genefusionevents set of THCA tumors with available exome sequencing data, we leading to swapping of strong and weak promoters have a global influence on identified one tumor with a KEAP1 mutation, likewise associated mRNA expression levels in tumors. Cumulative distribution function (CDF) plot with strong induction of NFE2L2 target mRNAs (Fig. 5C). This showing log2-transformed expression differences for genes predicted in the SV data NFE2L2 to be activated due to fusion with a stronger promoter, comparing the SV-affected further supports that activation may contribute to tu- sample with the mean of other samples in the same cancer type. Control analyses morigenesis in some thyroid carcinomas. involving fusions not predicted to cause activation of the 3′ partner (incorrect strand We next synthesized the PAX8–NFE2L2 fusion transcript and orientation or weakly expressed fusion partner) are shown by dashed lines, in ad- cloned it into a lentiviral vector (Fig. S7), and used this to dition to a randomized set (randomly picked sample within the relevant cancer type, overexpress it in the murine prostate adenocarcinoma cell line median of 100 randomizations). Inversions and deletions were used for this analysis. TRAMP-C1 (SI Methods). We used high-coverage RNA se- (B) Global influence of promoter-proximal SVs on mRNA levels. CDF plot showing > − quencing ( 30 million reads per sample) to compare the expres- expression differences for genes harboring an upstream SV event ( 10 to 0 kb), PAX8–NFE2L2 comparing the affected sample with the mean of other samples in the same cancer sion profiles of cells expressing the fusion type. Genes with SVs in a region farther upstream (−110 to 100 kb) and randomized transcript with cells transduced with an empty vector. We observed controls are shown. (C) Expression vs. copy number change for TERT in KICH. a prominent transcriptional response dominated by inductions (534 Samples with promoter-proximal SV breakpoints are marked in orange. P values induced and 215 repressed genes at q < 0.01; Fig. 5D). were calculated using the Wilcoxon rank-sum test comparing the expression of the altered tumors with other samples. (D–F)SameasC for KIRC, READ, and SKCM, respectively. Samples with TERT promoter mutations are indicated in purple. A B deletions, 2,511 were duplications, and 5,363 were interchro- A 1 mosomal translocations (Fig. 4 ). Of these, 5,209 (52%) in- 35 0.9 WGS Translocations 0.8 RNA-Seq volved genes fused in the correct orientation, thus having the Both 2,884 gene fusions 0.7 potential to produce a fusion transcript. We found that only 299 0.6 Tandem duplications unique pairs of genes were detected in both DNA and RNA 39 0.5 1,279 gene fusions 0.4 data, of which 97 were in-frame according to FusionCatcher 0.3 Inversions CGC enrichment (Fig. S6). In this high-confidence set with dual DNA/RNA sup- 9 428 gene fusions 0.2 0.1 port, there were 78 genes recurrently involved in fusions in more Deletions 14 0 than one sample (Table S1). RNA-seq transcripts 9,657 fusion 1,467 In-frame 619 gene fusions 100 200 300 400 500 Gene rank Next, we evaluated the two approaches, either alone or in C combination, by investigating the enrichment of known cancer BUB1B drivers in the Cancer Gene Census (CGC) database among the EIF2AK4 BUB1B EIF2AK4 top recurrently fused genes (42). We found that neither method (5’ partner) (3’ partner) Kinase domain -4-3-2-1 012345 alone showed notable CGC enrichment, although the SV-based New fusion transcript mRNA level (log RPKM) analysis performed slightly better (Fig. 4B). However, combining 2 both approaches resulted in a notable CGC enrichment among Fig. 4. Combined DNA/RNA approach improves detection of gene fusions. top-scoring genes, with 14 of the 78 recurrently fused genes ∼ (A) Venn diagram (not proportional) showing the number of fusions de- (18%) being listed in the CGC (random expectation 2.5%). tected by analysis of RNA-seq data and in the genomic SV data, as well as This suggests that DNA and RNA approaches to fusion gene overlaps between the two. (B) Enrichment of known cancer genes (from the detection complement each other, such that a combination of Cancer Gene Census) among the top recurrently fused genes identified by both provides more relevant results. analysis of RNA-seq data, genomic SVs, or a combined DNA/RNA approach. In the set of high-confidence detections with dual DNA/RNA The graphs show the cumulative fraction of known cancer genes as a support (Fig. S6) were several known functional fusions, in- function of gene rank. Rank was determined for individual genes based on cluding TMPRSS2–ERG in 6/20 prostate adenocarcinoma tu- the frequency of recurrence, taking into account all fusions in which a given mors (PRAD) and RET–CCDC6 in THCA (2/34 tumors). gene was predicted to participate. (C) A fusion between EIF2AK4 and BUB1B EIF2AK4 BUB1B in one PRAD tumor, due to a deletion in chromosome 15. The deleted region Results also included fused to , a mitotic is shown in blue, whereas nonaltered regions are shown in gray. mRNA checkpoint kinase, by a deletion in one PRAD tumor, with levels [represented as reads per kilobase of transcript per million mapped strong transcriptional induction of BUB1B in this sample reads (RPKM)] for both fusion partners across all PRADs are shown, with the (Fig. 4C). affected sample indicated in red.

4of6 | www.pnas.org/cgi/doi/10.1073/pnas.1606220113 Alaei-Mahabadi et al. Downloaded by guest on September 24, 2021 A C VS. Discussion NFE2L2 PAX8 KEAP1 binding Transcription In the present study, we used an integrative workflow to analyze (3’ partner) (5’ partner) factor structural rearrangements in 600 tumors from 18 different cancer Paired 10 RPKM) 2

P = 6e-08 types. We rely on coanalysis of somatic CNA and SV data to P =1.5e-22 P = 7.8e-13 P = 3.4e-13 P = 4.5e-13 5 optimize tools and parameters for SV detection, and to provide New fusion transcript insights into the general structural basis of CNAs in tumors. 0 Further coanalysis with RNA-seq data provided insights into the WT PAX8-NFE2L2 fused B Altered NFE2L2 KEAP1 mutated overall influence of SVs on gene expression in tumors. mRNA level (log -5 Surprisingly, although there was a clear correlation between

PAX8 GCLC GCLM NQO1 NFE2L2 the number of CNA breakpoints and SV breakpoints seen in Fused TXNRD1 5678 9 10 D NFE2L2 target genes each tumor, only 34% of copy number breakpoints had corre- WT mRNA level (log RPKM) 2 10-150 spondence in the SVs detected by WGS. It is likely that this E partly can be explained by limited sensitivity in detecting SV METABOLISM breakpoints, for example due to insufficient read coverage in ARYLHYDROCARBON 10-100 TRANSCRIPTIONAL ACTIVATION BY NRF2 some regions. However, another contributing factor could be OXIDATIVE STRESS P false positives or positional uncertainties arising from the cir- ID SIGNALING 10-50 SIGNAL TRANSDUCTION OF S1P RECEPTOR cular binary segmentation algorithm used for detecting CNA METAPATHWAY BIOTRANSFORMATION q = 0.01 breakpoints in array-based copy number data. IRON UPTAKE AND TRANSPORT 0 TERT is a subunit of telomerase essential for telomere 0123 -Log10(P ) -4 -2 20 10864 FWER PAX8-NFE2L2 transduced vs. control maintenance and long-term proliferation potential of cancer F cells. Both somatic promoter mutations (46–48) and copy num- (log2 RNA change) 0.8 Induced G ber gains are known to contribute to TERT activation in tumors. 0.6 4 P = 0.022 Recent data from KICH suggest that also SVs in the promoter of repressed mGclc 0.4 mGclm TERT could be important (7). Our results extend this to addi- 3 P = 0.007 0.2 tional cancers, thus more firmly establishing regulatory SVs as a GSTA2 GSTA1 GCLM GCLC SLC7A11 HMOX1 KEAP1 MAF CEBPB NFE2L2 MAPK8 AIMP2 PIK3CA EPHB2 PRKCA TERT

Enrichment score 0.0 2 mechanism for mRNA induction. Earlier studies have emphasized large discrepancies between

1 BIOPHYSICS AND 0 5,000 10,000 15,000 20,000 different callers for detecting fusions in RNA-seq data and that COMPUTATIONAL BIOLOGY

Rank in ordered dataset Mock + Control + PAX8-NFE2L2 false positive rates are high (39). This should contribute to the Relative expression to mUb 0 limited overlap observed here between DNA- and RNA-based Fig. 5. Activation of the NFE2L2 pathwaybyaPAX8–NFE2L2 fusion gene in thyroid fusion predictions. Additionally, read-through transcription carcinoma. (A) PAX8 fused to the transcription factor NFE2L2 as a result of tandem events may produce fusion transcripts involving neighboring duplication on chromosome 2. Nonaltered regions are shown in gray, and the tan- genes, which have no correspondence in the SV data, and some dem duplication is indicated in red. Major domains are indicated. The 5′ end of PAX8 replaces the inactivating KEAP1 interaction domain in NFE2L2.(B) Fusion with PAX8 is genomic fusion may not be expressed, further reducing overlaps. associated with strong transcriptional induction. The plot shows mRNA levels for both Although both DNA and RNA data have limitations when it fusion partners in all thyroid carcinoma samples, with the affected tumor indicated in comes to detecting fusions, we found that a combination of the red. (C) Strong activation of NFE2L2 targets in the tumor expressing the PAX8– two data types reduced the noise inherent to each approach. This NFE2L2 fusion transcript. mRNA levels of NFE2L2 and four canonical target genes is in agreement with recent studies that combine DNA and RNA (GCLC, GCLM, TXNRD1,andNQO1) are shown across 572 samples with available evidence to improve detection of gene fusions (49, 50). exome data from TCGA. The PAX8–NFE2L2–expressing tumor, as well as one addi- BUB1B encodes a mitotic checkpoint kinase also known as tional tumor identified as having a predicted NFE2L2-activating mutation in KEAP1,is BUBR1 that ensures proper chromosome segregation by slowing ’ indicatedingreen.P values were calculated using Student s t test, comparing the down mitosis (51). Hemizygous loss in mice results in increased expression of the two altered tumors with the remaining samples. (D)Transcriptome profiling of TRAMP-C1 murine prostate carcinoma cells transduced with a lentivirus susceptibility to carcinogen-induced colon cancer development – (52) and accelerated aging (53). On the other hand, elevated expressing the PAX8 NFE2L2 fusion transcript or a control lentivirus. Volcano plot BUB1B showing 534 genes significantly induced and 215 genes repressed at q < 0.01. (E) expression of has also been associated with cancer pro- Bub1b Enriched gene sets (Pnominal < 0.01) calculated by GSEA. The familywise error rate gression (54), and overexpression of in mice results in tu- (FWER) calculated by GSEA is shown on the x axis, and the dashed line represents morigenesis (55). Fusions involving this gene have not been = PFWER 0.05. WikiPathway (57) gene sets were used for this analysis. (F) Enrich- previously described, and the functional relevance of the ment plot for the transcriptional activation by NRF2 gene set. (G) Altered ex- EIF2AK4–BUB1B fusion found here deserves further investigation. pression of NFE2L2 target genes Gclc and Gclm was confirmed using RT-qPCR. NFE2L2 or KEAP1 mutations, indicative of activation of the Expression levels are shown relative to (mUb). P values were determined NFE2L2/NRF2 pathway that coordinates the cellular response to using Student’s t test. Bars indicate the mean, and error bars indicate SD. oxidative stress, are common in some cancers, including lung ade- nocarcinoma, but are uncommon in THCA. However, a recent study established that the NFE2L2 pathway is frequently altered in THCA To identify affected pathways, we next performed gene set through other mechanisms, including KEAP1 promoter hyper- enrichment analysis using GSEA (44). Whereas no gene sets methylation (56). Our results in THCA suggest an additional way of were significantly down-regulated, four sets were enriched activating NFE2L2, whereby a fusion gene is formed that lacks the among induced genes (adjusted P < 0.05; Fig. 5E). Notably, inhibitory KEAP1 interaction domain. This further establishes that these included experimentally determined targets of NFE2L2 NFE2L2 activation is important in THCA, and suggests a novel (“transcriptional activation by NRF2”; Fig. 5F), as well as “glu- mechanism for activation of NFE2L2 in cancer. tathione metabolism,”“oxidative stress,” and “arylhydrocarbon In summary, by integrative analysis of genomic and tran- receptor” signaling, all of which are known to be regulated by scriptomic data from a large tumor cohort, we provide multiple NFE2L2 (43, 45). Induction of the canonical NFE2L2 targets insights into structural rearrangements and their effects on gene expression in human cancer. Gclc and Gclm was confirmed by quantitative RT-PCR (2.4- and 3.1-fold, respectively; P < 0.022), further showing that the PAX8– Methods NFE2L2 NFE2L2 fusion protein is a capable activator of target WGS data from 600 TCGA tumors were analyzed for somatic SVs using Meerkat mRNAs (Fig. 5G). (11). Gene expression levels were quantified as described previously (47). Fusions

Alaei-Mahabadi et al. PNAS Early Edition | 5of6 Downloaded by guest on September 24, 2021 transcripts were identified using FusionCatcher (41) with default parameters. A Foundation for Strategic Research (to E.L.), Swedish Medical Research Coun- detailed description of the methods used in the study is provided in SI Methods. cil (to E.L. and J.A.N.), Swedish Cancer Society (to E.L. and J.A.N.), Åke Wiberg Foundation (to E.L.), Lars Erik Lundberg Foundation for Research and Educa- ACKNOWLEDGMENTS. The results published here are in whole or in part tion (to E.L.), Region Västra Götaland (ALF Grant to J.A.N.), Adlerbertska based upon data generated by The Cancer Genome Atlas pilot project estab- Forskningsstiftelsen (to J.B.), Assar Gabrielsson Foundation (to J.B.), W&M lished by the NCI and NHGRI. Information about TCGA and the investigators Lundgren Foundation (to J.B.), and Sahlgrenska Universitetssjukhusets Stiftelser and institutions that constitute TCGA Research Network can be found at (to J.B.). The computations were in part performed on computing resources https://cancergenome.nih.gov. This work was supported by grants from the provided by the SNIC through the Uppsala Multidisciplinary Center for Ad- Knut and Alice Wallenberg Foundation (to E.L. and J.A.N.), Swedish vanced Computational Science (UPPMAX) under Project b2012108.

1. Podlaha O, Riester M, De S, Michor F (2012) Evolution of the cancer genome. Trends 33. Graff JR, et al. (2005) The protein kinase Cbeta-selective inhibitor, enzastaurin Genet 28(4):155–163. (LY317615.HCl), suppresses signaling through the AKT pathway, induces apoptosis, 2. Zack TI, et al. (2013) Pan-cancer patterns of somatic copy number alteration. Nat and suppresses growth of human colon cancer and glioblastoma xenografts. Cancer Genet 45(10):1134–1140. Res 65(16):7462–7469. 3. Santarius T, Shipley J, Brewer D, Stratton MR, Cooper CS (2010) A census of amplified 34. Płaszczyca A, et al. (2014) Fusions involving protein kinase C and membrane-associ- and overexpressed human cancer genes. Nat Rev Cancer 10(1):59–64. ated proteins in benign fibrous histiocytoma. Int J Biochem Cell Biol 53:475–481. 4. Delattre O, et al. (1992) Gene fusion with an ETS DNA-binding domain caused by 35. Northcott PA, et al. (2014) Enhancer hijacking activates GFI1 family oncogenes in chromosome translocation in human tumours. Nature 359(6391):162–165. medulloblastoma. Nature 511(7510):428–434. 5. Sorensen PH, Triche TJ (1996) Gene fusions encoding chimaeric transcription factors in 36. Affer M, et al. (2014) Promiscuous rearrangements hijack enhancers but solid tumours. Semin Cancer Biol 7(1):3–14. mostly super-enhancers to dysregulate MYC expression in multiple myeloma. 6. Oliveira AM, et al. (2005) Aneurysmal bone cyst variant translocations upregulate Leukemia 28(8):1725–1735. USP6 transcription by promoter swapping with the ZNF9, COL1A1, TRAP150, and 37. Stransky N, Cerami E, Schalm S, Kim JL, Lengauer C (2014) The landscape of kinase OMD genes. Oncogene 24(21):3419–3426. fusions in cancer. Nat Commun 5:4846. 7. Davis CF, et al.; Cancer Genome Atlas Research Network (2014) The somatic genomic 38. Edgren H, et al. (2011) Identification of fusion genes in breast cancer by paired-end landscape of chromophobe renal cell carcinoma. Cancer Cell 26(3):319–330. RNA-sequencing. Genome Biol 12(1):R6. 8. Berger MF, et al. (2011) The genomic complexity of primary human prostate cancer. 39. Carrara M, et al. (2013) State-of-the-art fusion-finder algorithms sensitivity and Nature 470(7333):214–220. specificity. BioMed Res Int 2013:340620. 9. Hillmer AM, et al. (2011) Comprehensive long-span paired-end-tag mapping reveals 40. Liu S, et al. (2016) Comprehensive evaluation of fusion transcript detection algorithms characteristic patterns of structural variations in epithelial cancer genomes. Genome and a meta-caller to combine top performing methods in paired-end RNA-seq data. Res 21(5):665–675. Nucleic Acids Res 44(5):e47. 10. Stephens PJ, et al. (2009) Complex landscapes of somatic rearrangement in human 41. Nicorici D, et al. (2014) FusionCatcher—A tool for finding somatic fusion genes in breast cancer genomes. Nature 462(7276):1005–1010. paired-end RNA-sequencing data. bioRxiv:011650. 11. Yang L, et al. (2013) Diverse mechanisms of somatic structural variations in human 42. Futreal PA, et al. (2004) A census of human cancer genes. Nat Rev Cancer 4(3):177–183. cancer genomes. Cell 153(4):919–929. 43. Jaramillo MC, Zhang DD (2013) The emerging role of the Nrf2-Keap1 signaling 12. Quinlan AR, Hall IM (2012) Characterizing complex structural variation in germline pathway in cancer. Genes Dev 27(20):2179–2191. and somatic genomes. Trends Genet 28(1):43–53. 44. Subramanian A, et al. (2005) Gene set enrichment analysis: A knowledge-based ap- 13. Korbel JO, et al. (2007) Paired-end mapping reveals extensive structural variation in proach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA the . Science 318(5849):420–426. 102(43):15545–15550. 14. Suzuki S, Yasuda T, Shiraishi Y, Miyano S, Nagasaki M (2011) ClipCrop: A tool for 45. Shin S, et al. (2007) NRF2 modulates aryl hydrocarbon receptor signaling: Influence on detecting structural variations with single-base resolution using soft-clipping in- adipogenesis. Mol Cell Biol 27(20):7188–7197. formation. BMC Bioinformatics 12(Suppl 14):S7. 46. Horn S, et al. (2013) TERT promoter mutations in familial and sporadic melanoma. 15. Cao H, et al. (2014) Rapid detection of structural variation in a human genome using Science 339(6122):959–961. nanochannel-based genome mapping technology. Gigascience 3(1):34. 47. Fredriksson NJ, Ny L, Nilsson JA, Larsson E (2014) Systematic analysis of noncoding 16. Mohiyuddin M, et al. (2015) MetaSV: An accurate and integrative structural-variant somatic mutations and gene expression alterations across 14 tumor types. Nat Genet caller for next generation sequencing. Bioinformatics 31(16):2741–2744. 46(12):1258–1263. 17. Wong K, Keane TM, Stalker J, Adams DJ (2010) Enhanced structural variant and 48. Huang FW, et al. (2013) Highly recurrent TERT promoter mutations in human mela- breakpoint detection using SVMerge by integration of multiple detection methods noma. Science 339(6122):957–959. and local assembly. Genome Biol 11(12):R128. 49. Chen K, et al. (2013) BreakTrans: Uncovering the genomic architecture of gene fu- 18. Qin M, et al. (2015) SCNVSim: Somatic copy number variation and structure variation sions. Genome Biol 14(8):R87. simulator. BMC Bioinformatics 16:66. 50. Zhang J, et al. (2016) INTEGRATE: Gene fusion discovery using whole genome and 19. Liu B, et al. (2015) Structural variation discovery in the cancer genome using next gener- transcriptome data. Genome Res 26(1):108–118. ation sequencing: Computational solutions and perspectives. Oncotarget 6(8):5477–5489. 51. Janssen A, Medema RH (2013) Genetic instability: Tipping the balance. Oncogene 20. Chen K, et al. (2009) BreakDancer: An algorithm for high-resolution mapping of 32(38):4459–4470. genomic structural variation. Nat Methods 6(9):677–681. 52. Dai W, et al. (2004) Slippage of mitotic arrest and enhanced tumor development in 21. Zeitouni B, et al. (2010) SVDetect: A tool to identify genomic structural variations mice with BubR1 haploinsufficiency. Cancer Res 64(2):440–445. from paired-end and mate-pair sequencing data. Bioinformatics 26(15):1895–1896. 53. Baker DJ, et al. (2004) BubR1 insufficiency causes early onset of aging-associated 22. Rausch T, et al. (2012) DELLY: Structural variant discovery by integrated paired-end phenotypes and infertility in mice. Nat Genet 36(7):744–749. and split-read analysis. Bioinformatics 28(18):i333–i339. 54. Ding Y, et al. (2013) Cancer-specific requirement for BUB1B/BUBR1 in human brain 23. Banerji S, et al. (2012) Sequence analysis of mutations and translocations across breast tumor isolates and genetically transformed cells. Cancer Discov 3(2):198–211. cancer subtypes. Nature 486(7403):405–409. 55. Ricke RM, Jeganathan KB, van Deursen JM (2011) Bub1 overexpression induces an- 24. Patch AM, et al.; Australian Ovarian Cancer Study Group (2015) Whole-genome euploidy and tumor formation through Aurora B kinase hyperactivation. J Cell Biol characterization of chemoresistant ovarian cancer. Nature 521(7553):489–494. 193(6):1049–1064. 25. Cancer Genome Atlas Research Network (2012) Comprehensive genomic character- 56. Martinez VD, et al. (2013) Frequent concerted genetic mechanisms disrupt multiple ization of squamous cell lung cancers. Nature 489(7417):519–525. components of the NRF2 inhibitor KEAP1/CUL3/RBX1 E3-ubiquitin ligase complex in 26. Cancer Genome Atlas Research Network (2014) Integrated genomic characterization thyroid cancer. Mol Cancer 12(1):124. of papillary thyroid carcinoma. Cell 159(3):676–690. 57. Kutmon M, et al. (2016) WikiPathways: Capturing the full diversity of pathway 27. Nord KH, et al. (2014) GRM1 is upregulated through gene fusion and promoter knowledge. Nucleic Acids Res 44(D1):D488–D494. swapping in chondromyxoid fibroma. Nat Genet 46(5):474–477. 58. Karolchik D, et al. (2004) The UCSC Table Browser data retrieval tool. Nucleic Acids Res 28. Tomlins SA, et al. (2005) Recurrent fusion of TMPRSS2 and ETS transcription factor 32(Database issue):D493–D496. genes in prostate cancer. Science 310(5748):644–648. 59. Harrow J, et al. (2012) GENCODE: The reference human genome annotation for The 29. Kumar-Sinha C, Tomlins SA, Chinnaiyan AM (2008) Recurrent gene fusions in prostate ENCODE Project. Genome Res 22(9):1760–1774. cancer. Nat Rev Cancer 8(7):497–511. 60. Bhadury J, et al. (2014) BET and HDAC inhibitors induce similar genes and biological 30. Celestino R, et al. (2012) Survey of 548 oncogenic fusion transcripts in thyroid tumors effects and synergize to kill in Myc-induced murine lymphoma. Proc Natl Acad Sci USA supports the importance of the already established thyroid fusions genes. Genes 111(26):E2721–E2730. Chromosomes Cancer 51(12):1154–1164. 61. Dobin A, et al. (2013) STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15–21. 31. Pérot G, et al. (2014) Identification of a recurrent STRN/ALK fusion in thyroid carci- 62. Anders S, Pyl PT, Huber W (2015) HTSeq—A Python framework to work with high- nomas. PLoS One 9(1):e87170. throughput sequencing data. Bioinformatics 31(2):166–169. 32. Surdez D, et al. (2012) Targeting the EWSR1-FLI1 oncogene-induced protein kinase 63. Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dis- PKC-β abolishes Ewing sarcoma growth. Cancer Res 72(17):4494–4503. persion for RNA-seq data with DESeq2. Genome Biol 15(12):550.

6of6 | www.pnas.org/cgi/doi/10.1073/pnas.1606220113 Alaei-Mahabadi et al. Downloaded by guest on September 24, 2021