Global Analysis of Somatic Structural Genomic Alterations and Their Impact on Gene Expression in Diverse Human Cancers
Total Page:16
File Type:pdf, Size:1020Kb
Global analysis of somatic structural genomic alterations and their impact on gene expression in diverse human cancers Babak Alaei-Mahabadia, Joydeep Bhaduryb, Joakim W. Karlssona, Jonas A. Nilssonb, and Erik Larssona,1 aDepartment of Medical Biochemistry and Cell Biology, Institute of Biomedicine, The Sahlgrenska Academy, University of Gothenburg, SE-405 30 Gothenburg, Sweden; and bDepartment of Surgery, Sahlgrenska Cancer Center, Institute of Clinical Sciences, University of Gothenburg, SE-405 30 Gothenburg, Sweden Edited by Mary-Claire King, University of Washington, Seattle, WA, and approved October 21, 2016 (received for review April 19, 2016) Tumor genomes are mosaics of somatic structural variants (SVs) segments (14). Several factors complicate the analysis, in particular that may contribute to the activation of oncogenes or inactivation mappability issues due to repetitive sequence regions (15). Indeed, of tumor suppressors, for example, by altering gene copy number it has become clear that the results produced by different methods amplitude. However, there are multiple other ways in which SVs are not consistent, and some studies have intersected multiple ap- can modulate transcription, but the general impact of such events proaches to provide a presumed high-confidence set of predictions on tumor transcriptional output has not been systematically de- (16, 17). Adding to the challenges is the difficulty of assessing termined. Here we use whole-genome sequencing data to map SVs performance: True positive sets have thus far been obtained across 600 tumors and 18 cancers, and investigate the relationship through simulated genomic sequences (18), but this will not re- between SVs, copy number alterations (CNAs), and mRNA expression. flect the true complexity of cancer genomes. Improved bench- We find that 34% of CNA breakpoints can be clarified structurally marking is thus desirable before SVs can be comprehensively and that most amplifications are due to tandem duplications. We studied with WGS in human cancers. observe frequent swapping of strong and weak promoters in the Here, enabled by the availability of a large cancer cohort sub- context of gene fusions, and find that this has a measurable global jected to WGS by The Cancer Genome Atlas (TCGA) consor- impact on mRNA levels. Interestingly, several long noncoding RNAs tium, we map somatic SVs in 600 tumors across 18 different BIOPHYSICS AND were strongly activated by this mechanism. Additionally, SVs were cancer types and determine relationships between SVs, CNAs, COMPUTATIONAL BIOLOGY confirmed in telomere reverse transcriptase (TERT) upstream regions and mRNA expression. We compare SV breakpoints with those in several cancers, associated with elevated TERT mRNA levels. We from microarray-based CNA data to evaluate and optimize tools also highlight high-confidence gene fusions supported by both and parameters for SV detection, which additionally gives insights genomic and transcriptomic evidence, including a previously unde- into the structural basis of CNAs. We next investigate relation- scribed paired box 8 (PAX8)–nuclear factor, erythroid 2 like 2 ships between SVs and changes in gene expression levels, both in NFE2L2 ( ) fusion in thyroid carcinoma. In summary, we combine SV, the context of gene fusions as well as SVs affecting putative reg- CNA, and expression data to provide insights into the structural basis ulatory regions near TSSs. Finally, we combine breakpoints de- of CNAs as well as the impact of SVs on gene expression in tumors. termined from WGS and RNA-sequencing (RNA-seq) data to detect high-confidence fusion genes across this large cohort. cancer genomics | somatic structural variation | gene fusion | gene expression Results CNA-Based Evaluation of SV Detection Methodology. We first noted omatic structural variants (SVs) in genomic DNA can lead to large discrepancies, for example in terms of SV types, when Schanges in gene copy number, which may contribute to tumor development by altering gene expression levels (1). Several Significance studies have tried to systematically investigate the effect of copy number alterations (CNAs) on tumor transcriptomes (2, 3), but Structural changes in chromosomes can alter the expression and this can only explain part of the expression changes observed in function of genes in tumors, an important driving mechanism in tumors. Notably, there are several other mechanisms by which some tumors. Whole-genome sequencing makes it possible to SVs can alter tumor transcriptional output, including creation of detect such events on a genome-wide scale, but comprehensive fusion genes. These may produce chimeric mRNAs encoding investigations are still missing. Here, enabled by a massive novel oncogenic proteins (4, 5) or, alternatively, fusion mRNAs amount of whole-genome sequencing data generated by The that maintain their protein-coding sequence while being tran- ′ Cancer Genome Atlas consortium, we map somatic structural scriptionally induced or repressed due to swapping of 5 ends, changes in 600 tumors of diverse origins. At a global level, we including the promoter (6). Additionally, SVs can affect mRNA find that such events often contribute to altered gene expres- levels by bringing distant regulatory elements into the proximity sion in human cancer, and also highlight specific events that may of transcription start sites (TSSs) without structurally altering the have functional roles during tumor development. mRNA (7). More systematic investigations of the impact of SVs on gene expression levels in tumors are, however, lacking, in part Author contributions: B.A.-M., J.A.N., and E.L. designed research; B.A.-M., J.B., and J.W.K. due to the fact that mapping of SVs is considerably more chal- analyzed data; and B.A.-M. and E.L. wrote the paper. lenging compared with determining CNA profiles. Additionally, The authors declare no conflict of interest. although CNAs arise as a result of SVs, the two have generally This article is a PNAS Direct Submission. been studied in isolation and the general structural basis of so- Freely available online through the PNAS open access option. matic CNAs in tumors is poorly investigated. Data deposition: The sequence reported in this paper has been deposited in the Gene Recent studies have started to use whole-genome sequencing Expression Omnibus (GEO) database, www.ncbi.nlm.nih.gov/geo (accession no. (WGS) to identify SVs in tumors, and multiple tools have been GSE79756). published (8–12). These are based on the idea that breakpoints 1To whom correspondence should be addressed. Email: [email protected]. should reveal themselves on the basis of discordant read pairs (13) This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. or split reads mapping partially to both ends of fused chromosomal 1073/pnas.1606220113/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1606220113 PNAS Early Edition | 1of6 Downloaded by guest on September 24, 2021 applying several available tools for SV detection to WGS data and 125 per sample, respectively). Few SVs were found in copy from TCGA tumor/normal pairs (Fig. 1A), and therefore sought number stable cancers such as thyroid carcinoma (THCA) (26) to carefully evaluate multiple options. A general problem in (∼8 per sample on average). Similarly, clear cell and chromo- assessing SV results is the lack of true positive data for com- phobe kidney carcinomas (KIRC and KICH) and low-grade gli- parison (19). To partially overcome this problem, we made use of oma (LGG) tumors exhibited a small number of SVs (Fig. 2A). orthogonal results in the form of CNA profiles from the Affy- We next investigated the relationship between somatic CNA metrix SNP6 microarray platform (SI Methods). CNAs arise as a and SV breakpoints. On average, 34% of CNA breakpoints consequence of somatic SVs in the genome, and segmented (segments with absolute log2 amplitude ≥0.3, SNP6 platform) CNA data therefore provide a true positive subset. Although this detected across all 600 tumors were confirmed in the SV data. subset is incomplete, an ideal experimental and computational Overlaps were reduced in CNA stable tumors such as KICH, pipeline should essentially be able to detect all CNA breakpoints. which are dominated by arm-level events and therefore may lack We applied four available tools, BreakDancer (20), SVDetect breakpoints detectable by WGS (Fig. 2B). Absolute numbers of (21), DELLY (22), and Meerkat (11), to high-coverage WGS CNA and SV breakpoints were strongly correlated across tumors data from 200 TCGA tumors of diverse origins, and quantified (Pearson’s r = 0.81; Fig. 2C). However, a small number of overlaps between SV and CNA breakpoints (<15-kb tolerance). samples had elevated CNAs not reflected in the SV data (Fig. To reduce the computational complexity, only a single chromo- 2C), again explainable by the predominance of arm-level changes some (chr) was considered in this analysis (chr8). To also address in these tumors (Fig. S2). These results further support the the specificity of the methods, we investigated overlaps with a set validity of the SV data and show that both approaches are ef- of randomized CNA breakpoints that maintained a similar po- fective for quantifying overall chromosomal instability. SI Methods sitional distribution across the genome ( ). SVDetect The overlap between CNA and SV data at the level of com- had the highest sensitivity score (35%), followed by Meerkat plete events (for example, both breakpoints for an