Genome Sequencing and RNA-Motif Analysis Reveal Novel Damaging Noncoding Mutations in Human Tumors Babita Singh1, Juan L
Total Page:16
File Type:pdf, Size:1020Kb
Published OnlineFirst March 28, 2018; DOI: 10.1158/1541-7786.MCR-17-0601 Genomics Molecular Cancer Research Genome Sequencing and RNA-Motif Analysis Reveal Novel Damaging Noncoding Mutations in Human Tumors Babita Singh1, Juan L. Trincado1, PJ Tatlow2, Stephen R. Piccolo2,3, and Eduardo Eyras1,4 Abstract A major challenge in cancer research is to determine the motifs in introns, HNRPLL motifs at 50 UTRs, as well as 50 and 30 biological and clinical significance of somatic mutations in splice-site motifs, among others, with specific mutational pat- noncoding regions. This has been studied in terms of recurrence, terns that disrupt the motif and impact RNA processing. MIRA functional impact, and association to individual regulatory facilitates the integrative analysis of multiple genome sites that sites, but the combinatorial contribution of mutations to com- operate collectively through common RBPs and aids in the mon RNA regulatory motifs has not been explored. Therefore, interpretation of noncoding variants in cancer. MIRA is avail- we developed a new method, MIRA (mutation identification for able at https://github.com/comprna/mira. RNA alterations), to perform an unbiased and comprehensive study of significantly mutated regions (SMR) affecting binding Implications: The study of recurrent cancer mutations on poten- sites for RNA-binding proteins (RBP) in cancer. Extracting tial RBP binding sites reveals new alterations in introns, un- signals related to RNA-related selection processes and using translated regions, and long noncoding RNAs that impact RNA RNA sequencing (RNA-seq) data from the same specimens, we processing and provide a new layer of insight that can aid in identified alterations in RNA expression and splicing linked to the interpretation of noncoding variants in cancer genomes. mutations on RBP binding sites. We found SRSF10 and MBNL1 Mol Cancer Res; 1–13. Ó2018 AACR. Introduction ment of potential functional impacts (4, 9–11); (ii) recurrence in combination with sequence conservation or polymorphism data Cancer arises from genetic and epigenetic alterations that (12, 13); or (iii) the enrichment of mutations with respect to interfere with essential mechanisms of the normal life cycle of specific mutational backgrounds (5, 14, 15); and some of them cells such as DNA repair, replication control, and cell death (1). combine such approaches (5). However, these methods have so The search for driver mutations, which confer a selective advan- far been restricted to individual positions rather than combining tage to cancer cells, is generally performed in terms of the impact the contributions from multiple functionally equivalent regula- on protein sequences (2). However, systematic studies of cancer tory sites. Additionally, the impact on RNA processing measured genomes have highlighted mutational processes outside of from the same samples has not been evaluated. protein-coding regions (3–5) and tumorigenic mutations at non- RNA molecules are bound by multiple RNA binding proteins coding regions have been described, like those in the TERT (RBP) with specific roles in RNA processing, including splicing, promoter (6, 7). However, a major challenge remains to more stability, localization, and translation, which are critical for accurately and comprehensively determine the significance and the proper control of gene expression (16). Multiple experimental potential pathogenic involvement of somatic variants in regions approaches have established that RBPs generally interact with that do not code for proteins (8). Current methods to detect driver RNAs through short motifs of 4 to 7 nucleotides (17–19). These mutations in noncoding regions are based on (i) the recurrence of motifs occur anywhere along the precursor RNA molecule (pre- mutations in predefined regions in combination with measure- mRNA), including introns, protein coding regions, untranslated 50 and 30 regions, as well as in short and long noncoding RNAs (20, 21). Mutations on RNA regulatory sequences can impact RNA 1Department of Experimental and Health Sciences, Pompeu Fabra University processing and lead to disease (22). However, studies carried out (UPF), Barcelona, Spain. 2Department of Biology, Brigham Young University, so far have mainly focused on sequences around splice sites (23) Provo, Utah. 3Department of Biomedical Informatics, University of Utah, Salt 4 or in protein-coding regions (24). In vitro screenings of sequence Lake City, Utah. Catalan Institution for Research and Advanced Studies variants in exons have revealed that more than 50% of nucleotide (ICREA), Barcelona, Spain. substitutions can induce splicing changes (25, 26), with similar Note: Supplementary data for this article are available at Molecular Cancer effects for synonymous and nonsynonymous sites (26). Because Research Online (http://mcr.aacrjournals.org/). RBP binding motifs are widespread along gene loci, and somatic Corresponding Author: Eduardo Eyras, Pompeu Fabra University, PRBB mutations may occur anywhere along the genome, it is possible Dr. Aiguader 88, 08003, Barcelona, Spain. Phone: 34-93-316-0502; Fax: 34- that mutations in other genic regions could impact RNA proces- 93-316-0550; E-mail: [email protected] sing and contribute to the tumor phenotype. Mutations and doi: 10.1158/1541-7786.MCR-17-0601 expression alterations in RBP genes have an impact on specific Ó2018 American Association for Cancer Research. cellular programs in cancer, but it is not known whether www.aacrjournals.org OF1 Downloaded from mcr.aacrjournals.org on September 30, 2021. © 2018 American Association for Cancer Research. Published OnlineFirst March 28, 2018; DOI: 10.1158/1541-7786.MCR-17-0601 Singh et al. mutations in RBP binding sites along gene loci are frequent in follows: For each base a, we calculated the rate of mutations in a cancer, could damage RNA processing, and contribute to onco- locus R(a) ¼ m(a)/n(a), where n(a) is the number of a bases in the genic mechanisms. gene and m(a) is the number of those bases that are mutated. The To understand the effects of somatic mutations on RNA proces- expected mutation count is then calculated using the nucleotide sing in cancer at a global level we have developed a new method, counts in the window and the mutation rate per nucleotide. For MIRA, to carry out a comprehensive study of somatic mutation instance, for the 7-mer window AACTGCAG, the expected count patterns along genes that operate collectively through interacting was calculated as E ¼ 3R(A) þ 2R(C) þ 2R(G) þ R(T). This was with common RBPs. Compared with other existing approaches to compared with the actual number of mutations, n, observed in detect relevant mutations in noncoding regions, our study pro- that window to define an NB score: NB-score ¼ log2(n / E) per vides several novelties and advantages: (i) we searched exhaus- window. We discarded windows corresponding to single-nucle- tively along gene loci, hence increasing the potential to uncover otide repeats (e.g., AAAAAAA) and kept windows with NB-score > deep intronic pathological mutations; (ii) we studied the enrich- 6 (Supplementary Fig. S2). Further, we kept 7-mer windows ment of a large compendium of potential RNA regulatory motifs, that overlapped any of the three intronic or exonic bases around allowing us to identify potentially novel mechanisms affecting exon–intron boundaries with 1 or more mutations as long as the RNA processing in cancer; (iii) we showed that multiple mutated NB-score was greater than 6. genomic loci potentially interact with common RBPs, suggesting Significant 7-mer windows were clustered according to geno- novel cancer-related selection mechanisms; and (iv) unlike pre- mic overlap and classified according to the genic region in the vious methods, we used RNA sequencing data from the same same strand on which they fell (Supplementary Tables S2–S4): 50 samples to measure the impact on RNA processing. Our study or 30 untranslated regions (5UTR/3UTR), coding sequence (CDS), uncovered multiple mutated sites associated with common RBPs exon in short or long noncoding RNA (EXON), 50/30 splice-site that impact RNA processing with potential implications in cancer (5SS/3SS), or intron (INTRON). To unambiguously assign each and revealed a new layer of insight to aid in the interpretation of 7-mer window, we prioritized regions as follows: 5SS/3SS > CDS > noncoding variants in cancer genomes. 5UTR/3UTR > EXON > INTRON, i.e., if a window overlapped a splice site, it was classified as such; else, if it overlapped a CDS, it fi fi Materials and Methods was classi ed as CDS, etc. No signi cant window overlapped start or stop codons. To each SMR we assigned the average NB score and Detection of significantly mutated regions (SMR) a corrected P value using the Simes approach (27): we ranked We used the Gencode gene annotations (v19), excluding pseu- the P values of the n overlapping windows in increasing order fi dogenes. To de ne gene loci unequivocally, we clustered tran- pi,I¼ 1,2,...n and calculated ps ¼ min{np1 /1,np2 /2,np3 /3,...., scripts that shared a splice site on the same strand, and considered npn /n}, where p1 was the lowest and pn the highest P values in the fi a gene to be the genomic locus and the strand de ned by those cluster. Each SMR cluster was then assigned the P value ps. Using transcripts. We used somatic mutations from whole genome other window lengths (k ¼ 6, 8, 9) we observed that k ¼ 7 provides sequencing for 505 tumor samples from 14 tumor types (ref. 10; an optimal trading off between the number of regions covered and Supplementary Table S1): bladder carcinoma (BLCA; 21 sam- the clustering of those regions (Supplementary Table S2). Code ples), breast carcinoma (BRCA; 96 samples), colorectal carcinoma for this analysis is available at https://github.com/comprna/mira. (42 samples), glioblastoma multiforme (GBM; 27 samples), head and neck squamous carcinoma (HNSC; 27 samples), kidney Comparison with expression, replication timing and chromophobe (KICH; 15 samples), kidney renal carcinoma with other methods (KIRC; 29 samples), low grade glioma (LGG; 18 samples), lung Data for replication time were obtained from ref.