Downloaded from Classification Method Theucscgenomebrowser[45]
Total Page:16
File Type:pdf, Size:1020Kb
Mort et al. Genome Biology 2014, 15:R19 http://genomebiology.com/2014/15/1/R19 SOFTWARE Open Access MutPred Splice: machine learning-based prediction of exonic variants that disrupt splicing Matthew Mort1*, Timothy Sterne-Weiler4,5, Biao Li2, Edward V Ball1, David N Cooper1, Predrag Radivojac3, Jeremy R Sanford4 and Sean D Mooney2* Abstract We have developed a novel machine-learning approach, MutPred Splice, for the identification of coding region substitutions that disrupt pre-mRNA splicing. Applying MutPred Splice to human disease-causing exonic mutations suggests that 16% of mutations causing inherited disease and 10 to 14% of somatic mutations in cancer may disrupt pre-mRNA splicing. For inherited disease, the main mechanism responsible for the splicing defect is splice site loss, whereas for cancer the predominant mechanism of splicing disruption is predicted to be exon skipping via loss of exonic splicing enhancers or gain of exonic splicing silencer elements. MutPred Splice is available at http://mutdb.org/mutpredsplice. Introduction of these methods only consider the direct impact of the In case-control studies, the search for disease-causing missense variant at the protein level and automatically variants is typically focused on those single base substi- disregard same-sense variants as being ‘neutral’ with re- tutions that bring about a direct change in the primary spect to functional significance. Although this may well sequence of a protein (that is, missense variants), the be the case in many instances, same-sense mutations consequence of which may be structural or functional can still alter the landscape of cis-acting elements in- changes to the protein product. Indeed, missense muta- volved in posttranscriptional gene regulation, such as tions are currently the most frequently encountered type those involved in pre-mRNA splicing [10-12]. It is clear of human gene mutation causing genetic disease [1]. from the global degeneracy of the 5′ and 3′ splice site The underlying assumption has generally been that it is consensus motifs that auxiliary cis-acting elements must the nonsynonymous changes in the genetic code that are play a crucial role in exon recognition [13]. To date, a likely to represent the cause of pathogenicity in most considerable number of exonic splicing regulatory (ESR) cases. However, there is an increasing awareness of the and intronic splicing regulatory (ISR) elements have role of aberrant posttranscriptional gene regulation in been identified [14-19]. Generally these are classified as the etiology of inherited disease. either enhancers (exonic splicing enhancers (ESEs)/in- With the widespread adoption of next generation se- tronic splicing enhancers (ISEs)) or silencers (exonic spli- quencing (NGS), resulting in a veritable avalanche of cing silencers (ESSs)/intronic splicing silencers (ISS)), DNA sequence data, it is increasingly important to be which strengthen and repress, respectively, recognition of able to prioritize those variants with a potential func- adjacent splice sites by the splicing machinery. This dis- tional effect. In order to identify deleterious or disease- tinction may be to some extent artificial in so far as an causing missense variants, numerous bioinformatic tools ESE can act as an ESS and vice versa depending upon the have been developed, including SIFT [2], PolyPhen2 [3], sequence context and the trans-acting factor bound to PMUT [4], LS-SNP [5], SNAP [6], SNPs3D [7], MutPred it [16,20]. These trans-acting factors include members [8] and Condel [9] among others. However, the majority of the serine/arginine-rich family of proteins (SR pro- teins) typically known to bind to splicing enhancers and * Correspondence: [email protected]; [email protected] the heterogeneous nuclear ribonucleoprotein family of 1 Institute of Medical Genetics, School of Medicine, Cardiff University, Cardiff complexes (hnRNPs), which are thought to bind splicing CF14 4XN, UK 2Buck Institute for Research on Aging, Novato, CA 94945, USA silencers. However, it is clear that our knowledge of the Full list of author information is available at the end of the article © 2014 Mort et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Mort et al. Genome Biology 2014, 15:R19 Page 2 of 20 http://genomebiology.com/2014/15/1/R19 cooperative and antagonistic elements that regulate pre- not been optimized to deal with single base substitu- mRNA splicing in a context-dependent manner is still tions, and require the wild-type and mutant sequences very limited [21]. to be analyzed separately with the user having to com- The functional consequences of a splice-altering variant pute any difference in predicted splicing regulatory ele- (SAV) may also vary quite dramatically; thus, splicing ments. Tools that are designed specifically to handle events that alter the reading frame can introduce prema- single base substitutions include Spliceman, Skippy and ture termination codons that may then trigger transcript Human Splice Finder (HSF). In most cases, as each tool degradation through nonsense-mediated decay. Alterna- focuses on specific aspects of the splicing code, there is tively, an aberrant splicing event may maintain the open often a need to recruit multiple programs [37] before reading frame but lead instead to a dysfunctional protein any general conclusions can be drawn. lacking an important functional domain. Even a splice- An exome screen will typically identify >20,000 exonic altering variant that produces only a small proportion of variants [38]. This volume of data ensures that high- aberrant transcripts could still serve to alter the gene ex- throughput in silico methods are an essential part of the pression level [21]. toolset required to prioritize candidate functional vari- Up to approximately 14% of all reported disease-causing ants from the growing avalanche of sequencing data nucleotide substitutions (coding and non-coding) listed now being generated by NGS. NGS data analysis nor- in the Human Gene Mutation Database [1] (11,953 mu- mally involves applying multiple filters to the data in tations; HGMD Pro 2013.4) are thought to disrupt pre- order to prioritize candidate functional variants. When mRNA splicing whereas 1 to 2% of missense mutations applying NGS filters, it is important to remember that have been reported to disrupt pre-mRNA splicing same-sense variants may alter pre-mRNA splicing via a (HGMD Pro 2013.4). Previous studies have, however, number of different mechanisms. Hence, a naïve NGS found that the actual proportion of disease-causing filter that only considers variants within the splice site missense mutations that disrupt pre-mRNA splicing consensus as candidate splicing-sensitive variants would could be rather higher [22-25]. The difference between not identify same-sense variants that caused exon skip- the observed and predicted frequencies of disease- ping via a change in ESR elements. causing splicing mutations may be due in part to the Currently, several general areas need to be improved frequent failure to perform routine in vitro analysis (for in relation to the identification of genetic variation re- example, a hybrid minigene splicing assay [26]), so the sponsible for aberrant pre-mRNA splicing. Firstly, al- impact of a given missense mutation on the splicing though the consensus splice site sequences are well phenotype is generally unknown. The likely high fre- defined, the auxiliary splicing elements and their interac- quency of exonic variants that disrupt pre-mRNA spli- tions with splice sites are not well understood. Secondly, cing implies that the potential impact upon splicing there is an urgent need for larger unbiased datasets of should not be neglected when assessing the functional experimentally characterized variants that alter splicing significance of newly detected coding sequence vari- and have been quantitatively assessed with respect to the ants. Coding sequence variants that disrupt splicing mRNA splicing phenotype. This would provide better may not only cause disease [22] but may in some cases training data for new models and provide new datasets to also modulate disease severity [27,28] or play a role in benchmark the performance of different tools (both new complex disease [29]. The identification of disease- and existing). Thirdly, there is an urgent need for new bio- causing mutations that disrupt pre-mRNA splicing will informatic tools suitable for use in a high-throughput also become increasingly important as new therapeutic NGS setting. These tools promise to be invaluable for the treatment options become available that have the po- comprehensive evaluation of the impact of a given variant tential to rectify the underlying splicing defect [30,31]. on mRNA processing (that is, not just in terms of splice Current bioinformatic tools designed to assess the site disruption). It would also be beneficial if the specific impact of genetic variation on splicing employ different consequences for the splicing phenotype (that is, multiple approaches but typically focus on specific aspects of spli- exon skipping,