Mouse SNP Miner: an Annotated Database of Mouse Functional Single Nucleotide Polymorphisms Eli Reuveni*1, Vasily E Ramensky2 and Cornelius Gross1
Total Page:16
File Type:pdf, Size:1020Kb
BMC Genomics BioMed Central Database Open Access Mouse SNP Miner: an annotated database of mouse functional single nucleotide polymorphisms Eli Reuveni*1, Vasily E Ramensky2 and Cornelius Gross1 Address: 1Mouse Biology Unit, EMBL, Via Ramarini 32, 00016 Monterotondo, Italy and 2Engelhardt Institute of Molecular Biology, Vavilova 32, 119991 Moscow, Russia Email: Eli Reuveni* - [email protected]; Vasily E Ramensky - [email protected]; Cornelius Gross - [email protected] * Corresponding author Published: 21 January 2007 Received: 4 August 2006 Accepted: 21 January 2007 BMC Genomics 2007, 8:24 doi:10.1186/1471-2164-8-24 This article is available from: http://www.biomedcentral.com/1471-2164/8/24 © 2007 Reuveni et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: The mapping of quantitative trait loci in rat and mouse has been extremely successful in identifying chromosomal regions associated with human disease-related phenotypes. However, identifying the specific phenotype-causing DNA sequence variations within a quantitative trait locus has been much more difficult. The recent availability of genomic sequence from several mouse inbred strains (including C57BL/6J, 129X1/SvJ, 129S1/SvImJ, A/J, and DBA/2J) has made it possible to catalog DNA sequence differences within a quantitative trait locus derived from crosses between these strains. However, even for well-defined quantitative trait loci (<10 Mb) the identification of candidate functional DNA sequence changes remains challenging due to the high density of sequence variation between strains. Description: To help identify functional DNA sequence variations within quantitative trait loci we have used the Ensembl annotated genome sequence to compile a database of mouse single nucleotide polymorphisms (SNPs) that are predicted to cause missense, nonsense, frameshift, or splice site mutations (available at http://bioinfo.embl.it/SnpApplet/). For missense mutations we have used the PolyPhen and PANTHER algorithms to predict whether amino acid changes are likely to disrupt protein function. Conclusion: We have developed a database of mouse SNPs predicted to cause missense, nonsense, frameshift, and splice-site mutations. Our analysis revealed that 20% and 14% of missense SNPs are likely to be deleterious according to PolyPhen and PANTHER, respectively, and 6% are considered deleterious by both algorithms. The database also provides gene expression and functional annotations from the Symatlas, Gene Ontology, and OMIM databases to further assess candidate phenotype-causing mutations. To demonstrate its utility, we show that Mouse SNP Miner successfully finds a previously identified candidate SNP in the taste receptor, Tas1r3, that underlies sucrose preference in the C57BL/6J strain. We also use Mouse SNP Miner to derive a list of candidate phenotype-causing mutations within a previously uncharacterized QTL for response to morphine in the 129/Sv strain. Page 1 of 10 (page number not for citation purposes) BMC Genomics 2007, 8:24 http://www.biomedcentral.com/1471-2164/8/24 Background coding sequence mutations underlying QTL coupled with The laboratory mouse is a powerful model for studying the much greater reliability with which such mutations the genetic determinants of human disease-related pheno- can be identified at the present time suggests that search- types. One way to study genetic modifiers of such pheno- ing for functional coding mutations is a worthwhile initial types is to take advantage of genetic variations existing approach for QTL analysis. between mouse inbred strains to map chromosomal regions, or loci, that are associated with quantitative traits, Several authors have used the Celera Discovery System so-called quantitative trait loci (QTL). Over 2,000 QTL are [12] database to identify putative missense and nonsense listed in the Mouse Genome Informatics database [1], and SNPs within QTL [13-16]. Beginning with Mouse Build many of these are relevant to human disease. 126, December 2006, the public SNP repository, dbSNP [17], has incorporated all Celera sequence reads from However, the identification of causal functional DNA pol- 129X1/SvJ, 129S1/SvImJ, A/J, and DBA/2J and thus pro- ymorphisms underlying QTL has remained problematic, vides free access to genomic sequence from several inbred with such variations having been convincingly identified strains. Both the Ensembl and NCBI genome portals offer for no more than 9 QTL [2]. This difficulty stems largely search tools that allow comparison of SNPs between from the considerable size of typical QTL (10–50 Mb) and mouse inbred strains, called TranscriptSNPView [18] and the large number of sequence variations that lie within SNPviewer [19], respectively. SNP calls from dbSNP are such a region. In most cases, gene discovery for QTL is car- mapped onto the Ensembl or NCBI annotated mouse ried out by congenic mapping to narrow down the QTL genome to make predictions about the functional conse- region to less than 5 Mb, followed by gene expression and quence of each SNP. We chose to use genome annotations coding sequence analysis of each gene in the interval to from Ensembl because it provides open access to its narrow down a list of candidate genes. Finally, the most MySQL database and associated API interface [20]. promising candidates are tested by genetic complementa- Although these databases allow searching of SNPs accord- tion or epistasis in transgenic animals. Several methods ing to specific criteria (e.g., chromosome location, pair- are helping to speed the process of narrowing down QTL wise strain comparison, etc.), neither of them provides to <5 Mb, including mapping in chromosome substitu- additional annotations for SNPs, such as deleterious/non- tion strains [3], heterogeneous stocks [4], and advanced deleterious for missense mutations, gene expression data, intercrosses [5]. The construction of large panels of or relevance to human disease. recombinant inbred strains (>500 lines) are in progress and promise to facilitate rapid high-resolution mapping To address this need, we sought to build a database of [6]. mouse inbred strain SNPs predicted to cause functional changes in protein sequence, including missense, non- Recently, the nearly complete genome sequences of four sense, and splice site mutations, that could be easily mouse inbred strains (129X1/SvJ, 129S1/SvImJ, A/J, DBA/ searched by strain, type, chromosomal location and func- 2J; [7]) joined the public reference sequence (C57BL/6J; tional consequence. In addition, we applied the bioinfor- [8]) in open-access genome databases. Complete knowl- matics algorithms PolyPhen [21] and PANTHER [22] to edge of the DNA sequence variation between strains predict whether missense mutations among these SNPs allows for a systematic search of all candidate sequence were likely to disrupt protein function. PolyPhen analysis polymorphisms within a QTL for putative functional pol- of human SNPs showed that 17% of a selected set of 9,165 ymorphisms [9-11]. When coupled with gene annotation, human non-synonymous coding SNPs are 'possibly dam- this information can be used to identify polymorphisms aging' and 13% are 'probably damaging' to protein func- likely to cause changes in gene function and thus likely to tion. These data confirmed earlier reports suggesting that contribute to the QTL. Single nucleotide polymorphisms only a small fraction of missense mutations interfere with (SNPs) that cause nonsense or missense mutations can be protein function. We reasoned that a similar analysis of reliably identified and often are responsible for severe mouse missense SNPs could help to narrow down candi- changes in protein function. Among 9 identified causal date causal SNPs within QTL. The PANTHER algorithm QTL mutations, 2 are missense mutations, 2 are non-sense offers a complementary functional assessment of putative mutations, and 1 is a frameshift mutation demonstrating missense SNPs [22] and a comparison of PolyPhen and that these classes of overt coding sequence mutations con- PANTHER predictions for human missense SNPs demon- tribute to QTL phenotypes, at least for QTL of large effect strated a close correlation between these predictions and size [2]. It is unclear in this point what fraction of the empirical assessments of protein activity [23]. The recent >2000 reported mouse QTL are caused by overt coding deposition of Celera sequence reads from the 129X1/SvJ, sequence mutations, and it is likely that functional non- 129S1/SvImJ, A/J, and DBA/2J mouse inbred strains into coding mutations contribute significantly to these traits. the public domain has made it possible for us to develop However, the success of finding at least some functional a database of mouse SNPs that incorporates both Page 2 of 10 (page number not for citation purposes) BMC Genomics 2007, 8:24 http://www.biomedcentral.com/1471-2164/8/24 Polyphen and PANTHER missense SNP functional predic- data include both categorical, 'absent'/'present' calls, as tions as well as other gene function annotations. well as quantitative Affymetrix values for a large set of embryonic, neonatal and adult mouse tissues. A sche- Construction and content matic of the database is shown in Figure 1. To assemble mouse SNPs for our database, we used Perl