Functsnp: an R Package to Link Snps to Functional Knowledge And
Total Page:16
File Type:pdf, Size:1020Kb
Goodswen et al. BMC Bioinformatics 2010, 11:311 http://www.biomedcentral.com/1471-2105/11/311 DATABASE Open Access FunctSNP:Database an R package to link SNPs to functional knowledge and dbAutoMaker: a suite of Perl scripts to build SNP databases Stephen J Goodswen1,2, Cedric Gondro2, Nathan S Watson-Haigh1 and Haja N Kadarmideen*1 Abstract Background: Whole genome association studies using highly dense single nucleotide polymorphisms (SNPs) are a set of methods to identify DNA markers associated with variation in a particular complex trait of interest. One of the main outcomes from these studies is a subset of statistically significant SNPs. Finding the potential biological functions of such SNPs can be an important step towards further use in human and agricultural populations (e.g., for identifying genes related to susceptibility to complex diseases or genes playing key roles in development or performance). The current challenge is that the information holding the clues to SNP functions is distributed across many different databases. Efficient bioinformatics tools are therefore needed to seamlessly integrate up-to-date functional information on SNPs. Many web services have arisen to meet the challenge but most work only within the framework of human medical research. Although we acknowledge the importance of human research, we identify there is a need for SNP annotation tools for other organisms. Description: We introduce an R package called FunctSNP, which is the user interface to custom built species-specific databases. The local relational databases contain SNP data together with functional annotations extracted from online resources. FunctSNP provides a unified bioinformatics resource to link SNPs with functional knowledge (e.g., genes, pathways, ontologies). We also introduce dbAutoMaker, a suite of Perl scripts, which can be scheduled to run periodically to automatically create/update the customised SNP databases. We illustrate the use of FunctSNP with a livestock example, but the approach and software tools presented here can be applied also to human and other organisms. Conclusions: Finding the potential functional significance of SNPs is important when further using the outcomes from whole genome association studies. FunctSNP is unique in that it is the only R package that links SNPs to functional annotation. FunctSNP interfaces to local SNP customised databases which can be built for any species contained in the National Center for Biotechnology Information dbSNP database. Background availability of SNPs. For example, the dbSNP database Whole genome association studies (WGAS) are a set of housed by the National Center for Biotechnology Infor- methods to study common genetic variation across the mation (NCBI) contains millions of SNPs for many model entire genome of organisms with the purpose of identify- species. Secondly, the falling cost and rapid improve- ing associations between genetic polymorphisms with ments in SNP genotyping technology. One of the main observable phenotypic variation in traits. Single nucle- outcomes from WGAS is a subset of SNPs at a specified otide polymorphisms (SNPs) are the most common level of statistical significance (p - values) and false dis- genetic variant. There are essentially two advances that covery rates (FDR). These SNPs are currently used in two have allowed WGAS. Firstly, the huge increase in the conventional, albeit alternative, approaches depending on the requirements of the researcher: 1) a molecular biolo- * Correspondence: [email protected] gist usually searches for candidate genes in the region of 1 CSIRO Livestock Industries, Davies Laboratory, University Drive, Townsville, QLD 4810, Australia the significant SNP and then does functional experi- Full list of author information is available at the end of the article ments, which could include targeted sequencing; 2) a © 2010 Goodswen et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Com- mons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduc- tion in any medium, provided the original work is properly cited. Goodswen et al. BMC Bioinformatics 2010, 11:311 Page 2 of 9 http://www.biomedcentral.com/1471-2105/11/311 geneticist uses significant SNPs as DNA markers for pre- humans. FunctSNP is an R package which provides a set dicting disease risks, phenotypic performance or breed- of functions to query species-specific local relational ing value in individuals and may use this information in databases within the widely used statistical programming associated trait selection (marker-assisted selection/ language R http://www.r-project.org/. gene-assisted selection). For many of the bioinformatics tools, third party appli- One of the major challenges for a researcher using cations are required for upstream and downstream WGAS is minimising false positive rates while maintain- research analyses, which in most cases require data for- ing the power to identify true positive associations [1]. mat changes. The R environment has a global data cur- WGAS relies on linkage disequilibrium (LD) between rency of vectors and data frames so there are few data DNA markers and causal variants. SNPs located near conversion concerns between R packages. For instance, each other on a chromosome are said to be in LD if they there are two R packages called GenABEL [3] and SNPas- tend to be inherited together more often than can be soc [4], which perform genome wide association studies. expected by chance [1]. Unlike the SNP chip design for FunctSNP ideally complements these R packages by link- humans that had input from the HapMap project to assist ing their output to SNP annotation. In addition, there are in its SNP selection, the tag SNP selection for most other R packages called GSA [5], GoSim [6], GeneNet [7], and species is not strictly based on LD knowledge. For many snp.plotter [8] for further downstream analysis to provide species the LD patterns are still to be conclusively deter- advanced algorithms, and coloured graphs. mined and therefore the tag SNPs for the genotyping SNP A primary goal of FunctSNP is to help screen and select chip are selected to be evenly distributed [2]. the WGAS significant SNPs which are more likely to be In WGAS, association does not imply causation. The reliable as DNA markers. WGAS makes the direct associ- significant SNPs derived from WGAS can be generally ation between SNP variation and phenotype variation. classified into 4 types: 1) a causal SNP that contributes to Whereas in reality SNP variation affects phenotype varia- variation in the complex trait; 2) a SNP with no known tion through the effects it has on multiple intervening biological effect but in LD to an untagged SNP (not geno- biological pathways and networks [9]. FunctSNP aims to typed) that contributes to variation in the complex trait; link WGAS derived SNPs to these biological pathways 3) a SNP with an association only - no known biological and gene networks. The FunctSNP functions within R effect or linkage to a causal SNP; 4) a SNP not associated allow screening of SNPs by: 1) determining their physical with the complex trait (a false positive). Since the type of location with respect to genes. As a general rule, the significant SNP following WGAS is not distinguishable to closer a significant SNP is to the potential causal variant the researcher, there is a potential to perform functional the more likely they will be in LD; 2) finding evidence for experiments based on a false-positive SNP and/or use a a functional role of the significant SNP. It is well known SNP as a marker that is neither the causal variant nor in that over 90% of significant SNPs are not located on LD with the causal variant. genes. Therefore, if the significant SNP is not located on a Currently, the only approach to determine if a WGAS gene, FunctSNP functions can search in close proximity derived significant SNP is a true positive and in LD with to the significant SNP for a known SNP (not genotyped) the causal variant is to replicate the association findings that is on a gene, within a biological pathway, and with a in an independent population of adequate size [1]. Hence, biological function known to be associated with the trait a need exists for SNP annotation tools that can assist a of interest [10]. Figure 1 shows an example, through a researcher in distinguishing the type of significant SNP. schematic diagram, of the process of finding the potential The tool would need to first collect and integrate all exist- function of significant SNPs. ing biological functional information from online resources. Then provide an interface to access, analyse, Construction and content and present the information so that the researcher may The biological information required for SNP annotation make informed judgments as to whether a SNP could be a is publically available in online databases. Although most causal variant (type 1) or is in LD with a causal variant of these online resources provide cross-links to other (type 2). databases, it is not straightforward to extract the relevant Table 1 shows a list of the main current bioinformatics research data in one query. Furthermore, even though the tools for SNP annotation. The tools are web-based and databases contain well-annotated data, finding and man- predominantly work within the framework of human aging the data of interest among the large assortment of medical research. In May 2009, dbSNP held genetic poly- data can be a challenge. Our solution to superfluous and morphisms for 45 organisms and 15 of the organisms disseminated data was to link SNPs with functional have sequenced genomes. We have developed a unified knowledge within species-specific local relational data- bioinformatics tool, called FunctSNP, to address the need bases. to provide SNP annotation for other species as well as Goodswen et al.