Online Resources for SNP Analysisreview 65

Online Resources for SNP AnalysisREVIEW 65 Online Resources for SNP Analysis A Review and Route Map Christopher Phillips Abstract The major online single nucleotide polymorphism (SNP) databases freely available as research tools for genetic analysis are explained, reviewed, and compared. An outline is given of the search strategies that can be used with the most extensive current SNP databases: National Centre for Biotechnology Information (NCBI) dbSNP and HapMap to help the user secure the most appropriate data for the research needs of clinical genetics and population genetics research. A range of online tools that can be useful in designing SNP genotyping assays are also detailed. Index Entries: Single nucleotide polymorphism, SNP; genotyping; variation; polymorphism; online databases; linkage disequilibrium; haplotype; haplotype block; genome; HapMap; clinical genetics; population genetics. 1. Introduction marks available. Above all, this enables researchers The validated set of human single nucleotide to improve the resolution of linkage maps, in par- polymorphisms (SNPs) is one of the most valu- ticular in the precise mapping and study of genes able resources to have come from the human contributing to complex disorders, where the con- genome mapping project (HGP). This dataset tributory effect of individual genes is small or was largely unforeseen in the initial phases of where genetics is compounded by multiple locus sequencing but as several genome equivalents interaction and environment. As well as enabling were generated from different individuals during greatly enhanced mapping, SNPs provide a new the project, the identification of SNPs became a and fascinating layer of detail to our understand- by-product of growing importance. The discov- ing of human variability and what this tells us ery of new SNP sites by resequencing and the about disease susceptibility, populations, and evo- collation of detailed information about each locus lution. Because SNPs have low recurrent mutation has continued to develop both in scope and in rates compared to other polymorphic markers they depth as the simultaneous publication of draft provide the most stable and reliable indicators of sequence and first full SNP map 5 yr ago (1,2). the evolutionary history of populations. Last, but Added to this, our knowledge of SNPs has ex- not least, SNPs are the very stuff of genetic varia- panded in parallel to an improved understanding tion as a substitution in or near a transcribed region of the structure and function of the genome and can change an amino acid sequence, the control of its gene content. The SNPs provide the most com- transcription events, or the splicing pattern for the plete and densely spaced system of genome land- resulting RNA products. Such changes arise from *Author to whom all correspondence and reprint requests should be addressed. The Spanish National Genotyping Centre CeGen, Santiago node, Genomic Medicine Group, University of Santiago de Compostela, Galicia, Spain. E-mail: [email protected]. Molecular Biotechnology 2006 Humana Press Inc. All rights of any nature whatsoever reserved. ISSN: 1073–6085/Online ISSN: 1559–0305/2006/35:1/065–098/$30.00 MOLECULAR BIOTECHNOLOGY 65 Volume 35, 2007 65_98_Phillips_MB06_0040 65 1/3/07, 4:40 PM 66 Phillips loci commonly termed coding SNPs, promoter– These developments have meant that searching site SNPs, and splice–site SNPs, respectively. for specific SNPs among a total of 9 million or Therefore, it can be argued that SNPs form the more human loci can now be achieved fairly eas- principal part of the variability affecting each ily, provided the researcher begins with clearly stage of gene expression and can justifiably be determined criteria for the markers required for a said to influence the transcriptome and proteome, study. To this end, the I can firmly recommend, as well as the genome itself. As such, SNPs are as starting points, two excellent textbooks: Human more than just ubiquitous marker points, they Molecular Genetics and Human Evolutionary are a core genome feature that will ultimately Genetics (3,4) that each cover thoroughly the help to explain one of the enduring mysteries of fields of genetic analysis and population genet- human genetics: why comparable numbers of ics, respectively, these forming the two principal genes in species at opposite ends of the evolutionary applications for SNP analysis. Both books pro- scale can create such profoundly different levels of vide, in nonscientific parlance, “one-stop shops” complexity. for understanding the extent and specific charac- In the same spirit as that prevailing in the data- teristics of SNPs needed for planning studies of base management of HGP draft and final nucle- inherited human disease (referred to as clinical otide sequence, SNP data arising from public genetics from this point on) or population genet- genome analysis have been open source from the ics. I also consider that it is important for scientists beginning; freely available in websites accessible in one area of speciality to be better acquainted by any researcher. Furthermore, the largest data- with the other, so the use of both books is recom- base, dbSNP, accepts submissions directly from mended. Careful project planning is an essential the scientific community, allowing the rapid dis- step, in addition to the scrutiny of appropriate semination of newly discovered SNP polymor- reporting publications, before any database phisms. In fact all the SNP databases described in search starts in earnest. Therefore, the major steps this review are based on open access and in this of project design involving the access of online sense are a shared scientific resource. Added to resources can be visits to PubMed (for the best this, during 2006 the bulk of privately held SNP current reports in the field), HapMap (for the review data will become available to the whole research of SNP candidates in, or close to, genes or genome community. This means one of the original pre- regions of interest), dbSNP (for the characteriza- cepts of HGP—free access to the best possible tion of the chosen SNPs), and finally use of vari- information acts to accelerate scientific progress, ous web tools for optimizing the genotyping assay equally applies to the overriding majority of human design. This is not intended to be a prescribed SNP data. Unhindered access to information is con- route—many of the sites outlined in this review sidered to be of such primary importance now that have viable alternatives that can act, above all, to a system for data release has been formalized in the supplement the data obtained from the primary recommended framework for international commu- sources described here. In addition, particular nity resource projects (http://www.welcome.ac. areas of clinical genetics such as cancer studies uk/doc_WTD003208.HTML). This framework have specialized databases that are not covered in secures, for all scientists, rapid dissemination, this review but clearly serve their field with a highest standards of database management, and more direct focus of information. the end use of data without restriction. This review was originally written for a third At the same time that genomic data started to potential application of SNP analysis, that of become available, the fields of bioinformatics and forensic identification, commonly termed DNA database management were rapidly developing to profiling. The SNPs offer the potential to greatly keep pace with the scale of the information gener- reduce amplified product size compared to exist- ated and the complexity of the searches required. ing profiling loci and therefore, could provide MOLECULAR BIOTECHNOLOGY Volume 35, 2007 65_98_Phillips_MB06_0040 66 1/3/07, 4:40 PM Online Resources for SNP Analysis 67 much improved results with the highly degraded ers (potentially as many as 1 million), despite the DNA commonly encountered in forensic analysis majority of commercially held Celera loci being or disaster victim identification. common to both public and private databases. Forensic profiling requires a specific set of Some idea of how much supplementary data may search criteria to find SNPs with the necessary eventually be released for public access can be properties for this field and readers interested in obtained from the time taken by National Centre the forensic application of SNP analysis can find for Biotechnology Information (NCBI) to scruti- more specific guidelines in the original article nize, validate, and reorganize the Celera SNPs. along with several specialized reviews (5–7). It is This has already occupied 6 months of 2005 and interesting to note that forensic SNP sets may find is likely to go beyond a full year of data process- applications in genetic studies where measure- ing on a large scale. This underlines the continu- ment of stratification and admixture ratios could ing growth in importance of SNPs to provide both be achieved using a discrimination and geo- sufficient coverage for the analysis of all parts of graphic origin marker panel. These equally spe- the genome and to help fully understand the com- cialized areas of SNP use in genetic analysis are plex mechanisms of gene expression. discussed in the overview of SNP biology. Therefore, the purpose of this review is two- 2. The Biology of SNPs—A Brief Overview fold: to compare SNP databases and to detail the The SNPs, as base modifications, represent methods that can be used to check the character- changes to the ancestral DNA sequence compris- istics of SNP markers of most relevance to a pro- ing a genome. Because the cellular mechanisms posed study. Emphasis has been placed on ways for correcting a base mismatch are extremely to narrow the selection of SNPs (or any other sup- effective, it is necessary to understand how porting data) to a manageable size—an increas- these substitution events progress from a single ingly important strategy given the huge scale of sequence change that is promptly edited back to information available.

Online Resources for SNP Analysisreview 65

Biology of Human Variation Fall 2014

Genetic and Environmental Exposures Constrain Epigenetic Drift Over the Human Life Course

Small Variants Frequently Asked Questions (FAQ) Updated September 2011

A Pilot Analysis of DNA Methylation in Candidate Genes in Brain

Expressed Sequence Tag Analysis of the Response of Apple

Dbsnp Resources in the UCSC Database Today We Will Discuss

Sequence Variations in the Public Human Genome Data Reflect a Bottlenecked Population History

Methods to Detect Selection in Populations with Applications to the Human

1 a Primer on Molecular Biology

The Genetics of Bipolar Disorder

Human DNA Sequences: More Variation and Less Race

Dbsnp Columns