<<

Online Resources for SNP AnalysisREVIEW 65

Online Resources for SNP Analysis A Review and Route Map

Christopher Phillips

Abstract The major online single (SNP) databases freely available as research tools for genetic analysis are explained, reviewed, and compared. An outline is given of the search strategies that can be used with the most extensive current SNP databases: National Centre for Information (NCBI) dbSNP and HapMap to help the user secure the most appropriate data for the research needs of clinical and research. A range of online tools that can be useful in designing SNP assays are also detailed. Index Entries: Single nucleotide polymorphism, SNP; genotyping; variation; polymorphism; online data- bases; ; ; haplotype block; genome; HapMap; clinical genetics; population genetics.

1. Introduction marks available. Above all, this enables researchers The validated set of single nucleotide to improve the resolution of linkage maps, in par- polymorphisms (SNPs) is one of the most valu- ticular in the precise mapping and study of able resources to have come from the human contributing to complex disorders, where the con- genome mapping project (HGP). This dataset tributory effect of individual genes is small or was largely unforeseen in the initial phases of where genetics is compounded by multiple locus sequencing but as several genome equivalents interaction and environment. As well as enabling were generated from different individuals during greatly enhanced mapping, SNPs provide a new the project, the identification of SNPs became a and fascinating layer of detail to our understand- by-product of growing importance. The discov- ing of human variability and what this tells us ery of new SNP sites by resequencing and the about susceptibility, populations, and evo- collation of detailed information about each locus lution. Because SNPs have low recurrent has continued to develop both in scope and in rates compared to other polymorphic markers they depth as the simultaneous publication of draft provide the most stable and reliable indicators of sequence and first full SNP map 5 yr ago (1,2). the evolutionary history of populations. Last, but Added to this, our knowledge of SNPs has ex- not least, SNPs are the very stuff of genetic varia- panded in parallel to an improved understanding tion as a substitution in or near a transcribed region of the structure and function of the genome and can change an amino acid sequence, the control of its content. The SNPs provide the most com- transcription events, or the splicing pattern for the plete and densely spaced system of genome land- resulting RNA products. Such changes arise from

*Author to whom all correspondence and reprint requests should be addressed. The Spanish National Genotyping Centre CeGen, Santiago node, Genomic Medicine Group, University of Santiago de Compostela, Galicia, Spain. E-mail: [email protected].

Molecular Biotechnology  2006 Humana Press Inc. All rights of any nature whatsoever reserved. ISSN: 1073–6085/Online ISSN: 1559–0305/2006/35:1/065–098/$30.00

MOLECULAR BIOTECHNOLOGY 65 Volume 35, 2007

65_98_Phillips_MB06_0040 65 1/3/07, 4:40 PM 66 Phillips

loci commonly termed coding SNPs, promoter– These developments have meant that searching site SNPs, and splice–site SNPs, respectively. for specific SNPs among a total of 9 million or Therefore, it can be argued that SNPs form the more human loci can now be achieved fairly eas- principal part of the variability affecting each ily, provided the researcher begins with clearly stage of gene expression and can justifiably be determined criteria for the markers required for a said to influence the transcriptome and proteome, study. To this end, the I can firmly recommend, as well as the genome itself. As such, SNPs are as starting points, two excellent textbooks: Human more than just ubiquitous marker points, they Molecular Genetics and Human Evolutionary are a core genome feature that will ultimately Genetics (3,4) that each cover thoroughly the help to explain one of the enduring mysteries of fields of genetic analysis and population genet- human genetics: why comparable numbers of ics, respectively, these forming the two principal genes in at opposite ends of the evolutionary applications for SNP analysis. Both books pro- scale can create such profoundly different levels of vide, in nonscientific parlance, “one-stop shops” complexity. for understanding the extent and specific charac- In the same spirit as that prevailing in the data- teristics of SNPs needed for planning studies of base management of HGP draft and final nucle- inherited human disease (referred to as clinical otide sequence, SNP data arising from public genetics from this point on) or population genet- genome analysis have been open source from the ics. I also consider that it is important for scientists beginning; freely available in websites accessible in one area of speciality to be better acquainted by any researcher. Furthermore, the largest data- with the other, so the use of both books is recom- base, dbSNP, accepts submissions directly from mended. Careful project planning is an essential the scientific community, allowing the rapid dis- step, in addition to the scrutiny of appropriate semination of newly discovered SNP polymor- reporting publications, before any database phisms. In fact all the SNP databases described in search starts in earnest. Therefore, the major steps this review are based on open access and in this of project design involving the access of online sense are a shared scientific resource. Added to resources can be visits to PubMed (for the best this, during 2006 the bulk of privately held SNP current reports in the field), HapMap (for the review data will become available to the whole research of SNP candidates in, or close to, genes or genome community. This means one of the original pre- regions of interest), dbSNP (for the characteriza- cepts of HGP—free access to the best possible tion of the chosen SNPs), and finally use of vari- information acts to accelerate scientific progress, ous web tools for optimizing the genotyping assay equally applies to the overriding majority of human design. This is not intended to be a prescribed SNP data. Unhindered access to information is con- route—many of the sites outlined in this review sidered to be of such primary importance now that have viable alternatives that can act, above all, to a system for data release has been formalized in the supplement the data obtained from the primary recommended framework for international commu- sources described here. In addition, particular nity resource projects (http://www.welcome.ac. areas of clinical genetics such as cancer studies uk/doc_WTD003208.HTML). This framework have specialized databases that are not covered in secures, for all scientists, rapid dissemination, this review but clearly serve their field with a highest standards of database management, and more direct focus of information. the end use of data without restriction. This review was originally written for a third At the same time that genomic data started to potential application of SNP analysis, that of become available, the fields of and forensic identification, commonly termed DNA database management were rapidly developing to profiling. The SNPs offer the potential to greatly keep pace with the scale of the information gener- reduce amplified product size compared to exist- ated and the complexity of the searches required. ing profiling loci and therefore, could provide

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 66 1/3/07, 4:40 PM Online Resources for SNP Analysis 67

much improved results with the highly degraded ers (potentially as many as 1 million), despite the DNA commonly encountered in forensic analysis majority of commercially held Celera loci being or disaster victim identification. common to both public and private databases. Forensic profiling requires a specific set of Some idea of how much supplementary data may search criteria to find SNPs with the necessary eventually be released for public access can be properties for this field and readers interested in obtained from the time taken by National Centre the forensic application of SNP analysis can find for Biotechnology Information (NCBI) to scruti- more specific guidelines in the original article nize, validate, and reorganize the Celera SNPs. along with several specialized reviews (5–7). It is This has already occupied 6 months of 2005 and interesting to note that forensic SNP sets may find is likely to go beyond a full year of data process- applications in genetic studies where measure- ing on a large scale. This underlines the continu- ment of stratification and admixture ratios could ing growth in importance of SNPs to provide both be achieved using a discrimination and geo- sufficient coverage for the analysis of all parts of graphic origin marker panel. These equally spe- the genome and to help fully understand the com- cialized areas of SNP use in genetic analysis are plex mechanisms of gene expression. discussed in the overview of SNP . Therefore, the purpose of this review is two- 2. The Biology of SNPs—A Brief Overview fold: to compare SNP databases and to detail the The SNPs, as base modifications, represent methods that can be used to check the character- changes to the ancestral DNA sequence compris- istics of SNP markers of most relevance to a pro- ing a genome. Because the cellular mechanisms posed study. Emphasis has been placed on ways for correcting a base mismatch are extremely to narrow the selection of SNPs (or any other sup- effective, it is necessary to understand how porting data) to a manageable size—an increas- these substitution events progress from a single ingly important strategy given the huge scale of sequence change that is promptly edited back to information available. In the final section several the correct base, to become allelic (i.e., the vari- online resources such as Basic Local Alignment ant base is polymorphic with a minor fre- Search Tool (BLAST) and RepeatMasker, which quency greater than 1%). Substitutions, once provide invaluable tools for the design of robust reaching this frequency level, become true single and reliable SNP genotyping assays, are discussed nucleotide polymorphisms, a self-explanatory as a supplementary resource to the purely genetic term but note that the term SNP also encompasses databases. single base insertions and deletions (commonly Finally, during the preparation of this article termed ). Two processes create substitution- wholesale changes to the availability of a substan- based SNPs: incorrect base incorporation during tial proportion of private SNP data were taking DNA replication and in situ chemical modification place. Although the review was not intended to of a base. The first of these two processes, generat- cover searching private SNP databases—in par- ing new SNPs from nucleotide misincorporation ticular, the 4 million SNPs comprising the Celera at DNA synthesis, is an extremely rare event database known as Celera Discovery System given the fidelity of DNA polymerase enzymes (CDS)—all the information previously held by and the elaborate proofreading mechanisms in place Celera on these SNPs is in the process of transfer to check physical alignments in fresh sequence cop- to dbSNP. This means the extent of freely acces- ies. The frequency of misincorporation has been sible SNP data is certain to expand still more in estimated to be approx 10–9–10–10 per nucleotide the coming year. Since Celera SNPs were discov- (8) and this event alone is not sufficiently com- ered using different approaches to the public SNP mon to account for the huge number of SNPs with mapping initiative the expansion should provide detectable minor allele frequencies observed in all a considerable number of completely new mark- studied so far. This is an important

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 67 1/3/07, 4:40 PM 68 Phillips

point because a feature of SNPs often cited in such as indels (found at only ~10% of the fre- favor of their use in both population genetics and quency of substitution SNPs). It also predicts that association studies is the long-term stability of sub- the great majority of SNPs comprise C-T or A-G stitutions that have become fixed in a sequence, substitutions, which is indeed the case. Last, there stemming directly from a very low recurrent is no reason why a substitution event cannot occur mutation rate. Consequently, the alternative more than once at a given base position, however process of in situ modification of bases must account rare such an event is predicted to be and I have for the majority of SNPs generated. Although a successfully located many nonbinary SNPs in range of factors external to the cell can affect a online databases with the aim of developing sets chemical transformation of a base, most notably of these loci for mixture detection in forensic high-energy radiation, the usual repair processes analysis (9). are, again, far too effective at correcting changes Although the mechanism behind the generation to account for observed SNP frequencies. The real of SNP variation had been well characterized, it clue to the most widespread SNP generation event was not until the completion of the HGP that the comes from the fact that C-T and A-G SNPs far distribution, density, and diversity of SNPs could outnumber all other types of substitution. Further- be viewed on a scale large enough to assess how more, the rate of substitution observed within CG adequate, or otherwise, the marker coverage was dinucleotides (usually termed CpG) is an order going to be. If any extensive gaps in SNP distri- of magnitude higher than for all other dinucle- bution occurred in the genome, then SNPs could otide motifs. The CpG dinucleotides have two not be universally applied for the fine mapping of critical characteristics. First, they are the target every gene candidate, or used to track a complete for : a universal process of base modi- set of genes when large numbers of these acted fication affecting 75% of CpG dinucleotides that with small additive effect in multigenic traits. The plays a central role in the control of gene expres- international SNP map working group described sion. Second, they exhibit dyad symmetry, that is, in 2001 (2) the first systematic study of the whole the complementary base sequence is also CG, there- set of SNP variation as it stood at that time, fol- fore methylation effects both strands. Once cytosine lowing comprehensive sequence comparisons is methylated to become 5-methylcytosine it can between the multiple donors used for the HGP undergo deamination to form a stable thymine sequence compilation. The analysis of the 1.42 base surviving on the uncorrected strand. In short, million markers in this first map provided a detailed CpG can become TpG or CpA with equal likeli- picture of the characteristics of human SNPs and hood and in half of these cases the original strand allowed the appropriate planning of the future is corrected by the repair machinery rather than development of SNP maps, ultimately giving rise the deaminated strand, leading to a stabilized sub- to the work of the SNP consortium Allele Fre- stitution event. Actually the mutability of CpG is quency Project and the HapMap initiative. The such that these motifs only occur at 20% of the most important finding with a bearing on the frequency predicted by normal base composition. usefullness of SNPs as linkage markers was that When CpG motifs do occur at high frequency, in the distribution of SNPs is constant throughout particular the transcription control regions at the the autosomal chromosome set. In addition, the 59 ends of genes, the tendency is for CpG to vast majority of the genome was seen to contain escape the destabilizing process of methylation, SNPs at suitably high density, with only 4% of forming the characteristic CpG islands that can sequence showing frequencies less than one SNP often act as telltale signatures for spotting puta- per 80 kb, much of this comprising incomplete tive genes. All this adequately accounts for the SNP mapping at the time. This was an important bulk of SNPs generated throughout the human finding, not just for the future prospects of SNP- genome and explains the overall high frequency based linkage analysis but as a contrast to the of SNPs compared to other sequence changes studies of gene density from HGP data that were

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 68 1/3/07, 4:40 PM Online Resources for SNP Analysis 69

Fig. 1. SNP and Gene density plots for two autosomes of comparable size: chromosomes 18 and 19.

revealing considerable disparities between chro- were also remarkably consistent between the mosomes. This is shown by the contrasting num- autosomes. All showed nucleotide diversities bers of genes observed on chromosomes 18 and within 10% of the mean, except for the high and 19—a pair with almost identical lengths but at low value outliers of chromosomes 15 and 21, opposite ends of the gene density scale. Figure 1 respectively, and the major histocompatibility illustrates the gene distributions alongside the complex (MHC) region of . The X SNP density plots for both chromosomes. In line and Y chromosomes differ from autosomes in with distribution and density, patterns of SNP both effective population size and mutation rate variability, measured as nucleotide diversity, and the lower observed densities and levels of

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 69 1/3/07, 4:40 PM 70 Phillips

heterozygosity for nonautosomal SNPs matched characteristic had the potential to erode the power predictions based on the tendency of these chro- of SNPs as association markers, as the normal and mosomes to show reduced diversity. Although variant gene might both share the same associ- SNPs are binary loci with a much lower average ated haplotype in a great many cases. For this rea- heterozygosity than any other polymorphisms, it , there is much interest in allele frequency was evident from the HGP map that their balanced differences between the major population groups, distribution and levels of variability made the use as it provides a chance to choose the best study of genome-wide high-density SNP mapping a viable group for a particular set of blocks. Haplotype prospect. blocks are clearly inconsistent with the estab- Although a vast number of well-characterized lished model of genetic distance measurement SNPs can provide a wealth of marker sets for based on a predictable recombination rate and it association studies, the actual level of polymor- is generally agreed that they represent areas of phism shown by individual SNPs is limited because extremely low recombination bounded by smaller many closely neighboring loci are bound together segments where recombination is much more fre- in near-complete linkage as chromosome seg- quent, so-called hotspots (11). For this reason an ments termed haplotype blocks (alternatively important phase of study after the SNP map publi- linkage disequilibrium or LD blocks). Haplotype cation was the analysis of haplotype block diver- blocks occur because the chromosome segments sity and structure. Several studies examining have shared ancestry, therefore that particular chromosome-wide LD distribution found that combinations of SNP in close proximity blocks can span relatively large distances (12– comprising the haplotype are changed only slowly 14). Two studies obtained comparable block by recombination or accumulated mutation. Human length values from different chromosomes: Daly evolutionary history has the unusual characteristics et al. (14) reported a size range between 3 and 92 of involving very small population sizes and a kb on 5q, whereas De La Vega et al. (15) found a relatively short total period of development of the wider range of block size for chromosomes 6, 21, species. As a result the human genome is par- and 22 (5 to 300 kb), but with an average size of ticularly amenable to haplotype block analysis just 26 kb and 18 kb in Europeans and Africans, because the math dictates that the rate of erosion respectively. The longest block discovered at this of is slow (~10–8 per base, per genera- time was 800 kb (11), but blocks longer than 100 tion) and the number of generations adter the kb are normally expected to be rare (representing disease variant mutation is relatively small only 2% to 3% of the total found on chromosomes (~10–4–10–5). Using haplotype blocks as the basis 6, 21, and 22) (15). Many of the principles applied for SNP analysis can have significant advantages to haplotype block mapping and underlining the in association studies, as genomic regions can be HapMap approach still do not meet with univer- tested without the need to first pinpoint the loca- sal support and the main arguments are more fully tion of the functional variants of the trait. How- discussed by Wall and Pritchard (16). However, ever, haplotype block structure adds an additional the outcome that would suit most researchers layer of complexity because there are no guaran- would be to be sure that a SNP at a given distance tees that blocks are consistent in size or distribu- from a gene variant could adequately mark the vari- tion, coincidental in position to the genes of ants underlying the and therefore act interest or even very close (10). In fact, initial as a tag for comparing case subjects to control analysis using patterns of association between subjects. This is the principle of positional clon- SNP pairs revealed a nonrandom distribution of ing, where progressively finer mapping can focus linkage disequilibrium in the genome, although on the positions of contributing loci in the absence the blocks themselves consistently showed much of clear information about gene action. Such an reduced SNP variability with only a limited num- approach can potentially reduce the genotyping ber of allele combinations per block. This latter efforts needed to examine complex disorders to

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 70 1/3/07, 4:40 PM Online Resources for SNP Analysis 71

the point where a very small number of tagging the as a whole are both clustered SNPs might enable the tracking of the multiple into strata within a population that is presumed to variant genes creating complex disorders. This be homogeneous for both. As a result, the overall can be further simplified by attempting analysis within-population variation approaches levels on populations with a history of reduced variabil- seen between groups of study and control subjects ity due to small founder numbers, bottlenecks, because the groups do not share identical ances- reproductive isolation, or endogamy, hence, the tries. Stratification can potentially create spurious widespread interest in isolates such as island associations between the traits studied and a set populations or small, culturally coherent groups of SNPs chosen to map them (19). An example of like the Anabaptist sects of North America. At this effect is shown by the SNPs rs182549, which present, population genetics and association stud- shows a strong allele frequency gradient (cline) ies start to share a considerable amount of com- from northern to southern Europe as does the trait mon ground. Furthermore, much can be learned studied—adult stature. The association between from differences in the frequency of, and levels trait and variation in this case was entirely the result of susceptibility to, common among the of an identical distribution of variation and was lost five major population groups. These are broadly when individuals were rematched on the basis of based on the continental boundaries of Africa, geographic latitude within Europe (20). Interest- Europe, Asia, the Americas, and Oceania (Pacific ingly, forensic discrimination SNPs can make island groups). The HapMap initiative (of which effective measures of stratification, as such loci more later) was set up to apply the principles out- are required to be neutral, freely assorting, and lined to delineate haplotype blocks in the genome. highly polymorphic, characteristics not found in To develop the exploration of population differ- the association study SNPs. Once again these ences, phase I of HapMap SNP analysis has aspects highlight the importance of a thorough genotyped all loci in 90 subjects each from Africa understanding of human population structure and and North America (of European descent) plus 45 history in the design and interpretation of clinical subjects each from China and Japan. genetics studies. To conclude this section mention should be 3. NCBI: PubMed, , made of two important aspects of association studies using SNPs where a prior understanding Boolean Principles, and Databases of the characteristics of the study population is of Relevance to SNP Analysis particularly important: admixture mapping and (http://www.ncbi.nlm.nih.gov/) stratification effects. Admixture mapping (often Any search for information relating to genetics termed MALD, for mapping by admixture link- and medicine should begin at the homepage of the age disequilibrium) is exploiting the differences National Centre for Biotechnology Information in haplotype block boundaries between admixing (NCBI) website. NCBI is one of the institutes of populations, previously isolated, to gain higher the US National Institutes of Health (NIH) and levels of association (17). This is because recom- has been the principal worldwide repository of bination over the limited number of generations genomic data for the past 18 years. The nucleotide because admixture has less opportunity to disrupt sequence variation database housed at NCBI is the associations between SNPs and genes shown known as dbSNP and comprises the largest col- to different degrees by each contributing population. lection of SNPs available with the most compre- In particular, African Americans are an informative hensive set of supporting data for each locus. The population for association studies with an esti- following sections detail the structure and use of mated level of admixture with Europeans of about dbSNP specifically but the importance of NCBI 20%, although the level has a broad range from outside of dbSNP is that genetics research involv- 4%-30% depending on the US region (18). Strati- ing SNPs should always be set in the context of fication effects occur when the trait studied and supporting information. In particular, it is impor-

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 71 1/3/07, 4:40 PM 72 Phillips

tant to gather, at the same time as SNP data, infor- The importance of prior experience in the use mation about genes, phenotype, (both of PubMed is that both PubMed and dbSNP are chemical and ), expression components of a unified NCBI database retrieval dynamics, and the contextual sequence sur- system termed Entrez. Therefore, familiarity with rounding the SNP. Given the extent of the bio- bibliographic search strategies that can enable a logical databases at NCBI it is of no surprise that manageable list of published articles from the 12 this institute has built the most extensive collec- million available at NCBI can help the user to tions of gene (Gene), inherited disorder (OMIM), develop the same principles for SNP searching. (Protein), gene expression (GEO), and NCBI uses Entrez as a standardized query inter- nucleotide (GenBank) databases in the past 8 yr. face for all the major databases it manages. There- Together with the Santa Cruz genome database fore, using Entrez has two clear benefits for the (http://genome.ucsc.edu/), NCBI has managed the user. First, the query system is the same for each content of each of the draft versions of the human Entrez database and second, the searches can be sequence since 1990 and now keeps the reference made global to return data that is then seen to be sequence, completed in April 2003. This com- interlinked between many databases. Data returned prises 2.9 billion bases (99% coverage of gene- from multiple sources are linked by cross-refer- containing DNA) with an error rate of 1 in 10,000 enced hyperlinks termed linkouts (i.e., two-way bases. To most biologists NCBI will already be connections between each database). To work familiar in the guise of Medline and PubMed: two efficiently and to delineate a search correctly bibliographical databases that collate all the prin- Entrez relies on an understanding of Boolean cipal citations from biomedical journals (~5000 terms and combinations of parameters, termed journals in total). Medline was the main source of fields, that define the required characteristic from data for PubMed, but has largely been supplanted, a piece of data. For instance, searching NCBI the two are still distinguished by the fact that using just the search term diabetes gives a quar- PubMed has a broader scope by including articles ter of a million literature citations alone and predating the Medline selection and by contain- equally daunting numbers from the other data- ing certain “out of scope” content (i.e., not bio- bases. When diabetes is used with the operator medical). The NCBI bibliographic databases are “AND” in combination with the term chromo- by default predominantly text oriented. This some 2, the returned PubMed citations drop to a means they work by matching text recognized in much more realistic 140 articles, summarizing sev- the query submission to text in the data records eral studies, among others, analyzing the interleukin and to work efficiently the system needs to regu- 1 gene cluster on implicated in dia- late vocabulary. The database of words used to betes susceptibility. Not surprisingly the number index PubMed is MeSH (medical subheadings) of genes listed in EntrezGene drops in similar and this can be searched itself using the search fashion from 1114 to 3, again, acting to define a menu in the top left of each NCBI homepage. For specific genome feature before any follow-up association study research, checking MeSH is a searches have begun. This simple predefinition of useful step to help clarify terminology before a search terms can be particularly useful for the pro- search begins and to review in brief possible related cess of designing a new genetic study when it is areas of study to the disease of interest. For example, important to check that the work to be undertaken the query Repetitive Strain Injury gives the three is both manageable and has enough leads to insti- specific medical terms used to describe varieties gate database research in earnest, or equally impor- of this condition with the year of introduction of tant, has not already been achieved elsewhere. each. For clarification at the top of the page are Therefore in the initial stages the PubMed and the alternative terms named suggestions that are OMIM text-based databases (termed literature also used in PubMed for text matching with the databases in NCBI) are as important as the genetic query term (e.g., repetition strain injury). content databases (termed molecular).

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 72 1/3/07, 4:40 PM Online Resources for SNP Analysis 73

Fig. 2. Boolean operators. OR applies to all items in A or B, AND to items found in both A and B, and NOT to items in A not found in B.

Boolean terms govern the rules used in all data- tags and to the operator AND, not OR, if there are base searching by applying the principles of logic spaces between fields, as is common practice else- that define the relationship between a set of inclu- where. For most of the molecular databases the sive or exclusive terms, ruled by the three opera- text of a tagged field must be in the correct for- tors (or operands): AND, OR, and NOT. The logic mat, termed syntax to be automatically read by is summarized in venn diagram form in Fig. 2 and the NCBI search engine and this can initially be each operator can be described as follows: a source of frustration until Entrez formatting •OR (often termed union) is inclusive, therefore becomes second nature. To begin using Entrez it it returns all database entries that contain at is better to use a menu of fields for a single data- least one of the provided search terms. base search from the limits tab that appears on • AND (often termed intersection) is exclusive, each database homepage. These take two forms therefore it only returns database entries that depending on the type of information held in the contain all the provided search terms. database: additional query description boxes or • NOT (often termed difference) excludes from tick-boxes and query description boxes. An example the returns all database entries with the pro- of the first might be a PubMed search that can work vided search terms. with one of 29 limits such as author or text word. Nearly all Entrez database queries use the These still require an entry in the query box but operator AND to narrow down a search using some assistance is given for combining these with multiple combinations of an extensive array of extra intersections by limiting certain fields at the fields, each set being tailored to the content of the same time using a small number of choices in database. This approach is sometimes known as a seven additional areas such as text language or relational search, as it looks for relations or links human/animal subject matter. The date range for between the search terms found in each data record. publication or adoption to PubMed (Entrez Date) If search terms are to be confined to a specific field gives the most useful field limit in combination Entrez rules require that these are described using with text word as it ensures the returns concen- predefined tags termed field tags and set in square trate on the most recent publications. Overall, the brackets. For example, entering “short [au]” returns underlying theme is to allow simplified queries publications in PubMed with Short as author and that still enable manageable numbers of returns the list will not include studies of short tandem from the largest databases. The second form of repeat loci (unless Short is an author!). Descrip- limits using tick-boxes is used for many of the mo- tions and details of the whole PubMed field tag lecular databases, notably SNP (termed EntrezSNP list are outlined at (http://www.ncbi.nlm.nih. in this guise) where a clear level of categorization gov/books/bv.fcgi?rid=help .box.pub is possible with the data. Therefore, for example, medhelp.Box_1_Search_Field_D). Entrez searches ticking one of the 15 IUPAC substitution codes default to searching all fields in the absence of field (e.g., Y to denote C/T substitutions [listed in

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 73 1/3/07, 4:40 PM 74 Phillips

Table 1 used in PubMed help is apt because it illus- IUPAC Codes Used With the [ALLELE] Tag Denoting trates that use of parentheses can mirror the SNP Base Substitutions logic of a sentence: so “find articles on the effects Code Substitution Code Substitution of heat and humidity on multiple sclerosis” takes the form: (heat OR humidity) AND multiple M A or C V A or C or G sclerosis. R A or G H A or C or T • Wild card: using a in place of missing text W A or T D A or G or T allows a partial entry to be used as a query term S C or G B C or G or T Y C or T N A or C or G or T (e.g., using BRC* will find both BRCA1 and K G or T (or indeterminate base) BRCA2). NCBI does not generally use adja- cency searching in the molecular databases. A, C, G, and T can be used individually to select all SNPs This is based on the proximal operator NEAR, exhibiting that base as an allele. routinely used by web search engines like Google. A notable exception is the alternative text terms named suggestions that are handled Table 1] and sometimes termed IUPack) starts the by MeSH in PubMed. Using adjacency searches process of concentrating search terms down to a tends to lack focus for the majority of data in specific set of data to provide focus. The SNP NCBI, as information is usually clearly and field list is one of the most extensive and the use unequivocally categorized. of terms is outlined in greater detail in the dbSNP section. It is possible to use Boolean terms to combine Three modifiers of operator function can be individual Entrez searches performed at different used in Entrez: times as an alternative to using parentheses, enabling • Ranging: setting a range for a value in a field more opportunity to monitor the number of returns (e.g., SNP heterozygosity) using a colon (:) with different search term combinations. This between the lower and upper limits for the uses the clipboard and history tabs. The clipboard value. is a workspace for holding up to 500 items manu- • Parentheses: combining related terms together ally selected from search returns. Note: (before los- as logical groups and forcing the order of opera- ing work) that contents are cleared after 8 hours of tion for the search process or performing a com- inactivity. History lists the database search activity mon operation on a group of terms. In the first as numbers prefixed by a hash (#). Because these case this sets the order of searching to the records are, again, cleared after 8 hours of inac- bracketed terms first so that the next operation tivity it is worth getting into the habit of transfer- is performed on the results of the first opera- ring long multistep search records into the tion. In the latter case brackets are commonly personal folders available as “My NCBI” (http:// used to group together NOT items combined www.ncbi.nlm.nih.gov/entrez/login.fcgi?call= using OR. Entrez syntax uses curved parenthe- so.SignOn..Login&callpath=QueryExt.Cubby ses for grouping and square parentheses for field tags. Brackets are helpful in ensuring Query..ShowAll&db=pubmed). Previous searches descriptions lacking clear definition such as can be combined as hash fields and using Boolean “learning disorders” are replaced by a broadly operators (e.g., #1 AND #2 gives an intersection based set of more specific terms that in combi- of the first two searches from the current active nation keep the focus but prevent false exclu- session). It is also possible to use hash fields sions from returns. In place of the above query together with normal fields, thus helping to term, using (dyslexia OR attention deficit hyper- build a stepwise record of the search process as it activity disorder) together ensures the com- is modified to reduce return numbers in small bined returns with either term are available for stages. Finally, it is logical to fix the values of the next operation in the query. The example certain fields in Entrez to filter down the number

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 74 1/3/07, 4:40 PM Online Resources for SNP Analysis 75

of returns. For instance, choosing human as the sequences can be optimally compared, as can the is wise as much SNP data are now annotation, the process of characterizing a gene held for the mouse genome. These options for from the base sequence. RefSeq is not part of EntrezSNP searching are discussed in more detail Entrez but the protein sequence portion can be in Subheading 7. searched in Entrez as EntrezProtein. For most Several key NCBI databases have an important users interested in SNP searches the main con- place in clinical genetics research design or in sup- tact with nucleotide databases is in the process of port of dbSNP searches. These include GenBank, designing genotyping assays. Therefore it is nor- Gene, Online in Man mally necessary to check flanking sequence for (widely referred to as OMIM), and UniSTS. This quality or presence of clustering SNPs and to is just a fraction of the total collection and repre- check potential primer designs using BLAST. sents the most useful databases for adding data There is not a great need in most cases to go already obtained from dbSNP, PubMed, Map deeper than this. Viewer, and MeSH. What follows is a brief out- line of the structure and use of the four databases 3.2. Gene that can be accessed in combination with dbSNP Gene comprises the NCBI gene directory pre- searches. viously known as LocusLink. Maintaining this database is particularly challenging as the gene 3.1. GenBank landscape is constantly changing in so many aspects. GenBank comprises the nucleotide sequence Definitions of function, interactions, association database of NCBI. This simple description belies with a trait, activity, and many other characteris- the scale of the information held—a collection of tics are regularly revised and this must be collated sequences comprising 59 gigabases of data from with equal regularity. Gene works with unique more than 130,000 species (spanned by 17 differ- identifiers assigned to three types of gene enti- ent genetic codes), which is updated daily. The ties: genes with defining sequences, genes with database organization involved is equally com- known map positions, and genes inferred from plex but the front end is simple enough if the user phenotypic information. These gene identifiers needs just a sequence segment, the coordinates, are tracked, and new information is added when and the amino acid translation sequence. The available. The scope of Gene to encompass all example sequence file on the GenBank homepage organisms supersedes LocusLink, which was cen- illustrates a standard sequence report with the tered solely on human gene data. Searches gener- fields that can be used in searches. GenBank is ally begin with an identifier in the form of a letter/ part of Entrez, has its own specialized fields, and number combination standardized by HUGO is a subset database under the “umbrella” group (http://www.gene.ucl.ac.uk/nomenclature/). The of EntrezNucleotide. This comprises sequence term does not have case sensitivity but care is subsets for expressed sequence tag data (dbEST), needed to avoid spaces (treated as an AND opera- genome survey sequence data (dbGSS), and Core tor and usually failing to find the target). The list Nucleotide—the subset of interest to most users includes all the species with genes matching the containing genomic sequence data. This allows query combinations but here case becomes an joint searches in Entrez and individual searches in important point of distinction. Query “ABCC1” BLAST, avoiding cross-referencing when it is not returns ABCC1 in and Abcc1 or ABCC1 needed. To add more complexity and another in other mammal species—all essentially the name to keep in mind, there is also the RefSeq same gene, but also abcC1 in Dictyostelium sp., database. This can be thought of as the reference which is a different gene. Each linkout in the list sequence set for the key study organisms (3244 in gives a single report page starting with a two-line early 2006) with integration, meaning the included header including the full name and a unique

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 75 1/3/07, 4:40 PM 76 Phillips

geneID number that can be used in Entrez by hosted by NCBI and developed independently at itself. This is followed by sections: summary John Hopkins University. Using OMIM as the information; graphic summary of transcription basis for a literature search is usually a fruitful structure; graphic summary of genomic context approach, the text acts to review the quality of the with a linkout to MapViewer; bibliography; gen- associations suggested by studies in the area of eral gene information; general protein informa- interest under headings cloning, gene function tion; RefSeq sequences; related sequences and and structure, mapping, molecular genetics, and additional links. This is comprehensive enough to animal models. The references at the end of the provide most of the search directions needed to OMIM report are a selected list of the most - explore the context of the gene as well as the evant studies to provide the clearest direction for likely critical characteristics that can help the further investigation. As an example that high- researcher assess its status as a candidate for lights the difficulties of collating very extensive study. Invariably, the most useful section is the phenotype data with genetic data there is no men- bibliography—a comprehensive catalog of rel- tion in an exhaustive report for gene ABCC1 of evant studies of the gene that allows an easy check the effect of coding SNP: rs17822931 on human of current interest in the gene in the context of a earwax viscosity (21). The OMIM mapping sec- particular disease. Reference to the gene symbol tion is particularly helpful as a starting point when denotes the name written as upper case letter/ planning an association study with SNPs or see- number combinations, gene ID denotes the 5-digit ing whether further linkage analysis is needed as number. a preliminary stage of study. 3.3. OMIM 3.4. UniSTS Online Mendelian Inheritance in Man or OMIM is the NCBI phenotype database. In this role it cata- UniSTS comprises the NCBI database of link- logs traits, diseases, and disorders, although not age markers termed Sequence Tagged Sites (STS) always with reference to a gene if no association and leads on from the above point about linkage has currently been described. The list of entries in analysis. UniSTS content encompasses polymor- early 2006 reveals the paucity of understanding phic loci other than SNPs available for linkage of the genetic basis for disease. Of a total of analysis (predominantly short tandem repeat sites) 16,612 entries only 384 list a gene with a known and can be searched using gene identifiers or chro- sequence and phenotype (unique OMIM number mosome position. The return page gives related prefixed with +). Another 2229 describe a pheno- information and helpfully recommends poly- type with a suspected Mendelian inheritance pat- merase chain reaction (PCR) primers to simplify tern (no prefix), 1502 with a well-described the development of linkage marker typing if this phenotype lacking a described molecular basis is required. The primer pair information is used to (%), and 1862 with a known molecular basis (#). match alternative names for linkage markers, there- This leaves the remaining 10,635 entries listed as fore, for example, “D2S2300” will retrieve the merely genes with known sequence (*), therefore marker named in the database as “AFM261YB1.” the majority of OMIM data consists of genes lack- The easiest way to combine STS markers and ing known . Each entry has a unique SNPs to fully cover an area of interest with suit- number often used in the literature to denote a able linkage markers is to use the Between Mark- condition and usable as a search term throughout ers search option in the dbSNP homepage (http:// NCBI. OMIM compensates for the lack of con- www.ncbi.nlm.nih.gov/SNP/index.html). This crete data by being very readable and informative provides two query windows for inserting the STS about gene function. The noticeable difference in ID’s and a list of SNPs is returned that spans the the character of the database is because OMIM is distance in between.

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 76 1/3/07, 4:40 PM Online Resources for SNP Analysis 77

4. The dbSNP Database (http:// full set of key parameters for the locus. This page www.ncbi.nlm.nih.gov/SNP/index.html) can be thought of as the SNP locus homepage and from this standard point of reference different NCBI dbSNP is the principal database of routes can be followed to database entries else- SNP information generated from the HGP and where in NCBI with content related to the SNP, the simultaneously published first SNP map. such as Gene, OMIM, or GenBank. Of particular dbSNP has continued to collate all the data from use is the link to Map Viewer (see Subheading 7.), various SNP validation initiatives that have fol- which plots the SNPs chromosome position in lowed since, including output from The SNP Con- relation to a variety of other genome features sortium (principally the Allele Frequency Project), acting as landmarks for the locus. Interlinking in The Perlegen SNP genotyping initiative, and both directions is now standard practice, therefore HapMap. The SNP data are regularly updated in clicking an rs-number in any other major SNP synchrony with genome rebuilds, ensuring the database outside of NCBI will connect to the highest quality of SNP locus mapping and scru- dbSNP cluster report. This means that it is impor- tiny. In early 2006, the total dataset amounted to tant to be familiar with the page layout and to 40.6 million different SNP loci, with just under know the limitations that exist for the user with half of this number validated by genotyping a the way data are presented. The cluster report lay- sample set to confirm polymorphism. The human out has been revamped at the start of 2006 and content comprised 10,430,750 SNP clusters (i.e., after the summary header of four lines each for rs-numbers) of which 4,236,590 were sited in locus and allele information, the detail sections genes. A total of 35 organisms have individual SNP currently comprise submission, fasta, geneview, databases, 12 of these from completed genomes. map, diversity, and validation. The Chimpanzee dataset is likely to be of grow- ing importance and currently comprises 1.54 mil- 4.1. Submission lion SNPs clusters with just more than a third of Submissions for the SNP are listed, with the re- these in genes. With this pace of data building it ports used to validate the locus marked by an icon. is a good idea to subscribe to the dbSNP-an- Each ss number linkout leads to a detailed break- nounce automatic e-mail update system to keep down of the genotyping performed and these in turn updated on developments (http://www.ncbi.nlm. linkout to lists of genotypes with sample ID’s and nih.gov/mailman/listinfo/dbsnp-announce). As detailed population descriptions if these require well as reporting the release of each new build, checking. The sample details allow use of consen- announcing newly added features, and outlining sus controls as genotyping standards, for example, corrections or discovered problems with past or submission ss2316529 for SNP rs1490413 lists present builds, there is an archive for referencing sample CEPH1331.01 as an A-G heterozygote. possible problems with, or qualifications to, pre- viously obtained search data (http://www.ncbi. 4.2. Fasta nlm.nih.gov/mailman/pipermail/dbsnp-announce/). Fasta lists the flanking sequence around the The rapid growth of human SNP data in dbSNP dur- substitution site. Minor problems can occur with ing the past 5 years is shown in Fig. 3. recovery of sequence from this section requiring Any reference to a SNP locus within NCBI care. The amount of sequence can vary from tens (and elsewhere such as the scientific literature and to hundreds of bases, whereas the SNP can some- alternative SNP databases) uses a unique identi- times be found very close to the start or the end of fier comprising a number prefixed with rs. All rs- the listed sequence (Fig. 4). In these cases recov- numbers are listed in NCBI as linkouts, which ery of sufficient sequence involves visits else- will return a standard format summary page for where, either to Entrez Nucleotide or the more the SNP termed the Cluster Report, which lists a user-friendly Santa Cruz genome assembly. In-

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 77 1/3/07, 4:40 PM 78 Phillips

Fig. 3. Growth in SNP numbers since dbSNP began cataloging loci in 2000. Plot shows the cumulative number of unique SNP loci (black line) and the proportion of these SNPs validated by genotyping (gray line). Assimila- tion of loci into dbSNP is fast but proper assessment of the polymorphism much slower.

conveniently the NCBI linkout to Santa Cruz has of the displayed sequence in fasta and Entrez recently been removed requiring a “manual trip” Nucleotide Sequence Viewer. Therefore, users to the gateway page http://genome.uscs.edu/cgi- wishing to check sequence more carefully will bin/hgGateway and entry of the rs-number. In the need to be prepared to use sequence inversion Santa Cruz map browser, click on the blue rs-num- macros in Excel or in stand-alone programs to ber linkout under the various genome elements in “flip” between different strands in different loca- the map view to get the summary page and then tions. Helpfully Sequence Viewer allows a “view click on “View DNA for this feature” and choose on minus strand” option. One final sequence the width of flank. It may be wise to choose 0 and tracking problem is that clustering SNPs are occa- 100 bases (i.e., upstream/downstream), for example, sionally given the IUPAC code (e.g., CTAYGGA) followed by 100 and 0, to keep track of the substitu- within the sequence in fasta and this is easily tion site, as this is shown as a normal base and overlooked, although it appears to be uncommon therefore its position can be difficult to locate. and largely ad-hoc in occurrence. Note that Santa Cruz uses the term simple nucle- otide polymorphisms. Back in fasta, sequence is 4.3. GeneView arranged in 10-base blocks using different type- sets: upper case/lower case and black/green. Upper For SNPs in genes the GeneView section out- case denotes normal, unique genomic sequence, lines the gene context of the substitution with whereas lower case is used for sequence identified color codes for synonymous, not synonymous, by RepeatMasker (detailed in Subheading 11.) as and intronic SNPs (pale green, red, and yellow, low complexity or repetitive element sequence. respectively), includes the different amino acid Green is used to denote sequence used by the sub- residues and their position in the protein sequence mitter lab during SNP identification. A common plus linkouts to the nucleotide reference assem- problem is a lack of consistency in the direction bly for the coding regions of the gene.

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 78 1/3/07, 4:40 PM Online Resources for SNP Analysis 79

4.4. Map large allele frequency differences arise from The Map section lists the NCBI and CDS posi- combining samples from different population tions where both exist, such as, reference and groups but dbSNP previously made no distinction Celera plus linkouts to the contigs used in the between random population samples and samples assembly for each genome. This section also has of individuals with particular conditions. This is linkouts to Map Viewer and to neighbor SNP illustrated by SNP rs2075745 where a C allele details. The position and density of neighbor only occurs in subjects with type II diabetes. SNPs may be different between NCBI and CDS, Despite this, a frequency estimate for C of therefore in assay design it is important to track 0.476 was given for many years in the cluster all of these to be sure of clean primer-binding site report, although all other populations tested to sequence free from interfering substitutions nearby. date exhibit an A/T substitution at this SNP. In This is especially true of primer extension chemis- this case the misleading frequency resulted from tries that need ~20 bases of SNP free sequence a comparatively large sample of 200 diabetic sub- immediately adjacent to the target substitution. If jects tested by one submitting laboratory and this SNPs close together are in association it follows skewed the estimates. Since late 2005, dbSNP has that a particular allele may carry a neighbor at begun the process of incorporating the detailed, identical frequency and always dropout from the population-based breakdown of frequency esti- assay producing an apparently monomorphic locus. mates as it now appears. This gives a vast improve- It is important to note that this problem occurred in ment in clarity for a critical SNP characteristic and ~6% of cases during the HapMap phase I SNP this is obviously being extended to all submitter genotyping (22). For a high density of neighbor estimates, not just HapMap, as the previous example SNPs it may be too problematic to find a clean of rs2075745 is now unambiguous in presentation. sequence but the SNPs can be more easily tracked Although the averaged figures are retained, it is in Santa Cruz, which shows by far the clearest now straightforward to interpret these with refer- graphic arrangement of SNP grouping close to the ence to the population studies made by the sub- study SNP. Unfortunately it requires a longhand mitting laboratories. Note, however, that alleles process of clicking each linkout and obtaining the are listed in base-alphabetical order in dbSNP and positions to construct a fully annotated sequence do not use the HapMap convention of reference around the assayed substitution site. allele frequency first. This may seem to put undue emphasis on detail but a look at SNP rs176000, an 4.5. Diversity (replaces Variation) example of a base inversion between two submit- The Diversity section originally termed Varia- ting laboratories compounded by three-allele tion has been the source of problems for sev- variation, indicates there is still scope for confu- eral years and is now being completely revised sion in this section. to incorporate the detailed population data com- ing from HapMap. Until recently this section had 4.6. Validation been potentially misleading in the way it summa- Validation briefly summarizes the Mendelian rized allele validation information, because it status, PCR performance, and allele frequency used average allele frequencies and heterozygos- distribution quality indicators of the SNP. In addi- ity based on merged data from all submitting labo- tion to SNPs, data are held for other polymor- ratories. For example, a submitters estimate of 0.2 phisms loosely defined as simple. These include minor allele frequency based on 100 individuals small-scale multibase insertions or deletions (alter- combined with another of 0.5 based on 20 indi- natively termed deletion/insertion polymorphisms, viduals was summarized as 0.25 because dbSNP indels, or DIPs), repeat variation used total chromosome counts from all submis- (also termed short tandem repeats or STRs), and sions to obtain the average values. Not only can retroposable element insertions. Because dbSNP

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 79 1/3/07, 4:40 PM 80 Phillips

is an open database, there is a straightforward be used to check the reliability of study assays by framework for receiving and checking SNP data retyping the same controls. sent in from submitting laboratories. The SNP locus information is either reported as a new discovery 5. The HapMap Project (rare for human SNPs now) or collated into an (http://www.hapmap.org/downloads/ existing reference SNP set. Clearly the latter case, nature02168.pdf) where different laboratories routinely report iden- The international HapMap project was launched tical SNPs, requires careful scrutiny of the flank- in late October 2002 with the stated aim of deter- ing sequence to check for previous submission to mining the haplotype structure of the human ge- NCBI and unique location in the genome. Sub- nome. This has broadened in scope slightly to mission criteria are very effective at detecting encompass, in their own words, “all common nonunique SNPs, the checking process requires a human sequence variation, providing informa- minimum 25-bp context sequence each side of the tion needed as a guide to genetic studies of clini- substitution for the detection assay and uses a cal phenotypes.” What this means is that the minimum total context sequence of 100 bp to mapping of haplotype blocks using set SNP posi- position the SNP uniquely in the genome or tions has been extended to include any SNP otherwise. The proportion of nonunique SNPs is, landmark that could be equally useful. The full however, small (about 5%) and they are more program of the HapMap project is ambitious in common in pericentromeric areas, therefore the scale, in some senses approaching that of HGP, use of loci from these chromosome regions gen- but it remains simple in concept—to begin by erally needs more care to ensure that the SNP is genotyping at least one SNP per 5 kb of sequence unique and is flanked by a relatively small pro- (just more than 1 million markers) in 269 indi- portion of low complexity sequence (sequence viduals taken from 4 populations located in three containing portions of intra- or interchromosomal continents and to conclude by consolidating SNP repeats, polybase, or short tandem repeats). When number and annotation so that a limited set of the context sequence can be uniquely positioned SNPs can be confidently assigned as markers that and the SNP is identified as previously observed, tag each haplotype block. Haplotype blocks cre- dbSNP will place the submission into the refer- ate a large amount of redundancy in the use of ence SNP group, hence the use of the familiar SNPs to measure association, as the average hap- rs-number denoting the refSNP. The full group of lotype can contain a considerable number of SNP submissions for the same locus is termed a cluster markers that share exactly the same frequency and in NCBI, hence the single summary page for each the same recent ancestry with nearby gene vari- SNP is termed the reference SNP cluster report. ants, therefore using more than one SNP per block Submitted SNP and reference SNP details are dis- to track the genes of interest by association does tinguished by using multiple ID numbers prefixed not necessarily add any more value to the study. with ss and a single rs-number, respectively, at This is, of course, a simplistic argument that assumes each cluster report. The current ratio between sub- that blocks are easy to define and haplotype diver- missions and clusters is approximately 2.5:1 (27.2 sity is limited, but it emphasizes one of the tenets million to 10.4), therefore multiple institutions behind HapMap planning: to reduce the genotyping have independently confirmed the majority of effort required for a clinical genetics study with- NCBI SNPs. Clicking on an ss number (also out loosing any quality in the association values termed the accession number) allows the scrutiny obtained from using a small subset of the SNP of the quality of the submission including statisti- variation available. Because the total genotype cal analysis of the individual genotypes to ensure analysis needed far exceeds anything accom- they are in predicted ratio from the assumption of plished before, it was appropriate to start with Hardy Weinberg equilibrium. As outlined previ- manageable aims and build on these. Therefore, ously, these detailed genotype listings can then the phase I goals were to characterize and map

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 80 1/3/07, 4:40 PM Online Resources for SNP Analysis 81

haplotype blocks; to collate haplotype diversity 4. The SNP data generated so far offers a means in each population; and to define every coding to study genomic variation without recourse to SNP (this last aim spans phase I and phase II). wholesale resequencing. The four populations sampled were Yoruba in 5. The findings have gained increased under- Ibadan, Nigeria (referenced in the HapMap data standing and characterization of the human as YRI), Japanese in Tokyo (JPT), Han Chinese genome (notably the mapping of deletion in Bejing (CHB in HapMap but HCB in dbSNP), ), events in the recent and CEPH Utah residents with ancestry in Northern past and fine-scale recombination organization. and Western Europe (CEU). The YRI and CEU 6. dbSNP has collected the vast majority of com- samples comprised trios that ultimately allowed mon SNP variability in the human genome, very precise analysis of haplotype phase, that is, when SNPs have not been listed they show whether the alleles of heterozygous SNPs reside tight correlation to loci that are in dbSNP. SNP discovery using PCR-based sequencing is biased on one chromosome or the other. Ten ENCODE against low frequency SNPs (i.e., those with (Encyclopaedia of DNA Elements) regions minor allele frequencies <0.05). were analyzed with a 1—fold increase in SNP density to compare data quality between the Two further points emerge from the report. HapMap genome coverage and a more complete First, it is increasingly clear that HapMap data SNP catalog. Sequencing of 48 subjects from each founded, as it is, on the analysis of four popula- population for these regions has spanned the two tions, represents a valuable resource for popula- phases and also acts as a test bed for low fre- tion genetics analysis (23). Clear signals of quency SNPs. In contrast to the initial focus on 1 natural selection have been found from the HapMap million SNPs in phase I, HapMap is now produc- data in a number of genes that are not obvious can- ing genotypes for the phase II goals of expanding didates for adaptive responses in the immediate the number of SNPs in dbSNP with adequate mul- evolutionary past. Furthermore, a comparison of tiple population validation from 2.6 million to 9.2 haplotype-based selection detection tests com- million. Phase II also includes extended sample pared with classic methods that use individual loci numbers in the four study populations, a broad- indicates that the former approach can be more ening of study populations, and a focus on more sensitive in detecting recent positive selection detailed analysis of the ENCODE regions. (notably in the analysis of G6PD and TNFSF5 Three years after initiating the project, the genes) (24). Extension of HapMap genotyping to group reported the phase I findings in October new population samples will bolster the dataset 2005 (22). The investigators highlight the fact that further and help to pinpoint which SNP loci are HapMap is intended to concentrate internationally appropriate choices for more extensive sampling coordinated resources on the characterization and of human populations. Second, the report’s con- understanding of the variant part of the genome clusions include a demand for rigor in association sequence as a natural extension of the work of HGP studies through multiple replications and enlarged in establishing the invariant sequence shared by sample sizes. Furthermore, the investigators all individuals. To summarize an extensive report: emphasize the need for an unbiased approach to the reporting and interpretation of association 1. SNP loci have proved to be highly correlated study results, regardless of outcome. Because the with their immediate neighbors. 2. Analyses to date show the generality of haplo- common diseases are almost all complex diseases type block structure and recombination hotspots it is worth taking heed of this advice as the inves- in the human genome. tigators outline that such diseases require very 3. The redundancy of proximal SNP sets should careful control of environmental influences includ- yield efficiencies in association studies from the ing lifestyle and behavior, of adequate clinical use of a catalog of tagging SNPs and coding characterization of phenotype, and of sufficient SNPs. replication of studies. All of these factors are

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 81 1/3/07, 4:40 PM 82 Phillips

important to control correctly if the precision at named B002180 with three haplotypes outlined the genetic level is to be fully exploited to under- in base color-coded plots. However, this gives the stand complex disease. impression of work on hold because HapMap Since phase I was completed, HapMap has have begun data release. It seems that dbSNP is become an essential complement to dbSNP for unlikely to provide an adequate basis for detailed checking the haplotype positions of a chromo- haplotype mapping in its own right until sufficient some region as defined by values measuring data have been obtained. This process has started linkage disequilibrium between SNP pairs. In with the collection of Haplotype data from publi- addition, access to high-quality allele frequency cations (http://www.ncbi.nlm.nih.gov/SNP/hap/ estimates for 1 million SNP markers represents a dbSNP_haplotype_intro.html.). Second, dbSNP significant move forward for the validation status is still in the process of updating the Variation of a large proportion of human SNPs. This becomes section of each SNP cluster report page to encom- particularly useful as a means to check ones own pass the vast quantity of allele frequency data allele frequency estimates obtained from the sub- generated by HapMap and other contributors. jects of a study, allowing greatly improved qual- HapMap provides a full list of dbSNP loci across ity control of the user’s own genotyping assays. the whole genome as link-outsto dbSNP cluster This can be taken one step further by including reports for each locus, with HapMap validated CEPH -positive controls in an assay and refer- loci uppermost and all others below. The genome encing the genotypes obtained to the individually map also makes reference to the NCBI Entrez listed results in HapMap. The HapMap database Gene report page for each gene but a third point contains structured access to all the genotypes of difference is that occasional differences in gene generated in the form of SNP report pages, together characterization are evident. For example, SNP with detailed maps of haplotype structure in the rs11779952 is placed in gene SLC39A4, a solute form of annotated LD plots using pairwise com- carrier, by HapMap and in NFKBIL2, a nuclear parisons of SNPs in the chromosome interval us- factor kappa B inhibitor, by dbSNP. The disparity ing a stand-alone graphic browser (25). Finally, an in the affiliation of SNP and gene in this case equivalent approach to Entrez exists in HapMap, seems to have been caused by a difference in termed SNPmart, for filtering down the datasets gene position between NCBI and Celera genome to downloads of manageable size based on simi- builds. Although it is difficult to know how this lar principles. ambiguity has occurred, this example serves to The relationship between the two principal illustrate the effect of different approaches that SNP databases of HapMap and dbSNP is in the may be taken between HapMap and NCBI to the process of change and consolidation. They con- curation of data rather than the management of tinue to be very closely interlinked but a number data. These two roles are quite distinct, curation of differences should be emphasized. First, the in this context describing the interpretive activity process of dbSNP to provide haplotype map details required by large collections of data that need independently of HapMap is progressing slowly characterization to be properly cataloged (much and does not to match the quality and depth of the like museum contents). Although the majority other components of dbSNP. For example, in Map of NCBI database information can be described Viewer a haplotype map option exists (termed objectively, certain data must be characterized dbSNP haplotype), but this is only available for with the available knowledge and this can intro- and takes the form of block posi- duce disparities between databases run by differ- tions and reference numbers listed as hyperlinks ent organizations. As this example suggests a to reports from Perlegen. These proceed to detail the principal source of curation ambiguities is gene haplotype SNP allele composition, for example, description, more specifically termed gene anno- SNPs rs2822549 and rs2822550 link to a block tation, that is, the characterization of a gene and

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 82 1/3/07, 4:40 PM Online Resources for SNP Analysis 83

its function based on the interpretation of descrip- Applied Biosystems (AB) SNPbrowser is a tive data. This is one step beyond management of stand-alone database of 5 million loci comprising information and requires interpretive judgments a compilation of public and private (i.e., CDS to be made from the data. For instance, is a new based) SNPs. The marker set is essentially a data putative gene model recently placed in the data- dump direct to the user’s PC for use offline, with base a real gene that has not yet been fully described an easy and intuitive front-end in the form of an or a pseudogene? Can the gene function be defined annotated map. The advantage of this database is in terms of existing pathways? Is the ontology that it shows haplotype block information in a (defined as a set of terms that describe equiva- much easier to read format than HapMap and in a lence of function, role in a process, or sequence window independent of a web browser. The block between a set of genes or proteins) a close match maps are based on Celera’s own pairwise analysis to a well-defined gene or different enough of 160,000 SNPs (termed the backbone validated to suggest a lack of relatedness? Ambiguities in SNPs) so it complements, as well as clarifies, data curation will continue to complicate matters HapMap haplotype block map annotations. when researchers routinely collect and compare SNPbrowser can be downloaded from the AB information from more than one database. Last, website (http://marketing.appliedbiosystems.com/ any user familiar with dbSNP will have noticed mk/get/snpb_landing?isource=fr_E_RD_www_ the differences in public and Celera sequence allsnps_com_snpbrowser) and once launched, can position for SNP locations in the Integrated Maps be configured to suit the user’s needs in terms of section, when both databases hold the same locus. haplotype block annotation displayed, SNP type, Because the genome sequences were assembled in population studied, and extent of region shown. different ways and problems like sequence inver- The first choice to make is the haplotype block sions can occasionally arise, it makes sense for map display. The options are to use HapMap or dbSNP to list both until there is a full consensus Celera and to display all study populations or just sequence assembly. This is not principally a prob- a single one. The Celera maps are constructed lem of curation but of differences in the chromo- from the analysis of 45 individuals each from some coordinates obtained by each sequencing white (American European) and African Ameri- project. can, plus a smaller number of validated SNPs (hence less reliable map definitions) in Chinese 6. Finding SNPs for Clinical Genetics: and Japanese. These clearly mirror the HapMap ® Using HapMap and SNPbrowser study populations, but offer an opportunity to Faced with the task of locating the genetic com- compare HapMap Africans (YRI) with African ponent of a disease, researchers will either begin Americans and explore the effect of admixture whole genome linkage analysis and focus on the and the potential for MALD analysis, outlined chromosome regions showing the clearest linkage under Subheading 2. The match to a HapMap signals or will have an idea of candidates from style of presentation continues with the SNP infor- studies performed already. The strategy that best mation window, an example of which is given in starts the SNP analysis proper is to concentrate Fig. 5. This needs to be unlocked: “View menu,” the initial efforts of locus selection on coding and “SNP details,” “Show (ctrl + D),” then click the tagging SNPs as the information content from padlock icon upper right of the pop-up window. these loci can give the strongest pointers toward This additional window shows the Celera allele the genes underlying the disease of interest. Con- frequencies in identical pie charts making com- sulting the HapMap and SNPbrowser databases parisons easy and works with a mouse-over allow- can form the principal part of selecting SNPs for ing the map to remain uncluttered when SNP this part of the study and any further fine map- density is high. The NCBI button is a linkout if the ping analysis of candidate regions. SNP carries an rs-number, and it soon becomes

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 83 1/3/07, 4:40 PM 84 Phillips

Fig. 4. Typical fasta sequence report. The SNP detection assay only confirmed the substitution site because in this example, computational contig comparison techniques were used.

Fig. 5. SNPbrowser map view for the gene CAPG (see Fig. 6). Top four color bars show haplotype block distribution in African American, European, Chinese, and Japanese (top to bottom). The gene bar is color-coded black to gray to denote the power based on the gene variant frequency, haplotype frequencies, and block positions (upper right scale). The pop-up box lower right gives SNP details linked to a mouse-over system for the dark and light bars positioning each SNP locus on the haplotype block bars above.

evident in any gene that a large proportion of the browser other than e-mail registration. This same SNPs displayed are Celera only. The idea of approach applies to the Taqman genotyping assay SNPbrowser is to provide a shopping list for loci database described under Subheading 9. What is that can be genotyped with AB’s proprietary available to any user is a commercially orientated Taqman® and SNPlex® technologies, but there is database with a significant amount of private no commitment in downloading and using the SNPs, but each of these allows important infor-

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 84 1/3/07, 4:40 PM Online Resources for SNP Analysis 85

mation to be obtained about the position and vari- can be downloaded from the AB website (http:// ability of the Celera markers ahead of their incor- docs.appliedbiosystems.com/pebiodocs/00112 poration into dbSNP. The color-coded Haplotype 824.pdf, http://docs.appliedbiosystems.com/ block positions from the four study populations pebiodocs/00114486.pdf, http://docs.appliedbio are displayed above the gene bar as a combina- systems.com/pebiodocs/00112823.pdf). tion or individually. Certain caveats apply here: HapMap browsing is initiated in a similar way the block edges, although graphically sharp are to SNPbrowser, usually starting with a single gene tentative and any block definition used can only name to locate a manageable segment of chromo- produce fuzzy boundaries at best. Subsets of some and view the genome landscape (Fig. 6). SNPs, Taqman assay SNPs, SNPlex assay SNPs, Begin by placing a landmark (SNP or gene) or All SNPs, and Coding SNPs only, can be selected chromosome region limits (the range in full bp with buttons on the lower left of the map. Finally numbers) in the genome browser page (Data of most interest, over and above the private SNP linkout top right of the homepage, then the Generic data obtained is the chance to review the power of Genome Browser linkout). Often HapMap will the gene shown with a green–black scale ( offer a choice of locations if the description is in purple–pink scale to match) and to see a mea- not a complete match to the data and the list gives sure of linkage disequilibrium in LDU (i.e., units), full details to allow review of the user’s own an arbitrary scale based on r2 and D’ calculations information. An excellent set of PowerPoint between sets of SNP pairs. Connecting lines link guides are listed in the tutorial page (http://www. the SNPs true position on the normal kilobase hapmap.org/tutorials. html.en). The basics of scale to the LDU scale, therefore these merge in genome browsing HapMap data are outlined by cases of SNPs sharing the same haplotype block. Lincoln Stein and explains the steps involves in The two values of power and LDU, novel to the finding SNPs in or near the gene (termed region SNPbrowser map, are related because the power of interest or ROI), viewing patterns of LD, select- scale is a summary value based on the block dis- ing tag SNPs, and if required, downloading the tributions across the gene and is intended to pro- SNP dataset generated by a search. The layout of vide a means of comparing different parts of a the main HapMap view has already been detailed gene or different genes. The scale ranges from under Subheading 5. and the best strategy is for less than 0.5 (black) to five values between 0.5 users to explore a region and become familiar and 1.0 (dark to pale green) and is intended to help with the acquisition of data with a level of depth predict the power of a block-based approach us- that suits the particular purposes of the research ing different SNP combinations along with the stage. The other tutorial presentations from effect of sample sizes and the minor disease/trait Michael Boehnke, Mark Daly, Augustine Kong, allele frequency. SNPbrowser amounts to a suc- and Toshihiro Tanaka cover in much more depth cinct and visually clean map browser that is an than is possible in this review, the detailed aspects informative complement to HapMap browsing. of association study design. One word of caution, The linkouts to dbSNP and to the AB Taqman using the scrolling arrows to browse a large gene genotyping summary pages allow quick follow- at 5-kb scale or less can take a very long time up of SNPs of interest and the SNP lists (shopping and it will soon be noticed that even over long lists) can be saved or exported. Many researchers stretches of sequence the pie chart distributions use SNPbrowser as the first snapshot of a gene start to look very familiar. At this stage it is worth and this does not necessarily mean avoiding the cross-referencing to the SNPbrowser haplotype commercial pipeline attached, as Taqman has block definitions to determine whether the block been a mainstay of SNP genotyping for medical is very long itself or whether the review of SNP genetics studies for many years. A set of poster pre- data can fruitfully be focused on recommended sentations explaining the genetics in more detail tag SNPs for the region.

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 85 1/3/07, 4:40 PM 86 Phillips

7. Undirected SNP Searches—EntrezSNP comprise coding nonsynon[FUNC] AND W [AL- and Map Browsing LELE] AND 1[CHR] AND 1000000: 1500000 There are two ways to examine a group of SNPs [CHRPOS] AND snp[SNP_ CLASS] AND in the absence of clear directions to a specific chro- human [ORGN] AND by frequency [VALI- mosome region: direct queries to EntrezSNP and DATION] AND 1[MPWT]. EntrezSNP returns a genome map browsing. At present, there is little list of SNPs that qualify with a graphic summary need to find SNPs without associated landmarks for each, shown in Figure 7, giving all the infor- such as the position of a linkage signal or a candi- mation necessary for a rapid scan of a large num- date gene, but there can be particular reasons that ber of loci in one go. It is possible to sort the list a group of SNPs needs to be collected and studied in an alternative way to the default sort order of with information other than position. An example descending rs-number by selecting from the drop of one such situation was the collection of SNPs down menu (sort) to resort the list, for example, for forensic analysis when it was important to find heterozygosity or map position. Using map posi- SNPs with appropriate levels of variability or sub- tion will list from q-arm telomere up to p-arm te- stitution type (5,26). An EntrezSNP query is the lomere but note that the first loci are unplaced best approach for obtaining a list of candidate SNPs. Ticking each SNP allows a list to be ex- SNPs with combinations of specific characteris- ported to a text-holding webpage, a file, or the tics. The major drawback of EntrezSNP is that clipboard (Send to drop down menu). two criteria, linkage and flanking sequence qual- As an alternative to EntrezSNP, map browsing ity, cannot be used as part of a query and the latter offers an intuitive way to review large numbers characteristic has a direct bearing on assay suc- of SNPs in one session. Exploring a chromosome cess in nearly all SNP genotyping methodologies. segment as a map gives the best way to scrutinize However, any SNP in dbSNP or HapMap suitable the position and characteristics of nearby genome for study will have been validated by a submit- features of importance: transcripts, genes, or ting laboratory using one of the genotyping tech- clustering SNPs (neighbor SNPs in NCBI). Fur- niques. By default this tends to prevent SNPs that thermore, the features around each SNP can be would be impossible to genotype from being re- scrutinized easily through a series of linkouts em- turned from a query if validation status is used as bedded into the map view to the dbSNP cluster a fixed field. The appropriate fixed fields would report page and Gene reports or other supporting be “organism” [ORGN] or [TAX_ID] prefixed databases. Both dbSNP and HapMap have a map- with human and “map weight” (the number of based system at the core of their SNP databases. times a SNP maps to the genome) using the term HapMap Genome Browser (click Browse Project 1[MPWT] (ensures all SNPs are unique). The Data on left hand column of homepage) offers, “validation” term by frequency [VALIDATION] above all, comprehensive SNP allele frequency is used to ensure SNPs have been validated by data for all 1 million SNPs given in a succinct but repeat genotyping rather than by contig compari- clearly arranged pie chart graphic together with son and sequencing of pooled donor . After the position of any coincidental gene locus (these this the list of search terms combined by opera- linkout to NCBI Gene) plus a chromosome scale. tors can be taken from a tickbox list of limits The reference allele is blue in each pie chart and (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? the rs-number linkouts to the HapMap version of CMD=Limits&DB=snp) or applied by the user the cluster report with all the validation data for who is familiar with the syntax and specific val- the four populations plus details of the assay used. ues required. The most important search field At the top of the main map is the summary chro- tags are listed in Table 2. As an example “all mosome view aligned with SNP and gene density nonsynonymous, AT SNPs in segment 1 to 1.5 plots. In combination with a %GC plot these are a Mb of , true SNPs only” would useful supplement to the map view given by the

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 86 1/3/07, 4:40 PM Online Resources for SNP Analysis 87

Table 2 Important EntrezSNP Search Field Tags Description Tag Search field used Example Observed alleles [ALLELE] IUPAC allele code R[ALLELE] find SNPs with (Table 3) A/G substitutions Chromosome [CHR] number / X, Y 21[CHR] OR 22[CHR] find SNPs on chromosomes 21 and 22 Base position [BPOS] ranged number 18000:28000[BPOS] AND (used with AND Y[CHR]—find SNPs in 10-kb & [CHR]) section of Y-chromosome Heterozygosity [HET] ranged number 30:50[HET] find SNPs with heterozygosity value in range 30%–50% Function Class [FUNC] locus region, coding nonsynon[FUNC] etc. (8 in total) Build [CBID] number 125[CBID] search build 125 Gene location [GENE] gene symbol CAPG[GENE] search for SNPs in actin capping protein, - like gene Genotyping method [METHOD] description as listed hybridize[METHOD] search for in page below SNPs found by chip hybridization (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snp - METHOD) Map weight [HIT], [MPWT] number: 1 = once, NOT (2[HIT] OR 3[HIT]) exclude 2 = twice, 3 = 3–9 times SNPs mapping twice or more in ge nome (NB better to avoid using NOT 2[HIT] to include CDS SNPs Population [POP] description as listed pacific[POP] search for SNPs in page below genotyped in Australasian & (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snp - POPULATION) Oceanian samples

NCBI browser. The map elements, termed tracks, pages contain the descriptions and relevant publi- can be configured in a variety of ways with an cations. To obtain this additional map annotation emphasis, as would be expected, on SNP distri- download the plug-in (a java applet) and go to the bution and alignment with linkage disequilib- reports and analysis drop down menu, right up- rium measurements. Of most interest to users of permost. Choose “Annotate LD Plot,” click “con- HapMap comparing data to NCBI, is the process figure” (with a variety of arrangements possible), of annotating the default map view with LD and then “configure” again, then “go”. The same with haplotype block information. Three LD measures “Annotate Phased Haplotype Display” from the can be plotted: D’, r2, and LOD and haplotype block same menu where configure just gives options to structures can be viewed as phased haplotypes (i.e., choose populations. The LD plots comprise red alleles are assigned to the most likely shared chro- and gray scale pairwise block patterns. The plot mosome strand to denote the haplotype). These gives marker-to-marker LD values, where the require a more detailed outline than is genotyped SNP are denoted as ticks and the possible in this review but the extensive help marker pairwise information is plotted as boxes

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 87 1/3/07, 4:40 PM 88 Phillips

Fig. 6. Typical HapMap genome browser window for the gene CAPG (see also Fig. 5), with 15 SNPs genotyped by HapMap, shown in approximate position as pie-charts and below these, 25 SNPs shown as triangles (exact position) denoting loci not genotyped, but present in dbSNP. Each set of rs-numbers linkout to HapMap SNP reports and dbSNP cluster reports, respectively. The gene structure of CAPG is outlined as a line diagram at the base.

between these ticks. Phased haplotypes are given as blue and yellow blocks and unless the diversity is high it is the clearest way to view haplotype block boundaries in any current database. Fig. 7. Example locus return from EntrezSNP for rs2075745 showing annotation. The rs-number Map Viewer is the NCBI map browsing tool linkouts to the RefSNP cluster report, this human SNP allowing the simultaneous search and display of occurs on , maps once (underline), is all the NCBI genomic information by chromo- sited in a locus (L), exhibits 46% heterozygosity somal position. The interface provides a graphi- (scale), and has been validated by genotyping (V). cal overview of several databases in combination with user-controlled map arrangements. The map combinations and elements are arranged by click- ing on the maps and options button mid-left and allows any of 48 different maps to be aligned in

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 88 1/3/07, 4:40 PM Online Resources for SNP Analysis 89

Fig. 8. Symbols used to annotate the NCBI Map Viewer variation map.

the same segment view. The master map contains ing system used by Map Viewer to signify the the linkouts to the matching database with SNPs annotation status of gene loci placed on the Genes placed in a map termed Variation in the sequence map. NCBI labels all likely genes as “predicted maps list. Densely packed SNPs are merged together gene models” until confirmed to have a role or be and listed as “6 variations,” etc, but only if the map involved in a specific pathway. scale is not fine enough to allow adequate spac- ing of the scale used and for these grouped SNPs, 8. Ensembl the linkouts are lost. Often it is better to start with (http://www.ensembl.org/index.html) a whole view before concentrating on one area Readers who routinely use Ensembl as their and this is achieved by using the genome view database of choice for genome browsing may be (http://www.ncbi.nlm.nih.gov/mapview/map_ wondering why it has not been covered in this search.cgi?taxid=9606&query=). A landmark review up until now. Ensembl is one of the prin- will be positioned on the relevant chromosome cipal genome data repositories with extensive with a small red line, in most cases SNP queries databases and search tools stemming from the will result in two marks for multiple positions integral role of The Sanger Centre and EMBL due to the CDS coordinates being different. Mul- in HGP and the position of Ensembl in providing tiple landmarks can be displayed together in a open-source software frameworks for data access single view by using the OR operator in the search and storage. Ensembl continues to provide the window. The individual whole chromosome view is most up to date releases of annotated genome data obtained by clicking on the red number under the and the widest range of species with genome chromosome in the genome map. It soon becomes analysis available. In the area of human SNP data clear that dbSNP offers a much broader range of and supporting information the content very genome landmarks in the displayed region com- closely mirrors that found in dbSNP and there- pared to HapMap and usefully all SNPs in the fore there is a choice of approaches, NCBI or database have summary details summarized as Ensembl, and each can be used to access, largely icons placed against the map position and rs the same data, using comparable frameworks for linkout shown in Figure 8. The smallest scale searching and cross-referencing information for a definable using the zoom scale box is 100,000th study. There is little to choose between the two of total chromosome length (equivalent to 10 kb except a broader range of data in NCBI and closer of chromosome 1). Table 3 details the color-cod- integration to bibliographic data. For this reason

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 89 1/3/07, 4:40 PM 90 Phillips

Table 3 Color-Coding System Used for Annotation of the Genes Map Color Gene model evidence used Blue Confirmed gene model based on alignment of mRNA or mRNA/ESTs Green EST evidence only Brown Predicted gene model (using Gnomon program) plus EST alignment evidence Tan Predicted gene model only Orange Conflicting evidence—discrepancy between mRNA sequence and model

EST, expressed sequence tag.

the researcher interested in just human SNP data index.html) explaining involvement of VEGA in can easily use the most familiar search environ- the CDDS and MHC Haplotype Projects and the ment and obtain data of comparable depth and Welcome HAVANA groups involvement in the scope from each database collection with visits to analysis of ENCODE regions and the CORF PubMed and OMIM when required. The features projects (respectively, http://vega.sanger.ac.uk/ in Ensembl that can offer more detail and broader info/data/ccds.html, http://vega.sanger.ac.uk/ coverage are outlined. info/data/Homo_sapiens.html, http://vega.sanger. Because Ensembl is both a collection of genome ac.uk/info/data/encode.html, and http://vega. data and a software system that can be adapted for sanger.ac.uk/info/data/corf.html). its organization it is important to highlight aspects Second, Ensembl has close ties, through EMBL- of Ensembl content that complement the NCBI EBI, to the high-quality protein sequence database databases. First, Ensembl has had a pivotal role in of swissprot/uniprot (http://www.ebi.ac.uk/ the complex task of gene curation and has pio- swissprot/). This comprises manually annotated neered automated gene annotation techniques. protein sequences with content that has been Most genome content is now generated in such closely integrated with the gene annotation pipe- quantity that automated curation has become a line, therefore the two data sets, gene and protein, necessity. In contrast, the primary importance of have considerable synergy within Ensembl. the human and mouse gene annotation process has This makes swissprot/uniprot a better choice meant that this has been completed manually for detailed protein analysis than NCBI Protein, using expert scrutiny, thereby enhancing gene although, as with SNP and gene data, the content recognition algorithms in the process. The is shared and cross-referenced to a large degree. VEGA (Vertebrate Genome Annotation) data- Third, Ensembl provides a gene-oriented search base homepage lists linkouts to human, mouse, system in MartView (http://www.ensembl.org/ dog, and zebrafish genome browsers (26). VEGA’s Multi/martview), a system that, like HapMap, mission statement is to provide high quality, fre- uses the biomart engine. Initiating a search of the quently updated, manual annotation of vertebrate human assembly with VEGA genes as the start finished genome sequence (27). Without doubt this point allows the interrogation of more than 22,100 process will extend to chimpanzee, pufferfish, and genes with a full range of filters. This is poten- agricultural species as the annotation process tially the best way to collate a set of candidate becomes more streamlined and benefits from genes on the basis of function and role in pro- established knowledge of gene character in ver- cesses, such as, a pathways approach to studying tebrates. Users interested in obtaining the best the relationship of gene . OMIM may sug- data relating to human gene architecture are encour- gest candidates that can be extended to include all aged to visit the linkouts from Vega human home- genes that share similar properties in terms of page (http://vega.sanger.ac.uk/Homo_sapiens/ function. This can help to ensure that a search in

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 90 1/3/07, 4:40 PM Online Resources for SNP Analysis 91

the initial phases is not too restricted in focus. these have proved to be among the most interest- Fourth, Ensembl allows access to an HGP sequenc- ing SNPs. The problem of a maximum value of ing trace repository at Trace Server (http://trace. 0.5 for each allele applies to all the other data- ensembl.org/). Although listing single pass data bases and their frequency search systems. and therefore it is both impossible and unrealistic A popular program for processing multiple da- to use this resource for SNP analysis, the breadth tabases with frequency filtered searches is Fre- of data (1 million traces from 735 species) makes quency Finder (https://mapgenetics.nimh.nih. this a valuable resource for the analysis of sequence gov/frequencyfinder/index.jsp) described as a fre- outside of the mainstream study species. quency data acquisition tool for mining multiple public databases (30). This acts as a web portal or 9 . Online Resources for Population a stand-alone program, although both require a Genetic Studies Using SNPs data upload in the form of a SNP list as a line Although the focus of clinical genetics data- delimited text file or alternatively as SNP identi- base searching centers on the location of SNPs fiers or chromosome coordinates placed in a with reference to genes and haplotype structure, query box. Frequency Finder returns a table of population genetic studies place more emphasis rs-numbers, major and minor allele frequencies, on the distribution of allele and haplotype fre- and the data source. Although uploading a file of quencies. This can still involve the scrutiny of rs-numbers is a slightly clumsy way of initiating SNP position in the gene landscape as the effects a search these days, the system is comprehensive of strong positive selection can leave a character- in scope as it will locate and list frequency data istic signature in the distribution of SNP variabil- that does not overlap between the database ity around the selected gene. The discovery of sources used (TSC, dbSNP, Celera, ALFRED, haplotype variability reduction from selective and HGVBase; discussed later). As dbSNP broad- sweeps, as this effect is described, has led to some ens the extent of its data collection from other exciting recent studies of genes hitherto not sus- sources, such as Celera and HapMap, this has pected to be the subject of selective pressure become less important than it was previously. (28,29). The need for detailed and reliable fre- For example, the original report of the system quency data in this field means HapMap and in 2004 detected from a whole genome query dbSNP hold centre stage as the two largest allele (246,097 SNPs with data) that 5% of SNPs were frequency datasets available. However, searching unique to TSC, and 16% unique to AOD. The the data with the aim of exploring frequency dis- HapMap frequency data accessed was confined tribution differences between populations is not in early 2006 (v2.1) to European data only. easy in either database. Furthermore, frequency The proportion of SNPs described previously searches are limited in both by the 0.5 minor as unique to Celera are held in a publicly acces- allele frequency ceiling. This is the limit to fre- sible subset of the CDS database known as Assays- quency-filtered searches that sets a frequency on Demand (AOD) or Taqman SNP Genotyping range for the minor allele between 0 and 0.5 for Assays. When Celera and Applied Biosystems each class (in this case population). Although this merged their interests to become Applera, a focus is, of course, logical as the minor frequency can- of much development was the combination of not be more than 0.5, it prevents searches of SNPs Taqman real-time PCR assay technology and the where one population shows a minor allele fre- extensive SNP data in CDS, leading to a system quency of, say, 0.2 for allele C and this is present described by the company as knowledge-based at frequency 0.6 in another population, properly genotyping. With this system the frequency data showing a minor frequency of 0.4 but for the other for a SNP was as important as the availability of allele. This is not a major drawback as work- off-the-shelf Taqman assays to genotype the locus. arounds exist, but it prevents an easy search for The map browser to access this data SNPbrowser loci showing the biggest frequency contrasts and has been described, AOD is the equivalent of

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 91 1/3/07, 4:40 PM 92 Phillips

Entrez, a search system front end that permits lent rs-number if one exists. One advantage that searches based on frequency (among other crite- will become immediately obvious to the researcher ria). The importance of AOD compared to the vast is that this provides the easiest allele frequency array of SNPs that were available privately as search system for the 1 million HapMap SNPs CDS was the quality of validation. Celera used a despite the previously stated caveat that 0.5 is the polymorphism discovery resource (PDR) of six frequency limit per allele. The simplicity of this individuals to indicate whether a SNP detected by approach for searching HapMap allele frequency the private genome assembly was a valid SNP or data is that it gives a list that can be cross-checked not. This is an insufficient sample to provide reli- with AOD estimates if applicable (both are listed able allele frequency estimates and means the even if one is searched) and ready access then to CDS SNP details listed in AOD lack precision. In dbSNP and Gene. One additional point of refer- contrast, AOD data is based on pilot Taqman ence if the SNP has been validated by HapMap analysis of 45 individuals each from four popula- and AOD: the African allele frequency estimates tions as detailed under Subheading 6. for SNP in AOD are based on an African-American sample, browser. The AOD dataset has since been used therefore by comparison with HapMap Africans it for many important population genetics SNP is possible to gain an insight into admixture lev- studies that have extended the analysis to a wider els in the AOD study population. Before HapMap range of populations by using the available Taqman these were the most reliable SNP allele frequency assays on a broader base of well-defined popula- estimates available anywhere and were accurate tions. These studies provide valuable insights into enough to allow a detailed study of allele fre- population variability with the aim of improving quency-derived haplotype block definitions based the selection of subjects for MALD analysis and solely on AOD SNPs showing tandem arrays of to gain a better understanding of human popula- identical minor allele frequencies (32). The abil- tion dynamics. Similarly, I used AOD to begin ity to search on frequency alone continues to the process of collecting loci for forensic SNP make this database a powerful population genetics assays that can suggest a geographic origin for a tool despite the recent incorporation of linkouts to sample of unknown donor (31). AOD in the revamped dbSNP cluster reports. The AOD search page (https://products.applied Finally, as mentioned, recent interest in selec- biosystems.com/ab/en/US/adirect/ tion in the very recent past and its role in reducing ab?cmd=ABGTKeywordSearch&catID=600769) haplotype diversity in the vicinity of the selected allows access to HapMap, JSNP, and DME (drug gene has led to a useful web tool, Haplotter (http:// metabolizing enzyme) data, in addition to AOD, hg-wen.uchicago.edu/selection/haplotter.htm) to using keyword searches with gene symbol, gene detect and study this effect. Based on an in-depth name, public accession number, biological pro- analysis of the HapMap phase I data release (29) cess, or molecular function. With or without key- it is likely to promote potentially interesting fur- words, limits can then be set for intergenic, ther study of genes that previously may not have intragenic, or genic location plus SNP type and been considered obvious candidates for positive effect if genic, followed by the frequency filters selection. The tool scans HapMap data for signa- for the four AOD or HapMap populations. The tures of low haplotype diversity and unusually query returns a page listing the SNPs with long haplotype blocks that result from the rapid linkouts to dbSNP and Gene plus two decimal increase in frequency of the selected gene varia- place frequency estimates and chromosome loca- tion and bordering chromosome regions. The sig- tion. If the user has an “hCV number” used to natures are defined by the equivocal measure of catalog Celera SNP data, then it is possible to gain contrast between the haplotypes and the sur- information about the locus if it has been vali- rounding genome landscape because the ancestral dated for AOD, and usefully to obtain the equiva- allele (positive contrasts as the allele increases in

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 92 1/3/07, 4:40 PM Online Resources for SNP Analysis 93

frequency) can be the subject of selection as well proximity to genes, as a principal goal of the con- as the variant allele (negative contrasts). Haplotter sortium was to construct the first high density can work from gene identifiers or a single SNP SNP linkage map: the Allele Frequency Project. landmark (more slow and varied in coverage). This project created the most significant feature The program returns plots of iHS, the measure- of the TSC database for SNP research—detailed ment of contrast to surroundings plus the selec- genotype frequency data for 55,000 loci from tion signature or population diversity measures H, European (termed Caucasian), African, and Chi- D, and FST followed by a table of adjacent genes nese population samples. This resource formed that are colored light blue when they show sig- the core SNP validation data available to the pub- nificant evidence of selection effects. The major lic along with AOD data, before the initiation of advantage of this tool is it allows an unbiased the HapMap project. The findings of the Allele approach to finding regions with indications of Frequency Project provide a detailed analysis of recent selection pressure, so in use it is likely to the nature of SNP variability in the genome and reveal interesting and surprising candidates for differences between the study populations (35). more detailed study. In addition, it could focus One word of warning about the interpretation of studies on the phenotypes such loci exhibit with allele frequencies generated by pooled DNA tech- consequences for our understanding of the differ- niques that form a proportion of the loci detailed ences in susceptibility to disease between popula- in TSC. This technique is generally inaccurate and tion groups. The disadvantage is that it appears to almost wholly so when the minor allele frequency lack sensitivity to gene variation and associated is below 10%. An example is rs994174 with minor SNPs that have reached, or are very close to, fixa- allele frequency estimates of 0, 1, and 0 for Euro- tion (i.e., where a different allele is fixed, at a fre- pean, Asian, and African samples, respectively, quency close to 1, in different populations). using pooled DNA, whereas the same populations Examples of genes, where coding SNPs are close give estimates from repeat genotyping of 0.67, to fixation, that fail to yield a detectable signal 0.58, and 0.24 (CEU, CHB, and YRI in HapMap). with Haplotter are FY (inferring resistance to malarial infection in African populations) and 10.2. HGVbase (formerly HGBase) MATP (part of a depigmentation pathway in (http://hgvbase.cgb.ki.se/) European populations). In contrast, LCT (creat- The Human Genome Variability Database ing hypolactasia in Europeans) showing balanced comprises nearly 9 million entries concentrated on heterozygosity levels reveals one of the strongest se- human genome variants including SNPs, Indels, lection signals of all, although this may also relate and STRs. HGVbase uses its own system of locus to how recently the selective sweeps have occurred identification: a nine-digit number prefixed with at these loci. SNP (if applicable to the locus). The database is 10. Other SNP Databases and Resources in the process of adaptation to give much greater emphasis on phenotype/genotype collation so it 10.1. The SNP Consortium will lose a large part of its focus on the cataloging (http://snp.cshl.org/) of SNPs but gain increased importance as a means The SNP Consortium (TSC) is run by the Cold to link SNP variability to phenotype. In the Spring Harbor Laboratory on behalf of a private/ website’s own description: “sequence variations public partnership of 17 organizations. This data- are presented with details of how they are physi- base comprises 1.8 million loci, all of which are cally and functionally related to the closest listed in dbSNP. Both databases are fully cross- neighboring gene.” This will make HGVbase an referenced with linkouts, but TSC uses a different essential complement to NCBI OMIM and Gene SNP locus identification system. TSC SNPs have for the analysis of SNP variation and its effect on been chosen for study specifically because of their the expression of traits. The citation list returned

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 93 1/3/07, 4:40 PM 94 Phillips

from a query can provide a streamlined approach rs-numbers that have been collated in ALFRED to starting a literature search of studies of gene are matched to loci (either other polymorphic variation resulting from coding SNPs. In addition, markers or genes) in “Summaries” then “Sites the strength of HGVbase has been in the empha- with dbSNP rs #.” sis on listing low frequency variants and new 11. Web-Based SNP Assay Design Tools mutations that are on the periphery of mainstream SNP content elsewhere. Last, I thoroughly recom- 11.1. NCBI BLAST mend the indispensable list of linkouts maintained (http://www.ncbi.nlm.nih.gov/blast/) here to no less than 46 other online SNP-related BLAST is a tool for calculating sequence simi- databases and resource sites (http://hgvbase. larity that accesses the NCBI GenBank databases cgb.ki.se/cgi-bin/main.pl?page=databases_.htm). (35). Typically the SNP assay design process will In large part the list covers much of this section of query Nucleotide BLAST in two ways. the review as a source list of specialized data- 1. Finding a location for a submitted sequence, bases, each worth exploring but dependent on the effectively the query being: does the submitted particular aspect of SNP variability of interest to sequence exist in a GenBank database? the user. 2. Checking for coincidental similarity in a se- 10.3. PolyPhen quence, normally a PCR primer, the query being: (http://tux.embl-heidelberg.de/ramensky/) what is the degree of specificity of the submit- ted sequence? PolyPhen is a tool that provides Polymorphism Phenotyping—predicting the possible impact of There are three BLAST programs available for an amino acid sequence change on the properties sequence comparison, the BLAST guide (http:// of a protein. It will not work with SNP data input www.ncbi.nlm.nih.gov/BLAST/producttable. directly but holds a nonsynonymous SNP data- shtml) can be consulted for the correct program base comprising 50,919 SNPs taken from dbSNP choice. However, the alignment comparisons build 121 (34). The predictions on the effect of these required for each of these queries are provided SNPs make interesting reading: 9502 unknown, by MegaBLAST and standard BLAST (blastn), 27,991 benign, 7905 possibly damaging, and respectively. MegaBLAST is designed for long 5525 probably damaging, therefore 32.4% of sequences and for a certain degree of mismatch, SNPs with a known effect appeared to be detri- whereas blastn is designed to give a list of sequences mental to the protein. This data subset is the easi- in order of similarity. A third option, “Search for est way to use this tool and rs-numbers can be short and near exact matches” is recommended input as queries directly for comparison against for sequence specificity checks with less than 20 the nonsynonomous SNP collection (http://genetics. bases. A BLAST query returns a three-part report: bwh.harvard.edu/pph/data/index.html). (1) a header with query sequence information plus summarizing graphic overview, (2) a set of single- 10.4. ALFRED line matching sequence descriptions, and (3) the (http://alfred.med.yale.edu/alfred/index.asp) matching alignments themselves. There are two The ALlele FREquency Database is an exten- statistics that annotate the returns from blastn: the sive collection of frequency reports for polymorphic bit score and E-value. The graphic overview markers comprising 1501 loci, 475 populations, shows the query sequence as a numbered red bar and 41,980 frequency tables. This forms a useful and below this the database hits as colored bars population analysis tool, in particular the map aligned to the query. The colors and proximity to function, which provides an intuitive search inter- the query represent the alignment scores from red face based on geographic region. Unfortunately the (highest) through to black (lowest) and uppermost SNP data held is currently patchy (only 841 rs-num- to lowest bars. The single-line descriptions give bers), but this situation is certain to change. The both a bit score indicating the goodness of fit of

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 94 1/3/07, 4:40 PM Online Resources for SNP Analysis 95

each matched sequence and an expect value (E- various PCR parameters such as amplicon size value). The bit score is calculated from a formula range and optimum Tm can be prescribed by the that takes into account all matching user before a list of suggested primer sequences and gaps, the higher the score, the better the align- are returned in order of optimum predicted per- ment (http://www.ncbi.nlm.nih.gov/BLAST/tuto- formance in PCR. Despite this simple interface, an rial/Altschul-1.html). The E-value summarizes array of presets exists for the PCR conditions and the statistical significance of the alignment, reflect- the possibility to annotate the submitted sequence ing both the size of the database used to prepare to direct the primer design process in useful ways. the alignments and the score system used; the Primer3 provides particularly versatile secondary lower the E-value the more significant the hit. For structure detection subroutines that can screen example, a value of 0.05 equates to 5 in 100 or 1 primer designs for such structures. This is an in 20, signifying the probability of this match by essential step as these can reduce the efficiency chance alone. Overall, the routine use of BLAST of PCR or even prevent obtaining SNP genotypes to check primer sequence specificity should be a from an assay, particularly in large multiplex procedure familiar to all genetics researchers. designs. Ensembl has a BLAST site (http://www.sanger. 11.4. Santa Cruz In Silico PCR ac.uk/cgi-bin/blast/submitblast/hgp) and sequence alignment tool, SSAHA (http://www.sanger.ac.uk/ ( http://genome.ucsc.edu/cgi-bin/hgPcr) Software/analysis/SSAHA/) that provides an alter- In-silico PCR is a beautifully simple idea run native to NCBI. by the Santa Cruz genome site for checking the specificity of the primer designs developed for the 11.2. RepeatMasker capture PCR in a SNP genotyping assay. When the (http://repeatmasker.genome.washington. forward and reverse primer sequences are inserted edu/cgi-bin/RepeatMasker) into the query boxes these are compared to the RepeatMasker is a tool for screening submitted human sequence assembly (or 27 other species as DNA sequences against a broadly based library options) and the sequence interval between the of repetitive elements (36,37). A masked query primer pair is returned to confirm that the correct sequence is returned that can be used for data- segment is targeted. Each primer sequence is base searches plus a table annotating the regions listed in fasta format as capitalized bases and the of repetitive, low-complexity DNA. Users can interval in lower case. The simplicity and ease of expect about 50% of human sequence to be use of this web tool makes it a worthwhile alter- masked with this program. This system is used by native to waiting in the BLAST queue. NCBI for classifying the flanking sequence of dbSNP entries and shown in the fasta section of 12. Concluding Remarks a cluster report. RepeatMasker tends to be quite This review is intended to provide some initial aggressive in its annotation of sequence, therefore directions in which to point the mouse to ensure when designing primers for SNP genotyping assays, that a research project using SNP analysis is prop- it can be safer to include masked sequence and then erly designed and framed. It is important to fully check the resulting designs for specificity in review as much of the relevant genetic data as BLAST. possible and to consolidate the research aims on the basis of information gathered. Luckily, this 3 11.3. Primer has never been easier and the data never so exten- (http://www-genome.wi.mit.edu/cgi-bin/ sive and detailed in content. However, it is appro- primer/primer3_www.cgi) priate to conclude with a last cautionary note. At Primer3 is a well-established and popular primer any one time in the laboratory where I work, as design program and sequence analysis tool (38). many as four or five scientists of a team of 20 The flanking sequence for a SNP is submitted and will be reviewing informational web pages from

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 95 1/3/07, 4:40 PM 96 Phillips

PubMed, dbSNP, HapMap, or Ensembl. Com- 8. Nachman, M. W. and Crowell, S. L. (2000) Estimate of puter-based research can often seem like the major the mutation rate per nucleotide in humans. Genetics part of the work, but users should resist the temp- 156, 297–304. 9. Phillips, C., Lareu, M., et al. (2004) Non binary single tation to replace benchwork with a disproportion- nucleotide polymorphism markers, in Progress in ate amount of time following up the work of Forensic Genetics 10 (Doutremepuich, C. and Morling, others or collating information about SNPs with- N., eds.). Elsevier, Amsterdam. out investigating these loci for themselves. An 10. Dawson, E,, et al. (2002) A first generation linkage often used dictum these days is “dry work should disequilibrium map of . Nature 418, not be a substitute for wet work.” Although usually 544–548. 11. Patil, N., et al. (2001) Blocks of limited haplotype describing the tendency to place undue emphasis diversity revealed by high resolution scanning of on sequence analysis compared to investigations human chromosome 21. Science 294, 1669–1670. of cellular processes in live material, the phrase 12. Gabriel, S. B., et al. (2002) The structure of haplo- could equally well apply to excessive time spent type blocks in the human genome. Science 296, 2225– searching online databases at the expense of 2229. 13. Phillips, M. S., et al. (2003) Chromosome-wide dis- generating original data for oneself in pursuit of tribution of haplotype blocks and the role of recombi- the research aims. Only in the field of popula- nation hot spots. Nat. Genet. 33, 382–387. tion genetics is there now the real possibility to 14. Daly, M., Rioux, J. D., Schaffer, D. F., Hudson, T. J., perform primary research based solely on online and Lander, E. S. (2001) High resolution haplotype data. The SNP genotype information generated by structure in the human genome. Nat. Genet. 29, 229– the HapMap phase I data release has opened up 232. 15. De La Vega, F. M., et al. (2003) Selection of single an interesting phase of SNP research as the excit- nucleotide polymorphisms for a whole–genome link- ing work of Voight et al. (29) has shown by find- age disequilibrium mapping set. CSH Genome Sequenc- ing unforeseen signatures of recent selection in ing & Biology Meeting, Cold Spring Harbor, NY. human populations. This is an uncommon in- 16. Wall, J. D. and Pritchard, J. K. (2003) Haplotype stance—online investigation is still just the start blocks and linkage disequilibrium in the human genome. of the work in all other cases of genetic research. Nat. Rev. Genet. 4, 587–597. 17. Patterson, N., Hattangadi, N., Lane, B., et al. (2004) Methods for high–density admixture mapping of dis- References ease genes. Am. J. Hum. Genet. 74, 979–1000. 1. International Human Genome Sequencing Consor- 18. Reed, T. E. (1969) Caucasian genes in American tium (2001) Initial sequencing and analysis of the Negroes. Science 165, 762–768. human genome. Nature 409, 860–921. 19. Hinds, D. A., Stokowski, R. P., Patil, N., et al. (2004) 2. Sachidanandam, R., et al. (2001) A map of human Matching strategies for genetic association studies in genome sequence variation containing 1.42 million structured populations. Am. J. Hum. Genet. 74, 317–325. single nucleotide polymorphisms. Nature 409, 928– 20. Campbell, C. D., Ogburn, E. L., Lunetta, K. L., et al. 933. (2005) Demonstrating stratification in a European 3. Read, A. and Strachan, T. (2003) Human Molecular American population. Nat. Genet. 37, 868–872. Genetics 3. Garland Science. 21. Yoshiura, K., et al. (2006) A SNP in the ABCC11 4. Jobling, M. A., Hurles, M. E., and Tyler-Smith, C. gene is the determinant of human earwax type. Nat. (2003) Human Evolutionary Genetics. Garland Sci- Genet. 38, 324–330. ence. 22. Altshuler, D., Brooks, L. D., Chakravarti, A., Collins, 5. Phillips, C., Lareu, M., et al. (2004) Selecting SNPs F. S., Daly, M. J., Donnelly, P.; International HapMap for forensic applications, in Progress in Forensic Consortium. (2005) A haplotype map of the human Genetics 10(Doutremepuich, C. and Morling, N., Nature 437, eds.). Elsevier, Amsterdam. genome. 1299–1320. 6. Phillips, C. (2005) Using online databases for devel- 23. McVean, G., Spencer, C. C., and Chaix, R. (2005) oping SNP markers of forensic interest. Methods Mol. Perspectives on from the Biol. 297, 83–105. HapMap Project. PLoS Genet. 1, e54. 7. Sobrino, B., Brion, M., and Carracedo, A. (2005) 24. Sabeti, P. C., et al. (2002) Detecting recent positive SNPs in forensic genetics: a review of SNP typing selection in the human genome from haplotype struc- methodologies. Forensic Sci. Int. 154, 181–194. ture. Nature 419, 832–837.

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 96 1/3/07, 4:40 PM Online Resources for SNP Analysis 97

25. Thorisson, G. A., Smith, A. V., Krishnan, L., and 32. Costas, J., Salas, A., Phillips, C., and Carracedo, A. Stein, L. D. (2005) The International HapMap Project (2005) Human genome–wide screen of haplotype- Web site. Genome Res. 15, 1592–1593. like blocks of reduced diversity. Gene 11, 219–225. 26. Loveland, J. (2005) VEGA, the genome browser with 33. Miller, R. D., et al. (2005) High-density single-nucle- a difference. Brief Bioinform. 6, 189–193. otide polymorphism maps of the human genome. 27. Ashurst, J. L., Chen, C. K., Gilbert, J. G., et al. (2005) 86, 117–126. The Vertebrate Genome Annotation (Vega) database. 34. Ramensky, V., Bork, P., and Sunyaev, S. (2002) Nucleic Acids Res. 1, D459–465. Human non-synonymous SNPs: server and survey. 28. Lao, O., Duijn, K., Kersbergen, P., Knijff, P., and Nucleic Acids Res. 30, 3894–3900. Kayser, M. (2006) Proportioning whole-genome 35. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., single-nucleotide-polymorphism diversity for the and Lipman, D. J. (1990) Basic local alignment identification of geographic population structure and search tool. J. Mol. Biol. 215, 403–410. genetic ancestry. Am. J. Hum. Genet. 78, 680–690. 36. Jurka, J., Klonowski, P., Dagman, V., and Pelton, P. 29. Voight, B. F., Kudaravalli, S., Wen, X., and Pritchard, (1996) CENSOR—a program for identification and J. K. (2006) A map of recent positive selection in the elimination of repetitive elements from DNA sequences. human genome. PLoS Biol. 4, e72. Comput. Chem. 20, 119–121. 30. Nguyen, T. H., Liu, C., Gershon, E. S., and McMahon, 37. Bedell, J. A., Korf, I., and Gish, W. (2000) MaskerAid: F. J. (2004) Frequency Finder: a multi-source web ap- a performance enhancement to RepeatMasker. Bio- plication for collection of public allele frequencies of informatics 16, 1040–1041. SNP markers. Bioinformatics 20, 439–444. 38. Rozen, S. and Skaletsky, H. J. (2000) Primer3 on the 31. Phillips, C., Lareu, M., et al. (2004) Population-specific WWW for general users and for biologist program- single nucleotide polymorphism. Progress in Forensic mers. Methods Mol. Biol. 132, 365–386. Genetics 10 (Doutremepuich, C. and Morling, N., eds.). Elsevier, Amsterdam.

MOLECULAR BIOTECHNOLOGY Volume 35, 2007

65_98_Phillips_MB06_0040 97 1/3/07, 4:40 PM 65_98_Phillips_MB06_0040 98 1/3/07, 4:40 PM