A Comparison of Cataloged Variation Between International Hapmap

A Comparison of Cataloged Variation Between International Hapmap

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by PubMed Central Research and applications A comparison of cataloged variation between International HapMap Consortium and 1000 Genomes Project data Carrie C Buchanan,1,2 Eric S Torstenson,1 William S Bush,1 Marylyn D Ritchie2 1Center for Human Genetics ABSTRACT were used. Second generation linkage maps were Research, Vanderbilt University, Background Since publication of the human genome in based on microsatellites, which are short tandem Nashville, Tennessee, USA 2 2003, geneticists have been interested in risk variant repeated DNA sequences present throughout the Department of Biochemistry fi and Molecular Biology, associations to resolve the etiology of traits and complex genome. The rst published study using micro- Pennsylvania State University, diseases. The International HapMap Consortium satellites included 814 polymorphic markers.2 The University Park, Pennsylvania, undertook an effort to catalog all common variation third generation linkage maps were created using USA across the genome (variants with a minor allele SNPs. These high density maps were developed by frequency (MAF) of at least 5% in one or more ethnic Correspondence to the International HapMap Consortium (haplotype Dr Marylyn D Ritchie, groups). HapMap along with advances in genotyping mapping). HapMap aimed to compare genetic Pennsylvania State University, technology led to genome-wide association studies sequences of different individuals and identify Department of Biochemistry and which have identified common variants associated with chromosomal regions where genetic variants were Molecular Biology, 512 Wartik, many traits and diseases. In 2008 the 1000 Genomes shared. These variations quickly became the core University Park, PA 16802, USA; [email protected] Project aimed to sequence 2500 individuals and identify around which genome-wide association studies rare variants and 99% of variants with a MAF of <1%. (GWAS) were built. Researchers believed that these Received 21 October 2011 Methods To determine whether the 1000 Genomes variations among individuals could explain the Accepted 27 December 2011 Project includes all the variants in HapMap, we examined heritability of common disease. After several years of the overlap between single nucleotide polymorphisms moderately successful GWAS, a group of researchers (SNPs) genotyped in the two resources using merged decided that a more in-depth look at variation, phase II/III HapMap data and low coverage pilot data including rare variation, was necessary to explain from 1000 Genomes. additional disease heritability. The 1000 Genomes Results Comparison of the two data sets showed that Project aimed to sequence 2500 individuals and approximately 72% of HapMap SNPs were also found in gather information on variants down to 1% allele 1000 Genomes Project pilot data. After filtering out frequency with the goal of providing a more exten- HapMap variants with a MAF of <5% (separately for sive catalog of variation to the scientific community. each population), 99% of HapMap SNPs were found in In addition to cataloging human variation, both 1000 Genomes data. databases serve many other purposes. For example, Conclusions Not all variants cataloged in HapMap are GWAS were possible because of the linkage also cataloged in 1000 Genomes. This could affect disequilibrium information calculated from the decisions about which resource to use for SNP queries, SNPs in HapMap. Published sequencing studies are rare variant validation, or imputation. Both the HapMap often filtered by variants in 1000 Genomes to and 1000 Genomes Project databases are useful reduce the number of variants used in association resources for human genetics, but it is important to tests, since the individuals in 1000 Genomes are understand the assumptions made and filtering presumably healthy controls and thus variants strategies employed by these projects. detected in these data are unlikely of importance for disease. Both of these resources have enhanced the study design and analysis pipelines for common and rare variant association studies. INTRODUCTION The HapMap project was launched in October The field of human genetics has rapidly developed in 2002 on the heels of the completion of the human the past few decades. The desire for precise genomic genome sequence. The project was designed to mapping has encouraged the development of asso- build a database of common sequence variation, to ciation studies, from genome-wide linkage studies to determine allele frequencies, and to empirically both low and high throughput single nucleotide determine the linkage disequilibrium relationships polymorphism (SNP) genotyping and, most across the genome. To date, there are three phases recently, high throughput DNA sequencing. At each of HapMap. The details are listed in table 1.3 4 stage of progression, the researcher has been better In 2008, the HapMap project catalog contained able to narrow disease susceptibility genetic regions 3.5 million commonly occurring genetic variants and/or identify causal variants associated with across several populations. The allele frequencies disease. The first genetic linkage map was published and correlation patterns were critical for the in 1987 and based on restriction fragment length development and success of GWAS. However, to 1 This paper is freely available polymorphisms (RFLPs). RFLPs are DNA poly- expand the investigation of causal variants to online under the BMJ Journals morphisms that disrupt (by either creation or include rare variation, more research was required. unlocked scheme, see http:// destruction) restriction endonuclease recognition Using sequencing technology, researchers are able jamia.bmj.com/site/about/ fi unlocked.xhtml sequences. In this rst map, only 393 bi-allelic RFLPs to identify novel or rare variants. Sequencing J Am Med Inform Assoc 2012;19:289e294. doi:10.1136/amiajnl-2011-000652 289 Research and applications Table 1 HapMap details Table 3 Details for 1000 Genomes Project full project data (sequence No. of SNPs index 2010.08.04) genotyped Targeted SNPs Populations studied Continental groups Ethnicity breakdown Total Phase I 1 million Prioritized coding SNPs to attain CEU, YRI, CHB, JPT AFR 78 YRI+67 LWK+24 ASW+5 PUR 174 1 SNP for each 5 kb region EUR 90 CEU+92 TSI+43 GBR+36 FIN+17 283 Phase II 3 million Prioritized non-synonymous CEU, YRI, CHB, JPT MXL+5 PUR SNPs in coding regions ASN 68 CHB+25 CHS+84 JPT+17 MXL 194 Phase III 1.4 million Prioritized rare variants CEU, YRI, CHB, JPT, Total number of unique individuals 629 ASW, CHB, GIH, LWK, MXL, MKK, TSI AFR, African; ASN, Asian; EUR, European. SNP, single nucleotide polymorphism. which database to reference, one might base one’s decision on enables scientists to pinpoint functional variants from associa- the newest release, total number of variants, or even ethnicities ’ tion studies, improve the knowledge available to researchers included. For example, if a researcher s interest was rare variants, interested in evolutionary biology, and may lay the foundation he/she might automatically assume that data from the 1000 for predicting disease susceptibility and drug response. The 1000 Genomes Project would be the variation catalog of choice. The Genomes consortium materialized to address these needs, 1000 Genomes Project aimed to provide characterization of over 95% of variants in accessible genomic regions that have an allele primarily by providing a sequence reference database. Their aim 6 has been to ‘provide a deep characterization of human genome frequency of 1% or higher. In the previous example, presuming sequence variation as a foundation for investigating the rela- that the 1000 Genomes Project data included more rare variants tionship between genotype and phenotype.’ The pilot phase of than HapMap would be a correct assumption; the 1000 the project, which included three subprojects, provided the first Genomes Project pilot data do indeed capture more rare varia- data release. The subprojects were planned to achieve their aims tion than HapMap. However, not all rare variants found in through evaluation of sequencing technology and to develop HapMap have been found in the 1000 Genomes Project catalog. analytical pipelines for alignment, quality control, data Therefore, if one is interested in rare variants, it might be fi management, and statistical analysis.5 Details about the pilot bene cial to investigate both resources. An example of this is fi data from the 1000 Genomes Project are shown in table 2. shown in gure 1, which shows a screen shot from the NCBI Review of the pilot data shows that the project successfully browser (taken in October 2010) of a region on chromosome 7. It catalogs the vast majority of common variation. Durbin et al lists the known variants by chromosome position, rs id, func- reported that over 95% of the currently accessible variants found tional change, alleles, and many other identifying characteristics. in any individual were present in the pilot data.6 It also includes a validation column which provides links and To date (August 2011), the full 1000 Genomes Project data details about the validation status of each given variant. Of includes SNP calls, exome alignments, and genotypes for 1185 particular interest is variant rs2072413. It was validated in individuals. The end goal is to sequence approximately 2500 de- HapMap but was not sequenced by the 1000 Genomes Project identified subjects from 25 populations worldwide using next- (at this time, only pilot data from 1000 Genomes were available generation sequencing technology. In the ‘low-coverage’ full project on NCBI). If this SNP was of interest, it would be important to data, the current coverage estimate is 7.73 (64.2) and includes 15 consider the HapMap data, including rare variants. This was world populations.6 In this analysis, we downloaded an earlier somewhat surprising, and we felt that it might be pertinent to release of the full project data (released October 2010) which determine how pervasive the differences were between HapMap included 629 individuals from 15 world populations (see table 3). and 1000 Genomes Project data.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    6 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us