A Comprehensive Evaluation of SNP Genotype Imputation

Hum Genet (2009) 125:163–171 DOI 10.1007/s00439-008-0606-5 ORIGINALINVESTIGATION A comprehensive evaluation of SNP genotype imputation Michael Nothnagel Æ David Ellinghaus Æ Stefan Schreiber Æ Michael Krawczak Æ Andre Franke Received: 7 October 2008 / Accepted: 5 December 2008 / Published online: 17 December 2008 Ó Springer-Verlag 2008 Abstract Genome-wide association studies have contrib- in a northern European population is powerful and reliable, uted significantly to the genetic dissection of complex even in highly variable genomic regions such as the extended diseases. In order to increase the power of existing marker MHC on chromosome 6p21. However, while genotype sets even further, methods have been proposed to predict predictions were found to be highly accurate with all four individual genotypes at un-typed loci from other marker sets programs, the number of SNPs for which imputation was by imputation, usually employing HapMap data as a refer- actually carried out (‘imputation efficacy’) varied substan- ence. Although various imputation algorithms have been tially. BEAGLE, IMPUTE, and MACH yielded nearly used in practice already, a comprehensive evaluation and identical trade-offs between imputation accuracy and effi- comparison of these approaches, using genome-wide SNP cacy whereas PLINK performed consistently poorer. We data from one and the same population is still lacking. We nevertheless recommend either MACH or BEAGLE for therefore investigated four publicly available programs for practical use because these two programs are more user- genotype imputation (BEAGLE, IMPUTE, MACH, and friendly and generally require less memory than IMPUTE. PLINK) using data from 449 German individuals genotyped in our laboratory for three genome-wide SNP sets [Affymetrix 5.0 (500 k), Affymetrix 6.0 (1,000 k), and Illumina 550 k]. We observed that HapMap-based imputation Introduction Imputation of single nucleotide polymorphism (SNP) M. Nothnagel and D. Ellinghaus contributed equally to the genotypes has been proposed as a powerful means to manuscript. include genetic markers into large-scale disease association M. Krawczak and A. Franke shared senior authorship. studies without a need to actually genotype them (Marchini Electronic supplementary material The online version of this et al. 2007; Servin and Stephens 2007). In fact, multi-locus article (doi:10.1007/s00439-008-0606-5) contains supplementary analyses using a combination of imputed and observed material, which is available to authorized users. genotypes appear to facilitate the detection of rare causative variants (population frequency 5%) that would otherwise M. Nothnagel M. Krawczak \ Institute of MedicalÁ Informatics and Statistics, be overlooked (Browning and Browning 2008). The Christian-Albrechts University, Kiel, Germany underlying computations are usually based upon those 90–120 SNP haplotypes that are provided for each of four D. Ellinghaus S. Schreiber A. Franke (&) exemplary populations by the International HapMap project Institute of ClinicalÁ MolecularÁ Biology, Christian-Albrechts University, Campus Kiel, House 6, (The International HapMap Consortium 2003, 2005). Arnold-Heller-Str. 3, 24105 Kiel, Germany Although imputed genotypes have already been used in e-mail: [email protected] practice (Wellcome Trust Case Control Consortium 2007), an independent genome-wide validation of the approach and S. Schreiber M. Krawczak PopGen Biobank,Á Christian-Albrechts University, a comprehensive comparison of the available algorithms are Kiel, Germany still lacking. For example, Marchini et al. (2007) based the 123 164 Hum Genet (2009) 125:163–171 benchmarking of their algorithm, implemented in the the Illumina HumanHap550 Bead array (550 k), respec- computer program IMPUTE, on only 10,180 coding SNPs, tively. Genotyping was performed by Affymetrix (Santa and a recent genome-wide investigation of imputation per- Clara, CA, USA) and Illumina (San Diego, CA, USA), formance (Anderson et al. 2008) exclusively used IMPUTE. respectively, as a commercial service. Further genotyping Pei et al. (2008) compared five programs using both simu- details are given in the Supplementary material. Prior to lated and HapMap CEU phasing data, but only small inclusion, individual samples and SNPs were subjected to exemplary regions were analyzed and only imputation rigorous quality control. All sample-wise call rates were accuracy (and not efficacy) was studied. We therefore found to be [95% for the three array types. The average assessed systematically and at a genome-wide level how call rate per sample was 99.8% for Affymetrix 5.0, 99.4% well HapMap-based imputation with one of four publicly for Affymetrix 6.0, and 99.9% for Illumina 550 k. Indi- available programs, namely BEAGLE (Browning and vidual SNPs were required to have a call rate [95% in the Browning 2007), IMPUTE (Marchini et al. 2007), MACH German samples, a minor allele frequency[1%, and had to (Li and Abecasis 2006), and PLINK (Purcell et al. 2007), be in Hardy–Weinberg equilibrium (p [ 0.01). Annotation would have performed in our own collection of composite files from the Affymetrix and Illumina arrays were used to genotypes, for three genome-wide SNP sets, of 449 healthy code SNPs on the forward strand in order to match the blood donors of German descent. release 22 CEU phasing data from HapMap (Frazer et al. Our analysis was based upon genotypes generated by 2007). Strand orientation was checked automatically by all means of Affymetrix 5.0 (500 k), Affymetrix 6.0 (1,000 k) imputation programs, except for SNPs with A/T and C/G and Illumina 550 k SNP arrays, respectively. Since the same alleles, and no errors were reported. All pairs of individuals DNA samples were genotyped with all three arrays, and had average identity-by-state (IBS) values within the since the arrays contained only partially overlapping marker threefold inter-quartile range of the array-wide IBS distri- sets, extensive genome-wide benchmarking became possi- bution (Tukey’s outlier criterion), thus minimizing the ble through a comparison of the imputed and observed likelihood that any individual included in our study rep- genotypes derived with the different arrays. Prior to inclu- resented a close relative of another individual, or was of sion, individual samples and SNPs were subjected to different ethnic origin. Data were quality-controlled using rigorous quality control to minimize the effects of geno- PLINK v1.02 (Purcell et al. 2007); the total SNP numbers typing errors on imputation accuracy. Since genotype before and after quality control are given in Table 1. The imputation requires similar patterns of linkage disequilib- genotype concordance rate in the overlapping marker sets rium (LD) in both the study and the reference population, we exceeded 99.7% for all array pairs (Table 1) and the cor- also tested whether the haplotype frequencies in HapMap responding allele frequency distributions were found to be were representative of those observed in our own sample. virtually identical in Q–Q plots (data not shown), rendering platform-specific genotyping errors negligible. Data from the three array types could be matched unambiguously as Materials and methods belonging to the same individual by IBS values[0.985 for the overlapping markers. Samples and reference population Representativeness of HapMap data DNA samples of 241 male and 208 female unrelated individuals were obtained from the PopGen biobank Estimates of the two linkage disequilibrium (LD) measures 2 (Krawczak et al. 2006). The blood donors, their parents, r and |D0| and of the marker allele frequencies in HapMap and grandparents were all born in Germany. Written, were obtained from the HapMap web site (http://www. informed consent was obtained from all study participants hapmap.org/downloads/ld_data/2006-06/) (NCBI build and all protocols were approved by the institutional ethics 36). HaploView 4.0 (Barrett et al. 2005) was employed committee. We used the HapMap CEU samples (Frazer with default options to assess LD in the German samples. et al. 2007; The International HapMap Consortium 2003), comprising CEPH Utah residents of northern and western Imputation protocols European ancestry, as a reference population for imputation. The imputation performance of four publicly available computer programs was assessed using the release 22 CEU Genotyping and quality control phasing data from HapMap as a reference. All analyses were confined to those autosomal SNPs that were present Genotypes were generated with Affymetrix Genome-Wide in HapMap and that passed quality control in the German Human SNP Arrays 5.0 (500 k) and 6.0 (1,000 k) and with samples. Using genotypes from one array type (‘imputation 123 Hum Genet (2009) 125:163–171 165 Table 1 SNP sets used for imputation benchmarking Array type No. SNPs Array type Affymetrix 5.0 Affymetrix 6.0 Illumina 550 k Affymetrix 5.0 358,391 (80.8% of 443,816) 331,176 (99.8%) 70,716 (99.9%) Affymetrix 6.0 656,391 (73.1% of 934,968) 24,185 (95.9% of 25,215) 135,395 (99.7%) 260,448 (80.5% of 323,215) Illumina 550 k 514,883 (91.7% of 561,474) 279,869 (97. 3% of 287,675) 452,227 (86.8% of 520,996) 435,759 (98.1% of 444,167) 371,854 (98.0% of 379,488) No. SNPs number of autosomal SNPs that passed quality control and had a minor allele frequency (MAF) C1% in the German samples. The total number of SNPs on each array and the percentage included in the study are given in parentheses. Upper right half: number of overlapping SNPs and, in parentheses, average genotype

A Comprehensive Evaluation of SNP Genotype Imputation

Impact of Imputation Methods on the Amount of Genetic Variation Captured by a Single-Nucleotide Polymorphism Panel in Soybeans A

Imputation-Based Assessment of Next Generation Rare Exome Variant Arrays

Allele Imputation and Haplotype Determination from Databases Composed of Nuclear Families by Nathan Medina-Rodríguez and Ángelo Santana

Identity-By-Descent Estimation with Population- and Pedigree-Based Imputation in Admixed Family Data Mohamad Saad

Genotype Imputation Via Matrix Completion

Impact of Imputation Methods on the Amount of Genetic Variation Captured by a Single-Nucleotide Polymorphism Panel in Soybeans A

References for Haplotype Imputation in the Big Data

Accurate Genotype Imputation in Multiparental Populations

1000 Genomes-Based Imputation Identifies Novel and Refined

The Use of Genomic Data and Imputation Methods in Dairy Cattle Breeding

Imputation Accuracy from Low to Moderate Density Single Nucleotide Polymorphism Chips in a Thai Multibreed Dairy Cattle Population

Odyssey: a Semi-Automated Pipeline for Phasing, Imputation, and Analysis of Genome-Wide Genetic Data Ryan J