Fast and Accurate Genotype Imputation in Genome-Wide Association Studies Through Pre-Phasing
Total Page:16
File Type:pdf, Size:1020Kb
TECHNICAL REPORTS Fast and accurate genotype imputation in genome-wide association studies through pre-phasing Bryan Howie1,6, Christian Fuchsberger2,6, Matthew Stephens1,3, Jonathan Marchini4,5 & Gonçalo R Abecasis2 The 1000 Genomes Project and disease-specific sequencing reference data sets. For example, whereas HapMap Phase 2 included efforts are producing large collections of haplotypes that 210 unrelated individuals typed at ~4 million SNPs, the Phase 1 variant can be used as reference panels for genotype imputation in call set from the 1000 Genomes Project (released in March 2012) genome-wide association studies (GWAS). However, imputing includes 1,092 individuals typed at >38 million polymorphic sites. The from large reference panels with existing methods imposes a next phases of this project will extend this data set to over 2,000 indi- high computational burden. We introduce a strategy called viduals typed at an even greater number of sites, and other sequencing ‘pre-phasing’ that maintains the accuracy of leading methods efforts are also producing large genetic variation resources. while reducing computational costs. We first statistically These developments can provide immediate benefits to GWAS estimate the haplotypes for each individual within the GWAS through imputation: a more complete catalog of variants increases the sample (pre-phasing) and then impute missing genotypes into chances that causal or trait-associated variants will be imputed, and these estimated haplotypes. This reduces the computational reference panels with more haplotypes increase imputation accuracy cost because (i) the GWAS samples must be phased only once, and power for downstream association analysis, especially for variants whereas standard methods would implicitly repeat phasing with low allele frequencies4,5. In contrast, many existing genotype with each reference panel update, and (ii) it is much faster to imputation methods require substantial computing power when used match a phased GWAS haplotype to one reference haplotype with large reference data sets. This problem is compounded by the fact than to match two unphased GWAS genotypes to a pair of that reference collections are now regularly improved and expanded, reference haplotypes. We implemented our approach in the such that investigators might benefit from imputing their samples MaCH and IMPUTE2 frameworks, and we tested it on data multiple times over the course of a study. sets from the Wellcome Trust Case Control Consortium 2 Here, we propose a practical solution that maintains imputation (WTCCC2), the Genetic Association Information Network accuracy while greatly reducing computational costs. Our approach © 2012 Nature America, Inc. All rights reserved. America, Inc. © 2012 Nature (GAIN), the Women’s Health Initiative (WHI) and the 1000 is motivated by the observation that imputation methods spend much Genomes Project. This strategy will be particularly valuable for of their time accounting for the unknown phase of GWAS genotypes. repeated imputation as reference panels evolve. Some methods do this through analytical calculations that integrate npg over all possible phase configurations for each study individual, Genotype imputation is a key step in the analysis of GWAS. The whereas other methods average the imputation probabilities across approach works by finding haplotype segments that are shared between multiple haplotypes sampled by a phasing algorithm4. Both approaches study individuals, who are typically genotyped on a commercial have limitations. Analytical phase integration becomes computation- array with 300,000–2,500,000 SNPs, and a reference panel of more ally expensive as reference panels grow and is only possible when the densely typed individuals, such as those provided by the International study individuals are treated independently, which sacrifices linkage HapMap Project1,2 and the 1000 Genomes Project3 or obtained by disequilibrium (LD) information in the GWAS data. Sampling-based sequencing a subset of study individuals. Imputation methods can methods can scale better with reference panel size and capture LD accurately estimate genotypes or genotype probabilities at markers information to improve imputation accuracy6, but they may still have that have not been directly examined in a GWAS. Imputed geno- nontrivial computational costs because of the need to sample and types are now routinely used to increase the power of GWAS analyses, impute into several haplotype configurations for each individual. to guide fine-mapping efforts and to facilitate the meta-analysis of In our new approach, we first statistically estimate the haplotypes studies genotyped on different marker sets4,5. underlying the GWAS genotypes (in pre-phasing) and then impute The maturation of high-throughput genotyping and sequencing into these haplotypes, as if they were correct; a schematic of a tradi- technologies has led to a rapid increase in the size of publicly available tional workflow and the more efficient workflow proposed here are 1Department of Human Genetics, University of Chicago, Chicago, Illinois, USA. 2Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA. 3Department of Statistics, University of Chicago, Chicago, Illinois, USA. 4Department of Statistics, University of Oxford, Oxford, UK. 5Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK. 6These authors contributed equally to this work. Correspondence should be addressed to G.R.A. ([email protected]), J.M. ([email protected]) or M.S. ([email protected]). Received 13 September 2011; accepted 13 June 2012; published online 22 July 2012; doi:10.1038/ng.2354 NATURE GENETICS VOLUME 44 | NUMBER 8 | AUGUST 2012 955 T ECHNICAL REPORTS Traditional imputation Pre-phasing imputation Haplotype estimation These results show that pre-phasing can greatly speed up the imputa- 1000 1000 tion process, but the accuracy of imputation with this shortcut may 1000 Genomes 1000 Genomes depend on how well the GWAS haplotypes were estimated. The accu- Genomes Phase I Genomes Phase I HapMap 2 Pilot haplotypes HapMap 2 Pilot haplotypes racy of computationally estimated haplotypes depends on a number haplotypes haplotypes haplotypes haplotypes of factors, including marker density, the relatedness of the sampled individuals, sample size and demography12,13. In founder popula- GWAS GWAS Phased GWAS 14 genotypes genotypes haplotypes tions , long-range haplotypes can be estimated very accurately, even with modest sample sizes9. For example, by comparing the results of population- and trio-based phasing in Finnish samples from the Imputed Imputed Finland–United States Investigation of Non-Insulin–Dependent GWAS Imputed GWAS Imputed 2,15 genotypes GWAS Imputed genotypes GWAS Imputed Diabetes Mellitus Genetics (FUSION) study of type 2 diabetes , genotypes GWAS genotypes GWAS 16 genotypes genotypes we estimated that population phasing produces <1 switch error per 5.5 Mb of DNA. These results are highly accurate due to the large number of genotyped individuals (>2,000) and the fact that Finland is a founder population in which long haplotypes are shared by seem- Figure 1 Imputation schematic. Each box represents a genetic data set ingly unrelated individuals. In more diverse populations, haplotype and each arrow represents an analysis step. The sizes of the boxes reflect estimation may often be less accurate. For example, the distance the relative numbers of genotypes they contain, and the widths of the arrows reflect the relative computational costs of the analyses. Given between switches in European GWAS data sets is typically in the 17 a single GWAS data set (red box), successively larger reference panels range of 0.6–1.4 Mb . (blue boxes) lead to larger and more accurate imputed data sets (orange boxes). The computational cost of imputation is much lower when using Genotype imputation accuracy in diverse populations pre-phased GWAS haplotypes (purple box) than when using traditional Here, we evaluate whether pre-phased haplotypes can be used to imputation approaches (left). accurately impute missing genotypes in three GWAS data sets sam- pled from diverse populations: the WTCCC2 data described above, a European-American case-control data set from a psoriasis study shown (Fig. 1). Imputing into pre-phased haplotypes is known to by GAIN18 and a set of data for African Americans from WHI19. be fast, and it is highly accurate when the haplotypes are estimated In each data set, we masked and imputed a subset of the genotyped through genotyped family members7,8 or long segments of recent SNPs (details in Table 2). We measured imputation accuracy at shared ancestry9. These two phasing techniques cannot be used on these SNPs as the average squared correlation (mean R2) between unrelated individuals from outbred populations (a common study masked array genotypes and imputed allele dosages (posterior design in GWAS), which means many data sets can only be phased mean genotypes). by statistical algorithms that yield lower quality (but still reason- We used the GAIN data set to compare a well-benchmarked able) haplotypes. The central aims of this work are (i) to show that imputation method (MaCH) to a related method that uses pre- the GWAS haplotypes estimated by existing algorithms can produce phasing (minimac). Encouragingly, both methods produced results accurate imputation and (ii) to quantify the efficiency gains from pre- with similar accuracy when applied to a common reference panel © 2012