Genotype Imputation for Genome-Wide Association Studies
Total Page:16
File Type:pdf, Size:1020Kb
REVIEWS GENOME-WIDE ASSOCIATION STUDIES Genotype imputation for genome-wide association studies Jonathan Marchini* and Bryan Howie‡ Abstract | In the past few years genome-wide association (GWA) studies have uncovered a large number of convincingly replicated associations for many complex human diseases. Genotype imputation has been used widely in the analysis of GWA studies to boost power, fine-map associations and facilitate the combination of results across studies using meta-analysis. This Review describes the details of several different statistical methods for imputing genotypes, illustrates and discusses the factors that influence imputation performance, and reviews methods that can be used to assess imputation performance and test association at imputed SNPs. Hidden Markov model Genotype imputation is the term used to describe the genotypes based on the reference panel without taking A class of statistical model process of predicting or imputing genotypes that are the phenotype into account. Imputed genotypes at each that can be used to relate an not directly assayed in a sample of individuals. There SNP together with their inherent uncertainty are then observed process across the are several distinct scenarios in which genotype imputa- tested for association with the phenotype of interest in a genome to an underlying, unobserved process of interest. tion is desirable, but the term now most often refers to second stage. The advantage of the two-stage approach Such models have been used the situation in which a reference panel of haplotypes is that different phenotypes can be tested for association to estimate population at a dense set of SNPs is used to impute into a study without the need to redo the imputation. structure and admixture, for sample of individuals that have been genotyped at a sub- This Review provides an overview of the different genotype imputation and set of the SNPs. An overview of this process is given in methods that have been proposed for genotype impu- for mutiple testing. BOX 1. Genotype imputation can be carried out across tation, discusses and illustrates the factors that affect the whole genome as part of a genome-wide association the accuracy of genotype imputation, discusses the use (GWA) study or in a more focused region as part of a quality-control measures on imputed data and methods fine-mapping study. The goal is to predict the genotypes that can be employed in testing for association using at the SNPs that are not directly genotyped in the study imputed genotypes. sample. These ‘in silico’ genotypes can then be used to boost the number of SNPs that can be tested for associa- Genotype imputation methods tion. This increases the power of the study, the ability We assume that we have data at L diallelic autosomal to resolve or fine-map the causal variant and facilitates SNPs and that the two alleles at each SNP have been meta-analysis. BOX 2 discusses these uses of imputation coded 0 and 1. Let H denote a set of N haplotypes at as well as the imputation of untyped variation, human these L SNPs and let G denote the set of genotype data at leukocyte antigen (HLA) alleles, copy number variants the L SNPs in K individuals with Gi = {Gi1,…, GiL} denot- (CNVs), insertion–deletions (indels), sporadic missing ing the genotypes of the ith individual. The individual ∈ data and correction of genotype errors. genotypes are either observed so that Gik {0,1,2} or 1 The HapMap 2 haplotypes have been widely used they are missing so that Gik = missing. The main focus *Department of Statistics, to carry out imputation in studies of samples that have here is in predicting the genotypes of those SNPs that University of Oxford, Oxford, UK. ancestry close to those of the HapMap panels. The CEU have not been genotyped in the study sample at all but ‡Department of Human (Utah residents with northern and western European there are usually sporadic missing genotypes as well. We Genetics, University of ancestry from the CEPH collection), YRI (Yoruba from assume that strand alignment between data sets has been Chicago, Chicago, USA. Ibadan, Nigeria) and JPT + CHB (Japanese from Tokyo, carried out (Supplementary information S1 (box)). Correspondence to J.M. Japan and Chinese from Beijing, China) panels consist of e-mail: (REF. 2) [email protected] 120, 120 and 180 haplotypes, respectively, at a very dense IMPUTE v1. IMPUTE v1 is based on an extension doi:10.1038/nrg2796 set of SNPs across the genome. Most studies have used a of the hidden Markov models (HMMs) originally devel- Published online 2 June 2010 two-stage procedure that starts by imputing the missing oped as part of importance sampling schemes for NATURE REVIEWS | GENETICS VOLUME 11 | JULY 2010 | 499 © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS 3,4 Linkage disequlibrium simulating coalescent trees and for modelling linkage form the genotype vector. The term P(Z|H,ρ) models 5 The statistical association disequilibrium (LD) and estimating recombination rates . how the pair of copied haplotypes changes along the within gametes in a population The method is based on an HMM of each individual’s sequence and is defined by a Markov chain in which of the alleles at two loci. vector of genotypes, Gi, conditional on H, and a set of switching between states depends on an estimate of the Although linkage disequilibrium ρ can be due to linkage, it can parameters. This model can be written as fine-scale recombination map ( ) across the genome. | θ also arise at unlinked loci — for The term P(Gi Z, ) allows each observed genotype example, because of selection P(Gi|H, θθ, ρ ) = Σ P(Gi|Z, ), P(Z|H, ρ) (1) vector to differ through mutation from the genotypes z or non-random mating. determined by the pair of copied haplotypes and is con- θ in which Z = {Z1,…, ZL} with Zj = {Z j 1, Zj2} and Zjk = {1,…, N}. trolled with the mutation parameter . Estimates of the ρ The Zj can be thought of as the pair of haplotypes from fine-scale recombination map ( ) are provided on the reference panel at SNP j that are being copied to the IMPUTE v1 webpage and θ is fixed internally by the Box 1 | How genotype imputation works b Testing association at typed SNPs may not d Reference set of haplotypes, for example, HapMap f Testing association at imputed SNPs may lead to a clear signal boost the signal 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 0 0 1 1 1 0 1 1 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 0 1 1 1 0 alue alue -v p 0 0 1 0 1 1 1 0 0 1 1 1 1 1 1 0 10 1 1 1 1 1 0 1 0 0 1 0 0 0 1 0 1 –log –log10 p-v 1 1 1 0 0 1 0 0 1 1 1 0 1 1 1 0 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 0 1 0 0 1 1 1 0 1 1 1 0 a Genotype data with missing data at c Each sample is phased and the haplotypes e The reference haplotypes are used to untyped SNPs (grey question marks) are modelled as a mosaic of those in the impute alleles into the samples to create haplotype reference panel imputed genotypes (orange) 0 ? ? ? 1 ? 1 ? 0 1 1 ? ? 1 ? 0 1 1 1 1 1 2 1 0 0 2 2 0 2 2 2 0 1 ? ? ? 1 ? 1 ? 0 2 2 ? ? 2 ? 0 1 ? ? ? 1 ? 1 ? 0 1 1 ? ? 1 ? 0 0 ? ? ? 2 ? 2 ? 0 2 2 ? ? 2 ? 0 . 0 0 1 0 2 2 2 0 0 2 2 2 2 2 2 0 . 1 ? ? ? 2 ? 2 ? 0 2 1 ? ? 2 ? 0 . 1 1 1 1 2 2 2 0 0 2 1 1 2 2 2 0 1 ? ? ? 2 ? 1 ? 1 2 2 ? ? 2 ? 0 1 ? ? ? 1 ? 1 ? 0 1 0 ? ? 1 ? 0 1 1 2 0 2 2 1 0 1 2 2 1 2 2 2 0 2 ? ? ? 2 ? 2 ? 1 2 1 ? ? 2 ? 0 2 2 2 2 2 1 2 0 1 2 1 1 2 2 2 0 1 ? ? ? 1 ? 1 ? 1 2 2 ? ? 2 ? 0 1 ? ? ? 1 ? 1 ? 1 1 1 ? ? 1 ? 0 1 1 1 0 1 2 1 0 1 2 2 1 2 2 2 0 . 1 ? ? ? 2 ? 2 ? 0 2 1 ? ? 2 ? 1 . 1 1 2 1 2 1 2 0 0 2 1 1 1 2 1 1 . 2 ? ? ? 1 ? 1 ? 1 2 1 ? ? 2 ? 1 2 2 2 1 1 1 1 0 1 2 1 0 1 2 1 1 1 ? ? ? 0 ? 0 ? 1 1 1 ? ? 1 ? 0 1 ? ? ? 0 ? 0 ? 2 2 2 ? ? 2 ? 0 1 2 2 0 0 2 0 0 2 2 2 1 2 2 2 0 0 ? ? ? 0 ? 0 ? 1 1 1 ? ? 1 ? 0 In samples of unrelated individuals, the haplotypes of the individuals over Imputation attempts to predict these missing genotypes.Nature Re Algorithmsviews | Genetics differ short stretches of sequence will be related to each other by being identical in their details but all essentially involve phasing each individual in the study by descent (IBD).