Genotype Imputation Yun Li,1 Cristen Willer,1 Serena Sanna,2 and Gonc¸Alo Abecasis1 ANRV386-GG10-18 ARI 29 July 2009 15:32

Genotype Imputation Yun Li,1 Cristen Willer,1 Serena Sanna,2 and Gonçalo Abecasis1 ANRV386-GG10-18 ARI 29 July 2009 15:32 INTRODUCTION haplotype mapping studies geneticists expect Identifying and characterizing the genetic vari- to identify long stretches of shared chro- ants that affect human traits, ranging from dis- mosome inherited from a relatively recent ease susceptibility to variability in personality common ancestor, in GWAS focusing on measures, is one of the central objectives of apparently unrelated individuals geneticists human genetics. Ultimately, this aim will be expect to identify only relatively short stretches achieved by examining the relationship between of shared chromosomes. Remarkably, geno- interesting traits and the whole genome se- type imputation can use these short stretches quences of many individuals. Although whole of shared haplotype to estimate with great genome resequencing of thousands of individ- precision the effects of many variants that are uals is not yet feasible, geneticists have long not directly genotyped. recognized that much progress can be made by In this review, we first attempt to provide measuring only a relatively modest number of the reader with an intuition for how geno- genetic variants in each individual. This type type imputation approaches work and for their of “incomplete” information is useful because theoretical underpinnings. We start with the data about any set of genetic variants in a group relatively intuitive setting of imputing missing of individuals provide useful information about genotypes for a set of individuals using informa- many other unobserved genetic variants in the tion on their close relatives. We then examine same individuals. how genotype imputation works when applied The idea that data on a modest set of ge- to more distantly related individuals. Next, we netic variants measured in a number of re- survey results of studies that have used genotype lated individuals can provide useful information imputation to study complex disease suscepti- about other genetic variants in those individu- bility. We attempt to provide the reader with als forms the theoretical underpinning of both critical information to assess the merits of geno- genetic linkage mapping in pedigrees and hap- type imputation-based analyses and to provide lotype mapping in founder populations (23, 24, guidance to analysts attempting to implement 50). These studies typically use <10,000 genetic these approaches. Finally, we survey potential markers to survey the entire human genome by uses of imputation-based analyses in the con- identifying stretches of chromosome inherited text of whole genome resequencing studies that from a common ancestor. The shared stretches we believe will soon become commonplace. usually span several megabases and include thousands of genetic variants. Both approaches GENOTYPE IMPUTATION have been highly successful in identifying genes IN STUDIES OF RELATED responsible for single-gene Mendelian disor- INDIVIDUALS ders (9). In contrast, they have had only lim- ited success in mapping genes that influence Family samples constitute the most intuitive complex traits, although success stories do exist setting for genotype imputation. Genotypes Annu. Rev. Genom. Hum. Genet. 2009.10:387-406. Downloaded from www.annualreviews.org Access provided by University of North Carolina - Chapel Hill on 05/08/18. For personal use only. (40, 42, 75, 83). for a relatively modest number of genetic More recently, technological advances markers can be used to identify long stretches have made genome-wide association studies of haplotype shared between individuals of (GWAS) possible (39, 67, 109). Rather than known relationship. These stretches of shared <10,000 variants, these studies typically geno- haplotype (or regions “identical-by-descent”, type 100,000–1,000,000 variants in each of the IBD) are typically used to evaluate the evidence individuals being studied. Since >10 million for linkage. Specifically, genetic linkage implies common genetic variants are likely to exist that family members who share a region IBD (104), even these detailed studies examine only will be more similar to each other than will fam- a fraction of all genetic variants. Whereas ily members with the same degree of relatedness in traditional genetic linkage and founder who do not share the region IBD. In the context 388 Li et al. ANRV386-GG10-18 ARI 29 July 2009 15:32 of genotype imputation, we characterize each 830,000 genetic markers in the first phase of of these stretches in detail by genotyping the International HapMap Project (103). Us- additional markers in one or more individuals ing genotypes for approximately 6500 genetic in the family. Genotypes for these markers can markers genotyped by the SNP consortium then be propagated to other family members in all three generations of the pedigrees (85), who are only typed at a minimal set of markers. Burdick and colleagues imputed genotypes for The approach is illustrated in Figure 1.In most of the HapMap Project markers in the the figure, all individuals have been genotyped third generation of these pedigrees (12). They for a set of genetic markers indicated in red; a showed that this imputation-based analysis was subset of individuals in the top two generations more powerful than the original analysis, which has been genotyped at additional markers indi- examined only directly genotyped markers for cated in black (panel a). Genotypes for the red each individual. markers, available in all individuals, can be used Several formal statistical descriptions of to infer the segregation of haplotypes through genotype imputation procedures for association the family (panel b). Finally, most of the missing analyses in families have now been published genotypes for individuals in the bottom gener- (15, 108), and the procedures to support geno- ation can be inferred by comparing the hap- type imputation are implemented in packages lotypes they inherited with copies of the same such as MERLIN (2, 3) and MENDEL (52, haplotypes that are IBD and present in other 53). In principle, these procedures can be imple- individuals in the family (panel c). mented using the infrastructure of the Lander- The idea that family members share long Green (48) or Elston-Stewart (29) algorithms, stretches of haplotype that are IBD underpins or one of the many other pedigree analysis algo- nearly all methods of linkage analysis. Further- rithms, including those that are based on Monte more, many early approaches for association Carlo sampling (38, 96). An important obser- analysis in pedigree data implicitly impute miss- vation from these more formal treatments of ing genotypes by considering the distribution of the problem is that even when genotypes can- potential genotypes of each individual jointly not be imputed with high confidence, partial with that of other individuals in the same pedi- information about the identity of each of the gree (35, 45). The extension of this idea to the true underlying genotypes can be productively imputation of missing genotypes (as outlined incorporated in association analysis (15, 108). above) was first described by Burdick and col- For example, when genotypes are measured di- leagues (12), who coined the term in silico geno- rectly, observed allele counts are often used in typing to describe the idea that computational regression analyses to estimate an additive ef- analyses could be used to replace laboratory- fect for each marker (1, 8, 34). These observed based procedures in the determination of in- allele counts indicate the number of copies of dividual genotypes. To illustrate the potential the allele of interest (0, 1, or 2) carried by each of the approach, they reanalyzed the data of individual. When genotypes are not measured Annu. Rev. Genom. Hum. Genet. 2009.10:387-406. Downloaded from www.annualreviews.org Access provided by University of North Carolina - Chapel Hill on 05/08/18. For personal use only. Cheung and colleagues (17). Cheung and col- directly, these discrete counts can be replaced leagues sought to identify genetic variants as- with an expected allele count for each marker sociated with regulation of gene expression by (a real number between 0 and 2) (15). examining RNA transcript levels and genotype The approach has been successfully used data for individuals in the top two generations of to study several quantitative traits in a sam- the Centre d’Etude du Polymorphism Humain ple of closely related individuals from four vil- (CEPH) pedigrees (21). The CEPH pedigrees lages in Sardinia (77). Among study partici- are three-generation pedigrees with a struc- pants, 1412 individuals were genotyped with ture similar to that of the cartoon pedigree in the Affymetrix 500K mapping array set (which Figure 1.The top two generations of several of assays ∼500,000 SNPs) and a further 3329 in- these pedigrees were genotyped at more than dividuals were genotyped with Affymetrix 10K www.annualreviews.org • Genotype Imputation 389 ANRV386-GG10-18 ARI 29 July 2009 15:32 a b A/A A/G A/A A/G AA GA A A G A A/T A/A A/T T/T TA AAA T T T T/T G/T G/G G/T T T T G G G T G G/G G/T G/T G/T GG TGGT TG A/G A/A G/G A/A AG AAG G AA T/T T/T T/T T/G T T T T T T T G C/G G/G C/C G/G CG GGCC GG A/G A/G AG A G A/T A/T TA AT T/T G/T T T G T G/T G/T G T G T A/A G/A AA GA T/T T/T T T T T C/G C/G CG CG A/G A/G A/A A/A G/G G/G AG AG AA AA G G G G ./.

Load more