Nature Reviews Genetics | AOP, published online 15 June 2010; doi:10.1038/nrg2813 PROGRESS

GENOME-WIDE ASSOCI ATI ON STUDIES considered benign; we note that inflation

in λGC is proportional to sample size. New approaches to population If population stratification exists, it is important to distinguish between sub- stratification in genome-wide population differences that are due to recent genetic drift and those that arose from more association studies ancient population divergence11. In the case of genetic drift, dividing association

statistics by λGC will provide a sufficient Alkes L. Price, Noah A. Zaitlen, David Reich and Nick Patterson correction for stratification. In the case of Abstract | Genome-wide association (GWA) studies are an effective approach ancient population divergence, markers with for identifying genetic variants associated with disease risk. GWA studies can unusual allele frequency differences that lie outside the expected distribution, which be confounded by population stratification — systematic ancestry differences could be caused by natural selection, make between cases and controls — which has previously been addressed by methods stratification a much more severe problem, that infer genetic ancestry. Those methods perform well in data sets in which and dividing association statistics by λGC is population structure is the only kind of structure present but are inadequate in likely to be inadequate. In the case of fam- data sets that also contain family structure or cryptic relatedness. Here, we ily structure or cryptic relatedness, dividing association statistics by λGC will generally review recent progress on methods that correct for stratification while produce the approximate null distribution, accounting for these additional complexities. although a refinement to the method may be needed when there is uncertainty in the (REF. 12) Genome-wide association (GWA) studies is a necessity in studies with family-based estimate of λGC . However, even if have identified hundreds of common vari- sample ascertainment, and there is increas- the appropriate null distribution is obtained, ants associated with disease risk or related ing evidence that cryptic relatedness may in general this approach will not maximize traits1 (see the National Human Genome occur in a wide range of data sets (see below). power to detect true associations. Other Research Institute (NHGRI) Catalog of Family-based association tests offer one poten- approaches to correcting for stratification, Published Genome-Wide Association tial solution for dealing with family structure. including approaches that also account for Studies). These studies have overcome the More recently, approaches using mixed models family structure and cryptic relatedness, are dangers of population stratification, which that incorporate the full covariance structure described below. can produce spurious associations if not across individuals have been proposed. properly corrected2–3. However, accounting Below, we review each of these methods, Inferring genetic ancestry for population structure is more challenging conduct simulations to evaluate their Structured association. Methods that explic- when family structure or cryptic relatedness performance, discuss stratification in the itly infer genetic ancestry generally provide is also present, and these limitations have specific context of low-frequency or rare an effective correction for population stratifi- motivated the development of new methods. variants and conclude with guidelines cation in data sets in which population struc- Spurious associations have occurred prima- and recommendations. ture is the only type of sample structure. In rily at markers with unusual allele frequency the structured association approach, samples differences among subpopulations2,4, so it is Detecting stratification are assigned to subpopulation clusters (pos- crucial that new methods aimed at correcting A widely used approach to evaluate whether sibly allowing fractional cluster membership) for stratification are evaluated at unusually due to population stratifica- using a model-based clustering program such differentiated markers. tion exists is to compute the genomic con- as STRUCTURE13–14, and association statis-

The prevailing paradigm in recent years trol λ (λGC), which is defined as the median tics are computed by stratifying by cluster has been to use to measure the χ2(1 degree of freedom) association statistic using a program such as STRAT15. The appli- extent of inflation due to population stratifi- across SNPs divided by its theoretical cability of this approach to large genome- cation or other confounders, and to correct median under the null distribution7–9. A wide data sets has historically been limited for stratification (if necessary) using methods value of λGC ≈ 1 indicates no stratification, by its high computational cost when allowing structured that infer genetic ancestry, such as whereas λGC > 1 indicates stratification or fractional cluster membership, but faster association or principal components analysis other confounders, such as family struc- model-based approaches for inferring popu- (PCA). A limitation of this strategy is that ture or cryptic relatedness (see below), or lation structure have recently been developed it fails to account for other types of sample differential bias10. P–P plots are a standard (such as the ADMIXTURE software)16. structure, such as family structure or cryptic tool for visualization of test statistics Thus, applying structured association to 5–6 (FIG. 1) relatedness . Modelling family structure . Values of λGC < 1.05 are generally both infer population structure and compute

NATURE REVIEWS | GENETICS ADVANCE ONLINE PUBLICATION | 1 © 2010 Macmillan Publishers Limited. All rights reserved PROGRESS

a No stratification b Stratification without unusually c Stratification with unusually differentiated markers differentiated markers 10 10 10

8 8 8

6 6 6 ed (–logP) ed (–logP) ed (–logP) 4 4 4 Observ Observ Observ

2 2 2

0 0 0 024602460246 Expected (–logP) Expected (–logP) Expected (–logP) Figure 1 | P–P plots for the visualization of stratification or other unusually differentiated markers: p-values exhibitNa modestture Revie genome-widews | Genetics confounders. The figure shows simulated P–P plots under three scenarios inflation. c | Stratification with unusually differentiated markers: p-values for genome-wide scans with no causal markers. a | No stratification: exhibit modest genome-wide inflation and severe inflation at a small p-values fit the expected distribution. b | Stratification without number of markers.

association statistics in genome-wide data A common misconception is that AIMs information are immune to stratification31,32. sets is likely to become a practical approach. should be used to infer genetic ancestry even This approach performs favourably when genome-wide data are available, but in compared with previous family-based Principal components analysis. PCA is a fact the best ancestry estimates are obtained approaches31,32, but it places an upper bound tool that has been used to infer population using a large number of random markers. on the statistical power that can be extracted structure in genetic data for several decades, A limitation of the above methods is from the between-family component of the long before the era of GWA studies17–20. It that they do not model family structure or overall signal. This is because the trans- should be noted that top principal compo- cryptic relatedness. These factors may lead formed rank statistic cannot be more statis- nents do not always reflect population struc- to inflation in test statistics if they are not tically significant than one divided by the ture: they may reflect family relatedness19, explicitly modelled because samples that are number of samples. long-range linkage disequilibrium (LD) (due correlated are assumed to be uncorrelated. to, for example, inversion polymorphisms4) Although correcting for genetic ancestry and Mixed models 10 or assay artefacts . These effects can often then dividing by the residual λGC will restore Mixed models, which owe their roots to be eliminated by removing related samples, an appropriate null distribution, association applications in animal breeding, can model regions of long-range LD or low-quality statistics that explicitly account for family population structure, family structure and data, respectively, from the data used to structure or cryptic relatedness are likely to cryptic relatedness33,34. The basic approach compute principal components. In addition, achieve higher power owing to improved is to model phenotypes using a mixture PCA can highlight effects of differential bias weighting of the data. of fixed effects and random effects. Fixed that require additional quality control21. effects include the candidate SNP and Using top principal components as cov- Family-based association tests optional covariates, such as gender or age, ariates corrects for stratification in GWA Family-based studies, in which individuals whereas random effects are based on a studies21,22, and this can be done using are ascertained from family pedigrees, offer a phenotypic covariance matrix, which is software such as EIGENSTRAT. Like struc- unique solution to population stratification. modelled as a sum of heritable and non- tured association, PCA will appropriately Family-based association tests that focus heritable random variation (BOX 1). Mixed apply a greater correction to markers with on within-family information (general- models have historically been a theoretically large differences in allele frequency across izing the transmission disequilibrium test27) appealing but computationally intensive ancestral populations. Unlike initial imple- are immune to stratification, as transmitted approach; however, recent computational mentations of structured association, PCA is and untransmitted alleles have the same advances (such as the EMMAX and TASSEL computationally tractable in large genome- genetic ancestry28–30, and such tests can be software) have now made it possible to wide data sets. Related approaches, such as performed using the FBAT and QTDT soft- apply them to GWA studies35,36. Methods multidimensional scaling (MDS) and genetic ware. However, fully powered statistics for that explicitly model population structure, matching, have also proven useful23–25 and can family-based studies will need to incorporate family structure and cryptic relatedness are be carried out using the PLINK software. between-family information, which is still expected to perform better in the presence of When genome-wide data are not available susceptible to stratification. A recent sug- these complexities than methods that do not, (for example, in replication studies), struc- gestion is to transform between-family infor- and this has now been confirmed35,36. For tured association or PCA can infer genetic mation into a rank statistic before combining example, in an analysis of seven Wellcome ancestry, and hence correct for stratification, within-family and between-family infor- Trust Case Control Consortium phenotypes, using ancestry-informative markers (AIMs)26. mation, guaranteeing that both sources of the application of mixed models consistently

2 | ADVANCE ONLINE PUBLICATION www.nature.com/reviews/genetics © 2010 Macmillan Publishers Limited. All rights reserved PROGRESS

yielded values of λGC that were less than 1.01, genotype distributions) underlying the test were unrelated and individuals from POP2 in contrast to other approaches34. statistics will be violated, and stratification included 80 case–case sibling pairs, 40 case– may lead to confounding unless principal control sibling pairs and 130 control–control

Population structure: a fixed or random component covariates are used. The question sibling pairs. We computed λGC for each of the effect? An important and unanswered ques- of whether to model random effects only or following methods: an uncorrected Armitage tion is whether population structure should to include principal component covariates as trend test, EIGENSTRAT21, EMMAX without be modelled as part of the set of random fixed effects is analogous to the mixed model principal component covariates35, EMMAX effects, together with family structure and framework. When viewing phenotypes as with principal component covariates35 and cryptic relatedness, or as a separate fixed fixed, principal component covariates may be ROADTRIPS39. All principal component effect requiring principal component covari- essential. as modelling only random effects runs used only one principal component, but ates and additional model parameters35,36 leads to a uniform correction factor in the the additional inclusion of random princi- (BOX 1). Inclusion in random effects is much absence of missing data39. pal components has little effect on results21. simpler, and has been shown to provide a Power to detect causal variants may vary sufficient correction for stratification in data Simulations between methods, but our focus here was sets from Finland and the UK35. We carried out two simulations to show the on correcting false-positive associations. We However, population structure is actually properties of the above methods in correcting did not simulate the approach described in a fixed effect (that is, its effect as a function for stratification at normally differentiated REF. 31, as this method is completely immune of genetic ancestry is the same for all sam- or unusually differentiated markers in the from stratification, ensuring a value of 1.00 in ples), and spurious associations might result presence or absence of family structure. all entries of the table; this approach has if it is modelled as a random effect based on We considered a case–control study with appealing properties, but may have reduced overall covariance, particularly in the case of two subpopulations, POP1 and POP2, power in some instances (see above). We unusually differentiated markers. Modelling with 300 cases and 200 controls from POP1 note that the method of REF. 39 with prin- population structure as a fixed effect provides and 200 cases and 300 controls from POP2. cipal component covariates incorporated a higher level of certainty in correcting for We simulated 99,900 normally differentiated is an approach of potentially high interest, F stratification but requires running PCA (or a markers based on ST (POP1,POP2) = 0.01 but it is not currently implemented in the similar method) to infer the genetic ancestry (REF. 41) and 100 unusually differentiated ROADTRIPS software. of each sample36. If family structure is present, markers based on allele frequency difference The results of the simulations are shown inferring genetic ancestry by PCA is a chal- equal to 0.6 with both minor allele frequen- in TABLE 1. EIGENSTRAT is effective in cor- lenge because family relatedness may lead to cies uniformly distributed on [0.0,0.4]21. In recting for population stratification at both artefactual principal components19. A possible simulation 1, all individuals were unrelated. normally and unusually differentiated mark- solution is to compute principal components In simulation 2, all individuals from POP1 ers (simulation 1) but does not control for using SNP loadings inferred from a set of unre- lated samples, either by using a different set of samples from those in the disease study or by Box 1 | Mixed models using an unrelated subset of samples from the disease study37. This is likely to be sufficient Simple linear models Simple linear models represent the phenotype Yas a function of fixed effects X: when the set of unrelated samples used is very large relative to the magnitude of population Y = XB + E structure effects. However, unless sample Here, X denotes the genotype at the candidate marker in addition to optional covariates, such as sizes are very large, principal components gender or age, B denotes coefficients of fixed effects and E is a normally distributed noise term computed from external SNP loadings will that accounts for unexplained variation in Y. be biased towards zero owing to statistical Principal components analysis (PCA) addresses the issue of population substructure by including noise in the SNP loadings11,38. This motivates principal component covariates in X to explicitly model the ancestry of each individual. If further work on PCA in related samples. genotype is not causally related to phenotype but genotype and phenotype are both correlated to ancestry, test statistics will be inflated. Using PCA to explicitly model genetic ancestry removes this confounding effect. However, PCA only accounts for fixed effects of genetic ancestry; it does Modelling phenotypes as fixed. Mixed not account for relatedness between individuals, which may also cause inflation in test statistics. models model phenotypes using a fixed set of genotypes. However, as an alternative to Linear mixed models Linear mixed models represent the phenotype Y as a function of fixed effects X plus random mixed models, genotypes can be modelled effects u: using a fixed set of phenotypes, a theoreti- cally appealing approach that makes fewer Y = XB + u + E  assumptions about phenotypic covariance 8CT W   I- 39,40 structure . Simulations in the absence of Here, u denotes a component of the overall noise variance u + ε that is distributed according to unusually differentiated markers have shown a kinship matrix K. Thus, u represents the heritable component of random variation and E that using the genotypic covariance matrix represents the non-heritable component of random variation. to account for both population and family The kinship matrix K is defined according to the pairwise genotypic similarity of individuals, structure can effectively control spurious so its structure is influenced by population structure, family structure and cryptic relatedness. 39 The parameter   relates this structure to the phenotype Y.   captures the extent to which associations under a variety of settings ; this I I can be done using the ROADTRIPS software. genetically similar individuals are phenotypically similar, thus removing confounding effects. The optimal formulation of K, the importance of including principal component covariates in However, in the case of unusually differenti- fixed effects X and the effects of these choices have not yet been fully explored. ated markers, normality assumptions (about

NATURE REVIEWS | GENETICS ADVANCE ONLINE PUBLICATION | 3 © 2010 Macmillan Publishers Limited. All rights reserved PROGRESS

Table 1 | Effectiveness of different approaches for correcting for stratification statistics with λGC = 1, but this approach may be inadequate for spurious associations at Simulation 1, Simulation 1, Simulation 2, Simulation 2, unusually differentiated markers and/or F = 0.01 Δ = 0.6 F = 0.01 Δ = 0.6 ST ST may not maximize power if family structure Armitage trend 1.40 48.4 1.57 48.3 (or cryptic relatedness) is not fully modelled. EIGENSTRAT 1.00 1.00 1.17 1.14 EMMAX* 1.00 2.05 1.01 1.62 Low-frequency and rare variants GWA studies have largely focused on EMMAX* + principal 1.00 1.02 1.01 1.01 components common variants, but because most genetic heritability34 remains unexplained, future work ROADTRIPS 1.00 48.4 1.00 48.3 will increasingly focus on variants of low We list the genomic control L (LGC) of each method for normally differentiated markers (FST = 0.01) and unusually differentiated markers ($ = 0.6) in simulation 1 and simulation 2. In each case, L was computed minor-allele frequency (0.5% < MAF < 5%) GC 42 as the median C2 (1 degree of freedom) statistic (restricting to the subclass of markers tested) divided by or rare variants (MAF < 0.5%) . First, new 0.455. EIGENSTRAT corrects for population structure (simulation 1), EMMAX and ROADTRIPS correct for low-frequency variants will be identified by family structure and for population structure at normally differentiated markers (FST = 0.01), and EMMAX + principal components corrects for family structure and for population structure at normally or highly the 1000 Genomes Project and included in

differentiated markers (FST = 0.01 or $ = 0.6). We note that the approach of REF. 31 is immune to all of these next-generation genotyping arrays. Here, the confounders, implying a value of L = 1.00 for each column of the table. *EMMAX can use either the GC issues are generally similar to those involv- identity by descent (IBS) or Balding–Nichols estimate of the kinship matrix35. Results for IBS are shown in the table, and results for Balding–Nichols are (from left to right) 1.00, 1.91, 1.00 and 1.28 for EMMAX and ing common variants, except that deviation 1.00, 1.03, 1.00 and 0.99 for EMMAX + principal components. from model specification is more likely — for example, if normality assumptions are vio- lated or the genotypic variance of a SNP var- family structure (simulation 2). EMMAX not be a major concern. ROADTRIPS cor- ies across subpopulations43. Second, exome corrects for both stratification and popula- rects for family structure but not for popula- resequencing projects will aim to identify tion structure except for a modest residual tion stratification at unusually differentiated genes in which individuals with extreme inflation at unusually differentiated markers, markers, although the incorporation of phenotypes have an aggregate excess or defi- which is completely removed by EMMAX principal component covariates could poten- ciency of rare non-synonymous variants44. with principal component covariates; if the tially address this limitation. We note that for Differences in the range of allele frequencies number of unusually differentiated markers each method, dividing association statistics across ancestral populations make stratifica- is small, modest inflation at such markers may by residual λGC is guaranteed to produce tion a potential concern, but genetic ancestry

Glossary

Ancestry-informative markers FST Population structure Genetic markers ascertained for large differences in A measure of the genetic distance between two populations Sample structure due to differences in genetic allele frequency between subpopulations that are that describes the proportion of overall genetic ancestry among samples. genotyped to infer genetic ancestry in new samples. variation that is due to differences between populations. Principal components analysis Armitage trend test Genetic drift A dimensionality reduction technique used A standard 2(1 degree of freedom) association test Random fluctuations in allele frequencies over time due to χ to infer continuous axes of variation in genetic computed as the number of samples times the squared sampling effects, particularly in small populations. data, often representing genetic ancestry. correlation between genotype and phenotype. Genetic heritability Rank statistic Cryptic relatedness The proportion of the total phenotypic variation in a given Sample structure due to distant relatedness characteristic that can be attributed to additive genetic A statistic describing the rank, across markers, among samples with no known family relationships. effects. In the broad sense, heritability involves all additive of association of each marker. Rank statistics and non-additive genetic variance, whereas in the narrow can be transformed into quantiles of a standard Differential bias sense, it involves only additive genetic variance. normal distribution that can be combined with Spurious differences in allele frequencies between cases other statistics. and controls due to differences in sample collection, Genetic matching sample preparation and/or genotyping assay procedures. A method of association testing in which cases and SNP loadings controls are matched for genetic ancestry, as inferred by The correlations of each SNP to a given principal Exome resequencing principal components analysis or other methods. component in principal components analysis. A study design in which exon capture technologies are The principal component coordinates of each used to obtain resequencing data covering all exonic Genomic control sample are proportional to the sum of normalized regions for each individual in the study. A method for detecting (or detecting and correcting for) genotypes weighted by SNP loadings. stratification based on the genome-wide inflation of Family-based association tests association statistics. Structured association A class of association tests that uses families with A method for correcting for stratification in one or more affected children as the subjects rather Mixed models which samples are assigned to subpopulation than unrelated cases or controls. The analysis treats A class of models in which phenotypes are modelled using clusters and evidence of association is stratified the allele that is transmitted to (one or more) affected both fixed effects (candidate SNPs and fixed covariates) children from each parent as a ‘case’ and the and random effects (the phenotypic covariance matrix). by cluster. untransmitted alleles as ‘controls’ to avoid the effects of population structure. Multidimensional scaling Transmission disequilibrium test A dimensionality reduction technique, similar to principal A family-based association test involving Family structure components analysis, in which points in a high-dimensional case–parent trios in which alleles transmitted Sample structure due to familial relatedness space are projected into a lower-dimensional space while from parents to children are compared with among samples. approximately preserving the distance between points. untransmitted alleles.

4 | ADVANCE ONLINE PUBLICATION www.nature.com/reviews/genetics © 2010 Macmillan Publishers Limited. All rights reserved PROGRESS

can be inferred from genotyping array data 1. McCarthy, M. I. et al. Genome-wide association 29. Abecasis, G. R., Cardon, L. R. & Cookson, W. O. studies for complex traits: consensus, uncertainty A general test of association for quantitative traits in from the same samples, if available, and and challenges. Nature Rev. Genet. 9, 356–369 nuclear families. Am. J. Hum. Genet. 66, 279–292 included as a covariate. Finally, the advent of (2008). (2000). 2. Campbell, C. D. et al. Demonstrating stratification in a 30. Lange, C., DeMeo, D. L. & Laird, N. M. Power and whole-exome or whole-genome resequenc- European American population. Nature Genet. 37, design considerations for a general class of family-based ing raises the question of whether rare vari- 868–872 (2005). association tests: quantitative traits. Am. J. Hum. Genet. 3. Tian, C., Gregersen, P. K. & Seldin, M. F. 71, 1330–1341 (2002). ants can be used to infer genetic ancestry Accounting for ancestry: population substructure 31. Won, S. et al. On the analysis of genome-wide with greater precision, perhaps using differ- and genome-wide association studies. Hum. Mol. association studies in family-based designs: a universal, Genet. 17, R143–R150 (2008). robust analysis approach and an application to four ent methods from those currently applied to 4. Tian, C. et al. Analysis and application of European genome-wide association studies. PLoS Genet. 5, common variants. genetic substructure using 300 K SNP information. e1000741 (2009). PLoS Genet. 4, e4 (2008). 32. Lasky-Su, J. et al. On genome-wide association studies 5. Voight, B. F. & P r i tc h a rd , J . K. Confounding from for family-based designs: an integrative analysis Conclusion cryptic relatedness in case–control association studies. approach combining ascertained family samples with PLoS Genet. 1, e32 (2005). unselected controls. Am. J. Hum. Genet. 86, 573–580 Many different methods of correcting for 6. Weir, B. S., Anderson, A. D. & Hepler, A. B. (2010). stratification have been developed, and all of Genetic relatedness analysis: modern data and new 33. Yu, J. et al. A unified mixed-model method for challenges. Nature Rev. Genet. 7, 771–780 (2006). association mapping that accounts for multiple these methods have important advantages. 7. Devlin, B. & Roeder, K. Genomic control for association levels of relatedness. Nature Genet. 38, 203–208 Although mixed models are relatively studies. Biometrics 55, 997–1004 (1999). (2006). 8. Pritchard, J. K. & Rosenberg, N. A. Use of unlinked 34. Visscher, P. M., Hill, W. G. & Wray, N. R. Heritability in new and untested, they seem to offer a genetic markers to detect population stratification the genomics era — concepts and misconceptions. practical and comprehensive approach for in association studies. Am. J. Hum. Genet. 65, Nature Rev. Genet. 9, 255–266 (2008). 220–228 (1999). 35. Kang, H. M. et al. Variance component model to simultaneously addressing confounding due 9. Reich, D. E. & Goldstein, D. B. Detecting association in account for sample structure in genome-wide to population stratification, family structure a case–control study while correcting for population association studies. Nature Genet. 42, 348–354 stratification. Genet. Epidemiol. 20, 4–16 (2001). (2010). and cryptic relatedness. 10. Clayton, D. G. et al. Population structure, differential 36. Zhang, Z. et al. Mixed linear model approach adapted In studies in which stratification is not bias and genomic control in a large-scale, case–control for genome-wide association studies. Nature Genet. association study. Nature Genet. 37, 1243–1246 42, 355–360 (2010). a serious concern, an appealing and simple (2005). 37. Zhu, X., Li, S., Cooper, R. S. & Elston, R. C. approach is to use mixed models without 11. Price, A. L. et al. The impact of divergence time on the A unified association analysis approach for family nature of population structure: an example from and unrelated samples correcting for stratification. including principal component covariates. Iceland. PLoS Genet. 5, e1000505 (2009). Am. J. Hum. Genet. 82, 352–365 (2008). This approach could be applied in studies of 12. Devlin, B., Bacanu, S. A. & Roeder, K. 38. Lee S., Zou F. & Wright F. A. Convergence and Genomic control to the extreme. Nature Genet. prediction of principal component scores in populations of homogeneous ancestry, stud- 36, 1129–1130 (2004); author reply in 36, 1131 high-dimensional settings. Ann. Stat. (in the press). ies of structured populations in which struc- (2004). 39. Thornton, T. & McPeek, M. S. ROADTRIPS: case–control 13. Pritchard, J. K., Stephens, M. & Donnelly, P. association testing with partially or completely ture is due to very recent genetic drift and Inference of population structure using multilocus unknown population and pedigree structure. studies of any population in which PCA or genotype data. Genetics 155, 945–959 (2000). Am. J. Hum. Genet. 86, 172–184 (2010). 14. Rosenberg, N. A. et al. Genetic structure of human 40. Rakovski, C. S. & Stram, D. O. A kinship-based related methods, applied to either the entire populations. Science 298, 2381–2385 (2002). modification of the Armitage trend test to address sample or a subset of unrelated samples, indi- 15. Pritchard, J. K., Stephens, M., Rosenberg, N. A. & hidden population structure and small differential Donnelly, P. Association mapping in structured genotyping errors. PLoS ONE 4, e5825 (2009). cate that there is no substantial stratification populations. Am. J. Hum. Genet. 67, 170–181 (2000). 41. Holsinger, K. E. & Weir, B. S. Genetics in (that is, phenotypes are not highly correlated 16. Alexander, D. H., Novembre, J. & Lange, K. geographically structured populations: defining, Fast model-based estimation of ancestry in unrelated estimating and interpreting FST. Nature Rev. Genet. with any of the top principal components). individuals. Genome Res. 19, 1655–1664 (2009). 10, 639–650 (2009). For studies that do not meet any of the 17. Menozzi, P., Piazza, A. & Cavalli-Sforza, L. Synthetic 42. Manolio, T. A. et al. Finding the missing heritability of maps of human gene frequencies in Europeans. complex diseases. Nature 461, 747–753 (2009). above criteria, an alternative approach is to Science 201, 786–792 (1978). 43. Abney, M. & McPeek, M. S. Association testing with use mixed models with principal component 18. Cavalli-Sforza L. L., Menozzi P. & Piazza A. principal-components-based correction for population The History and Geography of Human Genes stratification [abstract number 58]. Proc. of the 58th covariates. In family-based studies in which (Princeton Univ. Press,1994). Annual Meeting of The American Soc. of Human the within-family component contributes 19. Patterson, N., Price, A. L. & Reich, D. Genetics [online], http://www.ashg.org/2008meeting/ Population structure and eigenanalysis. PLoS Genet. abstracts/fulltext/ (2008). much of the overall statistical power, the 2, e190 (2006). 44. Cohen, J. C. et al. Multiple rare alleles contribute to approach of REF. 31 may also prove useful. In 20. Novembre, J. & Stephens, M. Interpreting principal low plasma levels of HDL cholesterol. Science 305, component analyses of spatial population genetic 2869–2872 (2004). data sets that do not contain family structure variation. Nature Genet. 40, 646–649 (2008). or cryptic relatedness, simpler association 21. Price, A. L. et al. Principal components analysis Competing interests statement corrects for stratification in genome-wide association The authors declare no competing financial interests. tests (with or without principal component studies. Nature Genet. 38, 904–909 (2006). 22. Zhu, X., Zhang, S., Zhao, H. & Cooper, R. S. correction, based on above criteria) will FURTHER INFORMATION Association mapping, using a mixture model for 21,23 1000 Genomes Project: http://www.1000genomes.org probably be sufficient . complex traits. Genet. Epidemiol. 23, 181–196 ADMIXTURE software: (2002). http://www.genetics.ucla.edu/software/admixture Alkes L. Price, Noah A. Zaitlen, David Reich and Nick 23. Purcell, S. et al. PLINK: a tool set for whole-genome EIGENSTRAT, implemented in the EIGENSOFT software: Patterson are at the Program in Medical and association and population-based linkage analyses. http://www.hsph.harvard.edu/faculty/alkes-price/software Am. J. Hum. Genet. 81, 559–575 (2007). , Broad Institute of Harvard EMMAX software: http://genetics.cs.ucla.edu/emmax 24. Luca, D. et al. On the use of general control samples and Massachusetts Institute of Technology, FBAT software: http://biosun1.harvard.edu/~fbat/fbat.htm for genome-wide association studies: genetic matching Nature Reviews Genetics article series on Genome-wide Cambridge, Massachusetts 02142, USA. highlights causal variants. Am. J. Hum. Genet. 82, association studies: http://www.nature.com/nrg/series/ Alkes L. Price and Noah A. Zaitlen are also at 453–463 (2008). 25. Lee, A. B., Luca, D., Klei, L., Devlin, B. & Roeder. K. gwas/index.html the Department of Epidemiology and Department Discovering genetic ancestry using spectral graph NHGRI Catalog of Published Genome-Wide Association of Biostatistics, Harvard School of Public Health, theory. Genet. Epidemiol. 34, 51–59 (2010). Studies: http://genome.gov/gwastudies Boston, Massachusetts 02115, USA. 26. Seldin, M. F. & P r i c e, A . L. Application of ancestry PLINK software: informative markers to association studies in http://pngu.mgh.harvard.edu/~purcell/plink David Reich is also at the Department of Genetics, European Americans. PLoS Genet. 4, e5 (2008). QTDT software: Harvard Medical School, Boston, 27. Spielman, R. S., McGinnis, R. E. & Ewens, W. J. http://www.sph.umich.edu/csg/abecasis/QTDT ROADTRIPS software: http://www.stat.uchicago. Massachusetts 02115, USA. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent edu/~mcpeek/software/index.html Correspondence to A.L.P. diabetes mellitus (IDDM). Am. J. Hum. Genet. 52, STRUCTURE and STRAT software: e-mail: [email protected] 506–516 (1993). http://pritch.bsd.uchicago.edu/software.html 28. Laird, N. M. & Lange, C. Family-based designs in the TASSEL software: http://www.maizegenetics.net doi:10.1038/nrg2813 age of large-scale gene-association studies. Nature ALL LINKS ARE ACTIVE IN THE ONLINE PDF Published online 15 June 2010 Rev. Genet. 7, 385–394 (2006).

NATURE REVIEWS | GENETICS ADVANCE ONLINE PUBLICATION | 5 © 2010 Macmillan Publishers Limited. All rights reserved