Genome-Wide Association Study (GWAS)

James J. Yang January 24, 2018

School of Nursing, University of Michigan Ann Arbor, Michigan Outline

1 What are Genome-Wide Association Studies?

2 Linkage versus Association

3 GWAS Data

4 Issues with GWAS 5 Correcting Association Analysis for Genomic Control Principal Components Analysis Linear Mixed Models

6 Genome-wide Significance Level

7 Analysis Protocols

8 Other Studies using GWAS

9 Resources 2 What are genome-wide association studies

Genome-wide association studies (GWASs) are unbiased genome screens of unrelated individuals and appropriately matched controls or parent-affected child trios to establish whether any genetic variant is associated with a trait. These studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and major diseases. –nature.com

3 What are genome-wide association studies

A genome-wide association study is an approach that involves rapidly scanning markers across the complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease. Once new genetic associations are identified, researchers can use the information to develop better strategies to detect, treat and prevent the disease. Such studies are particularly useful in finding genetic variations that contribute to common, complex diseases, such as asthma, cancer, diabetes, heart disease and mental illnesses. –National Human Genome Research Institute, NIH

4 The new england journal of medicine

A SNP1 SNP2 Chromosome 9

Person1

Person2

Person3

G–C → T–A A–T → G–C

B SNP1 SNP2 Cases Initial discovery study Controls Cases Initial discovery study Controls P=1×10–12 P=1×10–8

Common Variant Heterozygote homozygote homozygote C

14 14 SNP1 12 12

10 10 SNP2 8 8

P Value 6 P Value 6 10 10 4 4 –Log –Log

2 2

0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 19 21 X 0 Position on chromosome 9 Chromosome 16 18 20 22

Figure 1. The Genomewide Association Study. The genomewide association study is typically based on a case–control design in which single-nucleotide polymorphisms (SNPs) across

the human genome are genotyped. Panel A depicts a small locus on chromosome 9, and thus a very small fragmentCOLOR of the FIGURE genome. In Panel B, the strength of association between each SNP and disease is calculated on the basis of the prevalenceRev4 of each SNP in06/22 cases and /10 5 Manolio (2010)controls. Genomewide In this example, SNPs Association 1 and 2 on chromosome Studies and9 are associated Assessment with disease, of the with Risk P values of of Disease. 10−12 and 10NEJM−8, respectively.. 363(2): The 166-176. plot in Panel C shows the P values for all genotyped SNPs that have survived a quality-control screen, withAuthor each chromosomeDr. Manolio shown in a different color. The results implicate a locus on chromosome 9, marked by SNPs 1 and 2, which are adjacentFig # to each1 other (graph at right), and other neighboring SNPs. Title ME DE Phimister larger studies investigating the risk of schizo- tend to be of modestArtist effect size,Muller with a median phrenia have implicated several variants — both odds ratio per copy of the AUTHORrisk allele PLEASE of 1.33.NOTE:7 Sev- Figure has been redrawn and type has been reset structural variants and SNPs — in the region of eral variants carry odds ratiosPlease above check carefully3.00, includ- the major histocompatibility complex (MHC) and ing some exceedingIssue 12.00. date These07-08-2010 are of particu- at other loci, associations that have been repli- lar interest, since it seems likely that there would cated in independent samples.32-34 have been evolutionary pressure against their se- Generally, associations between SNPs and traits lection unless they provided some survival ben-

168 n engl j med 363;2 nejm.org july 8, 2010 The New England Journal of Medicine Downloaded from nejm.org at UNIVERSITY OF MICHIGAN on December 8, 2017. For personal use only. No other uses without permission. Copyright © 2010 Massachusetts Medical Society. All rights reserved. Linkage vs. Association

Linkage Association Data Structures Pedigree Independent Samples (within families) (across families)

Concepts Recombination events Linkage Diseqilibrium

Statistics Likelihood of Statistical recombinants correlation

6 GWAS Data

Genotype Data: about 106 variables Phenotype Data: about 103 individuals

7 Single Nucleotide Polymorphism (SNP)

Responsible for 90% of all human genetic variations A SNP occurs every 100-300 base pairs Currently almost 12.8 million SNPs (2008) in the NCBI dbSNP database Like microsatellites, they are used as markers because They occur frequently They are stable to genotype Probe length is 25 bases long

8 Affymetrix GeneChip

Genome-Wide Human SNP Array 6.0: 1.8 million markers (946K probes for copy number variants and 906.6K SNPs) 9 10 11 12 13 14 15 Genotype Data

For a SNP with two alleles coded as A and B, there are three possible genotypes: {AA, AB, BB}. Additive (B as reference): {AA = 0, AB = 1,BB = 2}

Genotype Data: Zij = number of the reference allele ID SNP 1 2 3 4 ··· n − 2 n − 1 n 1 0 1 2 2 ··· 2 1 1 2 2 2 0 -1 ··· 2 0 1 3 0 1 2 0 ··· 2 2 0 ...... G − 1 2 0 0 1 ··· 2 -1 2

G 0 0 2 1 ··· 1 2 1 16 Phenotype Data with Genotype

Phenotype Data: X 7.32 3.09 5.24 4.57 ··· 7.12 3.96 4.72 Y 1 0 1 1 ··· 0 0 0 W 0.85 -0.62 0.51 0.67 ··· -0.10 -0.29 -1.56

Genotype Data: Zij = number of the reference allele ID SNP 1 2 3 4 ··· n − 2 n − 1 n 1 0 1 2 2 ··· 2 1 1 2 2 2 0 -1 ··· 2 0 1 3 0 1 2 0 ··· 2 2 0 ...... G − 1 2 0 0 1 ··· 2 -1 2 G 0 0 2 1 ··· 1 2 1 17

REVIEWS

often without the functional scrutiny that is required for LD or functional considerations, but will nevertheless a SNP in a non-coding region, and often despite the achieve some degree of coverage of the genome. presence of many nearby variants that might be equally However, for some sets of variants, the coverage is so or more strongly associated with disease. Indeed, one of poor that calling them ‘genome-wide’ is misleading. the missense variants that has been shown to be associ- The least comprehensive of such so-called genome- ated with complex disease — the Thr17Ala polymor- wide association studies are linkage studies that are phism in the gene encoding cytotoxic T-lymphocyte- converted into association studies by looking for associated protein 4 (CTLA4) — is reliably associated associations between disease and the 400–1,000 with autoimmune disease only because it is in strong microsatellites that are typed in linkage studies. Even LD with a regulatory polymorphism in a non-coding under the optimistic assumption that testing a single region, which is more strongly associated with disease microsatellite for association completely surveys vari- and is therefore more likely to be causal44. ation in a surrounding 50-kb block of LD (blocks are Nevertheless, some missense variants have been reli- on average ~20 kb (REF. 60), so this is also optimistic), ably associated with complex disease and, as a group, such a study would cover 20–50 Mb — 1–3% of the missense variants are more likely to have functional genome or less — and cannot truly be considered a consequences. Therefore, the genome-wide testing of genome-wide association study. large collections of missense variants is likely to remain A proposed alternative approach is to type a few a productive approach. However, given our current lack SNPs in or near the coding region of each gene81,82.This of knowledge about common disease risk alleles, it method, like all association approaches, only surveys remains unclear what fraction of these would be discov- those variants that have been chosen for genotyping and ered even by a comprehensive survey of missense poly- those variants that are in LD with the chosen variants. morphisms. Unless the LD patterns of each gene are empirically New methods are emerging that might help recog- determined, even missense SNPs might well be missed nize variants that affect gene function without affecting using this approach, because choosing SNPs on the basis the encoded amino-acid sequence. By comparing the of physical proximity does not guarantee that nearby human and mouse genomes74,it was shown that a signifi- SNPs will be captured52.Furthermore, regulatory vari- cant amount of non-coding DNA is highly conserved75. ants further away from a gene will almost certainly not This indicates that conserved non-coding regions are be surveyed. often functionally important — a hypothesis that has More recently, large collections of many thousands83 been supported experimentally75–80.Polymorphisms in (for example, the Affymetrix Centurion and ParAllele these non-coding regions could also have an important and MegAllele mapping sets) or over a million SNPs role in the genetics of biomedical traits. Indeed, a modi- (K. Frazer and D. Cox, personal communication; see fication of the missense approach to include SNPs in also the Perlegen Whole Genome Scanning collection) these conserved regions has also been proposed71. have been developed, and these can be genotyped at a However, the large number of additional SNPs required significantly lower cost per SNP.The degree of coverage would sacrifice the efficiency of the missense approach has not yet been published for these SNP sets, but they and would result in studies that are similar in scale to are likely to cover a significant fraction of the genome, the indirect LD approach. even if they are less efficient per marker than an LD- based set. If the savings are large enough, the cost might A convenience-based approach. A third approach to be lower than with an LD-based set of markers for the choosing markers for genome-wide association studies same degree of genome coverage. is to select variants on the basis of logistical considera- Before using such a set of variants, it will be impor- tions, such as the ease and cost of genotyping. Such a set tant to genotype them in a well-defined set of samples Direct and Indirectof Association variants will be less efficient Study per variant for surveying (such as those used in the HapMap project), to deter- the genome for disease alleles than a set that is based on mine how well the genome is covered and how best to supplement the set of markers, if necessary. Without such an assessment, even a large set of SNPs might seem ab to be genome-wide but might actually fail to survey a large amount of genomic variation. The ideal set of markers would be chosen with regard to LD and would be amenable to genotyping using the most cost-effective method. At present, published data and data that are emerging from the HapMap project indicate that Direct association Indirect association although 100,000 markers (1 every 30 kb of the Figure 1 | Testing SNPs for association by direct and indirect methods. a | A case in which genome) would provide a prodigious amount of data, it a candidate SNP (red) is directly tested for association with a disease phenotype. For example, is far from a complete scan of the genome and might this is the strategy used when SNPs are chosen for analysis on the basis of prior knowledge only provide an adequate proxy for fewer than 50% of about their possible function, such as missense SNPs that are likely to affect the function of a (green rectangle). b | The SNPs to be genotyped (red) are chosen on the basis of common variants (I. Pe’er and M.J.D., personal com- linkage disequilibrium (LD) patterns to provide information about as many other SNPs as possible. munication). A million randomly selected SNPs (or a In this case, the SNP shown in blue is tested for association indirectly, as it is in LD with the other few hundred thousand that have been optimally three SNPs. A combination of both strategies is also possible. selected on the basis of LD) seem to provide much more

NATURE REVIEWS | GENETICS VOLUME 6 | FEBRUARY 2005 | 99 Hirschhorn and Daly (2005) Nature Review© .2005 6:95-108. Nature Publishing Group 18 Tests of association single SNP

Case-control phenotype: The Cochran-Armitage test; logistic regression

logit (P [Yi = 1]) = α + βXi + 

Continuous phenotype: Linear regression

Yi = α + βXi + 

Ordered Categorical phenotype: Multinomial logit regression

logit (P [Yi ≤ j]) = α + βXi + 

19 Issues with GWAS

GWAS test correlation between phenotype and genotype, but correlation does not imply causation. Poor reproducibility. Correlation between marker and trait in GWAS may arise from Population Structures/Population Stratification. Admixed. . Ascertainment bias. Multiple Testings and GAWS Significance Level. More variables than samples (p  n problem).

20 Population Structure

Population stratification: the presence of a systematic difference in allele frequencies between subpopulations in a population, possibly due to different ancestry, Admixture: previously distinct populations begin to inter-breed. Cryptic relatedness: apparently unrelated individuals actually have some unsuspected relatedness.

21 False positive due to population stratification

Samples from Pop 1 Samples from Pop 2 A a Total A a Total Affected 64 16 80 Affected 4 16 20 Unaffected 16 4 20 Unaffected 16 64 80 Total 80 20 100 Total 20 80 100

Combined Samples Allele A Allele B Total Affected 68 32 100 Unaffected 32 68 100 Total 100 100 200 Odds ratio = 4.8, p−value = 6 × 10−7

22 NATURE | Vol 456 | 6 November 2008 LETTERS

The direction of the PC1 axis and its relative strength may reflect a individuals within 310 km of their reported origin and 90% within special role for this geographic axis in the demographic history of 700 km of their origin (Fig. 2 and Supplementary Table 4, results Europeans (as first suggested in ref. 10). PC1 aligns north-northwest/ based on populations with n . 6). Across all populations, 50% of south-southeast (NNW/SSE, 216 degrees) and accounts for individuals are placed within 540 km of their reported origin, and approximately twice the amount of variation as PC2 (0.30% versus 90% of individuals within 840 km (Supplementary Fig. 3 and 0.15%, first eigenvalue 5 4.09, second eigenvalue 5 2.04). However, Supplementary Table 4). These numbers exclude individuals who caution is required because the direction and relative strength of the reported mixed grandparental ancestry, who are typically assigned PC axes are affected by factors such as the spatial distribution of to locations between those expected from their grandparental origins samples (results not shown, also see ref. 9). More robust evidence (results not shown). Note that distances of assignments from for the importance of a roughly NNW/SSE axis in Europe is that, in reported origin may be reduced if finer-scale information on origin these same data, haplotype diversity decreases from south to north were available for each individual. (A.A. et al., submitted). As the fine-scale spatial structure evident in Population structure poses a well-recognized challenge for disease- Fig. 1 suggests, European DNA samples can be very informative association studies (for example, refs 11–13). The results obtained about the geographical origins of their donors. Using a multi- here reinforce that the geographic distribution of a sample is impor- ple-regression-based assignment approach, one can place 50% of tant to consider when evaluating genome-wide association studies

PC

a 1

PC2

bc 0.020 0.03 UK Germany Novembre0.02 et al. (2008, Nature. 456:98-101.) showed that the eigenvectors0.01 of the SNP covariance0.010 matrix reflect population 0 23 structure.–0.01 France 0

–0.02 Spain Median genetic correlation

North–south in PC1–PC2 space Italy –0.03 Portugal –0.010 –0.03 –0.02 –0.01 0 0.01 0.02 0.03 East–west in PC1–PC2 space 0 1,000 2,000 3,000 French-speaking Swiss French Geographic distance between German-speaking Swiss German populations (km) Italian-speaking Swiss Italian Figure 1 | Population structure within Europe. a, A statistical summary of Croatia; HU, Hungary; IE, Ireland; IT, Italy; KS, Kosovo; LV, Latvia; MK, genetic data from 1,387 Europeans based on principal component axis one Macedonia; NO, Norway; NL, Netherlands; PL, Poland; PT, Portugal; RO, (PC1) and axis two (PC2). Small coloured labels represent individuals and Romania; RS, Serbia and Montenegro; RU, Russia, Sct, Scotland; SE, large coloured points represent median PC1 and PC2 values for each Sweden; SI, Slovenia; SK, Slovakia; TR, Turkey; UA, Ukraine; YG, country. The inset map provides a key to the labels. The PC axes are rotated Yugoslavia. b, A magnification of the area around Switzerland from to emphasize the similarity to the geographic map of Europe. AL, Albania; a showing differentiation within Switzerland by language. c, Genetic AT, Austria; BA, Bosnia-Herzegovina; BE, Belgium; BG, Bulgaria; CH, similarity versus geographic distance. Median genetic correlation between Switzerland; CY, Cyprus; CZ, Czech Republic; DE, Germany; DK, Denmark; pairs of individuals as a function of geographic distance between their ES, Spain; FI, Finland; FR, France; GB, United Kingdom; GR, Greece; HR, respective populations. 99 © 2008 Macmillan Publishers Limited. All rights reserved Population Structure and GWAS

Because of patterns of relatedness, genetic allele frequencies can vary across populations (more than sampling variation). Phenotypes can also reflect population structure, because

1 they are controlled by many loci (“polygenic” model) that tend to vary with the population structure, and/or 2 they vary with climate, diet or other environmental factors that differ across populations, and/or 3 ascertainment bias: recruitment of phenotypic groups differs across populations Genome-wide tendency for genetic associations reflecting these effects rather than direct causal effect of a SNP.

24 Confounding in Human Studies

The confounding effect depends on experimental design. Small effect needs large study to detect. Large population structure effect may mask true effect. The demographic histories and environments The pattern of recruitment (ascertainment bias). (explicit confounders) age, gender, race (implicit confounders) Cryptic relatedness: Most studies of apparently unrelated individuals do include some close relatives. Pick one in each “family” Mixed models

25 Correcting Association Analysis for Confounding

Matching (Luca et al. 2008) Genetic Matching. Genomic Control (Delvin and Roeder, 1999) Transmission-Disequilibrium Test (Spielman et al. 1993) Parental control. Structured Association (Pritchard et al. 2000) Stratified Analysis Regression Control Covariates adjustment Principal Components Analysis (PCA) (Price et al. 2006) PCA ⇒ genetic background covariates Linear Mixed Models (Yu et al. 2006; Kang et al. 2008) Model kinship relationship

26 Genomic Control

Devlin and Roeder (1999, Biometrics 55:997-1004) showed that the distribution of the Cochran-Armitage Trend test statistics under population structure is inflated by a constant multiplicative factor λ. The idea is to multiple test statistic with constant λ to 2 make it fit χ1 distribution.

27 Genomic Control

2 Instead of χ1 distribution under H0, Devlin and Roeder assumed 2 X1,...,XG ∼ λχ1

under H0 with population stratification. λ is estimated by median(X ,...,X ) λˆ = 1 G qchisq(0.5, 1) where qchisq(0.5, 1) = 0.4549 is the median of the 2 theoretical χ1 value. The adjusted statistic is adj ˆ Xi = Xi/λ adj 2 and Xi ∼ χ1 28 Genomic Control

Limitation:

The inflation factor λ is assumed to be the same across the genome and very few SNPs tag causal variants. Both are not usually true. λ depends on sample size.

Genomic control now mainly used to measure the problem (whether λ < 1.05), not to remedy it.

29 QQ plot of observed p-values

30 QQ plot of adjusted p-values

31 R EPORTS South Asia, East Asia, Oceania, and America. cients in inferred clusters (Fig. 1). At K ϭ 2 groups. Unlike other populations from Paki- Only 7.4% of these 4199 alleles were exclu- the clusters were anchored by Africa and stan, Kalash showed no membership in East sive to one region; region-specific alleles America, regions separated by a relatively Asia at K ϭ 5, consistent with their suggested were usually rare, with a median relative large genetic distance (table S1). Each in- European or Middle Eastern origin (15). frequency of 1.0% in their region of occur- crease in K split one of the clusters obtained In America and Oceania, regions with low rence (11). with the previous value. At K ϭ 5, clusters heterozygosity (table S3), inferred clusters Despite small among-population variance corresponded largely to major geographic re- corresponded closely to predefined popula- components and the rarity of “private” al- gions. However, the next cluster at K ϭ 6 did tions (Fig. 2). These regions had the largest leles, analysis of multilocus genotypes allows not match a major region but consisted large- among-population variance components, and inference of genetic ancestry without relying ly of individuals of the isolated Kalash group, they required the fewest loci to obtain the on information about sampling locations of who speak an Indo-European language and clusters observed with the full data. Inferred individuals (12–14). We applied a model- live in northwest Pakistan (Fig. 1 and table clusters for Africa and the Middle East were based clustering algorithm that, loosely S2). In several populations, individuals had also consistent across runs but did not all speaking, identifies subgroups that have dis- partial membership in multiple clusters, with correspond to predefined groups. For the oth- tinctive allele frequencies. This procedure, similar membership coefficients for most in- er samples, among-population variance com- implemented in the computer program struc- dividuals. These populations might reflect ponents were below 2%, and independent ture (14), places individuals into K clusters, continuous gradations in allele frequencies structure runs were less consistent. For K Ն where K is chosen in advance but can be across regions or admixture of neighboring 3, similarity coefficients for pairs of runs varied across independent runs of the algo- Downloaded from rithm. Individuals can have membership in multiple clusters, with membership coeffi- Table 1. Analysis of molecular variance (AMOVA). Eurasia, which encompasses Europe, the Middle East, cients summing to 1 across clusters. and Central/South Asia, is treated as one region in the five-region AMOVA but is subdivided in the In the worldwide sample, individuals seven-region design. The World-B97 sample mimics a previous study (6). from the same predefined population nearly Variance components and 95% confidence intervals (%) always shared similar membership coeffi-

Number Number http://science.sciencemag.org/ Sample of of Among Among 1 regions populations Within populations populations Molecular and Computational Biology, 1042 West regions 36th Place DRB 289, University of Southern Califor- within regions nia, Los Angeles, CA 90089, USA. 2Department of Human Genetics, University of Chicago, 920 East World 1 52 94.6 (94.3, 94.8) 5.4 (5.2, 5.7) 58th Street, Chicago, IL 60637, USA. 3Center for World 5 52 93.2 (92.9, 93.5) 2.5 (2.4, 2.6) 4.3 (4.0, 4.7) Medical Genetics, Marshfield Medical Research Foun- World 7 52 94.1 (93.8, 94.3) 2.4 (2.3, 2.5) 3.6 (3.3, 3.9) dation, Marshfield, WI 54449, USA. 4Foundation Jean World-B97 5 14 89.8 (89.3, 90.2) 5.0 (4.8, 5.3) 5.2 (4.7, 5.7) Dausset–Centre d’Etude du Polymorphisme Humain Africa 1 6 96.9 (96.7, 97.1) 3.1 (2.9, 3.3) (CEPH), 27 rue Juliette Dodu, 75010 Paris, France. Eurasia 1 21 98.5 (98.4, 98.6) 1.5 (1.4, 1.6) 5 Department of Genetics, Yale University School of Eurasia 3 21 98.3 (98.2, 98.4) 1.2 (1.1, 1.3) 0.5 (0.4, 0.6) Medicine, 333 Cedar Street, New Haven, CT 06520,

Structured AssociationEurope 1 8 99.3 (99.1, 99.4) 0.7 (0.6, 0.9) 6 USA. Vavilov Institute of General Genetics, Russian Middle East 1 4 98.7 (98.6, 98.8) 1.3 (1.2, 1.4) on December 8, 2017 Academy of Sciences, 3 Gubkin Street, Moscow Central/South Asia 1 9 98.6 (98.5, 98.8) 1.4 (1.2, 1.5) 117809, Russia. 7Department of Biological Sciences, Stanford University, Stanford, CA 94305, USA. East Asia 1 18 98.7 (98.6, 98.9) 1.3 (1.1, 1.4) Oceania 1 2 93.6 (92.8, 94.3) 6.4 (5.7, 7.2) *To whom correspondence should be addressed. E- America 1 5 88.4 (87.7, 89.0) 11.6 (11.0, 12.3) mail: [email protected]

Fig. 1. Estimated population structure. Each individual is represented by a K produced nearly identical individual membership coefficients, having pair- thin vertical line, which is partitioned into K colored segments that represent wise similarity coefficients above 0.97, with the exceptions of comparisons the individual’s estimated membership fractions in K clusters. Black lines involving four runs at K ϭ 3 that separated East Asia instead of Eurasia, and separate individuals of different populations. Populations are labeled below one run at K ϭ 6 that separated Karitiana instead of Kalash. The figure the figure, with their regional affiliations above it. Ten structure runs at each shown for a given K is based on the highest probability run at that K.

2382 20 DECEMBER 2002 VOL 298 SCIENCE www.sciencemag.org

Rosenberg et al. (2002) Science 298:2381-2385

32 Structured Association (STRAT)

Structured Association assumes that population consists of subpopulations (“islands). Individual genotypes are admixed from different subpopulations. Within each subpopulation, an association between a SNP and the trait is a true association.

Analysis methods: Step 1: Using Structure and unlinked ancestry-informative markers (AIMs) (≈ 100 SNPs) to estimate the population structures, and then assign individuals to putative subpopulations. Step 2: Test for association within the subpopulations. 33 Structured Association (STRAT)

Limitations:

1 Structured Association does not account for pedigree-level relationship.

2 Strata not always known a priori or easily identified

3 Lose power in many scenarios, including no stratification.

4 Structure and similar programs are computationally demanding.

Pritchard et al. (2000) Association mapping in structured populations. Am. J. Hum. Genet. 67:170-181.

34 Regression Control

Start with simple regression with phenotype as response variable Add ≈ 100 widely spaced SNPs as covariates. Note: these covariates are informative about the underlying relatedness and are used to adjust its effect Variable selection (e.g. LASSO) may be used to avoid overfitting. Regression control is computationally faster than structured association. Regression control is more robust to ascertainment bias than genomic control. Regression control has all the flexibility from using regression methods. 35 Principal Components Analysis (PCA)

The genotype is coded as a matrix X = {xij} where i = 1,...,G and j = 1, . . . , n. In general, G ≈ 106 and n ≈ 103. Let K = XXT be n × n matrix. Then K = average allelic correlation, viewed a a kinship coefficient. Apply PCA method on K, we derive

PC1,PC2,...,PCn. The top principal components are viewed as continuous axes of variation that reflect subpopulation genetic variation in the sample. Individuals with “similar” values for a particular top principal component will have “similar” ancestry for that axes. 36 Principal Components Analysis (PCA)

PC1: linear combinations of individuals with maximal variance The closer the kinship of two individuals, the more similar

their PC1 scores tend to be.

If there are only 2 subpopulations, PC1 usually distinguishes them

Admixed individuals have intermediate PC1 scores;

Given PC1, the next k − 1 PCs distinguish k subpopulations (including admixture).

37 Principal Components Analysis (PCA)

Similar to regression control, but uses PCs as covariates to avoid overfitting. Typically 2 − 15 PCs are used. In addition to race, PCs also strongly influenced by patterns of LD.

Price et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904-9, 2006.

38 QQ plot of observed p-values

39 QQ plot of adjusted p-values using 10 PCs

40 Hapmap Genetic Substructure

70 X. Zheng, B.S. Weir / Theoretical Population Biology 107 (2016) 65–76

Fig. 3. The principal component analysis on HapMap Phase 3 data, using a pruned set of 9949 SNPs and 1198 founders consisting of 11 populations: (a) the first and second eigenvectors; (b) a linear transformation of coordinate from (a) followed by a translation, assuming three ancestral populations with surrogate samples: CEU, YRI and CHB + JPT. The average positions of three surrogate samples are masked by a red plus sign.

Table 1 populations tend to be admixtures from these three ancestries. In- Summary of population samples in the eigenanalysis. ferring ancestral proportions was conducted by a coordinate trans- Name Population # of samples formation, assuming three ancestral populations with surrogate HapMap Phase III (1198 founders): samples: CEU, YRI and CHB + JPT. The X and Y axes in Fig. 3(b) ASW African ancestry in Southwest USA 53 represent the proportions of genome from African and Asian an- CEU Utah residents with Northern and Western 112 cestries respectively. Gujarati Indians in Houston (GIH, yellow) and European ancestry from the CEPH collection CHB Han Chinese in Beijing, China 137 Mexican ancestry in Los Angeles (MEX, green) appear to be admix- CHD Chinese in Metropolitan Denver, Colorado 109 tures between Europeans and Asians. ASW, MKK and LWK tend to GIH Gujarati Indians in Houston, Texas 101 be more related to African ancestry with some admixture, while 41 JPT Japanese in Tokyo, Japan 113 CHD and TSI are quite close to the surrogate samples of Asia. The LWK Luhya in Webuye, Kenya 110 MEX Mexican ancestry in Los Angeles, California 58 PCA plot with the largest two principal components generated by MKK Maasai in Kinyawa, Kenya 156 the full SNP set is shown in Fig. B.1), which is similar to Fig. 3. TSI Toscani in Italia 102 The population admixture proportions are estimated by aver- YRI Yoruba in Ibadan, Nigeria 147 aging ancestral proportions of individuals using the full SNP set. The Human Genome Diversity Panel (HGDP, 938 unrelated individuals): African Americans (ASW) are a typically admixed sample, esti- Africa 101 mated with ∼78% of genome from YRI and 21% from CEU, and ap- Europe 157 + Middle East 163 proximately no genome from CHB JPT. The result confirms the Central and South Asia 199 estimates of 78% African and 22% European ancestry shown in the East Asia 228 supplementary materials of the HapMap Phase 3 report (Interna- Oceania 26 tional HapMap 3 Consortium et al., 2010). The HAPMIX algorithm America 64 (Price et al., 2009) was used in HapMap Phase 3 project, the opti- mal linear combination of 74% YRI and 26% CEU was observed for files.html. The dataset contains a small number of relatives, and MKK, and a combination of 94% YRI and 6% CEU for LWK. In our + 938 individuals remained in the analysis after filtering out first analyses, the PCA-inferred combinations are 74% YRI 24% CEU + and second degree relatives of which were suggested by Rosenberg for MKK and 94% YRI 5% CEU for LWK. Our results are consistent with the admixture proportions previously estimated. (2006). The supervised ADMIXTURE and EIGMIX methods were applied To reduce potential effects of linkage disequilibrium, SNP to the HapMap3 SNP data assuming three ancestral populations pruning was conducted by randomly selecting autosomal SNPs with surrogate samples CEU, YRI and CHB + JPT. ADMIXTURE is for which each pair was at least as far apart as 200 kb: 9949 a model-based method with an assumption of markers in link- remaining SNPs for HapMap Phase 3 and 9790 for HGDP. All age equilibrium, therefore a pruned SNP set was used to avoid analyses were performed on both of the pruned and full SNP sets, the strong influence of SNP clusters. The pseudo-ancestors (YRI, and the unbound estimates of ancestral proportion are reported. In CHB + JPT and CEU) are specified in the analyses of ADMIXTURE the full sets, there are 1 423,833 and 644,258 autosomal SNPs for according to the AP (1, 0, 0), (0, 1, 0) and (0, 0, 1). As shown in HapMap3 and HGDP respectively. Fig. 4, the AP inferred by PCA tend to be consistent with those esti- mated by ADMIXTURE using the same SNP set. However, the offsets are observed for admixed populations, such like GIH and MKK. The 3.2. Analyses of HapMap Phase 3 data PCA-based proportions of genome from CEU are lower than AD- MIXTURE for GIH, and those are higher for MKK. Actually, our in- To avoid the confounding effect of relatives, 1198 founders were ference on MKK was actually consistent with what HapMap Phase selected for the PCA analysis by removing the offspring. The first 3 has reported. Note that PCA is a dimension reduction technique two principal components are the focus, since more eigenvectors and may lose information if we look only at the largest two prin- provide little additional information for inferring primary popula- cipal components, and the assumption of pseudo-ancestors (CEU, tion structure. As shown in Fig. 3(a), the samples from CEU, YRI and YRI, CHB + JPT) might not truly represent the ancestors in human CHB + JPT correspond to three vertices of a triangle, and the other evolution. East Asia Genetic Substructure East Asian Pop. Substructure

Figure 1. Principal component analyses of substructure in a diverse set of subjects of East Asian descent. Graphic representation of the first two PCs based on analysis with .200 K SNPs are shown. Color code shows subgroup of subjects for each population group. The subjects included Filipino (FIL), Vietnamese (VIET), Lahu, Dai, Cambodian (CAMB), Han Chinese (CHB), Mongola (MGL), Oroqen (ORQ), Daur, Korean (KOR), Chinese Americans from Taiwan (TWN),Yi, Hezhen (HEZ), Miaozu (MIAO), Naxi, She, Tu, Tujia (TUJ), Xibo, Chinese Americans (CHA), Japanese (JPT), and Tian Yakutet (YAK). alA., Analyses (2008) including the YakutPLoS population ONEgroup. B, Analysis3(12):e3862. without Yakut is shown. C, Approximate geographic origin of population group is depicted on a map of East Asia (downloaded from University of Texas Library website). The positions of the HGDP population groups are 42 based on the collection site information[12] and the other population groups were placed based on self-identified country or region of origin. [Note: Yakut are not shown on the map since this population is from Siberia and is a considerable distance north of the depicted region.] D, Shows rotated results of PC1 and PC2 to assist illustration of geographic correspondence of ethnic group locations. doi:10.1371/journal.pone.0003862.g001

ing these SNP subsets with the 200 K SNP set. These results, restricted our ascertainment to five populations (Han Chinese, summarized in Table 3, showed that the 20 K random SNP set Japanese, Korean, Vietnamese and Filipino)(See Methods). To and 5 K random SNP set corresponded closely with the .200 K access the potential usefulness of these AIMs an independent set of SNP set for the first 4 PCs, with decreased correlations observed samples was used and compared with the same number of random for the 1 K random SNP set. The relatively poor performance of SNPs. For this assessment we included Cambodian and Dai the 1 K random sets was more pronounced when more closely samples since we had limited samples from the Vietnamese and related population groups were considered e.g. Japanese and Filipino populations. 3 K AIMs showed close correlation between Korean for PC1, 20 K/200 K r2 = 0.82+/20.12 (mean+/2SD), the 200 K results for the first two PCs (Table 3). A set of the best 5 K/200 K r2 = 0.69+/20.03, and 1 K/200 K r2 = 0.28+/ 1.5 K AIMs also showed close correlation (Figure 5 and 20.06. These results suggest that random sets of 5 K SNPs may Table 3). A reduced set of 750 AIMs showed a fall-off in be necessary for resolving and adjusting for substructure in these correlation but was still equivalent to 3 K random SNPs. None of EAS populations (see discussion). the AIM sets correlated with PC3 or PC4 (r2,0.01, p,0.05), however, these PCs distinguished the Dai and Cambodian from East Asian Substructure Ancestry Informative Markers the other population groups and these were not included in our AIMs that discern population substructure are likely to be useful AIM selections. Nevertheless, for the common EAS populations in candidate gene, chromosomal position based association studies these data suggest that the EAS-AIMs (Table S3) will be useful for and defining homogeneous subject sets [24]. Since the application association studies in the majority of EAS and EAS-American of these methods is most applicable to large population groups we populations.

PLoS ONE | www.plosone.org 4 December 2008 | Volume 3 | Issue 12 | e3862 Eigen-spectrum of 3 distinct populations

Hoffman (2013) Correcting for Population Structure and Kinship Using the Linear Mixed Model: Theory and Extensions. PLOS ONE 8(10): e75707. 43 Eigen-spectrum of 33 parent-offspring duos

44 Linear Mixed Models

The early GWAS used independent samples Although independent samples can still be cryptic related or exist population stratification, they can be adjusted using genomic control, principal component analysis. When samples are related, another approach is linear mixed models (assuming univariate continuous phenotype). For related individuals, the correlations between phenotype can be attributed to kinship, reflecting genome-wide polygenic effects. PC adjustment uses just the first few eigenvectors of the kinship matrix K. Linear mixed models (LMM) model the whole matrix K. 45 Linear Mixed Models

Yi = α + βXi + ξZi + δi + ε

2 2 where δ ∼ N(0, 2σg K) and ε ∼ N(0, σ I).

Yi is the phenotype. Xi is SNP (usually coded as 0, 1, 2),

and Zi is covariates.

δi is random effect corresponds to polygenic contributions of many loci (small, additive genetic effects distributed across the genome). 2 σg measures the relative importance of polygenic effects, and is related to narrow-sense heritability via 2 2 2 2 h = σg /(σg + σ ).

46 Methods Comparisons

Genomic Control: simple and fast handles cryptic kinship + population structure some loss of power can be severe under ascertainment bias or polygenic inheritance can work with ∼ 102 SNPs. Principal Component Adjustment: uses only first few PCs of kinship matrix which (usually) measures large-scale population structure; cannot handle cryptic kinship or complex forms of population structure such as family structures. problem of choosing number of PCs to use.

47 Methods Comparisons

Linear Mixed Models: use whole kinship matrix; adjust for cryptic kinship as well as population structure; computational issues now essentially resolved; doesn’t allow for confounding role of selection; can be affected by ascertainment for binary data, which can invalidate the assumption that phenotype correlation = genotype correlation.

48 Multiple Testings and GWAS

Up until about 2007 there was frequently a failure to replicate reports of genetic association - the problem is now much reduced but has not entirely gone away. One reason was inadequate criteria for deciding when an association should be regarded as established. As the number of tests increased with improved marker technology, the possibilities for false positives also increased: called the multiple testing problem. Traditionally a significance level of α = 0.05 has been used in science, which allows on average one false positive per twenty tests under the null hypothesis. This is unacceptable for testing a million SNPs - it could generate 50, 000 false positives. 49 Genetic Epidemiology 32: 227–234 (2008) Genome-wide Significance Level

Estimation of Significance Thresholds for Genomewide Association Scans

Frank Dudbridgeà and Arief Gusnanto MRC Biostatistics Unit, Institute for Public Health, Cambridge, United Kingdom

The question of what significance threshold is appropriate for genomewide association studies is somewhat unresolved. Previous theoretical suggestions have yet to be validated in practice, whereas permutation testing does not resolve a discrepancy betweenDudbridge the genomewide and multiplicity Gusnanto of the experiment (2008, andGen. the subset Epi. of markers actually tested. We used genotypes from the Wellcome Trust Case-Control Consortium to estimate a genomewide significance threshold for the UK Caucasian population.32(3):227-234) We subsampled tookthe genotypes real at GWASincreasing densities, genotype using permutation data and to estimate the nominal P-value for 5% family-wise error. By extrapolating to infinite density, we estimated the genomewide significance threshold to be about 7.2 extrapolated108. To reduce thethe computationp-value time, we threshold considered Patterson’s to an eigenvalueinfinite estimator density of the of effective number of tests, but found it to be an order of magnitude too low for multiplicity correction. However, by fitting a Beta distribution to the minimum P-value from permutation replicates, we showed that the effective number is a useful heuristicSNPs. and suggest that its estimation in this context is an open problem. We conclude that permutation is still needed to obtain genomewide significance thresholds, but with subsampling, extrapolation−7 and estimation of an effective numberThey of tests, found the threshold that can adjustedbe standardizedα forFWER all studies= of 2 the× same10 population.for theGenet. Epidemiol. 32: 227–234, 2008. observedr 2008 Wiley-Liss, data, Inc. but this decreased to α = 2 × 10−8 Key words: multiple testing; permutation test; Bonferroni; eigenvalue FWER if SNPs with the same statistical properties were infinitely

Contract grant sponsor: EU 6th Framework Programme; Contract grant number: LSHM-CT-2004-503485. Ã Correspondencedense to: Frank Dudbridge, in the MRC genome. Biostatistics Unit, Institute for Public Health, Robinson Way, Cambridge CB2 0SR, UK. E-mail: [email protected] 50 Received 25 September 2007; Accepted 28 November 2007 Published online 25 February 2008 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/gepi.20297

INTRODUCTION Here we discuss some aspects of assessing significance in genomewide scans, and estimate a The question of what strength of evidence genomewide significance threshold for the UK should be considered significant has yet to be fully Caucasian population using data from the recently resolved in genetic association analysis. On the completed Wellcome Trust Case-Control Consortium one hand, multiple testing issues arise in most [2007] (WTCCC) study. We allow for a potentially studies, whether based on candidate genes or saturated dense marker panel to distinguish the genomewide scans, with attendant issues of how genomewide multiplicity of the experiment from the to quantify the multiplicity, what error rate to set of markers actually tested. This approach has a control and which method to use [Manly long history in linkage analysis [Morton, 1955; et al., 2004]. On the other hand, even under strong Lander and Kruglyak, 1995] but has generally control of the type-1 error, many associations have not been taken up in association mapping. Our not been replicated and are thought to have been approach brings together previous ideas based on false positives. This disappointing aspect can the hypothesized extent of multiplicity in the be attributed to the use of traditional thresholds genome, with emerging marker data that allow this of significance, even after adjustment for multiple multiplicity to be estimated. testing, which reflect over-optimistic prior belief The multiple testing problem arises because, if in the tested hypotheses [Ioannidis, 2005]. In many hypotheses are tested simultaneously, some the current period of genomewide association test statistics will be surprisingly extreme, even if no scans, using a dense but incomplete panel of associations exist. Multiple test procedures are single nucleotide polymorphism (SNP) markers designed to exercise control over the entire set of [Barrett and Cardon, 2006], such considerations take hypotheses, to prevent study-wide conclusions on greater importance owing to the high profile and being drawn that could be attributed to chance cost of these studies. alone. The family-wise error rate (FWER) is the r 2008 Wiley-Liss, Inc. Genetic Epidemiology 32 : 179–185 (2008) Genome-wide Significance Level

Genome-Wide Significance for Dense SNP and Resequencing Data

Clive J. Hoggart,1Ã Taane G. Clark,1,2 Maria De Iorio,1 John C. Whittaker,3 and David J. Balding1 1Department of Epidemiology and Public Health, Imperial College London, Norfolk Place, London 2Current affiliation - Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive Oxford and Wellcome Trust Sanger Institute, Hinxton, Cambridge 3Non-communicable Disease Epidemiology Unit, London School of Hygiene and Tropical Medicine, Keppel Street, London

The problem of multiple testing is an important aspect of genome-wide association studies, and will become more important as marker densities increase. The problem has been tackled with permutation and false discovery rate procedures and with Bayes factors, but each approach faces difficulties that we briefly review. In the current context of multiple studies on differentHoggart genotyping et platforms, al.(2008, we argue forGen. the use of Epi.truly genome-wide32(2): significance 179-185) thresholds, basedused on all polymorphisms whether or not typed in the study. We approximate genome-wide significance thresholds in contemporary West African,computer East Asian and simulations European populations ofby simulating entire sequence genomes data, based on allunder polymorphisms realistic as well as for a range of single nucleotide polymorphism (SNP) selection criteria. Overall we find that significance thresholds vary by a factor of 420 over the SNP selection criteria and statistical tests that we consider and can be highly dependent on samplepopulation size. We compare our genetics results for sequence models data to those for derived three by the HapMap large Consortium populations. and find notable differences which may be due to the small sample sizes used in the HapMap estimate. Genet. Epidemiol. 32:179–185, 2008. r 2007 Wiley-Liss, Inc. They then estimated αFWER if different classes of SNPs Key words: genome-wide association studies; multiple testing; statistical significance were tested. Contract grant sponsor: The UK Medical Research Council. Ã CorrespondenceThey to: found Clive J. Hoggart, strong Department ofdependence Epidemiology and Public Health, of α Imperial College London,on manyNorfolk Place, London. E-mail: [email protected] FWER Received 7 August 2007; Accepted 16 October 2007 Publishedfactors, online 28 December including: 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/gepi.20292 population, MAF threshold, Choice of statistical test, and INTRODUCTION librium (LD) and SNP density, as well as the choice of test statistic(s) and sample size. The concept of an numbers of cases and controls.effective number of (independent) tests is appealing Given the large number of statistical tests (around 6 [Cheverud, 2001; Nyholt, 2004; The International 51 10 ) in genome-wide association (GWA) studies, and HapMap Consortium, 2005], but this number is even larger numbers from future studies using closely related to a00 and depends on the same whole-genome resequencing technology, it is impor- factors [Salyakina et al., 2005; Dudbridge et al., 2006]. tant, but difficult, to accurately convey the signifi- False discovery rate procedures [Benjamini and cance of any reported associations. Frequently this Hochberg, 1995; Storey and Tibshirani, 2003], which problem is tackled by controlling the family-wise control the expected proportion of non-causal SNPs error rate (FWER), which is the probability of one or among those declared significant, have been pro- more significant results under the null hypothesis of posed as an alternative to the FWER approach. These no association. For a test applied at each of n single have been applied successfully to the analysis of nucleotide polymorphisms (SNPs), the simplest way 00 genome-wide gene expression data, for which one of approximating the per-test significance level a typically expects many true positives. However, in corresponding to a given FWER (a) is to apply a the context of GWA studies, the smaller number of Bonferroni (a00 5 a/n)oraSˇida´k(a00 5 1(1a)1/n) positives and the problem of LD between correction [Sˇida´k, 1968, 1971]. If the tests are true positives and flanking SNPs, remain proble- mutually independent the Sˇida´k correction is exact, matic [Dudbridge et al., 2006; Yang et al., 2005]. In but in practice the tests are dependent and both principle, the use of Bayes factors to measure corrections can be very conservative. significance avoids the need for assessing genome- More accurate approximations to the genome- wide error rates. A full Bayesian analysis involves wide significance level a00 are difficult to obtain, the difficult challenge of identifying a realistic yet because the correlations between the tests depend on tractable alternative model. For example, the alter- many factors, such as variations in linkage disequi- native model of the Wellcome Trust Case Control

r 2007 Wiley-Liss, Inc. Genome-wide Significance Level

GWAS significance level −8 A consensus is that αSNP must be 5 × 10 .

52 Analysis Protocols

Quality Control Call Rate (Marker and Sample) Minor Allele Frequency Hardy-Weinberg Equilibrium Test Sex Inconsistency Impute genotype using IMPUTE2, MACH or BEAGLE Impute phenotype data Relatedness using KING program Population Stratification

53 Other Studies using GWAS

GWAS in plant and animal breeding Effect Size and Power Gene × Environment interaction Pleiotropic Polygene Heredity Meta-Analysis

54 199Supplementary loci associated Figure 2. 199 loci associated with with adult height variation. adult Karyogram height displaying the genome ( locationN of= the 180 189 height K) SNPs identified from the primary meta-analysis (green) and the 19 secondary signals (red) discovered in the conditional analysis to be associated with height. The closest genes to the SNPs (gray) are followed by a MIM (blue) label if the gene underlies a skeletal growth-related Mendelian disorder described in OMIM. The plot was created using Affyrmation (http://genepipe.ngc.sinica.edu.tw/affyrmation/).

60 Allen et al. (2010) Nature 467: 832-838. 55 Resources: Texts

Yang, M C K (2007) Introduction to Statistical Methods in Modern Genetics. CRC Press. Laird, N M and Lange, C (2011) The Fundamentals of Modern Statistical Genetics. Springer. Lynch, M and Walsh, B (1998) Genetics and Analysis of Quantitative Traits. Oxford University Press. Siegmund, D and Yakir, B (2007) The Statistics of Gene Mapping. Springer. Lange, K (2002) Mathematical and Statistical Methods for Genetic Analysis. Springer. Balding, D J, Bishop, M, and Cannings, C (2007) Handbook of Statistical Genetics (2 volume set), 3rd Edition. Wiley. 56 Resources: Journal Articles

Genetic Epidemiology Series from the Lancet journals published 7 papers in 2005 reviewing linkage (and association) methods applied in public health. Nature Reviews Genetics (2008-2013) published 23 papers about best practices for conducting GWAS in humans. Anderson et al. (2013) Data quality control in genetic case-control association studies. Nature Protocols. 5:1564-1573 Winkler et al. (2014) Quality control and conduct of genome-wide association meta-analyses. Nature Protocols. 9:1192-1212. Reed et al. (2015) A guide to genome-wide association analysis and post-analytic. Statist. Med., 34: 3769-3792. 57 Resources: Free Software

PLINK: Whole genome association analysis toolset KING: Kinship-based INference for Gwas R packages CRAN Task View: Statistical Genetics GWASTools GenABEL OpenMendel: an open source project implemented in the Julia programming language

58 Resources: Free Software

Linear Mixed Models TASSEL 5.0 Grammar-Gamma ⇒ GenABEL ROADTRIPS: Case-Control Association Testing with Partially or Completely Unknown Population and Pedigree Structure FaST-LMM: Factored Spectrally Transformed Linear Mixed Models EMMA: Efficient Mixed-Model Association EMMAX: Efficient Mixed-Model Association eXpedited GEMMA: Genome-wide Efficient Mixed Model Association GCTA: A tool for Genome-wide COmplex Trait Analysis

59 60