Genome-Wide Association Study (GWAS)
Total Page:16
File Type:pdf, Size:1020Kb
Genome-Wide Association Study (GWAS) James J. Yang January 24, 2018 School of Nursing, University of Michigan Ann Arbor, Michigan Outline 1 What are Genome-Wide Association Studies? 2 Linkage versus Association 3 GWAS Data 4 Issues with GWAS 5 Correcting Association Analysis for Confounding Genomic Control Principal Components Analysis Linear Mixed Models 6 Genome-wide Significance Level 7 Analysis Protocols 8 Other Studies using GWAS 9 Resources 2 What are genome-wide association studies Genome-wide association studies (GWASs) are unbiased genome screens of unrelated individuals and appropriately matched controls or parent-affected child trios to establish whether any genetic variant is associated with a trait. These studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and major diseases. {nature.com 3 What are genome-wide association studies A genome-wide association study is an approach that involves rapidly scanning markers across the complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease. Once new genetic associations are identified, researchers can use the information to develop better strategies to detect, treat and prevent the disease. Such studies are particularly useful in finding genetic variations that contribute to common, complex diseases, such as asthma, cancer, diabetes, heart disease and mental illnesses. {National Human Genome Research Institute, NIH 4 The new england journal of medicine A SNP1 SNP2 Chromosome 9 Person1 Person2 Person3 G–C → T–A A–T → G–C B SNP1 SNP2 Cases Initial discovery study Controls Cases Initial discovery study Controls P=1×10–12 P=1×10–8 Common Variant Heterozygote homozygote homozygote C 14 14 SNP1 12 12 10 10 SNP2 8 8 P Value 6 P Value 6 10 10 4 4 –Log –Log 2 2 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 19 21 X 0 Position on chromosome 9 Chromosome 16 18 20 22 Figure 1. The Genomewide Association Study. The genomewide association study is typically based on a case–control design in which single-nucleotide polymorphisms (SNPs) across the human genome are genotyped. Panel A depicts a small locus on chromosome 9, and thus a very small fragmentCOLOR of the FIGURE genome. In Panel B, the strength of association between each SNP and disease is calculated on the basis of the prevalenceRev4 of each SNP in06/22 cases and /10 5 Manolio (2010)controls. Genomewide In this example, SNPs Association 1 and 2 on chromosome Studies and9 are associated Assessment with disease, of the with Risk P values of of Disease. 10−12 and 10NEJM−8, respectively.. 363(2): The 166-176. plot in Panel C shows the P values for all genotyped SNPs that have survived a quality-control screen, withAuthor each chromosomeDr. Manolio shown in a different color. The results implicate a locus on chromosome 9, marked by SNPs 1 and 2, which are adjacentFig # to each1 other (graph at right), and other neighboring SNPs. Title ME DE Phimister larger studies investigating the risk of schizo- tend to be of modestArtist effect size,Muller with a median phrenia have implicated several variants — both odds ratio per copy of the AUTHORrisk allele PLEASE of 1.33.NOTE:7 Sev- Figure has been redrawn and type has been reset structural variants and SNPs — in the region of eral variants carry odds ratiosPlease above check carefully3.00, includ- the major histocompatibility complex (MHC) and ing some exceedingIssue 12.00. date These07-08-2010 are of particu- at other loci, associations that have been repli- lar interest, since it seems likely that there would cated in independent samples.32-34 have been evolutionary pressure against their se- Generally, associations between SNPs and traits lection unless they provided some survival ben- 168 n engl j med 363;2 nejm.org july 8, 2010 The New England Journal of Medicine Downloaded from nejm.org at UNIVERSITY OF MICHIGAN on December 8, 2017. For personal use only. No other uses without permission. Copyright © 2010 Massachusetts Medical Society. All rights reserved. Linkage vs. Association Linkage Association Data Structures Pedigree Independent Samples (within families) (across families) Concepts Recombination events Linkage Diseqilibrium Statistics Likelihood of Statistical recombinants correlation 6 GWAS Data Genotype Data: about 106 variables Phenotype Data: about 103 individuals 7 Single Nucleotide Polymorphism (SNP) Responsible for 90% of all human genetic variations A SNP occurs every 100-300 base pairs Currently almost 12.8 million SNPs (2008) in the NCBI dbSNP database Like microsatellites, they are used as markers because They occur frequently They are stable to genotype Probe length is 25 bases long 8 Affymetrix GeneChip Genome-Wide Human SNP Array 6.0: 1.8 million markers (946K probes for copy number variants and 906.6K SNPs) 9 10 11 12 13 14 15 Genotype Data For a SNP with two alleles coded as A and B, there are three possible genotypes: fAA; AB; BBg. Additive (B as reference): fAA = 0; AB = 1; BB = 2g Genotype Data: Zij = number of the reference allele ID SNP 1 2 3 4 ··· n − 2 n − 1 n 1 0 1 2 2 ··· 2 1 1 2 2 2 0 -1 ··· 2 0 1 3 0 1 2 0 ··· 2 2 0 . .. .. G − 1 2 0 0 1 ··· 2 -1 2 G 0 0 2 1 ··· 1 2 1 16 Phenotype Data with Genotype Phenotype Data: X 7.32 3.09 5.24 4.57 ··· 7.12 3.96 4.72 Y 1 0 1 1 ··· 0 0 0 W 0.85 -0.62 0.51 0.67 ··· -0.10 -0.29 -1.56 Genotype Data: Zij = number of the reference allele ID SNP 1 2 3 4 ··· n − 2 n − 1 n 1 0 1 2 2 ··· 2 1 1 2 2 2 0 -1 ··· 2 0 1 3 0 1 2 0 ··· 2 2 0 . .. .. G − 1 2 0 0 1 ··· 2 -1 2 G 0 0 2 1 ··· 1 2 1 17 REVIEWS often without the functional scrutiny that is required for LD or functional considerations, but will nevertheless a SNP in a non-coding region, and often despite the achieve some degree of coverage of the genome. presence of many nearby variants that might be equally However, for some sets of variants, the coverage is so or more strongly associated with disease. Indeed, one of poor that calling them ‘genome-wide’ is misleading. the missense variants that has been shown to be associ- The least comprehensive of such so-called genome- ated with complex disease — the Thr17Ala polymor- wide association studies are linkage studies that are phism in the gene encoding cytotoxic T-lymphocyte- converted into association studies by looking for associated protein 4 (CTLA4) — is reliably associated associations between disease and the 400–1,000 with autoimmune disease only because it is in strong microsatellites that are typed in linkage studies. Even LD with a regulatory polymorphism in a non-coding under the optimistic assumption that testing a single region, which is more strongly associated with disease microsatellite for association completely surveys vari- and is therefore more likely to be causal44. ation in a surrounding 50-kb block of LD (blocks are Nevertheless, some missense variants have been reli- on average ~20 kb (REF. 60), so this is also optimistic), ably associated with complex disease and, as a group, such a study would cover 20–50 Mb — 1–3% of the missense variants are more likely to have functional genome or less — and cannot truly be considered a consequences. Therefore, the genome-wide testing of genome-wide association study. large collections of missense variants is likely to remain A proposed alternative approach is to type a few a productive approach. However, given our current lack SNPs in or near the coding region of each gene81,82.This of knowledge about common disease risk alleles, it method, like all association approaches, only surveys remains unclear what fraction of these would be discov- those variants that have been chosen for genotyping and ered even by a comprehensive survey of missense poly- those variants that are in LD with the chosen variants. morphisms. Unless the LD patterns of each gene are empirically New methods are emerging that might help recog- determined, even missense SNPs might well be missed nize variants that affect gene function without affecting using this approach, because choosing SNPs on the basis the encoded amino-acid sequence. By comparing the of physical proximity does not guarantee that nearby human and mouse genomes74,it was shown that a signifi- SNPs will be captured52.Furthermore, regulatory vari- cant amount of non-coding DNA is highly conserved75. ants further away from a gene will almost certainly not This indicates that conserved non-coding regions are be surveyed. often functionally important — a hypothesis that has More recently, large collections of many thousands83 been supported experimentally75–80.Polymorphisms in (for example, the Affymetrix Centurion and ParAllele these non-coding regions could also have an important and MegAllele mapping sets) or over a million SNPs role in the genetics of biomedical traits. Indeed, a modi- (K. Frazer and D. Cox, personal communication; see fication of the missense approach to include SNPs in also the Perlegen Whole Genome Scanning collection) these conserved regions has also been proposed71. have been developed, and these can be genotyped at a However, the large number of additional SNPs required significantly lower cost per SNP.The degree of coverage would sacrifice the efficiency of the missense approach has not yet been published for these SNP sets, but they and would result in studies that are similar in scale to are likely to cover a significant fraction of the genome, the indirect LD approach.