Genome-Wide Association Study (GWAS)

Total Page:16

File Type:pdf, Size:1020Kb

Genome-Wide Association Study (GWAS) Genome-Wide Association Study (GWAS) James J. Yang January 24, 2018 School of Nursing, University of Michigan Ann Arbor, Michigan Outline 1 What are Genome-Wide Association Studies? 2 Linkage versus Association 3 GWAS Data 4 Issues with GWAS 5 Correcting Association Analysis for Confounding Genomic Control Principal Components Analysis Linear Mixed Models 6 Genome-wide Significance Level 7 Analysis Protocols 8 Other Studies using GWAS 9 Resources 2 What are genome-wide association studies Genome-wide association studies (GWASs) are unbiased genome screens of unrelated individuals and appropriately matched controls or parent-affected child trios to establish whether any genetic variant is associated with a trait. These studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and major diseases. {nature.com 3 What are genome-wide association studies A genome-wide association study is an approach that involves rapidly scanning markers across the complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease. Once new genetic associations are identified, researchers can use the information to develop better strategies to detect, treat and prevent the disease. Such studies are particularly useful in finding genetic variations that contribute to common, complex diseases, such as asthma, cancer, diabetes, heart disease and mental illnesses. {National Human Genome Research Institute, NIH 4 The new england journal of medicine A SNP1 SNP2 Chromosome 9 Person1 Person2 Person3 G–C → T–A A–T → G–C B SNP1 SNP2 Cases Initial discovery study Controls Cases Initial discovery study Controls P=1×10–12 P=1×10–8 Common Variant Heterozygote homozygote homozygote C 14 14 SNP1 12 12 10 10 SNP2 8 8 P Value 6 P Value 6 10 10 4 4 –Log –Log 2 2 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 19 21 X 0 Position on chromosome 9 Chromosome 16 18 20 22 Figure 1. The Genomewide Association Study. The genomewide association study is typically based on a case–control design in which single-nucleotide polymorphisms (SNPs) across the human genome are genotyped. Panel A depicts a small locus on chromosome 9, and thus a very small fragmentCOLOR of the FIGURE genome. In Panel B, the strength of association between each SNP and disease is calculated on the basis of the prevalenceRev4 of each SNP in06/22 cases and /10 5 Manolio (2010)controls. Genomewide In this example, SNPs Association 1 and 2 on chromosome Studies and9 are associated Assessment with disease, of the with Risk P values of of Disease. 10−12 and 10NEJM−8, respectively.. 363(2): The 166-176. plot in Panel C shows the P values for all genotyped SNPs that have survived a quality-control screen, withAuthor each chromosomeDr. Manolio shown in a different color. The results implicate a locus on chromosome 9, marked by SNPs 1 and 2, which are adjacentFig # to each1 other (graph at right), and other neighboring SNPs. Title ME DE Phimister larger studies investigating the risk of schizo- tend to be of modestArtist effect size,Muller with a median phrenia have implicated several variants — both odds ratio per copy of the AUTHORrisk allele PLEASE of 1.33.NOTE:7 Sev- Figure has been redrawn and type has been reset structural variants and SNPs — in the region of eral variants carry odds ratiosPlease above check carefully3.00, includ- the major histocompatibility complex (MHC) and ing some exceedingIssue 12.00. date These07-08-2010 are of particu- at other loci, associations that have been repli- lar interest, since it seems likely that there would cated in independent samples.32-34 have been evolutionary pressure against their se- Generally, associations between SNPs and traits lection unless they provided some survival ben- 168 n engl j med 363;2 nejm.org july 8, 2010 The New England Journal of Medicine Downloaded from nejm.org at UNIVERSITY OF MICHIGAN on December 8, 2017. For personal use only. No other uses without permission. Copyright © 2010 Massachusetts Medical Society. All rights reserved. Linkage vs. Association Linkage Association Data Structures Pedigree Independent Samples (within families) (across families) Concepts Recombination events Linkage Diseqilibrium Statistics Likelihood of Statistical recombinants correlation 6 GWAS Data Genotype Data: about 106 variables Phenotype Data: about 103 individuals 7 Single Nucleotide Polymorphism (SNP) Responsible for 90% of all human genetic variations A SNP occurs every 100-300 base pairs Currently almost 12.8 million SNPs (2008) in the NCBI dbSNP database Like microsatellites, they are used as markers because They occur frequently They are stable to genotype Probe length is 25 bases long 8 Affymetrix GeneChip Genome-Wide Human SNP Array 6.0: 1.8 million markers (946K probes for copy number variants and 906.6K SNPs) 9 10 11 12 13 14 15 Genotype Data For a SNP with two alleles coded as A and B, there are three possible genotypes: fAA; AB; BBg. Additive (B as reference): fAA = 0; AB = 1; BB = 2g Genotype Data: Zij = number of the reference allele ID SNP 1 2 3 4 ··· n − 2 n − 1 n 1 0 1 2 2 ··· 2 1 1 2 2 2 0 -1 ··· 2 0 1 3 0 1 2 0 ··· 2 2 0 . .. .. G − 1 2 0 0 1 ··· 2 -1 2 G 0 0 2 1 ··· 1 2 1 16 Phenotype Data with Genotype Phenotype Data: X 7.32 3.09 5.24 4.57 ··· 7.12 3.96 4.72 Y 1 0 1 1 ··· 0 0 0 W 0.85 -0.62 0.51 0.67 ··· -0.10 -0.29 -1.56 Genotype Data: Zij = number of the reference allele ID SNP 1 2 3 4 ··· n − 2 n − 1 n 1 0 1 2 2 ··· 2 1 1 2 2 2 0 -1 ··· 2 0 1 3 0 1 2 0 ··· 2 2 0 . .. .. G − 1 2 0 0 1 ··· 2 -1 2 G 0 0 2 1 ··· 1 2 1 17 REVIEWS often without the functional scrutiny that is required for LD or functional considerations, but will nevertheless a SNP in a non-coding region, and often despite the achieve some degree of coverage of the genome. presence of many nearby variants that might be equally However, for some sets of variants, the coverage is so or more strongly associated with disease. Indeed, one of poor that calling them ‘genome-wide’ is misleading. the missense variants that has been shown to be associ- The least comprehensive of such so-called genome- ated with complex disease — the Thr17Ala polymor- wide association studies are linkage studies that are phism in the gene encoding cytotoxic T-lymphocyte- converted into association studies by looking for associated protein 4 (CTLA4) — is reliably associated associations between disease and the 400–1,000 with autoimmune disease only because it is in strong microsatellites that are typed in linkage studies. Even LD with a regulatory polymorphism in a non-coding under the optimistic assumption that testing a single region, which is more strongly associated with disease microsatellite for association completely surveys vari- and is therefore more likely to be causal44. ation in a surrounding 50-kb block of LD (blocks are Nevertheless, some missense variants have been reli- on average ~20 kb (REF. 60), so this is also optimistic), ably associated with complex disease and, as a group, such a study would cover 20–50 Mb — 1–3% of the missense variants are more likely to have functional genome or less — and cannot truly be considered a consequences. Therefore, the genome-wide testing of genome-wide association study. large collections of missense variants is likely to remain A proposed alternative approach is to type a few a productive approach. However, given our current lack SNPs in or near the coding region of each gene81,82.This of knowledge about common disease risk alleles, it method, like all association approaches, only surveys remains unclear what fraction of these would be discov- those variants that have been chosen for genotyping and ered even by a comprehensive survey of missense poly- those variants that are in LD with the chosen variants. morphisms. Unless the LD patterns of each gene are empirically New methods are emerging that might help recog- determined, even missense SNPs might well be missed nize variants that affect gene function without affecting using this approach, because choosing SNPs on the basis the encoded amino-acid sequence. By comparing the of physical proximity does not guarantee that nearby human and mouse genomes74,it was shown that a signifi- SNPs will be captured52.Furthermore, regulatory vari- cant amount of non-coding DNA is highly conserved75. ants further away from a gene will almost certainly not This indicates that conserved non-coding regions are be surveyed. often functionally important — a hypothesis that has More recently, large collections of many thousands83 been supported experimentally75–80.Polymorphisms in (for example, the Affymetrix Centurion and ParAllele these non-coding regions could also have an important and MegAllele mapping sets) or over a million SNPs role in the genetics of biomedical traits. Indeed, a modi- (K. Frazer and D. Cox, personal communication; see fication of the missense approach to include SNPs in also the Perlegen Whole Genome Scanning collection) these conserved regions has also been proposed71. have been developed, and these can be genotyped at a However, the large number of additional SNPs required significantly lower cost per SNP.The degree of coverage would sacrifice the efficiency of the missense approach has not yet been published for these SNP sets, but they and would result in studies that are similar in scale to are likely to cover a significant fraction of the genome, the indirect LD approach.
Recommended publications
  • Hdy201091.Pdf
    Heredity (2011) 106, 511–519 & 2011 Macmillan Publishers Limited All rights reserved 0018-067X/11 www.nature.com/hdy REVIEW Overview of techniques to account for confounding due to population stratification and cryptic relatedness in genomic data association analyses MJ Sillanpa¨a¨1,2 1Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland and 2Department of Agricultural Sciences, University of Helsinki, Helsinki, Finland Population-based genomic association analyses are more confounders in population-based genomic data association powerful than within-family analyses. However, population analyses. The common correction techniques for population stratification (unknown or ignored origin of individuals from stratification and cryptic relatedness problems are presented multiple source populations) and cryptic relatedness (unknown here in the phenotype–marker association analysis context, or ignored covariance between individuals because of their and comments on their suitability for other types of genomic relatedness) are confounding factors in population-based association analyses (for example, phenotype–expression genomic association analyses, which inflate the false-positive association) are also provided. Even though many of these rate. As a consequence, false association signals may arise in techniques have originally been developed in the context of genomic data association analyses for reasons other than true human genetics, most of them are also applicable to model association between the tested genomic
    [Show full text]
  • Matching Based Case-Control Association Mapping with Unknown
    MODELS AND METHODS FOR GENOME-WIDE ASSOCIATION STUDIES by Weihua Guan A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Biostatistics) in The University of Michigan 2010 Doctoral Committee: Professor Gonçalo Abecasis, Co-Chair Professor Michael Lee Boehnke, Co-Chair Professor Roderick J. Little Assistant Professor Jun Li Associate Research Scientist Laura J. Scott © Weihua Guan 2010 To Hannah and Yu ii Acknowledgements It is a pleasure to acknowledge the group of people who have contributed to my thesis. Without their help, this dissertation would not have been possible. First and foremost I want to thank Mike who has been very supportive throughout my graduate school study and guided me to grow up as an independent researcher. I would like to thank Gonçalo for his inspiring ideas and knowledgeable suggestions. I would like to thank Laura for her precious experiences of real studies and always helpful discussions. I also want to thank my other committee members: Rod and Jun, for their help and support during my writing of this dissertation. I would like to thank my colleagues and fellow students in the Center for Statistical Genetics. They made my life as a graduate student joyful and unforgettable. I am especially grateful to Liming for our extensive collaborations in the first part of my dissertation. His knowledge and skills are essential to the success of our work. I owe great gratitude to my family for their support and encouragement throughout my entire student life. Yu has always put great confidence on me and been supportive on everything.
    [Show full text]
  • Population Structure and Cryptic Relatedness in Genetic Association
    Statistical Science 2009, Vol. 24, No. 4, 451–471 DOI: 10.1214/09-STS307 c Institute of Mathematical Statistics, 2009 Population Structure and Cryptic Relatedness in Genetic Association Studies William Astle and David J. Balding1 Abstract. We review the problem of confounding in genetic associa- tion studies, which arises principally because of population structure and cryptic relatedness. Many treatments of the problem consider only a simple “island” model of population structure. We take a broader approach, which views population structure and cryptic relatedness as different aspects of a single confounder: the unobserved pedigree defin- ing the (often distant) relationships among the study subjects. Kinship is therefore a central concept, and we review methods of defining and es- timating kinship coefficients, both pedigree-based and marker-based. In this unified framework we review solutions to the problem of population structure, including family-based study designs, genomic control, struc- tured association, regression control, principal components adjustment and linear mixed models. The last solution makes the most explicit use of the kinships among the study subjects, and has an established role in the analysis of animal and plant breeding studies. Recent computa- tional developments mean that analyses of human genetic association data are beginning to benefit from its powerful tests for association, which protect against population structure and cryptic kinship, as well as intermediate levels of confounding by the pedigree. Key words and phrases: Cryptic relatedness, genomic control, kinship, mixed model, complex disease genetics, ascertainment. 1. CONFOUNDING IN GENETIC arXiv:1010.4681v1 [stat.ME] 22 Oct 2010 EPIDEMIOLOGY 1.1 Association and Linkage William Astle is Research Associate, Centre for Genetic association studies (Clayton, 2007) are Biostatistics, Department of Epidemiology and Public designed to identify genetic loci at which the al- Health, St.
    [Show full text]
  • Population Structure and Cryptic Relatedness in Genetic
    Statistical Science 2009, Vol. 24, No. 4, 451–471 DOI: 10.1214/09-STS307 c Institute of Mathematical Statistics, 2009 Population Structure and Cryptic Relatedness in Genetic Association Studies William Astle and David J. Balding1 Abstract. We review the problem of confounding in genetic associa- tion studies, which arises principally because of population structure and cryptic relatedness. Many treatments of the problem consider only a simple “island” model of population structure. We take a broader approach, which views population structure and cryptic relatedness as different aspects of a single confounder: the unobserved pedigree defin- ing the (often distant) relationships among the study subjects. Kinship is therefore a central concept, and we review methods of defining and es- timating kinship coefficients, both pedigree-based and marker-based. In this unified framework we review solutions to the problem of population structure, including family-based study designs, genomic control, struc- tured association, regression control, principal components adjustment and linear mixed models. The last solution makes the most explicit use of the kinships among the study subjects, and has an established role in the analysis of animal and plant breeding studies. Recent computa- tional developments mean that analyses of human genetic association data are beginning to benefit from its powerful tests for association, which protect against population structure and cryptic kinship, as well as intermediate levels of confounding by the pedigree. Key words and phrases: Cryptic relatedness, genomic control, kinship, mixed model, complex disease genetics, ascertainment. 1. CONFOUNDING IN GENETIC arXiv:1010.4681v1 [stat.ME] 22 Oct 2010 EPIDEMIOLOGY 1.1 Association and Linkage William Astle is Research Associate, Centre for Genetic association studies (Clayton, 2007) are Biostatistics, Department of Epidemiology and Public designed to identify genetic loci at which the al- Health, St.
    [Show full text]
  • The Linear Mixed Models in Genome-Wide Association Studies
    Send Orders for Reprints to [email protected] The Open Bioinformatics Journal, 2013, 7, (Suppl-1, M2) 27-33 27 Open Access Genetic Studies: The Linear Mixed Models in Genome-wide Association Studies Gengxin Li1,* and Hongjiang Zhu2 1Department of Mathematics and Statistics, Wright State University, 201 MM, 3640 Colonel Glenn Highway, Dayton, OH 45435-0001 2Division of Biostatistics, Coordinating Center for Clinical Trials, The University of Texas School of Public Health at Houston Abstract: With the availability of high-density genomic data containing millions of single nucleotide polymorphisms and tens or hundreds of thousands of individuals, genetic association study is likely to identify the variants contributing to complex traits in a genome-wide scale. However, genome-wide association studies are confounded by some spurious associations due to not properly interpreting sample structure (containing population structure, family structure and cryptic relatedness). The absence of complete genealogy of population in the genome-wide association studies model greatly motivates the development of new methods to correct the inflation of false positive. In this process, linear mixed model based approaches with the advantage of capturing multilevel relatedness have gained large ground. We summarize current literatures dealing with sample structure, and our review focuses on the following four areas: (i) The approaches handling population structure in genome-wide association studies; (ii) The linear mixed model based approaches in genome-wide association studies; (iii) The performance of linear mixed model based approaches in genome-wide association studies and (iv) The unsolved issues and future work of linear mixed model based approaches. Keywords: Genetic similarity matrix, genome-wide association study (GWAS), linear mixed model (LMM), population stratification, sample structure, single nucleotide polymorphisms (SNPs).
    [Show full text]
  • Quick Estimation of the Effect of Related Samples in Genetic Case
    Effective Sample Size: Quick Estimation of the Effect of Related Samples in Genetic Case-Control Association Analyses Yaning Yang1, Elaine F. Remmers2, Chukwuma B. Ogunwole2, Daniel L. Kastner2, Peter K. Gregersen3, Wentian Li3 1. Department of Statistics and Finance, University of Science and Technology of China, Anhui 230026, Hefei, CHINA 2. Genetics and Genomic Branch, National Institute of Arthritis and Musculoskeletal and Skin Diseases National Institute of Health, 9 Memorial Drive, Bethesda, MD 20892, USA. 3. The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, Manhasset, 350 Community Drive, NY 11030, USA. Summary Affected relatives are essential for pedigree linkage analysis, however, they cause a violation of the inde- pendent sample assumption in case-control association studies. To avoid the correlation between samples, a common practice is to take only one affected sample per pedigree in association analysis. Although several methods exist in handling correlated samples, they are still not widely used in part because these are not easily implemented, or because they are not widely known. We advocate the effective sample size method as a simple and accessible approach for case-control association analysis with correlated samples. This method modifies the chi-square test statistic, p-value, and 95% confidence interval of the odds-ratio by replacing the arXiv:q-bio/0611093v3 [q-bio.QM] 1 May 2011 apparent number of allele or genotype counts with the effective ones in the standard formula, without the need for specialized computer programs. We present a simple formula for calculating effective sample size for many types of relative pairs and relative sets.
    [Show full text]