A Bayesian Method for Detecting and Characterizing Allelic Heterogeneity and Boosting Signals in Genome-Wide Association Studies

Total Page:16

File Type:pdf, Size:1020Kb

A Bayesian Method for Detecting and Characterizing Allelic Heterogeneity and Boosting Signals in Genome-Wide Association Studies Statistical Science 2009, Vol. 24, No. 4, 430–450 DOI: 10.1214/09-STS311 c Institute of Mathematical Statistics, 2009 A Bayesian Method for Detecting and Characterizing Allelic Heterogeneity and Boosting Signals in Genome-Wide Association Studies Zhan Su, Niall Cardin , the Wellcome Trust Case Control Consortium, Peter Donnelly and Jonathan Marchini Abstract. The standard paradigm for the analysis of genome-wide as- sociation studies involves carrying out association tests at both typed and imputed SNPs. These methods will not be optimal for detecting the signal of association at SNPs that are not currently known or in regions where allelic heterogeneity occurs. We propose a novel associa- tion test, complementary to the SNP-based approaches, that attempts to extract further signals of association by explicitly modeling and es- timating both unknown SNPs and allelic heterogeneity at a locus. At each site we estimate the genealogy of the case-control sample by tak- ing advantage of the HapMap haplotypes across the genome. Allelic heterogeneity is modeled by allowing more than one mutation on the branches of the genealogy. Our use of Bayesian methods allows us to as- sess directly the evidence for a causative SNP not well correlated with known SNPs and for allelic heterogeneity at each locus. Using simu- lated data and real data from the WTCCC project, we show that our method (i) produces a significant boost in signal and accurately identi- fies the form of the allelic heterogeneity in regions where it is known to exist, (ii) can suggest new signals that are not found by testing typed or imputed SNPs and (iii) can provide more accurate estimates of effect sizes in regions of association. arXiv:1010.4670v1 [stat.ME] 22 Oct 2010 Key words and phrases: Complex disease, genome-wide association, allelic heterogeneity, Bayesian methods. Zhan Su is Analyst, Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 is University Lecturer in Statistical Genomics, 3TG, UK and Wellcome Trust Centre for Human Department of Statistics, University of Oxford, 1 South Genetics, University of Oxford, Roosevelt Drive, Parks Road, Oxford OX1 3TG, UK and Wellcome Trust Oxford, OX3 7BN, UK. Niall Cardin is Postdoctoral Centre for Human Genetics, University of Oxford, Recearcher, Department of Statistics, University of Roosevelt Drive, Oxford, OX3 7BN, UK e-mail: Oxford, 1 South Parks Road, Oxford OX1 3TG, UK. [email protected]. Full details of Consortium membership are available at www.wtccc.org.uk. Peter Donnelly is Professor, This is an electronic reprint of the original article Wellcome Trust Centre for Human Genetics, University published by the Institute of Mathematical Statistics in of Oxford, Roosevelt Drive, Oxford, OX3 7BN, UK and Statistical Science, 2009, Vol. 24, No. 4, 430–450. This Department of Statistics, University of Oxford, 1 South reprint differs from the original in pagination and Parks Road, Oxford OX1 3TG, UK. Jonathan Marchini typographic detail. 1 2 Z. SU ET AL. 1. INTRODUCTION The method that we present here is applicable to genotype data with missing data and is computa- Over the last two years genome-wide association tionally feasible to analyze thousands of individuals studies have been successful in uncovering novel dis- across the whole genome (it requires approximately ease causing variants [2, 3, 7, 21, 28–30]. All of these the same amount of computational resources as im- studies have proceeded by testing for associations at putation [13]). A novel feature of our method is that SNPs assayed by a commercial genotyping chip and we can take advantage of the HapMap haplotypes to many have also used genotype imputation methods build the genealogical trees at each putative risk lo- [13] to test untyped SNPs, especially when combin- cus. ing studies that used different genotyping chips to We provide an informal description of the method carry out larger meta-analysis studies. first and full technical details of the method are It is possible that signals of association will be given in the Methods section. Our approach pro- missed by these methods and there are several ways ceeds in several stages: in which this could happen. First, the true causal variant, which may be a SNP but could also be an (i) We use a panel of known haplotype variation, Indel or Copy Number Variant (CNV), may not be such as HapMap [6], and an estimate of the fine-scale on the chip or on the typed reference panel and may recombination rate (such as that available from the not be in sufficient Linkage Disequilibrium (LD) with HapMap website), to construct an approximation a single typed or imputed SNP for a signal to be de- to the genealogy of the sample of haplotypes in the tected. If this is the case the variant may be well panel, at each point on a grid of positions across the identified by considering a local haplotype in the re- genome. gion, thus the association may be detected if such (ii) For each such tree, we then, in turn, consider effects are tested for association. Second, it may be putative mutations on each of its branches. Once the case that the causal model of association in the the branch for a mutation is chosen, this will fix region involves more than one SNP. One way to de- the alleles carried by chromosomes at each of the scribe this model would be to say that there is allelic tips of the tree. Assuming such a SNP exists in the heterogeneity in the association signal. If the SNPs population, we use a population genetics model to are in LD then the various haplotypes that consist predict the likely genotypes at this SNP in each case of the causal SNPs may have distinct relative risks. and control individual (we call these the study in- If this is the case then the model might also be de- dividuals). This is perhaps simplest to conceptual- scribed as a haplotype effect model. ize under the simplifying assumption that we had In this paper we investigate a method that is com- haploid data on the study individuals. The popula- plementary to SNP-based association tests that al- tion genetics model allows us to place each haploid lows for these more complex disease models. To go study chromosome (probabilistically) on the tips of beyond testing typed or imputed genetic variants the genealogical tree: each tip contains a single panel we need to construct a model for genetic variation haplotype, and the study haplotype will tend to be that has not been directly observed. We achieve this placed on the tips corresponding to panel haplotypes by modeling the genealogy of the sample of chro- that are locally similar to it. For diploid study data mosomes at each point along the genome and then there is an additional level that, in effect, averages estimating genotypes, in the case-control samples, over likely local phasings of the data. The result, for at SNPs derived by placing mutations on the in- each study individual, is a probability distribution dividual branches of the tree. The genotypes that over the possible genotypes at the putative SNP. are derived from the local genealogies can then be (iii) The next step takes the predicted genotypes associated with the phenotype under study, which with their uncertainties and looks for evidence of we test using Bayesian methods that naturally ac- association with disease status. (Note the need to count for the inherent uncertainty in the location of handle appropriately the uncertainty over the pre- the disease mutation on the genealogy. Some pre- dicted genotypes.) We do this here in a Bayesian vious approaches that have used genealogical trees, framework. For a particular genomic location and have either been applicable only to haplotype data putative SNP, the evidence for association is nat- with no missing data [4, 27] or computationally pro- urally measured by a Bayes factor [13] (BF) that hibitive and thus restricted to small samples [10, 33]. compares a model of association with a null model BAYESIAN METHOD 3 of no association. The uncertainty over the possible We also carried out some simulation studies that branch carrying the mutation is handled by averag- highlight additional features of our approach. First, ing over the Bayes factors for each branch, to give a we examine how well our method does at uncovering single BF summarizing the evidence for the presence allelic heterogeneity where it exists. We show that of a causative mutation at that position. this can be quite a hard problem but our method (iv) To allow for possible allelic heterogeneity, we does have good power to uncover the action of more can extend the analyses above by putting two (or than one causal variant. Second, we consider the more) putative mutations on the genealogical tree problem of estimating the effect size of a causal vari- and predicting genotypes for the pair of SNPs in ant in an associated region in the specific case where study individuals, before fitting disease models with the causal variant is not well tagged by a typed SNP. multiple causative mutations. We would then av- We show that our method is able to provide a more erage over pairs of positions for the mutation in accurate estimate of the effect size in this case. calculating a BF for the strength of evidence for The next section describes the details of our meth- a 2-mutation disease model at that position, com- ods and this is followed by a section on the analy- pared to the null hypothesis, and by comparing the sis of the WTCCC data and simulation studies. We 2-mutation BF with the single mutation BF, one can conclude with a discussion of the results and the assess the relative evidence for allelic heterogeneity, likely applications of our method.
Recommended publications
  • Enthusiasm Mixed with Scepticism About Single-Nucleotide Polymorphism Markers for Dissecting Complex Disorders
    European Journal of Human Genetics (1999) 7, 98–101 t © 1999 Stockton Press All rights reserved 1018–4813/99 $12.00 http://www.stockton-press.co.uk/ejhg MEETING REPORT First International SNP Meeting at Skokloster, Sweden, August 1998 Enthusiasm mixed with scepticism about single-nucleotide polymorphism markers for dissecting complex disorders Ann-Christine Syv˚anen1, Ulf Landegren2, Anders Isaksson2, Ulf Gyllensten2 and Anthony Brookes2 1Department of Medical Sciences; 2Department of Genetics and Pathology; Uppsala University, Uppsala, Sweden Single nucleotide polymorphisms (SNP) are frequent in number of participants, created a relaxed atmosphere our genomes, occurring on average once every thou- that stimulated many informal scientific discussions. sand nucleotides. They are useful as genetic markers Following an introductory lecture from Gert-Jan van because SNPs evolve slowly and because they can be Ommen (Leiden, NL), President of HUGO, there was scored by technically simple methods. Moreover, a a lively discussion on the role of HUGO in the genome great deal of the functional variation that explains project, revolving around its part in linking the diverse human diversity probably results from SNPs. In recent activities in the field, ranging from facilitating mapping years, SNP-based strategies to explain common disease and mutation analysis to nomenclature supervision, have received enormous attention from the scientific providing guidance in genome diversity and population community as well as from pharmaceutical and bio- studies and more generally in the ethical and intellec- technology companies. The enthusiasm stems largely tual property areas. During the two and a half days of from the hope that large-scale association studies using the meeting advances in technology for discovering and SNPs will help identify genes that contribute to scoring SNPs, SNP-based approaches to dissect com- common disorders with complex genetic causation, and plex diseases and the utility of SNPs in the study of allow formulation of improved therapeutic drugs.
    [Show full text]
  • Understanding Rare and Common Diseases in the Context of Human Evolution Lluis Quintana-Murci1,2,3
    Quintana-Murci Genome Biology (2016) 17:225 DOI 10.1186/s13059-016-1093-y REVIEW Open Access Understanding rare and common diseases in the context of human evolution Lluis Quintana-Murci1,2,3 Abstract (e.g., migration, admixture, and changes in population size), and natural selection. The wealth of available genetic information is allowing The plethora of genetic information generated in the the reconstruction of human demographic and adaptive last ten years, thanks largely to the publication of se- history. Demography and purifying selection affect the quencing datasets for both modern human populations purge of rare, deleterious mutations from the human and ancient DNA samples [14–18], is making it possible population, whereas positive and balancing selection to reconstruct the genetic history of our species, and to can increase the frequency of advantageous variants, define the parameters characterizing human demographic improving survival and reproductioninspecific history: expansion out of Africa, the loss of genetic diversity environmental conditions. In this review, I discuss with increasing distance from Africa (i.e., the “serial founder how theoretical and empirical population genetics effect”), demographic expansions over different time scales, studies, using both modern and ancient DNA data, and admixture with ancient hominins [16–21]. These stud- are a powerful tool for obtaining new insight into ies are also revealing the extent to which selection has acted the genetic basis of severe disorders and complex on the human genome, providing insight into the way in disease phenotypes, rare and common, focusing which selection removes deleterious variation and the po- particularly on infectious disease risk. tential of human populations to adapt to the broad range of climatic, nutritional, and pathogenic environments they have occupied [22–28].
    [Show full text]
  • Population Genetics Models of Common Diseases Anna Di Rienzo
    Population genetics models of common diseases Anna Di Rienzo The number and frequency of susceptibility alleles for common This is because the relevant fitness effects are those that diseases are important factors to consider in the efficient acted in ancestral human populations, whereas clinical design of disease association studies. These quantities are the phenotypes are assessed in contemporary populations. results of the joint effects of mutation, genetic drift and In addition, clinical severity does not necessarily imply selection. Hence, population genetics models, informed by effects on reproduction, viability and survival, especially empirical knowledge about patterns of disease variation, can for diseases with late age of onset. be used to make predictions about the allelic architecture of common disease susceptibility and to gain an overall Disease mapping strategies rely — implicitly or explicitly understanding about the evolutionary origins of such diseases. — on population genetics quantities, such as allelic Equilibrium models and empirical studies suggest a role for heterogeneity, allele frequency spectrum and inter-eth- both rare and common variants. In addition, increasing nic differentiation; for example, standard association evidence points to changes in selective pressures on methods are most powerful for detecting common sus- susceptibility genes for common diseases; these findings are ceptibility variants with limited heterogeneity. Likewise, likely to form the basis for further modeling studies. the evidence for association is most likely to be replicated in additional populations if the susceptibility variants are Addresses Department of Human Genetics, University of Chicago, Chicago, the same and similarly common across populations. IL 60637, USA Because these quantities are functions of the underlying population genetics model, modeling studies might help Corresponding author: Di Rienzo, Anna to optimize disease-mapping strategies to the outcomes of ([email protected]) specific evolutionary scenarios.
    [Show full text]
  • The Population Genomics of Adaptive Loss of Function
    Heredity (2021) 126:383–395 https://doi.org/10.1038/s41437-021-00403-2 REVIEW ARTICLE The population genomics of adaptive loss of function 1,2 3 1 4,5 J. Grey Monroe ● John K. McKay ● Detlef Weigel ● Pádraic J. Flood Received: 24 September 2020 / Revised: 28 December 2020 / Accepted: 1 January 2021 / Published online: 11 February 2021 © The Author(s) 2021. This article is published with open access Abstract Discoveries of adaptive gene knockouts and widespread losses of complete genes have in recent years led to a major rethink of the early view that loss-of-function alleles are almost always deleterious. Today, surveys of population genomic diversity are revealing extensive loss-of-function and gene content variation, yet the adaptive significance of much of this variation remains unknown. Here we examine the evolutionary dynamics of adaptive loss of function through the lens of population genomics and consider the challenges and opportunities of studying adaptive loss-of-function alleles using population genetics models. We discuss how the theoretically expected existence of allelic heterogeneity, defined as multiple functionally analogous mutations at the same locus, has proven consistent with empirical evidence and why this impedes both the detection of selection and causal relationships with phenotypes. We then review technical progress towards new functionally explicit population genomic tools and genotype-phenotype methods to overcome these limitations. More 1234567890();,: 1234567890();,: broadly, we discuss how the challenges of studying adaptive loss of function highlight the value of classifying genomic variation in a way consistent with the functional concept of an allele from classical population genetics.
    [Show full text]
  • An Assessment of “Hidden” Heterogeneity Within Electromorphs at Three Enzyme Loci in Deer Mice
    Copyright 0 1982 by the Genetics Society of America AN ASSESSMENT OF “HIDDEN” HETEROGENEITY WITHIN ELECTROMORPHS AT THREE ENZYME LOCI IN DEER MICE CHARLES F. AQUADRO’ AND JOHN C. AVISE Department of Molecular and Population Genetics, University of Georgia, Athens, Georgia 30602 Manuscript received October 9, 1981 Revised copy accepted July 3, 1982 ABSTRACT Allelic heterogeneity within protein electromorphs at three loci was examined in populations of deer mice (Peromyscus moniculatus) collected from five localities across North America. We used a variety of electrophoretic techniques (including several starch and acrylamide conditions, gel-sieving, and isoelectric focusing) plus heat denaturation. Of particular interest was the supernatant glutamate oxalate transaminase system (GOT-1; aspartate aminotransferase-1 of some authors), which under standard electrophoretic conditions had been shown to exhibit basically a two-allele polymorphism throughout the range of maniculatus. The use of all of the above techniques failed to uncover any additional variation for GOT-1 in these populations. Similarly, no new scorable variation was resolved at the essentially monomorphic malate dehydrogenase- 1 locus by additional conditions of electrophoresis. In marked contrast to the results for the above two enzymes, the use of multiple conditions of electropho- resis resolved the 8 standard-condition electromorphs of esterase-1 into a total of 23 variants showing strong geographic differentiation in frequency. These 23 electromorphs were further divided into a total of 35 variants by thermal stability studies. However, the allelic nature of all of the thermal stability esterase variants remains to be documented. The results of this study, taken together with the remarkable geographic heterogeneity for this species in ecology, morphology, karyotype and mitochondrial DNA sequence, suggest that some form of balancing selection may be acting to maintain the GOT-1 poly- morphism.
    [Show full text]
  • Extreme Allelic Heterogeneity at a Caenorhabditis Elegans Beta-Tubulin Locus Explains Natural Resistance to Benzimidazoles
    bioRxiv preprint doi: https://doi.org/10.1101/372623; this version posted July 19, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1 Extreme allelic heterogeneity at a Caenorhabditis elegans beta-tubulin locus 2 explains natural resistance to benzimidazoles 3 4 Steffen R. Hahnel⧺ ,1, Stefan Zdraljevic⧺ ,1,2, Briana C. Rodriguez1, Yuehui Zhao3, Patrick T. McGrath3, and 5 Erik C. Andersen1,2,4 * 6 7 1. DepartMent of Molecular Biosciences, Northwestern University, Evanston, IL 60208, USA 8 2. Interdisciplinary Biological Sciences Program, Northwestern University, Evanston, IL 60208, USA 9 3. School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30332 10 4. Robert H. Lurie Comprehensive Cancer Center of Northwestern University, Chicago, IL 60611, USA 11 ⧺ Equal contribution 12 * Corresponding author 13 14 15 Erik C. Andersen 16 Assistant Professor of Molecular Biosciences 17 Northwestern University 18 Evanston, IL 60208, USA 19 Tel: (847) 467-4382 20 Fax: (847) 491-4461 21 Email: [email protected] 22 23 24 25 26 Steffen R. Hahnel, [email protected], ORCID 0000-0001-8848-0691 27 Stefan Zdraljevic, [email protected], ORCID 0000-0003-2883-4616 28 Briana C. Rodriguez, [email protected] 29 Yuehui Zhao, [email protected] 30 Patrick T. McGrath, [email protected], ORCID 0000-0002-1598-3746 31 Erik C. Andersen, [email protected], ORCID 0000-0003-0229-9651 1 of 52 bioRxiv preprint doi: https://doi.org/10.1101/372623; this version posted July 19, 2018.
    [Show full text]