A Bayesian Method for Detecting and Characterizing Allelic Heterogeneity and Boosting Signals in Genome-Wide Association Studies

Statistical Science 2009, Vol. 24, No. 4, 430–450 DOI: 10.1214/09-STS311 c Institute of Mathematical Statistics, 2009 A Bayesian Method for Detecting and Characterizing Allelic Heterogeneity and Boosting Signals in Genome-Wide Association Studies Zhan Su, Niall Cardin , the Wellcome Trust Case Control Consortium, Peter Donnelly and Jonathan Marchini Abstract. The standard paradigm for the analysis of genome-wide association studies involves carrying out association tests at both typed and imputed SNPs. These methods will not be optimal for detecting the signal of association at SNPs that are not currently known or in regions where allelic heterogeneity occurs. We propose a novel association test, complementary to the SNP-based approaches, that attempts to extract further signals of association by explicitly modeling and estimating both unknown SNPs and allelic heterogeneity at a locus. At each site we estimate the genealogy of the case-control sample by tak- ing advantage of the HapMap haplotypes across the genome. Allelic heterogeneity is modeled by allowing more than one mutation on the branches of the genealogy. Our use of Bayesian methods allows us to assess directly the evidence for a causative SNP not well correlated with known SNPs and for allelic heterogeneity at each locus. Using simu- lated data and real data from the WTCCC project, we show that our method (i) produces a significant boost in signal and accurately identi- fies the form of the allelic heterogeneity in regions where it is known to exist, (ii) can suggest new signals that are not found by testing typed or imputed SNPs and (iii) can provide more accurate estimates of effect sizes in regions of association. arXiv:1010.4670v1 [stat.ME] 22 Oct 2010 Key words and phrases: Complex disease, genome-wide association, allelic heterogeneity, Bayesian methods. Zhan Su is Analyst, Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 is University Lecturer in Statistical Genomics, 3TG, UK and Wellcome Trust Centre for Human Department of Statistics, University of Oxford, 1 South Genetics, University of Oxford, Roosevelt Drive, Parks Road, Oxford OX1 3TG, UK and Wellcome Trust Oxford, OX3 7BN, UK. Niall Cardin is Postdoctoral Centre for Human Genetics, University of Oxford, Recearcher, Department of Statistics, University of Roosevelt Drive, Oxford, OX3 7BN, UK e-mail: Oxford, 1 South Parks Road, Oxford OX1 3TG, UK. [email protected]. Full details of Consortium membership are available at www.wtccc.org.uk. Peter Donnelly is Professor, This is an electronic reprint of the original article Wellcome Trust Centre for Human Genetics, University published by the Institute of Mathematical Statistics in of Oxford, Roosevelt Drive, Oxford, OX3 7BN, UK and Statistical Science, 2009, Vol. 24, No. 4, 430–450. This Department of Statistics, University of Oxford, 1 South reprint differs from the original in pagination and Parks Road, Oxford OX1 3TG, UK. Jonathan Marchini typographic detail. 1 2 Z. SU ET AL. 1. INTRODUCTION The method that we present here is applicable to genotype data with missing data and is computa- Over the last two years genome-wide association tionally feasible to analyze thousands of individuals studies have been successful in uncovering novel dis- across the whole genome (it requires approximately ease causing variants [2, 3, 7, 21, 28–30]. All of these the same amount of computational resources as im- studies have proceeded by testing for associations at putation [13]). A novel feature of our method is that SNPs assayed by a commercial genotyping chip and we can take advantage of the HapMap haplotypes to many have also used genotype imputation methods build the genealogical trees at each putative risk lo- [13] to test untyped SNPs, especially when combin- cus. ing studies that used different genotyping chips to We provide an informal description of the method carry out larger meta-analysis studies. first and full technical details of the method are It is possible that signals of association will be given in the Methods section. Our approach pro- missed by these methods and there are several ways ceeds in several stages: in which this could happen. First, the true causal variant, which may be a SNP but could also be an (i) We use a panel of known haplotype variation, Indel or Copy Number Variant (CNV), may not be such as HapMap [6], and an estimate of the fine-scale on the chip or on the typed reference panel and may recombination rate (such as that available from the not be in sufficient Linkage Disequilibrium (LD) with HapMap website), to construct an approximation a single typed or imputed SNP for a signal to be de- to the genealogy of the sample of haplotypes in the tected. If this is the case the variant may be well panel, at each point on a grid of positions across the identified by considering a local haplotype in the re- genome. gion, thus the association may be detected if such (ii) For each such tree, we then, in turn, consider effects are tested for association. Second, it may be putative mutations on each of its branches. Once the case that the causal model of association in the the branch for a mutation is chosen, this will fix region involves more than one SNP. One way to de- the alleles carried by chromosomes at each of the scribe this model would be to say that there is allelic tips of the tree. Assuming such a SNP exists in the heterogeneity in the association signal. If the SNPs population, we use a population genetics model to are in LD then the various haplotypes that consist predict the likely genotypes at this SNP in each case of the causal SNPs may have distinct relative risks. and control individual (we call these the study in- If this is the case then the model might also be de- dividuals). This is perhaps simplest to conceptual- scribed as a haplotype effect model. ize under the simplifying assumption that we had In this paper we investigate a method that is com- haploid data on the study individuals. The popula- plementary to SNP-based association tests that al- tion genetics model allows us to place each haploid lows for these more complex disease models. To go study chromosome (probabilistically) on the tips of beyond testing typed or imputed genetic variants the genealogical tree: each tip contains a single panel we need to construct a model for genetic variation haplotype, and the study haplotype will tend to be that has not been directly observed. We achieve this placed on the tips corresponding to panel haplotypes by modeling the genealogy of the sample of chro- that are locally similar to it. For diploid study data mosomes at each point along the genome and then there is an additional level that, in effect, averages estimating genotypes, in the case-control samples, over likely local phasings of the data. The result, for at SNPs derived by placing mutations on the in- each study individual, is a probability distribution dividual branches of the tree. The genotypes that over the possible genotypes at the putative SNP. are derived from the local genealogies can then be (iii) The next step takes the predicted genotypes associated with the phenotype under study, which with their uncertainties and looks for evidence of we test using Bayesian methods that naturally ac- association with disease status. (Note the need to count for the inherent uncertainty in the location of handle appropriately the uncertainty over the pre- the disease mutation on the genealogy. Some predicted genotypes.) We do this here in a Bayesian vious approaches that have used genealogical trees, framework. For a particular genomic location and have either been applicable only to haplotype data putative SNP, the evidence for association is nat- with no missing data [4, 27] or computationally pro- urally measured by a Bayes factor [13] (BF) that hibitive and thus restricted to small samples [10, 33]. compares a model of association with a null model BAYESIAN METHOD 3 of no association. The uncertainty over the possible We also carried out some simulation studies that branch carrying the mutation is handled by averag- highlight additional features of our approach. First, ing over the Bayes factors for each branch, to give a we examine how well our method does at uncovering single BF summarizing the evidence for the presence allelic heterogeneity where it exists. We show that of a causative mutation at that position. this can be quite a hard problem but our method (iv) To allow for possible allelic heterogeneity, we does have good power to uncover the action of more can extend the analyses above by putting two (or than one causal variant. Second, we consider the more) putative mutations on the genealogical tree problem of estimating the effect size of a causal vari- and predicting genotypes for the pair of SNPs in ant in an associated region in the specific case where study individuals, before fitting disease models with the causal variant is not well tagged by a typed SNP. multiple causative mutations. We would then av- We show that our method is able to provide a more erage over pairs of positions for the mutation in accurate estimate of the effect size in this case. calculating a BF for the strength of evidence for The next section describes the details of our meth- a 2-mutation disease model at that position, com- ods and this is followed by a section on the analy- pared to the null hypothesis, and by comparing the sis of the WTCCC data and simulation studies. We 2-mutation BF with the single mutation BF, one can conclude with a discussion of the results and the assess the relative evidence for allelic heterogeneity, likely applications of our method.

A Bayesian Method for Detecting and Characterizing Allelic Heterogeneity and Boosting Signals in Genome-Wide Association Studies

Enthusiasm Mixed with Scepticism About Single-Nucleotide Polymorphism Markers for Dissecting Complex Disorders

Understanding Rare and Common Diseases in the Context of Human Evolution Lluis Quintana-Murci1,2,3

Population Genetics Models of Common Diseases Anna Di Rienzo

The Population Genomics of Adaptive Loss of Function

An Assessment of “Hidden” Heterogeneity Within Electromorphs at Three Enzyme Loci in Deer Mice

Extreme Allelic Heterogeneity at a Caenorhabditis Elegans Beta-Tubulin Locus Explains Natural Resistance to Benzimidazoles