A Powerful and Efficient Set Test for Genetic Markers That Handles Confounders Jennifer Listgarten1,*, Christoph Lippert1,*, Eun Yong Kang1, Jing Xiang1, Carl M

Total Page:16

File Type:pdf, Size:1020Kb

A Powerful and Efficient Set Test for Genetic Markers That Handles Confounders Jennifer Listgarten1,*, Christoph Lippert1,*, Eun Yong Kang1, Jing Xiang1, Carl M Category A powerful and efficient set test for genetic markers that handles confounders Jennifer Listgarten1,*, Christoph Lippert1,*, Eun Yong Kang1, Jing Xiang1, Carl M. Kadie1 and David Heckerman1,* 1eScience Group, Microsoft Research, Los Angeles, USA. In press at Bioinformatics. ABSTRACT association between rare variants within a gene and disease (Wu et Motivation: Approaches for testing sets of variants, such as a set of al., 2011; Bansal et al., 2010). As next generation sequencing rap- rare or common variants within a gene or pathway, for association idly becomes the norm, these set-based tests, complementary to with complex traits are important. In particular, set tests allow for single SNP tests, will become increasingly important. However, aggregation of weak signal within a set, can capture interplay among existing methods for testing sets of SNPs do not handle confound- variants, and reduce the burden of multiple hypothesis testing. Until ing such as arises when related individuals or those of diverse eth- now, these approaches did not address confounding by family relat- nic backgrounds are included in the study. Such confounders, when not accounted for, result in loss of power and spurious asso- edness and population structure, a problem that is becoming more ciations (Balding, 2006; A. Price et al., 2010). Yet it is precisely important as larger data sets are used to increase power. these richly structured cohorts which yield the most power for Results: We introduce a new approach for set tests that handles discovery of the genetic underpinnings of complex traits. Moreo- confounders. Our model is based on the linear mixed model and ver, such structure typically presents itself as data cohorts become uses two random effects—one to capture the set association signal larger and larger to enable the discovery of weak signals. and one to capture confounders. We also introduce a computational speedup for two-random-effects models that makes this approach In this paper, we introduce a new, powerful and computationally feasible even for extremely large cohorts. Using this model with both efficient likelihood ratio-based set test that accounts for rich con- the likelihood ratio test and score test, we find that the former yields founding structure. We demonstrate control of type I error as well more power while controlling type I error. Application of our ap- as improved power over the more traditionally used score test. proach to richly structured GAW14 data demonstrates that our Finally, we demonstrate application of our approach to two real method successfully corrects for population structure and family GWAS data sets. Both data sets showed evidence of spurious asso- relatedness, while application of our method to a 15,000 individual ciation due to confounders in an uncorrected analysis, while appli- Crohn’s disease case-control cohort demonstrates that it additionally cation of our set test corrected for confounders and uncovered recovers genes not recoverable by univariate analysis. signal not recovered by univariate analysis. Finally, our test is extremely computationally efficient owing to development of a Availability: A Python-based library implementing our approach is new linear mixed model algorithm also presented herein, which available at http://mscompbio.codeplex.com makes possible, for example, set analysis of the 15,000 individual Contact:{jennl,lippert,heckerma}@microsoft.com Wellcome Trust Case Control Consortium (WTCCC) data. 1 INTRODUCTION Several approaches have been used to jointly test sets of SNPs: post hoc, gene-set enrichment in which univariate P values are Traditional Genome-Wide Association Studies (GWAS) test one aggregated (Holden et al, 2008), operator-based aggregation such single nucleotide polymorphism (SNP) at a time for association as “collapsing” of SNP values (Braun and Buetow, 2011; B. Li and with disease, overlooking interplay between SNPs within a gene or Leal, 2008), multivariate regression, typically penalized pathway, missing weak signal that aggregates in sets of related (Schwender et al., 2011; Malo et al., 2008), and variance compo- SNPs, and incurring a severe penalty for multiple testing. More nent (also called kernel) models such as a linear mixed models recently, sets of SNPs have been tested jointly in a gene-set en- (Wu et al., 2011, 2010; Quon et al., 2013). richment style approach (Holden et al, 2008), and also in seeking Our approach is based on the linear mixed model (LMM) which can equivalently be viewed as a multivariate regression. In partic- *To whom correspondence should be addressed, equal contributions. ular, use of a LMM with a specific form of genetic similarity ma- trix is equivalent to regressing those SNPs used to estimate genetic similarity on the phenotype (Hayes et al., 2009; Listgarten et al., 1 J.Listgarten et al. 2012). If one uses only SNPs to be tested in the similarity matrix as 2 METHODS in Wu et al., 2010, 2011, then one is effectively performing a mul- Let 푁(풗|풖; 횺) denote a multivariate Normal distribution in 풗 with mean 풖 tivariate regression test. However, by also using SNPs that tag and covariance matrix 횺. The log likelihood of a one-variance-component confounders in a separate similarity matrix, our model can addi- linear mixed model in the linear regression view is given by tionally correct for confounders, as has been done in a single-SNP ퟏ test GWAS setting (Kang et al., 2010; Yu et al., 2006; Listgarten LL ≡ log ∫ N (풚|푿휷 + 푽풘; 휎2푰) ⋅ N(풘|ퟎ; 휎2푰) d풘, √풔 푒 푔 et al., 2012; Lippert et al., 2011). Finally, our approach allows one to condition on other causal SNPs, by way of the similarity matrix, for increased power, again, as has been done in single-SNP test where 풚 is a 1 × 푁 vector of phenotype values for 푁 individuals; 휷 is the setting (Atwell et al., 2010; Listgarten et al., 2012; Segura et al., set of the fixed effects of the covariates stored in the design matrix 푿; 푰 is 2 2012). an 푁 × 푁 identity matrix; 휎푒 is the residual variance in the regression; 풘 are the 1 × 푁 random effects for the SNPs stored in the design matrix V 2 The use of LMMs to correct for confounders in genome-wide as- (dimension 푁 × 푠), and N(풘|0; 휎푔 푰) is the distribution for the weight sociation studies (GWAS) is now widely accepted, because this parameters. That is, the random regression weights, 풘 are marginalized 2 approach has been shown capable of correcting for several forms over independent Normal distributions with equal variance 휎푔 . of genetic relatedness such as population structure and family re- latedness (Kang et al., 2010; Astle and Balding, 2009; Yu et al., Equivalently, and more typically, the log likelihood is written with random 2006; A. Price et al., 2010). Independently, the use of LMMs to effects marginalized out, jointly test rare variants has become prevalent (Wu et al., 2010, 2 2 2011). In our new approach, we marry the aforementioned uses of LL = log N(풚|푿휷; 휎푒 푰 + 휎푔 푲), LMMs to perform set tests in the presence of confounders within a single, robust, and well-defined statistical model. where the genetic similarity (called the kernel in some contexts), 푲, is given by 푲= ퟏ 푽푽푻, as is the case, for example, when 푲 is the realized 풔 Because of the aforementioned equivalence, our approach can also relationship matrix (RRM) (B. J. Hayes et al, 2009; Lippert et al, 2011). be viewed as a form of linear regression with two distinct sets of Given this equivalence, the SNPs used to estimate genetic similarity (those covariates. The first set of covariates consists of SNPs that correct in 푽) can be interpreted as a set of covariates in the regression. for confounders (and other causal SNPs), that is, those which pre- In our model, we partition the random effects into two sets: one set of ran- dict race and relatedness, for example. Inclusion of these SNP dom effects, 풘푪 (with design matrix 푽푪), are used to correct for confound- covariates makes the data for individuals independently and identi- ers (and condition on causal SNPs) using 푠 SNPs, while the other set, 풘 , cally distributed (i.e., knowing the value of these SNPs induces a 푐 푺 are used to test the 푠푠 SNPs of interest in the corresponding design matrix, common distribution from which the individuals are drawn). The 푽푺. The log likelihood (in the linear regression view) is then written second set of covariates consists of SNPs for a given set of interest, such as those SNPs belonging to a gene. We call our approach 푽푪 푽푺 2 LL = log ∬ N (풚|푿휷 + 풘푪 + 풘푺; 휎푒 푰) ⋅ N(풘푪|0; 푰) N(풘푺|0; 푰) d풘푪d풘푺, FaST-LMM-Set. √풔풄 √풔풔 2 2 Computing the likelihood for our model—a LMM with two ran- where each set of random effects has a separate variance (휎퐶 and 휎푆 ). Again, dom effects—is, naively, extremely expensive, as it scales cubical- we can equivalently write this in the marginalized form, ly with the number of individuals (e.g., Listgarten et al, 2010). For 휎2 휎2 example, on the 15,000 individual WTCCC data set we analyse, 2 퐶 푻 푆 T 퐿퐿 = log N (풚|푿휷; 휎푒 푰 + 푽푪푽푪 + 푽푺푽푺 ) . currently available algorithms would need to compute and store in 푠푐 푠푠 memory genetic similarity matrices of dimension 15,000 × 15,000 and repeatedly perform cubic operations on them to test just a sin- For convenience, we re-parameterize this as gle set of SNPs—a practically infeasible approach. However, ex- 퐿퐿 = log N 풚|푿휷; 휎2푰 + 휎2[(1 − 휏)푲 + 휏푲 ] , (1) tending our previous work that made LMMs with a single random ( 푒 푔 푪 푺 ) effect linear in the number of individuals (Lippert et al 2011) to the two-variance component model needed here, we bypass this com- where now the covariance matrix, 푲, has been partitioned into two variance putational bottleneck, yielding a new, two-random-effects algo- components: rithm which is linear in the number of individuals.
Recommended publications
  • Identifying Lineage Effects When Controlling for Population Structure Improves Power in Bacterial Association Studies
    Identifying lineage effects when controlling for population structure improves power in bacterial association studies Sarah G Earle1*, Chieh-Hsi Wu1*, Jane Charlesworth1*, Nicole Stoesser1, N Claire Gordon1, Timothy M Walker1, Chris C A Spencer2, Zamin Iqbal2, David A Clifton3, Katie L Hopkins4, Neil Woodford4, E Grace Smith5, Nazir Ismail6, Martin J Llewelyn7, Tim E Peto1, Derrick W Crook1, Gil McVean2, A Sarah Walker1, Daniel J Wilson1,2 Addresses: 1 Nuffield Department of Medicine, University of Oxford, John Radcliffe Hospital, Oxford, OX3 9DU, United Kingdom 2 Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, OX3 7BN, United Kingdom 3 Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, United Kingdom 4 Antimicrobial Resistance and Healthcare Associated Infections Reference Unit, Public Health England, London, NW9 5EQ, United Kingdom 5 Public Health England, West Midlands Public Health Laboratory, Heartlands Hospital, Birmingham, B9 5SS, United Kingdom 6 Centre for Tuberculosis, National Institute for Communicable Diseases, Johannesburg, South Africa 7 Department of Infectious Diseases and Microbiology, Royal Sussex County Hospital, Brighton, United Kingdom * These authors contributed equally Corresponding author D.J.W. Email: [email protected] Abstract Bacteria pose unique challenges for genome-wide association studies (GWAS) because of strong structuring into distinct strains and substantial linkage disequilibrium across the genome. While methods developed for human studies can correct for strain structure, this risks considerable loss- of-power because genetic differences between strains often contribute substantial phenotypic variability. Here we propose a new method that captures lineage-level associations even when locus-specific associations cannot be fine-mapped.
    [Show full text]
  • S41467-019-13480-Z OPEN Insights Into Malaria Susceptibility Using Genome- Wide Data on 17,000 Individuals from Africa, Asia and Oceania
    ARTICLE https://doi.org/10.1038/s41467-019-13480-z OPEN Insights into malaria susceptibility using genome- wide data on 17,000 individuals from Africa, Asia and Oceania Malaria Genomic Epidemiology Network The human genetic factors that affect resistance to infectious disease are poorly understood. Here we report a genome-wide association study in 17,000 severe malaria cases and 1234567890():,; population controls from 11 countries, informed by sequencing of family trios and by direct typing of candidate loci in an additional 15,000 samples. We identify five replic- able associations with genome-wide levels of evidence including a newly implicated variant on chromosome 6. Jointly, these variants account for around one-tenth of the heritability of severe malaria, which we estimate as ~23% using genome-wide genotypes. We interrogate available functional data and discover an erythroid-specific transcription start site underlying the known association in ATP2B4, but are unable to identify a likely causal mechanism at the chromosome 6 locus. Previously reported HLA associations do not replicate in these sam- ples. This large dataset will provide a foundation for further research on the genetic deter- minants of malaria resistance in diverse populations. A full list of authors and their affiliations appears at the end of the paper. NATURE COMMUNICATIONS | (2019) 10:5732 | https://doi.org/10.1038/s41467-019-13480-z | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-13480-z enome-wide association studies (GWASs) have been very passed our quality control process (Methods). Subsets of these successful in identifying common genetic variants data from The Gambia, Malawi and Kenya7, from Tanzania8 and G 9 underlying chronic non-communicable diseases, but have from selected control samples have been reported previously proved to be more difficult for acute infectious diseases that (Supplementary Table 2).
    [Show full text]
  • View a Copy of This Licence, Visit
    Abegaz et al. BioData Mining (2021) 14:16 https://doi.org/10.1186/s13040-021-00247-w METHODOLOGY Open Access Performance of model-based multifactor dimensionality reduction methods for epistasis detection by controlling population structure Fentaw Abegaz1* , François Van Lishout1, Jestinah M. Mahachie John1, Kridsadakorn Chiachoompu1, Archana Bhardwaj1, Diane Duroux1, Elena S. Gusareva1, Zhi Wei2, Hakon Hakonarson3,4 and Kristel Van Steen1,5 * Correspondence: fentawabegaz@ yahoo.com Abstract 1GIGA-R, Medical Genomics – BIO3, University of Liège, Liège, Belgium Background: In genome-wide association studies the extent and impact of confounding Full list of author information is due to population structure have been well recognized. Inadequate handling of such available at the end of the article confounding is likely to lead to spurious associations, hampering replication, and the identification of causal variants. Several strategies have been developed for protecting associations against confounding, the most popular one is based on Principal Component Analysis. In contrast, the extent and impact of confounding due to population structure in gene-gene interaction association epistasis studies are much less investigated and understood. In particular, the role of nonlinear genetic population substructure in epistasis detection is largely under-investigated, especially outside a regression framework. Methods: To identify causal variants in synergy, to improve interpretability and replicability of epistasis results, we introduce three strategies based on a model-based multifactor dimensionality reduction approach for structured populations, namely MBMDR-PC, MBMDR- PG, and MBMDR-GC. Results: Simulation results comparing the performance of various approaches show that in thepresenceofpopulationstructureMBMDR-PC and MBMDR-PG consistently better control type I error rate at the nominal level than MBMDR-GC.
    [Show full text]
  • Confounding from Cryptic Relatedness in Case-Control Association Studies
    Confounding from Cryptic Relatedness in Case-Control Association Studies Benjamin F. Voight*, Jonathan K. Pritchard Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America Case-control association studies are widely used in the search for genetic variants that contribute to human diseases. It has long been known that such studies may suffer from high rates of false positives if there is unrecognized population structure. It is perhaps less widely appreciated that so-called ‘‘cryptic relatedness’’ (i.e., kinship among the cases or controls that is not known to the investigator) might also potentially inflate the false positive rate. Until now there has been little work to assess how serious this problem is likely to be in practice. In this paper, we develop a formal model of cryptic relatedness, and study its impact on association studies. We provide simple expressions that predict the extent of confounding due to cryptic relatedness. Surprisingly, these expressions are functions of directly observable parameters. Our analytical results show that, for well-designed studies in outbred populations, the degree of confounding due to cryptic relatedness will usually be negligible. However, in contrast, studies where there is a sampling bias toward collecting relatives may indeed suffer from excessive rates of false positives. Furthermore, cryptic relatedness may be a serious concern in founder populations that have grown rapidly and recently from a small size. As an example, we analyze the impact of excess relatedness among cases for six phenotypes measured in the Hutterite population. Citation: Voight BF, Pritchard JK (2005) Confounding from cryptic relatedness in case-control association studies.
    [Show full text]
  • Population Structure and Cryptic Relatedness in Genetic Association
    Statistical Science 2009, Vol. 24, No. 4, 451–471 DOI: 10.1214/09-STS307 c Institute of Mathematical Statistics, 2009 Population Structure and Cryptic Relatedness in Genetic Association Studies William Astle and David J. Balding1 Abstract. We review the problem of confounding in genetic associa- tion studies, which arises principally because of population structure and cryptic relatedness. Many treatments of the problem consider only a simple “island” model of population structure. We take a broader approach, which views population structure and cryptic relatedness as different aspects of a single confounder: the unobserved pedigree defin- ing the (often distant) relationships among the study subjects. Kinship is therefore a central concept, and we review methods of defining and es- timating kinship coefficients, both pedigree-based and marker-based. In this unified framework we review solutions to the problem of population structure, including family-based study designs, genomic control, struc- tured association, regression control, principal components adjustment and linear mixed models. The last solution makes the most explicit use of the kinships among the study subjects, and has an established role in the analysis of animal and plant breeding studies. Recent computa- tional developments mean that analyses of human genetic association data are beginning to benefit from its powerful tests for association, which protect against population structure and cryptic kinship, as well as intermediate levels of confounding by the pedigree. Key words and phrases: Cryptic relatedness, genomic control, kinship, mixed model, complex disease genetics, ascertainment. 1. CONFOUNDING IN GENETIC arXiv:1010.4681v1 [stat.ME] 22 Oct 2010 EPIDEMIOLOGY 1.1 Association and Linkage William Astle is Research Associate, Centre for Genetic association studies (Clayton, 2007) are Biostatistics, Department of Epidemiology and Public designed to identify genetic loci at which the al- Health, St.
    [Show full text]
  • Population Structure and Cryptic Relatedness in Genetic
    Statistical Science 2009, Vol. 24, No. 4, 451–471 DOI: 10.1214/09-STS307 c Institute of Mathematical Statistics, 2009 Population Structure and Cryptic Relatedness in Genetic Association Studies William Astle and David J. Balding1 Abstract. We review the problem of confounding in genetic associa- tion studies, which arises principally because of population structure and cryptic relatedness. Many treatments of the problem consider only a simple “island” model of population structure. We take a broader approach, which views population structure and cryptic relatedness as different aspects of a single confounder: the unobserved pedigree defin- ing the (often distant) relationships among the study subjects. Kinship is therefore a central concept, and we review methods of defining and es- timating kinship coefficients, both pedigree-based and marker-based. In this unified framework we review solutions to the problem of population structure, including family-based study designs, genomic control, struc- tured association, regression control, principal components adjustment and linear mixed models. The last solution makes the most explicit use of the kinships among the study subjects, and has an established role in the analysis of animal and plant breeding studies. Recent computa- tional developments mean that analyses of human genetic association data are beginning to benefit from its powerful tests for association, which protect against population structure and cryptic kinship, as well as intermediate levels of confounding by the pedigree. Key words and phrases: Cryptic relatedness, genomic control, kinship, mixed model, complex disease genetics, ascertainment. 1. CONFOUNDING IN GENETIC arXiv:1010.4681v1 [stat.ME] 22 Oct 2010 EPIDEMIOLOGY 1.1 Association and Linkage William Astle is Research Associate, Centre for Genetic association studies (Clayton, 2007) are Biostatistics, Department of Epidemiology and Public designed to identify genetic loci at which the al- Health, St.
    [Show full text]
  • New Approaches to Population Stratification in Genome-Wide
    Nature Reviews Genetics | AOP, published online 15 June 2010; doi:10.1038/nrg2813 PROGRESS GENOME-WIDE ASSOCI ATI ON STUDIES considered benign; we note that inflation in λGC is proportional to sample size. New approaches to population If population stratification exists, it is important to distinguish between sub- stratification in genome-wide population differences that are due to recent genetic drift and those that arose from more association studies ancient population divergence11. In the case of genetic drift, dividing association statistics by λGC will provide a sufficient Alkes L. Price, Noah A. Zaitlen, David Reich and Nick Patterson correction for stratification. In the case of Abstract | Genome-wide association (GWA) studies are an effective approach ancient population divergence, markers with for identifying genetic variants associated with disease risk. GWA studies can unusual allele frequency differences that lie outside the expected distribution, which be confounded by population stratification — systematic ancestry differences could be caused by natural selection, make between cases and controls — which has previously been addressed by methods stratification a much more severe problem, that infer genetic ancestry. Those methods perform well in data sets in which and dividing association statistics by λGC is population structure is the only kind of structure present but are inadequate in likely to be inadequate. In the case of fam- data sets that also contain family structure or cryptic relatedness. Here, we ily structure or cryptic relatedness, dividing association statistics by λGC will generally review recent progress on methods that correct for stratification while produce the approximate null distribution, accounting for these additional complexities.
    [Show full text]
  • The Linear Mixed Models in Genome-Wide Association Studies
    Send Orders for Reprints to [email protected] The Open Bioinformatics Journal, 2013, 7, (Suppl-1, M2) 27-33 27 Open Access Genetic Studies: The Linear Mixed Models in Genome-wide Association Studies Gengxin Li1,* and Hongjiang Zhu2 1Department of Mathematics and Statistics, Wright State University, 201 MM, 3640 Colonel Glenn Highway, Dayton, OH 45435-0001 2Division of Biostatistics, Coordinating Center for Clinical Trials, The University of Texas School of Public Health at Houston Abstract: With the availability of high-density genomic data containing millions of single nucleotide polymorphisms and tens or hundreds of thousands of individuals, genetic association study is likely to identify the variants contributing to complex traits in a genome-wide scale. However, genome-wide association studies are confounded by some spurious associations due to not properly interpreting sample structure (containing population structure, family structure and cryptic relatedness). The absence of complete genealogy of population in the genome-wide association studies model greatly motivates the development of new methods to correct the inflation of false positive. In this process, linear mixed model based approaches with the advantage of capturing multilevel relatedness have gained large ground. We summarize current literatures dealing with sample structure, and our review focuses on the following four areas: (i) The approaches handling population structure in genome-wide association studies; (ii) The linear mixed model based approaches in genome-wide association studies; (iii) The performance of linear mixed model based approaches in genome-wide association studies and (iv) The unsolved issues and future work of linear mixed model based approaches. Keywords: Genetic similarity matrix, genome-wide association study (GWAS), linear mixed model (LMM), population stratification, sample structure, single nucleotide polymorphisms (SNPs).
    [Show full text]
  • Population Structure in Genetic Studies: Confounding Factors and Mixed Models
    REVIEW Population structure in genetic studies: Confounding factors and mixed models 1 2 2,3 Jae Hoon Sul , Lana S. MartinID , Eleazar Eskin * 1 Department of Psychiatry and Biobehavioral Sciences, University of California Los Angeles, Los Angeles, California, United States of America, 2 Department of Computer Science, University of California, Los Angeles, California, United States of America, 3 Department of Human Genetics, University of California Los Angeles, Los Angeles, California, United States of America * [email protected] Abstract a1111111111 a1111111111 A genome-wide association study (GWAS) seeks to identify genetic variants that contribute a1111111111 to the development and progression of a specific disease. Over the past 10 years, new a1111111111 approaches using mixed models have emerged to mitigate the deleterious effects of popula- a1111111111 tion structure and relatedness in association studies. However, developing GWAS tech- niques to accurately test for association while correcting for population structure is a computational and statistical challenge. Using laboratory mouse strains as an example, our review characterizes the problem of population structure in association studies and OPEN ACCESS describes how it can cause false positive associations. We then motivate mixed models in Citation: Sul JH, Martin LS, Eskin E (2018) the context of unmodeled factors. Population structure in genetic studies: Confounding factors and mixed models. PLoS Genet 14(12): e1007309. https://doi.org/10.1371/ journal.pgen.1007309 Editor: Gregory S. Barsh, Stanford University School of Medicine, UNITED STATES Introduction Published: December 27, 2018 Genetics studies have identified thousands of variants implicated in dozens of common human diseases [1±5]. These variants are locations in the human genome where genetic con- Copyright: © 2018 Sul et al.
    [Show full text]