A Powerful and Efficient Set Test for Genetic Markers That Handles Confounders Jennifer Listgarten1,*, Christoph Lippert1,*, Eun Yong Kang1, Jing Xiang1, Carl M
Total Page:16
File Type:pdf, Size:1020Kb
Category A powerful and efficient set test for genetic markers that handles confounders Jennifer Listgarten1,*, Christoph Lippert1,*, Eun Yong Kang1, Jing Xiang1, Carl M. Kadie1 and David Heckerman1,* 1eScience Group, Microsoft Research, Los Angeles, USA. In press at Bioinformatics. ABSTRACT association between rare variants within a gene and disease (Wu et Motivation: Approaches for testing sets of variants, such as a set of al., 2011; Bansal et al., 2010). As next generation sequencing rap- rare or common variants within a gene or pathway, for association idly becomes the norm, these set-based tests, complementary to with complex traits are important. In particular, set tests allow for single SNP tests, will become increasingly important. However, aggregation of weak signal within a set, can capture interplay among existing methods for testing sets of SNPs do not handle confound- variants, and reduce the burden of multiple hypothesis testing. Until ing such as arises when related individuals or those of diverse eth- now, these approaches did not address confounding by family relat- nic backgrounds are included in the study. Such confounders, when not accounted for, result in loss of power and spurious asso- edness and population structure, a problem that is becoming more ciations (Balding, 2006; A. Price et al., 2010). Yet it is precisely important as larger data sets are used to increase power. these richly structured cohorts which yield the most power for Results: We introduce a new approach for set tests that handles discovery of the genetic underpinnings of complex traits. Moreo- confounders. Our model is based on the linear mixed model and ver, such structure typically presents itself as data cohorts become uses two random effects—one to capture the set association signal larger and larger to enable the discovery of weak signals. and one to capture confounders. We also introduce a computational speedup for two-random-effects models that makes this approach In this paper, we introduce a new, powerful and computationally feasible even for extremely large cohorts. Using this model with both efficient likelihood ratio-based set test that accounts for rich con- the likelihood ratio test and score test, we find that the former yields founding structure. We demonstrate control of type I error as well more power while controlling type I error. Application of our ap- as improved power over the more traditionally used score test. proach to richly structured GAW14 data demonstrates that our Finally, we demonstrate application of our approach to two real method successfully corrects for population structure and family GWAS data sets. Both data sets showed evidence of spurious asso- relatedness, while application of our method to a 15,000 individual ciation due to confounders in an uncorrected analysis, while appli- Crohn’s disease case-control cohort demonstrates that it additionally cation of our set test corrected for confounders and uncovered recovers genes not recoverable by univariate analysis. signal not recovered by univariate analysis. Finally, our test is extremely computationally efficient owing to development of a Availability: A Python-based library implementing our approach is new linear mixed model algorithm also presented herein, which available at http://mscompbio.codeplex.com makes possible, for example, set analysis of the 15,000 individual Contact:{jennl,lippert,heckerma}@microsoft.com Wellcome Trust Case Control Consortium (WTCCC) data. 1 INTRODUCTION Several approaches have been used to jointly test sets of SNPs: post hoc, gene-set enrichment in which univariate P values are Traditional Genome-Wide Association Studies (GWAS) test one aggregated (Holden et al, 2008), operator-based aggregation such single nucleotide polymorphism (SNP) at a time for association as “collapsing” of SNP values (Braun and Buetow, 2011; B. Li and with disease, overlooking interplay between SNPs within a gene or Leal, 2008), multivariate regression, typically penalized pathway, missing weak signal that aggregates in sets of related (Schwender et al., 2011; Malo et al., 2008), and variance compo- SNPs, and incurring a severe penalty for multiple testing. More nent (also called kernel) models such as a linear mixed models recently, sets of SNPs have been tested jointly in a gene-set en- (Wu et al., 2011, 2010; Quon et al., 2013). richment style approach (Holden et al, 2008), and also in seeking Our approach is based on the linear mixed model (LMM) which can equivalently be viewed as a multivariate regression. In partic- *To whom correspondence should be addressed, equal contributions. ular, use of a LMM with a specific form of genetic similarity ma- trix is equivalent to regressing those SNPs used to estimate genetic similarity on the phenotype (Hayes et al., 2009; Listgarten et al., 1 J.Listgarten et al. 2012). If one uses only SNPs to be tested in the similarity matrix as 2 METHODS in Wu et al., 2010, 2011, then one is effectively performing a mul- Let 푁(풗|풖; 횺) denote a multivariate Normal distribution in 풗 with mean 풖 tivariate regression test. However, by also using SNPs that tag and covariance matrix 횺. The log likelihood of a one-variance-component confounders in a separate similarity matrix, our model can addi- linear mixed model in the linear regression view is given by tionally correct for confounders, as has been done in a single-SNP ퟏ test GWAS setting (Kang et al., 2010; Yu et al., 2006; Listgarten LL ≡ log ∫ N (풚|푿휷 + 푽풘; 휎2푰) ⋅ N(풘|ퟎ; 휎2푰) d풘, √풔 푒 푔 et al., 2012; Lippert et al., 2011). Finally, our approach allows one to condition on other causal SNPs, by way of the similarity matrix, for increased power, again, as has been done in single-SNP test where 풚 is a 1 × 푁 vector of phenotype values for 푁 individuals; 휷 is the setting (Atwell et al., 2010; Listgarten et al., 2012; Segura et al., set of the fixed effects of the covariates stored in the design matrix 푿; 푰 is 2 2012). an 푁 × 푁 identity matrix; 휎푒 is the residual variance in the regression; 풘 are the 1 × 푁 random effects for the SNPs stored in the design matrix V 2 The use of LMMs to correct for confounders in genome-wide as- (dimension 푁 × 푠), and N(풘|0; 휎푔 푰) is the distribution for the weight sociation studies (GWAS) is now widely accepted, because this parameters. That is, the random regression weights, 풘 are marginalized 2 approach has been shown capable of correcting for several forms over independent Normal distributions with equal variance 휎푔 . of genetic relatedness such as population structure and family re- latedness (Kang et al., 2010; Astle and Balding, 2009; Yu et al., Equivalently, and more typically, the log likelihood is written with random 2006; A. Price et al., 2010). Independently, the use of LMMs to effects marginalized out, jointly test rare variants has become prevalent (Wu et al., 2010, 2 2 2011). In our new approach, we marry the aforementioned uses of LL = log N(풚|푿휷; 휎푒 푰 + 휎푔 푲), LMMs to perform set tests in the presence of confounders within a single, robust, and well-defined statistical model. where the genetic similarity (called the kernel in some contexts), 푲, is given by 푲= ퟏ 푽푽푻, as is the case, for example, when 푲 is the realized 풔 Because of the aforementioned equivalence, our approach can also relationship matrix (RRM) (B. J. Hayes et al, 2009; Lippert et al, 2011). be viewed as a form of linear regression with two distinct sets of Given this equivalence, the SNPs used to estimate genetic similarity (those covariates. The first set of covariates consists of SNPs that correct in 푽) can be interpreted as a set of covariates in the regression. for confounders (and other causal SNPs), that is, those which pre- In our model, we partition the random effects into two sets: one set of ran- dict race and relatedness, for example. Inclusion of these SNP dom effects, 풘푪 (with design matrix 푽푪), are used to correct for confound- covariates makes the data for individuals independently and identi- ers (and condition on causal SNPs) using 푠 SNPs, while the other set, 풘 , cally distributed (i.e., knowing the value of these SNPs induces a 푐 푺 are used to test the 푠푠 SNPs of interest in the corresponding design matrix, common distribution from which the individuals are drawn). The 푽푺. The log likelihood (in the linear regression view) is then written second set of covariates consists of SNPs for a given set of interest, such as those SNPs belonging to a gene. We call our approach 푽푪 푽푺 2 LL = log ∬ N (풚|푿휷 + 풘푪 + 풘푺; 휎푒 푰) ⋅ N(풘푪|0; 푰) N(풘푺|0; 푰) d풘푪d풘푺, FaST-LMM-Set. √풔풄 √풔풔 2 2 Computing the likelihood for our model—a LMM with two ran- where each set of random effects has a separate variance (휎퐶 and 휎푆 ). Again, dom effects—is, naively, extremely expensive, as it scales cubical- we can equivalently write this in the marginalized form, ly with the number of individuals (e.g., Listgarten et al, 2010). For 휎2 휎2 example, on the 15,000 individual WTCCC data set we analyse, 2 퐶 푻 푆 T 퐿퐿 = log N (풚|푿휷; 휎푒 푰 + 푽푪푽푪 + 푽푺푽푺 ) . currently available algorithms would need to compute and store in 푠푐 푠푠 memory genetic similarity matrices of dimension 15,000 × 15,000 and repeatedly perform cubic operations on them to test just a sin- For convenience, we re-parameterize this as gle set of SNPs—a practically infeasible approach. However, ex- 퐿퐿 = log N 풚|푿휷; 휎2푰 + 휎2[(1 − 휏)푲 + 휏푲 ] , (1) tending our previous work that made LMMs with a single random ( 푒 푔 푪 푺 ) effect linear in the number of individuals (Lippert et al 2011) to the two-variance component model needed here, we bypass this com- where now the covariance matrix, 푲, has been partitioned into two variance putational bottleneck, yielding a new, two-random-effects algo- components: rithm which is linear in the number of individuals.