| INVESTIGATION
Efficient and Accurate Multiple-Phenotype Regression Method for High Dimensional Data Considering Population Structure
Jong Wha J. Joo,* Eun Yong Kang,† Elin Org,‡ Nick Furlotte,† Brian Parks,‡ Farhad Hormozdiari,† Aldons J. Lusis,‡,§,** and Eleazar Eskin*,†,**,1 *Bioinformatics Interdepartmental Ph.D. Program, †Computer Science Department, ‡Department of Medicine, §Department of Microbiology, Immunology and Molecular Genetics, and **Department of Human Genetics, University of California, Los Angeles, California 90095
ABSTRACT A typical genome-wide association study tests correlation between a single phenotype and each genotype one at a time. However, single-phenotype analysis might miss unmeasured aspects of complex biological networks. Analyzing many phenotypes simultaneously may increase the power to capture these unmeasured aspects and detect more variants. Several multivariate approaches aim to detect variants related to more than one phenotype, but these current approaches do not consider the effects of population structure. As a result, these approaches may result in a significant amount of false positive identifications. Here, we introduce a new methodology, referred to as GAMMA for generalized analysis of molecular variance for mixed-model analysis, which is capable of simultaneously analyzing many phenotypes and correcting for population structure. In a simulated study using data implanted with true genetic effects, GAMMA accurately identifies these true effects without producing false positives induced by population structure. In simulations with this data, GAMMA is an improvement over other methods which either fail to detect true effects or produce many false positive identifications. We further apply our method to genetic studies of yeast and gut microbiome from mice and show that GAMMA identifies several variants that are likely to have true biological mechanisms.
KEYWORDS multivariate analysis; population structure; mixed models
VER the past few years, genome-wide association studies researchers often seek genetic variants that affect the profile of O(GWAS) have been used to find genetic variants that gut microbiota, which encompass 10s of 1000s of species are involved in disease and other traits by testing for corre- (Lockhart et al. 1996; Gygi et al. 1999). Another example is lations between these traits and genetic variants across the when researchers want to detect regulatory hotspots in ex- genome. A typical GWAS examines the correlation of a single pression quantitative trait loci (eQTL) studies. Many genes phenotype and each genotype one at a time. Recently, large are known to be regulated by a small number of genomic amounts of genomic data, including expression data, have regions called trans-regulatory hotspots (Wang et al. 2004; been collected from GWAS cohorts. This data often contains Cervino et al. 2005; Hillebrandt et al. 2005), which strongly thousands of phenotypes per individual. The standard indicate the presence of master regulators of transcription. approach to analyzing this type of data involves performing Moreover, current strategies for analyzing phenotypes inde- a single-phenotype analysis: a GWAS on each phenotype pendently are underpowered. A more powerful approach individually. could capture the unmeasured aspects of complex biological The genomic loci that are of the most interest are the loci networks, such as protein mediators, together with many that simultaneously affect many phenotypes. For example, phenotypes that might otherwise be missed when using an approach that focuses on a single phenotype or a few pheno- types (O’Reilly et al. 2012). Copyright © 2016 by the Genetics Society of America doi: 10.1534/genetics.116.189712 Many multivariate methods have been proposed that are Manuscript received March 29, 2016; accepted for publication September 28, 2016; designed to jointly analyze large numbers of genomic pheno- published Early Online October 20, 2016. 1Corresponding author: 3532-J Boelter Hall, University of California, Los Angeles, CA types. Most of the methods perform some form of data re- 90095-1596. E-mail: [email protected] duction, such as cluster analysis and factor analysis (Alter et al.
Genetics, Vol. 204, 1379–1390 December 2016 1379 2000; Quackenbush 2001). However, these data-reduction tion structure. We take the key principles behind MDMR methods have many issues such as the difficulty of determin- (Nievergelt et al. 2007; Zapala and Schork 2012), which per- ing the number of principal components, doubts about the forms multivariate regression using distance matrices to form generalizability of principal components, etc. (Nievergelt a statistic for testing the effects of covariates on multiple et al. 2007). Aschard et al. (2014) discussed the performance phenotypes. To correct for population structure, we extend of different principal component analysis-based strategies for the statistical procedure of MDMR to incorporate the linear multiple-phenotype analysis and showed that testing only mixed model. the top principal components often have low power, whereas To demonstrate the utility of GAMMA, we use both sim- combining signals across all principal components can have ulated and real data sets and compared our method with greater power in the analysis. Alternatively,Zapala and Schork representative previous approaches. These approaches in- (2012) proposed a way of analyzing high dimensional data clude the standard t-test, one of the standard and the simplest called multivariate distance matrix regression (MDMR) anal- method for GWAS; efficient mixed-model association (EMMA) ysis. MDMR uses a distance matrix whose elements are tested (Kang et al. 2008), a representative single-phenotype analysis for association with independent variables of interest. This method that implements linear mixed model and corrects method is simple and directly applicable to high dimen- for population structure (Lippert et al. 2011; Zhou and sional multiple-phenotype analysis. In addition, users can Stephens 2012); and MDMR (Zapala and Schork 2012), flexibly choose appropriate distance matrices (Webb 2002; a multiple-phenotypes analysis method. In a simulated Wessel and Schork 2006). study, GAMMA corrects for population structure and accu- Each of the previous methods is based on the assumption rately identifies genetic variants associated with pheno- that the phenotypes of the individuals are independently and types. In comparison, the previous approaches we tested, identically distributed (i.i.d.). Unfortunately, as has been which analyze each phenotype individually, do not have shown in GWAS, this assumption is invalid due to a phenom- enough power to detect associations and are not able to enon referred to as population structure. Allele frequencies detect variants. MDMR (Zapala and Schork 2012) predicts are known to vary widely from population to population, many spurious associations produced due to population because each population carries its own unique genetic and structure. We further applied GAMMA to two real data sets. social history. These differences in allele frequencies, along When applied to a yeast data set, GAMMA identified most with the correlation of a phenotype with its populations, may of the regulatory hotspots identified as related to regula- cause spurious correlation between genotypes and pheno- tory elements in a previous study (Joo et al. 2014); while the types and induce spurious associations (Kittles et al. 2002; previous approaches we tested failed to detect those hot- Freedman et al. 2004; Marchini et al. 2004; Campbell et al. spots. When applied to a gut microbiome data set from mice, 2005; Helgason et al. 2005; Reiner et al. 2005; Voight and GAMMA corrected for population structure and identified Pritchard 2005; Berger et al. 2006; Foll and Gaggiotti 2006; regions of the genome that harbor variants responsible for Seldin et al. 2006; Flint and Eskin 2012). These errors poten- taxa abundances. In comparison, the previous methods we tially compound when analyzing multiple phenotypes tested either failed to identify any of the variants in the because biases in test statistics accumulate from each phe- region or produced a significant number of false positives. notype, which is shown in our experiments. Unfortunately, none of the previously discussed multivariate methods are Materials and Methods able to correct for population structure and may cause asignificant number of false positive results. Recently, Linear mixed models multiple-phenotypes analysis methods have been developed For analyzing the ith SNP, we assume the following linear that consider population structure (Korte et al. 2012; Zhou mixed model as the generative model: and Stephens 2014). However, these methods are impractical for cases with large number of phenotypes (.100) since their Y ¼ Xib þ U þ E: (1) computational time scales quadratically with the number of phenotypes considered. Let n be the number of individuals and m be the number of In this article, we propose a method, called GAMMA genes. Here, Y is an n 3 m matrix, where each column vector (generalized analysis of molecular variance for mixed-model yj contains the jth phenotype values; Xi is a vector of length n analysis), which efficiently analyzes large numbers of pheno- with genotypes of the ith SNP; and b is a vector of length m, b types while simultaneously considering population structure. where each entry j contains an effect of the ith SNP on the Recently, the linear mixed model has become a popular jth phenotype. U is an n 3 m matrix, where each column approach for GWAS as it can correct for population structure vector uj contains the effect of population structure of the (Kang et al. 2008, 2010; Lippert et al. 2011; Segura et al. jth phenotype. E is an n 3 m matrix, where each column 2012; Svishcheva et al. 2012; Zhou and Stephens 2012; vector ej contains i.i.d. residual errors of the jth phenotype. Hormozdiari et al. 2015). The linear mixed model incorpo- We assume the random effects, uj and ej; follow multivariate rates genetic similarities between all pairs of individuals, normal distribution, u Nð0; s2 KÞ and e Nð0; s2 IÞ; j gj j ej known as kinship, into their model and corrects for popula- where K is a known n 3 n genetic similarity matrix and I is
1380 J. W. J. Joo et al. Figure 1 The results of different methods applied to a simulated data set. The x-axis shows SNP locations and the y-axis shows log10P-values of associations between each SNP and all the genes. Blue Y shows the location of the true trans-regulatory hotspots. (A) The result of the standard t-test.
(B) The result of EMMA. For (A) and (B), we averaged the log10P-values over all of the genes for each SNP. (C) The result of MDMR. (D) The result of GAMMA. an n 3 n identity matrix with unknown magnitudes s2 and Applying this statistic (Equation 3) to our case, we find the gj s2 ; respectively. following: ej Multiple-phenotypes analysis ¼ 9 ; ¼ð 2 bb Þ9ð 2 bb Þ RSS1 yj yj RSS2 yj Xi j yj Xi j Let us say we are analyzing associations between the ith SNP 9 9 ¼ y ðI 2 H Þy ¼ br br and the jth phenotype. Traditional univariate analysis is j i j j j 2 ¼ 9 2 9ð 2 Þ ¼ 9 based on the following linear model: RSS1 RSS2 yj yj yj I Hi yj yj Hiyj ¼ b9b ; ¼ ; ¼ ¼ b þ : yj yj p1 1 p2 2 (4) yj Xi j ej (2) bb ¼ð 9 Þ21 9 ; ¼ ð 9 Þ21 9 b ¼ 2b ¼ Here, y is a vector of length n with the jth phenotype values, where j Xi Xi Xi yj Hi Xi Xi Xi Xi and rj yj rj j 2 ð 9 Þ21 9 ¼ð 2 Þ : b yj Xi Xi Xi Xi yj I Hi yj Applying Equation 4 to Equa- Xi is a vector of length n with the ith SNP values, j is a value tion 3, we find the following F-statistic: contains an effect of the ith SNP on the jth phenotype, and ej is a vector of length n with i.i.d. residual errors of the jth 9 yb ybj ð2 2 1Þ phenotype. To test associations, we test the null hypothesis F ¼ j : (5) : b ¼ : b 6¼ : ^9^ ð 2 Þ H0 j 0 against the alternative hypothesis HA j 0 We rj rj n 2 can perform an F-test for the analysis by comparing two mod- ¼ ¼ b þ : x2; els, model 1: yj ej and model 2: yj Xi j ej The standard Using the fact that the RSS statistics follow we could F-statistic is given as follows: extend the univariate case into a multivariate case in the following: ðRSS 2 RSS Þ=ðp 2 p Þ F ¼ 1 2 2 1 ; (3) RSS2=ðn 2 p2Þ Y ¼ Xib þ E (6) where RSS1 and RSS2 are the residual sum of squares (RSS) where Y is an n 3 m matrix, where each column vector yj of model 1 and model 2, respectively; and p1 and p2 are the contains the jth phenotype values; b is a vector of length m, b number of parameters in model 1 and model 2, respectively. where each entry j contains an effect of the ith SNP on the
Multiple-Phenotype Regression 1381 Figure 2 An eQTL map of a real yeast data set. P-values are estimated from NICE (Joo et al. 2014). The x-axis corresponds to SNP locations and the y- axis corresponds to the gene locations. The intensity of each point on the map represents the significance of the association. The diagonal band represents the cis effects and the vertical bands represent trans-regulatory hotspots. jth phenotype; and E is an n 3 m matrix, where each column the multivariate case (Equation 6), we can estimate a pseudo- vector ej contains i.i.d. residual errors of the jth phenotype. F-statistic as follows: Here, we assume that the random effect ej follows multivar- ′ 2 ðYb YbÞ=ð 2 Þ iate normal distribution, ej Nð0; s IÞ; where I is an n 3 n tr 2 1 ej ; (7) identity matrix with unknown magnitude s2 : In the multi- b ′b ej trðR RÞ=ðn 2 2Þ variate case, both RSS1 and RSS2 are m 3 m matrices, where j;j Rb ¼ Y 2 Yb ¼ Y 2 ð 9 Þ21 9Y ¼ð 2 ÞY: the diagonal element RSS is RSS for the jth phenotype as where Xi Xi Xi Xi I Hi The rea- computed in the univariate case. Given this, if we take the son why we call this a “pseudo” F-statistic is because it is not trace of this matrix, we obtain a sum of x2 statistics. Thus in guaranteed that we are summing independent x2 statistics,
1382 J. W. J. Joo et al. Figure 3 The results of MDMR and GAMMA applied to a yeast data set. The x-axis corresponds to SNP locations and the y-axis corresponds to
gene locations. The y-axis corresponds to 2log10 of P-values. Blue * above each plot shows puta- tive hotspots that were reported in a previous study (Joo et al. 2014) for the yeast data. (A) The result of MDMR. (B) The result of GAMMA.
and when they are not independent we do not expect that the tive identifications. Recently, the linear mixed model has result is also x2: emergedasapowerfultoolforGWASasitcouldcorrectfor Here we note that the trace of an inner product ma- the population structure. GAMMA incorporates the effect of trix is the same as the trace of an outer product matrix: population structure by assuming a linear mixed model ′ ′ ′ ′ trðYb YbÞ¼trðYbYb Þ and trðRb RbÞ¼trðRb Rb Þ: The advantage of (Equation 1), which has an extra term U accounting for ′ ′ this duality is that we can estimate the trace of YbYb and Rb Rb the effects of population structure, instead of the conven- from the outer product matrix YY9 by using the fact that tional linear model (Equation 6). This is an extension of the b b′ ′ b b ′ YY ¼ HiðYY ÞHi and RR ¼ðI 2 HiÞðYY9ÞðI 2 HiÞ: The outer following widely used linear mixed model for a univariate product matrix YY9 could be obtained from any n 3 n symmet- analysis: ric matrix of distances (Gower 1966; McArdle and Anderson ¼ b þ þ : 2001). Let us say we have a distance matrix D with each ele- yj Xi j uj ej ment d : Let A be a matrix where each element a ¼ð21=2Þd ; ij ij ij Based on the linear mixed model (Equation 1), each pheno- andwecancenterthematrixbytakingGower’scenteredmatrix type follows a multivariate normal distribution with mean and G (Gower 1966; McArdle and Anderson 2001): variance as follows: ! ! 1 1 X G ¼ I 2 119 A I 2 119 (8) y N X b ; ; n n j i j j P ¼ s2 þ s2 where j g K e I is the variance of the jth phenotype. where 1 is a column of 1’s of length n. Then this matrix G is an j j Pb ¼ s^2 þ s^2 ; We compute a covariance matrix, g K e I as de- outer-product matrix and we can generate a pseudo-F-statistic scribed in Implementation, and the alternate model is trans- from a distance matrix as follows: formed by the inverse square root of this matrix as follows: trðH GH Þ=ð2 2 1Þ X 2 = X 2 = i i : (9) b 1 2 b 1 2 b ; s2 : yj N Xi j I tr ðI 2 HiÞGðI 2 HiÞ =ðn 2 2Þ
Thus, to incorporate population structure, we transform Pb 21=2 Pb 21=2 genotypes and phenotypes, Xi ¼ Xi and yj ¼ yj; Correcting for population structure and apply them to Equation 9 to get an alternative pseudo- In GWAS, it is widely known that genetic relatedness, referred F-statistic as follows: to as population structure, complicates analysis by creating . e e e ð 2 Þ spurious associations. The linear model (Equation 6) does tr HiGHi 2 1 h i. ; not account for population structure, and applying the model tr I 2 He Ge I 2 He ðn 2 2Þ to the multiple-phenotypes analysis may induce false posi- i i
Multiple-Phenotype Regression 1383 Figure 4 The results of the standard t-test and EMMA applied to a yeast data set. The x-axis cor- responds to SNP locations and the y-axis corre- sponds to gene locations. The y-axis corresponds
to sum of 2log10 of P-value over the genes. Blue * above each plot shows putative hotspots that were reported in a previous study (Joo et al. 2014) in the yeast data. (A) The result of the standard t-test. (B) The result of EMMA.