Enrichment of Statistical Power for Genome-Wide
Total Page:16
File Type:pdf, Size:1020Kb
Li et al. BMC Biology 2014, 12:73 http://www.biomedcentral.com/1741-7007/12/73 METHODOLOGY ARTICLE Open Access Enrichment of statistical power for genome-wide association studies Meng Li1,2, Xiaolei Liu2, Peter Bradbury3, Jianming Yu4, Yuan-Ming Zhang5*, Rory J Todhunter6, Edward S Buckler2,3 and Zhiwu Zhang2,7,8* Abstract Background: The inheritance of most human diseases and agriculturally important traits is controlled by many genes with small effects. Identifying these genes, while simultaneously controlling false positives, is challenging. Among available statistical methods, the mixed linear model (MLM) has been the most flexible and powerful for controlling population structure and individual unequal relatedness (kinship), the two common causes of spurious associations. The introduction of the compressed MLM (CMLM) method provided additional opportunities for optimization by adding two new model parameters: grouping algorithms and number of groups. Results: This study introduces another model parameter to develop an enriched CMLM (ECMLM). The parameter involves algorithms to define kinship between groups (that is, kinship algorithms). The ECMLM calculates kinship using several different algorithms and then chooses the best combination between kinship algorithms and grouping algorithms. Conclusion: Simulations show that the ECMLM increases statistical power. In some cases, the magnitude of power gained by using ECMLM instead of CMLM is larger than the improvement found by using CMLM instead of MLM. Keywords: Genome wide association study, population structure, kinship, mixed model, cluster analysis Background number of observations. The model space is usually too Genome-wide association studies (GWAS) are widely large to explore exhaustively, and the maximum number used in human genetics research to identify genes asso- of polymorphisms fitted at a time must be less than the ciated with complex diseases and in agricultural research number of individuals [6]. to identify genes associated with quantitative traits such These issues are particularly problematic in recent as yield and productivity [1,2]. The extremely dense gen- GWAS because the number of polymorphisms can reach etic markers derived from new genotyping technology, millions, but the number of phenotyped and genotyped such as genotyping by sequencing, have provided the individuals is rarely more than hundreds of thousands. potential for discovering genes underlying phenotypic For candidate gene studies, pre-requisite knowledge is ne- diversity through GWAS [3-5]. Several new methods cessary, for example, the location of the candidate genes. have been proposed for GWAS such as the multi-locus In this case, single-locus approaches are a necessary step mixed-model approach [6] and the candidate gene ap- before further analyses using the multi-locus or candidate proach [7]. However, automatically including cofactors, gene approaches. Thus, the single-locus approach is still as proposed by these new methods, is challenging when the mainstream method in GWAS. the number of predictors is large compared to the However, advances in genotyping technology have allowed extremely dense genetic marker mapping and the associated computing time has become a major con- * Correspondence: [email protected]; [email protected] 5State Key Laboratory of Crop Genetics and Germplasm Enhancement/ cern for genetic researches. Simultaneously, using large National Center for Soybean Improvement, College of Agriculture, Nanjing numbers of markers has also increased concerns about Agricultural University, Nanjing 210095, China false positives [8-11] and the potential for misleading re- 2Institute for Genomic Diversity, Cornell University, Ithaca, New York 14853, USA sults in follow-up re-sequencing studies. Full list of author information is available at the end of the article © 2014 Li et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Li et al. BMC Biology 2014, 12:73 Page 2 of 10 http://www.biomedcentral.com/1741-7007/12/73 False positives are easy to control, but only at the ex- y¼Xβ þ Zu þ e ð1Þ pense of true positive discovery or statistical power. For example, a stringent association test threshold is an effect- where y is a vector of a phenotype; β represents un- ive way to control the false positive rate, but the numbers known fixed effects, including population structure and of true positives are reduced at the same time. A desirable marker effects; u is a vector of size s (the number of solution is to reduce false positives without compromising groups) for unknown random polygenic effects following statistical power. This solution is critical because the a distribution with mean of zero and covariance matrix ¼ σ2 inheritance of most human diseases and agriculturally im- of G 2K a; and K is the group kinship matrix with portant traits is controlled by many genes, which individu- element kij(i, j = 1, 2,.... s) representing the relationship σ2 ally have small effects or rare alleles [12,13]. between group i and j, and a is an unknown genetic A number of statistical methods have been developed to variance. X and Z are the incidence matrices for β and eliminate spurious association between phenotypes and u, respectively, and e is a vector of random residual ef- testing markers and to increase statistical power in GWAS. fects that are normally distributed with zero mean and ¼ σ2 σ2 One of the causes of spurious association is population covariance R I e , where I is the identity matrix and e structure or stratification. In this case, the population can is the unknown residual variance. be partitioned into subpopulations. Then, association tests The likelihood of equation (1) can be expressed as: can be performed within the subpopulations or by using À Á β; ; σ2; σ2; ; ; ϕ ð Þ an estimate of population membership as a covariate in a L y u a e g s 2 linear or logistic model [14]. A similar method employs a where g is a clustering algorithm; s is the number of principal components analysis of the genotype matrix. The groups; and ϕ is the operation usedÀÁ to calculate group first few principal components may reflect broad patterns ~ kinship k from individual kinship k . ϕ is the new of similarity across individuals [15-17]. ij ht parameter added in this study. The general formula for Spurious association can also be caused by differences in the derivation of pair-wise kinship coefficients is as relatedness between pairs of individuals. This effect can be follows: reduced using a general linear model (GLM) to estimate ÀÁ the proportion of genes identical by descent between any ϕ ~ k ¼ k ð3Þ pair of individuals and excluding closely related individuals ij h∈i; t ∈j ht [18,19]. Alternatively, population structure and unequal relatedness can be simultaneously accounted for in a where the operation ϕ was defined as the average algo- mixed linear model (MLM). Subpopulation memberships rithm in the previous study [21]. Here we extended the (Q matrix) or principal components (PC) of the marker operation to a series of algorithms, including average, genotypes are treated as fixed effects and kinship is used median, and maximum. This extension created another to define the variance and covariance structure of random dimension of parameter space in the MLM (Figure 1). individual effects [20]. This MLM method outperforms We expected the extended parameter space would lead other methods with respect to statistical power. An im- to a better model fit and result in higher statistical power proved approach, called the compressed MLM (CMLM), for GWAS. has been proposed to cluster individuals into groups by using clustering algorithms such as the un-weighted pair Effect of group kinship algorithms in model fit group method with arithmetic mean (UPGMA). The kin- We examined model fit using three group kinship algo- ship among groups is calculated simply as the average of rithms (average, median, and maximum) in four species kinship among individuals. By optimizing grouping (the (human, dog, maize, and Arabidopsis) where the UPGMA number of groups and clustering method used for group- algorithm is used to cluster individuals into groups. The ing), CMLM improves statistical power for GWAS [21]. model fit was reflected by twice negative log likelihood This study introduces another parameter for model (−2LL). Here, we define the compression level as the aver- optimization: defining the relationship among groups in age individual number in each group. Different compres- the CMLM. Statistical power is examined through simu- sion levels (up to 16 individuals per group on average) lations. The results showed that statistical power is fur- were applied. Variations of model fit due to each group ther improved through this enriched compressed MLM kinship algorithm were observed for all species and at (ECMLM) method. some compression levels between 1 and 16 (Figure 2). The average algorithm performed best only for the dog Results dataset. The maximum algorithm performed best for all Model setup other datasets. This finding suggested that optimization With Henderson’s notation [22], a CMLM for GWAS on group kinship algorithms is necessary for choosing the can be written in as follows: best algorithm for a specific dataset. Li et al. BMC Biology 2014, 12:73 Page 3 of 10 http://www.biomedcentral.com/1741-7007/12/73 average group kinship and the UPGMA clustering algo- rithm. The details are provided in Additional file 1. The three traits were aphid offspring number, visual chloros present at 16°C, and vegetative growth rate after vernal- ization. When the parameter space was expanded by the combinations of clustering methods and group kinship cal- culations, compression improved the fit for all three traits (Additional file 1: Figure S1).