Bioinformatics, 33(23), 2017, 3733–3739 doi: 10.1093/bioinformatics/btx511 Advance Access Publication Date: 14 August 2017 Original Paper
Genome analysis Conditional asymptotic inference for the kernel association test Kai Wang
Department of Biostatistics, University of Iowa, Iowa City, IA 52246, USA
Associate Editor: John Hancock
Received on June 7, 2017; revised on July 21, 2017; editorial decision on August 7, 2017; accepted on August 8, 2017
Abstract Motivation: The kernel association test (KAT) is popular in biological studies for its ability to combine weak effects potentially of opposite direction. Its P-value is typically assessed via its (unconditional) asymptotic distribution. However, such an asymptotic distribution is known only for continuous traits and for dichotomous traits. Furthermore, the derived P-values are known to be conservative when sample size is small, especially for the important case of dichotomous traits. One alternative is the permutation test, a widely accepted approximation to the exact finite sample conditional inference. But it is time-consuming to use in practice due to stringent significance crite- ria commonly seen in these analyses. Results: Based on a previous theoretical result a conditional asymptotic distribution for the KAT is introduced. This distribution provides an alternative approximation to the exact distribution of the KAT. An explicit expression of this distribution is provided from which P-values can be easily computed. This method applies to any type of traits. The usefulness of this approach is demon- strated via extensive simulation studies using real genotype data and an analysis of genetic data from the Ocular Hypertension Treatment Study. Numerical results showed that the new method can control the type I error rate and is a bit conservative when compared to the permutation method. Nevertheless the proposed method may be used as a fast screening method. A time-consuming permutation procedure may be conducted at locations that show signals of association. Availability and implementation: An implementation of the proposed method is provided in the R package iGasso. Contact: [email protected]
1 Introduction The test statistic for association, denoted by Q, derived from the kernel machine regression is of the following form: Kernel machine regression (Liu et al., 2007, 2008) is becoming more and more popular in analysis of biological data. It has the ability to Q ¼ ðÞy yb tKyðÞ yb : (1) accumulate weak signals across biological features even though they b are potentially of opposite effects. Initially proposed for pathway Here y is the vector of responses on n subjects. y is the vector of pre- analysis, this method has found a plethora of applications in rare dicted responses using a linear regression (for continuous responses) variants association analysis with sequence data (Lee et al., 2012b; or logistic regression (for dichotomous responses) with potential con- Wu et al., 2011). It has been extended to numerous situations founders as predictors. This regression model is called the null model such as extreme continuous traits (Barnett et al., 2013), survival as it does not involve any of the features to be tested for association. outcomes (Cai et al., 2011; Chen et al., 2013; Lin et al., 2011), The positive definite n n matrix K is called the kernel. It measures family-based association test (Wang et al., 2013), gene–gene and the pair-wise similarities between subjects. Choice of kernel depends gene–environmental interactions (Lin et al., 2013, 2016) and micro- on the application context. In genetic association studies, it is often t biome data (Chen et al., 2016; Zhao et al., 2015). taken as K ¼ GG , where G is a n m matrix of genotype scores at m
VC The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] 3733
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/33/23/3733/4082272 by Washington University at St Louis user on 01 May 2018 3734 K.Wang
single nucleotide polymorphisms (SNPs) (Wu et al., 2011). The col- may depend on all the observations of Y: hðÞ¼Yi hðÞYi; ðÞY1; ...; Yn . umns of G are typically weighted. In microbiome data analysis, popu- However, this dependence must be permutation invariant: for a per-
lar kernels include the UniFrac kernel and the Bray-Curtis kennel mutation of ðÞY1; ...; Yn ; hðÞYi does not change. One example is that (Chen and Li, 2013; Chen et al., 2012; Zhao et al., 2015). the rank of a value remains the same in any permutation. Other ex- The exact distribution of the statistic Q is unknown. Asymptotic amples of h and g transformations can be found in Hothorn et al. distribution has been derived for the case of continuous traits and the (2006). case of dichotomous traits (Liu et al., 2007, 2008). Let Z denote the de- Under the null hypothesis of independence between Y and X, sign matrix of the covariates (including the intercept) used in the null Strasser and Weber (1999) were able to derive the mean l and vari- 1 model and P ¼ V VZ ZtVZ ZtV with V ¼ diagðÞybðÞ1 yb for di- ance R of T over the set S of all possible permutations: chotomous traits and V ¼ rb2I for continuous traits where rb2 is an esti- "# ! Xn mated variance of the residuals. Then the asymptotic distribution of Q t l ¼ vec gðÞXi EðÞhðY ÞjS Pis a linear combination of m independent 1-df chi-square distributions: i¼1 2 1=2 1=2 j kjvj ðÞdf ¼ 1 where kj; j ¼ 1; ...; m, are eigenvalues of P KP . ! However, it has been found that the asymptotic distribution for n Xn R ¼ VðÞ hðY ÞjS gðÞ X gðÞX t the Q statistic is conservative for small sample size especially in the n 1 i i i¼1 important case of dichotomous responses. Lee etP al. (2012b) use a 2 ! single chi-square distribution to approximate j kjvj ðÞdf ¼ 1 by 1 Xn Xn matching the first and the second moments. The performance of this VðÞ hðY ÞjS gðÞ X gðÞX t : n 1 i i method was investigated via simulation studies (Lee et al., 2012b). i¼1 i¼1 Chen et al. (2016) found that the small-sample correction Here represents the Kronecker product and method of Lee et al. (2012b) can be anti-conservative. Alternatively, Chen et al. (2016) proposed the following statistic for continuous 1 Xn b t b b2 EðÞhðY ÞjS ¼ hðÞYi ; traits: ðÞy y KyðÞ y =r in order to take into account the uncer- n i¼1 tainty in the estimated residual variance rb2. Based on the iteratively reweighted least squares algorithm for generalized linear models, a 1 Xn similar statistic was proposed for dichotomous traits. VðÞ¼hðÞjY S ½ hðÞ Y EðÞhðÞjY S ½ hðÞ Y EðÞhðÞjY S t: n i i The main challenge in approximating the distribution of Q is that i¼1 its distribution depends on the residual variance of the response which Strasser and Weber (1999) further showed that T follows an asymp- needs to be estimated. However, it would be known in the framework totic multivariate normal distribution with mean l and variance ma- of conditional inference in which statistical inference is conditional on trix R as sample size n goes to infinity. the observed data. Deriving the exact distribution of Q given the These results on T are very useful for conditional inference when data doesn’t seem to be an easy task. Permutation would be a straight- exact conditional distribution is unknown or permutation is too forward way to approximate this distribution but it may be time-consuming to conduct. This conditional inference framework time-consuming. In this article we study the conditional asymptotic encompasses many commonly used tests for independence. The test distribution of Q using the theoretical work by Strasser and Weber statistics of these tests are either scalars (when pq ¼ 1) or assume the (1999). Unlike those literature mentioned above, this approach is not t quadratic form ðÞT l RþðÞT l . Examples of tests that can be limited to situations with continuous traits or dichotomous traits. put into this framework include Kruskal-Wallis test, linear-by-linear In what follows, we first introduce the main results of Strasser and test, trend test, two-sample test, etc. (Hothorn et al., 2006). The R Weber (1999) on permutation. A small sample counterpart of the stat- package coin provides a nice implementation of this theory. In the istic Q is proposed such that the results of Strasser and Weber (1999) next subsection, we show that KAT can be coined into this frame- are applicable. The inference behavior based on the conditional work as well. And we introduce its conditional asymptotic asymptotic distribution of Q is studied using simulated phenotypes distribution. with real genotype data from Genetic Analysis Workshop (GAW) 17. We also show an application to a genetic study for The Ocular Hypertension Treatment Study (OHTS) (Gordon and Kass, 1999). 2.2 Conditional asymptotic distribution of KAT Since K is positive definite, it can be in general written K ¼ GGt by 2 Materials and methods Cholesky decomposition. So we will express the statistic Q as b t t b 2.1 A theoretical result of Strasser and Weber (1999) Q ¼ ðÞy y GG ðÞy y . Define y~ ¼ y yb to be the vector of residuals. Now the ith row We start by introducing the main result by Strasser and Weber (1999) of G, denoted by Gi, corresponds to Xi in the general theory pre- on permutation. Let ðÞYi; Xi ; i ¼ 1; ...; n,denoten independent and sented in the last subsection. Letting g and h be identity transform- identically distributed observations. Both Yi and Xi can be univariate ations such that h Gt ¼ Gt and gðÞ¼y~ y~ where y~ is the ith or multivariate. To test the null hypothesis that the conditional distri- i i i i i component of y~. Clearly, transformation g satisfies the ‘permutation bution of Y given X is identical to the distribution of Y against arbi- invariant’ requirement. We have trary alternatives, Strasser and Weber (1999) proposed a multivariate linear statistic of the following form (Hothorn et al., 2006): Xn t t Rm 1 ! T ¼ Gi y~i ¼ G y~ 2 : Xn i¼1 t pq 1 T ¼ vec gðÞXi hðÞYi 2 R ; i¼1 It is straightforward to calculate that
Xn where g is a transformation of X to a vector of length p and h is a 1 EðÞ¼y~jS y~ ¼ 0: n i transformation of Y to a vector of length q.Theh transformation i¼1
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/33/23/3733/4082272 by Washington University at St Louis user on 01 May 2018 Conditional asymptotic inference for the kernel association test 3735
Table 1. Simulated type I error rate for continuous traits Table 2. Simulated type I error rate for dichotomous traits
Sig. Cond. Sig. SKAT Cond. Gene n Level Chen SKAT Asy. Perm. Gene n Level Chen Adj. Asy. Perm.
ABCC6 50 0.01 0.01017 0.00659 0.00666 0.01039 ABCC6 50 0.01 0.00962 0.01279 0.00705 0.00998 0.001 0.00104 0.00038 0.00039 0.00107 0.001 0.00092 0.00154 0.00044 0.00102 100 0.01 0.01023 0.00904 0.00910 0.01042 100 0.01 0.01028 0.01189 0.00874 0.01053 0.001 0.00093 0.00058 0.00057 0.00105 0.001 0.00091 0.00116 0.00061 0.00105 200 0.01 0.00980 0.00909 0.00907 0.00976 200 0.01 0.01016 0.01064 0.00939 0.01029 0.001 0.00085 0.00069 0.00069 0.00091 0.001 0.00097 0.00119 0.00082 0.00110 ACIN1 50 0.01 0.00970 0.00780 0.00798 0.00984 ACIN1 50 0.01 0.01094 0.01400 0.00873 0.01090 0.001 0.00097 0.00053 0.00054 0.00113 0.001 0.00132 0.00164 0.00069 0.00126 100 0.01 0.01004 0.00890 0.00880 0.01009 100 0.01 0.01050 0.01187 0.00952 0.01057 0.001 0.00102 0.00079 0.00077 0.00113 0.001 0.00110 0.00131 0.00078 0.00112 200 0.01 0.00969 0.00913 0.00916 0.00981 200 0.01 0.01014 0.01066 0.00956 0.01015 0.001 0.00108 0.00097 0.00096 0.00119 0.001 0.00084 0.00094 0.00072 0.00091 AHNAK 50 0.01 0.01013 0.00659 0.00687 0.01018 AHNAK 50 0.01 0.00365 0.01537 0.00188 0.01020 0.001 0.00086 0.00030 0.00030 0.00102 0.001 0.00018 0.00118 0.00003 0.00106 100 0.01 0.00931 0.00681 0.00676 0.00950 100 0.01 0.00642 0.01285 0.00465 0.01069 0.001 0.00104 0.00065 0.00066 0.00127 0.001 0.00028 0.00118 0.00013 0.00103 200 0.01 0.00986 0.00895 0.00897 0.01001 200 0.01 0.00811 0.01110 0.00721 0.01060 0.001 0.00125 0.00099 0.00098 0.00125 0.001 0.00052 0.00133 0.00038 0.00109 AKAP13 50 0.01 0.00986 0.00802 0.00799 0.01004 AKAP13 50 0.01 0.01116 0.01392 0.00938 0.01102 0.001 0.00116 0.00059 0.00062 0.00120 0.001 0.00132 0.00172 0.00070 0.00130 100 0.01 0.00958 0.00871 0.00875 0.00977 100 0.01 0.01045 0.01170 0.00959 0.01036 0.001 0.00102 0.00080 0.00079 0.00108 0.001 0.00117 0.00138 0.00093 0.00113 200 0.01 0.00991 0.00953 0.00952 0.00984 200 0.01 0.00991 0.01068 0.00954 0.00995 0.001 0.00090 0.00078 0.00080 0.00093 0.001 0.00110 0.00120 0.00099 0.00114 ALPK3 50 0.01 0.01011 0.00680 0.00695 0.01013 ALPK3 50 0.01 0.00925 0.01279 0.00613 0.00947 0.001 0.00100 0.00038 0.00039 0.00110 0.001 0.00086 0.00141 0.00032 0.00104 100 0.01 0.00947 0.00802 0.00803 0.00943 100 0.01 0.01003 0.01154 0.00826 0.01022 0.001 0.00104 0.00073 0.00070 0.00117 0.001 0.00091 0.00129 0.00061 0.00112 200 0.01 0.00972 0.00900 0.00896 0.00982 200 0.01 0.00994 0.01025 0.00910 0.01003 0.001 0.00103 0.00086 0.00086 0.00109 0.001 0.00100 0.00121 0.00083 0.00114 ANKRD12 50 0.01 0.00983 0.00751 0.00736 0.00987 ANKRD12 50 0.01 0.00757 0.01362 0.00561 0.01013 0.001 0.00082 0.00040 0.00036 0.00098 0.001 0.00055 0.00150 0.00025 0.00114 100 0.01 0.01012 0.00893 0.00899 0.01041 100 0.01 0.00829 0.01136 0.00713 0.00951 0.001 0.00111 0.00087 0.00086 0.00125 0.001 0.00045 0.00121 0.00027 0.00098 200 0.01 0.01009 0.00929 0.00938 0.01008 200 0.01 0.00953 0.01097 0.00888 0.01023 0.001 0.00105 0.00085 0.00086 0.00112 0.001 0.00089 0.00126 0.00078 0.00114 ANKRD15 50 0.01 0.00962 0.00663 0.00679 0.00960 ANKRD15 50 0.01 0.00943 0.01322 0.00615 0.01008 0.001 0.00095 0.00026 0.00027 0.00106 0.001 0.00087 0.00143 0.00028 0.00110 100 0.01 0.01027 0.00886 0.00881 0.01033 100 0.01 0.00967 0.01149 0.00781 0.00997 0.001 0.00097 0.00068 0.00069 0.00109 0.001 0.00098 0.00134 0.00061 0.00115 200 0.01 0.01026 0.00923 0.00915 0.01016 200 0.01 0.01035 0.01106 0.00927 0.01046 0.001 0.00079 0.00060 0.00059 0.00092 0.001 0.00105 0.00127 0.00074 0.00120 ATP10A 50 0.01 0.01003 0.00594 0.00606 0.01012 ATP10A 50 0.01 0.00996 0.01313 0.00640 0.01000 0.001 0.00085 0.00025 0.00029 0.00097 0.001 0.00094 0.00122 0.00030 0.00098 100 0.01 0.00993 0.00791 0.00800 0.01014 100 0.01 0.01014 0.01156 0.00819 0.01013 0.001 0.00095 0.00060 0.00063 0.00103 0.001 0.00121 0.00148 0.00064 0.00121 200 0.01 0.01000 0.00903 0.00909 0.01005 200 0.01 0.01011 0.01064 0.00911 0.01018 0.001 0.00098 0.00081 0.00078 0.00112 0.001 0.00107 0.00118 0.00080 0.00112 BAIAP3 50 0.01 0.01021 0.00659 0.00667 0.01056 BAIAP3 50 0.01 0.00850 0.01323 0.00580 0.01005 0.001 0.00099 0.00032 0.00029 0.00106 0.001 0.00064 0.00121 0.00030 0.00097 100 0.01 0.00991 0.00878 0.00882 0.00997 100 0.01 0.00930 0.01133 0.00789 0.01027 0.001 0.00098 0.00070 0.00067 0.00110 0.001 0.00079 0.00121 0.00055 0.00109 200 0.01 0.00972 0.00897 0.00897 0.00981 200 0.01 0.00946 0.01029 0.00867 0.00977 0.001 0.00093 0.00078 0.00078 0.00110 0.001 0.00107 0.00128 0.00093 0.00120
Note: ‘Cond. Asy.’ is the proposed method and ‘Perm.’ is permutation test. Note: ‘SKAT Adj’ is the adjusted STAT, ‘Cond. Asy.’ is the proposed method, and ‘Perm.’ is permutation test. Furthermore,
Xn 1 n 1 PROPOSITION 1. The conditional mean of T over all permutations VðÞ¼y~jS y~2 ¼ s2; n i n y~ is l ¼ 0 and its conditional variance matrix is i¼1 P 2 t 1 t t 2 2 2 1 2 R ¼ sy~ G G n G 11 G ¼ ðÞn 1 sy~ SG; where sy~ ¼ ðÞn 1 i y~i is the sample variance of y~i; i ¼ 1; ...; n. In summary, we have the following result.
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/33/23/3733/4082272 by Washington University at St Louis user on 01 May 2018 3736 K.Wang
Table 3. Simulated power for continuous traits
Sig. level 0.01 Sig. level 0.001
Cond. Cond. Gene n Chen SKAT Asy. Perm. Chen SKAT Asy. Perm.
ABCC6 50 0.318 0.279 0.278 0.316 0.083 0.054 0.052 0.082 100 0.695 0.677 0.681 0.694 0.376 0.359 0.359 0.381 200 0.555 0.544 0.545 0.551 0.252 0.243 0.245 0.255 ACIN1 50 0.144 0.139 0.135 0.144 0.039 0.034 0.034 0.041 100 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 200 0.212 0.211 0.206 0.207 0.071 0.070 0.069 0.074 AHNAK 50 0.967 0.955 0.955 0.960 0.767 0.695 0.689 0.684 100 0.999 0.998 0.998 0.999 0.992 0.991 0.989 0.991 200 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 AKAP13 50 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 100 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 200 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 ALPK3 50 0.038 0.031 0.030 0.037 0.008 0.007 0.007 0.008 100 0.324 0.315 0.317 0.325 0.123 0.109 0.105 0.121 200 0.958 0.958 0.956 0.958 0.863 0.857 0.858 0.861 ANKRD12 50 0.033 0.031 0.028 0.032 0.003 0.003 0.003 0.003 100 0.056 0.055 0.054 0.056 0.006 0.006 0.005 0.007 200 0.807 0.801 0.802 0.803 0.580 0.569 0.570 0.576 ANKRD15 50 0.469 0.452 0.454 0.469 0.242 0.196 0.199 0.238 100 0.872 0.868 0.868 0.875 0.636 0.605 0.605 0.639 200 0.307 0.297 0.298 0.308 0.087 0.078 0.078 0.082 ATP10A 50 0.050 0.041 0.040 0.051 0.008 0.004 0.004 0.008 100 0.164 0.153 0.152 0.166 0.043 0.037 0.038 0.042 200 1.000 1.000 1.000 1.000 0.993 0.993 0.993 0.993 BAIAP3 50 0.654 0.621 0.623 0.649 0.381 0.323 0.327 0.383 100 0.137 0.130 0.132 0.134 0.019 0.016 0.017 0.018 200 0.999 0.999 0.999 0.999 0.993 0.993 0.992 0.993
Note: ‘Cond. Asy.’ is the proposed method and ‘Perm.’ is permutation test.
Table 4. Simulated power for dichotomous traits
Sig. level 0.01 Sig. level 0.001
SKAT Cond. SKAT Cond. Gene n Chen Adj. Asy. Perm. Chen Adj. Asy. Perm.
ABCC6 50 0.142 0.159 0.122 0.146 0.039 0.047 0.018 0.035 100 0.359 0.365 0.349 0.359 0.166 0.177 0.152 0.168 200 0.858 0.856 0.850 0.857 0.603 0.608 0.582 0.609 ACIN1 50 0.111 0.121 0.094 0.107 0.027 0.035 0.020 0.027 100 0.997 0.997 0.996 0.996 0.977 0.979 0.974 0.976 200 0.324 0.327 0.322 0.326 0.128 0.133 0.120 0.132 AHNAK 50 0.795 0.845 0.776 0.811 0.529 0.602 0.482 0.555 100 0.988 0.990 0.988 0.989 0.947 0.962 0.944 0.954 200 0.975 0.977 0.973 0.975 0.895 0.907 0.885 0.902 AKAP13 50 0.989 0.990 0.983 0.984 0.920 0.931 0.903 0.920 100 0.998 0.998 0.998 0.998 0.990 0.990 0.989 0.989 200 1.000 1.000 1.000 1.000 0.999 0.999 0.999 0.999 ALPK3 50 0.087 0.104 0.072 0.090 0.017 0.023 0.009 0.021 100 0.195 0.208 0.188 0.197 0.071 0.076 0.062 0.074 200 0.505 0.503 0.500 0.504 0.214 0.218 0.199 0.210 ANKRD12 50 0.054 0.070 0.050 0.059 0.013 0.018 0.010 0.016 100 0.255 0.282 0.249 0.263 0.072 0.092 0.063 0.086 200 0.447 0.457 0.447 0.451 0.205 0.218 0.202 0.212 ANKRD15 50 0.289 0.317 0.265 0.288 0.113 0.138 0.087 0.115 100 0.536 0.556 0.524 0.538 0.318 0.333 0.296 0.313 200 0.729 0.740 0.724 0.737 0.489 0.502 0.478 0.495 ATP10A 50 0.070 0.078 0.054 0.069 0.014 0.018 0.012 0.014 100 0.244 0.251 0.235 0.242 0.107 0.108 0.094 0.104 200 0.927 0.931 0.924 0.928 0.807 0.814 0.800 0.810 BAIAP3 50 0.284 0.312 0.263 0.286 0.115 0.147 0.094 0.124 100 0.202 0.230 0.191 0.220 0.055 0.070 0.042 0.065 200 0.995 0.996 0.996 0.996 0.968 0.970 0.964 0.967
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/33/23/3733/4082272Note: ‘SKAT Adj’ is the adjusted STAT, ‘Cond. Asy.’ is the proposed method, and ‘Perm.’ is permutation test. by Washington University at St Louis user on 01 May 2018 Conditional asymptotic inference for the kernel association test 3737