Bioinformatics, 33(23), 2017, 3733–3739 doi: 10.1093/bioinformatics/btx511 Advance Access Publication Date: 14 August 2017 Original Paper

Genome analysis Conditional asymptotic inference for the kernel association test Kai Wang

Department of Biostatistics, University of Iowa, Iowa City, IA 52246, USA

Associate Editor: John Hancock

Received on June 7, 2017; revised on July 21, 2017; editorial decision on August 7, 2017; accepted on August 8, 2017

Abstract Motivation: The kernel association test (KAT) is popular in biological studies for its ability to combine weak effects potentially of opposite direction. Its P-value is typically assessed via its (unconditional) asymptotic distribution. However, such an asymptotic distribution is known only for continuous traits and for dichotomous traits. Furthermore, the derived P-values are known to be conservative when sample size is small, especially for the important case of dichotomous traits. One alternative is the permutation test, a widely accepted approximation to the exact finite sample conditional inference. But it is time-consuming to use in practice due to stringent significance crite- ria commonly seen in these analyses. Results: Based on a previous theoretical result a conditional asymptotic distribution for the KAT is introduced. This distribution provides an alternative approximation to the exact distribution of the KAT. An explicit expression of this distribution is provided from which P-values can be easily computed. This method applies to any type of traits. The usefulness of this approach is demon- strated via extensive simulation studies using real genotype data and an analysis of genetic data from the Ocular Hypertension Treatment Study. Numerical results showed that the new method can control the type I error rate and is a bit conservative when compared to the permutation method. Nevertheless the proposed method may be used as a fast screening method. A time-consuming permutation procedure may be conducted at locations that show signals of association. Availability and implementation: An implementation of the proposed method is provided in the R package iGasso. Contact: [email protected]

1 Introduction The test statistic for association, denoted by Q, derived from the kernel machine regression is of the following form: Kernel machine regression (Liu et al., 2007, 2008) is becoming more and more popular in analysis of biological data. It has the ability to Q ¼ ðÞy yb tKyðÞ yb : (1) accumulate weak signals across biological features even though they b are potentially of opposite effects. Initially proposed for pathway Here y is the vector of responses on n subjects. y is the vector of pre- analysis, this method has found a plethora of applications in rare dicted responses using a linear regression (for continuous responses) variants association analysis with sequence data (Lee et al., 2012b; or logistic regression (for dichotomous responses) with potential con- Wu et al., 2011). It has been extended to numerous situations founders as predictors. This regression model is called the null model such as extreme continuous traits (Barnett et al., 2013), survival as it does not involve any of the features to be tested for association. outcomes (Cai et al., 2011; Chen et al., 2013; Lin et al., 2011), The positive definite n n matrix K is called the kernel. It measures family-based association test (Wang et al., 2013), –gene and the pair-wise similarities between subjects. Choice of kernel depends gene–environmental interactions (Lin et al., 2013, 2016) and micro- on the application context. In genetic association studies, it is often t biome data (Chen et al., 2016; Zhao et al., 2015). taken as K ¼ GG , where G is a n m matrix of genotype scores at m

VC The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] 3733

Downloaded from https://academic.oup.com/bioinformatics/article-abstract/33/23/3733/4082272 by Washington University at St Louis user on 01 May 2018 3734 K.Wang

single nucleotide polymorphisms (SNPs) (Wu et al., 2011). The col- may depend on all the observations of Y: hðÞ¼Yi hðÞYi; ðÞY1; ...; Yn . umns of G are typically weighted. In microbiome data analysis, popu- However, this dependence must be permutation invariant: for a per-

lar kernels include the UniFrac kernel and the Bray-Curtis kennel mutation of ðÞY1; ...; Yn ; hðÞYi does not change. One example is that (Chen and Li, 2013; Chen et al., 2012; Zhao et al., 2015). the rank of a value remains the same in any permutation. Other ex- The exact distribution of the statistic Q is unknown. Asymptotic amples of h and g transformations can be found in Hothorn et al. distribution has been derived for the case of continuous traits and the (2006). case of dichotomous traits (Liu et al., 2007, 2008). Let Z denote the de- Under the null hypothesis of independence between Y and X, sign matrix of the covariates (including the intercept) used in the null Strasser and Weber (1999) were able to derive the mean l and vari- 1 model and P ¼ V VZ ZtVZ ZtV with V ¼ diagðÞybðÞ1 yb for di- ance R of T over the set S of all possible permutations: chotomous traits and V ¼ rb2I for continuous traits where rb2 is an esti- "# ! Xn mated variance of the residuals. Then the asymptotic distribution of Q t l ¼ vec gðÞXi EðÞhðY ÞjS Pis a linear combination of m independent 1-df chi-square distributions: i¼1 2 1=2 1=2 j kjvj ðÞdf ¼ 1 where kj; j ¼ 1; ...; m, are eigenvalues of P KP . ! However, it has been found that the asymptotic distribution for n Xn R ¼ VðÞhðY ÞjS gðÞX gðÞX t the Q statistic is conservative for small sample size especially in the n 1 i i i¼1 important case of dichotomous responses. Lee etP al. (2012b) use a 2 ! single chi-square distribution to approximate j kjvj ðÞdf ¼ 1 by 1 Xn Xn matching the first and the second moments. The performance of this VðÞhðY ÞjS gðÞX gðÞX t : n 1 i i method was investigated via simulation studies (Lee et al., 2012b). i¼1 i¼1 Chen et al. (2016) found that the small-sample correction Here represents the Kronecker product and method of Lee et al. (2012b) can be anti-conservative. Alternatively, Chen et al. (2016) proposed the following statistic for continuous 1 Xn b t b b2 EðÞhðY ÞjS ¼ hðÞYi ; traits: ðÞy y KyðÞ y =r in order to take into account the uncer- n i¼1 tainty in the estimated residual variance rb2. Based on the iteratively reweighted least squares algorithm for generalized linear models, a 1 Xn similar statistic was proposed for dichotomous traits. VðÞ¼hðÞjY S ½hðÞY EðÞhðÞjY S ½hðÞY EðÞhðÞjY S t: n i i The main challenge in approximating the distribution of Q is that i¼1 its distribution depends on the residual variance of the response which Strasser and Weber (1999) further showed that T follows an asymp- needs to be estimated. However, it would be known in the framework totic multivariate normal distribution with mean l and variance ma- of conditional inference in which statistical inference is conditional on trix R as sample size n goes to infinity. the observed data. Deriving the exact distribution of Q given the These results on T are very useful for conditional inference when data doesn’t seem to be an easy task. Permutation would be a straight- exact conditional distribution is unknown or permutation is too forward way to approximate this distribution but it may be time-consuming to conduct. This conditional inference framework time-consuming. In this article we study the conditional asymptotic encompasses many commonly used tests for independence. The test distribution of Q using the theoretical work by Strasser and Weber statistics of these tests are either scalars (when pq ¼ 1) or assume the (1999). Unlike those literature mentioned above, this approach is not t quadratic form ðÞT l RþðÞT l . Examples of tests that can be limited to situations with continuous traits or dichotomous traits. put into this framework include Kruskal-Wallis test, linear-by-linear In what follows, we first introduce the main results of Strasser and test, trend test, two-sample test, etc. (Hothorn et al., 2006). The R Weber (1999) on permutation. A small sample counterpart of the stat- package coin provides a nice implementation of this theory. In the istic Q is proposed such that the results of Strasser and Weber (1999) next subsection, we show that KAT can be coined into this frame- are applicable. The inference behavior based on the conditional work as well. And we introduce its conditional asymptotic asymptotic distribution of Q is studied using simulated phenotypes distribution. with real genotype data from Genetic Analysis Workshop (GAW) 17. We also show an application to a genetic study for The Ocular Hypertension Treatment Study (OHTS) (Gordon and Kass, 1999). 2.2 Conditional asymptotic distribution of KAT Since K is positive definite, it can be in general written K ¼ GGt by 2 Materials and methods Cholesky decomposition. So we will express the statistic Q as b t t b 2.1 A theoretical result of Strasser and Weber (1999) Q ¼ ðÞy y GG ðÞy y . Define y~ ¼ y yb to be the vector of residuals. Now the ith row We start by introducing the main result by Strasser and Weber (1999) of G, denoted by Gi, corresponds to Xi in the general theory pre- on permutation. Let ðÞYi; Xi ; i ¼ 1; ...; n,denoten independent and sented in the last subsection. Letting g and h be identity transform- identically distributed observations. Both Yi and Xi can be univariate ations such that h Gt ¼ Gt and gðÞ¼y~ y~ where y~ is the ith or multivariate. To test the null hypothesis that the conditional distri- i i i i i component of y~. Clearly, transformation g satisfies the ‘permutation bution of Y given X is identical to the distribution of Y against arbi- invariant’ requirement. We have trary alternatives, Strasser and Weber (1999) proposed a multivariate linear statistic of the following form (Hothorn et al., 2006): Xn t t Rm1 ! T ¼ Gi y~i ¼ G y~ 2 : Xn i¼1 t pq1 T ¼ vec gðÞXi hðÞYi 2 R ; i¼1 It is straightforward to calculate that

Xn where g is a transformation of X to a vector of length p and h is a 1 EðÞ¼y~jS y~ ¼ 0: n i transformation of Y to a vector of length q.Theh transformation i¼1

Downloaded from https://academic.oup.com/bioinformatics/article-abstract/33/23/3733/4082272 by Washington University at St Louis user on 01 May 2018 Conditional asymptotic inference for the kernel association test 3735

Table 1. Simulated type I error rate for continuous traits Table 2. Simulated type I error rate for dichotomous traits

Sig. Cond. Sig. SKAT Cond. Gene n Level Chen SKAT Asy. Perm. Gene n Level Chen Adj. Asy. Perm.

ABCC6 50 0.01 0.01017 0.00659 0.00666 0.01039 ABCC6 50 0.01 0.00962 0.01279 0.00705 0.00998 0.001 0.00104 0.00038 0.00039 0.00107 0.001 0.00092 0.00154 0.00044 0.00102 100 0.01 0.01023 0.00904 0.00910 0.01042 100 0.01 0.01028 0.01189 0.00874 0.01053 0.001 0.00093 0.00058 0.00057 0.00105 0.001 0.00091 0.00116 0.00061 0.00105 200 0.01 0.00980 0.00909 0.00907 0.00976 200 0.01 0.01016 0.01064 0.00939 0.01029 0.001 0.00085 0.00069 0.00069 0.00091 0.001 0.00097 0.00119 0.00082 0.00110 ACIN1 50 0.01 0.00970 0.00780 0.00798 0.00984 ACIN1 50 0.01 0.01094 0.01400 0.00873 0.01090 0.001 0.00097 0.00053 0.00054 0.00113 0.001 0.00132 0.00164 0.00069 0.00126 100 0.01 0.01004 0.00890 0.00880 0.01009 100 0.01 0.01050 0.01187 0.00952 0.01057 0.001 0.00102 0.00079 0.00077 0.00113 0.001 0.00110 0.00131 0.00078 0.00112 200 0.01 0.00969 0.00913 0.00916 0.00981 200 0.01 0.01014 0.01066 0.00956 0.01015 0.001 0.00108 0.00097 0.00096 0.00119 0.001 0.00084 0.00094 0.00072 0.00091 AHNAK 50 0.01 0.01013 0.00659 0.00687 0.01018 AHNAK 50 0.01 0.00365 0.01537 0.00188 0.01020 0.001 0.00086 0.00030 0.00030 0.00102 0.001 0.00018 0.00118 0.00003 0.00106 100 0.01 0.00931 0.00681 0.00676 0.00950 100 0.01 0.00642 0.01285 0.00465 0.01069 0.001 0.00104 0.00065 0.00066 0.00127 0.001 0.00028 0.00118 0.00013 0.00103 200 0.01 0.00986 0.00895 0.00897 0.01001 200 0.01 0.00811 0.01110 0.00721 0.01060 0.001 0.00125 0.00099 0.00098 0.00125 0.001 0.00052 0.00133 0.00038 0.00109 AKAP13 50 0.01 0.00986 0.00802 0.00799 0.01004 AKAP13 50 0.01 0.01116 0.01392 0.00938 0.01102 0.001 0.00116 0.00059 0.00062 0.00120 0.001 0.00132 0.00172 0.00070 0.00130 100 0.01 0.00958 0.00871 0.00875 0.00977 100 0.01 0.01045 0.01170 0.00959 0.01036 0.001 0.00102 0.00080 0.00079 0.00108 0.001 0.00117 0.00138 0.00093 0.00113 200 0.01 0.00991 0.00953 0.00952 0.00984 200 0.01 0.00991 0.01068 0.00954 0.00995 0.001 0.00090 0.00078 0.00080 0.00093 0.001 0.00110 0.00120 0.00099 0.00114 ALPK3 50 0.01 0.01011 0.00680 0.00695 0.01013 ALPK3 50 0.01 0.00925 0.01279 0.00613 0.00947 0.001 0.00100 0.00038 0.00039 0.00110 0.001 0.00086 0.00141 0.00032 0.00104 100 0.01 0.00947 0.00802 0.00803 0.00943 100 0.01 0.01003 0.01154 0.00826 0.01022 0.001 0.00104 0.00073 0.00070 0.00117 0.001 0.00091 0.00129 0.00061 0.00112 200 0.01 0.00972 0.00900 0.00896 0.00982 200 0.01 0.00994 0.01025 0.00910 0.01003 0.001 0.00103 0.00086 0.00086 0.00109 0.001 0.00100 0.00121 0.00083 0.00114 ANKRD12 50 0.01 0.00983 0.00751 0.00736 0.00987 ANKRD12 50 0.01 0.00757 0.01362 0.00561 0.01013 0.001 0.00082 0.00040 0.00036 0.00098 0.001 0.00055 0.00150 0.00025 0.00114 100 0.01 0.01012 0.00893 0.00899 0.01041 100 0.01 0.00829 0.01136 0.00713 0.00951 0.001 0.00111 0.00087 0.00086 0.00125 0.001 0.00045 0.00121 0.00027 0.00098 200 0.01 0.01009 0.00929 0.00938 0.01008 200 0.01 0.00953 0.01097 0.00888 0.01023 0.001 0.00105 0.00085 0.00086 0.00112 0.001 0.00089 0.00126 0.00078 0.00114 ANKRD15 50 0.01 0.00962 0.00663 0.00679 0.00960 ANKRD15 50 0.01 0.00943 0.01322 0.00615 0.01008 0.001 0.00095 0.00026 0.00027 0.00106 0.001 0.00087 0.00143 0.00028 0.00110 100 0.01 0.01027 0.00886 0.00881 0.01033 100 0.01 0.00967 0.01149 0.00781 0.00997 0.001 0.00097 0.00068 0.00069 0.00109 0.001 0.00098 0.00134 0.00061 0.00115 200 0.01 0.01026 0.00923 0.00915 0.01016 200 0.01 0.01035 0.01106 0.00927 0.01046 0.001 0.00079 0.00060 0.00059 0.00092 0.001 0.00105 0.00127 0.00074 0.00120 ATP10A 50 0.01 0.01003 0.00594 0.00606 0.01012 ATP10A 50 0.01 0.00996 0.01313 0.00640 0.01000 0.001 0.00085 0.00025 0.00029 0.00097 0.001 0.00094 0.00122 0.00030 0.00098 100 0.01 0.00993 0.00791 0.00800 0.01014 100 0.01 0.01014 0.01156 0.00819 0.01013 0.001 0.00095 0.00060 0.00063 0.00103 0.001 0.00121 0.00148 0.00064 0.00121 200 0.01 0.01000 0.00903 0.00909 0.01005 200 0.01 0.01011 0.01064 0.00911 0.01018 0.001 0.00098 0.00081 0.00078 0.00112 0.001 0.00107 0.00118 0.00080 0.00112 BAIAP3 50 0.01 0.01021 0.00659 0.00667 0.01056 BAIAP3 50 0.01 0.00850 0.01323 0.00580 0.01005 0.001 0.00099 0.00032 0.00029 0.00106 0.001 0.00064 0.00121 0.00030 0.00097 100 0.01 0.00991 0.00878 0.00882 0.00997 100 0.01 0.00930 0.01133 0.00789 0.01027 0.001 0.00098 0.00070 0.00067 0.00110 0.001 0.00079 0.00121 0.00055 0.00109 200 0.01 0.00972 0.00897 0.00897 0.00981 200 0.01 0.00946 0.01029 0.00867 0.00977 0.001 0.00093 0.00078 0.00078 0.00110 0.001 0.00107 0.00128 0.00093 0.00120

Note: ‘Cond. Asy.’ is the proposed method and ‘Perm.’ is permutation test. Note: ‘SKAT Adj’ is the adjusted STAT, ‘Cond. Asy.’ is the proposed method, and ‘Perm.’ is permutation test. Furthermore,

Xn 1 n 1 PROPOSITION 1. The conditional mean of T over all permutations VðÞ¼y~jS y~2 ¼ s2; n i n y~ is l ¼ 0 and its conditional variance matrix is i¼1 P 2 t 1 t t 2 2 2 1 2 R ¼ sy~ G G n G 11 G ¼ ðÞn 1 sy~ SG; where sy~ ¼ ðÞn 1 i y~i is the sample variance of y~i; i ¼ 1; ...; n. In summary, we have the following result.

Downloaded from https://academic.oup.com/bioinformatics/article-abstract/33/23/3733/4082272 by Washington University at St Louis user on 01 May 2018 3736 K.Wang

Table 3. Simulated power for continuous traits

Sig. level 0.01 Sig. level 0.001

Cond. Cond. Gene n Chen SKAT Asy. Perm. Chen SKAT Asy. Perm.

ABCC6 50 0.318 0.279 0.278 0.316 0.083 0.054 0.052 0.082 100 0.695 0.677 0.681 0.694 0.376 0.359 0.359 0.381 200 0.555 0.544 0.545 0.551 0.252 0.243 0.245 0.255 ACIN1 50 0.144 0.139 0.135 0.144 0.039 0.034 0.034 0.041 100 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 200 0.212 0.211 0.206 0.207 0.071 0.070 0.069 0.074 AHNAK 50 0.967 0.955 0.955 0.960 0.767 0.695 0.689 0.684 100 0.999 0.998 0.998 0.999 0.992 0.991 0.989 0.991 200 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 AKAP13 50 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 100 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 200 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 ALPK3 50 0.038 0.031 0.030 0.037 0.008 0.007 0.007 0.008 100 0.324 0.315 0.317 0.325 0.123 0.109 0.105 0.121 200 0.958 0.958 0.956 0.958 0.863 0.857 0.858 0.861 ANKRD12 50 0.033 0.031 0.028 0.032 0.003 0.003 0.003 0.003 100 0.056 0.055 0.054 0.056 0.006 0.006 0.005 0.007 200 0.807 0.801 0.802 0.803 0.580 0.569 0.570 0.576 ANKRD15 50 0.469 0.452 0.454 0.469 0.242 0.196 0.199 0.238 100 0.872 0.868 0.868 0.875 0.636 0.605 0.605 0.639 200 0.307 0.297 0.298 0.308 0.087 0.078 0.078 0.082 ATP10A 50 0.050 0.041 0.040 0.051 0.008 0.004 0.004 0.008 100 0.164 0.153 0.152 0.166 0.043 0.037 0.038 0.042 200 1.000 1.000 1.000 1.000 0.993 0.993 0.993 0.993 BAIAP3 50 0.654 0.621 0.623 0.649 0.381 0.323 0.327 0.383 100 0.137 0.130 0.132 0.134 0.019 0.016 0.017 0.018 200 0.999 0.999 0.999 0.999 0.993 0.993 0.992 0.993

Note: ‘Cond. Asy.’ is the proposed method and ‘Perm.’ is permutation test.

Table 4. Simulated power for dichotomous traits

Sig. level 0.01 Sig. level 0.001

SKAT Cond. SKAT Cond. Gene n Chen Adj. Asy. Perm. Chen Adj. Asy. Perm.

ABCC6 50 0.142 0.159 0.122 0.146 0.039 0.047 0.018 0.035 100 0.359 0.365 0.349 0.359 0.166 0.177 0.152 0.168 200 0.858 0.856 0.850 0.857 0.603 0.608 0.582 0.609 ACIN1 50 0.111 0.121 0.094 0.107 0.027 0.035 0.020 0.027 100 0.997 0.997 0.996 0.996 0.977 0.979 0.974 0.976 200 0.324 0.327 0.322 0.326 0.128 0.133 0.120 0.132 AHNAK 50 0.795 0.845 0.776 0.811 0.529 0.602 0.482 0.555 100 0.988 0.990 0.988 0.989 0.947 0.962 0.944 0.954 200 0.975 0.977 0.973 0.975 0.895 0.907 0.885 0.902 AKAP13 50 0.989 0.990 0.983 0.984 0.920 0.931 0.903 0.920 100 0.998 0.998 0.998 0.998 0.990 0.990 0.989 0.989 200 1.000 1.000 1.000 1.000 0.999 0.999 0.999 0.999 ALPK3 50 0.087 0.104 0.072 0.090 0.017 0.023 0.009 0.021 100 0.195 0.208 0.188 0.197 0.071 0.076 0.062 0.074 200 0.505 0.503 0.500 0.504 0.214 0.218 0.199 0.210 ANKRD12 50 0.054 0.070 0.050 0.059 0.013 0.018 0.010 0.016 100 0.255 0.282 0.249 0.263 0.072 0.092 0.063 0.086 200 0.447 0.457 0.447 0.451 0.205 0.218 0.202 0.212 ANKRD15 50 0.289 0.317 0.265 0.288 0.113 0.138 0.087 0.115 100 0.536 0.556 0.524 0.538 0.318 0.333 0.296 0.313 200 0.729 0.740 0.724 0.737 0.489 0.502 0.478 0.495 ATP10A 50 0.070 0.078 0.054 0.069 0.014 0.018 0.012 0.014 100 0.244 0.251 0.235 0.242 0.107 0.108 0.094 0.104 200 0.927 0.931 0.924 0.928 0.807 0.814 0.800 0.810 BAIAP3 50 0.284 0.312 0.263 0.286 0.115 0.147 0.094 0.124 100 0.202 0.230 0.191 0.220 0.055 0.070 0.042 0.065 200 0.995 0.996 0.996 0.996 0.968 0.970 0.964 0.967

Downloaded from https://academic.oup.com/bioinformatics/article-abstract/33/23/3733/4082272Note: ‘SKAT Adj’ is the adjusted STAT, ‘Cond. Asy.’ is the proposed method, and ‘Perm.’ is permutation test. by Washington University at St Louis user on 01 May 2018 Conditional asymptotic inference for the kernel association test 3737

2 1 t 1 t t where SG ¼ ðÞn 1 G G n G 11 G is the sample covariance inference method, the method of Chen et al. (2016),SKAT(for matrix of genotype scores at the m SNPs. continuous traits), SKAT with small sample adjustment (for di- Since KAT ¼ TtT, the conditional asymptotic distribution is ob- chotomous traits) and permutation test. While other works tained using the main result of Strasser and Weber (1999). (Chen et al., 2016; Lee et al., 2012a; Wu et al., 2010)usedsimu- lated sequence data, we use real sequence data that were made PROPOSITION 2. The conditional asymptotic distribution of KAT available by Genetic Analysis Workshop (GAW) 17 (Almasy is the same as the distribution of a linear combinationP of m inde- et al., 2011). m 2 pendent 1-df chi-square random variables: j¼1 kj vj ðÞdf ¼ 1 The GAW 17 mini-exome sequence data were from 697 subjects where kj; j ¼ 1; ...; m, are the eigenvalues of R. selected from the 1000 Genome Projects. There were 24 487 SNPs We note that, unlike the asymptotic distribution for Q, this in 3205 . Nine largest genes (i.e. ABCC6, ACIN1, AHNAK, asymptotic distribution has the same form regardless of the type of AKAP13, ALPK3, ANKRD12, ANKRD15, ATP10A and BAIAP3) traits. Whether it is continuous or dichotomous, the coefficients in terms of genotyped SNPs are selected. The number of SNPs ranges

kj; j ¼ 1; ...; m are the eigenvalues of R and are proportional to the from 200 to 400. Trait data were simulated in the same way as in 2 eigenvalues of SG. the other works (Chen et al., 2016; Lee et al., 2012a; Wu et al., 2010) which are described below. For each gene, 20% randomly selected SNPs are assumed to be 3 Simulation studies causal. Eighty percent of these causal SNPs are assumed to have We have performed extensive simulation studies to demonstrate positive effect on the trait while the remaining 20% are negative. and compare the performance of the proposed conditional Similar to Wu et al. (2011), continuous traits were simulated using the following model Histogram of average CCT y ¼ 0:5 x1 þ 0:5 x2 þ b1g1 þþbsgs; (2)

where x1 follows a standard normal distribution, x2 follows a discrete distribution taking values 0 or 1 with probability 0.5, and

g1; ...; gs were genotype scores at s causal rare variants. Given the minor allele frequency (MAF) of a causal variant, its coefficient b is

determined by b ¼ 0:5 logðÞj 5 log10ðÞjMAF . For the study of type I error rate, b1; ...; bs were set at 0. For the study of power, the mag- nitude of bj was equal to j0:2 log 10ðÞj MAF . In the analysis of type I error rate, all bs are set to 0. For the dichotomous trait, its disease status is determined using Frequency the following logistic regression model:

logit½¼ PrðÞ disease b0 þ 0:5 x1 þ 0:5 x2 þ b1g1 þþbsgs;

where x1 and x2 are the same as those for the continuous traits,

b0 ¼ logðÞ 0:01=0:99 corresponds to background disease probability of 0.01 and exp ðÞb1 ; ...; exp ðÞbs are the odds ratios for the s dis-

0 1020304050 ease susceptibility variants, respectively. The number of cases is the same as the number of controls. Similar simulation set up has been 450 500 550 600 650 used elsewhere (Wang and Elston, 2007; Wu et al., 2011). As in the Average CCT case of continuous, all bs are set to 0 in the analysis of type I error Fig. 1. Histogram of the average CCT for the Black subjects in OHTS rate.

Table 5. A summary of gene-based association P-values (105) for the 252 Black subjects in the OHTS study

Base-pair position Conditional Chr Biotype Gene symbol Start Stop Chen SKATinference Permutation

1 coding ZBTB40 22 451 851 22 531 157 19.52 23.28 23.64 9.999 17 Protein coding SENP3 7 561 875 7 571 969 1.897 2.728 2.971 9.999 17 Protein coding SENP3-EIF4A1 7 563 287 7 578 715 1.070 1.153 1.215 0.000 17 Protein coding EIF4A1 7 572 706 7 579 005 2.965 3.551 3.624 9.999 17 snoRNA SNORA48 7 574 713 7 574 847 3.376 4.479 4.733 0.000 17 miRNA MIR6779 38 914 979 38 915 042 1.002 1.742 1.631 9.999 17 Protein coding KCNH4 42 156 891 42 181 278 0.000 0.000 0.000 0.000 17 Protein coding HCRT 42 184 060 42 185 452 0.000 0.000 0.000 0.000 17 Protein coding GHDC 42 188 799 42 194 532 0.000 0.000 0.000 0.000 17 Protein coding STAT5B 42 199 168 42 276 707 0.983 1.623 1.505 0.000 17 Protein coding BECN1 42 810 134 42 833 350 3.126 3.812 3.541 0.000 17 miRNA MIR6781 42 823 880 42 823 943 5.550 7.268 6.837 0.000 17 Protein coding PSME3 42 824 385 42 843 758 2.936 3.871 3.609 9.999

Note: Genes are selected if any of the four listed statistics is significant at level 0.0001.

Downloaded from https://academic.oup.com/bioinformatics/article-abstract/33/23/3733/4082272 by Washington University at St Louis user on 01 May 2018 3738 K.Wang

The number of simulation replicates is 100 000 for the analysis One thousand six hundred and thirty six individuals between 40 and of type I error and is 1000 for the analysis of power. 80 years old were enrolled and 1077 of them were genotyped in a The simulated type I error rates are presented in Table 1 for con- subsequent study. Data for this genetic study is available at tinuous traits and in Table 2 for dichotomous traits. These type I Database of Genotypes and Phenotypes (dbGaP, Study Accession error rates are under control except that the adjusted SKAT is phs000240.v1.p1). It contains 1057 subjects who have available slightly anti-conservative. The anti-conservativeness of the adjusted both genotype data and baseline phenotype data. The vast majority SKAT has also been observed in Chen et al. (2016). of these subjects were non-Hispanic White (752) and Black (252). Results of the power analysis are presented in Table 3 for con- Unpublished results have identified genetic heterogeneity between tinuous traits and in Table 4 for dichotomous traits. All statistics whites and blacks. Our focus is a genome-wide association studies have similar performance. The conditional inference method using of CCT on the Black subjects. A histogram of CCT is presented in the asymptotic distribution is sometimes slightly less powerful than Figure 1. It is pretty symmetric. the permutation method and the Chen method. Note that in some There were 1 051 295 genotyped SNPs. There HGNC gene sym- genes, the power is not increasing in sample size. This is because the bols were obtained using the R/Bioconductor package biomaRt (ver- causal SNPs are selected at random within each scenario. sion 2.26.1). There are 30 562 autosomal genes. Similar to Lee et al. (2012a), genes that contain less than 3 SNPs were excluded from further consideration. This reduces the number of genes to 23 778. Variables age and gender are used as covariates. Gene symbols at 4 Application to OHTS which at least one of the four statistics has a P-value less than Primary open angle glaucoma (POAG) is a leading cause of irrevers- 0.0001 are listed in Table 5. In this table, the information on biotype ible blindness. Although a genetic basis has been established for a and base-pair position is obtained from www.ensemble.org. It is substantial fraction of POAG, no risk alleles of major effect have interesting that all of them except one are on 17. There been identified (Fingert, 2011). The etiology of POAG is likely to be are three tight gene clusters on in terms of base-pair complex. Since POAG is assessed through quantitative measures location. Cluster 1 consists of genes SENP3, SENP3-EIF4A1, such as central corneal thickness (CCT), intraocular pressure (IOP), EIF4A1 and SNORA48. Cluster 2 consists of KCNH4, HCRT, and cup-to-disc ratio, one promising research direction is to map GHDC and STAT5B. Cluster 3 consists of BECN1, MIR6781 and genes underlying these quantitative measures. Indeed, large-scale PSME3. Cluster 2 is very close to cluster 3. They warrant further in- GWAS have identified genes that affect CCT (Lu et al., 2010; Vitart vestigation although none of them overlaps with ZNF469, et al., 2010; Vithana et al., 2011). We report an application of the COL5A1, COL8A2, AKAP13 and AVGR8, genes for which associ- method of Chen et al. (2016), SKAT with and without small sample ation with CCT has been reported previously (Lu et al., 2010; Vitart correction, and the conditional inference method described in this et al., 2010; Vithana et al., 2011). report to a genome-wide gene-based association study of CCT aver- One reviewer suggested that an example of data analysis where aged over both eyes using data from the Ocular Hypertension the phenotype is neither continuous nor binary be given. To satisfy Treatment Study (OHTS). this suggestion, the CCT measurement is discretized according its OHTS (Gordon and Kass, 1999) is a National Eye Institute- quartiles. The discretized CCT measurement takes 4 values: 1, 2, 3 sponsored multi-center, randomized clinical trial. Its goal is to inves- and 4. For this modified data, neither the Chen method nor SKAT tigate the efficacy of medical treatment in delaying or preventing the applies. However, the proposed method and the permutation method onset of POAG in individuals with elevated intraocular pressure. are still applicable. Findings from these two methods are summarized

Table 6. A summary of gene-based association P-values (105) for the 252 Black subjects in the OHTS study

Base-pair position Conditional Chr Biotype Gene symbol Start Stopinference Permutation

1 Protein coding ZBTB40 22 451 851 22 531 157 5.640 9.999 1 Protein coding ADIPOR1 202 940 823 202 958 572 21.08 0.000 9 snRNA RNU6-1039P 110 369 592 110 369 698 27.58 9.999 9 Protein coding CNTRL 121 074 863 121 177 610 40.45 9.999 13 Misc RNA RN7SL597P 41 121 876 41 122 171 28.56 9.999 14 TR V_pseudogene TRAV32 22 185 562 22 186 057 46.41 9.999 17 snoRNA SNORA48 7 574 713 7 574 847 34.56 0.000 17 Protein coding SENP3-EIF4A1 7 563 287 7 578 715 25.57 9.999 17 miRNA MIR6779 38 914 979 38 915 042 5.532 0.000 17 Protein coding ARL5C 39 156 894 39 167 484 34.41 9.999 17 Protein coding KCNH4 42 156 891 42 181 278 0.000 0.000 17 Protein coding HCRT 42 184 060 42 185 452 0.000 0.000 17 Protein coding GHDC 42 188 799 42 194 532 0.000 0.000 17 Protein coding STAT5B 42 199 168 42 276 707 0.400 0.000 17 Protein coding BECN1 42 810 134 42 833 350 25.97 9.999 17 Protein coding PSME3 42 824 385 42 843 758 23.85 9.999 18 LincRNA LINC01541 71 519 962 71 578 956 17.25 9.999 19 Protein coding CACNA1A 13 206 442 13 633 025 71.19 9.999

Note: The phenotype CCT is discretized according to its quartiles. Genes are selected if any of the two listed statistics is significant at level 0.0001. Methods ‘Chen’ and ‘SKAT’ are not applicable in this situation because the phenotype is neither continuous nor binary.

Downloaded from https://academic.oup.com/bioinformatics/article-abstract/33/23/3733/4082272 by Washington University at St Louis user on 01 May 2018 Conditional asymptotic inference for the kernel association test 3739

in Table 6. This list of findings is slightly longer than that from Table Cai,T. et al. (2011) Kernel machine approach to testing the significance of mul- 5. Most of them are due to the significance of the permutation test. tiple genetic markers for risk prediction. Biometrics, 67, 975–986. The proposed method is generally less significant than the permuta- Chen,H. et al. (2013) Sequence kernel association test for quantitative traits in tion text. At the significance level 0.0001 used for this table, both family samples. Genet. Epidemiol., 37, 196–204. Chen,J. and Li,H. (2013) Kernel methods for regression analysis of micro- methods are significant at gene ZBTB40 on chromosome 1 and some biome compositional data. In: Mingxiu Hu, Yi Liu, and Jianchang Lin et al. other genes on chromosome 17 that are seen in Table 5. (eds) Topics in Applied Statistics. Springer, New York, pp. 191–201. Chen,J. et al. (2012) Associating microbiome composition with environmental covariates using generalized unifrac distances. Bioinformatics, 28, 5 Discussion 2106–2113. Chen,J. et al. (2016) Small sample kernel association tests for human genetic An approximation method is proposed for the conditional inference of and microbiome association studies. Genet. Epidemiol., 40, 5–19. kernel association test (KAT). This approximation is based on a theoret- Fingert,J. (2011) Primary open-angle glaucoma genes. Eye, 25, 587–595. ical result developed by Strasser and Weber (1999) on the asymptotic Gordon,M.O. and Kass,M.A. (1999) The ocular hypertension treatment permutation distribution of a random vector. By properly constructing study: design and baseline description of the participants. Arch. avectorT and a mapping function h, a small sample counterpart of the Ophthalmol., 117, 573–583. KAT is introduced. The current application of this theory focuses on Hothorn,T. et al. (2006) A lego system for conditional inference. Am. Stat., test statistics of the form ðÞT EðÞT tR1ðÞT EðÞT ,whereR is the 60, 257–263. covariance matrix of T, and is not applicable to KAT. The work pre- Lee,S. et al. (2012a) Optimal tests for rare variant effects in sequencing associ- sented in this report fills this gap. ation studies. Biostatistics, 13, 762–775. Lee,S. et al. (2012b) Optimal unified approach for rare-variant association We note that the proposed method approximates the permuta- testing with application to small-sample case-control whole-exome sequenc- tion distribution of KAT where the covariates are permuted together ing studies. Am. J. Hum. Genet., 91, 224–237. with response y. This is to satisfy the ‘permutation invariant’ Lin,X. et al. (2011) Kernel machine snp-set analysis for censored survival out- requirement. comes in genome-wide association studies. Genet. Epidemiol., 35, 620–631. A salient feature of the proposed method is that it has no limita- Lin,X. et al. (2013) Test for interactions between a genetic marker set and en- tions on the types of the responses, a benefit of using the conditional vironment in generalized linear models. Biostatistics, 14, 667–681. inference. Regardless of the type of the responses, being it continu- Lin,X. et al. (2016) Test for rare variants by environment interactions in ous, dichotomous, or some other types, the conditional variance ma- sequencing association studies. Biometrics, 72, 156–164. trix of the vector T assumes the same form. In practice, one may Liu,D. et al. (2007) Semiparametric regression of multidimensional genetic want to use this method as a screening tool followed by extensive pathway data: Least-squares kernel machines and linear mixed models. Biometrics, 63, 1079–1088. permutations at genes that show signals. Liu,D. et al. (2008) Estimation and testing for the effect of a genetic pathway An implementation of the proposed method is provided in the R on a disease outcome using logistic kernel machine regression via logistic package iGasso. The function name is KAT.coin. mixed models. BMC Bioinform, 9, 292. Lu,Y. et al. (2010) Common genetic variants near the brittle cornea syndrome locus znf469 influence the blinding disease risk factor central corneal thick- Funding ness. PLoS Genet., 6, e1000947. Strasser,H. and Weber,C. (1999). On the asymptotic theory of permutation Preparation of GAW 17 data was supported, in part, by National Institutes of statistics. Math. Methods Stat., 2, 220–250. Health [grant R01 MH059490] and used data from the 1000 Genomes Vitart,V. et al. (2010) New loci associated with central cornea thickness in- Project (www.1000genomes.org). The GAW 17 was supported by NIH grant clude col5a1, akap13 and avgr8. Hum. Mol. Genet., 19, 4304–4311. R01 GM031575 from the National Institute of General Medical Sciences. Vithana,E.N. et al. (2011) Collagen-related genes influence the glaucoma risk The author thanks Dr Jun Chen and Dr Wenan Chen for sharing the R code factor, central corneal thickness. Hum. Mol. Genet., 20, 649–658. for their method proposed in Chen et al. (2016). The author also thanks the Wang,T. and Elston,R.C. (2007) Improved power by use of a weighted score Associate Editor, Dr Alison Hutchins, and three anonymous reviewers for test for linkage disequilibrium mapping. Am. J. Hum. Genet., 80, 353–360. their useful comments. Wang,X. et al. (2013) GEE-based SNP set association test for continuous and discrete traits in family-based association studies. Genet. Epidemiol., 37, Conflict of Interest: none declared. 778–786. Wu,M.C. et al. (2010) Powerful SNP-set analysis for case-control gen- References ome-wide association studies. Am. J. Hum. Genet., 86, 929–942. Wu,M.C. et al. (2011) Rare-variant association testing for sequencing data Almasy,L. et al. (2011) Genetic Analysis Workshop 17 mini-exome simula- with the sequence kernel association test. Am. J. Hum. Genet., 89, 82–93. tion. BMC Proceedings, 5, S2. Zhao,N. et al. (2015) Testing in microbiome-profiling studies with mirkat, the Barnett,I.J. et al. (2013) Detecting rare variant effects using extreme phenotype microbiome regression-based kernel association test. Am. J. Hum. Genet., sampling in sequencing association studies. Genet. Epidemiol., 37, 142–151. 96, 797–807.

Downloaded from https://academic.oup.com/bioinformatics/article-abstract/33/23/3733/4082272 by Washington University at St Louis user on 01 May 2018