Modeling Multiple Responses Via Bootstrapping Margins with an Application to Genetic Association Testing

Statistics and Its Interface Volume 9 (2016) 47–56 Modeling multiple responses via bootstrapping margins with an application to genetic association testing Jiwei Zhao and Heping Zhang∗,† complex diseases, particularly mental illness and substance use [33]. Here, comorbidity refers to the occurrence of mul- The need for analysis of multiple responses arises from tiple disorders in the same subject. For example, more than many applications. In behavioral science, for example, co- half of the persons with one substance use disorder suffer morbidity is a common phenomenon where multiple disor- from another form of mental illness [3]. Thus, it is scien- ders occur in the same person. The advantage of jointly ana- tifically important to consider comorbidity in genetic stud- lyzing multiple correlated responses has been examined and ies. From the statistical perspective, [37] conducted compre- documented. Due to the difficulties of modeling multiple re- hensive simulation studies and demonstrated that analyzing sponses, nonparametric tests such as generalized Kendall’s multiple traits together generally improves the statistical Tau have been developed to assess the association between power over single-trait based tests. multiple responses and risk factors. These procedures have While it is important and well motivated to analyze mul- been applied to genomewide association studies of multi- tivariate traits jointly, it also raises theoretical and comple complex traits. Unfortunately, those nonparametric tests putational challenges. [34] proposed a generalized Kendall’s only provide the significance of the association but not the magnitude. We propose a Gaussian copula model with dis- Tau to test the association of any hybrid of dichotomous, crete margins for modeling multivariate binary responses. ordinal or quantitative traits with a genetic marker. Later, This model separates marginal effects from between-trait [17]and[36] extended this method to consider the adjust- correlations. We use a bootstrapping margins approach to ments of covariates. However, the nonparametric tests can constructing Wald’s statistic for the association test. Al- only report the significance of the association but not the though our derivation is based on the fully parametric Gaus- magnitude. Therefore, it is desirable, and the goal of this sian copula framework for simplicity, the underlying as- article, to establish a parametric framework for genetic assumptions to apply our method can be weakened. The boot- sociation studies of multiple traits. strapping margins approach only requires the correct speci- We propose Gaussian copula model [23, 28] with discrete fication of the model margins. Our simulation and real data margins to analyze multivariate binary traits. Our model analysis demonstrate that our proposed method not only in- has several advantages. First, this is a rich class of para- creases power over some existing association tests, but also metric models that includes some commonly-used models provides further insight into genetic association studies of such as the multivariate probit model [2, 1]. Second, our multivariate traits. model makes much weaker assumptions than the multivariate probit model does, and hence is more broadly applica- AMS 2000 subject classifications: Primary 62P10; sec- ble. Third, under the Gaussian copula framework, the model ondary 62G09. components that characterize the marginal effects and the Keywords and phrases: Multiple traits, Marginal ap- correlations among the traits are readily separated. Before proach, Bootstrap, Gaussian copula. we can deliver these useful features, we need to resolve the computational challenge. To this end, we propose to fit this 1. INTRODUCTION model using a two-step semi-parametric approach. In the first step, we compute the maximum marginal likelihood es- The advent of high-throughput genotyping technology timator (MMLE) of association coefficients, say, β˜.Inthe has led to discoveries of numerous disease genes, commonly second step, we estimate the variance of β˜ using the boot- through genomewide association studies (GWAS). Most of strapping technique [11]. This bootstrapping margins ap- the GWAS data are analyzed using a single disease trait at a proach does not assume independence of the traits. Since it time. However, comorbidity often occurs in genetic studies of only requires the correct specification of model margins, this approach is robust to the misspecification of the correlation ∗Corresponding author. †The work is supported by the grant R01 DA016750-09 from the Na- information among the traits, and it can be extended to any tional Institute on Drug Abuse. case as long as the model margins are correctly specified. Although there is a rich literature on modeling multivari- where ui ∈ [0, 1], Φ is the cumulative distribution function ate discrete data through copula structure, including [18], (c.d.f.) of a standard normal distribution, and Φd represents [23], [30], and the references therein, our work has distinc- the c.d.f. of d-dimensional normal vector with mean zero tive features in model feasibility and computational cost, es- and covariance matrix Γ. For instance, [28] used the Gaus- pecially in genetic association testings. As discussed above, sian copula to construct a class of multivariate dispersion our modeling strategy only requires the correct specification models for d-dimensional multivariate data (y1,...,yd)with of the margins, and is more flexible to use. Also, as shown in marginal distributions F1,...,Fd.Thatis, Sections 4 and 5, from the computational perspective, our approach is much faster than multivariate probit model, es- CΦ(F1(y1),...,Fd(yd)|Γ) (1) pecially when the dimension of multivariate discrete data −1 −1 =Φd{Φ (F1(y1)),...,Φ (Fd(yd))|Γ}. is large. This feature is particularly appealing in analyzing high-throughput genotyping data. Motivated by genetic case-control studies of complex dis- Copula has become a useful tool in genetic studies. eases, here we concentrate on the modeling of multiple bi- For example, [19] considered a Gaussian copula variance- nary traits W =(W (1),...,W(L))T . By taking Radon- components method for linkage analysis with nonnormal Nikodym derivative for CΦ(F1(y1),...,FL(yL)|Γ) in (1) quantitative traits. [15] introduced a Gaussian copula based with respect to the counting measure, we can show that approach to modeling the dependence between disease sta- (1) (L) tus and secondary phenotypes in case-control association P (W = w1,...,W = wL) studies. We exploit copula-based methods for further use in (2) 1 1 genetic studies. In GWAS, the need of analyzing millions of j1+···+jL = ··· (−1) CΦ(u1j ,...,uLj |Γ), single nucleotide polymorphisms (SNPs) requires the algo- 1 L j1=0 jL=0 rithm for each single SNP would be extremely fast, which is the main motivation of our work. Our proposal of Gaussian where wl = 0 or 1, ul0 = Fl(wl), ul1 = Fl(wl − 1), and Fl is copula framework guarantees the MMLE is consistent and the c.d.f. for W (l), i.e., asymptotically normal [28]. To account for the ignorance of ⎧ potential correlations among multiple traits, we further pro- ⎨0 s<0 − ≤ pose Bootstrap method correcting for the variance estima- Fl(s)=⎩1 pl 0 s<1 tion. Although our method of using MMLE seems simple for 1 s ≥ 1, Gaussian copula model itself, it works very fast and shows (l) substantial power gain over some nonparametric tests in our where pl = P (W = 1). numerical studies. More importantly, through the analysis This model setting includes many commonly-used mod- of a real data set on comorbidity, our proposed method iden- els. For example, the bivariate probit model has the follow- tifies some significant SNP biomarkers reported in previous ing probability mass function: related studies, illustrating the usefulness of our proposed (1) (2) method. ⎧P (W = w1,W = w2) This paper is organized as follows. We establish our model ⎪ −1 − −1 − | ⎪Φ2(Φ (1 p1), Φ (1 p2) Γ) in Section 2. In Section 3, we describe our two-step semi- ⎪ ⎪ if w1 =0,w2 =0, parametric estimation method for association testing. We ⎪ −1 −1 ⎪1 − p1 − Φ2(Φ (1 − p1), Φ (1 − p2)|Γ) present our simulation studies in Section 4 and the SAGE ⎨ if w1 =0,w2 =1, data analysis in Section 5. We compare our analysis results = ⎪ − − −1 − −1 − | ⎪1 p2 Φ2(Φ (1 p1), Φ (1 p2) Γ) with those based on multivariate probit model and another ⎪ ⎪ if w1 =1,w2 =0, existing nonparametric method [36]. We also provide the ⎪ −1 − −1 − | − ⎩⎪p1 + p2 +Φ2(Φ (1 p1), Φ (1 p2) Γ) 1 estimates and their standard errors for genetic associations, if w =1,w =1. which reveal further scientific details for GWAS. The article 1 2 ends with a discussion in Section 6. Let G denote a variable of interest (e.g., a genetic marker) and X be a p-vector of covariates. To model the marginal 2. THE GAUSSIAN COPULA MODEL effects of G and X on W (l), we consider a generalized linear Copula, a multivariate distribution function with uni- model [GLM, 22], i.e., for each l, formly distributed margins, is a useful tool for modeling T correlated variables. For the general introduction and ap- (3) g(pl)=ηl = αl + βlG + γl X, plication of copula, we refer the readers to [23]. A common { − choice is Gaussian copula, which is constructed from a mul- where g is the link function. The choices include log t/(1 } −1 {− − } tivariate normal distribution using Sklar’s theorem. Specif- t) (logit), Φ (t) (probit), and log log(1 t) (comple- ically, the d-variate Gaussian copula is mentary log-log). Note that different choices of link func- tions, i.e., models for margins, will not affect the Gaussian −1 −1 CΦ(u1,...,ud|Γ) = Φd{Φ (u1),...,Φ (ud)|Γ}, copula correlation structure. Our model is flexible in that 48 J. Zhao and H. Zhang every single W (l) can have its own distinct choice of link marginal likelihood; second, IFM uses Jackknife for variance T T function.

Modeling Multiple Responses Via Bootstrapping Margins with an Application to Genetic Association Testing

Linkage and Association Studies in African

The Tumor Suppressor CIC Directly Regulates MAPK Pathway Genes Via Histone Deacetylation Simon Weissmann1,2, Paul A

Genes Involved and Proteins

Targeting the DEK Oncogene in Head and Neck Squamous Cell Carcinoma: Functional and Transcriptional Consequences

Endocrine System Local Gene Expression

Differential DNA Methylation and Changing Cell-Type Proportions As Fibrotic Stage Progresses in NAFLD

An RNA-Seq-Based Resource for Pain and Sensory Neuroscience Research

Clinical Factors That Influence the Cellular Responses of Saphenous

Identification of Novel Regulatory Genes in Acetaminophen

An Integrative Genomic Analysis of the Longshanks Selection Experiment for Longer Limbs in Mice

Derived Xenografts Highlights the Role of REST in Neuroendocrine Differentiation of Castration- Resistant Prostate Cancer Amilcar Flores-Morales1,2,3, Tobias B

A Chronological Atlas of Natural Selection in the Human Genome During the Past Half-Million Years