Identifying and Exploiting Gene-Pathway Interactions from RNA-Seq Data for Binary Phenotype Fang Shao1, Yaqi Wang2, Yang Zhao1 and Sheng Yang1*

Shao et al. BMC Genetics (2019) 20:36 https://doi.org/10.1186/s12863-019-0739-7 METHODOLOGYARTICLE Open Access Identifying and exploiting gene-pathway interactions from RNA-seq data for binary phenotype Fang Shao1, Yaqi Wang2, Yang Zhao1 and Sheng Yang1* Abstract Background: RNA sequencing (RNA-seq) technology has identified multiple differentially expressed (DE) genes associated to complex disease, however, these genes only explain a modest part of variance. Omnigenic model assumes that disease may be driven by genes with indirect relevance to disease and be propagated by functional pathways. Here, we focus on identifying the interactions between the external genes and functional pathways, referring to gene-pathway interactions (GPIs). Specifically, relying on the relationship between the garrote kernel machine (GKM) and variance component test and permutations for the empirical distributions of score statistics, we propose an efficient analysis procedure as Permutation based gEne-pAthway interaction identification in binary phenotype (PEA). Results: Various simulations show that PEA has well-calibrated type I error rates and higher power than the traditional likelihood ratio test (LRT). In addition, we perform the gene set enrichment algorithms and PEA to identifying the GPIs from a pan-cancer data (GES68086). These GPIs and genes possibly further illustrate the potential etiology of cancers, most of which are identified and some external genes and significant pathways are consistent with previous studies. Conclusions: PEA is an efficient tool for identifying the GPIs from RNA-seq data. It can be further extended to identify the interactions between one variable and one functional set of other omics data for binary phenotypes. Keywords: Gene-pathway interaction, Garrote kernel machine, Variance component testing, Binary phenotype Background DE genes, thus are hard to elucidate the etiology and RNA sequencing (RNA-seq) technology has identified mechanism [8–10]. Systematic characterization of the amounts of significant genes and given some evidence biological function of genes represents an important step for the diagnosis and treatment of complex disease, for investigating the molecular mechanisms underlying especially cancers [1, 2]. Most of existing statistical the identified disease associations. Enrichment analysis methods focus on identifying the differentially expressed methods are based on different ideas: some only inclu- (DE) genes and heritability estimation by the RNA-seq ding genes participating in pathways and some conside- count data [3–7]. However, with the assumption that ring the regulations between genes in networks [11–14]. only minority of genes associate with phenotypes, these Furthermore, omnigenic model assumes that disease models inevitably lose the regulation information from may be driven by genes with indirect relevance to disease and be propagated by functional pathways. These * Correspondence: [email protected] external genes may cause the disease by distantly regu- 1 Department of Biostatistics, School of Public Health, Nanjing Medical lating significant pathways and they may explain most University, 101 Longmian Avenue, Nanjing, Jiangsu, People’s Republic of China heritability [15]. For transcriptome data, one common Full list of author information is available at the end of the article © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Shao et al. BMC Genetics (2019) 20:36 Page 2 of 9 sense is that the core gene effects can be understood by Methods their interactions within underlying pathways or any Model expressed gene [16, 17]. As a result, identifying the inter- We model GPIs in binary phenotypes as following: actions between external genes and significant pathways XP (GPIs) holds for further understanding of etiology and ðÞ¼ðÞy ¼ Xβ þ P β þ Gβ logit P 1 C j P j G improving the prediction ability [18]. j¼1 Considering the potential importance of interactions in XP ÀÁ þ P ∙G ξ ð Þ defining the genetic architecture of complex traits and j G j 1 inefficiency of traditional methods in high-dimensional j¼1 data, emerging statistical methods have been implemented where y indicates the binary phenotypes (i =1,… , N), X, to identify interactions with low calculation resources [19]. an N × m matrix, is supposed as the covariates, P,an Different algorithms use different ideas, such as set tests N × P matrix, is assumed as the expression levels in a [20, 21] and searching algorithms (exhaustive searching and significant pathway, which can be calculated from some prioritization based on the gene set) [22, 23]. Due to lacking gene enrichment analyses, P ∙ G indicate the GPIs confidence and biological priority and high-dimensional j (Fig. 1). We also suppose γ = P(y = 1). βC, βP, βG and ξG searching spaces, these methods may lose power. are the coefficients of the covariates, functional pathway Moreover, the omnibus test is widely used to identify genes, external gene and GPIs, respectively. the sets from both single and multiple levels, even from LRT based on chi-squared statistics is a traditional different datasets [24–29]. The joint test is more efficient method for testing of interactions for generalized linear and scalable because of low computational consumption, models. The chi-squared statistic is the multiplication of reduction of the degree of freedom and no estimation of − 2 by the logarithm of the ratio of likelihood of the full variance components. On the other hand, kernel-based model and that of the model without interactions. methods have been proposed to estimate association of Unfortunately, as the high-dimensional and compli- genetic variants with complex traits [20, 28, 30–32]. A cated relation of the variables, traditional methods are general kernel machine method can account for complex always inefficient. nonlinear genes and interactions effects. Though the application of kernel-based methods in genome wide Garrote kernel machine (GKM) association studies (GWASs) has been reported in the A kernel function is suitable to suggest the complicated literature, our method applies the idea to identify GPIs relationship, including both linear and nonlinear relations, of the transcriptome data [32, 33]. between genes and phenotypes. Here, we extend the linear Here, noting the similarity between the mixed model GKM for the binary phenotype to identify the GPIs, and kernel function, we develop a statistical test to identify although many other kernel functions can be selected. the GPIs for binary phenotypes. The model possibly solves The kernel function is shown as following: the two challenges. First, our model is testing the GPIs in the binary phenotype framework. To do so, we firstly use KZðÞ¼k ; Zl ðÞ1 þ δGkGl ðÞ1 þ Pk Pl ð2Þ two enrichment analysis of RNA-seq data, including gene Z G P K Z Z set enrichment analysis (GSEA), DAVID and MinePath where k =( k, k), ( k, l) is the kernel matrix of kth [11–13]. Second, the model is quite similar to the garrote and lth individuals. We then test for the effect of the δ kernel machine (GKM), but the parameter estimation pro- GPIs by considering the null hypothesis H0 : =0. cedure is quite different [20]. We refer to the statistical method as the Permutation based gEne-pAthway inter- Parameter estimation action identification in binary phenotypes (PEA). We pro- With the kernel function, the Eq. (1) can be rewritten as vide a method overview of PEA, including the parameter a semi-parametric model as follows: estimation and hypothesis testing. Extensive simulations logitðÞ¼PðÞy ¼ 1 XβC þ h ð3Þ show that compared to the traditional likelihood ratio test T (LRT) for generalized linear models, PEA has higher areas where h =(h1,h2, … ,hN) is an unknown centered under curve (AUCs) with controllable type I error rates. smooth function vector. h can be parameterized for In addition, the parameter estimation is more accurate. different forms of GPIs, such as the Gaussian kernel and We apply our method to analyze platelet RNA-seq data d th ploynomial kernel. As the similarity between the from a case-control study (GSE68086) [1]. PEA can also semi-parametric model and mixed effect model, the h be applied to analyze other interactions in binary phe- can be assumed as the random effects following a multi- notypes, such as pathway-environment interactions. PEA variate normal distribution Nð0; τKðδÞÞ . The relation- is implemented as a Rcpp function, freely available at ship between the unknown function and the kernel https://github.com/biostat0903/RNAseq-Data-Analysis. function is as follows: Shao et al. BMC Genetics (2019) 20:36 Page 3 of 9 Fig. 1 The illustration of GPI in binary phenotypes. The red line represents the effect of the significant pathway. The three black dot lines represent three situations: positive interaction (L1), no interaction (L0) and negative interaction (L2) "# XN ðÞþ1 T XT DðÞt XXT DðÞt K β t hi ¼ hðÞ¼Zi αiKZðÞ¼i; Zl κ α ð4Þ C i ðÞ − ðÞ ðÞþ1 ¼ D t X τ 1I þ D t K α t l 1 XT DðÞt

Identifying and Exploiting Gene-Pathway Interactions from RNA-Seq Data for Binary Phenotype Fang Shao1, Yaqi Wang2, Yang Zhao1 and Sheng Yang1*

Datasheet Blank Template

Transcriptional Control of Tissue-Resident Memory T Cell Generation

Histone H3.3 Maintains Genome Integrity During Mammalian Development

CD29 Identifies IFN-Γ–Producing Human CD8+ T Cells With

Análise Integrativa De Perfis Transcricionais De Pacientes Com

Molecular Effects of Isoflavone Supplementation Human Intervention Studies and Quantitative Models for Risk Assessment

UC Davis UC Davis Previously Published Works

A Mutation in Histone H2B Represents a New Class of Oncogenic Driver

Non-Neutral Evolution of H3.3-Encoding Genes Occurs Without Alterations in Protein Sequence

Sequence Specificity Analysis of the SETD2 Protein Lysine Methyltransferase and Discovery of a SETD2 Super-Substrate

A Genome Wide Screen in C. Elegans Identifies Cell Non-Autonomous Regulators of Oncogenic Ras Mediated Over-Proliferation DISSER

Germline Mutations in Histone 3 Family 3A and 3B Cause a Previously Unidentified Neurodegenerative Disorder in 46 Patients