Functional Regression Method for Whole Genome Eqtl Epistasis Analysis with Sequencing Data Kelin Xu1,2, Li Jin1 and Momiao Xiong1,3,4*
Total Page:16
File Type:pdf, Size:1020Kb
Xu et al. BMC Genomics (2017) 18:385 DOI 10.1186/s12864-017-3777-4 METHODOLOGYARTICLE Open Access Functional regression method for whole genome eQTL epistasis analysis with sequencing data Kelin Xu1,2, Li Jin1 and Momiao Xiong1,3,4* Abstract Background: Epistasis plays an essential rule in understanding the regulation mechanisms and is an essential component of the genetic architecture of the gene expressions. However, interaction analysis of gene expressions remains fundamentally unexplored due to great computational challenges and data availability. Due to variation in splicing, transcription start sites, polyadenylation sites, post-transcriptional RNA editing across the entire gene, and transcription rates of the cells, RNA-seq measurements generate large expression variability and collectively create the observed position level read count curves. A single number for measuring gene expression which is widely used for microarray measured gene expression analysis is highly unlikely to sufficiently account for large expression variation across the gene. Simultaneously analyzing epistatic architecture using the RNA-seq and whole genome sequencing (WGS) data poses enormous challenges. Methods: We develop a nonlinear functional regression model (FRGM) with functional responses where the position-level read counts within a gene are taken as a function of genomic position, and functional predictors where genotype profiles are viewed as a function of genomic position, for epistasis analysis with RNA-seq data. Instead of testing the interaction of all possible pair-wises SNPs, the FRGM takes a gene as a basic unit for epistasis analysis, which tests for the interaction of all possible pairs of genes and use all the information that can be accessed to collectively test interaction between all possible pairs of SNPs within two genome regions. Results: By large-scale simulations, we demonstrate that the proposed FRGM for epistasis analysis can achieve the correct type 1 error and has higher power to detect the interactions between genes than the existing methods. The proposed methods are applied to the RNA-seq and WGS data from the 1000 Genome Project. The numbers of pairs of significantly interacting genes after Bonferroni correction identified using FRGM, RPKM and DESeq were 16,2361, 260 and 51, respectively, from the 350 European samples. Conclusions: The proposed FRGM for epistasis analysis of RNA-seq can capture isoform and position-level information and will have a broad application. Both simulations and real data analysis highlight the potential for the FRGM to be a good choice of the epistatic analysis with sequencing data. Keywords: Gene-gene interaction, Multivariate functional regression, Functional regression models, RNA-seq, Next-generation sequencing, Association studies, eQTL * Correspondence: [email protected] 1State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai 200438, China 3Department of Biostatistics, Human Genetics Center, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA Full list of author information is available at the end of the article © The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Xu et al. BMC Genomics (2017) 18:385 Page 2 of 19 Background across the entire gene, and transcription rates of the Epistatic effect in gene expression, defined as the depart- various cells. The summary statistic-based epistasis ure from additive effects in a linear model of eQTL ana- analysis of the RNA-seq fails to utilize all transcripts lysis [1], plays an essential role in understanding the information. gene regulation and disease mechanisms [2–4]. The critical barrier in epistasis analysis is to deal with One polymorphism’s effect on expression of a gene de- rare variants. The traditional statistical methods for epis- pends on other polymorphisms present in the genome tasis analysis were originally designed for testing the [5]. Epistasis analysis of gene expressions will substan- interaction between common variants and are difficult tially improve the understanding of the genetic architec- to apply to rare variants due to high type 1 error rates, ture of gene expression and facilitate mechanistic severe multiple testing, prohibitive computational time insights into complex traits [6, 7]. However, eQTL epis- and low power [19]. Whole genome RNA-req eQTL tasis analysis remains fundamentally unexplored due to analysis poses a significant challenge. To meet the chal- large computational challenges and data availability [6]. lenge, we developed a nonlinear functional regression Gene expression is an intermediate phenotype that brid- model (FRGM) with functional responses where the ges the genotype and higher level phenotypes such as dis- position-level read counts within a gene are taken as a eases [8, 9]. Studying the effect of epistasis on the gene function of genomic position, and functional predictors expression could provide a better understanding of the where genotype profiles are viewed as a function of gen- genetic architecture and gene regulation. The importance omic position, for epistasis analysis with RNA-seq data, of detecting the epistatic effect on the gene expression has which allows simultaneous capture of all space informa- been emphasized in many recent studies [10, 11]. How- tion hidden in the RNA-seq data and genetic variation ever, the corresponding methods are relatively rare. The data, but with substantially reduced dimensions. Instead widely used statistical methods for identifying eQTL epis- of testing the interaction of all possible pair-wises SNPs, tasis are designed for microarray expression data where an theFRGMtakesageneasabasicunitforepistasis overall expression of the gene is taken as a quantitative analysis, which tests for the interaction of all possible trait and all methods for QTL epistasis analysis can be pairs of genes and uses all the information that can be used for eQTL epistasis analysis [10, 12]. accessed to collectively test interaction between all Application of next generation sequencing (NGS) possible pairs of SNPs within two genome regions (or techniques to the genetic analysis of gene expression in- genes). The proposed FRGM for epistasis analysis of volves (1) generating millions of short reads of mRNA the RNA-seq can capture isoform and position-level or cDNA which are mapped to the genome and lead to information and will have a broad application. a sequence of read counts at the hundreds of millions of The FRGM for epistasis analysis has several remark- genomic positions [13–16] and (2) generating millions able features. First, the FRGM accounts for the change or 10 millions of genetic variants. RNA-seq counts vary in the position-level read counts, while preserving the greatly across the gene [17]. Count variations can be due intrinsic structure and all the positional-level genetic in- to experimental bias such as fragmentation methods, formation. Second, the FRGM simultaneously utilizes reverse-transcription [16], sequence-specific bias and se- both correlation information among the RNA-seq at dif- quencing technology variation [18]. However, count vari- ferent genomic positions and among all variants in a ation can also be caused by variation in splicing, genomic region. Third, the multicollinearity problems in transcription start sites, polyadenylation sites, post- the FRGM which may be presented in both the RNA- transcriptional RNA editing across the entire gene, and seq and genetic variation are alleviated. Fourth, the transcription rates of the cells [13–16, 18]. RNA-seq FRGM expands both position-level read count function data can be viewed as a function or a curve of the gen- and genotype function in terms of orthogonal eigenfunc- omic position and hence can be taken as a function- tion, which leads to substantial dimension reduction in valued trait. both RNA-seq data and SNP data. The FRGM for epista- Although RNA-seq data are measured as a function, sis analysis of function-valued traits which capture key the widely used methods for genetic studies of the RNA- information in the data is expected to open a new route seq in humans are the same as that for the traditional for epistasis analysis of RNA-seq data. single-valued quantitative traits where a single number To evaluate its performance for epistasis analysis of for overall expression of the gene is taken as a quantita- the RNA-seq, we use large scale simulations to calculate tive trait. These methods use summary statistics to thetypeIerrorratesandevaluatethepowerofthepro- measure or represent gene expressions assayed by NGS posed FRGM for detecting epistasis. To further evaluate its techniques and cannot capture the expression variations performance, the FRGM for epistasis analysis is applied to across the gene due to splicing, transcription