Viewers for Constructive Criticism

Rauschenberger et al. BMC Bioinformatics (2016) 17:118 DOI 10.1186/s12859-016-0961-5 METHODOLOGY ARTICLE Open Access Testing for association between RNA-Seq and high-dimensional data Armin Rauschenberger1, Marianne A. Jonker1, Mark A. van de Wiel1,2 and Renée X. Menezes1* Abstract Background: Testing for association between RNA-Seq and other genomic data is challenging due to high variability of the former and high dimensionality of the latter. Results: Using the negative binomial distribution and a random-effects model, we develop an omnibus test that overcomes both difficulties. It may be conceptualised as a test of overall significance in regression analysis, where the response variable is overdispersed and the number of explanatory variables exceeds the sample size. Conclusions: The proposed test can detect genetic and epigenetic alterations that affect gene expression. It can examine complex regulatory mechanisms of gene expression. The R package globalSeq is available from Bioconductor. Keywords: High-dimensional, Overdispersion, Negative binomial, Global test, Integration, RNA-Seq Background for high dimensionality, reduces the multiple testing bur- Genetic and epigenetic factors contribute to the regula- den, and successfully detects small effects that encompass tion of gene expression. A better understanding of these many covariates. Due to its desirable properties, the global regulatory mechanisms is an important step in the fight test has become a widely used tool in genomics (e.g. [2–4]). against cancer. Of interest are genetic alterations such Currently, gene expression microarrays are being sup- as single nucleotide polymorphisms (SNPs), copy-number planted by high-throughput sequencing. The negative variations (CNVs) and loss of heterozygosity (LOH), as binomial distribution seems to be a sensible choice for well as epigenetic alterations such as DNA methylation, modelling RNA sequencing data [5, 6]. One of its parame- microRNA expression levels and histone modifications. ters describes the dispersion of the variable. If this param- From a statistical perspective, it makes sense to repre- eter is unknown, the negative binomial distribution is not sent the expression of one gene as a response variable that in the exponential family. As the global test from Goeman changes when some covariates are altered. As a starting et al. [1] is limited in its current form to the exponential point, we assume that all covariates come from a single family of distributions, a new test is needed for RNA-Seq genetic or epigenetic molecular profile. Typically, more data. We will provide here such a test. covariates are of interest than there are samples. After proposing a global test for the negative binomial A plethora of methods for the analysis of gene expres- setting, we perform a simulation study, and analyse two sion and covariates has emerged in the last years. Many of publicly available datasets. The first application concen- these methods test each covariate individually, and subse- trates on method validation, overdispersion, and individ- quently correct for multiple testing or rank the covariates ual contributions. The second application concentrates on by significance. An alternative approach is the global test robustness against multicollinearity, the method of con- from Goeman et al. [1]. The global test does not test the trol variables, and the simultaneous analysis of multiple individual but the joint significance of covariates. It allows molecular profiles. Although we focus on RNA-Seq gene expression data, the test developed here is applicable whenever associa- *Correspondence: [email protected] tions between a count variable and large sets of quantita- 1 Department of Epidemiology and Biostatistics, VU University Medical Center, tive or binary variables are of interest. In essence, it can 1007 MB, Amsterdam, The Netherlands Full list of author information is available at the end of the article be applied to any other type of sequencing data, such as © 2016 Rauschenberger et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Rauschenberger et al. BMC Bioinformatics (2016) 17:118 Page 2 of 8 | = μ | = μ + φμ2 ChIP-Seq (chromatin immunoprecipitation), microRNA- that E[ yi ri] i and Var[ yi ri] i i . Its density Seq or meth-Seq (methylation). function is given by 1 1 yi Methods yi + φ φ μ ( ) = 1 i The random-effects model f yi 1 . 1 ( + ) 1 + μiφ + μ The human genome contains several thousand protein- φ yi 1 φ i coding genes. In the following, only one gene is con- Various link functions come into consideration for the sidered at a time. Accordingly, the expression of one negative binomial model. We favour the logarithmic link gene across all samples is our response variable y = in order to relate the negative binomial model directly (y , ..., y )T . If we were interested whether a given sub- 1 n to the Poisson model (see below). As library sizes can set of SNPs affected gene expression, these SNPs would be be unequal, we include the offset log(m /m),wherem our p covariates. The n×p covariate matrix X is potentially i i denotes the library sizes, and m their geometric mean. high-dimensional (p n). Thus the mean function becomes We represent the relationship between the response and mi mi the covariates using the generalised linear model frame- μi = exp α + ri + log = exp(α + ri).(2) work from McCullagh and Nelder [7]: m m ⎛ ⎞ When τ 2 is close to zero, the score test is the most pow- p 2 − erful test of the null hypothesis H0 : τ = 0against E[ y ] = h 1 ⎝α + X β ⎠ , i ij j the alternative hypothesis H : τ 2 > 0 [8]. Here the = 0 j 1 score function is the first derivative of the logarithmic marginal likelihood with respect to τ 2. Intuitively, if the where h−1 is an inverse link function, α is the unknown marginal likelihood reacts sensitively to changes in τ 2 intercept, X is the entry in the ith row and jth column of X, ij close to 0, there is evidence against τ 2 = 0. Using results and β , ..., β are the unknown regression coefficients. 1 p from le Cessie and van Houwelingen [9], we show in the This model holds for all samples i (i = 1, ..., n). Additional file 1 how to calculate the score function. This We are interested in testing the joint significance of all function contains the unknown parameters α and φ,but regression coefficients. This is challenging because the they can be estimated by maximum likelihood. Replacing regression coefficients cannot be estimated by classical the unknown parameters by their estimates leads to the regression methods if there are more covariates than sam- test statistic ples. Goeman et al. [1] took a novel approach for testing β = ...= β = β = ∪...∪β = n n H0 : 1 p 0againstH1 : 1 0 p 0. Rik (yi −ˆμi)(yk −ˆμk) unb = ThedecisivestepfromGoemanetal.[1]wastoassume 2 (1 + φˆμˆ )(1 + φˆμˆ ) β = (β ... β )T i=1 k=1 i k 1, , p to be random, with the expected value (3) n ˆ E[ β] = 0 and the variance-covariance matrix Var[ β] = Rii (μˆ i + yiφμˆ i) 2 2 − , τ I,whereI is the p × p identity matrix and τ ≥ 0. Then ˆ 2 = 2 (1 + φμˆ i) a random-effects model is obtained: i 1 where R is the entry in the ith row and jth column of the p ij −1 × = ( / ) T μˆ = ( / ) (α)ˆ E yi|ri = h (α + ri), ri = Xijβj.(1)n n matrix R 1 p XX ,and 0,i mi m exp j=1 is the estimated mean under the null hypothesis. For sim- plicity we always write μˆ i instead of μˆ 0,i. In the Additional This random-effects model allows to rephrase the null file 1 the test statistic is rewritten in matrix notation. and the alternative hypotheses. Defining the random vec- Statistical hypothesis testing depends on the null dis- T tor r = (r1, ..., rn) , it can be deduced that E[ r] = 0 tribution of the test statistic unb, which is unknown. We and Var[ r] = τ 2XXT . Now the null hypothesis of no asso- will obtain p-values by permuting the response y = T T ciation between the covariate group and the response is (y1, ..., yn) together with the mean μˆ = (μˆ 1, ..., μˆ n) . 2 given by H0 : τ = 0. To construct a score test against the Since this is a one-sided test [10], if the observed 2 one-sided alternative hypothesis H1 : τ > 0, we need to test statistic is larger than most of the test statistics assume a distribution for yi|ri. obtained by permutation, there is evidence against the null hypothesis. The testing procedure As we are not using a parametric form for the null We assume the negative binomial distribution yi|ri ∼ distribution of the test statistic, no adjustments for the NB(μi, φ), where the mean parameter μi depends on estimation of α and φ are necessary. Furthermore, maxi- the sample, but the dispersion parameter φ does not. mum likelihood estimation does not depend on the order T We parametrise the negative binomial distribution such of the elements in y = (y1, ..., yn) . Because neither αˆ Rauschenberger et al. BMC Bioinformatics (2016) 17:118 Page 3 of 8 nor φˆ vary under permutation, computational efficiency of covariates is the average of the individual test statis- can be achieved with these parameters as given.

Load more