<<

Send Orders for Reprints [email protected]

34 The Open Bioinformatics Journal, 2013, 7, (Suppl 1: M3) 34-40 Open Access Statistical Methods for Overdispersion in mRNA-Seq Hui Zhang*, Stanley B. Pounds and Li Tang

Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA

Abstract: Recent developments in Next-Generation Sequencing (NGS) technologies have opened doors for ultra high throughput sequencing mRNA (mRNA-seq) of the whole transcriptome. mRNA-seq has enabled researchers to comprehensively search for underlying biological determinants of diseases and ultimately discover novel preventive and therapeutic solutions. Unfortunately, given the complexity of mRNA-seq data, data generation has outgrown current analytical capacity, hindering the pace of research in this area. Thus, there is an urgent need to develop novel statistical methodology that addresses problems related to mRNA-seq data. This review addresses the common challenge of the presence of overdispersion in mRNA count data. We review current methods for modeling overdispersion, such as negative binomial, quasi-likelihood Poisson method, and the two-stage adaptive method; introduce related statistical theories; and discuss their applications to mRNA-seq count data. Keywords: Count response, mRNA-seq, negative binomial theory, over-dispersion, Poisson, quasi-likelihood.

1. INTRODUCTION theoretically unbounded. Thus, methods for binomial responses do not apply. Log-linear models built upon DNA sequencing is a powerful technique that allows Poisson distributions are most popular for analyzing the scientists to obtain key genetic information from biological number of counts. An important mathematical feature of the system of interest. However, traditional sequencing methods is that the equals its . such as capillary electrophoresis based Sanger sequencing However, in mRNA-seq data the variance of the count is have inherent limitations with regard to throughput, often much larger than its mean, and this property is called scalability, speed, resolution and cost. Next Generation overdispersion. When overdispersion occurs, the log-linear Sequencing (NGS) has been developed to address such model does not hold, which results in biased and misleading limitations, and this technology has triggered numerous conclusions. Therefore, alternative analytical strategies are ground-breaking discoveries and evoked a revolution in needed to adequately address over dispersion of mRNA-seq biomedical research [1]. NGS technologies enable data. sequencing the entire genome and transcriptome in a rapid and cost-effective manner, and can be applied to a wide The regular log-linear regression and related goodness- range of biomedical applications such as variant discovery, of-fit tests will first be introduced in Section 2, followed by profiling of histone modifications, identification of overdispersion and its detection in Section 3. In Section 4, transcription factor binding sites, resequencing, and we will review extended Poisson models and the newly chararcterization of the transcriptome. The application of developed two-stage adaptive strategy for overdispersion in NGS to the transcriptome, also known as mRNA-seq, allows mRNA-seq count reads. Finally, future directions will be direct sequencing of a population of transcripts and these discussed. read counts linearly approximate target transcript abundance 2. LOG-LINEAR REGRESSION MODEL FOR COUNT [2]. Therefore, mRNA-seq data can provide rich information DATA on alterative splicing, allele-specific expression, unannotated exons, and differential expression. To assess differential continuous expression measures, such as microarray data, among various groups or treatments In mRNA-seq data analysis, sequenced fragments are and other demographic factors, an ordinary linear regression aligned with a reference sequence and the number of can be written as follows: fragments (typically called counts) mapped to regions of interest is recorded, usually as count data [2]. However, Yijg = xi ! jg = ! jg0 + ! jg1xi1 +…+ ! jgp xip + "ijg , (1) unlike other types of discrete responses, count responses 2 cannot be expressed in the form of several proportions. For 1 ! i ! n j and "ijg ~ N (0,# jg ), count responses, the upper limit is infinite and the range is th where Yijg denotes the expression level of gene g of the i

th *Address correspondence to this author at the Department of Biostatistics, replicate in the j treatment group and ! jg represents St. Jude Children's Research Hospital, 262 Danny Thomas PI, MS 768, corresponding regression coefficients, which quantify group Memphis, TN 38105, USA; Tel: 901-595-6736; Fax: 901--595-8843; x E-mail: [email protected] effects on gene expression levels. An example of i is

1874-0362/13 2013 Bentham Open Statistical Methods for Overdispersion in mRNA-Seq Count Data The Open Bioinformatics Journal, 2013, Volume 7 35

treatment assignment or gender. If xi1 represents a group used to account for replicate-to-replicate variation in indicators, ! stands for difference in expression level technical factors such as errors and of laboratory jg1 operations such as RNA extraction and amplification [7]. between group j and the reference group. Similarly, if xip Technical laboratory effects typically is manifest as differences in the total number of sequence reads generated stands for gender, ! jgp represents the difference in the for each replicate. Thus, the offset term is typically a simple expression level of gene g between males and females. A t- function of the read count data of the replicate. The average, test can be applied to evaluate whether a certain group effect, median, or upper quantile is often used [7]. Mathematically, for example, is equal to a constant C. In the case that ! jg1 the offset term may be absorbed into the intercept term of C = 0 , the test is the familiar one to assess the significance equation 3. Thus, we use the notation of equation 3 of the result, throughout with the understanding that the offset term is often required in practice. 2 ! # & H0:! jg1=C ! 2 ! ! jg1 * C Subsequently, with the log-linear model defined above, ! | ! ," ~ Normal % ! ,V " jg ( ) T = ~ t , jg0 jg1 jg % jg1 g ( 2 dg the variation of a count response can be usually partially $ ' ! explained by a vector of predictors. Vg " jg Without loss of generality, in the rest of this review 2 ! ! except Section 4.3, we will remove j and g by considering where V ! jg is the variance of ! and t represents the g jg1 dg only 1 gene and 1 treatment group.

T-distribution with degree of freedom dg . 2.1. Inference About Model Parameters However, in mRNA-seq data, the expression level is In , maximum-likelihood estimation (MLE) is often measured by count rather than a continuous value. used to estimate the parameters by finding optimal Thus, using the model employed in (1) will likely result in a mathematical values that maximize the misleading conclusion. Although biomedical investigators (or the log-likelihood function). For the log-linear model, the previously transformed their count data and used tools log-likelihood function is appropriate for continuous data [3-5], it is now more suitable n to employ more advanced statistical methods that better l (!) = "{yi #i $ exp(#i ) $ log yi!} characterize the features of count data. i=1 (4) n The Poisson distribution is a widely used model for the T = "{yi xi ! $ exp(xi !) $ log yi!}. count response. It is determined by only one parameter ! , i=1 which is both the mean and variance of the distribution. To locate the numbers maximizing the log-likelihood, we However, this parameter often varies because of need to solve the score function, which sets the first derivative heterogenous underlying characteristics; for example, the of the log-likelihood function with respect to ! as 0: reads for A/T-rich regions are usually low. Thus, this single- parameter model can no longer be applied if heterogeneity is n ! T % high. Therefore, the log-linear regression model, an l (") = yi xi $ exp(xi ")xi . (5) !" #{ } extension of the simple Poisson distribution, was developed i=1 to account for such heterogeneity [6]. As the second-order derivative is negative, the unique solution to the score equation (5) will be the MLE of ! , Let Yijg denote the mRNA-seq count for gene g in the th th ! ! i replicate of the j treatment group and xi = xi1,…, xip denoted as ! . The MLE ! follows an asymptotically ( ) denote a vector of explanatory variables of the i th subject ! # 1 "1 & 1 i n . , that is ! ~ AN % !, I (!)( , where ( ! ! j ) $ n '

I ! is the matrix, defined as Given xi , the response variable Yijg follows a Poisson ( ) distribution with mean ! , % #2 ( ijg [6]. For the log-linear model I (!) = E '" $ l (! | Y)* & #!#! ) Y | x ~ Poisson , 1 i n , (2) ijg i (!ijg ) " " j defined in (2) and (3), whereas the conditional mean ! is linked to a linear 1 n ijg ' ' (6) E #"I (!)%$ = E (&i xi xi ), I (!) = (&i xi xi . combination of predictors by the log function: n i=1 log(! ) = x " = " + " x +…+ " x , (3) In practice, it is usually not feasible to evaluate ijg i jg jg0 jg1 i1 jgp ip E "I (!)$ in a closed form. Thus, the observed version of where ! = (! , ! ,…, ! )" is the parameter vector, # % jg jg0 jg1 jgp ! including interest and nuisance parameters. In the analysis of I (!) with estimated ! replacing ! serves as a natural mRNA-seq data, a replicate-specific offset term is frequently 36 The Open Bioinformatics Journal, 2013, Volume 7 Zhang et al. of I (!) for inference. Then the MLEs can be Given both yi ~ Poisson(!i ) and !i s are all large, obtained from iterative numerical optimization by taking Pearson's statistic approximately follows a chi-square advantage of modern computers. distribution with degrees of freedom of n ! p . Note that the In most mRNA-seq analyses, we are primarily interested asymptotic distribution of Pearson's statistic holds when in testing whether a predictor of interest is associated with !i " # while n is fixed. This is somewhat in comparison the gene expression level, such as cancerous versus normal with the common large sample theory, because most tissue. Group comparisons can be made using Wald, score, asymptotic results typically hold with a large sample size n . or likelihood ratio tests. In mRNA-seq data, !i s may not be large; thus, inference 2.2. Goodness of Fit based on the asymptotic chi-square distribution may be invalid. Departures from the Poisson assumption that mean equals variance are frequently seen even in well-designed 2.2.2. Deviance Statistic and controlled practical studies, especially in mRNA-seq The deviance statistic is defined as twice the difference count data. Therefore, it is important to evaluate whether the between the maximum achievable log-likelihood and the log-linear model fits the data appropriately. In this review, value of the log-likelihood evaluated at the MLEs of the we discuss 2 goodness-of-fit tests for log-linear regression: model, that is, the Pearson's ! 2 test and the deviance test. D(y,!) = 2 $#l (y, y) " l (y,!)&%, 2.2.1. Pearson's ! 2 Statistic ! where y = (y1,…, yn ) . l (y, y) is the log-likelihood Pearson's ! 2 statistic is the sum of normalized squared assuming the model is a perfect fit to the data and is residuals. A residual is defined as the difference between the l ( y,!) observed and model-fitted values of the response variable, the log-likelihood of the model of consideration. For log- and the sum of its squares asymptotically follows a chi- linear regression, the deviance statistic is defined as square distribution under regular conditions. For instance, let + $ ' . n y $ ' $ ' ! # !& - & i ) ! 0 ! ! D(y,!) = 2 yi log * & yi * # i ) , # i = exp & xi ! ) . yi be the count response and ! i = exp x " be the fitted " - 0 % i ( i=1 & ! ) % ( % ( $ ' - % # i ( 0 , / value under the log-linear model (1 ! i ! n) . Since the mean If the Poisson model holds, D(y,!) is approximated by and variance are the same for the Poisson distribution, we a chi-squared random variable with degrees of freedom ! n ! p . When the deviance divided by the degrees of may also estimate the variance as ! i . Subsequently, the freedom is significantly larger than 1, which is a rare case ! yi ! " i under the null hypothesis of good fit with Poisson, it is normalized residual for the i th subject is , and the indicative of lack of fit. ! " i 3. OVERDISPERSION AND ITS DETECTION 2 $ ! ' Although the Poisson model is most widely employed for y " # i & i ) count data analysis, a common violation of it is n % ( Pearson statistic is simply P = . It can be overdispersion, which is seen in mRNA-seq count data. As ! i=1 ! discussed at the beginning of Section 2, the mean and # i proven that the Poisson distribution converges to a normal variance of a Poisson distribution should be the same. This is actually a very stringent restriction, and in many distribution when the mean ! grows unbounded [8]. If we applications, the variance ! 2 = Var (y) often exceeds the ! ignore the sampling variability in ! , we have mean ! = E (y) , causing overdispersion and making it inappropriate to model such data using the Poisson model. ! yi ! " i ! When overdispersion occurs, the standard errors of ~ AN(0,1), if " i # $, 1 % i % n. parameter estimates of the Poisson log-linear model are ! " i artificially deflated, leading to exaggerated biased estimates and false positive findings. In mRNA-seq data, It follows that for fixed n , overdispersion may occur because of correlated gene counts, 2 clustering of subjects and within group heterogeneity. If the $ ! ' yi " # i underlying reason for overdispersion is uncertain, a common n & ) % ( 2 ! (7) approach is to use a parametric model with 1 additional P = ! ~ asymptotic *n" p , if # i + ,forall1 - i - n. i=1 ! parameter to account for variance inflation or use a # i nonparametric robust estimate for the asymptotic distribution where p is the number of parameters, that is the dimension of of model estimate and make an inference based on the β. corrected asymptotic distribution. Statistical Methods for Overdispersion in mRNA-Seq Count Data The Open Bioinformatics Journal, 2013, Volume 7 37

Several methods have been developed to detect modeling methods can be divided into 3 categories: the overdispersion, 4 of which are discussed here. First, maximum likelihood based parametric method, the overdispersion detection could be taken as the goodness-of- maximum quasi-likelihood based nonparametric method and fit test to evaluate how well a Poisson model fits the data. In the two-stage analysis strategy. fact, the 2 goodness-of-fit tests, the deviance and Pearson's 4.1. Parametric Method chi-square statistics, discussed in Section 2, can be used to verify whether overdispersion occurs. However, the deviance As mentioned in Section 2, mRNA-seq reads frequently and Pearson's statistics can be quite different if the mean is show overdispersion because of reasons such as correlated not large enough. Second, Vuong's statistic [9], which tests gene counts, clustering of subjects, and within-group the Poisson model against its nested model such as the heterogeneity. Therefore, it is natural to add a random effect negative binomial model, can be an efficient way to test ri to standard Poisson models to capture additive overdispersion. Third, a recently proposed generalized heterogeneity. Equation (2) can be rewritten as ANOVA [10] detects overdispersion by extending U- statistics with asymptotic theory. Finally, for mRNA-seq Yi | xi , ri ~ Poisson(ri µi ), 1 ! i ! n. data analysis, Auer and Doerge [11] adopted a score test method proposed by Dean and Lawless [12, 13]. They As stated above, if the random effect ri is known, the 2 $ ' count response yi follows standard Poisson distribution. 1 n $ ! ' defined T = & & yi " # i ) " yi ) under the assumption However, when r is unknown or not observed, we assume it 2 ! i=1 i %& % ( () as a random that follows a parametric distribution. of a correct specification of the mean response !i . This Statisticians have shown that it is convenient and reasonable statistic is motivated by assuming a form of extra Poisson to assume that ri has a [8]. Therefore, it 2 is easy to derive the marginal distribution of response , variation, Var(yi ) = !i + " !i , and then testing the null Yi hypothesis of the Poisson model H 0 :! = 0 . When the " yi ! (" + yi )# µi sample size n ! " , the following statistic P(Yi = yi | xi ) = , (9) "+yi yi!! (" ) (µi + #) 2 n *$ ' - , ! / where ! and ! are the parameters of gamma distribution. ! & yi " # i ) " yi i=1 ,% ( / Equation (9) is the probability mass function of a negative T = + . , (8) 1 2 !" n $ ' with mean and variance ! µi 2!& # i ) i % ( !" # " & , which can be rewritten as mean and %1+ ( !i approximately normally distributed under the null hypothesis µi $ µi ' 2 that yi follows a Poisson distribution. If the sample size n is variance !i + " !i to match the test in (8). Unless ! = 0 , fixed, but !i " # for all 1 ! i ! n , T is asymptotically this variance is always larger than the mean ! . Therefore, equivalent to the negative binomial model adds an extra term ! ! 2 to the

2 variance of Poisson model to account for overdispersion. For n $ ! ' this reason, ! is the parameter to control dispersion or the !& yi " # i ) i=1 % ( shape of this distribution. The negative binomial model has T = . 2 n been widely used for sequence count data analyses, such as 1 ! egeR [14-16], DESeq [17] and baySeq [18]. # i n ! i=1 As a parametric model belonging to the exponential Dean and Lawless [12] showed that the limiting family, negative binomial inference can be performed with maximum likelihood. Its log-likelihood is derived as distribution of T2 is a linear combination of chi-squares as ! " # 1 ! i ! n . n i ( ) l (!," ) = # log fNB (yi | xi , !,$ ) 4. METHODS FOR OVERDISPERSED COUNT DATA i=1 n When overdispersion is detected and the appropriateness 1 + "1 % 1 "1 ( . 4 = y! log g x # " log + g x # } of the Poisson model is in serious doubt, the model-based ! 2 i - 1 ( 1i ) ' 1 ( 1i )* 0 5 i=1 3 , & $ ) / 6 asymptotic variance no longer indicates the variability of the MLE and inference based on the likelihood approach may be n , 1 / #1 & ) invalid. A different and more appropriate model can be used +! ." log(1+ "g1 (x1i $)) + log % ( yi + + 1 i=1 - ' " * 0 to fit the data if the underlying cause for overdispersion is known. In previously published sequence count data n % % 1( ( analyses, multiple approaches have been employed to ! log yi!! log # ' * . (10) "&' & $ ) )* address overdispersion [14-18]. These overdispersion i=1 38 The Open Bioinformatics Journal, 2013, Volume 7 Zhang et al.

By maximizing this log-likelihood, we readily obtain the asymptotically normal regardless of the distribution of yi and ! choice of V , as long as the mean function E y | x = MLEs ! , as well as their asymptotic distribution for inference. i ( i i )

!i (xi ") is correct [16]. Further, if yi given xi is modeled By testing the null H :! = 0 , we can evaluate whether 0 parametrically using a member of the , the there is overdispersion in the data. However, since its range is estimating equations in (12) are essentially the same as the score nonnegative, 0 is a boundary point. Thus, the asymptotic equations of the log-likelihood with an appropriate selection of theory of the MLE cannot be applied directly for inference about ! , as 0 is not an interior point in the parameter space. ! Vi , which ! is the same as ! . Statisticians refer to this problem as inference under nonstandard condition. Under certain conditions that are ! With the sandwich estimate, the asymptotic variance of ! satisfied by most applications, including the overdispersion test being discussed here, inference about the boundary point can be is given by based on a modified asymptotic distribution. For example, to n 1 #1 & 1 #2 % ) #1 test the null hypothesis, H 0 :! = 0 , the revised asymptotic (13) !" = B ( $DiVi Var (yi | xi ) Di + B . distribution is an equal mixture of a point mass at 0 and the n ' n i=1 *

! If the conditional distribution of yi given xi follows a positive half of the asymptotic normal distribution of ! under the null hypothesis. Intuitively, the lower half of the asymptotic Poisson with mean !i = exp(xi ") , then ! distribution of ! is “folded” into a point mass at 0, since !"i $ Di = = "i xi , Var (yi | xi ) = "i , B = E ("i xi xi ). (14) negative values of ! are not allowed under the null hypothesis. !# 4.2. Non-Parametric Method It is readily checked that the estimating equation in (12) is identical to the score equations of the log-likelihood of the When overdispersion exists and the assumptions of negative Poisson log-linear regression in (5), if V = Var y | x and the binomial are satisfied, the parametric method and related i ( i i ) maximum likelihood are very efficient to model mRNA-seq asymptotic variance of the estimating equation estimate in (13) counts. However, negative binomial may not always hold for all simplifies to mRNA-seq reads and therefore a robust inference is required. 1 & 1 n ) Robert Wedderburn introduced a quasi-likelihood function in #1 #2 % #1 !" = B ( $DiVi Var (yi | xi ) Di + B 1974 [19]. Despite its properties being similar to that of the log- n ' n i=1 * likelihood function, it only requires the mean and variance of 1 $ 1 n ' 1 the response to follow a certain pattern, without having to !1 !2 # !1 !1 (15) = B & "DiVi Vi Di ) B = I (*). specify any parametric distribution. Quasi-likelihood models n % n i=1 ( n can be fitted using a straightforward extension of the algorithms Then, the estimating equation yields the same inference as for generalized linear models. ! To perform inference, the challenge is to estimate the the MLE. However, as the estimating equation estimate ! and asymptotic variance of estimated parameters. The most popular its associated asymptotic variance ! in (13) are derived method is the sandwich variance estimate, which is derived " based on the estimating equations. Scaled variance is another independent of such distributional models, it still provides a approach. valid inference even when the conditional distribution of yi given is not Poisson. For example, in the presence of 4.2.1. Sandwich Estimate for Asymptotic Variance xi overdispersion, Var y | x is larger than ! , biasing the Let ( i i ) i asymptotic variance in (15) based on the MLE. In contrast, the 1 n #!i %1 & estimating equation based asymptotic variance ! in (13) still E (yi | xi ) = !i (xi " ), Di = , B = DiVi Di , (11) " #" n $ i=1 provides valid inference about Var (!) . where x indicates that the mean is an arbitrary !i ( i ") !i After estimating various parameters in (13) using ! corresponding moments that function of x ! . Let ! be the estimating equation estimate of i 2 # & n ! ! ! 1 ! * ! Var y | x = y i , B = i x x , ! , i.e., ! is the solution of ( i i ) % i ! " ( ) " i i $ ' n i=1 n "1 ! ! ! ! DiVi (yi " #i ) = 0, (12) ! Di = " i xi , V i = " i , i=1 we get the sandwich variance estimate where !i and Di are defined in (11), and Vi is also a function ! of !i . It has been proven that ! is consistent and

Statistical Methods for Overdispersion in mRNA-Seq Count Data The Open Bioinformatics Journal, 2013, Volume 7 39

#1 #1 for overdispersion. If this variance model is incorrectly ! 1 ! % 1 n ! ! ! ! ( ! i i i specified, inference based on this asymptotic variance !" = B ' $ D V Var (yi | xi ) D * B n & n i=1 ) ! estimate !" may not be reliable. *1 2 *1 1 $ 1 n ! ' $ 1 n ! ! ' $ 1 n ! ' # # # (16) 4.3. Two-Stage Analysis = & ! " i xi xi ) & ! " i Var (yi | xi )xi xi ) & ! " i xi xi ) . n % n i=1 ( %& n i=1 () % n i=1 ( Typically, an mRNA-seq dataset can have thousands to millions of genetic features (exons, genes, or other genetic Thus, if overdispersion is detected, we can still use the loci to which sequences are mapped). Some genes may have MLE to estimate ! , but need to switch to the sandwich substantial variation in transcription levels, whereas others, variance estimate in (16) to ensure valid inference. such as housekeeping genes, may have very stable 4.2.2. Scaled Variance transcription levels, with no overdispersion. Therefore, a Two-Stage Poisson Model (TSPM) was proposed to first For the log-linear model, a widely used alternative to examine overdispersion in each individual gene before address overdispersion is to add an additional scale making inferences in order to preserve high statistical power, parameter to inflate the variance of yi . Specifically, this especially for small sample sizes [12]. scaled variance approach assumes the same conditional mean After estimating the offset terms, a nonspecific filtering but a scaled conditional variance of yi given xi as is performed to remove genes with total counts less than 10 following: across all treatment groups (a pre-specified arbitrary cutoff) to satisfy the conditions required by asymptotic theory in ! = exp x " , Var y | x = ! 2! . i ( i ) ( i i ) i Section 3. Then each gene that passes the filtering process, 2 such as gene g , is evaluated first for overdispersion by using If ! = 1, Var (yi | xi ) = !i and the modified variance the Dean and Lawless' method [12], introduced in Section 3, approach reduces to the Poisson model. In the presence of 2 with the hypothesis set as overdispersion, ! > 1 and Var (yi | xi ) > !i , accounting H : = 0 v.s. H : > 0, for overdispersion. 0g ! g 1g ! g Under this scaled-variance approach, we first estimate ! and with the statistic 2 using either the MLE or the estimating equation approach. 2 * *$ ' - - Then, we estimate the scale parameter ! 2 by , , ! ! ! / / & yijg " # ijg ) " yijg + h ijg # ijg , , / / H 2 1 % ( 0 n , + . / 2 ! 1 ! $ ! ' Tg = ~ 01 , (18) 2 ,! / ! = "ri , ri = ! i & yi # ! i ) . 2N i, j ! n i=1 % ( , # ijg / , / +, ./ where ri s are known as the Pearson residuals. By 2 ! ! ! ! substituting Var y | x with ! ! i in the sandwich where h ijg is the corresponding diagonal element of the hat ( i i ) T !1 T ! matrix (i.e. H = X (X X) X and X is the covariate variance estimate !" in (16), we obtain a consistent vector). At the second stage, if the H is rejected, estimate of the asymptotic variance of the estimating 0g equation (or maximum likelihood) estimate of ! : suggesting that gene g has overdispersion, a quasi- likelihood method will be employed. Otherwise, a standard ,1 2 ,1 ! & n " ) & n " " ) & n " ) log-linear model will be fitted which has a greater statistical % % % !" = ( # $ i xi xi + ( # $ $ i xi xi + ( # $ i xi xi + power than the quasi-likelihood method, since the MLE is ' i=1 * '( i=1 *+ ' i=1 * most efficient when the parametric model holds. 2 *1 Recently, Pounds et al. [20] further extended the TSPM ! $ n ! ' # (17) by using concepts of the Assumption Adequacy Averaging = ! & " ! i xi xi ) . % i=1 ( method [21]. Briefly, Dean's test for overdispersion, a test based on a standard Poisson general linear model, and a Alternatively, we can apply the deviance statistic to quasi-likelihood test are applied to the count data of each estimate ! 2 in (17) to obtain a slightly different yet genetic feature. A set of empirical Bayesian probabilities consistent estimate of the asymptotic variance of the [22] is computed for each test and these empirical Bayesian estimating equation/maximum likelihood estimate of ! . probabilities are either combined in a weighted average or used to select the empirically best test for each genetic ! feature. These intermediate results are used to obtain a final However, unlike the sandwich estimate in (16), the !" empirical Bayesian probability that is a type of local false ! discovery rate metric for each genetic feature. The approach estimate is derived based on a particular variance model !" by Pounds et al. performs better than TSPM [20], but it 40 The Open Bioinformatics Journal, 2013, Volume 7 Zhang et al. cannot be applied to a gene with small counts due to its “Stem cell transcriptome profiling via massive-scale mRNA dependence on asymptotic theories for detecting sequencing”, Nat. Methods, vol. 5, pp.613-619, 2008. [4] B. Langmead, K.D. Hansen and J.T. Leek, “Cloud-scale RNA- overdispersion [12]. sequencing differential expression analysis with Myrna”, Genome. Biol., vol. 11, pp. 83, 2010. 5. FUTURE STUDIES [5] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Uniformly applying the negative binomial or the quasi- Harris, D.P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J.C. likelihood Poisson model to mRNA-seq count data addresses Matese, J.E. Richardson, M. Ringwald, G.M. Rubin and G. the overdispersion and fixes the bias resulting from standard Sherlock, “Gene ontology: tool for the unification of biology. The Poisson models. However, it also reduces the analysis power Gene Ontology Consortium”, Nat. Genet., vol. 25, pp.25-29, 2000. [6] P. McCullagh and J.A. Nelder, Genaralized Linear Models. for genes that do not exhibit overdispersion. The two-stage Chapman & Hall/CRC, 1999. analysis strategy seems to address this limitation, but it can [7] J.H. Bullard, E. Purdom, K.D. Hansen and S. Dudoit, “Evaluation be used only when genes have large counts. Therefore, of statistical methods for normalization and differential expression in mRNA-Seq experiments”, BMC Bioinformatics., vol. 11, pp. 94- improvements in statistical methods are warranted, 107, 2010. especially for genes with lower counts. [8] G. Casella and R.L. Berger, , 2nd ed. Thomson Learning: Pacific Grove, CA, 2002. Another challenge is that many mRNA-seq data carry [9] Q.H. Vuong, “Likelihood Ratio Tests for Model Selection and non- excessive zeroes, while statistical models described in this nested Hypotheses”, Econometrica, vol. 57, pp.307-333, 1989. review generally no longer hold. Also, a widely recognized [10] H. Zhang and X.M. Tu, “The Generalized ANOVA -- A classic practical problem is to distinguish between a random zero, song sung with modern lyrics”, In press by Institute of Mathematical Statistics Collection Series, 2013. occurring in a gene with low expression, from a structural [11] P.L. Auer and R. Doerge, “A two-stage Poisson model for testing zero, occurring in an unexpressed gene. The zero-inflated RNA-seq data”, Statis. Appl. Genet. Mol. Biol., vol. 10, pp.1-26, models may be useful when there is an excess of zeroes in 2011. [12] C.B. Dean and J.F. Lawless, “Tests for detecting overdispersion in mRNA-seq data [23]. These models introduce an additional models”, J. Am. Stat. Assoc., vol. 84, pp.467- parameter to model the probability that a gene is 472, 1989. unexpressed, which serves as a mathematical representation [13] C.B. Dean, “Testing for overdispersion in poisson and binomial for structural zeroes whereas random zeroes reflect genes regression models”, J. Am. Stat. Assoc., vol. 87, pp.451-457, 1992. [14] M.D. Robinson and A. Oshlack, “A scaling normalization method with low expressions. for differential expression analysis of RNA-seq data”, Genome. Biol., vol. 11, pp.1-9, 2010. mRNA-seq data have posed numerous statistical [15] M.D. Robinson and G.K. Smyth, “Moderated statistical tests for challenges, and translating the rich information derived from assessing differences in tag abundance”, Bioinformatics, vol. 23, mRNA-seq data into clinical knowledge will be a long- pp. 2881-2887, 2007. standing direction for biomedical and statistical research. [16] M.D. Robinson and G.K. Smyth, “Small-sample estimation of negative binomial dispersion, with applications to SAGE data”, This review aims to aid researchers to analyze regular and Biostatistics, vol. 9, pp.321-332, 2008. overdispersed mRNA-seq count data more appropriately and [17] S. Anders and W. Huber, “Differential expression analysis for direct attention to practical issues in such data. sequence count data”, Genome Biol., vol. 11, article 106, 2010. [18] T.J. Hardcastle and K.A. Kelly, “baySeq: empirical Bayesian CONFLICT OF INTEREST methods for identifying differential expression in sequence count data”, BMC. Bioinformatics, vol. 11, pp. 422, 2010. The author confirms that this article content has no [19] R.W.M. Wedderburn RWM, “Quasi-likelihood functions, conflicts of interest. generalized linear models, and the Gauss---Newton method”, Biometrika, vol. 61, pp.439-447, 1974. ACKNOWLEDGEMENTS [20] S.B. Pounds, C.L. Gao and H. Zhang, “Empirical bayesian selection of hypothesis testing procedures for analysis of sequence We gratefully acknowledge funding from the American count expression data”, Stat. Appl. Genet. Mol. Biol., vol. 11, Lebanese Syrian Associated Charities (ALSAC). article 5, 2012. [21] S.B. Pounds and S.N. Rai, “Assumption adequacy averaging as a REFERENCES concept to develop more robust methods for differential gene expression analysis”, Comput.l Stat. Data Anal., vol. 53, pp.1604- [1] S.C. Schuster, “Next-generation sequencing transforms today's 1612, 2009. biology”, Nat. Methods, vol. 5, pp.16-18, 2008. [22] S.B. Pounds and S. Morris, “Estimating the occurrence of false [2] A. Mortazavi, B.A. Williams, K. McCue, L. Schaeffer, and B. positives and false negatives in Microarray studies by Wold, “Mapping and quantifying mammalian transcriptomes by approximating and partitioning the empirical distribution of p- RNA-Seq”, Nat. Methods., vol. 7, pp. 621-628, 2008. values”, Bioinformatics, vol. 19, pp. 1236-1242, 2003. [3] N. Cloonan, A.R. Forrest, G. Kolle, B.B. Gardiner, G.J. Faulkner, [23] D. Lambert, “Zero-Inflated poisson regression, with an application M.K. Brown, D.F. Taylor, A.L. Steptoe, S. Wani, G. Bethel, A.J. to defects in manufacturing”, Technometrics, vol. 34, pp. 1-14, Robertson, A.C. Perkins, S.J. Bruce, C.C. Lee, S.S. Ranade, H.E. 1992. Peckham, J.M. Manning, K.J. McKernan, and S.M. Grimmond,

Received: August 6, 2013 Revised: September 6, 2013 Accepted: September 15, 2013

© Zhang et al.; Licensee Bentham Open.

This is an open access article licensed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted, non-commercial use, distribution and reproduction in any medium, provided the work is properly cited.