Statistical Methods for Overdispersion in Mrna-Seq Count Data Hui Zhang*, Stanley B
Total Page:16
File Type:pdf, Size:1020Kb
Send Orders for Reprints [email protected] 34 The Open Bioinformatics Journal, 2013, 7, (Suppl 1: M3) 34-40 Open Access Statistical Methods for Overdispersion in mRNA-Seq Count Data Hui Zhang*, Stanley B. Pounds and Li Tang Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA Abstract: Recent developments in Next-Generation Sequencing (NGS) technologies have opened doors for ultra high throughput sequencing mRNA (mRNA-seq) of the whole transcriptome. mRNA-seq has enabled researchers to comprehensively search for underlying biological determinants of diseases and ultimately discover novel preventive and therapeutic solutions. Unfortunately, given the complexity of mRNA-seq data, data generation has outgrown current analytical capacity, hindering the pace of research in this area. Thus, there is an urgent need to develop novel statistical methodology that addresses problems related to mRNA-seq data. This review addresses the common challenge of the presence of overdispersion in mRNA count data. We review current methods for modeling overdispersion, such as negative binomial, quasi-likelihood Poisson method, and the two-stage adaptive method; introduce related statistical theories; and discuss their applications to mRNA-seq count data. Keywords: Count response, mRNA-seq, negative binomial theory, over-dispersion, Poisson, quasi-likelihood. 1. INTRODUCTION theoretically unbounded. Thus, methods for binomial responses do not apply. Log-linear models built upon DNA sequencing is a powerful technique that allows Poisson distributions are most popular for analyzing the scientists to obtain key genetic information from biological number of counts. An important mathematical feature of the system of interest. However, traditional sequencing methods Poisson distribution is that the mean equals its variance. such as capillary electrophoresis based Sanger sequencing However, in mRNA-seq data the variance of the count is have inherent limitations with regard to throughput, often much larger than its mean, and this property is called scalability, speed, resolution and cost. Next Generation overdispersion. When overdispersion occurs, the log-linear Sequencing (NGS) has been developed to address such model does not hold, which results in biased and misleading limitations, and this technology has triggered numerous conclusions. Therefore, alternative analytical strategies are ground-breaking discoveries and evoked a revolution in needed to adequately address over dispersion of mRNA-seq biomedical research [1]. NGS technologies enable data. sequencing the entire genome and transcriptome in a rapid and cost-effective manner, and can be applied to a wide The regular log-linear regression and related goodness- range of biomedical applications such as variant discovery, of-fit tests will first be introduced in Section 2, followed by profiling of histone modifications, identification of overdispersion and its detection in Section 3. In Section 4, transcription factor binding sites, resequencing, and we will review extended Poisson models and the newly chararcterization of the transcriptome. The application of developed two-stage adaptive strategy for overdispersion in NGS to the transcriptome, also known as mRNA-seq, allows mRNA-seq count reads. Finally, future directions will be direct sequencing of a population of transcripts and these discussed. read counts linearly approximate target transcript abundance 2. LOG-LINEAR REGRESSION MODEL FOR COUNT [2]. Therefore, mRNA-seq data can provide rich information DATA on alterative splicing, allele-specific expression, unannotated exons, and differential expression. To assess differential continuous expression measures, such as microarray data, among various groups or treatments In mRNA-seq data analysis, sequenced fragments are and other demographic factors, an ordinary linear regression aligned with a reference sequence and the number of can be written as follows: fragments (typically called counts) mapped to regions of interest is recorded, usually as count data [2]. However, Yijg = xi ! jg = ! jg0 + ! jg1xi1 +…+ ! jgp xip + "ijg , (1) unlike other types of discrete responses, count responses 2 cannot be expressed in the form of several proportions. For 1 ! i ! n j and "ijg ~ N (0,# jg ), count responses, the upper limit is infinite and the range is th where Yijg denotes the expression level of gene g of the i th *Address correspondence to this author at the Department of Biostatistics, replicate in the j treatment group and ! jg represents St. Jude Children's Research Hospital, 262 Danny Thomas PI, MS 768, corresponding regression coefficients, which quantify group Memphis, TN 38105, USA; Tel: 901-595-6736; Fax: 901--595-8843; x E-mail: [email protected] effects on gene expression levels. An example of i is 1874-0362/13 2013 Bentham Open Statistical Methods for Overdispersion in mRNA-Seq Count Data The Open Bioinformatics Journal, 2013, Volume 7 35 treatment assignment or gender. If xi1 represents a group used to account for replicate-to-replicate variation in indicators, ! stands for difference in expression level technical factors such as errors and efficiency of laboratory jg1 operations such as RNA extraction and amplification [7]. between group j and the reference group. Similarly, if xip Technical laboratory effects typically is manifest as differences in the total number of sequence reads generated stands for gender, ! jgp represents the difference in the for each replicate. Thus, the offset term is typically a simple expression level of gene g between males and females. A t- function of the read count data of the replicate. The average, test can be applied to evaluate whether a certain group effect, median, or upper quantile is often used [7]. Mathematically, for example, is equal to a constant C. In the case that ! jg1 the offset term may be absorbed into the intercept term of C = 0 , the test is the familiar one to assess the significance equation 3. Thus, we use the notation of equation 3 of the result, throughout with the understanding that the offset term is often required in practice. 2 ! # & H0:! jg1=C ! 2 ! ! jg1 * C Subsequently, with the log-linear model defined above, ! | ! ," ~ Normal % ! ,V " jg ( ) T = ~ t , jg0 jg1 jg % jg1 g ( 2 dg the variation of a count response can be usually partially $ ' ! explained by a vector of predictors. Vg " jg Without loss of generality, in the rest of this review 2 ! ! except Section 4.3, we will remove j and g by considering where Vg ! jg is the variance of ! jg1 and td represents the g only 1 gene and 1 treatment group. T-distribution with degree of freedom dg . 2.1. Inference About Model Parameters However, in mRNA-seq data, the expression level is In statistics, maximum-likelihood estimation (MLE) is often measured by count rather than a continuous value. used to estimate the parameters by finding optimal Thus, using the model employed in (1) will likely result in a mathematical values that maximize the likelihood function misleading conclusion. Although biomedical investigators (or the log-likelihood function). For the log-linear model, the previously transformed their count data and used tools log-likelihood function is appropriate for continuous data [3-5], it is now more suitable n to employ more advanced statistical methods that better l (!) = "{yi #i $ exp(#i ) $ log yi!} characterize the features of count data. i=1 (4) n The Poisson distribution is a widely used model for the T = "{yi xi ! $ exp(xi !) $ log yi!}. count response. It is determined by only one parameter ! , i=1 which is both the mean and variance of the distribution. To locate the numbers maximizing the log-likelihood, we However, this parameter often varies because of need to solve the score function, which sets the first derivative heterogenous underlying characteristics; for example, the of the log-likelihood function with respect to ! as 0: reads for A/T-rich regions are usually low. Thus, this single- parameter model can no longer be applied if heterogeneity is n ! T % high. Therefore, the log-linear regression model, an l (") = yi xi $ exp(xi ")xi . (5) !" #{ } extension of the simple Poisson distribution, was developed i=1 to account for such heterogeneity [6]. As the second-order derivative is negative, the unique solution to the score equation (5) will be the MLE of ! , Let Yijg denote the mRNA-seq count for gene g in the th th ! ! i replicate of the j treatment group and xi = xi1,…, xip denoted as ! . The MLE ! follows an asymptotically ( ) denote a vector of explanatory variables of the i th subject ! # 1 "1 & 1 i n . normal distribution, that is ! ~ AN % !, I (!)( , where ( ! ! j ) $ n ' I ! is the Fisher information matrix, defined as Given xi , the response variable Yijg follows a Poisson ( ) distribution with mean ! , % #2 ( ijg [6]. For the log-linear model I (!) = E '" $ l (! | Y)* & #!#! ) Y | x ~ Poisson , 1 i n , (2) ijg i (!ijg ) " " j defined in (2) and (3), n whereas the conditional mean !ijg is linked to a linear 1 ' ' (6) E #"I (!)%$ = E (&i xi xi ), I (!) = (&i xi xi . combination of predictors by the log function: n i=1 log(! ) = x " = " + " x +…+ " x , (3) In practice, it is usually not feasible to evaluate ijg i jg jg0 jg1 i1 jgp ip E "I (!)$ in a closed form. Thus, the observed version of where ! = (! , ! ,…, ! )" is the parameter vector, # % jg jg0 jg1 jgp ! including interest and nuisance parameters. In the analysis of I (!) with estimated ! replacing ! serves as a natural mRNA-seq data, a replicate-specific offset term is frequently 36 The Open Bioinformatics Journal, 2013, Volume 7 Zhang et al. estimator of I (!) for inference. Then the MLEs can be Given both yi ~ Poisson(!i ) and !i s are all large, obtained from iterative numerical optimization by taking Pearson's statistic approximately follows a chi-square advantage of modern computers. distribution with degrees of freedom of n ! p . Note that the In most mRNA-seq analyses, we are primarily interested asymptotic distribution of Pearson's statistic holds when in testing whether a predictor of interest is associated with !i " # while n is fixed.