Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-Seq Read Counts Charity W Law1,2,Yunshunchen1,2,Weishi1,3 and Gordon K Smyth1,4*
Total Page:16
File Type:pdf, Size:1020Kb
Law et al. Genome Biology 2014, 15:R29 http://genomebiology.com/2014/15/2/R29 METHOD Open Access voom: precision weights unlock linear model analysis tools for RNA-seq read counts Charity W Law1,2,YunshunChen1,2,WeiShi1,3 and Gordon K Smyth1,4* Abstract New normal linear modeling strategies are presented for analyzing read counts from RNA-seq experiments. The voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline. This opens access for RNA-seq analysts to a large body of methodology developed for microarrays. Simulation studies show that voom performs as well or better than count-based RNA-seq methods even when the data are generated according to the assumptions of the earlier methods. Two case studies illustrate the use of linear modeling and gene set testing methods. Background In the past few years, RNA-seq has emerged as a rev- Gene expression profiling is one of the most commonly olutionary new technology for expression profiling [10]. used genomic techniques in biological research. For most One common approach to summarize RNA-seq data of the past 16 years or more, DNA microarrays have been is to count the number of sequence reads mapping to the premier technology for genome-wide gene expression each gene or genomic feature of interest [11-14]. RNA- experiments, and a large body of mature statistical meth- seq profiles consist therefore of integer counts, unlike ods and tools has been developed to analyze intensity microarrays, which yield intensities that are essentially data from microarrays. This includes methods for differ- continuous numerical measurements. A number of early ential expression analysis [1-3], random effects [4,5], gene RNA-seq publications applied statistical methods devel- set enrichment [6], gene set testing [7,8] and so on. One oped for microarrays to analyze RNA-seq read counts. popular differential expression pipeline is that provided For example, the limma package has been used to ana- by the limma software package [9]. The limma pipeline lyze log-counts after normalization by sequencing depth includes linear modeling to analyze complex experiments [11,15-17]. with multiple treatment factors, quantitative weights to Later statistical publications argued that RNA-seq data account for variations in precision between different should be analyzed by statistical methods designed specif- observations, and empirical Bayes statistical methods to ically for counts. Much interest has focused on the nega- borrow strength between genes. tive binomial (NB) distribution as a model for read counts, Borrowing information between genes is a crucial fea- and especially on the problem of estimating biological ture of the genome-wide statistical methods, as it allows variability for experiments with small numbers of repli- for gene-specific variation while still providing reliable cates. One approach is to fit a global value or global trend inference with small sample sizes. The normal-based to the NB dispersions [13,18,19], although this has the lim- empirical Bayes statistical procedures can adapt to differ- itation of not allowing for gene-specific variation. A num- ent types of datasets and can provide exact type I error ber of empirical Bayes procedures have been proposed to rate control even for experiments with a small number of estimate the gene-wise dispersions [20-22]. Alternatively, replicate samples [3]. Lund et al. [23] proposed that the residual deviances from NB generalized linear models be entered into limma’s *Correspondence: [email protected] empirical Bayes procedure to enable quasi-likelihood test- 1Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia ing. Other methods based on over-dispersed Poisson 4Department of Mathematics and Statistics, The University of Melbourne, models have also been proposed [24-26]. Parkville, Victoria 3010, Australia Unfortunately, the mathematical theory of count dis- Full list of author information is available at the end of the article tributions is less tractable than that of the normal distribution, © 2014 Law et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Law et al. Genome Biology 2014, 15:R29 Page 2 of 17 http://genomebiology.com/2014/15/2/R29 and this tends to limit both the performance and the larger standard deviations than small counts. While a log- usefulness of the RNA-seq analysis methods. One prob- arithmic transformation counteracts this, it overdoes the lem relates to error rate control with small sample sizes. adjustment somewhat so that large log-counts now have Despite the use of probabilistic distributions, all the sta- smaller standard deviations than small log-counts. We tistical methods developed for RNA-seq counts rely on explore the idea that it is more important to model the approximations of various kinds. Many rely on the statis- mean-variance relationship correctly than it is to specify tical tests that are only asymptotically valid or are theoret- the exact probabilistic distribution of the counts. There ically accurate only when the dispersion is small. All the is a body of theory in the statistical literature showing differential expression methods currently available based that correct modeling of the mean-variance relationship on the NB distribution treat the estimated dispersions as inherent in a data generating process is the key to design- if they were known parameters, without allowing for the ing statistically powerful methods of analysis [30]. Such uncertainty of estimation, and this leads to statistical tests variance modeling may in fact take precedence over iden- that are overly liberal in some situations [27,28]. This is tifying the exact probability law that the data values follow true even of the NB exact test [18], which gives exact [31-33]. We therefore take the view that it is crucial to type I error rate control when the dispersion is known but understand the way in which the variability of RNA-seq which becomes liberal when an imprecise dispersion esti- read counts depends on the size of the counts. Our work mator is inserted for the known value. Quasi-likelihood is in the spirit of pseudo-likelihoods [32] whereby statisti- methods [23] account for uncertainty in the dispersion by cal methods based on the normal distribution are applied using an F-test in place of the usual likelihood ratio test, after estimating a mean-variance function for the data at but this relies on other approximations, in particular that hand. the residual deviances are analogous to residual sums of Our approach is to estimate the mean-variance rela- squares from a normal analysis of variance. tionship robustly and non-parametrically from the data. A related issue is the ability to adapt to different types We work with log-counts normalized for sequence depth, of data with high or low dispersion heterogeneity. None of specifically with log-counts per million (log-cpm). The the empirical Bayes methods based on the NB distribution mean-variance is fitted to the gene-wise standard devia- achieve the same adaptability, robustness or small sample tions of the log-cpm as a function of average log-count. properties as the corresponding methods for microarrays, We explore two ways to incorporate the mean-variance due to the mathematical intractability of count distribu- relationship into the differential expression analysis. The tions compared to the normal distribution. first is to modify limma’s empirical Bayes procedure to The most serious limitation though is the reduced range incorporate a mean-variance trend. The second method of statistical tools associated with count distributions incorporates the mean-variance trend into a precision compared to the normal distribution. This is more fun- weight for each individual normalized observation. The damental than the other problems because it limits the normalized log-counts and associated precision weights types of analyses that can be done. Much of the statisti- can then be entered into the limma analysis pipeline, or cal methodology that has been developed for microarray indeed into any statistical pipeline for microarray data data relies on use of the normal distribution. For exam- that is precision weight aware. We call the first method ple, we often find it useful in our own microarray gene limma-trend and the second method voom, an acronym expression studies to estimate empirical quality weights for ‘variance modeling at the observational level’. limma- to downweight poor quality RNA samples [29], to use trend applies the mean-variance relationship at the gene random effects to allow for repeated measures on the level whereas voom applies it at the level of individual same experimental units [4,5] or to conduct gene set observations. tests for expression signatures while allowing for inter- This article compares the performance of the limma- gene correlations [7,8]. These techniques broaden the based pipelines to edgeR [20,34], DESeq