Normalization and Variance Stabilization of Single-Cell RNA-Seq Data Using Regularized Negative Binomial Regression Christoph Hafemeister1* and Rahul Satija1,2*

Hafemeister and Satija Genome Biology (2019) 20:296 https://doi.org/10.1186/s13059-019-1874-1 Method Open Access Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression Christoph Hafemeister1* and Rahul Satija1,2* Abstract Single-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from “regularized negative binomial regression,” where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package sctransform, with a direct interface to our single-cell toolkit Seurat. Keywords: Single-cell RNA-seq, Normalization Introduction sampling during sequencing also contribute significantly, In the analysis and interpretation of single-cell RNA- necessitating technical correction [4]. These same chal- seq (scRNA-seq) data, effective pre-processing and nor- lenges apply to bulk RNA-seq workflows, but are exac- malization represent key challenges. While unsupervised erbated due to the extreme comparative sparsity of analysis of single-cell data has transformative potential scRNA-seq data [5]. to uncover heterogeneous cell types and states, cell- The primary goal of single-cell normalization is to to-cell variation in technical factors can also confound remove the influence of technical effects in the underlying these results [1, 2]. In particular, the observed sequenc- molecular counts, while preserving true biological vari- ing depth (number of genes or molecules detected per ation. Specifically, we propose that a dataset which has cell) can vary significantly between cells, with variation been processed with an effective normalization workflow in molecular counts potentially spanning an order of should have the following characteristics: magnitude, even within the same cell type [3]. Impor- tantly, while the now widespread use of unique molec- 1 In general, the normalized expression level of a gene ular identifiers (UMI) in scRNA-seq removes technical should not be correlated with the total sequencing variation associated with PCR, differences in cell lysis, depth of a cell. Downstream analytical tasks reverse transcription efficiency, and stochastic molecular (dimensional reduction, differential expression) should also not be influenced by variation in sequencing depth. *Correspondence: [email protected]; [email protected] 2 The variance of a normalized gene (across cells) 1New York Genome Center, 101 6th Ave, New York, NY, 10013 USA 2Center for Genomics and Systems Biology, New York University, 12 Waverly Pl, should primarily reflect biological heterogeneity, New York, NY, 10003 USA independent of gene abundance or sequencing depth. © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Hafemeister and Satija Genome Biology (2019) 20:296 Page 2 of 15 For example, genes with high variance after with similar abundances, we can regularize parameter normalization should be differentially expressed estimates and obtain reproducible error models. The across cell types, while housekeeping genes should residuals of our “regularized negative binomial regres- exhibit low variance. Additionally, the variance of a sion” represent effectively normalized data values that gene should be similar when considering either are no longer influenced by technical characteristics, deeply sequenced cells, or shallowly sequenced cells. but preserve heterogeneity driven by distinct biological states. Lastly, we demonstrate that these normalized Given its importance, there have been a large num- values enable downstream analyses, such as dimension- ber of diverse methods proposed for the normalization of ality reduction and differential expression testing, where scRNA-seq data [6–11]. In general, these fall into two dis- the results are not confounded by cellular sequencing tinct sets of approaches. The first set aims to identify “size depth. Our procedure is broadly applicable for any UMI- factors” for individual cells, as is commonly performed for based scRNA-seq dataset and is freely available to users bulk RNA-seq [12]. For example, BASiCS [7] infers cell- through the open-source R package sctransform specific normalizing constants using spike-ins, in order (github.com/ChristophH/sctransform), with a direct to distinguish technical noise from biological cell-to-cell interface to our single-cell toolkit Seurat. variability. Scran [8] pools cells with similar library sizes and uses the summed expression values to estimate pool- Results based size factors, which are resolved to cell-based size A single scaling factor does not effectively normalize both factors. By performing a uniform scaling per cell, these lowly and highly expressed genes methods assume that the underlying RNA content is con- Sequencing depth variation across single cells represents stant for all cells in the dataset and that a single scaling a substantial technical confounder in the analysis and factor can be applied for all genes. interpretation of scRNA-seq data. To explore the extent Alternative normalization approaches model molecule of this effect and possible solutions, we examined five counts using probabilistic approaches. For example, ini- UMI datasets from diverse tissues, generated with both tial strategies focused on read-level (instead of UMI-level) plate- and droplet-based protocols. We show results on data and modeled the measurement of each cell as a all datasets in Additional file 1,butfocushereona mixture of two components: a negative binomial (NB) dataset of 33,148 human peripheral blood mononuclear “signal” component and a Poisson “dropout” component cells (PBMC) freely available from 10x Genomics. This [13]. For newer measurements based on UMI, model- dataset is characteristic of current scRNA-seq experi- ing strategies have focused primarily on the use of the ments; we observed a median total count of 1891 UMI/cell NB distribution [14], potentially including an additional and observed 16,809 genes that were detected in at least parameter to model zero-inflation (ZINB). For example, 5cells(Fig.1a, b). As expected, we observed a strong lin- ZINB-WaVE [9]modelscountsasZINBinaspecialvari- ear relationship between unnormalized expression (gene ant of factor analysis. scVI and DCA also use the ZINB UMI count) and cellular sequencing depth. We observed noise model [10, 15], either for normalization and dimen- nearly identical trends (and regression slopes) for genes sionality reduction in Bayesian hierarchical models or for across a wide range of abundance levels, after grouping a denoising autoencoder. These pioneering approaches genes into six equal-width bins based on their mean abun- extend beyond pre-processing and normalization, but rely dance (Fig. 1c), demonstrating that counts from both low- on the accurate estimation of per-gene error models. and high-abundance genes are confounded by sequencing In this manuscript, we present a novel statistical depth and require normalization. approach for the modeling, normalization, and vari- We next tested how the standard normalization ance stabilization of UMI count data for scRNA-seq. approach in popular scRNA-seq packages such as We first show that different groups of genes cannot be Seurat [16–18] and SCANPY [19]compensatesforthis normalized by the same constant factor, representing an effect. In this two-step process (referred to as “log- intrinsic challenge for scaling-factor-based normalization normalization” for brevity), UMI counts are first scaled schemes, regardless of how the factors themselves are by the total sequencing depth (“size factors”) followed by calculated. We instead propose to construct a generalized pseudocount addition and log-transformation. While this linear model (GLM) for each gene with UMI counts as approach mitigated the relationship between sequenc- the response and sequencing depth as the explanatory ing depth and gene expression, we found that genes variable. We explore potential error models for the GLM with different

Normalization and Variance Stabilization of Single-Cell RNA-Seq Data Using Regularized Negative Binomial Regression Christoph Hafemeister1* and Rahul Satija1,2*

Journal of Biology Celebrates Its Fifth Anniversary Biomedcentral

Purifying Selection of Long Dsrna Is the First Line of Defense Against False Activation of Innate Immunity Michal Barak1, Hagit T

Tracking the Popularity and Outcomes of All Biorxiv Preprints

Print Special Issue Flyer

Genome Biology (2015) 16:6 DOI 10.1186/S13059-014-0577-X

Biomed Central Open-Access Research That Covers a Broad Range of Disciplines, and Reaches Influencers and Decision Makers

Citation Characterization and Impact Normalization in Bioinformatics Journals

The Genome of the Sparganosis Tapeworm Spirometra Erinaceieuropaei Isolated from the Biopsy of a Migrating Brain Lesion Bennett Et Al

Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-Seq Read Counts Charity W Law1,2,Yunshunchen1,2,Weishi1,3 and Gordon K Smyth1,4*

A Framework for Transcriptome-Wide Association Studies in Breast Cancer in Diverse Study Populations Arjun Bhattacharya1, Montserrat García-Closas2,3, Andrew F

Tracking the Popularity and Outcomes of All Biorxiv Preprints

Productivity and Influence in Bioinformatics: a Bibliometric Analysis Using Pubmed Central