Accuracy, Robustness and Scalability of Dimensionality Reduction Methods for Single-Cell RNA-Seq Analysis Shiquan Sun1,2, Jiaqiang Zhu2, Ying Ma2 and Xiang Zhou2,3*

Sun et al. Genome Biology (2019) 20:269 https://doi.org/10.1186/s13059-019-1898-6 RESEARCH Open Access Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis Shiquan Sun1,2, Jiaqiang Zhu2, Ying Ma2 and Xiang Zhou2,3* Abstract Background: Dimensionality reduction is an indispensable analytic component for many areas of single-cell RNA sequencing (scRNA-seq) data analysis. Proper dimensionality reduction can allow for effective noise removal and facilitate many downstream analyses that include cell clustering and lineage reconstruction. Unfortunately, despite the critical importance of dimensionality reduction in scRNA-seq analysis and the vast number of dimensionality reduction methods developed for scRNA-seq studies, few comprehensive comparison studies have been performed to evaluate the effectiveness of different dimensionality reduction methods in scRNA-seq. Results: We aim to fill this critical knowledge gap by providing a comparative evaluation of a variety of commonly used dimensionality reduction methods for scRNA-seq studies. Specifically, we compare 18 different dimensionality reduction methods on 30 publicly available scRNA-seq datasets that cover a range of sequencing techniques and sample sizes. We evaluate the performance of different dimensionality reduction methods for neighborhood preserving in terms of their ability to recover features of the original expression matrix, and for cell clustering and lineage reconstruction in terms of their accuracy and robustness. We also evaluate the computational scalability of different dimensionality reduction methods by recording their computational cost. Conclusions: Based on the comprehensive evaluation results, we provide important guidelines for choosing dimensionality reduction methods for scRNA-seq data analysis. We also provide all analysis scripts used in the present study at www.xzlab.org/reproduce.html. Introduction are also paired with an excessive number of zeros known as Single-cell RNA sequencing (scRNA-seq) is a rapidly grow- dropouts [9]. Subsequently, dimensionality reduction ing and widely applying technology [1–3]. By measuring methods that transform the original high-dimensional noisy gene expression at a single-cell level, scRNA-seq provides expression matrix into a low-dimensional subspace with an unprecedented opportunity to investigate the cellular enriched signals become an important data processing step heterogeneity of complex tissues [4–8]. However, despite for scRNA-seq analysis [10]. Proper dimensionality reduc- the popularity of scRNA-seq, analyzing scRNA-seq data re- tion can allow for effective noise removal, facilitate data mains a challenging task. Specifically, due to the low cap- visualization, and enable efficient and effective downstream ture efficiency and low sequencing depth per cell in analysis of scRNA-seq [11]. scRNA-seq data, gene expression measurements obtained Dimensionality reduction is indispensable for many from scRNA-seq are noisy: collected scRNA-seq gene mea- types of scRNA-seq analysis. Because of the importance surements are often in the form of low expression counts, of dimensionality reduction in scRNA-seq analysis, many and in studies not based on unique molecular identifiers, dimensionality reduction methods have been developed and are routinely used in scRNA-seq software tools that * Correspondence: [email protected] include, but not limited to, cell clustering tools [12, 13] 2Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA and lineage reconstruction tools [14]. Indeed, most com- 3Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, monly used scRNA-seq clustering methods rely on USA dimensionality reduction as the first analytic step [15]. Full list of author information is available at the end of the article © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Sun et al. Genome Biology (2019) 20:269 Page 2 of 21 For example, Seurat applies clustering algorithms dir- Unfortunately, despite the popularity of scRNA-seq ectly on a low-dimensional space inferred from principal technique, the critical importance of dimensionality re- component analysis (PCA) [16]. CIDR improves cluster- duction in scRNA-seq analysis, and the vast number of ing by improving PCA through imputation [17]. SC3 dimensionality reduction methods developed for scRNA- combines different ways of PCA for consensus clustering seq studies, few comprehensive comparison studies have [18]. Besides PCA, other dimensionality reduction tech- been performed to evaluate the effectiveness of different niques are also commonly used for cell clustering. For dimensionality reduction methods for practical applica- example, nonnegative matrix factorization (NMF) is used tions. Here, we aim to fill this critical knowledge gap by in SOUP [19]. Partial least squares is used in scPLS [20]. providing a comprehensive comparative evaluation of a Diffusion map is used in destiny [21]. Multidimensional variety of commonly used dimensionality reduction scaling (MDS) is used in ascend [22]. Variational infer- methods for scRNA-seq studies. Specifically, we com- ence autoencoder is used in scVI [23]. In addition to cell pared 18 different dimensionality reduction methods on clustering, most cell lineage reconstruction and develop- 30 publicly available scRNA-seq data sets that cover a mental trajectory inference algorithms also rely on di- range of sequencing techniques and sample sizes [12, 14, mensionality reduction [14]. For example, TSCAN 41]. We evaluated the performance of different dimen- builds cell lineages using minimum spanning tree based sionality reduction methods for neighborhood preserving on a low-dimensional PCA space [24]. Waterfall per- in terms of their ability to recover features of the original forms k-means clustering in the PCA space to eventually expression matrix, and for cell clustering and lineage reproduce linear trajectories [25]. SLICER uses locally lin- construction in terms of their accuracy and robustness ear embedding (LLE) to project the set of cells into a using different metrics. We also evaluated the computa- lower-dimension space for reconstructing complex cellu- tional scalability of different dimensionality reduction lar trajectories [26]. Monocle employs either independ- methods by recording their computational time. To- ent components analysis (ICA) or uniform manifold gether, we hope our results can serve as an important approximation and projection (UMAP) for dimensional- guideline for practitioners to choose dimensionality re- ity reduction before building the trajectory [27, 28]. duction methods in the field of scRNA-seq analysis. Wishbone combines PCA and diffusion maps to allow for bifurcation trajectories [29]. Results Besides the generic dimensionality reduction methods We evaluated the performance of 18 dimensionality re- mentioned in the above paragraph, many dimensionality duction methods (Table 1; Additional file 1: Figure S1) reduction methods have also been developed recently on 30 publicly available scRNA-seq data sets (Add- that are specifically targeted for modeling scRNA-seq itional file 1: Table S1-S2) and 2 simulated data sets. De- data. These scRNA-seq-specific dimensionality reduction tails of these data sets are provided in “Methods and methods can account for either the count nature of Materials.” Briefly, these data sets cover a wide variety of scRNA-seq data and/or the dropout events commonly sequencing techniques that include Smart-Seq2 [1](8 encountered in scRNA-seq studies. For example, ZIFA data sets), Smart-Seq [53] (5 data sets), 10X Genomics relies on a zero-inflation normal model to model drop- [33] (6 data sets), inDrop [54] (1 data set), RamDA-seq out events [30]. pCMF models both dropout events and [55] (1 data set), sci-RNA-seq3 [28] (1 data set), SMAR- the mean-variance dependence resulting from the count Ter [56] (5 data sets), and others [57] (3 data sets). In nature of scRNA-seq data [31]. ZINB-WaVE incorpo- addition, these data sets cover a range of sample sizes rates additional gene-level and sample-level covariates from a couple of hundred cells to over tens of thousands for more accurate dimensionality reduction [32]. Finally, of cells. In each data set, we evaluated the ability of dif- several deep learning-based dimensionality reduction ferent dimensionality reduction methods in preserving methods have recently been developed to enable scalable the original feature of the expression matrix, and, more and effective computation in large-scale scRNA-seq data, importantly, their effectiveness for two important single- including data that are collected by 10X Genomics tech- cell analytic tasks: cell clustering and lineage inference. niques [33] and/or from large consortium studies such In particular, we used 14 real data sets together with 2 as Human Cell Atlas (HCA) [34, 35]. Common deep simulated data sets for dimensionality

Accuracy, Robustness and Scalability of Dimensionality Reduction Methods for Single-Cell RNA-Seq Analysis Shiquan Sun1,2, Jiaqiang Zhu2, Ying Ma2 and Xiang Zhou2,3*

A Cluster Specific Latent Dirichlet Allocation Model for Trajectory Clustering in Crowded Videos

Alignment of Time-Course Single-Cell RNA-Seq Data with CAPITAL

Trajectory Inference Across Multiple Conditions with Condiments

Inferring Cellular Trajectories from Scrna-Seq Using Pseudocell Tracer

Cellrank for Directed Single-Cell Fate Mapping

Single-Cell Trajectories Reconstruction, Exploration and Mapping of Omics Data with STREAM

Statistical Analysis for Scrnaseq Data

Minimum Spanning Vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets

Trajectory Inference Analysis

K-Nearest Neighbor Smoothing for High-Throughput Single-Cell

From Bivariate to Multivariate Analysis of Cytometric Data: Overview of Computational Methods and Their Application in Vaccination Studies

Accurate Denoising of Single-Cell RNA-Seq Data Using Unbiased Principal Component Analysis Florian Wagner1,2, Dalia Barkley1 & Itai Yanai1,3