A Hybrid Deep Clustering Approach for Robust Cell Type Profiling Using Single-Cell RNA-Seq Data

Downloaded from rnajournal.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press 1 A Hybrid Deep Clustering Approach for Robust Cell Type 2 Profiling Using Single-cell RNA-seq Data 3 Suhas Srinivasan1*, Anastasia Leshchyk2, Nathan T. Johnson3,4 and Dmitry Korkin1,2,5* 4 1 Data Science Program, Worcester Polytechnic Institute, Worcester, MA, USA 5 2 Bioinformatics and Computation Biology Program, Worcester Polytechnic Institute, 6 Worcester, MA, USA 7 3 Laboratory of Systems Pharmacology, Harvard Program in Therapeutic Science, Harvard 8 Medical School, Boston, MA, USA 9 4 Breast Tumor Immunology Laboratory, Dana Farber Cancer Institute, Boston, MA, USA 10 5 Department of Computer Science, Worcester Polytechnic Institute, Worcester, MA, USA 11 12 * Correspondence: [email protected]; [email protected] 13 14 Abstract 15 Single-cell RNA sequencing (scRNA-seq) is a recent technology that enables fine-grained 16 discovery of cellular subtypes and specific cell states. Analysis of scRNA-seq data routinely 17 involves machine learning methods, such as feature learning, clustering, and classification, to 18 assist in uncovering novel information from scRNA-seq data. However, current methods are not 19 well suited to deal with the substantial amounts of noise that is created by the experiments or the 20 variation that occurs due to differences in the cells of the same type. To address this, we 21 developed a new hybrid approach, Deep Unsupervised Single-cell Clustering (DUSC), which 22 integrates feature generation based on a deep learning architecture by using a new technique to 23 estimate the number of latent features, with a model-based clustering algorithm, to find a 24 compact and informative representation of the single-cell transcriptomic data generating robust 25 clusters. We also include a technique to estimate an efficient number of latent features in the 26 deep learning model. Our method outperforms both classical and state-of-the-art feature learning 1 Downloaded from rnajournal.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press 27 and clustering methods, approaching the accuracy of supervised learning. We applied DUSC to 28 single-cell transcriptomics dataset obtained from a triple-negative breast cancer tumor to identify 29 potential cancer subclones accentuated by copy-number variation and investigate the role of 30 clonal heterogeneity. Our method is freely available to the community and will hopefully 31 facilitate our understanding of the cellular atlas of living organisms as well as provide the means 32 to improve patient diagnostics and treatment. 33 34 Introduction 35 Despite the centuries of research, our knowledge of the cellular architecture of human tissues and 36 organs is still very limited. Microscopy has been conventionally used as a fundamental method 37 to discover novel cell types, study cell function and cell differentiation states through staining 38 and image analysis [1]. However, this approach is not able to identify heterogeneous sub- 39 populations of cells, which might look similar, but perform different functions. Recent 40 developments in single-cell RNA sequencing (scRNA-seq) have enabled harvesting the gene 41 expression data from a wide range of tissue types, cell types, and cell development stages, 42 allowing for a fine-grained discovery of cellular subtypes and specific cell states [2]. Single-cell 43 RNA sequencing data have played a critical role in the recent discoveries of new cell types in the 44 human brain [3], gut [4], lungs [5], and immune system [6], as well as in determining cellular 45 heterogeneity in cancerous tumors, which could help improve prognosis and therapy [7, 8]. 46 Single-cell experiments produce datasets that have three main characteristics of big data: volume 47 (number of samples and number of transcripts per each sample), variety (types of tissues and 48 cells), and veracity (missing data, noise, and dropout events) [9]. Recently emerging large 49 initiatives, such as the Human Cell Atlas [10], rely on single-cell sequencing technologies at an 50 unprecedented scale, and have generated datasets obtained from hundreds of thousands and even 51 millions of cells. The high numbers of cells, in turn, allow to account for data variability due to 52 cellular heterogeneity and different cell-cycle stages. As a result, there is a critical need to 53 automate the processing and analysis of scRNA-seq data. For instance, for the analysis of large 54 transcriptomics datasets, computational methods are frequently employed that find patterns 55 associated with the cellular heterogeneity or cellular development, and group cells according to 56 these patterns. 2 Downloaded from rnajournal.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press 57 If one assumes that all cellular types or stages extractable from a single-cell transcriptomics 58 experiment have been previously identified, it is possible to apply a supervised learning 59 classifier. The supervised learning methods are trained on the data extracted from the individual 60 cells whose types are known. The previously developed approaches for supervised cell type 61 classification have leveraged data from image-based screens [11] and flow cytometry 62 experiments [12]. There has also been a recent development of supervised classifiers for single- 63 cell transcriptomic data [13], including methods that implement neural networks trained on a 64 combination of transcriptomic data and protein interaction data [79]. While a supervised learning 65 approach is expected to be more accurate in identifying the previously observed cellular types, 66 its main disadvantage is the limited capacity in discovering new cell types or identifying the 67 previously known cell types whose RNA-seq profiles differ from the ones observed in the 68 training set. 69 70 Another popular technique for scRNA-seq data analysis is unsupervised learning, or clustering. 71 In this approach, no training data are provided. Instead, the algorithm looks to uncover intrinsic 72 similarities shared between cells of the same type and not shared between cells of different types 73 [14]. Often, clustering analysis is coupled with a feature learning method to filter out thousands 74 of unimportant features extracted from the scRNA-seq data. In a recent study, the Principal 75 Component Analysis (PCA) approach was used on gene expression data from scRNA-seq 76 experiments profiling neuronal cells [15]. With the goal of identifying useful gene markers that 77 underlie specific cell types in the dorsal root ganglion of mice, 11 distinct cellular clusters were 78 discovered. Other approaches have also adopted this strategy of combining a simple, but efficient 79 feature learning method with a clustering algorithm, to detect groups of cells that could be of 80 different sub-types or at different stages in cellular development [21, 65]. One challenge faced by 81 such an approach is due to scRNA-seq data exhibiting complex high-dimensional structure, and 82 such complexity cannot be accurately captured by fewer dimensions when using simple linear 83 feature learning methods. 84 85 A nonlinear method frequently used in scRNA-seq data analysis for clustering and visualization 86 is t-distributed stochastic neighbor embedding (t-SNE) [16]. While t-SNE can preserve the local 3 Downloaded from rnajournal.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press 87 clusters, preserving the global hierarchical structure of clusters is often problematic [17]. 88 Furthermore, the conventional feature learning methods may not be well suited for scRNA-seq 89 experiments that have considerable amount of both experimental and biological noise or the 90 occurrence of dropout events [18, 19]. To address this problem, two recent methods have been 91 introduced, pcaReduce [20] and SIMLR [21]. pcaReduce integrates an agglomerative 92 hierarchical clustering with PCA to generate a hierarchy where the cluster similarity is measured 93 in subspaces of gradually decreasing dimensionalities. The other approach, SIMLR, learns 94 different cell-to-cell distances through by analyzing the gene expression matrix; it then performs 95 feature learning, clustering, and visualization. The computational complexity of the denoising 96 technique in SIMLR prevents its application on the large datasets. Therefore, a different pipeline 97 is used to handle large data, where the computed similarity measure is approximated, while the 98 diffusion approach to reduce the effects of noise is not used. In addition to the dimension 99 reduction methods, K-means is a popular clustering method used in single-cell transcriptomics 100 analysis. While being arguably the most popular divisive clustering algorithm it has several 101 limitations [22, 23]. 102 103 In this work, we looked at the possibility to leverage an unsupervised deep learning approach 104 [24] to handle the complexities of scRNA-seq data and overcome the above limitations of the 105 current feature learning methods. It has been theoretically shown that the multilayer feed- 106 forward artificial neural networks, with an arbitrary squashing function and sufficient number of 107 hidden units (latent features) are the universal approximators [25] capable of performing the 108 dimensionality reduction [26]. A recently published method, scVI, implemented unsupervised 109 neural networks to overcome specific problems of the library size and batch

Load more