PAGA: Graph Abstraction Reconciles Clustering with Trajectory Inference Through a Topology Preserving Map of Single Cells F
Total Page:16
File Type:pdf, Size:1020Kb
Wolf et al. Genome Biology (2019) 20:59 https://doi.org/10.1186/s13059-019-1663-x METHOD Open Access PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells F. Alexander Wolf1 , Fiona K. Hamey2, Mireya Plass3, Jordi Solana3,JoakimS.Dahlin2,4, Berthold Göttgens2, Nikolaus Rajewsky3, Lukas Simon1 and Fabian J. Theis1,5* Abstract Single-cell RNA-seq quantifies biological heterogeneity across both discrete cell types and continuous cell transitions. Partition-based graph abstraction (PAGA) provides an interpretable graph-like map of the arising data manifold, based on estimating connectivity of manifold partitions (https://github.com/theislab/paga). PAGA maps preserve the global topology of data, allow analyzing data at different resolutions, and result in much higher computational efficiency of the typical exploratory data analysis workflow. We demonstrate the method by inferring structure-rich cell maps with consistent topology across four hematopoietic datasets, adult planaria and the zebrafish embryo and benchmark computational performance on one million neurons. Background and processes such as development, dose response, and Single-cell RNA-seq offers unparalleled opportunities for disease progression. Here, we unify both viewpoints. comprehensive molecular profiling of thousands of indi- A central example of dissecting heterogeneity in single- vidual cells, with expected major impacts across a broad cell experiments concerns data that originate from com- range of biomedical research. The resulting datasets are plex cell differentiation processes. However, analyzing often discussed using the term transcriptional landscape. such data using pseudotemporal ordering [2, 5–9]faces However, the algorithmic analysis of cellular hetero- the problem that biological processes are usually incom- geneity and patterns across such landscapes still faces pletely sampled. As a consequence, experimental data do fundamental challenges, for instance, in how to explain not conform with a connected manifold and the mod- cell-to-cell variation. Current computational approaches eling of data as a continuous tree structure, which is attempt to achieve this usually in one of two ways [1]. the basis for existing algorithms, has little meaning. This Clustering assumes that data is composed of biologically problem exists even in clustering-based algorithms for distinct groups such as discrete cell types or states and the inference of tree-like processes [10–12], which make labels these with a discrete variable—the cluster index. By the generally invalid assumption that clusters conform contrast, inferring pseudotemporal orderings or trajecto- with a connected tree-like topology. Moreover, they rely ries of cells [2–4] assumes that data lie on a connected on feature-space based inter-cluster distances, like the manifold and labels cells with a continuous variable—the euclidean distance of cluster means. However, such dis- distance along the manifold. While the former approach tance measures quantify biological similarity of cells only is the basis for most analyses of single-cell data, the latter at a local scale and are fraught with problems when used enables a better interpretation of continuous phenotypes for larger-scale objects like clusters. Efforts for addressing the resulting high non-robustness of tree-fitting to dis- *Correspondence: [email protected] tances between clusters [10]bysampling[11, 12]have 1Helmholtz Center Munich – German Research Center for Environmental only had limited success. Health, Institute of Computational Biology, Neuherberg, Munich, Germany 5Department of Mathematics, Technische Universität München, Munich, Partition-based graph abstraction (PAGA) resolves Germany these fundamental problems by generating graph-like Full list of author information is available at the end of the article © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Wolf et al. Genome Biology (2019) 20:59 Page 2 of 9 maps of cells that preserve both continuous and discon- node in V corresponds to a cell and each edge in E nected structure in data at multiple resolutions. The data- represents a neighborhood relation (Fig. 1)[3, 15–17]. driven formulation of PAGA allows to robustly recon- However, the complexity of G and noise-related spuri- struct branching gene expression changes across different ous edges make it both hard to trace a putative biological datasets and, for the first time, enabled reconstructing the process from progenitor cells to different fates and to lineage relations of a whole adult animal [13]. Further- decide whether groups of cells are in fact connected or more, we show that PAGA-initialized manifold learning disconnected. Moreover, tracing isolated paths of single algorithms converge faster, produce embeddings that are cells to make statements about a biological process comes more faithful to the global topology of high-dimensional with too little statistical power to achieve an acceptable data, and introduce an entropy-based measure for quan- confidence level. Gaining power by averaging over distri- tifying such faithfulness. Finally, we show how PAGA butions of single-cell paths is hampered by the difficulty abstracts transition graphs, for instance, from RNA veloc- of fitting realistic models for the distribution of these ity and compare to previous trajectory-inference algo- paths. rithms. With this, PAGA provides a graph abstraction We address these problems by developing a statis- method [14] that is suitable for deriving interpretable tical model for the connectivity of groups of cells, abstractions of the noisy kNN-like graphs that are typi- which we typically determine through graph-partitioning cally used to represent the manifolds arising in scRNA-seq [17–19] or alternatively through clustering or experi- data. mental annotation. This allows us to generate a simpler PAGA graph G∗ (Fig. 1) whose nodes correspond to cell Results groups and whose edge weights quantify the connec- PAGA maps discrete disconnected and continuous tivity between groups. Similar to modularity [20], the connected cell-to-cell variation statistical model considers groups as connected if their Both established manifold learning techniques and single- number of inter-edges exceeds a fraction of the number cell data analysis techniques represent data as a neigh- of inter-edges expected under random assignment. The borhood graph of single cells G = (V, E),whereeach connection strength can be interpreted as confidence in Fig. 1 Partition-based graph abstraction generates a topology-preserving map of single cells. High-dimensional gene expression data is represented as a kNN graph by choosing a suitable low-dimensional representation and an associated distance metric for computing neighborhood relations—in most of the paper, we use PCA-based representations and Euclidean distance. The kNN graph is partitioned at a desired resolution where partitions represent groups of connected cells. For this, we usually use the Louvain algorithm, however, partitions can be obtained in any other way, too. A PAGA graph is obtained by associating a node with each partition and connecting each node by weighted edges that represent a statistical measure of connectivity between partitions, which we introduce in the present paper. By discarding spurious edges with low weights, PAGA graphs reveal the denoised topology of the data at a chosen resolution and reveal its connected and disconnected regions. Combining high-confidence paths in the PAGA graph with a random-walk-based distance measure on the single-cell graph, we order cells within each partition according to their distance from a root cell. A PAGA path then averages all single-cell paths that pass through the corresponding groups of cells. This allows to trace gene expression changes along complex trajectories at single-cell resolution Wolf et al. Genome Biology (2019) 20:59 Page 3 of 9 the presence of an actual connection and allows discard- Box 1. Taking a classification view on embedding algo- ing spurious, noise-related connections (Additional file 1: rithms, we quantify how faithful an embedding is to the Note 1). While G represents the connectivity structure global topology of the high-dimensional data by com- of the data at single-cell resolution, the PAGA graph G∗ paring the distributions P and Q of edges in the high- represents the connectivity structure of the data at the dimensional and embedding spaces using a weighted chosen coarser resolution of the partitioning and allows to Kullback-Leibler divergence identify connected and disconnected regions of the data. ( || ) = disc( || ) + overl( || ) KLgeo P Q KLgeo P Q KLgeo P Q Following paths along nodes in G∗ means following an q p − ensemble of single-cell paths that pass through the cor- = de pe + de ( − ) 1 pe p pe log q 1 pe log − , ∈ de qe de 1 qe responding cell groups in G. By averaging over such an e Efc