Robustness and Applicability of Transcription Factor and Pathway Analysis Tools on Single-Cell RNA-Seq Data Christian H
Total Page:16
File Type:pdf, Size:1020Kb
Holland et al. Genome Biology (2020) 21:36 https://doi.org/10.1186/s13059-020-1949-z RESEARCH Open Access Robustness and applicability of transcription factor and pathway analysis tools on single-cell RNA-seq data Christian H. Holland1,2, Jovan Tanevski1,3, Javier Perales-Patón1, Jan Gleixner4,5, Manu P. Kumar6, Elisabetta Mereu7, Brian A. Joughin6,8, Oliver Stegle4,5,9, Douglas A. Lauffenburger6, Holger Heyn7,10, Bence Szalai11 and Julio Saez-Rodriguez1,2* Abstract Background: Many functional analysis tools have been developed to extract functional and mechanistic insight from bulk transcriptome data. With the advent of single-cell RNA sequencing (scRNA-seq), it is in principle possible to do such an analysis for single cells. However, scRNA-seq data has characteristics such as drop-out events and low library sizes. It is thus not clear if functional TF and pathway analysis tools established for bulk sequencing can be applied to scRNA-seq in a meaningful way. Results: To address this question, we perform benchmark studies on simulated and real scRNA-seq data. We include the bulk-RNA tools PROGENy, GO enrichment, and DoRothEA that estimate pathway and transcription factor (TF) activities, respectively, and compare them against the tools SCENIC/AUCell and metaVIPER, designed for scRNA-seq. For the in silico study, we simulate single cells from TF/pathway perturbation bulk RNA-seq experiments. We complement the simulated data with real scRNA-seq data upon CRISPR-mediated knock-out. Our benchmarks on simulated and real data reveal comparable performance to the original bulk data. Additionally, we show that the TF and pathway activities preserve cell type-specific variability by analyzing a mixture sample sequenced with 13 scRNA-seq protocols. We also provide the benchmark data for further use by the community. Conclusions: Our analyses suggest that bulk-based functional analysis tools that use manually curated footprint gene sets can be applied to scRNA-seq data, partially outperforming dedicated single-cell tools. Furthermore, we find that the performance of functional analysis tools is more sensitive to the gene sets than to the statistic used. Keywords: scRNA-seq, Functional analysis, Transcription factor analysis, Pathway analysis, Benchmark Background molecular processes such as the activity of pathways or Gene expression profiles provide a blueprint of the sta- transcription factors (TFs). These functional analysis tus of cells. Thanks to diverse high-throughput tech- tools are broadly used and belong to the standard toolkit niques, such as microarrays and RNA-seq, expression to analyze expression data [1–4]. profiles can be collected relatively easily and are hence Functional analysis tools typically combine prior very common. To extract functional and mechanistic in- knowledge with a statistical method to gain functional formation from these profiles, many tools have been de- and mechanistic insights from omics data. In the case of veloped that can, for example, estimate the status of transcriptomics, prior knowledge is typically rendered as gene sets containing genes belonging to, e.g., the same * Correspondence: [email protected] biological process or to the same Gene Ontology (GO) 1Institute for Computational Biomedicine, Bioquant, Heidelberg University, annotation. The Molecular Signature Database Faculty of Medicine, and Heidelberg University Hospital, Heidelberg, (MSigDB) is one of the largest collections of curated and Germany 2Joint Research Centre for Computational Biomedicine (JRC-COMBINE), RWTH annotated gene sets [5]. Statistical methods are as abun- Aachen University, Faculty of Medicine, Aachen, Germany dant as the different types of gene sets. Among them, Full list of author information is available at the end of the article © The Author(s). 2020 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Holland et al. Genome Biology (2020) 21:36 Page 2 of 19 the most commonly used are over-representation ana- on the same statistical method but relies on multiple lysis (ORA) [6] and Gene Set Enrichment Analysis GRNs such as tissue-specific networks. (GSEA) [7]. Still, there is a growing number of statistical We first benchmarked the tools on simulated single- methods spanning from simple linear models to cell transcriptome profiles. We found that on this in advanced machine learning methods [8, 9]. silico data the footprint-based gene sets from DoRothEA Recent technological advances in single-cell RNA-seq and PROGENy can functionally characterize simulated (scRNA-seq) enable the profiling of gene expression at single cells. We observed that the performance of the the individual cell level [10]. Multiple technologies and different tools is dependent on the used statistical protocols have been developed, and they have experi- method and properties of the data, such as library size. enced a dramatic improvement over recent years. How- We then used real scRNA-seq data upon CRISPR- ever, single-cell data sets have a number of limitations mediated knock-out/knock-down of TFs [20, 21] to as- and biases, including low library size and drop-outs. sess the performance of TF analysis tools. The results of Bulk RNA-seq tools that focus on cell type identification this benchmark further supported our finding that TF and characterization as well as on inferring regulatory analysis tools can provide accurate mechanistic insights networks can be readily applied to scRNA-seq data [11]. into single cells. Finally, we demonstrated the utility of This suggests that functional analysis tools should in the tools for pathway and TF activity estimation on re- principle be applicable to scRNA-seq data as well. How- cently published data profiling a complex sample with ever, it has not been investigated yet whether these limi- 13 different scRNA-seq technologies [22]. Here, we tations could distort and confound the results, rendering showed that summarizing gene expression into TF and the tools not applicable to single-cell data. pathway activities preserves cell-type-specific informa- In this paper, we benchmarked the robustness and tion and leads to biologically interpretable results. Col- applicability of various TF and pathway analysis tools lectively, our results suggest that the bulk- and on simulated and real scRNA-seq data. We focused footprint-based TF and pathway analysis tools DoRo- on three tools for bulk and three tools for scRNA-seq thEA and PROGENy partially outperform the single-cell data. The bulk tools were PROGENy [12], DoRothEA tools SCENIC, AUCell, and metaVIPER. Although on [13], and classical GO enrichment analysis, combining scRNA-seq data DoRothEA and PROGENy were less ac- GO gene sets [14] with GSEA. PROGENy estimates curate than on bulk RNA-seq, we were still able to ex- the activity of 14 signaling pathways by combining tract relevant functional insight from scRNA-seq data. corresponding gene sets with a linear model. DoRo- thEA is a collection of resources of TF’s targets (reg- Results ulons) that can serve as gene sets for TF activity Robustness of bulk-based TF and pathway analysis tools inference. For this study, we coupled DoRothEA with against low gene coverage the method VIPER [15]asitincorporatesthemode Single-cell RNA-seq profiling is hampered by low gene of regulation of each TF-target interaction. Both coverage due to drop-out events [23]. In our first ana- PROGENy’sandDoRothEA’s gene sets are based on lysis, we focused solely on the low gene coverage aspect observing the transcriptomic consequences (the “foot- and whether tools designed for bulk RNA-seq can deal print”) of the processes of interest rather than the with it. Specifically, we aimed to explore how DoRo- genes composing the process as gene sets [16]. This thEA, PROGENy, and GO gene sets combined with approach has been shown to be more accurate and GSEA (GO-GSEA) can handle low gene coverage in gen- informative in inferring the process’s activity [12, 17]. eral, independently of other technical artifacts and char- The tools specifically designed for application on acteristics from scRNA-seq protocols. Thus, we scRNA-seq data that we considered are SCENIC/ conducted this benchmark using bulk transcriptome AUCell [18] and metaVIPER [19]. SCENIC is a com- benchmark data. In these studies, single TFs and path- putational workflow that comprises the construction ways are perturbed experimentally, and the transcrip- of gene regulatory networks (GRNs) from scRNA-seq tome profile is measured before and after the data that are subsequently interrogated to infer TF perturbation. These experiments can be used to bench- activity with the statistical method AUCell. In mark tools for TF/pathway activity estimation, as they addition, we coupled AUCell with the footprint-based should estimate correctly the change in the perturbed gene sets from DoRothEA and PROGENy that we TF or pathway. The use of these datasets allowed us to hereafter refer to as D-AUCell and P-AUCell. Using systematically control the