DEGREE PROJECT IN MEDICAL ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2021

Review and Analysis of single-cell RNA sequencing cell-type identification and annotation tools

CORENTIN RAOUX

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY, BIOTECHNOLOGY AND HEALTH

Review and Analysis of single-cell RNA sequencing cell-type identification and annotation tools

CORENTIN RAOUX

Degree Programme in Medical Engineering Date: June 9, 2021 Supervisor: Yufei Luo Examiner: Matilda Larsson School of Engineering Sciences in Chemistry, Biotechnology and Health Host company: Servier Swedish title: Granskning och Analys av enkelcells-RNA- sekvenseringsverktyg för identifiering och annotering av celltyper Review and Analysis of single-cell RNA sequencing cell-type identification and annotation tools / Granskning och Analys av enkelcells-RNA- sekvenseringsverktyg för identifiering och annotering av celltyper

© 2021 Corentin RAOUX Abstract | i

Abstract

Single-cell RNA-sequencing makes possible to study the gene expression at the level of individual cells. However, one of the main challenges of the single-cell RNA-sequencing analysis today, is the identification and annotation of cell types. The current method consists in manually checking the expression of genes using top differentially expressed genes and comparing them with related cell-type markers available in scientific publications. It is therefore time-consuming and labour intensive. Nevertheless, in the last two years, numerous automatic cell-type identification and annotation tools which use different strategies have been created. But, the lack of specific comparisons of those tools in the literature and especially for immuno-oncologic and oncologic purposes makes difficult for laboratories and companies to know objectively what are the best tools for annotating cell types. In this project, a review of the current tools and an evaluation of R tools were carried out. The annotation performance, the computation time and the ease of use were assessed. After this preliminary results, the best selected R tools seem to be ClustifyR (fast and rather precise) and SingleR (precise) for the correlation- based tools, and SingleCellNet (precise and rather fast) and scPred (precise but a lot of cell types remains unassigned) for the supervised classification tools. Finally, for the marker-based tools, MAESTRO and SCINA are rather robust if they are provided with high quality markers.

Keywords Single-cell RNA sequencing, Automatic cell types annotation, Classification, Benchmark, Evaluation ii | Abstract Sammanfattning | iii

Sammanfattning

Encells-RNA-sekvensering möjliggör undersökning av genuttryck på nivån av enskilda celler. Däremot är en av nuvarande huvudutmaningarna för encells- RNA-sekvensering identifieringen av celltyper. Den nuvarande metoden består av att manuellt kontrollera uttrycket av gener med top differentiellt uttryckta gener och jämföra dem med de relaterade celltypsmarkörerna som är tillgäng- liga i vetenskapliga publikationer. Konsekvent, är det tids- och arbetskrävande. Trots detta har flera automatiska verktyg för identifiering och annotering av celltyp som använder olika strategier konstruerats och tagits fram under de senaste två åren. Bristen på specifika jämförelser av dessa verktyg inom litteraturen, speciellt för immuno-onkologiska och onkologiska syften, har dock försvårat det för laboratorier och företag att objektivt urskilja vilka de bästa verktygen för att urskilja celltyper egentligen är. I detta projekt undersöktes de aktuella verktygen, samt utvärderades de berörda R-verktygen. Likaså bedömdes även annoteringens utförande, beräkningstiden och använ- darvänligheten. Det preliminära resultatet indikerar att de bästa utvalda verk- tygen är ClustifyR (snabbt och rätt noggrann) och SingleR (noggran) för korrelationsbaserade verktyg och SingleCellNet (noggrann och rätt snabbt) och scPred (noggrann dock förblir många celltyper otilldelade) för bevakade klassificeringsverktyg. Slutligen är MAESTRO and SCINA kraftfulla för mar- körbaserade verktyg om de är försedda med högkvalitativa markörer.

Nyckelord Encells-RNA-sekvensering, Automatisk annotering av celltyper, Klassifice- ring, Riktmärke, Värdering iv | Sammanfattning Résumé | v

Résumé

Le séquencage d’ARN à cellule unique rend possible l’étude de l’expres- sion des gènes au niveau de cellules individuelles. Cependant, l’un des principaux défis actuels de l’analyse de séquençage d’ARN à cellule unique est l’identification et l’annotation de types cellulaires. La méthode actuelle consiste à vérifier manuellement l’expression des gènes en utilisant les princi- paux gènes exprimés differentiellement et à les comparer avec des marqueurs spécifiques de types cellulaires présents dans des publications scientifiques. Ceci est donc chronophage et laborieux. Toutefois, durant les deux dernières années, un nombre conséquent d’outils d’identification et d’annotation auto- matique de types cellulaires utilisant différentes stratégies ont été créés. Mais le manque de comparaisons spécifiques de ces outils dans la littérature et spécialement pour un objectif immuno-oncologique et immunoloqique rend difficile pour les laboratoires et les entreprises de savoir objectivement quel est le meilleur outils pour annoter les types cellulaires. Dans ce projet, un examen des outils actuels et une évaluation des outils R ont été effectués. Les performances d’annotation, le temps de calcul et la facilité d’utilisation ont été évalués. Après ces résultats préliminaires, les meilleurs outils R selectionnés semblent être ClustifyR (rapide et plutôt précis) et SingleR (précis) pour les outils basés sur les correlations, et SingleCellNet (précis et plutôt rapide) et scPred (précis mais beaucoup de types cellulaires restent non-annotés) pour les outils de classification supervisés. Finalement, pour les outils basés sur des marqueurs, MAESTRO et SCINA sont plutôt robustes si on leur fournit des marqueurs de haute qualité.

Mots clés Sequençage d’ARN à cellule unique, Annotation automatique de types cellu- laires, Classification, Comparaison, Evaluation vi | Résumé Acknowledgments | vii

Acknowledgments

I would firstly like to thank Mrs. Yufei Luo for having supervised my work and gave me useful advice, as well as, the bioinformatic team and the different people in the Servier company who have welcomed me and helped me for this project. I also want to thank PhD Stefania Giacomello for having generously accepted to review my project and helped me to improve the content of this report. Finally, I thank my group of the HL205X Course supervised by PhD Carsten Mim, as well as the different people of KTH for their help and advice at different levels in this project.

Stockholm, June 2021 Corentin RAOUX viii | Acknowledgments CONTENTS | ix

Contents

1 Introduction1 1.1 Background...... 1 1.2 Challenge...... 1 1.3 Purpose and Goals...... 2 1.4 Delimitations...... 3

2 Methods5 2.1 Tools selection and installation...... 5 2.2 Public Datasets Collection...... 6 2.2.1 Test datasets...... 6 2.2.2 Reference datasets...... 8 2.2.3 Simulated dataset...... 10 2.2.4 Data validity...... 11 2.3 Evaluation Design...... 12 2.3.1 Evaluation Criteria...... 12 2.3.2 Evaluation Metrics...... 13 2.3.3 Evaluation Benchmarking Strategies...... 14 2.3.4 Verification of the reliability of the methods...... 15

3 Results and Analysis 21 3.1 First configuration - Evaluation of the ability to accurately annotate major cell types...... 21 3.1.1 Zhang Smart-Seq2 - Qian Colorectal...... 21 3.1.2 Kim - Qian lung...... 22 3.1.3 Analysis - (Tables 3.1, 3.2, 3.3, 3.4, 3.5 and 3.6) ..... 23 3.2 Second configuration - Evaluation of the ability to accurately annotate deeper sub cell types...... 24 3.2.1 Zhang 10X Genomics - Nieto...... 24 3.2.2 Kim - Nieto...... 25 x | Contents

3.2.3 Analysis - (Tables 3.7, 3.8, 3.9, 3.10, 3.11 and 3.12) ... 25 3.3 Computation time and ease to use...... 27 3.3.1 Computation time...... 27 3.3.2 Ease to use...... 28

4 Discussion and Conclusions 31 4.1 Discussion...... 31 4.2 Conclusions...... 32 4.3 Future work...... 32

References 33

A State of the Art 43 A.1 Introduction...... 43 A.2 Example of field of study where scRNA-seq is applied.... 43 A.3 Workflow of scRNA-seq...... 44 A.3.1 Pre-processing...... 44 A.3.2 Data processing and visualization...... 45 A.3.3 Downstream analysis...... 48 A.4 Review and research of the automatic cell type annotation tools 50 A.4.1 Challenges of the annotation...... 52 A.4.2 The different types of annotation tools...... 52 A.4.3 Tool summary according to their categories...... 55 A.4.4 Acquired knowledge in literature reviews...... 56 A.5 Conclusion...... 57

B Supplement information 59 B.1 Methods used in the literature...... 59 B.2 Comments on these methods...... 60

C Further evaluation of MAESTRO 62 LIST OF FIGURES | xi

List of Figures

1.1 General scheme of the functioning of a tool...... 3

2.1 UMAP representation of the Zhang 10X Genomics test dataset with annotated cell types...... 7 2.2 UMAP representation of the Zhang Smart-seq2 test dataset with annotated cell types...... 7 2.3 UMAP representation of the Kim 10X Genomics test dataset with annotated cell types...... 8 2.4 UMAP representation of the Nieto reference dataset with annotated cell types...... 9 2.5 UMAP representation of the Qian Lung Cancer reference dataset with annotated cell types...... 10 2.6 UMAP representation of the Qian Colorectal Cancer reference dataset with annotated cell types...... 10 2.7 Synthesis of the different annotation levels...... 13 2.8 4 Pairs of reference and test datasets configurations...... 15 2.9 Cell types correlation matrix of the Qian Lung reference dataset 17 2.10 Cell types correlation matrix of the Qian Colorectal reference dataset...... 17 2.11 Cell types correlation matrix of the Nieto reference dataset.. 18 2.12 Cell types correlation matrix of the Zhang 10X genomics dataset 18 2.13 Cell types correlation matrix of the Zhang Smart-seq2 dataset 19 2.14 Cell types correlation matrix of the Kim dataset...... 19

3.1 Time to train classifiers or pre-process data for the tools that require, according to the 4 pairs of reference-test datasets... 27 3.2 Computation time for the annotation according to the 4 pairs of reference-test datasets...... 27

A.1 Pre-processing workflow...... 45 xii | LIST OF FIGURES

A.2 Data processing and visualization workflow...... 46 A.3 Before and after correction (UMAP Visualisation)...... 47 A.4 Downstream Analysis – Clustering and Annotation...... 50 A.5 3 approaches for automatic cell-type annotation of scRNA-seq datasets: annotation by marker gene databases, correlation- based methods, and annotation by supervised classification.. 53 A.6 Classification of the tools according to the 3 approaches for automatic cell type annotation...... 56 A.7 Selected tools separated by the 3 approaches for automatic cell type annotation...... 57 LIST OF TABLES | xiii

List of Tables

3.1 Precision, recall and F1-score for the first category of tools on the Zhang Smart-Seq2 - Qian Colorectal configuration.... 21 3.2 Precision, recall and F1-score for the second category of tools on the Zhang Smart-Seq2 - Qian Colorectal configuration... 22 3.3 Precision, recall and F1-score for the third category of tools on the Zhang Smart-Seq2 - Qian Colorectal configuration... 22 3.4 Precision, recall and F1-score for the first category of tools on the Kim - Qian Lung configuration...... 22 3.5 Precision, recall and F1-score for the second category of tools on the Kim - Qian lung configuration...... 22 3.6 Precision, recall and F1-score for the third category of tools on the Kim - Qian Lung configuration...... 23 3.7 Precision, recall and F1-score for the first category of tools on the Zhang 10X Genomics - Nieto configuration...... 24 3.8 Precision, recall and F1-score for the second category of tools on the Zhang 10X Genomics - Nieto configuration...... 24 3.9 Precision, recall and F1-score for the third category of tools on the Zhang 10X Genomics - Nieto configuration...... 25 3.10 Precision, recall and F1-score for the first category of tools on the Kim - Nieto configuration...... 25 3.11 Precision, recall and F1-score for the second category of tools on the Kim - Nieto configuration...... 26 3.12 Precision, recall and F1-score for the third category of tools on the Kim - Nieto configuration...... 26 3.13 Advantages and Drawbacks of the tools which require a list of markers as prior knowledge (1st category)...... 28 3.14 Advantages and Drawbacks of the tools which use correlation with a reference dataset (2nd category)...... 29 xiv | LIST OF TABLES

3.15 Advantages and Drawbacks of the tools which use supervised classification (3rd category)...... 29

A.1 Automatic cell annotation or prediction tools selected for further analyses...... 51

C.1 Precision, recall and F1-score for TISCH (MAESTRO) on the Zhang Smart-Seq2 dataset...... 62 C.2 Precision, recall and F1-score for TISCH (MAESTRO) on the Kim dataset and for the major cell types...... 63 C.3 Precision, recall and F1-score for TISCH (MAESTRO) on the Zhang10X dataset...... 63 C.4 Precision, recall and F1-score for TISCH (MAESTRO) on the Kim dataset and for the T cell sub types...... 64 List of acronyms and abbreviations | xv

List of acronyms and abbreviations

ANN Artificial Neural Network

CCA Canonical correlation analysis cDNA complementary Deoxyribonucleic Acid kNN k-Nearest Neighbour mRNA messenger Ribonucleic Acid

NGS Next Generation Sequencing

PBMC Peripheral Blood Mononuclear Cells

PCA Principal component analysis

PCR Polymerase Chain Reaction

QC Quality Control

RNA-seq RNA sequencing scRNA-seq Single-cell RNA sequencing

SVM Support Vector Machine t-SNE t-distributed stochastic neighbour embedding

UMAP Uniform Approximation and Projection method xvi | List of acronyms and abbreviations Introduction | 1

Chapter 1

Introduction

1.1 Background

RNA sequencing (RNA-seq) data are wildly used in pharmaceutical companies for drug efficacy or targeted molecular validation through tissue expression. But the lack of cell-type specificity becomes a principal limitation of RNA-seq analysis. Nowadays, a new sequencing technology called Single- cell RNA sequencing (scRNA-seq) can measure the gene expression level of individual cells. It is a rapidly expanding field that holds tremendous potential to improve the understanding of biological problems and help to better understand the nature and complexity of human disease. With this method, it is possible to study the expression of a tissue according to different cell types to better understand cell heterogeneity, cell states and interaction between cells, which can lead to a better study of drug response efficacy.

1.2 Challenge

One of the main challenges of scRNA-seq analysis today, is the identifica- tion and annotation of cell types which is essential in the analysis of sequencing data. The current method consists in manually checking in the data the expression of genes using top differentially expressed genes and compare them with related cell-type markers available in scientific publications. This method is therefore cumbersome, complex and labour intensive. An automatic identification of cell-type method appears to be a significant solution. Nevertheless, in the last two years, taking advantage of the creation of huge gene-marker databases, statistical methods or machine learning methods, research groups have created numerous cell-type identification and annotation 2 | Introduction

tools. Those tools use different algorithmic strategies and are created for different purposes what influence their performance depending on the situation in which they are used. Moreover, there is a lack of comparisons of all those tools in the literature because the number of tools is ever-increasing and the few ones that exist are done on a general aspect, not on specific purposes and on several "old" tools. Then, it may be difficult for laboratories and companies which use scRNA-seq technologies to know objectively what are the best cell types automatic annotation tools for their specific scientific purposes.

1.3 Purpose and Goals

In immuno-oncology and oncology, cell types could be very specific and therefore, tools may react differently from the situations on which they were created. Furthermore, there is no specific evaluation of them in these fields. Then, the purpose of this project is to know the best performing cell types annotation tools for these specific situations. The best tools will be integrated in the scRNA-seq analysis pipeline of the Servier company∗ and therefore improve its time-consumption, its workload and its efficiency. The acquired knowledge could also benefit other institutions or laboratories working on similar fields. The work can also be reproduced and do not affect any patients as only public anonymised datasets and publicly available tools were used or evaluated. The goal of this project is reviewing and evaluating scRNA-seq cell type annotation tools. This has been divided into the following four sub-tasks:

1. Review most of the existing automatic cell type annotation tools for scRNA-seq data analysis (see State of the ArtA)

2. Select representative input datasets and the most detailed reference datasets for tools benchmarking (see Figure 1.1)

3. Evaluate the performance of the tools by their precision, recall, computa- tion time and ease to use

4. Select the tools which have the best performance in order to integrate them in the current pipeline of the company.

∗ French pharmaceutical company specialized in oncology, immuno-inflammatory and neurodegenerative diseases and cardio-metabolism Introduction | 3

Figure 1.1: General scheme of the functioning of a tool. The tools require either a reference dataset or a markers file (file with list of specific genes named markers for each cell types) or a trained classifier. More explanations are given in the section State of the ArtA.

1.4 Delimitations

This work was carried out for the HL205X course and during the allowed 17 weeks. Given the time limit, only some evaluations could be carried out and not all the planned ones. The other ones will be done after the submission of this report during the two remaining months of my internship. Hence, only the chosen R tools (see State of the ArtA) have been evaluated on some configurations and presented in this report. The tools have been evaluated for an immuno-oncologic and immune purpose and it is advised not to generalize the results to other situations or purposes. Moreover, as the all the evaluations could not be done when this report was written, the presented results have to be understood as preliminary results. In the appendix, a State of the ArtA presents in more details the scRNA- seq technology, one example of use and its workflow. It presents also the literature review and the research of the automatic cell types annotation tools, their characteristics and the ones which were chosen for the evaluation. Supplement informationB is also given for synthesizing the general evaluation already done in the literature and explaining the differences with this project. Please, read first these parts if you have any questions at this time of reading. 4 | Introduction Methods | 5

Chapter 2

Methods

2.1 Tools selection and installation

The 27 selected tools for the evaluation of the performance and their choice among all the existing methods were presented in the State of the ArtA. For all the tools, the latest version on the beginning of March 2021 was installed. Thirteen R tools have been installed successfully out of the seventeen R tools selected in the State of the ArtA ∗. The ten Python tools were installed late in the project for company IT security reasons. Therefore, the evaluation of those tools could not be presented in this report but will nevertheless be done after the thesis submission. ∗ CellID [1] had to be abandoned due to R version incompatibility (R>=4.1) and ACTIONet [2], scCancer [3] and SciBet [4] for C++ compiler CentOS7 failure to compile R packages used in those tools. Among these thirteen tools, only scTyper [5] could not be evaluated due to dysfunction in the code. It was hence abandoned. After analysis, the problem comes from the conception of the algorithm. All the given markers of cell type have to be expressed by the cells to be annotated, otherwise the code failed. This render impossible the annotation because this requires to know perfectly all the genes and cell types in the input dataset, whereas it is the goal of the tool. Instead, other tools met some errors but could be solved. Some functions of scClassifR [6] had to be changed because of the lack of robustness (ex:erase line max_clf < −gsub(0_0, max_clf), errors when markers have no numbers in the matrix). This has weighed down the pre-processing. SCINA [7] checks if the marker genes are also in the matrix and erases those which are not but do not erase the cell types with no remaining marker genes. This results in errors. So an additional function was created to ensure that such situations do not occur. 6 | Methods

2.2 Public Datasets Collection

In order to do inter-datasets evaluation with real datasets (see Supplement InformationB), it was necessary to find publicly available datasets with enough cells and well-annotated cell-type labels. The raw count matrices and the cell-type annotations in the metadata files were collected from the corresponding publications and for each dataset. The extracted cell-type annotations is used as the ground truth for the evaluations. The datasets were separated in test datasets on which the annotation will be evaluated and reference datasets whose count matrices, gene markers or cell types data will serve as prior knowledge for the tools (see State of the ArtA).

2.2.1 Test datasets Three human publicly available datasets were used as test datasets. Two come from the same publication, Zhang et al. (2020) [8], and one from Kim et al. (2020) [9]. We have chosen datasets resulted from the two mainly used sequencing technics (10X Genomics and Smart-seq2) as they can be data information variations, and from colorectal and lung cancer which are different from the tissues usually used in the evaluations (seeB). We also opted to have one dataset of sorting pathological and healthy immune cells. This dataset is aimed at testing the annotation depth of tools, which means the level of subtypes that one cell can be annotated by the tools. The other datasets with mix cell populations (normal cells, tumour cells, immune cells) will permit to check if the tools are able to annotate wider cell populations in different environments.

Zhang et al. (2020) [8] The two datasets from this publication consist in 43,817 sorted immune cells from 10X Genomics sequencing (Figure 2.1) and 10,468 cells from Smart-seq2 sequencing (Figure 2.2) of colorectal tumour environment. Data of major cell types and immune cell subtypes are given with the metadata files of those datasets. Methods | 7

Figure 2.1: UMAP representation of the Zhang 10X Genomics test dataset with annotated cell types - UMAP is non-linear dimensionality reduction method used to project the multidimensional data in a referential understandable by humans. In this figure and the following ones, the different colors represent different cell types.

Figure 2.2: UMAP representation of the Zhang Smart-seq2 test dataset with annotated cell types

Kim et al. (2020) [9] The dataset from this publication consists in 203,298 cells from 10X Genomics sequencing (Figure 2.3) of metastic lung adenocarcinoma tumour 8 | Methods

for primary lung tissues, pleural fluids and lymph nodes or brain metastases. Data of major cell types and immune cell subtypes are given with the metadata files of this dataset.

Figure 2.3: UMAP representation of the Kim 10X Genomics test dataset with annotated cell types

2.2.2 Reference datasets Three human publicly available and well annotated datasets were used as reference datasets. One come from Nieto et al. (2020) [10] and the other two come from the same publication Qian et al. (2020) [11]. The major part of the cells of the reference datasets were sequenced with the 10X Genomics technics (3’ library and also 5’ for Qian). The Qian datasets were chosen because the tissues correspond to those of the test datasets and provide broader cell types prior information. The Nieto dataset was chosen because it is an exhaustive dataset of sorted immune cells and provide deeper cell types prior information for annotation depth.

Nieto et al. (2020) [10] This dataset consists in several datasets of sorting immune cells coming from different tumor environments (13 cancer types including colorectal cancer and non-small-cell lung cancers) ( Figure 2.4). Major immune cell types (e.g, T cells, B cells and macrophages), and minor (e.g. proliferating Methods | 9

and dendritic cells) are integrated exhaustively. In this project, we used the downsampled dataset of 24834 cells. Data of immune cell subtypes are given with the metadata files of this dataset.

Figure 2.4: UMAP representation of the Nieto reference dataset with annotated cell types

Qian et al. (2020) [11] To enable the tools to annotate cell types other than immune, two datasets were used. The first one consists in a dataset of 93,575 cells of 8 lung cancer patients (Figure 2.5) and the second one consists in a dataset of 44,684 cells of 7 colorectal cancer patients (Figure 2.6). Both malignant and normal cells, both tissues and immune cells are present. However, only data of major cell types are given with the metadata files of those datasets. 10 | Methods

Figure 2.5: UMAP representation of the Qian Lung Cancer reference dataset with annotated cell types

Figure 2.6: UMAP representation of the Qian Colorectal Cancer reference dataset with annotated cell types

2.2.3 Simulated dataset Having simulated datasets enables to avoid some biases present in real experimental data like the previous datasets and also enables varying biological parameters to assess the robustness of the tools on some specific situations, for example, the fold-change. The fold change is the ratio of the average expression of a gene in cluster, or cell, relative to the average expression in all Methods | 11

the other clusters, or cells, combined for measuring change in the expression level of a gene∗. The tools can be very sensible to the different level of markers’ expression and it is important to know their limits. R packages as splatter [12] or SymSim [13] enable to create such simulated data. They can generate simulated scRNA-seq counts data from real datasets with similar cell type compositions and same set of Differentially Expressed genes but in a way that allows the control of the cellular heterogeneity and the generation of discrete subpopulations or continuous trajectories. It is possible with those packages to make differ the magnitude of those genes and hence the log fold change for them. Then, we will be able to evaluate the performance of the tools with varying values of Differentially Expressed genes and notice some situations where the tools can drop in performance.

2.2.4 Data validity As the annotations given with the test and reference datasets are used as ground truth for the evaluation of the performance of the tools, it is essential that these annotations come from expert knowledge and not from any computational method used by the tools to be evaluated, and are scientifically verified and true. Then, we have checked in each scientific article how the author annotated the cells, if they used a tool to annotate and if they verified experimentally their annotations for data validity. All the datasets were manually annotated by their research team using known gene markers expressions [8,9, 10, 11] and the identification of malignant cells were done with copy number variations (CNV) estimation method† [8,9]. Antibodies were also used to distinguish cell types [8,9, 11], for example, to distinguish the immune cells from the others with anti-CD45 antibodies. Zhang et al. [8] also performed an enrichment analysis and used elastic net regularization as a trained logistic regression model to compare the similarity between cell groups within different reference datasets. Kim et al. [9] accessed the composition of immune cell types after removing epithelial and stromal populations. The annotated cell types were compared as well to flow cytometry results. Nieto et al. [10] and Qian et al. [11] performed Canonical correlation analysis (CCA) to identify common cellular phenotypes within different datasets. Nieto et al. test the robustness of the cell types annotation

∗ It is generally expressed in log2 (a fold-change variation of 1 corresponds to an expression two times superior). † characterization by exceptional high amounts of expressed genes specific of malignant cells 12 | Methods

with a random forest classifier and Qian et al. validate their annotations at the protein level using CITE-seq technics with 198 antibodies. We can therefore be- rather confident in the veracity of the annotation. Hence, the annotations in the reference datasets are likely to be true. And, if a tool differently annotates some cell types in the test datasets, it is more likely that the tool is wrong than an error in the annotation of the test datasets. Finally, the right to use all these public datasets was also verified. Those datasets respect all social and ethical concerns and the anonymity were not infringed.

2.3 Evaluation Design

The evaluation was performed on RStudio with R version 4.0.2 and with a compiler CentOs7. RStudio is composed of 16 CPUs Intel(R) Xeon(R) E5- 2660 0 2.20Ghz and a limited memory of 130Go is shared between several users.

2.3.1 Evaluation Criteria In this project, we focused on 4 main evaluation criteria:

• General Accuracy on major cell types (1st and 2nd levels of annotation (see Figure 2.7));

• Depth Accuracy (ability to annotate the cell subtypes with different level of depth) of immune cell subtypes annotation (3rd level of annotation (see Figure 2.7));

• Computation time;

• User-friendly on installation, easy to run, good guide documentation. Methods | 13

Figure 2.7: Synthesis of the different annotation levels

2.3.2 Evaluation Metrics have been used as annotation evaluation metrics for TP TP each cell type in the datasets: precision = TP +FP and recall = TP +FN where TP, FP and FN are the True Positive (well annotated - agreement between the metadata of the test dataset and tool annotation results), False Positive (annotated as the cell type but does not correspond to it in test metadata) and False Negative (should be annotated with the corresponding metadata cell type but were not), respectively. This choice is motivated by the fact that these metrics are usually used in computer science and the notion of True Negative is not applicable in our situation. The F1-score which is the of the precision and recall has 2∗(precision∗recall) also been used for each cell population: F1score = (precision+recall) (1 = high and 0 = low precision and recall) to synthesize the precision and recall in one metrics. For the case of unassigned cells, we verified that all cell types present in the test datasets are also in the reference datasets (see 2.2) to be sure that any unassignements are not due to the lack of cell type information in the reference datasets. Thus, any unassigned cells will be counted as False Negatives. 14 | Methods

2.3.3 Evaluation Benchmarking Strategies As explained in the State of the ArtA, the tools are divided into 3 categories: (1) marker-gene-based (2) correlation-based and (3) supervised classifiers which need a labeled reference dataset for training. Even if, these 3 categories use different formats of prior knowledge as input, we tried to give them the same references to evaluate them on an equal basis. We hence avoid to use marker gene files or pretrained classifiers given by the creators of the tools if possible. Thus, the tools will work with the same knowledge and the differences in annotation will essentially be related to their algorithm construction. The first category requires a marker gene file and not a reference dataset as prior knowledge. In this case, we created a marker gene file with the markers given in the supplementary materials of the scientific publications of the reference datasets (Qian Colorectal, Lung and Nieto [10, 11]). In this way, this category will be evaluated on data coming from the same references used by the other categories. This allows the tools of this category to receive the same number of marker genes per cell populations and to be more accurately compared in the same condition. Nevertheless, it was not possible to do so for scCatch [14] as it does not support any outer reference datasets except its own one (CellMatch). It could be nevertheless evaluated with the other tools. We also verified that the test and reference datasets are not already present as prior knowledge in this database for avoiding conflicts. The second category only needs as prior knowledge, the reference datasets and does not requires any other information. The third category requires trained classifiers to be able to annotate. Then, we trained our own classifiers for each tool by randomly separating the reference datasets into a reference train and reference test datasets. The reference train datasets are used for the training phase. The reference test datasets include 50 cells from each cell types and are used to verify the quality of the training of the classifiers. The train and test reference datasets were exactly the same for all the classifiers, as different samplings could potentially have an effect on the trainings∗. For all the tools, all the parameters were set to default values or the values provided in the given examples, manuals or vignettes. The test datasets are provided in raw count data after cell quality and gene filtering. Finally, 3 configurations will be evaluated. The first two configurations will each use

∗ The test reference dataset is a sub dataset of the reference dataset in this case and it is different from the test datasets that we want to annotate Methods | 15

2 pairs of test - reference datasets for the evaluation (Figure 2.8).

Figure 2.8: 4 Pairs of reference and test datasets configurations

First configuration Evaluation of the ability to accurately annotate general cell types (immune, stromal, malignant,etc) in unsorted datasets of colorectal and lung tissues [Zhang Smart-Seq2 (test) - Qian colorectal (reference) (1) and Kim (test) - Qian lung (reference) (2)] (Figure 2.8).

Second configuration

Evaluation of the ability to accurately annotate deeper sub cell types in sorted immune datasets of colorectal and lung tissues [Zhang 10X Genomics (test) - Nieto (reference) (3) and Kim (test) - Nieto (refe-rence) (4)] (Figure 2.8).

Third configuration: Simulated dataset

Evaluation of the ability to handle low fold change and assess the fold change boundaries corresponding to a loss of precision. Due to the limited amount of time allowed by the course HL205X, this configura- tion will be carried out after the presentation of this report.

2.3.4 Verification of the reliability of the methods

The used inter-dataset configuration corresponds to a real case scenario and will emphasize the performance we expect when the tools will be used on real situations. The simulated dataset configuration will emphasize the limits of the tools related to markers’ expression decrease. 16 | Methods

Effort was put to ensure that the tools will be evaluated on previously unknown datasets or datasets the tools were not trained for, in order not to create knowledge biases. We verified as well that reference datasets are independent of the test datasets. Moreover, to avoid annotation biases and potential annotation misclassifications due to the use of different tissues between the reference datasets and test datasets, we have selected reference and test datasets that come from the same tissues and we paired them together. Finally, we used similar prior information if possible to evaluate the tools. Then, the tools could be correctly compared within a tool category and also generally. The Qian reference datasets can only be used to assess the global annotation on major cell types as there is no deeper distinction among T cells subtypes. Meaning that in the first configuration, the ILC, NK, CD8 and CD4 sub types will be annotated as T cells. On the contrary, the Nieto reference dataset will be used for assessing deeper immune cell type annotation test. All the immune subtypes are present in the Nieto reference dataset, there will not be reasons for problems of deep annotation. All the cell types in the reference are also present in the test datasets, so there will not be reason for incorrect annotation of those cell types. However, some cell types of the test dataset are not present in the reference as, for instance, oligodendrocytes in the Kim test dataset∗. Hence, those cell types will not be theoretically annotated and will not be considered in the evaluation. In all the reference datasets, there is a high correlation between most of the cell types (Figures 2.11, 2.9 and 2.10). The correlation is higher in the Nieto reference dataset due to the high similarities between its cell types which are only immune sub cell types, than in the Qian reference datasets which have broader major cell types. We can then expect lower performance for the second configuration because the cell types are less separable. Instead, the cell types similarity is lower in the test datasets (Figures 2.12, 2.13 and 2.14). We expect therefore less misclassifications in the first configuration because the cell types are less similar and more easily separable. However, there are possible challenges. The first one could be the annotation of NK and T cells as there is a high correlation between NK and T cells subtypes in the test datasets (Figures 2.12 and 2.14). A second one is the annotation of epithelial and malignant cells in the first configuration because these cell types are highly similar in the Zhang Smart-seq2 and Kim test dataset and in the Qian Colo reference dataset.

∗ comes from the brain cells also sequenced in this dataset Methods | 17

Figure 2.9: Cell types correlation matrix of the Qian Lung reference dataset

Figure 2.10: Cell types correlation matrix of the Qian colorectal reference dataset 18 | Methods

Figure 2.11: Cell types correlation matrix of the Nieto reference dataset

Figure 2.12: Cell types correlation matrix of the Zhang 10X genomics dataset Methods | 19

Figure 2.13: Cell types correlation matrix of the Zhang Smart-seq2 dataset

Figure 2.14: Cell types correlation matrix of the Kim dataset 20 | Methods Results and Analysis | 21

Chapter 3

Results and Analysis

The described methods in the previous chapter were implemented and executed as described. The results are presented below. Moreover, as the performance are entirely dependent of the prior knowledge data given in input, they have to be considered relatively to the other tools on the same configuration and conclusion cannot be compared absolutely to other different situations. Finally, tools such as SCINA, scClassify, HieRFIT could not be evaluated with the Kim test dataset because they require too much memory and trigger an error. ScClassifR have been evaluated only on the pair Zhang Smart-Seq2 (test) - Qian Colorectal (reference) because too much computation errors occurs for the other pairs.

3.1 First configuration - Evaluation of the ability to accurately annotate major cell types

3.1.1 Zhang Smart-Seq2 - Qian Colorectal

Table 3.1: Precision, recall and F1-score for the first category of tools on the Zhang Smart-Seq2 - Qian colorectal configuration 22 | Results and Analysis

Table 3.2: Precision, recall and F1-score for the second category of tools on the Zhang Smart-Seq2 - Qian Colorectal configuration

Table 3.3: Precision, recall and F1-score for the third category of tools on the Zhang Smart-Seq2 - Qian Colorectal configuration

3.1.2 Kim - Qian lung

Table 3.4: Precision, recall and F1-score for the first category of tools on the Kim - Qian Lung configuration

Table 3.5: Precision, recall and F1-score for the second category of tools on the Kim - Qian Lung configuration Results and Analysis | 23

Table 3.6: Precision, recall and F1-score for the third category of tools on the Kim - Qian Lung configuration

3.1.3 Analysis - (Tables 3.1, 3.2, 3.3, 3.4, 3.5 and 3.6)

A lot of tools have been able to provide an overall high precision and recall on both pairs of reference and test datasets: ClustifyR, SingleR, SingleCellNet and scPred∗. CHETAH had good precision but lower recall on some cell types due to the tree construction (a lot of cells are left behind in the nodes and counted as False Negative). scClassify and HieRFIT had an overall good precision and recall on Zhang Smart-seq2 (test) - Qian Colorectal (reference) pair †. For the tools which used markers information, the overall precision and recall is lower. scClassifR confusedly annotated especially among B, epithelial and T cells, Garnett misclassified a lot of cells which are annotated as one major cell types or as unknown. There is also a lot of confusion for scCatch. Nevertheless, scCatch can have good precision when the cell types are recognized in its database. Finally, SCINA and MAESTRO seems more robust but also incorrectly assigned a lot of cell types. An explanation for this difficulty to annotate correctly seems to be related to the quality of the given markers. Those ones seems to be rather the most expressed genes per cell type than specific marker genes‡ . This highlights the high dependence of this tools on the quality of marker genes. In order to verify that this lower performance is only due to the quality of marker genes, another evaluation was performed and presented in the sectionC. For all the tools, the annotation of epithelial cells and cancer cells was challenging as expected (see 2.3.4), due to the high similarity between those cell types in the Qian Colo reference, and Zhang Smart-Seq2 and Kim test datasets.

∗ Even if, for scPred, there are many unassigned cells that make decrease the general recall † HieRFIt just annotated some B cells as Myeloid that makes decrease the precision of Myeloid cells and decrease the recall of the B cells ‡ A lot of genes given as marker genes are in fact provided to several cell types 24 | Results and Analysis

3.2 Second configuration - Evaluation of the ability to accurately annotate deeper sub cell types

3.2.1 Zhang 10X Genomics - Nieto

Table 3.7: Precision, recall and F1-score for the first category of tools on the Zhang 10X Genomics - Nieto configuration

Table 3.8: Precision, recall and F1-score for the second category of tools on the Zhang 10X genomics - Nieto configuration Results and Analysis | 25

Table 3.9: Precision, recall and F1-score for the third category of tools on the Zhang 10X Genomics - Nieto configuration

3.2.2 Kim - Nieto

Table 3.10: Precision, recall and F1-score for the first category of tools on the Kim - Nieto configuration

3.2.3 Analysis - (Tables 3.7, 3.8, 3.9, 3.10, 3.11 and 3.12)

Some tools succeeded in having just some misclassifications among the subtypes and having good precision and recall for the major lineage (B cells, myeloid and T cells): ClustifyR, SingleR, SingleCellNet. SCINA and MAESTRO have also a good precision and recall for the major cell types. Some cells and cell types are unassigned but the better specification of marker genes seems to have improved the precision and recall of those tools. Other tools had an overall good precision but lower recall such as CHETAH (lot of cells lost in the nodes of th tree construction), scClassify – Pearson/Spearman∗ – (lot of cells annotated as the union of several sub cell types and

∗ scClassify-Spearman had all the cell types assigned compared to scClassify-Pearson and had a better F1 score in mean for the major cell types. Only the recall of the myeloid cells is 26 | Results and Analysis

Table 3.11: Precision, recall and F1-score for the second category of tools on the Kim - Nieto configuration

Table 3.12: Precision, recall and F1-score for the third category of tools on the Kim - Nieto configuration a lot of unassigned cells) and scPred - (due to a high number of unassigned cells). Finally, a lot of cell types are not assigned and a lot of cells are not annotated with Garnett, HieRFIT or scCatch. There is a lot of misclassification for HieRFIT as well as for Garnett. The rather good results of SCINA and MAESTRO show that the misclassifications of Garnett are not only due to the markers but also to the algorithm lack of robustness. ScCatch often assigned to other cell types than those in the dataset and is not able to discriminate B and T cells subtypes. The regular misclassification are usually among subtypes (B cells, Myeloid and T cells)∗. But the main ones are between the T cells subtypes as the similarity is high between them†. lower for scClassify-Spearman compared to scClassify-Pearson. ∗ Plasma B cells as Follicular B cells, Follicular B as proliferative B cells, DC annotated as Macrophages or Monocytes and vice-versa, TAMS SPP1 annotated as Macrophages and Monocytes, Mast cells as DC or macrophages, ... † NK as CD8, CD8 as TH17 and NK (probably related to cytotoxic activity), CD4 as Results and Analysis | 27

3.3 Computation time and ease to use

3.3.1 Computation time

Figure 3.1: Time to train classifiers or pre-process data for the tools that require, according to the 4 pairs of reference-test datasets

Figure 3.2: Computation time for the annotation according to the 4 pairs of reference-test datasets

The tools require different amounts of time to carry out the pre-processing and the annotation (Figure 3.1 and 3.2). The differences are dependent on the number of cells to annotate and also on the algorithm construction. The computation time can reach to some 104s for huge datasets and for some tools. Moreover, the pre-processing

Naive/proliferative T cells, Treg, Th17 or CD8 and vice-versa, Exhausted CD8 annotated as CD8, ... 28 | Results and Analysis

often require a lot of time but it has to be done a single time. Tools such as SCINA, ClustifyR, scClassify and SingleCellNet have a lower computation time than the other tools.

3.3.2 Ease to use

Table 3.13: Advantages and Drawbacks of the tools which require a list of markers as prior knowledge (1st category) Results and Analysis | 29

Table 3.14: Advantages and Drawbacks of the tools which use correlation with a reference dataset (2nd category)

Table 3.15: Advantages and Drawbacks of the tools which use supervised classification (3rd category) 30 | Results and Analysis Discussion and Conclusions | 31

Chapter 4

Discussion and Conclusions

4.1 Discussion

Firstly, the present report is just the presentation of the preliminary results of a general evaluation as more tools should be evaluated and more evaluations of configurations should be done to have a better assessment. However, the results already provide good information on the tools as wanted at the beginning of the project. The main challenge for the tools was distinguishing highly similar cell types and it is as well possible that the authors of the publications from which the labels used as ground truth, experienced the same problems. Indeed, the discrimination between the cell subtypes is also challenging for the researchers and require efficient method. It is possible that some subtypes cells were also misclassified by the researchers. Then, even if the validity of the data was verified, the uncertainties have to be taken in account and the results have to be analysed with the view of these uncertainties, especially for the second configuration. Nevertheless, the results remain reliable for comparison. Furthermore, the annotation for the marker-based tools was challenging as their performance strongly depends on the quality and number of the marker genes provided. The marker genes of the Nieto reference dataset seems to be very specific to each cell types that enable the first category tools to have rather good results. Nevertheless, the gene from the Qian reference dataset seems to be the genes that were the most expressed per cell types and not only include specific markers of each cell types∗. This results in some mistakes of annotation for the less robust tools. It is the case for example with Garnett which assigned the cells to some cell types in majority or unassigned them. Then if better markers have been used, the performance

∗ For example, the gene LMNA which is a gene coding for the membrane of the cell is used as marker for many cell types and is therefore not specific 32 | Discussion and Conclusions

of the tools of the first category could have been better. This is confirmed by the supplementary evaluation presented in the sectionC in appendix. However, this shows that is challenging to have a high-quality marker files and that is why in their absence, it is preferable to use correlation-based tools.

4.2 Conclusions

In this project, an evaluation of thirteen selected R tools was carried out. The ability to accurately annotate general cell types (immune, stromal, malignant,...) in unsorted datasets of colorectal and lung tissues and the ability to accurately annotate deeper sub-cell types in sorted immune datasets of colorectal and lung tissues was assessed as well as the computation time and the ease to use. According to this preliminary results, the best selected R tools seem to be ClustifyR (fast and rather precise) and SingleR (precise) for the correlation-based tools and SingleCellNet (precise and rather fast) and scPred (precise but many cell types remains unassigned) for the supervised classification tools. Finally, for the markers-based tools, MAESTRO and SCINA∗ are the most robust. Nevertheless, they need high quality markers to precisely perform.

4.3 Future work

Due to the limited amount of time, not all the objectives of the evaluation have been met. In this section we will focus on some of the remaining steps. First, the third configuration with the simulated datasets will be carried out on the selected R tools. Secondly, all the three configurations will be done on the 10 selected Python tools (seeA). This will enable to have a good view of the advantages and drawbacks of these interesting tools. This project will continue in the future in order to keep the evaluation updated with new version of the tools, new tools or new configurations.

∗ do not support datasets with high number of cells REFERENCES | 33

References

[1] A. Cortal, L. Martignetti, E. Six, and A. Rausell, “Cell-id: gene signature extraction and cell identity recognition at individual cell level,” (Preprint) bioRxiv, 2020. doi: https://doi.org/10.1101/2020.07.23.215525

[2] S. Mohammadi, J. Davila-Velderrain, and M. Kellis, “A multiresolution framework to characterize single-cell state landscapes,” Nature communications, vol. 11, no. 1, p. 5399, 2020. doi: https://doi.org/10.1038/s41467-020-18416-6

[3] W. Guo, D. Wang, S. Wang, Y. Shan, C. Liu, and J. Gu, “sccancer: a package for automated processing of single-cell rna-seq data in cancer,” Briefings in Bioinformatics, 2020. doi: https://doi.org/10.1093/bib/bbaa127

[4] C. Li, B. Liu, B. Kang, Z. Liu, Y. Liu, C. Chen, X. Ren, and Z. Zhang, “Scibet as a portable and fast single cell type identifier,” Nature communications, vol. 11, no. 1, p. 1818, 2020. doi: https://doi.org/10.1038/s41467-020-15523-2

[5] J. Choi, H. In Kim, and H. Woo, “sctyper: a comprehensive pipeline for the cell typing analysis of single-cell rna-seq data.” BMC Bioinformatics, vol. 21, p. 342, 2020. doi: https://doi.org/10.1186/s12859-020-03700-5

[6] V. Nguyen and J. Griss, “Scclassifr: Framework to accurately classify cell types in single-cell rna-sequencing data,” (Preprint) bioRxiv, 2020. doi: https://doi.org/10.1101/2020.12.22.424025

[7] Z. Zhang, D. Luo, X. Zhong, J. H. Choi, Y. Ma, S. Wang, E. Mahrt, W. Guo, E. W. Stawiski, Z. Modrusan, S. Seshagiri, P. Kapur, G. C. Hon, J. Brugarolas, and T. Wang, “Scina: A semi-supervised subtyping algorithm of single cells and bulk samples,” Genes, vol. 10, no. 7, p. 531, 2019. doi: https://doi.org/10.3390/genes10070531

[8] L. Zhang, Z. Li, K. M. Skrzypczynska, Q. Fang, W. Zhang, S. A. O’Brien, Y. He, L. Wang, Q. Zhang, A. Kim, R. Gao, J. Orf, T. Wang, D. Sawant, J. Kang, D. Bhatt, D. Lu, C. M. Li, A. S. Rapaport, K. Perez, and X. Yu, “Single-cell analyses inform mechanisms of myeloid-targeted 34 | REFERENCES

therapies in colon cancer.” Cell, vol. 181, no. 2, p. 442–459.e29, 2020. doi: https://doi.org/10.1016/j.cell.2020.03.048

[9] N. Kim, H. Kim, K. Lee, Y. Hong, J. Cho, J. Choi, J. Lee, B. Suh, Y.L.and Ku, H. Eum, S. Choi, Y. Choi, J. Joung, W. Park, H. Jung, J. Sun, S. Lee, J. Ahn, K. Park, M. Ahn, and H. Lee, “Single-cell rna sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma.” Nature Communication, vol. 11, no. 1, p. 2285, 2020. doi: https://doi.org/10.1038/s41467-020-16164-1

[10] P. Nieto, M. Elosua-Bayes, J. L. Trincado, M. D., R. Massoni-Badosa, M. Salvany, A. Henriques, E. Mereu, C. Moutinho, S. Ruiz, P. Lorden, V. T. Chin, D. Kaczorowski, C. Chan, R. Gallagher, A. Chou, E. Planas-Rigol, C. Rubio-Perez, I. Gut, J. M. Piulats, J. Seoane, J. E. Powell, E. Batlle, and H. Heyn, “A single-cell tumor immune atlas for precision oncology,” (Preprint) bioRxiv, 2020. doi: https://doi.org/10.1101/2020.10.26.354829

[11] J. Qian, S. Olbrecht, B. Boeckx, H. Vos, D. Laoui, E. Etlioglu, E. Wauters, V. Pomella, S. Verbandt, P. Busschaert, A. Bassez, A. Franken, M. V. Bempt, J. Xiong, B. Weynand, Y. van Herck, A. Antoranz, F. M. Bosisio, B. Thienpont, G. Floris, and D. Lambrechts, “A pan-cancer blueprint of the heterogeneous tumor microenvironment revealed by single-cell profiling,” Cell research, vol. 30, no. 9, pp. 745–762, 2020. doi: https://doi.org/10.1038/s41422-020- 0355-0

[12] L. Zappia, B. Phipson, and A. Oshlack, “Splatter: simulation of single-cell rna sequencing data.” Genome biology, vol. 18, no. 1, p. 174, 2017. doi: https://doi.org/10.1186/s13059-017-1305-0

[13] X. Zhang, C. Xu, and N. Yosef, “Simulating multiple faceted variability in single cell rna sequencing.” Nature Communication, vol. 10, no. 1, p. 2611, 2019. doi: https://doi.org/10.1038/s41467-019-10500-w

[14] X. Shao, J. Liao, X. Lu, R. Xue, N. Ai, and X. Fan, “sccatch: Automatic annotation on cell types of clusters from single-cell rna sequencing data,” iScience, vol. 23, no. 3, p. 100882, 2020. doi: https://doi.org/10.1016/j.isci.2020.100882

[15] A. Haque, J. Engel, S. A. Teichmann, and T. Lönnberg, “A practical guide to single-cell rna-sequencing for biomedical research and clinical applications,” Genome medicine, vol. 9, no. 1, p. 75, 2017. doi: https://doi.org/10.1186/s13073-017-0467-4

[16] G. C. Yuan, L. Cai, M. Elowitz, T. Enver, G. Fan, G. Guo, R. Irizarry, P. Kharchenko, J. Kim, S. Orkin, J. Quackenbush, A. Saadatpour, T. Schroeder, REFERENCES | 35

R. Shivdasani, and I. Tirosh, “Challenges and emerging directions in single-cell analysis,” Genome biology, vol. 18, no. 1, p. 84, 2017. doi: https://doi.org/10.1186/s13059-017-1218-y

[17] M. D. Luecken and F. J. Theis, “Current best practices in single-cell rna-seq analysis: a tutorial,” Molecular systems biology, vol. 15, no. 6, 2019. doi: https://doi.org/10.15252/msb.20188746

[18] F. Tang, C. Barbacioru, Y. Wang, E. Nordman, C. Lee, N. Xu, X. Wang, J. Bodeau, B. B. Tuch, A. Siddiqui, K. Lao, and M. A. Surani, “mrna-seq whole- transcriptome analysis of a single cell,” Nature methods, vol. 6, no. 5, pp. 377– 382, 2009. doi: https://doi.org/10.1038/nmeth.1315

[19] F. Pia Caruso, L. Garofano, F. D’Angelo, K. Yu, F. Tang, J. Yuan, J. Zhang, L. Cerulo, S. M. Pagnotta, D. Bedognetti, P. A. Sims, M. Suvà, X.-D. Su, A. Lasorella, A. Iavarone, and M. Ceccarelli, “A map of tumor–host interactions in glioma at single-cell resolution,” GigaScience, vol. 9, no. 10, 2020. doi: https://doi.org/10.1093/gigascience/giaa109

[20] R. Sandberg, “Entering the era of single-cell transcriptomics in biology and medicine,” Nature methods, vol. 11, no. 1, pp. 22–2’, 2014. doi: https://doi.org/10.1038/nmeth.2764

[21] A. Saadatpour, S. Lai, G. Guo, and G. C. Yuan, “Single-cell analysis in cancer genomics,” Trends in genetics, vol. 31, no. 10, pp. 576–586, 2015. doi: https://doi.org/10.1016/j.tig.2015.07.003

[22] N. E. Navin, “The first five years of single-cell cancer genomics and beyond,” Genome research, vol. 25, no. 10, p. 1499–1507, 2015. doi: https://doi.org/10.1101/gr.191098.115

[23] O. B. Poirion, X. Zhu, T. Ching, and L. Garmire, “Single-cell transcriptomics bioinformatics and computational challenges,” Frontiers in genetics, vol. 7, no. 163, 2016. doi: https://doi.org/10.3389/fgene.2016.00163

[24] K. Pearson, “Liii. on lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 2, no. 11, p. 559–572, 1901. doi: https://doi.org/10.1080/14786440109462720

[25] R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker, “Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps,” Proceedings of the National Academy of Sciences, vol. 102, no. 21, pp. 7426–7431, 2005. doi: 10.1073/pnas.0500334102 36 | REFERENCES

[26] L. van der Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008. [Online]. Available: http://jmlr.org/papers/v9/vandermaaten08a.html

[27] L. McInnes, J. Healy, N. Saul, and L. Großberger, “Umap: Uniform manifold approximation and projection,” Journal of Open Source Software, vol. 3, no. 29, p. 861, 2018. doi: 10.21105/joss.00861. [Online]. Available: https://doi.org/10.21105/joss.00861

[28] J. K. de Kanter, P. Lijnzaad, T. Candelli, T. Margaritis, and F. C. P. Holstege, “Chetah: a selective, hierarchical cell type identification method for single-cell rna sequencing,” Nucleic Acids Research, vol. 47, no. 16, p. e95, 2019. doi: https://doi.org/10.1093/nar/gkz543

[29] H. Pliner, J. Shendure, and C. Trapnell, “Supervised classification enables rapid annotation of cell atlases,” Nature Methods, vol. 16, p. 983–986, 2019. doi: https://doi.org/10.1038/s41592-019-0535-3

[30] T. S. Andrews and M. Hemberg, “Identifying cell populations with scrnaseq,” Molecular aspects of medicine, vol. 59, pp. 114–122, 2018. doi: https://doi.org/10.1016/j.mam.2017.07.002

[31] V. Menon, “Clustering single cells: a review of approaches on high-and low- depth single-cell rna-seq data,” Briefings in functional genomics, vol. 17, no. 4, pp. 240–245, 2018. doi: https://doi.org/10.1093/bfgp/elx044

[32] M. N. Bernstein, Z. Ma, M. Gleicher, and C. N. D., “Cello: comprehensive and hierarchical cell type classification of human cells with the cell ontology,” iScience, vol. 24, no. 1, p. 101913, 2021. doi: https://doi.org/10.1016/j.isci.2020.101913

[33] L. Zappia, B. Phipson, and A. Oshlack, “Exploring the single-cell rna-seq analysis landscape with the scrna-tools database,” PLoS computational biology, vol. 14, no. 6, 2018. doi: https://doi.org/10.1371/journal.pcbi.1006245

[34] S. Davis and al., “Community-curated list of software packages and data resources for single-cell, including rna-seq, atac-seq, etc. 23/02/2021,” https://github.com/seandavi/awesome-single-cell, Web 25/02/2021.

[35] F. Ma and M. Pellegrini, “Actinn: automated identification of cell types in single cell rna sequencing,” Bioinformatics, vol. 36, no. 2, p. 533–538, 2020. doi: https://doi.org/10.1093/bioinformatics/btz592

[36] H. A. Ekiz, C. J. Conley, W. Z. Stephens, and R. M. O’Connell, “Cipr: a web-based r/shiny app and r package to annotate cell clusters in single cell rna REFERENCES | 37

sequencing experiments,” BMC bioinformatics, vol. 21, no. 1, p. 191, 2020. doi: https://doi.org/10.1186/s12859-020-3538-2

[37] M. Brbić, M. Zitnik, S. Wang, A. O. Pisco, R. B. Altman, S. Darmanis, and J. Leskovec, “Mars: discovering novel cell types across heterogeneous single- cell experiments,” Nature methods, vol. 17, no. 12, p. 1200–1206, 2020. doi: https://doi.org/10.1038/s41592-020-00979-3

[38] R. Fu, A. E. Gillen, R. M. Sheridan, C. Tian, M. Daya, Y. Hao, J. R. Hesselberth, and K. A. Riemondy, “clustifyr: an r package for automated single-cell rna sequencing cluster classification,” F1000Research, vol. 9, p. 223, 2020.

[39] E. Mereu, A. Lafzi, C. Moutinho, C. Ziegenhain, D. J. McCarthy, A. Álvarez Varela, E. Batlle, Sagar, D. Grün, J. K. Lau, S. C. Boutet, C. Sanada, A. Ooi, R. C. Jones, K. Kaihara, C. Brampton, Y. Talaga, Y. Sasagawa, K. Tanaka, and H. Hayashi, T. andHeyn, “Benchmarking single-cell rna-sequencing protocols for cell atlas projects,” Nature biotechnology, vol. 38, no. 6, p. 747–755, 2020. doi: https://doi.org/10.1038/s41587-020-0469-4

[40] V. Y. Kiselev, A. Yiu, and M. Hemberg, “scmap: projection of single-cell rna- seq data across data sets,” Nature methods, vol. 15, no. 5, p. 359–362, 2018. doi: https://doi.org/10.1038/nmeth.4644

[41] O. Franzén and J. L. M. Björkegren, “alona: a web server for single-cell rna-seq analysis,” Bioinformatics, vol. 36, no. 12, p. 3910–3912, 2020. doi: https://doi.org/10.1093/bioinformatics/btaa269

[42] X. Shao, H. Yang, X. Zhuang, J. Liao, Y. Yang, P. Yang, J. Cheng, X. Lu, H. Chen, and X. Fan, “Reference-free cell-type annotation for single-cell transcriptomics using deep learning with a weighted graph neural network,” (Preprint) bioRxiv, 2020. doi: https://doi.org/10.1101/2020.05.13.094953

[43] F. Wagner and I. Yanai, “Moana: A robust and scalable cell type classification framework for single-cell rna-seq data.” (Preprint) bioRxiv, 2018. doi: 10.1101/456129

[44] R. Hou, E. Denisenko, and A. R. R. Forrest, “scmatch: a single-cell gene expression profile annotation tool using reference datasets,” Bioinformatics, vol. 35, no. 22, p. 4688–4695, 2019. doi: https://doi.org/10.1093/bioinformatics/btz292

[45] A. Olsson, M. Venkatasubramanian, V. K. Chaudhri, B. J. Aronow, N. Salomonis, H. Singh, and H. L. Grimes, “Single-cell analysis of mixed- lineage states leading to a binary cell fate choice,” Nature, vol. 537, no. 7622, p. 698–702, 2016. doi: https://doi.org/10.1038/nature19348 38 | REFERENCES

[46] S. Domanskyi, A. Szedlak, N. Hawkins, and al., “Polled digital cell sorter (p-dcs): Automatic identification of hematological cell types from single cell rna-sequencing clusters,” BMC Bioinformatics, vol. 20, p. 369, 2019. doi: https://doi.org/10.1186/s12859-019-2951-x

[47] F. Zanini, B. A. Berghuis, R. C. Jones, B. Nicolis di Robilant, R. Y. Nong, J. A. Norton, M. F. Clarke, and S. R. Quake, “Northstar enables automatic classification of known and novel cell types from tumor samples,” Scientific reports, vol. 10, no. 1, p. 15251, 2020. doi: https://doi.org/10.1038/s41598-020- 71805-1

[48] X. Han, R. Wang, Y. Zhou, L. Fei, H. Sun, S. Lai, A. Saadatpour, Z. Zhou, H. Chen, F. Ye, D. Huang, Y. Xu, W. Huang, M. Jiang, X. Jiang, J. Mao, Y. Chen, C. Lu, J. Xie, Q. Fang, and G. Guo, “Mapping the mouse cell atlas by microwell-seq,” Cell, vol. 172, no. 5, p. 1091–1107, 2018. doi: https://doi.org/10.1016/j.cell.2018.02.001

[49] S. C. Mädler, A. Julien-Laferriere, L. Wyss, M. Phan, A. S. W. Kang, E. Ulrich, R. Schmucki, J. D. Zhang, L. Ebeling, M.and Badi, T. Kam-Thong, P. C. Schwalie, and K. Hatje, “Besca, a single-cell transcriptomics analysis toolkit to accelerate translational research,” (Preprint) bioRxiv, 2020. doi: https://doi.org/10.1101/2020.08.11.245795

[50] S. Wang, A. O. Pisco, A. McGeever, and al., “Unifying single-cell annotations based on the cell ontology,” (Preprint) bioRxiv, 2019. doi: 10.1101/810234

[51] J. C. Kimmel and D. R. Kelley, “Semi-supervised adversarial neural networks for single-cell classification,” Genome research. Advance online publication, 2021. doi: https://doi.org/10.1101/gr.268581.120

[52] W. Kong, Y. C. Fu, and S. A. Morris, “Capybara: A computational tool to measure cell identity and fate transitions,” (Prepint) bioRxiv, 2020. doi: https://doi.org/10.1101/2020.02.17.947390

[53] X. Yang, S. Gao, T. Wang, and al., “gcanno: a graph-based single cell type annotation method,” BMC Genomics, vol. 21, p. 823, 2020. doi: https://doi.org/10.1186/s12864-020-07223-4

[54] P. Xie, M. Gao, C. Wang, J. Zhang, P. Noel, C. Yang, D. Von Hoff, H. Han, M. Q. Zhang, and W. Lin, “Superct: a supervised-learning framework for enhanced characterization of single-cell transcriptomic profiles,” Nucleic Acids Research, vol. 47, no. 8, p. e48, 2019. doi: https://doi.org/10.1093/nar/gkz116

[55] J. Alquicira-Hernandez, A. Sathe, H. P. Ji, Q. Nguyen, and J. E. Powell, “scpred: accurate supervised method for cell-type classification from single- REFERENCES | 39

cell rna-seq data,” Genome biology, vol. 20, no. 1, p. 264, 2019. doi: https://doi.org/10.1186/s13059-019-1862-5

[56] Y. Lieberman, L. Rokach, and T. Shay, “Castle - classification of single cells by transfer learning: Harnessing the power of publicly available single cell rna sequencing experiments to annotate new experiments.” PloS one, vol. 13, no. 10, p. e0205499, 2018. doi: https://doi.org/10.1371/journal.pone.0205499

[57] Y. Kaymaz, F. Ganglberger, M. Tang, F. Fernandez-Albert, N. Lawless, and T. Sackton, “Hierfit: Hierarchical random forest for information transfer,” (Preprint) bioRxiv, 2020. doi: https://doi.org/10.1101/2020.09.16.300822

[58] X. Zhou, H. Chai, Y. Zeng, and al., “scadapt: Virtual adversarial domain adaptation network for single cell rna-seq data classification across platforms and species,” (Preprint) bioRxiv, 2021. doi: 10.1101/2021.01.18.427083

[59] Y. Cao, X. Wang, and G. Peng, “Scsa: A cell type annotation tool for single-cell rna-seq data,” Frontiers in genetics, vol. 11, p. 490, 2020. doi: https://doi.org/10.3389/fgene.2020.00490

[60] Z. J. Cao, L. Wei, S. Lu, D. C. Yang, and G. Gao, “Searching large-scale scrna-seq databases via unbiased cell embedding with cell blast,” Nature communications, vol. 11, no. 1, p. 3458, 2020. doi: https://doi.org/10.1038/s41467-020-17281-7

[61] L. Michielsen, M. J. T. Reinders, and A. Mahfouz, “Hierarchical progressive learning of cell identities in single-cell data,” (Preprint) bioRxiv, 2020. doi: doi: https://doi.org/10.1101/2020.03.27.010124

[62] L. Chen, Y. Zhai, Q. He, W. Wang, and M. Deng, “Integrating deep supervised, self-supervised and unsupervised learning for single-cell rna- seq clustering and annotation,” Genes, vol. 11, no. 7, p. 792, 2020. doi: https://doi.org/10.3390/genes11070792

[63] A. W. Zhang, C. O’Flanagan, E. A. Chavez, J. Lim, N. Ceglia, A. McPherson, M. Wiens, P. Walters, T. Chan, B. Hewitson, D. Lai, A. Mottok, C. Sarkozy, L. Chong, T. Aoki, X. Wang, A. P. Weng, J. N. McAlpine, S. Aparicio, C. Steidl, and S. P. Shah, “Probabilistic cell-type assignment of single-cell rna- seq for tumor microenvironment profiling,” Nature methods, vol. 16, no. 10, p. 1007–1015, 2019. doi: https://doi.org/10.1038/s41592-019-0529-1

[64] F. K. Hamey and B. Göttgens, “Machine learning predicts putative hematopoietic stem cells within large single-cell transcriptomics data sets,” Experimental hematology, vol. 78, pp. 11–20, 2019. doi: https://doi.org/10.1016/j.exphem.2019.08.009 40 | REFERENCES

[65] Y. Chen, T. Lakshmikanth, J. Mikes, and P. Brodin, “Single-cell classification using learned cell phenotypes,” (Preprint) bioRxiv, 2020. doi: https://doi.org/10.1101/2020.07.22.216002

[66] X. Liu, S. J. Gosline, L. T. Pflieger, P. Wallet, A. Iyer, J. Guinney, A. H. Bild, and J. T. Chang, “Knowledge-based classification of fine-grained immune cell types in single-cell rna-seq data with immclassifier,” (Preprint) bioRxiv, 2020. doi: doi: https://doi.org/10.1101/2020.03.23.002758

[67] C. Xu, R. Lopez, E. Mehlman, J. Regier, M. I. Jordan, and N. Yosef, “Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models,” Molecular systems biology, vol. 17, no. 1, p. e9620, 2021. doi: https://doi.org/10.15252/msb.20209620

[68] J. Hu, X. Li, G. Hu, and al., “Iterative transfer learning with neural network for clustering and cell type classification in single-cell rna-seq analysis,” Nature Machine Intelligence, vol. 2, p. 607–618, 2020. doi: https://doi.org/10.1038/s42256-020-00233-7

[69] Y. Tan and P. Cahan, “Singlecellnet: A computational tool to classify single cell rna-seq data across platforms and across species,” Cell systems, vol. 9, no. 2, p. 207–213, 2019. doi: https://doi.org/10.1016/j.cels.2019.06.004

[70] S. Mao, Y. Zhang, G. Seelig, and S. Kannan, “Cellmesh: Probabilistic cell- type identification using indexed literature,” (Preprint) bioRxiv, 2020. doi: https://doi.org/10.1101/2020.05.29.124743

[71] M. Goyal, G. Serrano, I. Shomorony, M. Hernaez, and I. Ochoa, “Jind: Joint integration and discrimination for automated single-cell annotation,” (Preprint) bioRxiv, 2020. doi: https://doi.org/10.1101/2020.10.06.327601

[72] Y. Lin, Y. Cao, H. J. Kim, A. Salim, T. P. Speed, D. M. Lin, P. Yang, and J. Yang, “scclassify: sample size estimation and multiscale classification of cells using single and multiple reference,” Molecular systems biology, vol. 16, no. 6, p. e9389, 2020. doi: https://doi.org/10.15252/msb.20199389

[73] D. Aran, A. Looney, L. Liu, and al., “Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage,” Nature Immunology, vol. 20, p. 163–172, 2019. doi: https://doi.org/10.1038/s41590- 018-0276-y

[74] T. S. Johnson, T. Wang, Z. Huang, C. Y. Yu, Y. Wu, Y. Han, Y. Zhang, K. Huang, and J. Zhang, “Lambda: label ambiguous domain adaptation dataset integration reduces batch effects and improves subtype detection,” Bioinformatics, vol. 35, no. 22, p. 4696–4706, 2019. doi: https://doi.org/10.1093/bioinformatics/btz295 REFERENCES | 41

[75] J. B. Kang, A. Nathan, N. Millard, L. Rumker, D. B. Moody, I. Korsunsky, and S. Raychaudhuri, “Efficient and precise single-cell reference atlas mapping with symphony,” (Preprint) bioRxiv, 2020. doi: https://doi.org/10.1101/2020.11.18.389189

[76] C. Wang, D. Sun, X. Huang, and al., “Integrative analyses of single-cell transcriptome and regulome using maestro,” Genome Biology, vol. 21, p. 198, 2020. doi: https://doi.org/10.1186/s13059-020-02116-x

[77] K. Boufea, S. Seth, and N. N. Batada, “scid uses discriminant analysis to identify transcriptionally equivalent cell types across single-cell rna-seq data with batch effect,” iScience, vol. 23, no. 3, p. 100914, 2020. doi: https://doi.org/10.1016/j.isci.2020.100914

[78] D. DeTomaso, M. G. Jones, M. Subramaniam, T. Ashuach, C. J. Ye, and N. Yosef, “Functional interpretation of single cell similarity maps,” Nature communications, vol. 10, no. 1, p. 4376, 2019. doi: https://doi.org/10.1038/s41467-019-12235-0

[79] G. Pasquini, J. E. Rojo Arias, P. Schäfer, and V. Busskamp, “Automated methods for cell type annotation on scrna-seq data,” Computational and Structural Biotechnology Journal, vol. 19, pp. 961–969, 2021. doi: https://doi.org/10.1016/j.csbj.2021.01.015

[80] Q. Huang, Y. Liu, Y. Du, and L. X. Garmire, “Evaluation of cell type annotation r packages on single-cell rna-seq data,” Genomics, proteomics and bioinformatics. Advance online publication, 2020. doi: https://doi.org/10.1016/j.gpb.2020.07.004

[81] X. Zhao, S. Wu, N. Fang, X. Sun, and J. Fan, “Evaluation of single-cell classifiers for single-cell rna sequencing data sets,” Briefings in bioinformatics, vol. 21, no. 5, p. 1581–1595, 2020. doi: https://doi.org/10.1093/bib/bbz096

[82] T. Abdelaal, L. Michielsen, D. Cats, D. Hoogduin, H. Mei, M. Reinders, and A. Mahfouz, “A comparison of automatic cell identification methods for single- cell rna sequencing data,” Genome biology, vol. 20, no. 1, p. 194, 2019. doi: https://doi.org/10.1186/s13059-019-1795-z

[83] T. Stuart, A. Butler, P. Hoffman, C. Hafemeister, E. Papalexi, W. M. Mauck, Y. Hao, M. Stoeckius, P. Smibert, and R. Satija, “Comprehensive integration of single-cell data.” Cell, vol. 177, no. 7, p. 1888–1902.e21, 2019. doi: https://doi.org/10.1016/j.cell.2019.05.031

[84] D. Sun, J. Wang, Y. Han, X. Dong, G. J., R. Zheng, X. Shi, B. Wang, Z. Li, P. Ren, L. Sun, Y. Yan, P. Zhang, F. Zhang, T. Li, and C. Wang, “Tisch: 42 | REFERENCES

a comprehensive web resource enabling interactive single-cell transcriptome visualization of tumor microenvironment.” Nucleic Acids Research, vol. 49, no. 1, pp. 1420–1430, 2021. doi: https://doi.org/10.1093/nar/gkaa1020 Appendix A: State of the Art | 43

Appendix A

State of the Art

A.1 Introduction

scRNA-seq is a rapidly evolving genomic approach for the detection and quantita- tive analysis of messenger Ribonucleic Acid (mRNA) molecules in a biological sample, at the individual cell level [15, 16]. Whereas before, it was only possible to study the whole cells at the tissue level thanks to the Standard (Bulk) RNA-seq. scRNA-seq was rendered possible thanks to the optimization of Next Generation Sequencing (NGS) technologies and allows whole transcriptome profiling of thou- sands of individual cells among many complex tissues [15, 16]. This technology enables to study the cellular differences and gene expression with an unprecedented resolution and then, makes understandable the cell particularity within its environment [15, 17]. The first scRNA-seq study was published in 2009 [18] and since, scRNA- seq has received increasing attention from the genomics, bioinformatics and computa- tional fields [15, 17]. The development of scRNA-seq and bioinformatics approaches has enabled a lot of discoveries and innovation in medicine and biological systems during the last decade [15, 17]. It also highlights previously unknown levels of heterogeneity of cells and rare cell populations in different tissues [15]. Then, these techniques have helped and will help to better understand the nature and complexity of a lot of human diseases in order to develop more effective therapies [16].

A.2 Example of field of study where scRNA- seq is applied

Single-cell RNA-sequencing is widely applied in cell biology problematics as cancer research. Cancer is one of the most common cause of death in the world [3]. The understanding of the cellular heterogeneities and complex tumor microenviron- 44 | Appendix A: State of the Art

ments is a major challenge for current cancer research [3]. The composition of the tumor microenvironments and the interaction between malignant cells and their microenvironments influences tumor growth and progression, as it can affect the way in which the immune system struggle to eliminate cancer cells and respond to immune therapies [19]. Then, understanding the tumor–host interaction mechanism between tumor-resident immune cells and cancer cells and the causes and consequences of perturbed immune cell functions across cancer types are essential for the identification of novel immuno-oncology therapeutic targets by pharmaceutical companies [19, 10]. As cancer tissues are often characterized by changes in both cellular compositions∗ and alterations in the expression of genes [20], reflected at the genomic, transcripto- mic, and proteomic levels [21], scRNA-seq technics are therefore a very useful technology to analyse such changes at single-cell level [3, 10, 11]. They have shed new light on the role of tumor microenvironments in disease progression and therapy resistance, cancer treatment failure and disease recurrence [16, 21, 22]. Moreover, they can also help to characterize the tumor cellular heterogeneity and evolution of cell differentiation, identify rare cell types and understand their roles, measure mutation rates, and, ultimately, guide diagnosis and treatment [15, 21, 22, 23]. scRNA-seq technics are always evolving and therefore could give other insights in the future but they all follow an elaborated workflow.

A.3 Workflow of scRNA-seq

A typical scRNA-seq analysis includes a pre-processing, a data processing and a downstream analysis phase [15, 17].

• Pre-processing (Get expression matrix from raw data, quality control on data sequencing and on cell quality)

• Data processing (Attenuate technical noise while conserving biological signals, normalize and scale data and carry out cell group clustering.)

• Cell- and gene-level downstream statistical analysis (Highlight biological signals, identify subpopulations and the gene modules driving them)

A.3.1 Pre-processing

Multiples steps need to be performed for generating single-cell data from a biological sample (Figure A.1). The input material is generally a biological tissue sample. First, a single-cell dissociation is carried out in order to obtain a single viable cell suspension and get rid of the tissue. After, single-cell isolation is performed

∗ For example, infiltrating immune cells Appendix A: State of the Art | 45

Figure A.1: Pre-processing workflow - Adapted from Haque et al. Genome Medicine (2017) [15]. (1) Isolation of single cells from a tissue sample (2) Single cell lysis while preserving cellular mRNA (3) mRNA molecule capture using poly[T] sequence primers that bind to mRNA poly[A] tails (4) Convert poly[T]-primed mRNA into cDNA using reverse transcription (5) cDNA amplification (by PCR) (6) cDNA sequencing library preparation (insert “barcode” nucleotide barcodes to identify each library) (7) Pool cDNA sequencing libraries and Sequence libraries (via NGS) (8) Use bioinformatic tools to perform quality control and to assess technical variability in the scRNA-seq data

(1). It is necessary to profile the mRNA in each cell separately. Then, isolated individual cells are lysed while preserving mRNA (2) and a library construction is performed. Library construction consists in capturing intracellular mRNA (3), reverse-transcribing the mRNA into complementary complementary Deoxyribonucleic Acid (cDNA) molecules (4) and amplifying them with for example Polymerase Chain Reaction (PCR) (5) before sequencing to increase the probability of being measured [15, 17]. Barcodes are used to preserve information on cellular origin by distinguishing amplified copies of the same mRNA molecule and reads from other mRNA molecules transcribed form the same gene (6). The amplified and tagged cDNA are pooled together (multiplexed) for sequencing and sequenced by NGS methods (7). The sequencing creates read data, which goes through a Quality Control (QC), a grouping of signal based on the barcodes (demulti-plexing), a genome alignment, and a quantification [15, 17]. At the end of this process, workable raw data are obtained.

A.3.2 Data processing and visualization

The data obtained at the end of the sequencing are processed according to the data processing pipeline (Figure A.2). The raw data generated by the sequencing machine are processed in this step to create a matrix of molecular counts (count matrix) if unique molecular identifiers (UMIs) were used in the single-cell library construction protocol, or in read counts (read matrix). The resulting matrix have the dimension of the number of barcodes 46 | Appendix A: State of the Art

Figure A.2: Data processing and visualization workflow - Adapted from Malte D.Luecken & Fabian J.Theis. Molecular Systems Biology (2019) [17].

(identifier of cells) x number of transcripts (related to genes) [17]. However, the purpose of the following steps is to convert raw data to quantitative biological information [21] by removing the systematic biases due to technical variability without erasing biological variation [21].

Quality control

Quality control is performed to ensure that the data quality is sufficient for downstream analysis. Before analysing the data, it is necessary to be sure that all the cellular barcode data correspond to viable cells. Filtering and thresholding processes are carried out to get rid of outlier barcodes that would corresponds to dying cells or doublets of cell, for example [17]. Cells having few expressed genes and with very low gene expression can also be filtered, as these cells do not carry enough information.

Normalization and data correction

Normalization and data correction are then performed. Normalization attempts to remove the effects of count sampling for representing the counts of barcode or gene of Appendix A: State of the Art | 47

all cells from the same level. Data correction∗ targets further technical and biological artefacts such as batch effect (Figure A.3) or cell cycle effects on the transcriptome. Batch effects often occur where cells are managed in distinct samples and the small differences in environment experienced by the cells can induce small differences on the measurement of the transcriptome † [16, 17].

Figure A.3: Before and after Batch effect removal (UMAP Visualisation) - Adapted from Malte D.Luecken & Fabian J.Theis. Molecular Systems Biology (2019) [17]. Cells are coloured by sample of origin. Variability between batches (samples) is clearly visible before batch correction.

Even after filtering out the unexpressed genes in theQC step, the number of expressed gene for a single-cell dataset can reach up to the 15,000 that corresponds to a feature space of 15,000 dimensions. In order to facilitate the downstream analysis, it is important to reduce the noise in the data and reduce the dimensionality of the dataset to visualize the data and ease the analysis [17]. Most methods reduce the space by keeping the top expressed genes in common dimensions that can be easier to visualize and interpret [15].

Feature selection

The first step of reducing the dimensionality of scRNA-seq datasets commonly is feature selection. In this step, the dataset is filtered to keep only highly variable genes that are “informative” of the variability in the data [17].

Dimensionality reduction

After feature selection, the dimensions can be reduced further by dimensionality reduction algorithms that project the underlying structure of the data in a low-

∗ also called Batch effect removal † between groups of cells in the same experiment, between experiments carried out in the same laboratory or between datasets from different laboratories 48 | Appendix A: State of the Art

dimensional space while maintaining the biological information [17, 21]. Those last ones can be sufficiently described in a smaller dimensional space than the space of all the genes in order to visualize and summarize the biological information. The visualisation is the attempt to optimally resume the dataset in two or three dimensions for plotting, whereas the summarization is used to reduce the data to its essential components by finding the inherent dimensionality of the data. Principal component analysis (PCA)[15, 17, 24] and diffusion maps [25] are the two main dimensionality reduction technics for summarization.∗ The summarized data will hence be used in the downstream analysis.

Visualisation

For the visualization of the data, non-linear dimensionality reduction methods like t-distributed stochastic neighbour embedding (t-SNE)[26] or Uniform Approximation and Projection method (UMAP)[27] are performed to project the data in a space easily understandable by humans.

A.3.3 Downstream analysis

This last step consists in extracting biological insights and analyse the observed biological system. Downstream analysis can be separated into cell- and gene-level approaches depending on what researchers want to look for. Cell-level approach puts interest on the description of groups of cells and continuous (differentiation) trajectories which highlight small changes in gene expression between similar cells (as a snapshot of a continuous and dynamic process [17]). On its side, the gene-level approach directly investigates molecular signal in the data in order to understand the context in which gene expression occurs [17]. However, before extracting any biological insights, the cells have to be organized into subpopulations by identifying transcriptional similarities between cells using clustering methods [15, 17, 20, 28].

Clustering

Clustering is the name of the step consisting in grouping cells into biological- meaning subpopulations often with unsupervised algorithm† (Figure A.4 (a)). Methods such as clustering algorithms and community detection methods are two methods able

∗ PCA is a linear approach that generates reduced dimensions by maximizing the residual variance in each further dimension whereas diffusion maps are a non-linear data summarization technic where each diffusion component (i.e. diffusion map dimension) highlights the heterogeneity of a different cell population [17] As PCA is a linear method, it cannot fully capture the nonlinear relationships in the data [21]. † typical classification problem in machine learning Appendix A: State of the Art | 49

to classify cells in subgroups using score of similarity. The similarity between gene expression profiles is assessed with distance metrics often applied on dimensionality reduced representations. Clustering methods are based on searching the minimal distances and finding dense regions in the reduced expression space to assign cells to clusters. Differently, the community detection methods are graph-partitioning algo- rithms and thus are based on a graph representation of data in the reduced expression space. The clusters can be obtained using the k-Nearest Neighbour (kNN) approach for example (each cell is connected to its k most similar cells) [17]. At the end, clusters of cells can be annotated.

Cell type annotation

One of the main uses of scRNA-seq is to identify and characterise cell-populations and highlight the cellular heterogeneity in the data. This is, for example, for understanding the different cell populations of the tumour microenvironment [19] as explained in the introduction. Then, it is often necessary to know what are the different cellular populations in the data and try to annotate them. Cell type annotation has therefore become a fundamental task in the analysis of scRNA-seq data as analysing transcriptome profiles of individual cells enables the identification and annotation of cell types and the observation of new ones [10]. Cell type identification and annotation in scRNA-seq studies are currently often done manually after clustering. Looking for marker genes∗ in cell cluster is not a difficult task, tools can find easily the marker genes in cell clusters by their differentially expressed genes. Nevertheless, it is often required to search in the literature or databases to link the set of marker genes of clusters to known cell types [28, 29] and annotate them (Figure A.4 (b)). Such manual cell type identification presents some drawbacks. First, it is time-consuming, labor intensive and require exhaustive review of cluster-specific genes [28, 29]. Secondly, the method is subjective due to the choice of the clustering method and parameters, or the uncertainty of the marker genes related to a specific cell type [28]. This subjectivity makes difficult transfers of annotation between datasets generated by independent groups on related tissues. And this results in useless repetition of effort [29]. Finally, the knowledge about defined cell types is still increasing quickly, that makes the analysis more complex and the annotations labels inherently ad hoc [28, 29]. Therefore, tools or methods which are able to automatically identify accurate cell types and annotate them are today necessary and represent a big challenge in the scRNA-seq pipeline. In recent years, several computational tools have been developed for these tasks. Automated cell type annotation has become available and offers a vast

∗ Marker genes are the main genes highly differentially expressed by a cell type and therefore characterize the cluster. They are used to annotate it with a meaningful biological label [17]. 50 | Appendix A: State of the Art

Figure A.4: Downstream Analysis – Clustering and Annotation - Adapted from Malte D.Luecken & Fabian J.Theis. Molecular Systems Biology (2019) [17] and T.S. Andrews & M. Hemberg. Molecular Aspects of Medicine 59 (2018) [30]. (a) Clustering process (b) Cluster annotation using gene marker speedup of the process [17, 31, 32,6]. However, those tools use different strategies and present different advantages and drawbacks. They are also often specialised for a particular task. Thus, for immuno-oncologic or oncologic research purpose, it is necessary to search all the current available tools and review and seek for those which seem the most efficient.

A.4 Review and research of the automatic cell type annotation tools

The scRNA-seq analysis workflows are patchworks of independently developed tools which carry out different tasks at different levels in the pipeline [17]. More and more bioinformatic and computational methods are becoming available. The reasons for this increasing number of methods are the various type of data (different tissues, species, healthy/pathogenic...), the different research field purposes and the different algorithmic strategies. This situation makes difficult to know what is the best strategy of workflow analysis [15, 17, 33]. Hence, reviewing the vast number of available tools and keeping up with the current state of scRNA-seq analysis has become very challenging [33]. Nevertheless, to facilitate the selection of appropriate analysis tools, specialised, continually updated, publicly available databases were created to list all the existing tools, as scRNA-tools (www.scRNA-tools.org) [33] or Awesome-single-cell [34], and categories them according to the analysis tasks they perform. Those databases were Appendix A: State of the Art | 51

used to find all the existing tools for automatic cell type annotation in addition to a lexicographic research tools in Pub Med∗. Before further analysis, 94 open-source tools were classified as specialised in classification tasks and seems to be all the available tools on the 1st February 2021 for the purpose task. Then, all these tools were reviewed to know if they truly automatically annotate or predict cell types, if they are approved by the scientific community (number of citations in the literature compared to the date of publication), if they are continuously improved and expanding their functionalities and do not cease to be updated or maintained (date of the last commits on GitHub) or if they are implemented in a consistent programming language, with generally used input and output and an verified algorithmic methods. Most of the tools are coded in R and Python†[33] because those languages are free-to-use, popular across a range of data science fields and the tools coded in these languages are available through open-source repositories. A lot of methods are also often shared and described in preprints prior to peer- reviewed publication due to their recent creation [33]. We proceeded at a first elimination of tools according to the above criteria. The remaining tools are presented in the table A.1.

ACTINN [35] CIPR [36] MARS [37] SCINA [7] ACTIONet [2] Clustifyr [38] matchSCore2 [39] Scmap [40] Adobo [41] DeepSort [42] Moana [43] scMatch [44] AltAnalyze [45] DigitalCellSorter [46] Northstar [47] scMCA [48] BESCA [49] Garnett [29] OnClass [50] scNym [51] Capybara [52] gCAnno [53] rSuperCT [54] scPred [55] CaSTLe [56] HieRFIT [57] scAdapt [58] SCSA [59] Cell-BLAST [60] HPL [61] scAnCluster [62] scTHI [19] Cellassign [63] hscScore [64] scCancer [3] scTyper [5] Cellgrid [65] ImmClassifier [66] scCATCH [14] scANVI [67] CellID [1] ItClust [68] scClassifR [6] SingleCellNet [69] CellMeSH [70] JIND [71] scClassify [72] SingleR [73] CellO [32] LAmbDA [74] SciBet [4] Symphony [75] CHETAH [28] MAESTRO [76] scID [77] VISION [78] Table A.1: Automatic cell annotation or prediction tools selected for further analyses

∗ In Pub Med, we use the key words “cell type annotation”, “cell-type prediction”, “classification” to find relevant scientific articles presenting review of tools or documentations of tools. In Awesome-single-cell, we listed all the tools of the part “Cell type identification and classification”. In scRNA-tools, we listed in the Classification subclass all the tools carrying out the automatic cell type prediction or annotation.† R (57,4%) Python (38,3%), MATLAB (3,2%), Other (1,1%) 52 | Appendix A: State of the Art

In the following, attention will be put in the ability of the tools to face the different challenges of annotation.

A.4.1 Challenges of the annotation

It is important that the tools can handle the different sequencing data from different manufacturer and techniques (10X, Smart-seq2, etc) and can answer the different special research questions (population of immune cell types or other tissues cell types). Cell annotation is also especially challenging concerning the description of novel phenotypes (e.g. cancer specific cell states), leading to difference in cell labels between analyses [10].The RNA expression profile of cancer cells can be different from any known cell type, as well as unique to the patient. This can lead to difficulties to classify, simply because their expression profiles do not look like any known, healthy cell type [28]. The tools should therefore be prepared to such situation. Furthermore, the level of sensitivity and the specificity of the cellular identity is also a criterion. Cells of the same cell type in different states should be annotated in the same cell type. Nevertheless, different cell types could be grouped together and annotate with only one cell type or the same cell type could be split in different sub groups and then assigned to different sub cell types. However, it is not always clear what constitutes a cell type. There is a lack of consensus framework for cell types and features that define them [29]. Nevertheless, the huge diversity of different cell types underneath a first level of typing [55] can be handled. For example, T cell is a large type but it contains subtypes as CD4+ or CD8+ [17] that may be a satisfactory to label further. Hence, the level of descent in the hierarchy of cell types is also a tremendous advantage. Finally, it is highly difficult to create a tool that handle all the problematics that scRNA-seq covers today. Thus, the tools are created to answer specific biological questions, and some may not be able to correctly respond to immuno-oncologic and oncologic purpose. A tool which is created to treat such specific purpose will be appreciated compared to a non-specific tool too general. Therefore, a good cell type identification tool should be both sensitive and selective, should correctly identify as many cells as possible, while not classifying cells when based on insufficient evidence to avoid overclassification [28].

A.4.2 The different types of annotation tools

The tools use different strategies to associate gene expression profiles with a cell type identity to each test single cells or groups of cells (clusters). Three main methodological approaches can be identified in the literature (Figure A.5)[79]. A first group of methods is based on publicly available databases and ontologies describing Appendix A: State of the Art | 53

cell-type-specific gene markers. The second type of approaches uses labeled scRNA- seq datasets in order to find the best correlation between the reference and test datasets. Supervised learning is used as a third alternative. Classifiers are trained with labelled reference and after are able to transfer known cell types labels to unlabelled datasets [79]. The two first types of methods are called prior-knowledge methods because they required prior information as input (marker gene database or labelled datasets), whereas the last ones are called supervised methods, because the knowledge is acquired in the classifier by training it with labelled datasets with a supervised machine learning approach.

Figure A.5: 3 approaches for automatic cell-type annotation of scRNA-seq datasets: annotation by marker gene databases, correlation-based methods, and annotation by supervised classification - Adapted from Giovanni Pasquini et al. Computational and Structural Biotechnology Journal (2021) [79]. (A) Marker gene database-based annotation uses cell type atlases. Literature- and scRNA-seq analysis-derived gene markers have been listed into reference cell type hierarchies and marker lists, and used to annotate cells in the test dataset presenting the same gene markers. (B) Correlation-based methods adopt multiple correlation measurement in order to compare gene expression profiles between a reference and a test dataset. Ensemble of published studies can be used to create references of cell-type gene-expression profiles. Annotation is then carried out by finding the reference cell types which match the best to the test cell or cluster. (C) Annotation by supervised classification work with machine learning technics. A classifier is trained on reference labeled scRNA-seq datasets and is after applied to the test dataset. 54 | Appendix A: State of the Art

Marker-gene-based annotation

To automatically annotate cell types, reference cell type information is always needed at some point. As said previously, the first category of approach is based on cell type marker genes because unbiased cell-type specific marker genes can be easily exploited [79] and avoid working with all the genes. List of cell type marker genes that come from specific studies and literature and assembled the information in very large databases and ontologies [79] are increasingly becoming available [17] and facilitate greatly the annotation. Marker genes can be database-derived or manually-curated lists [79]. The prior information is exhaustive in database-derived lists but the annotation can be uncertain if the test dataset is noisy. Manually-curated lists cannot usually annotate all the cell type in the data but are suitable for sophisticated statistical methods [79]. Another technic aiming to compare the test dataset’s marker genes and those of a reference dataset via enrichment test∗ or other [17] is also used. Nevertheless, the performance of this approach can be affected if markers of the cell types present in the test dataset are absent because the tools could not assign cell types for the cell types it does not know prior information [67]. Moreover, the fact that a lot of markers could be shared between several cell types and all the cell types do not possess a canonical set of marker genes [32], could also be detrimental for the performance.

Correlation-based annotation

Instead of comparing marker genes, it is possible to compare the whole expression profiles of each gene and therefore find similarities between datasets. This approach is a more straightforward statistical method than scoring the presence of marker genes. Correlation-based methods correlate the test cells to an annotated set of reference datasets and assign the label to cell types which have the best correlation with the reference cell types [79]. In other words, gene expression profiles of a reference dataset can be directly used for the annotation of a test datasets [79]. The main used correlation are Spearman and Pearson correlation [79]. The advantage of this approach is the possibility to evaluate both linear and non- linear interactions [79] by working in the whole gene expression [79]. However, this method requires also exhaustive annotated reference datasets as prior knowledge and the performance is entirely correlated to the quality and the different cell types in these reference datasets. Then, the selection of the features is also important because the datasets have to be almost similar for the correlations†. Additionally, the correlation-

∗ Method to identify classes of genes that are over-represented in a large set of genes † As explained at the beginning, feature selection consists in removing the irrelevant or redundant features (genes) from the data. This action enables to increase the quality and the number of detected genes in each cell to carry out the correlation [79]. Appendix A: State of the Art | 55

based approaches cannot annotate cell identities that are not present in the reference dataset as the previous category. Some hybrid approaches between this category and the following one have also been adopted in the literature. They are called semi-supervised learning methods and are able to learn cell type knowledge from annotated reference datasets like the machine learning algorithms but they can also take advantage of information in the target dataset to decrease the quality dependence of the reference dataset [62] and harmonize the data [79] like the correlation-based algorithms. Tools like Onclass, scAdapt, scAnCluster, scVI (scANVI) or scNym use this type of approach.

Annotation by supervised classification

Finally, supervised classification approaches can embody an interesting alternative. The management of the multidimensionality of the data and the transfer of label from known datasets to unknown datasets are problems that are classically handled by machine learning methods [79]. Machine learning technics offer a variety of alternatives but are all related to the supervised learning [79]. Supervised learning enables the building of a model distribution of training labels (cell types) as a function of features (genes) and then trained the model on previously annotated dataset [79]. The trained model can after assign cell types (labels) to unlabelled datasets according to the expressed genes (features) [79]. Different methods are used as Random Forest method, Artificial Neural Network (ANN), Support Vector Machine (SVM) or embedding methods that project cells to cell type space and infer the closest cell types to unannotated cells [79], for example. Nevertheless, this approach also requires an accurately annotated reference dataset for the training phase. Even if it can sometimes represent an efficient solution to manage the intrinsic noise and variability of the data as the variability carried from the sequencer, the sequencing depth and the method used for the sample preparation [79], it is also highly depending on the training dataset. Batch effects or biological factors differences between datasets can alter the quality of the annotation [58] due to the lack of generalisation of the reference dataset [67]. Moreover, as the correlation- based methods, supervised classification methods need reference datasets that cover all the cell types present in the target dataset for annotating them [62]. They are limited for discovering novel cell types [62] and totally sensitive to the chosen training data [32].

A.4.3 Tool summary according to their categories

The remaining tools can therefore be classified by categories. Figure A.6 represents this classification and summarises the literature information. 56 | Appendix A: State of the Art

Figure A.6: Classification of the tools according to the 3 approaches for automatic cell type annotation The box in dashed line represents tools that use unsupervised learning to classify cells in groups and use either marker-gene-based or Correlation-based annotation for annotation. The box Neural Network and Deep Neural Network were added to separate the artificial neural network methods from the other machine learning methods.

A.4.4 Acquired knowledge in literature reviews

In the literature, only few reviews and benchmarks of a small number of tools previously cited were done [80, 81, 82]. However, they give good insights for the choice of tools. Huang et al. [80] carried out a systematic comparisons of 8 R package tools on several public scRNA-seq datasets and simulated data. They concluded that methods such as SingleR [73] and SingleCellNet [69] generally performed well thanks to their higher relative robustness. They also high-lighted that method that incorporates prior knowledge (such as Garnett [29] and SCINA [7]) did not improve the performance comparing to the other methods and seems limited when cell-cell similarity is high. Zhao et al. [81] performed an evaluation of the performance of 9 tools. They concluded that SingleR [73] is the best tools when the cell number is low or cell types are extremely imbalanced in the reference. The combination of multiple tools by ensemble voting seems also to be a good strategy for improving the accuracy of the annotation. And finally, under non-ideal situations∗, tools based on cluster-level similarities have superior performance. Abdelaal et al. [82] benchmarked 22 cell types automatic cell annotation tools using 27 publicly available scRNA-seq datasets of different sizes, technologies,

∗ such as small-sized and cell-type-imbalanced reference data sets Appendix A: State of the Art | 57

species, and levels of complexity (number of cells, overlapping populations,...). They found that most classifiers perform well on a huge variety of datasets but face difficulty with complex datasets with overlapping populations (immune cells and particularly T cells) or deep annotations (such as subpopulations of CD4+ and CD8+ T cell populations). Like Huang et al. [80], they noticed that incorporating prior knowledge in the form of marker genes does not improve the performance and this is highly correlated to the input markers. Finally, they showed that SCINA [7], ACTINN [35], SingleCellNet [69] perform well and complex datasets, nevertheless, SVM seems to be the best supervised classification method.

A.5 Conclusion

At the end of this first step of analysis, a set of tools have been chosen for supplementary analysis according to their potential capacity to perform well for the purpose of the project, the input and output data format, the predicted cell types they can annotate (different immune cells, different tissues, cancer cells,etc.) and their algorithms and methods. Even if the performance of Seurat [83] based on random forest and CaSTLe [56] based on XGBoost were emphasized [80, 81, 82], it was decided not to choose them due to lack of explanation on their implementation.

Figure A.7: Selected tools separated by the3 approaches for automatic cell type annotation ACTINN, ACTIONet, CELL-BLAST, CellID, CellO, CHETAH, clustifyr, DeepSort, Garnett, HieRFIT, MAESTRO, MARS, northstar, scANVI, scCancer, scCATCH, scClassifR, scClassify, scMatch, scNym, SciBet, SCINA, scPred, SCSA, scTyper, Single-CellNet and SingleR are the selected tools that are analysed in this thesis 58 | Appendix A: State of the Art

The selected tools are presented in the Figure A.7 according to their categories. An attention was also put to keep at least two tools per type of algorithm or method. They will be tested and compared with simulated datasets but also with already annotated and published datasets (presented in the core of the thesis). Appendix B: Supplement information | 59

Appendix B

Supplement information

This section was created to synthesize what was done in the previous evaluations available in the literature if someone is interested by more general performance evaluations and the reason of the creation of a new evaluation in this project.

B.1 Methods used in the literature

As explained in the State of the ArtA, 3 main evaluations of tools were carried out [82, 80, 81]. All of them performed intra-dataset (within dataset) and inter-dataset (across datasets) experimental configurations for assessing the accuracy by transferring referencing labels from one dataset to another dataset. The intra- dataset comparison, inherently artificial, provide an ideal scenario independent of technical and biological variations across datasets [82]. Meanwhile, the inter-dataset comparison is more realistic and correspond to the real practical situation. It is more challenging for the tools as they have to deal with technical differences and appropriate cell type annotation matching. With those configurations, they assess the sensitivity of the tools in different situation: different datasets sizes (number of cells) [82], different datasets complexities (number of cell types [82, 80] and sorted datasets [82]), different number of cells per cell types (scalability) [82], in the reference for [81], different numbers of reference cells [81], different number of input features (feature selection) [82, 80], different level of differential expression (similarity among cell types) [80], different tissues [82, 80, 81], different sequencing protocols [82, 81], different annotation levels [82], rare cell types detection [80] and capability to identify novel cell types [81]. Some evaluations also performed an analysis of the Rejection option [80, 82]- which is the capacity to identify unknown cell types that were not seen in the reference or training. They excluded cell types from the reference and look the reaction of the tools, their capacity to unclassified these cell types and their changes in performances 60 | Appendix B: Supplement information

on the other cell types. Finally , they also assessed the computation time with high number of cells [82], high number of features [82], deep annotation level [82] or increasing number of cells [80, 81] and the required memory with an increasing number of cells [80, 81]. In order to measure the performance of the tools, the 3 evaluations used different metrics: F1-score [82], overall accuracy and recall [82, 80, 81], adjusted rand index (ARI) [80], V-measure [80], receiver operator characteristic (ROC) curves [81], area under ROC curves (AUC) [81]. They also used real public datasets and simulated datasets created with tools as Splatter [12] depending of the will to assess the performance on realistic or ideal datasets.

B.2 Comments on these methods

These three evaluations are very useful and complete for a general overview of the performance of the tools. However, for a specific purpose, they do not answer to all the questions and can encounter some limitations. These three publications evaluates only specified segments of the whole classification tools set and as they used different datasets and methods, it is difficult to transfer the ranking and conclusions from one publication to another. Moreover, these reviews mostly evaluate tools existing at the beginning of 2020 and then, do not evaluate all the tools available in 2021 and the evolution of the studied tools could have known. The three evaluations were also carried out on the same tissues (Human Pancreas and Peripheral Blood Mononuclear Cells (PBMC)). But those tissues are the tissues mostly used by the tools developers to test their tools. This might hence have increased the performance of the tools in the evaluations and this does not assess if the tools could perform well on other tissues. Additionally, most of the tools perform well when cell populations are more separable (high gene differential expression scale) [82, 80, 81], but when the cell populations become less separable (low gene differential expression scale), the tools accuracy performances decrease with a different variations between tools [81]. The tools have to face such situation in the immuno-oncologic and oncologic context. The cell populations in the samples are specific, and the datasets and the cell types to annotate are hence different. Indeed, there are an higher number of immune cell types, that results in a higher cell-cell similarity, a high number of cell types and probably tumor cells that present uncommon genes features. The complexity is then high and this very specific situation challenges the tools. As the performance of tools is dependent on the complexity of datasets and most universal characteristics were assessed whereas there is no universally best tool able to be performing in all situations [81], those 3 evaluations do not allow us to know which tools could be the best performing in immuno-oncologic and oncologic context. Thus, a particular attention has been put on the evaluation of current tools on Appendix B: Supplement information | 61

the same basis, on other human tissues than PBMC and Pancreas and specifically on immuno-oncologic and oncologic data. The lung and colorectal tissues where chosen as tissues that are not usually used for evaluation. This could permit to see how the tools react on tissues in which they were probably less performing. Furthermore, only inter-dataset comparisons will be performed because we wanted to evaluate the tools on situations similar to those on which they will be used (worse case scenario) and not on unrealistic ideal situation as intra-dataset situations. Contrary to the 3 evaluations [82, 80, 81], rejection option and the capability to catch novel cell types have not been evaluated. We focused more on the accuracy, stability, robustness and deep annotation (specifically with immune cells subpopulations) which are more important characteristics. 62 | Appendix C: Further evaluation of MAESTRO

Appendix C

Further evaluation of MAESTRO

Some of the marker-based tools are used in the pipeline of other useful tools. The rather overall low results obtain for the first configuration forced us to verify that the use of the marker-based tools do not have influence on the quality of the annotation presented in the tools that use these marker-based tools. This is case, for instance, of the web resource TISCH [84] which use MAESTRO for annotating the interactive single-cell transcriptome visualization of tumor microenvironment. As this platform appears as useful in the oncology field, it was decided to evaluate its annotations on the same test datasets used previously in order to know if we can have confidence in them and hence in the platform TISCH. The annotating pipeline of TISCH was created again and the marker genes used were the same as those used in this tools ∗. The obtained results are presented in the figure below.

Table C.1: Precision, recall and F1-score for TISCH (MAESTRO) on the Zhang Smart-Seq2 dataset

∗ i.e. the marker genes of the database CIBERSORT provided with MAESTRO and the marker genes of dendritic cells and T cells provided in the github repository of TISCH Appendix C: Further evaluation of MAESTRO | 63

Table C.2: Precision, recall and F1-score for TISCH (MAESTRO) on the Kim dataset and for the major cell types

Table C.3: Precision, recall and F1-score for TISCH (MAESTRO) on the Zhang10X dataset 64 | Appendix C: Further evaluation of MAESTRO

Table C.4: Precision, recall and F1-score for TISCH (MAESTRO) on the Kim dataset and for the T cell sub types

With the configuration used in TISCH, enhanced MAESTRO gave better overall results than those presented previously. MAESTRO had high F1-score for the major lineages and the performance for immune sub types are as high as the most performing tools if not better. The tools is not built for some sub types as Proliferative B, TAMS SPP1 or Treg, that explains the null F1-score. The only noticed drawbacks is that the tool can only annotate either only immune cell subtypes or only major families of cells (Immune, stromal, malignant). Then, if the annotation is performed with the immune subtypes annotation configuration, the stromal or malignant cells presented in the sample will not be annotated. However, this supplementary evaluation confirmed that marker-based tools could be performing and their performance is positively correlated with the quality of the provided markers. For DIVA

{ "Author1": { "name": "Corentin RAOUX"}, "Degree": {"Educational program": "Degree Programme in Medical Engineering"}, "Title": { "Main title": "Review and Analysis of single-cell RNA sequencing cell-type identification and annotation tools", "Language": "eng" }, "Alternative title": { "Main title": "Granskning och Analys av enkelcells-RNA- sekvenseringsverktyg för identifiering och annotering av celltyper", "Language": "swe" }, "Supervisor1": { "name": "Yufei Luo" }, "Examiner": { "name": "Matilda Larsson", "organisation": {"L1": "School of Engineering Sciences in Chemistry, Biotechnology and Health" } }, "Cooperation": { "Partner_name": "Servier"}, "Other information": { "Year": "2021", "Number of pages": "xv,65"} }

TRITA CBH-GRU-2021:084

www.kth.se