Review and Analysis of Single-Cell RNA Sequencing Cell-Type Identification and Annotation Tools

DEGREE PROJECT IN MEDICAL ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2021 Review and Analysis of single-cell RNA sequencing cell-type identification and annotation tools CORENTIN RAOUX KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY, BIOTECHNOLOGY AND HEALTH Review and Analysis of single-cell RNA sequencing cell-type identification and annotation tools CORENTIN RAOUX Degree Programme in Medical Engineering Date: June 9, 2021 Supervisor: Yufei Luo Examiner: Matilda Larsson School of Engineering Sciences in Chemistry, Biotechnology and Health Host company: Servier Swedish title: Granskning och Analys av enkelcells-RNA- sekvenseringsverktyg för identifiering och annotering av celltyper Review and Analysis of single-cell RNA sequencing cell-type identification and annotation tools / Granskning och Analys av enkelcells-RNA- sekvenseringsverktyg för identifiering och annotering av celltyper © 2021 Corentin RAOUX Abstract | i Abstract Single-cell RNA-sequencing makes possible to study the gene expression at the level of individual cells. However, one of the main challenges of the single-cell RNA-sequencing analysis today, is the identification and annotation of cell types. The current method consists in manually checking the expression of genes using top differentially expressed genes and comparing them with related cell-type markers available in scientific publications. It is therefore time-consuming and labour intensive. Nevertheless, in the last two years, numerous automatic cell-type identification and annotation tools which use different strategies have been created. But, the lack of specific comparisons of those tools in the literature and especially for immuno-oncologic and oncologic purposes makes difficult for laboratories and companies to know objectively what are the best tools for annotating cell types. In this project, a review of the current tools and an evaluation of R tools were carried out. The annotation performance, the computation time and the ease of use were assessed. After this preliminary results, the best selected R tools seem to be ClustifyR (fast and rather precise) and SingleR (precise) for the correlation- based tools, and SingleCellNet (precise and rather fast) and scPred (precise but a lot of cell types remains unassigned) for the supervised classification tools. Finally, for the marker-based tools, MAESTRO and SCINA are rather robust if they are provided with high quality markers. Keywords Single-cell RNA sequencing, Automatic cell types annotation, Classification, Benchmark, Evaluation ii | Abstract Sammanfattning | iii Sammanfattning Encells-RNA-sekvensering möjliggör undersökning av genuttryck på nivån av enskilda celler. Däremot är en av nuvarande huvudutmaningarna för encells- RNA-sekvensering identifieringen av celltyper. Den nuvarande metoden består av att manuellt kontrollera uttrycket av gener med top differentiellt uttryckta gener och jämföra dem med de relaterade celltypsmarkörerna som är tillgäng- liga i vetenskapliga publikationer. Konsekvent, är det tids- och arbetskrävande. Trots detta har flera automatiska verktyg för identifiering och annotering av celltyp som använder olika strategier konstruerats och tagits fram under de senaste två åren. Bristen på specifika jämförelser av dessa verktyg inom litteraturen, speciellt för immuno-onkologiska och onkologiska syften, har dock försvårat det för laboratorier och företag att objektivt urskilja vilka de bästa verktygen för att urskilja celltyper egentligen är. I detta projekt undersöktes de aktuella verktygen, samt utvärderades de berörda R-verktygen. Likaså bedömdes även annoteringens utförande, beräkningstiden och använ- darvänligheten. Det preliminära resultatet indikerar att de bästa utvalda verktygen är ClustifyR (snabbt och rätt noggrann) och SingleR (noggran) för korrelationsbaserade verktyg och SingleCellNet (noggrann och rätt snabbt) och scPred (noggrann dock förblir många celltyper otilldelade) för bevakade klassificeringsverktyg. Slutligen är MAESTRO and SCINA kraftfulla för mar- körbaserade verktyg om de är försedda med högkvalitativa markörer. Nyckelord Encells-RNA-sekvensering, Automatisk annotering av celltyper, Klassifice- ring, Riktmärke, Värdering iv | Sammanfattning Résumé | v Résumé Le séquencage d’ARN à cellule unique rend possible l’étude de l’expression des gènes au niveau de cellules individuelles. Cependant, l’un des principaux défis actuels de l’analyse de séquençage d’ARN à cellule unique est l’identification et l’annotation de types cellulaires. La méthode actuelle consiste à vérifier manuellement l’expression des gènes en utilisant les principaux gènes exprimés differentiellement et à les comparer avec des marqueurs spécifiques de types cellulaires présents dans des publications scientifiques. Ceci est donc chronophage et laborieux. Toutefois, durant les deux dernières années, un nombre conséquent d’outils d’identification et d’annotation automatique de types cellulaires utilisant différentes stratégies ont été créés. Mais le manque de comparaisons spécifiques de ces outils dans la littérature et spécialement pour un objectif immuno-oncologique et immunoloqique rend difficile pour les laboratoires et les entreprises de savoir objectivement quel est le meilleur outils pour annoter les types cellulaires. Dans ce projet, un examen des outils actuels et une évaluation des outils R ont été effectués. Les performances d’annotation, le temps de calcul et la facilité d’utilisation ont été évalués. Après ces résultats préliminaires, les meilleurs outils R selectionnés semblent être ClustifyR (rapide et plutôt précis) et SingleR (précis) pour les outils basés sur les correlations, et SingleCellNet (précis et plutôt rapide) et scPred (précis mais beaucoup de types cellulaires restent non-annotés) pour les outils de classification supervisés. Finalement, pour les outils basés sur des marqueurs, MAESTRO et SCINA sont plutôt robustes si on leur fournit des marqueurs de haute qualité. Mots clés Sequençage d’ARN à cellule unique, Annotation automatique de types cellulaires, Classification, Comparaison, Evaluation vi | Résumé Acknowledgments | vii Acknowledgments I would firstly like to thank Mrs. Yufei Luo for having supervised my work and gave me useful advice, as well as, the bioinformatic team and the different people in the Servier company who have welcomed me and helped me for this project. I also want to thank PhD Stefania Giacomello for having generously accepted to review my project and helped me to improve the content of this report. Finally, I thank my group of the HL205X Course supervised by PhD Carsten Mim, as well as the different people of KTH for their help and advice at different levels in this project. Stockholm, June 2021 Corentin RAOUX viii | Acknowledgments CONTENTS | ix Contents 1 Introduction1 1.1 Background...........................1 1.2 Challenge............................1 1.3 Purpose and Goals.......................2 1.4 Delimitations..........................3 2 Methods5 2.1 Tools selection and installation.................5 2.2 Public Datasets Collection...................6 2.2.1 Test datasets......................6 2.2.2 Reference datasets...................8 2.2.3 Simulated dataset.................... 10 2.2.4 Data validity...................... 11 2.3 Evaluation Design........................ 12 2.3.1 Evaluation Criteria................... 12 2.3.2 Evaluation Metrics................... 13 2.3.3 Evaluation Benchmarking Strategies.......... 14 2.3.4 Verification of the reliability of the methods...... 15 3 Results and Analysis 21 3.1 First configuration - Evaluation of the ability to accurately annotate major cell types.................... 21 3.1.1 Zhang Smart-Seq2 - Qian Colorectal.......... 21 3.1.2 Kim - Qian lung.................... 22 3.1.3 Analysis - (Tables 3.1, 3.2, 3.3, 3.4, 3.5 and 3.6) ..... 23 3.2 Second configuration - Evaluation of the ability to accurately annotate deeper sub cell types................. 24 3.2.1 Zhang 10X Genomics - Nieto............. 24 3.2.2 Kim - Nieto....................... 25 x | Contents 3.2.3 Analysis - (Tables 3.7, 3.8, 3.9, 3.10, 3.11 and 3.12) ... 25 3.3 Computation time and ease to use............... 27 3.3.1 Computation time................... 27 3.3.2 Ease to use....................... 28 4 Discussion and Conclusions 31 4.1 Discussion............................ 31 4.2 Conclusions........................... 32 4.3 Future work........................... 32 References 33 A State of the Art 43 A.1 Introduction........................... 43 A.2 Example of field of study where scRNA-seq is applied.... 43 A.3 Workflow of scRNA-seq.................... 44 A.3.1 Pre-processing..................... 44 A.3.2 Data processing and visualization........... 45 A.3.3 Downstream analysis.................. 48 A.4 Review and research of the automatic cell type annotation tools 50 A.4.1 Challenges of the annotation.............. 52 A.4.2 The different types of annotation tools......... 52 A.4.3 Tool summary according to their categories...... 55 A.4.4 Acquired knowledge in literature reviews....... 56 A.5 Conclusion........................... 57 B Supplement information 59 B.1 Methods used in the literature.................. 59 B.2 Comments on these methods.................. 60 C Further evaluation of MAESTRO 62 LIST OF FIGURES | xi List of Figures 1.1 General scheme of the functioning of a tool..........3 2.1 UMAP representation of the Zhang 10X Genomics test dataset with annotated cell types....................7 2.2 UMAP representation of the Zhang Smart-seq2

Load more