Review and Analysis of Single-Cell RNA Sequencing Cell-Type Identification and Annotation Tools

Total Page:16

File Type:pdf, Size:1020Kb

Review and Analysis of Single-Cell RNA Sequencing Cell-Type Identification and Annotation Tools DEGREE PROJECT IN MEDICAL ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2021 Review and Analysis of single-cell RNA sequencing cell-type identification and annotation tools CORENTIN RAOUX KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES IN CHEMISTRY, BIOTECHNOLOGY AND HEALTH Review and Analysis of single-cell RNA sequencing cell-type identification and annotation tools CORENTIN RAOUX Degree Programme in Medical Engineering Date: June 9, 2021 Supervisor: Yufei Luo Examiner: Matilda Larsson School of Engineering Sciences in Chemistry, Biotechnology and Health Host company: Servier Swedish title: Granskning och Analys av enkelcells-RNA- sekvenseringsverktyg för identifiering och annotering av celltyper Review and Analysis of single-cell RNA sequencing cell-type identification and annotation tools / Granskning och Analys av enkelcells-RNA- sekvenseringsverktyg för identifiering och annotering av celltyper © 2021 Corentin RAOUX Abstract | i Abstract Single-cell RNA-sequencing makes possible to study the gene expression at the level of individual cells. However, one of the main challenges of the single-cell RNA-sequencing analysis today, is the identification and annotation of cell types. The current method consists in manually checking the expression of genes using top differentially expressed genes and comparing them with related cell-type markers available in scientific publications. It is therefore time-consuming and labour intensive. Nevertheless, in the last two years, numerous automatic cell-type identification and annotation tools which use different strategies have been created. But, the lack of specific comparisons of those tools in the literature and especially for immuno-oncologic and oncologic purposes makes difficult for laboratories and companies to know objectively what are the best tools for annotating cell types. In this project, a review of the current tools and an evaluation of R tools were carried out. The annotation performance, the computation time and the ease of use were assessed. After this preliminary results, the best selected R tools seem to be ClustifyR (fast and rather precise) and SingleR (precise) for the correlation- based tools, and SingleCellNet (precise and rather fast) and scPred (precise but a lot of cell types remains unassigned) for the supervised classification tools. Finally, for the marker-based tools, MAESTRO and SCINA are rather robust if they are provided with high quality markers. Keywords Single-cell RNA sequencing, Automatic cell types annotation, Classification, Benchmark, Evaluation ii | Abstract Sammanfattning | iii Sammanfattning Encells-RNA-sekvensering möjliggör undersökning av genuttryck på nivån av enskilda celler. Däremot är en av nuvarande huvudutmaningarna för encells- RNA-sekvensering identifieringen av celltyper. Den nuvarande metoden består av att manuellt kontrollera uttrycket av gener med top differentiellt uttryckta gener och jämföra dem med de relaterade celltypsmarkörerna som är tillgäng- liga i vetenskapliga publikationer. Konsekvent, är det tids- och arbetskrävande. Trots detta har flera automatiska verktyg för identifiering och annotering av celltyp som använder olika strategier konstruerats och tagits fram under de senaste två åren. Bristen på specifika jämförelser av dessa verktyg inom litteraturen, speciellt för immuno-onkologiska och onkologiska syften, har dock försvårat det för laboratorier och företag att objektivt urskilja vilka de bästa verktygen för att urskilja celltyper egentligen är. I detta projekt undersöktes de aktuella verktygen, samt utvärderades de berörda R-verktygen. Likaså bedömdes även annoteringens utförande, beräkningstiden och använ- darvänligheten. Det preliminära resultatet indikerar att de bästa utvalda verk- tygen är ClustifyR (snabbt och rätt noggrann) och SingleR (noggran) för korrelationsbaserade verktyg och SingleCellNet (noggrann och rätt snabbt) och scPred (noggrann dock förblir många celltyper otilldelade) för bevakade klassificeringsverktyg. Slutligen är MAESTRO and SCINA kraftfulla för mar- körbaserade verktyg om de är försedda med högkvalitativa markörer. Nyckelord Encells-RNA-sekvensering, Automatisk annotering av celltyper, Klassifice- ring, Riktmärke, Värdering iv | Sammanfattning Résumé | v Résumé Le séquencage d’ARN à cellule unique rend possible l’étude de l’expres- sion des gènes au niveau de cellules individuelles. Cependant, l’un des principaux défis actuels de l’analyse de séquençage d’ARN à cellule unique est l’identification et l’annotation de types cellulaires. La méthode actuelle consiste à vérifier manuellement l’expression des gènes en utilisant les princi- paux gènes exprimés differentiellement et à les comparer avec des marqueurs spécifiques de types cellulaires présents dans des publications scientifiques. Ceci est donc chronophage et laborieux. Toutefois, durant les deux dernières années, un nombre conséquent d’outils d’identification et d’annotation auto- matique de types cellulaires utilisant différentes stratégies ont été créés. Mais le manque de comparaisons spécifiques de ces outils dans la littérature et spécialement pour un objectif immuno-oncologique et immunoloqique rend difficile pour les laboratoires et les entreprises de savoir objectivement quel est le meilleur outils pour annoter les types cellulaires. Dans ce projet, un examen des outils actuels et une évaluation des outils R ont été effectués. Les performances d’annotation, le temps de calcul et la facilité d’utilisation ont été évalués. Après ces résultats préliminaires, les meilleurs outils R selectionnés semblent être ClustifyR (rapide et plutôt précis) et SingleR (précis) pour les outils basés sur les correlations, et SingleCellNet (précis et plutôt rapide) et scPred (précis mais beaucoup de types cellulaires restent non-annotés) pour les outils de classification supervisés. Finalement, pour les outils basés sur des marqueurs, MAESTRO et SCINA sont plutôt robustes si on leur fournit des marqueurs de haute qualité. Mots clés Sequençage d’ARN à cellule unique, Annotation automatique de types cellu- laires, Classification, Comparaison, Evaluation vi | Résumé Acknowledgments | vii Acknowledgments I would firstly like to thank Mrs. Yufei Luo for having supervised my work and gave me useful advice, as well as, the bioinformatic team and the different people in the Servier company who have welcomed me and helped me for this project. I also want to thank PhD Stefania Giacomello for having generously accepted to review my project and helped me to improve the content of this report. Finally, I thank my group of the HL205X Course supervised by PhD Carsten Mim, as well as the different people of KTH for their help and advice at different levels in this project. Stockholm, June 2021 Corentin RAOUX viii | Acknowledgments CONTENTS | ix Contents 1 Introduction1 1.1 Background...........................1 1.2 Challenge............................1 1.3 Purpose and Goals.......................2 1.4 Delimitations..........................3 2 Methods5 2.1 Tools selection and installation.................5 2.2 Public Datasets Collection...................6 2.2.1 Test datasets......................6 2.2.2 Reference datasets...................8 2.2.3 Simulated dataset.................... 10 2.2.4 Data validity...................... 11 2.3 Evaluation Design........................ 12 2.3.1 Evaluation Criteria................... 12 2.3.2 Evaluation Metrics................... 13 2.3.3 Evaluation Benchmarking Strategies.......... 14 2.3.4 Verification of the reliability of the methods...... 15 3 Results and Analysis 21 3.1 First configuration - Evaluation of the ability to accurately annotate major cell types.................... 21 3.1.1 Zhang Smart-Seq2 - Qian Colorectal.......... 21 3.1.2 Kim - Qian lung.................... 22 3.1.3 Analysis - (Tables 3.1, 3.2, 3.3, 3.4, 3.5 and 3.6) ..... 23 3.2 Second configuration - Evaluation of the ability to accurately annotate deeper sub cell types................. 24 3.2.1 Zhang 10X Genomics - Nieto............. 24 3.2.2 Kim - Nieto....................... 25 x | Contents 3.2.3 Analysis - (Tables 3.7, 3.8, 3.9, 3.10, 3.11 and 3.12) ... 25 3.3 Computation time and ease to use............... 27 3.3.1 Computation time................... 27 3.3.2 Ease to use....................... 28 4 Discussion and Conclusions 31 4.1 Discussion............................ 31 4.2 Conclusions........................... 32 4.3 Future work........................... 32 References 33 A State of the Art 43 A.1 Introduction........................... 43 A.2 Example of field of study where scRNA-seq is applied.... 43 A.3 Workflow of scRNA-seq.................... 44 A.3.1 Pre-processing..................... 44 A.3.2 Data processing and visualization........... 45 A.3.3 Downstream analysis.................. 48 A.4 Review and research of the automatic cell type annotation tools 50 A.4.1 Challenges of the annotation.............. 52 A.4.2 The different types of annotation tools......... 52 A.4.3 Tool summary according to their categories...... 55 A.4.4 Acquired knowledge in literature reviews....... 56 A.5 Conclusion........................... 57 B Supplement information 59 B.1 Methods used in the literature.................. 59 B.2 Comments on these methods.................. 60 C Further evaluation of MAESTRO 62 LIST OF FIGURES | xi List of Figures 1.1 General scheme of the functioning of a tool..........3 2.1 UMAP representation of the Zhang 10X Genomics test dataset with annotated cell types....................7 2.2 UMAP representation of the Zhang Smart-seq2
Recommended publications
  • Extraction of User's Stays and Transitions from Gps Logs: a Comparison of Three Spatio-Temporal Clustering Approaches
    MASTERARBEIT EXTRACTION OF USER’S STAYS AND TRANSITIONS FROM GPS LOGS: A COMPARISON OF THREE SPATIO-TEMPORAL CLUSTERING APPROACHES Ausgeführt am Institut für Geoinformation und Kartographie der Technischen Universität Wien unter der Anleitung von Univ.Prof. Mag.rer.nat. Dr.rer.nat. Georg Gartner, TU Wien Univ.Lektor Dipl.-Ing. Dr.techn. Karl Rehrl, TU Wien Mag. DI(FH) Cornelia Schneider, Salzburg Research durch Francisco Daniel Porras Bernárdez Austrasse 3b 209, 5020 Salzburg Wien, 25 January 2016 _______________________________ Unterschrift (Student) MASTER’S THESIS EXTRACTION OF USER’S STAYS AND TRANSITIONS FROM GPS LOGS: A COMPARISON OF THREE SPATIO-TEMPORAL CLUSTERING APPROACHES Conducted at the Institute for Geoinformation und Kartographie der Technischen Universität Wien under the supervision of Univ.Prof. Mag.rer.nat. Dr.rer.nat. Georg Gartner, TU Wien Univ.Lektor Dipl.-Ing. Dr.techn. Karl Rehrl, TU Wien Mag. DI(FH) Cornelia Schneider, Salzburg Research by Francisco Daniel Porras Bernárdez Austrasse 3b 209, 5020 Salzburg Wien, 25 January 2016 _______________________________ Signature (Student) ACKNOWLEDGEMENTS First of all, I would like to express my deepest gratitude to Dr. Georg Gartner and Dr. Karl Rehrl for their supervision as well as Mag. DI(FH) Cornelia Schneider and all the time they have deserved to my person. I really appreciate their patience and continuous support. I am also very grateful for the amazing opportunity that Salzburg Research Forschungsgesellschaft m.b.H. has given to me allowing me to develop my thesis during an internship at the institute. I will be always grateful for the confidence Dr. Rehrl and Mag. DI(FH) Schneider placed in me.
    [Show full text]
  • Performance Measures Outline 1 Introduction 2 Binary Labels
    CIS 520: Machine Learning Spring 2018: Lecture 10 Performance Measures Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the material discussed in the lecture (and vice versa). Outline • Introduction • Binary labels • Multiclass labels 1 Introduction So far, in supervised learning problems with binary (or multiclass) labels, we have focused mostly on clas- sification with the 0-1 loss (where the error on an example is zero if a model predicts the correct label and one if it predicts a wrong label). In many problems, however, performance is measured differently from the 0-1 loss. In general, for any learning problem, the choice of a learning algorithm should factor in the type of training data available, the type of model that is desired to be learned, and the performance measure that will be used to evaluate the model; in essence, the performance measure defines the objective of learning.1 Here we discuss a variety of performance measures used in practice in settings with binary as well as multiclass labels. 2 Binary Labels Say we are given training examples S = ((x1; y1);:::; (xm; ym)) with instances xi 2 X and binary labels yi 2 {±1g. There are several types of models one might want to learn; for example, the goal could be to learn a classification model h : X →{±1g that predicts the binary label of a new instance, or to learn a class probability estimation (CPE) model ηb : X![0; 1] that predicts the probability of a new instance having label +1, or to learn a ranking or scoring model f : X!R that assigns higher scores to positive instances than to negative ones.
    [Show full text]
  • The Difference Between Precision-Recall and ROC Curves for Evaluating the Performance of Credit Card Fraud Detection Models
    Proc. of the 6th International Conference on Applied Innovations in IT, (ICAIIT), March 2018 The Difference Between Precision-recall and ROC Curves for Evaluating the Performance of Credit Card Fraud Detection Models Rustam Fayzrakhmanov, Alexandr Kulikov and Polina Repp Information Technologies and Computer-Based System Department, Perm National Research Polytechnic University, Komsomolsky Prospekt 29, 614990, Perm, Perm Krai, Russia [email protected], [email protected], [email protected] Keywords: Credit Card Fraud Detection, Weighted Logistic Regression, Random Undersampling, Precision- Recall curve, ROC Curve Abstract: The study is devoted to the actual problem of fraudulent transactions detecting with use of machine learning. Presently the receiver Operator Characteristic (ROC) curves are commonly used to present results for binary decision problems in machine learning. However, for a skewed dataset ROC curves don’t reflect the difference between classifiers and depend on the largest value of precision or recall metrics. So the financial companies are interested in high values of both precision and recall. For solving this problem the precision-recall curves are described as an approach. Weighted logistic regression is used as an algorithm- level technique and random undersampling is proposed as data-level technique to build credit card fraud classifier. To perform computations a logistic regression as a model for prediction of fraud and Python with sklearn, pandas and numpy libraries has been used. As a result of this research it is determined that precision-recall curves have more advantages than ROC curves in dealing with credit card fraud detection. The proposed method can be effectively used in the banking sector.
    [Show full text]
  • Performance Measures Accuracy Confusion Matrix
    Performance Measures • Accuracy Performance Measures • Weighted (Cost-Sensitive) Accuracy for Machine Learning • Lift • ROC – ROC Area • Precision/Recall – F – Break Even Point • Similarity of Various Performance Metrics via MDS (Multi-Dimensional Scaling) 1 2 Accuracy Confusion Matrix • Target: 0/1, -1/+1, True/False, … • Prediction = f(inputs) = f(x): 0/1 or Real Predicted 1 Predicted 0 correct • Threshold: f(x) > thresh => 1, else => 0 • If threshold(f(x)) and targets both 0/1: a b r True 1 1- targeti - threshold( f (x i )) ABS Â( ) accuracy = i=1KN N c d True 0 incorrect • #right / #total • p(“correct”): p(threshold(f(x)) = target) † threshold accuracy = (a+d) / (a+b+c+d) 3 4 1 Predicted 1 Predicted 0 Predicted 1 Predicted 0 true false TP FN Prediction Threshold True 1 positive negative True 1 Predicted 1 Predicted 0 false true FP TN 0 b • threshold > MAX(f(x)) True 1 True 0 positive negative True 0 • all cases predicted 0 • (b+d) = total 0 d • accuracy = %False = %0’s Predicted 1 Predicted 0 Predicted 1 Predicted 0 True 0 Predicted 1 Predicted 0 hits misses P(pr1|tr1) P(pr0|tr1) True 1 True 1 a 0 • threshold < MIN(f(x)) True 1 • all cases predicted 1 false correct • (a+c) = total P(pr1|tr0) P(pr0|tr0) c 0 • accuracy = %True = %1’s True 0 True 0 alarms rejections True 0 5 6 Problems with Accuracy • Assumes equal cost for both kinds of errors – cost(b-type-error) = cost (c-type-error) optimal threshold • is 99% accuracy good? – can be excellent, good, mediocre, poor, terrible – depends on problem 82% 0’s in data • is 10% accuracy bad?
    [Show full text]
  • A Practioner's Guide to Evaluating Entity Resolution Results
    A Practioner’s Guide to Evaluating Entity Resolution Results Matt Barnes [email protected] School of Computer Science Carnegie Mellon University October, 2014 1. Introduction Entity resolution (ER) is the task of identifying records belonging to the same entity (e.g. individual, group) across one or multiple databases. Ironically, it has multiple names: deduplication and record linkage, among others. In this paper we survey metrics used to evaluate ER results in order to iteratively improve performance and guarantee sufficient quality prior to deployment. Some of these metrics are borrowed from multi-class clas- sification and clustering domains, though some key differences exist differentiating entity resolution from general clustering. Menestrina et al. empirically showed rankings from these metrics often conflict with each other, thus our primary motivation for studying them [1]. This paper provides practitioners the basic knowledge to begin evaluating their entity resolution results. 2. Problem Statement Our notation follows that of [1]. Consider an input set of records I = a, b, c, d, e where a, b, c, d, and e are unique records. Let R = a, b, d , c, e denote an{ entity resolution} clustering output, where ... denotes a cluster.{h Let Si hbe thei} true clustering, referred to as the “gold standard.” Theh i goal of any entity resolution metric is to measure error (or arXiv:1509.04238v1 [cs.DB] 14 Sep 2015 similarity) of R compared to the gold standard S. 3. Pairwise Metrics Pairwise metrics consider every pair of records as samples for evaluating performance. Let P airs(R) denote all the intra-cluster pairs in the clustering R.
    [Show full text]
  • Benchmarking Differential Expression Analysis Tools for RNA-Seq: Normalization-Based Vs. Log-Ratio Transformation-Based Methods Thomas P
    Quinn et al. BMC Bioinformatics (2018) 19:274 https://doi.org/10.1186/s12859-018-2261-8 RESEARCH ARTICLE Open Access Benchmarking differential expression analysis tools for RNA-Seq: normalization-based vs. log-ratio transformation-based methods Thomas P. Quinn1,2* , Tamsyn M. Crowley1,2,3 and Mark F. Richardson2,4 Abstract Background: Count data generated by next-generation sequencing assays do not measure absolute transcript abundances. Instead, the data are constrained to an arbitrary “library size” by the sequencing depth of the assay, and typically must be normalized prior to statistical analysis. The constrained nature of these data means one could alternatively use a log-ratio transformation in lieu of normalization, as often done when testing for differential abundance (DA) of operational taxonomic units (OTUs) in 16S rRNA data. Therefore, we benchmark how well the ALDEx2 package, a transformation-based DA tool, detects differential expression in high-throughput RNA-sequencing data (RNA-Seq), compared to conventional RNA-Seq methods such as edgeR and DESeq2. Results: To evaluate the performance of log-ratio transformation-based tools, we apply the ALDEx2 package to two simulated, and two real, RNA-Seq data sets. One of the latter was previously used to benchmark dozens of conventional RNA-Seq differential expression methods, enabling us to directly compare transformation-based approaches. We show that ALDEx2, widely used in meta-genomics research, identifies differentially expressed genes (and transcripts) from RNA-Seq data with high precision and, given sufficient sample sizes, high recall too (regardless of the alignment and quantification procedure used). Although we show that the choice in log-ratio transformation can affect performance, ALDEx2 has high precision (i.e., few false positives) across all transformations.
    [Show full text]
  • Anomaly Detection in Application Log Data
    UTRECHT UNIVERSITY MASTER THESIS Anomaly Detection in Application Log Data Author: First supervisor: Patrick KOSTJENS Dr. A.J. FEELDERS Second supervisor: Prof. Dr. A.P.J.M. SIEBES A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computing Science in the Algorithmic Data Analysis Group Department of Information and Computing Sciences August 2016 ICA-3733327 iii UTRECHT UNIVERSITY Abstract Faculty of Science Department of Information and Computing Sciences Master of Science in Computing Science Anomaly Detection in Application Log Data by Patrick KOSTJENS Many applications within the Flexyz network generate a lot of log data. This data used to be difficult to reach and search. It was therefore not used unless a problem was reported by a user. One goal of this project was to make this data available in a single location so that it becomes easy to search and visualize it. Additionally, automatic analysis can be performed on the log data so that problems can be detected before users notice them. This analysis is the core of this project and is the topic of a case study in the domain of application log data analysis. We compare four algorithms that take different approaches to this prob- lem. We perform experiments with both artificial and real world data. It turns out that the relatively simple KNN algorithm gives the best perfor- mance, although it still produces a lot of false positives. However, there are several ways to improve these results in future research. Keywords: Anomaly detection, data streams, log analysis v Acknowledgements First and foremost, I would like to express my gratitude to Dr.
    [Show full text]
  • Software Benchmark—Classification Tree Algorithms for Cell Atlases
    Article Software Benchmark—Classification Tree Algorithms for Cell Atlases Annotation Using Single-Cell RNA-Sequencing Data Omar Alaqeeli 1 , Li Xing 2 and Xuekui Zhang 1,* 1 Department of Mathematics and Statistics, University of Victoria, Victoria, BC V8P 5C2, Canada; [email protected] 2 Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon, SK S7N 5A2, Canada; [email protected] * Correspondence: [email protected]; Tel.: +1-250-721-7455 Abstract: Classification tree is a widely used machine learning method. It has multiple implementa- tions as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others. Citation: Alaqeeli, O.; Xing, L.; Zhang, X. Software Benchmark— Keywords: classification tree; single-cell RNA-Sequencing; benchmark; precision; recall; F1-score; Classification Tree Algorithms for complexity; Area Under the Curve; run-time Cell Atlases Annotation Using Single-Cell RNA-Sequencing Data.
    [Show full text]
  • Evaluation Criteria (For Binary Classification)
    San José State University Math 251: Statistical and Machine Learning Classification Evaluation Criteria (for binary classification) Dr. Guangliang Chen Outline of the presentation: • Evaluation criteria – Precision, recall/sensitivity, specificity, F1 score – ROC curves, AUC • References – Stanford lecture1 – Wikipedia page2 1http://cs229.stanford.edu/section/evaluation_metrics_spring2020.pdf 2https://en.wikipedia.org/wiki/Receiver_operating_characteristic Evaluation criteria Motivation The two main criteria we have been using for evaluating classifiers are • Accuracy (overall) • Running time The overall accuracy does not reflect the classwise accuracy scores, and can be dominated by the largest class(es). For example, with two imbalanced classes (80 : 20), the constant prediction with the dominant label will achieve 80% accuracy overall. Dr. Guangliang Chen | Mathematics & Statistics, San José State University3/20 Evaluation criteria The binary classification setting More performance metrics can be introduced for binary classification, where the data points only have two different labels: • Logistic regression: y = 1 (positive), y = 0 (negative), • SVM: y = 1 (positive), y = −1 (negative) Dr. Guangliang Chen | Mathematics & Statistics, San José State University4/20 Evaluation criteria Additionally, we suppose that the classifier outputs a continuous score instead of a discrete label directly, e.g., 1 • Logistic regression: p(x; θ) = 1+e−θ·x • SVM: w · x + b A threshold will need to be specified in order to make predictions (i.e., to convert scores to labels). Dr. Guangliang Chen | Mathematics & Statistics, San José State University5/20 Evaluation criteria Interpretation of the confusion matrix The confusion table summarizes the 4 different combinations of true conditions and predicted labels: Predicted Actual (H0: Test point is negative) Positive Negative (H1: Test point is positive) Positive TPFP FP: Type-I error Negative FNTN FN: Type-II error The overall accuracy of the classifier is TP + TN Accuracy = TP + FP + FN + TN Dr.
    [Show full text]
  • Controlling and Visualizing the Precision-Recall Tradeoff for External
    Controlling and visualizing the precision-recall tradeoff for external performance indices Blaise Hanczar1 and Mohamed Nadif2 1 IBISC, University of Paris-Saclay, Univ. Evry, Evry, France [email protected] 2 LIPADE, University of Paris Descartes, Paris, France [email protected] Abstract. In many machine learning problems, the performance of the results is measured by indices that often combine precision and recall. In this paper, we study the behavior of such indices in function of the trade- off precision-recall. We present a new tool of performance visualization and analysis referred to the tradeoff space, which plots the performance index in function of the precision-recall tradeoff. We analyse the proper- ties of this new space and show its advantages over the precision-recall space. Keywords: Evaluation · precision-recall. 1 Introduction In machine learning, precision and recall are usual measures to assess the perfor- mances of the results. These measures are particularly used in supervised learn- ing [18], information retrieval [16], clustering [13] and recently in biclustering contexts [12]. In supervised learning, the classifier performances are assessed by comparing the predicted classes to the actual classes of a test set. These compar- isons can be measured by using the precision and recall of the positive class. The precision-recall is generally used in the problems wich present very unbalanced classes where the couple sensitivity-specificity is not relevant. In information retrieval, the performance of a search algorithm is assessed by analysing from the similarity between the set of target documents and the set returned by the algorithm.
    [Show full text]
  • Precision Recall Curves
    Precision recall curves Peter Corke December 2016 1 Binary classifiers Binary classifiers are widely used in fields as diverse as document retrieval and robot navigation. We will first consider a familiar case of document retrieval using a tool like Google where we perform a search and expect to see only relevant documents. The classifier, Google in this case, will: • return a relevant document (true positive) • return an irrelevant document (false positive) • choose not to return a relevant document (false negative) • chooses not to return an irrelevant document (true negative) If we consider that the classification of the document is either positive (relevant) or negative (not relevant) then we can express these four outcomes in a 2 × 2 contingency table or confusion matrix as shown in Figure 1. We introduce the common shorthand notation TP, FP, FN and TN. The two error situations FP and FN are often referred to as type I and type II errors respectively. False positives or type I errors are also referred to as false alarms. The number of relevant documents is TP + FN and the number of irrelevant documents is TN + FP . Consider now a robot localization problem. The robot is at a particular point X and its localizer can do one of four things: • correctly report the robot is at X (true positive) • incorrectly report the robot is at X (false positive) • incorrectly report the robot is not at X (false negative) • correctly report the robot is not at X (true negative) For robot navigation we need to determine what it means to be \at location X" given the inevitable errors in sensor data that lead to uncertainty in location.
    [Show full text]
  • Chapter 6 Evaluation Metrics and Evaluation
    Chapter 6 Evaluation Metrics and Evaluation The area of evaluation of information retrieval and natural language processing systems is complex. It will only be touched on in this chapter. First the scientific base for evaluation of all information retrieval systems, called the Cranfield paradigm will be described. Then different evaluation concepts such as precision, recall, F- score, development, training and evaluation sets and k-fold cross validation will be described. Statistical significance testing will be presented. This chapter will also discuss manual annotation and inter-annotator agreement, annotation tools such as BRAT and the gold standard. An example of a shared task on retrieving information from electronic patient records will be presented. 6.1 Qualitative and Quantitative Evaluation There are two types of evaluation, qualitative evaluation and quantitative evalua- tion. In this book quantitative evaluation is mostly used and described. Qualitative evaluation means asking a user or user groups whether the result from an informa- tion retrieval system gives a satisfying answer or not. Qualitative evaluation focuses mostly on one or more users’ experiences of a system. Quantitative evaluation means having a mechanical way to quantify the results, in numbers, from an information retrieval system. The Cranfield paradigm will be described, which was the first attempt to make a quantitative evaluation of an information retrieval system. 6.2 The Cranfield Paradigm The evaluation methods used here are mainly quantitative and are based on the Cranfield tests that also called the Cranfield Evaluation paradigm or the Cranfield paradigm, carried out by Cleverdon (1967). © The Author(s) 2018 45 H. Dalianis, Clinical Text Mining, https://doi.org/10.1007/978-3-319-78503-5_6 46 6 Evaluation Metrics and Evaluation Cyril Cleverdon was a librarian at the College of Aeronautics in Cranfield (later the Cranfield Institute of Technology and Cranfield University), UK.
    [Show full text]