Endoscopy Disease Detection Challenge 2020
Total Page:16
File Type:pdf, Size:1020Kb
ENDOSCOPY DISEASE DETECTION CHALLENGE 2020 Sharib Ali1 Noha Ghatwary7 Barbara Braden2 Dominique Lamarque3 Adam Bailey 2 Stefano Realdon4 Renato Cannizzaro 5 Jens Rittscher1 Christian Daul6 James East 2 1 Institute of Biomedical Engineering, Big Data Institute, University of Oxford, Old Road Campus, Oxford, UK 2 Translational Gastroenterology Unit, Experimental Medicine Div., John Radcliffe Hospital, University of Oxford, Oxford, UK 3 Universite´ de Versailles St-Quentin en Yvelines, Hopitalˆ Ambroise Pare,´ France 4 Instituto Onclologico Veneto, IOV-IRCCS, Padova, Italy 5 CRO Centro Riferimento Oncologico IRCCS Aviano Italy 6 CRAN UMR 7039, University of Lorraine, CNRS, Nancy, France 7 University of Lincoln, UK ABSTRACT improve current medical practices and refine health-care Whilst many technologies are built around endoscopy, there systems worldwide. However, well-annotated, representative is a need to have a comprehensive dataset collected from publicly available datasets for disease detection for assessing multiple centers to address the generalization issues with research reproducibility and facilitating standardized com- most deep learning frameworks. What could be more im- parison of methods is still lacking. Many methods to detect portant than disease detection and localization? Through diseased regions in endoscopy have been proposed however our extensive network of clinical and computational experts, these have been primarily focused on the task of polyp we have collected, curated and annotated gastrointestinal detection in the gastrointestinal tract with demonstration on endoscopy video frames. We have released this dataset and datasets acquired from at most a few data centers and single have launched disease detection and segmentation challenge modality imaging, most commonly white light. Here, we EDD20201 to address the limitations and explore new di- present our multi-class, multi-organ and multi-population rections. EDD2020 is a crowd sourcing initiative to test the disease detection and segmentation challenge in clinical feasibility of recent deep learning methods and to promote endoscopy. With this sub-challenge we aim to establish a research for building robust technologies. In this paper, we comprehensive dataset and benchmark existing and novel provide an overview of the EDD2020 dataset, challenge algorithms for disease detection. tasks, evaluation strategies and a short summary of results on Specifically, we aim to assess localization of disease re- test data. A detailed paper will be drafted after the challenge gions using bounding boxes and exact pixel-level segmenta- workshop with more detailed analysis of the results. tion. Clinical applicability by assessing real-time monitoring and offline performance evaluations of algorithms for im- arXiv:2003.03376v1 [eess.IV] 7 Mar 2020 I. INTRODUCTION proved accuracy and better quantitative reporting is required Endoscopy is a widely used clinical procedure for the today. In this challenge, participants are provided with 380 early detection of cancers in hollow-organs such as oesoph- annotated video frames by medical experts and experienced agus, stomach, colon and bladder. During this procedure post doctoral researchers. The dataset is comprehensive and an endoscope is used; a long, thin, rigid or flexible tube consists of endoscopy frames from different gastrointestinal with a light source and camera at the tip to visualize the tract organs and varied endoscopy modalities. It incorporates inside of affected organs on an external screen. Computer- multiple populations with 4 different international centers assisted methods for accurate and temporally consistent and include below mentioned categories: localization and segmentation of diseased region-of-interests enable precise quantification and mapping of lesions from • Organ 1: Colon, associated disease: Polyp, cancer clinical endoscopy videos which is critical for monitoring • Organ 2: Oesophagus, associated disease: Barretts, dys- and surgical planning. Innovations have the potential to plasia and cancer • Organ 3: Stomach, associated disease: Pyloric inflam- 1https://edd2020.grand-challenge.org mation, dysplasia II. PROPOSED TASKS cases. Figure 1 shows label annotations for some frames To rigorously assess the methods for disease detection and in EDD2020 dataset. The bounding box annotations are localization in various organs, participants are requested to provided in PASCAL VOC format [1] and the label masks complete two sub-tasks on the provided dataset: are provided as 5 channel tif images. An open-source VIA annotation tool [2] was used for all annotations. 1. Disease detection and localization: This task is evalu- The test dataset is restricted to be used only for this ated based on the inference results on the test dataset challenge and is made available to the participants of this provided from a subset of the data collected for the challenge only. Methods are evaluated on an online leader- training. board and will be analysed at the end of the competition. 2. Disease area segmentation: This task will be evaluated Codes to help participants getting started have been made based on the results of the test dataset provided from a available online3. subset of the data collected for training 3. Out-of-sample detection: This task will be evaluated on IV. EVALUATION a completely unknown set of images which have not been used in either of training or testing datasets for The challenge problems fall into three distinct categories. other tasks For each there exists already well-defined evaluation metrics used by the wider imaging community which we use for Participants are required to identify, localize and segment evaluation here. Codes related to all evaluation metrics used five classes consisting of: 1) Normal dysplastic Barrett’s Oe- in this challenge are also available online4. sophagus (NDBE), 2) suspicious area (“subtle pre-cancerous lesion”), 3) high-grade dysplasia, 4) adenocarcinoma (“can- IV-A. Detection score cer”), and 5) polyp. Two or more classes can be overlapped, for e.g., Barrett’s can colocalize with low-grade or high- Metrics used for multi-class disease detection: grade dysplastic area. • IoU - intersection over union. This metric measures the overlap between two bounding boxes A and B as the III. DATASET ratio between the overlapped area A \ B over the total The EDD2020 dataset 2 is unique and consists of five area A [ B occupied by the two boxes: classes for both detection and segmentation tasks. All data A \ B are from the gastrointestinal tract that comprise of oesoph- IoU = (1) A [ B agus, stomach, colon and rectum. The data is collected from Ambroise PareHospital´ of Boulogne-Billancourt, Paris, where \, [ denote the intersection and union respec- France; Centro Riferimento Oncologico IRCCS, Aviano, tively. In terms of numbers of true positives (TP), Italy; Istituto Oncologico Veneto, Padova, Italy and John false positives (FP) and false negatives (FN), IoU (aka Radcliffe Hospital, Oxford, UK. All the collected data Jaccard J) can be defined as: were annonymised at pre-collection stage by the primary TP IoU=J = (2) institution and by the EDD2020 challenge organisers before TP + FP + FN releasing the data publicly. No patient information has been • mAP - mean average precision of detected disease class provided or communicated in this process and all data TP with precision (p) defined as p = TP +FP and recall has been acquired through patient consent at primary host TP institutions. (r) as r = TP +FN . This metric measures the ability of Annotations were performed by two clinical experts and an object detector to accurately retrieve all instances two post-doctoral researchers with a background in en- of the ground truth bounding boxes. The higher the doscopy imaging. Due to annotation variability and difficult mAP the better the performance. Average precision to get consensus between experts in some cases of this (AP) is computed as the Area Under Curve (AUC) of dataset we would like to declare that these annotations are the precision-recall curve of detection sampled at all subjective to the annotators involved in the preparation of unique recall values (r1; r2; :::) whenever the maximum this dataset. There were four different iterations of this precision value drops: dataset. The dataset consists of 386 still image frames X AP = f(rn+1 − rn) pinterp(rn+1)g; (3) acquired from multiple videos. The ground truth for training n data consists of 160 manually labeled annotations for non- dysplastic Barrett’s, 88 cases for suspicious precancerous with pinterp(rn+1) = max p(~r). Here, p(rn) denotes r~≥rn+1 lesion, 74 for high-grade dysplasia, 53 cancer and 127 polyp the precision value at a given recall value. This defi- masks with a total of 503 ground truth annotations. The nition ensures monotonically decreasing precision. The same number of bounding boxes also exists for each of these 3https://github.com/sharibox/EDD2020 2http://dx.doi.org/10.21227/f8xg-wb80 4https://github.com/sharibox/EndoCV2020 Fig. 1. Endoscopy disease detection and segmentation dataset (EDD2020). The top two rows consists of mostly polyps, middle two rows consists of mostly suspected and high-grade dysplastic (HGD) cases in stomach, and finally bottom two row represents non-dysplastic Barrett’s oesophagus (NDBE), suspected dysplasia and one case for cancer. mAP is the mean of AP over all disease classes i for the training nor test data of the detection and segmentation N = 5 classes given as tasks. Participants will be ranked based on a score gap of 1 X generalization defined as: mAP = AP (4) N i i devg =j mAPd − mAPg j (7) This definition was popularised in the PASCAL VOC It is worth noting that the highest mAPg with devg ! 0 is challenge [1]. The final mAP (mAPd) was computed required for winning the competition. as an average mAPs for IoU from 0.25 to 0.75 with The deviation score devg can be either positive or nega- a step-size of 0.05 which means an average over 11 tive, however, the absolute difference should be very small IoU levels are used for 5 categories in the competition (ideally, devg = 0). The best algorithm should have high (mAP @[:25 : :05 : :75] ).