electronics Article A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit Rafael Padilla ∗ , Wesley L. Passos , Thadeu L. B. Dias , Sergio L. Netto and Eduardo A. B. da Silva Electrical Engineering Program/Alberto Luiz Coimbra Institute for Post-Graduation and Research in Engineering (PEE/COPPE), PO Box 68504, Rio de Janeiro 21941-972, RJ, Brazil; [email protected] (W.L.P.); [email protected] (T.L.B.D.); [email protected] (S.L.N.); [email protected] (E.A.B.d.S.) * Correspondence: [email protected] Abstract: Recent outstanding results of supervised object detection in competitions and challenges are often associated with specific metrics and datasets. The evaluation of such methods applied in different contexts have increased the demand for annotated datasets. Annotation tools represent the location and size of objects in distinct formats, leading to a lack of consensus on the representation. Such a scenario often complicates the comparison of object detection methods. This work alleviates this problem along the following lines: (i) It provides an overview of the most relevant evaluation methods used in object detection competitions, highlighting their peculiarities, differences, and ad- vantages; (ii) it examines the most used annotation formats, showing how different implementations may influence the assessment results; and (iii) it provides a novel open-source toolkit supporting dif- ferent annotation formats and 15 performance metrics, making it easy for researchers to evaluate the performance of their detection algorithms in most known datasets. In addition, this work proposes a new metric, also included in the toolkit, for evaluating object detection in videos that is based on the spatio-temporal overlap between the ground-truth and detected bounding boxes. Citation: Padilla, R.; Passos, W.L.; Keywords: object-detection metrics; precision; recall; evaluation; automatic assessment; bound- Dias, T.L.B.; Netto, S.L., da Silva, ing boxes E.A.B. A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit. Electronics 2021, 10, 279. https://doi.org/10.3390/ 1. Introduction electronics10030279 The human visual system can effectively distinguish objects in different environments and contexts, even under a variety of constraints such as low illumination [1], color Academic Editor: Tomasz Trzcinski differences [2], and occlusions [3,4]. In addition, objects are key to the understanding of Received: 25 December 2020 a scene’s context, which lends paramount importance to the estimation of their precise Accepted: 20 January 2021 location and classification. This has led computer vision researchers to explore automatic Published: 25 January 2021 object detection for decades [5], reaching impressive results particularly in the last few years [6–9]. Publisher’s Note: MDPI stays neu- Object detection algorithms attempt to locate general occurrences of one or more tral with regard to jurisdictional clai- predefined classes of objects. In a system designed to detect pedestrians, for instance, an ms in published maps and institutio- algorithm tries to locate all pedestrians that appear within an image or a video [3,10,11]. nal affiliations. In the identification task, however, an algorithm tries to recognize a specific instance of a given class of objects. In the pedestrian example, an identification algorithm wants to determine the identity of each pedestrian previously detected. Copyright: © 2021 by the authors. Li- Initially, real-time object detection applications were limited to only one object type [12] censee MDPI, Basel, Switzerland. at a time, mostly due to hardware limitations. Later on, advancements in object detec- This article is an open access article tion techniques led to their increasing adoption in areas that included the manufacturing distributed under the terms and con- industry with optical inspections [13], video surveillance [14], forensics [15,16], medical ditions of the Creative Commons At- image analysis [17–19], autonomous vehicles [20], and traffic monitoring [21]. In the last tribution (CC BY) license (https:// decade, the use of deep neural networks (DNNs) has completely changed the landscape creativecommons.org/licenses/by/ of the computer vision field [22]. DNNs have allowed for drastic improvements in image 4.0/). Electronics 2021, 10, 279. https://doi.org/10.3390/electronics10030279 https://www.mdpi.com/journal/electronics Electronics 2021, 10, 279 2 of 28 classification, image segmentation, anomaly detection, optical character recognition (OCR), action recognition, image generation, and object detection [5]. The field of object detection has yielded significant improvements in both efficiency and accuracy. To validate such improvements, new techniques must be assessed against current state-of-the-art approaches, preferably over widely available datasets. However, benchmark datasets and evaluation metrics differ from work to work, often making their comparative assessment confusing and misleading. We identified two main reasons for such confusion in comparative assessments: • There are often differences in bounding box representation formats among differ- ent detectors. Boxes could be represented, for instance by their upper-left corner coordinates (x, y) and their absolute dimensions (width, height) in pixels, or by their relative coordinates (xrel, yrel) and dimensions (widthrel, heightrel), with the values normalized by the image size, among others; • Each performance assessment tool implements a set of different metrics, requiring specific formats for the ground-truth and detected bounding boxes. Even though many tools have been developed to convert the annotated boxes from one format to another, the quality assessment of the final detections still lacks a tool compatible with different bounding box formats and multiple metrics. Our previous work [23] contributed to the research community in this direction, by presenting a tool which reads ground-truth and detected bounding boxes in a closed format and evaluates the detections using the average precision (AP) and mean average precision (mAP) metrics, as required in the PASCAL Challenge [24]. In this work that contribution is significantly expanded by incorporating 13 other metrics, as well as by supporting additional annotation formats into the developed open-source toolbox. The new evaluation tool is available at https://github.com/rafaelpadilla/review_object_detection_metrics. We believe that our work significantly simplifies the task of evaluating object detection algorithms. This work intends to explain in detail the computation of the most popular metrics used as benchmarks by the research community, particularly in online challenges and com- petitions, providing their mathematical foundations and a practical example to illustrate their applicability. In order to do so, after a brief contextualization of the object-detection field in Section2, the most common annotation formats and assessment metrics are ex- amined in Sections3 and4, respectively. A numerical example is provided in Section5 illustrating the previous concepts from a practical perspective. Popular metrics are further addressed in Section6. In Section7 object detection in videos is discussed from an inte- grated spatio-temporal point of view, and a new metric for videos is provided. Section8 presents an open-source and freely distributed toolkit that implements all discussed con- cepts in a unified and validated way, as verified in Section9. Finally, Section 10 concludes the paper by summarizing its main technical contributions. 2. An Overview of Selected Works on Object Detection Back in the mid-50s and 60s the first attempts to recognize simple patterns in images were published [25,26]. These works identified primitive shapes and convex polygons based on contours. In the mid-80s, more complex shapes started gaining meaning, such as in [27], which described an automated process to construct a three-dimensional geometric description of an airplane. To describe more complex objects, instead of characterizing them by their shapes, automated feature extraction methods were developed. Different methods attempted to find important feature points that when combined could describe objects broadly. Robust feature points are represented by distinctive pixels, whose neighborhood describe the same object irrespective of changes in pose, rotation, and illumination. The Harris detector [28] finds such points in the object corners based on local intensity changes. A local search algorithm using gradients was devised in [29] to solve the image registration problem, which later was expanded to a tracking algorithm [30] for identifying important points in videos. Electronics 2021, 10, 279 3 of 28 More robust methods were able to identify characteristic pixel points and represent them as feature vectors. The so-called scale invariant feature transform (SIFT) [31], for instance, applied the difference of Gaussians in several scales coupled with histograms of gradients, yielding characteristic points with features that are robust to scale changes and rotation. Another popular feature detector and descriptor, the speed up robust fea- tures (SURF) [32], was claimed to be faster and more robust than SIFT, and uses a blob detector based on the Hessian matrix for interest point detection and wavelet responses for feature representations. Feature-point representation methods alone are not able to perform object detection, but can help in
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages28 Page
-
File Size-