Evaluation of Multiple Object Tracking in Surveillance Video

Master of Science Thesis in Electrical Engineering Department of Electrical Engineering, Linköping University, 2019 Evaluation of Multiple Object Tracking in Surveillance Video Axel Nyström Master of Science Thesis in Electrical Engineering Evaluation of Multiple Object Tracking in Surveillance Video: Axel Nyström LiTH-ISY-EX--19/5245--SE Supervisor: Anderson Tavares isy, Linköping University Niclas Appleby National Forensic Centre Examiner: Michael Felsberg isy, Linköping University Computer Vision Laboratory Department of Electrical Engineering Linköping University SE-581 83 Linköping, Sweden Copyright © 2019 Axel Nyström Sammanfattning Visuell objektföljning av flera objekt är en process där ett flertal objekt tilldelas unika och konsekventa identiteter i en videosekvens. En populär metod för ob- jektföljning är en teknik som kallas följning-genom-detektion. Följning-genom- detektion är en tvåstegsprocess där en objektdetektionsalgoritm först hittar objekt i varje bild i en videosekvens, de hittade objekten associeras sen med redan följda objekt av en följningsalgoritm. Ett av huvudsyftena med det här examensarbetet är att undersöka hur olika objektdetektionsalgoritmer presterar på övervak- ningsvideo som Nationellt forensiskt centrum vill använda visuell objektföljning på. Examensarbetet undersöker också korrelationen mellan objektdetektionsal- goritmers prestanda och prestandan på hela följning-genom-detektionsystemet. Slutligen så undersöks också hur användandet av visuella deskriptorer i följnings- algoritmer kan påverka träffsäkerheten i ett följning-genom-detektionsystem. Resultat som presenteras i det här arbetet visar att objektdetektionsalgoritmens kapacitet är en stark indikator för hur hela följning-genom-detektionsystemet presterar. Arbetet visar också hur användandet av visuella deskriptorer i följ- ningssteget kan minska antalet identitetsbyten och därmed öka träffsäkerheten för hela systemet. iii Abstract Multiple object tracking is the process of assigning unique and consistent iden- tities to objects throughout a video sequence. A popular approach to multiple object tracking, and object tracking in general, is to use a method called tracking- by-detection. Tracking-by-detection is a two-stage procedure: an object detection algorithm first detects objects in a frame, these objects are then associated with already tracked objects by a tracking algorithm. One of the main concerns of this thesis is to investigate how different object detection algorithms perform on surveillance video supplied by National Forensic Centre. The thesis then goes on to explore how the stand-alone alone performance of the object detection algorithm correlates with overall performance of a tracking-by-detection system. Finally, the thesis investigates how the use of visual descriptors in the tracking stage of a tracking-by-detection system effects performance. Results presented in this thesis suggest that the capacity of the object detection algorithm is highly indicative of the overall performance of the tracking-by- detection system. Further, this thesis also shows how the use of visual descriptors in the tracking stage can reduce the number of identity switches and thereby in- crease performance of the whole system. v Acknowledgments I would like to thank NFC, and specifically Niclas Appleby, for giving me the op- portunity to work on this thesis as well as supplying me with adequate hardware. I would also like to thank my supervisor Anderson Tavares and my examiner Michael Felsberg for providing me with great feedback. Finally, a shout-out to my coffee break mates, you made writing this thesis enjoyable. Linköping, 2019 Axel vii Contents Notation xi 1 Introduction 1 1.1 Background . 1 1.1.1 Object Detection . 2 1.1.2 Multiple Object Tracking . 2 1.2 Problem formulation . 2 1.3 Limitations . 3 2 Theory and Related Work 5 2.1 Tracking-by-Detection . 6 2.2 Image Classification Networks . 7 2.3 Object Detection Algorithms . 7 2.3.1 R-CNN . 7 2.3.2 Fast R-CNN . 8 2.3.3 Faster R-CNN . 9 2.3.4 Region Proposal Networks . 10 2.3.5 Mask R-CNN . 10 2.3.6 YOLO . 11 2.3.7 YOLOv2 . 12 2.3.8 YOLOv3 . 12 2.3.9 Feature Pyramid Network . 13 2.3.10 Single Shot Detector . 13 2.3.11 RetinaNet . 14 2.4 Tracking Algorithms . 16 2.4.1 SORT . 16 2.4.2 Deep SORT . 16 3 Method 19 3.1 Data Annotation . 20 3.2 Evaluation . 21 3.2.1 Classification of Predicted Bounding Boxes . 22 3.2.2 Object Detection Evaluation . 22 ix x Contents 3.2.3 Object Tracking Evaluation . 24 3.3 Test Environment . 25 3.3.1 Testing Object Detection Algorithms . 26 3.3.2 Testing Object Tracking Algorithms . 27 3.3.3 Hardware . 27 3.4 Algorithms and Implementations . 28 3.4.1 Deep Learning Libraries . 28 3.4.2 Implementations . 29 4 Results 31 4.1 Object Detection Results . 32 4.2 Object Tracking Results . 35 5 Discussion 41 5.1 Object Detection . 41 5.2 Object Tracking . 43 6 Conclusions 45 6.1 Future Work . 45 6.2 Ethics . 46 Bibliography 47 A Technical Report of CDIO Project 53 Notation Abbrevations Abbrevation Meaning cnn Convolutional Neural Network svm Support Vector Machine nfc National Forensic Centre roi Region of Interest iou Intersection over Union yolo You Only Look Once ssd Single Shot Detector sort Simple Online and Real-time Tracker rpn Region Proposal Network fpn Feature Pyramid Network cpu Central Processing Unit gpu Graphical Processing Unit mot Multiple Object Tracking csv Comma Separated Value tp True Positive fp False Positive tn True Negative fn False Negative xi 1 Introduction Video tracking is an area of computer vision that deals with localization of mov- ing objects in video. There are many applications of video tracking in fields such as robotics, sport analysis, and video surveillance. These applications often re- quire multiple objects to be tracked at the same time, which is referred to as multiple object tracking. A popular approach to object tracking is to use a method called tracking-by- detection. Tracking-by-detection uses an object detection algorithm to detect objects present in a frame. These objects are then tracked by associating objects in the current frame with objects from previous frames using a tracking algorithm. Having a reliable method for object detection is crucial since the tracking algorithm is dependent on objects being detected in each frame. Lately, object detection algorithms based on convolutional neural networks have been able to achieve greater accuracy than traditional object detection methods. This improve- ment in object detection accuracy has facilitated the use of tracking-by-detection methods for multiple object tracking. 1.1 Background National Forensic Centre (NFC) is an organization within the Swedish police au- thority that is responsible for forensics for the Swedish police. The section for information technology at NFC handles, among other things, forensic image analysis. NFC wants to investigate how video tracking in surveillance cameras can be used to support criminal investigations and to ease the work of surveillance cam- era operators. NFC has previously hosted a student project related to video tracking, that project 1 2 1 Introduction was part of the course Image and Graphics, Project Course CDIO at Linköping Uni- versity (see Appendix A). The result of that project was a tracking-by-detection system, which uses a detection algorithm called YOLO [33] to detect objects and a tracking algorithm called Deep SORT [44] to track detected objects. The system is also able to perform person re-identification using an algorithm called AlignedReID [45]. Person re-identification is the practice of identifying the same person in multiple different cameras. This system is relevant since it is used as test environment throughout this thesis. 1.1.1 Object Detection Object detection is the process of localizing and classifying objects present in an image. Today’s state of the art object detection algorithms typically utilize convolutional neural networks in some way. There are two main categories of such object detection algorithms, single-stage detectors and two-stage object detectors. The major difference between the two categories is that two-stage object detectors first find regions of interest in the image and then classify the regions separately, whereas single-stage detectors predict bounding boxes and classify objects simul- taneously [5]. This gives single-stage detectors, in general, increased speed at the cost of accuracy when compared to region based object detectors. 1.1.2 Multiple Object Tracking The tracking stage of a tracking-by-detection method can be seen as solving two different tasks. First, future positions of tracked objects are predicted, this is com- monly done using methods such as the Kalman filter [19]. Next, objects detected in a new frame are associated with already tracked objects based on the predicted future positions of the already tracked objects. If there are as many detections as there are already tracked objects, this association can be seen as an assignment problem, which can be solved using the Hungarian method [21]. Deep SORT [44], which is used in the system already implemented at NFC, predicts future positions using a Kalman filter and then solves the assignment problem with the Hungarian method. In addition to this Deep SORT also uses a visual descriptor to improve the accuracy of the tracking. This visual descriptor is a 128-dimensional vector obtained by feeding an objects bounding box into a convolutional neural network. The convolutional neural network used has been trained to distinguish pedestrians from each other, this means that it is especially suited to track people. 1.2 Problem formulation NFC is interested in investigating how the accuracy of their tracking-by-detection system can be improved. Since tracking-by-detection systems are limited by the performance of the detection algorithm it could be relevant to examine how the choice of detection algorithm effects the accuracy and speed of the system. This trade-off between speed and accuracy has been studied in papers such as [17], where several modern object detection algorithm are compared.

Load more