<<

Master of Thesis in Electrical Department of Electrical Engineering, Linköping University, 2019

Evaluation of Multiple Object Tracking in Surveillance Video

Axel Nyström Master of Science Thesis in Electrical Engineering Evaluation of Multiple Object Tracking in Surveillance Video: Axel Nyström LiTH-ISY-EX--19/5245--SE

Supervisor: Anderson Tavares isy, Linköping University Niclas Appleby National Forensic Centre Examiner: Michael Felsberg isy, Linköping University

Computer Vision Laboratory Department of Electrical Engineering Linköping University SE-581 83 Linköping, Sweden

Copyright © 2019 Axel Nyström Sammanfattning

Visuell objektföljning av flera objekt är en process där ett flertal objekt tilldelas unika och konsekventa identiteter i en videosekvens. En populär metod för ob- jektföljning är en teknik som kallas följning-genom-detektion. Följning-genom- detektion är en tvåstegsprocess där en objektdetektionsalgoritm först hittar ob- jekt i varje bild i en videosekvens, de hittade objekten associeras sen med redan följda objekt av en följningsalgoritm. Ett av huvudsyftena med det här examensar- betet är att undersöka hur olika objektdetektionsalgoritmer presterar på övervak- ningsvideo som Nationellt forensiskt centrum vill använda visuell objektföljning på. Examensarbetet undersöker också korrelationen mellan objektdetektionsal- goritmers prestanda och prestandan på hela följning-genom-detektionsystemet. Slutligen så undersöks också hur användandet av visuella deskriptorer i följnings- algoritmer kan påverka träffsäkerheten i ett följning-genom-detektionsystem.

Resultat som presenteras i det här arbetet visar att objektdetektionsalgoritmens kapacitet är en stark indikator för hur hela följning-genom-detektionsystemet presterar. Arbetet visar också hur användandet av visuella deskriptorer i följ- ningssteget kan minska antalet identitetsbyten och därmed öka träffsäkerheten för hela systemet.

iii

Abstract

Multiple object tracking is the process of assigning unique and consistent iden- tities to objects throughout a video sequence. A popular approach to multiple object tracking, and object tracking in general, is to use a method called tracking- by-detection. Tracking-by-detection is a two-stage procedure: an object detection algorithm first detects objects in a frame, these objects are then associated with already tracked objects by a tracking algorithm. One of the main concerns of this thesis is to investigate how different object detection algorithms perform on surveillance video supplied by National Forensic Centre. The thesis then goes on to explore how the stand-alone alone performance of the object detection al- gorithm correlates with overall performance of a tracking-by-detection system. Finally, the thesis investigates how the use of visual descriptors in the tracking stage of a tracking-by-detection system effects performance.

Results presented in this thesis suggest that the capacity of the object detec- tion algorithm is highly indicative of the overall performance of the tracking-by- detection system. Further, this thesis also shows how the use of visual descriptors in the tracking stage can reduce the number of identity switches and thereby in- crease performance of the whole system.

v

Acknowledgments

I would like to thank NFC, and specifically Niclas Appleby, for giving me the op- portunity to work on this thesis as well as supplying me with adequate hardware. I would also like to thank my supervisor Anderson Tavares and my examiner Michael Felsberg for providing me with great feedback. Finally, a shout-out to my coffee break mates, you made writing this thesis enjoyable.

Linköping, 2019 Axel

vii

Contents

Notation xi

1 Introduction 1 1.1 Background ...... 1 1.1.1 Object Detection ...... 2 1.1.2 Multiple Object Tracking ...... 2 1.2 Problem formulation ...... 2 1.3 Limitations ...... 3

2 Theory and Related Work 5 2.1 Tracking-by-Detection ...... 6 2.2 Image Classification Networks ...... 7 2.3 Object Detection Algorithms ...... 7 2.3.1 R-CNN ...... 7 2.3.2 Fast R-CNN ...... 8 2.3.3 Faster R-CNN ...... 9 2.3.4 Region Proposal Networks ...... 10 2.3.5 Mask R-CNN ...... 10 2.3.6 YOLO ...... 11 2.3.7 YOLOv2 ...... 12 2.3.8 YOLOv3 ...... 12 2.3.9 Feature Pyramid Network ...... 13 2.3.10 Single Shot Detector ...... 13 2.3.11 RetinaNet ...... 14 2.4 Tracking Algorithms ...... 16 2.4.1 SORT ...... 16 2.4.2 Deep SORT ...... 16

3 Method 19 3.1 Data Annotation ...... 20 3.2 Evaluation ...... 21 3.2.1 Classification of Predicted Bounding Boxes ...... 22 3.2.2 Object Detection Evaluation ...... 22

ix x Contents

3.2.3 Object Tracking Evaluation ...... 24 3.3 Test Environment ...... 25 3.3.1 Testing Object Detection Algorithms ...... 26 3.3.2 Testing Object Tracking Algorithms ...... 27 3.3.3 Hardware ...... 27 3.4 Algorithms and Implementations ...... 28 3.4.1 Deep Learning Libraries ...... 28 3.4.2 Implementations ...... 29

4 Results 31 4.1 Object Detection Results ...... 32 4.2 Object Tracking Results ...... 35

5 Discussion 41 5.1 Object Detection ...... 41 5.2 Object Tracking ...... 43

6 Conclusions 45 6.1 Future Work ...... 45 6.2 Ethics ...... 46

Bibliography 47

A Technical Report of CDIO Project 53 Notation

Abbrevations Abbrevation Meaning cnn Convolutional Neural Network svm Support Vector Machine nfc National Forensic Centre roi Region of Interest iou Intersection over Union yolo You Only Look Once ssd Single Shot Detector sort Simple Online and Real-time Tracker rpn Region Proposal Network fpn Feature Pyramid Network cpu Central Processing Unit gpu Graphical Processing Unit mot Multiple Object Tracking csv Comma Separated Value tp True Positive fp False Positive tn True Negative fn False Negative

xi

1 Introduction

Video tracking is an area of computer vision that deals with localization of mov- ing objects in video. There are many applications of video tracking in fields such as robotics, sport analysis, and video surveillance. These applications often re- quire multiple objects to be tracked at the same time, which is referred to as multiple object tracking.

A popular approach to object tracking is to use a method called tracking-by- detection. Tracking-by-detection uses an object detection algorithm to detect ob- jects present in a frame. These objects are then tracked by associating objects in the current frame with objects from previous frames using a tracking algo- rithm. Having a reliable method for object detection is crucial since the tracking algorithm is dependent on objects being detected in each frame. Lately, object detection algorithms based on convolutional neural networks have been able to achieve greater accuracy than traditional object detection methods. This improve- ment in object detection accuracy has facilitated the use of tracking-by-detection methods for multiple object tracking.

1.1 Background

National Forensic Centre (NFC) is an organization within the Swedish police au- thority that is responsible for forensics for the Swedish police. The section for technology at NFC handles, among other things, forensic image anal- ysis. NFC wants to investigate how video tracking in surveillance cameras can be used to support criminal investigations and to ease the work of surveillance cam- era operators.

NFC has previously hosted a student project related to video tracking, that project

1 2 1 Introduction was part of the course Image and Graphics, Project Course CDIO at Linköping Uni- versity (see Appendix A). The result of that project was a tracking-by-detection system, which uses a detection algorithm called YOLO [33] to detect objects and a tracking algorithm called Deep SORT [44] to track detected objects. The sys- tem is also able to perform person re-identification using an algorithm called AlignedReID [45]. Person re-identification is the practice of identifying the same person in multiple different cameras. This system is relevant since it is used as test environment throughout this thesis.

1.1.1 Object Detection Object detection is the process of localizing and classifying objects present in an image. Today’s state of the art object detection algorithms typically utilize con- volutional neural networks in some way. There are two main categories of such object detection algorithms, single-stage detectors and two-stage object detectors. The major difference between the two categories is that two-stage object detectors first find regions of interest in the image and then classify the regions separately, whereas single-stage detectors predict bounding boxes and classify objects simul- taneously [5]. This gives single-stage detectors, in general, increased speed at the cost of accuracy when compared to region based object detectors.

1.1.2 Multiple Object Tracking The tracking stage of a tracking-by-detection method can be seen as solving two different tasks. First, future positions of tracked objects are predicted, this is com- monly done using methods such as the Kalman filter [19]. Next, objects detected in a new frame are associated with already tracked objects based on the predicted future positions of the already tracked objects. If there are as many detections as there are already tracked objects, this association can be seen as an assignment problem, which can be solved using the Hungarian method [21]. Deep SORT [44], which is used in the system already implemented at NFC, predicts future positions using a Kalman filter and then solves the assignment problem with the Hungarian method. In addition to this Deep SORT also uses a visual descriptor to improve the accuracy of the tracking. This visual descriptor is a 128-dimensional vector obtained by feeding an objects bounding box into a convolutional neural network. The convolutional neural network used has been trained to distinguish pedestrians from each other, this that it is especially suited to track people.

1.2 Problem formulation

NFC is interested in investigating how the accuracy of their tracking-by-detection system can be improved. Since tracking-by-detection systems are limited by the performance of the detection algorithm it could be relevant to examine how the choice of detection algorithm effects the accuracy and speed of the system. This trade-off between speed and accuracy has been studied in papers such as [17], where several modern object detection algorithm are compared. 1.3 Limitations 3

Another aspect of the tracking-by-detection method that could effect the overall accuracy is the use of visual descriptors in the tracking stage. Deep SORT [44], which is used in the existing system at NFC, is an extension of the the algorithm SORT [4], which does not use any appearance information in the tracking stage. Comparing SORT to Deep SORT would give insights into how tracking perfor- mance and, by extension, the whole system’s performance is effected by the use of visual descriptors.

All quantitative analyzes will be done on the same form of surveillance video data on which NFC plans to use the system. The research questions should also be seen in this context, they do not seek to answer the general case but rather the specific case when NFC’s data is used. The research questions this thesis aims to answer are the following:

• How does the choice of detection algorithm effect a tracking-by-detection method that is used to track people? • How does the use of visual descriptors in the tracking stage of a tracking- by-detection method effect the accuracy?

1.3 Limitations

The major limitation of this thesis is that NFC does not have any annotated data. Some data is annotated as a part of this thesis in order to be able to quantitatively measure performance. This annotated data is however not nearly enough to train object detection algorithms with and is therefore only used as test data. Instead, pre-trained weights supplied with algorithm implementations are used through- out this thesis.

This thesis also limits the number of tested tracking algorithms to two, SORT and Deep SORT. Other algorithms could have been tested but SORT and Deep SORT were chosen due to how similar they are, apart from the use of visual descriptors in Deep SORT. Testing more algorithms would also have further broadened the scope of the thesis and made it more time-consuming.

2 Theory and Related Work

This chapter will present the theory, related work, and key concepts, which are relevant to this thesis. Theoretical concepts are purposefully presented at a rather abstract level so that this chapter does not become overly lengthy. This means that the thesis will not delve into the basics of computer vision or deep learning. Therefore, it is beneficial for the reader to have a basic understanding of both deep learning and computer vision. If the reader feels the need to brush up on these subjects, a good start for deep learning is [12] and for an overview of com- puter vision [41] is recommended. Further, the reader could also need to read up on some general machine learning concepts, in which case [13] is suggested.

The chapter begins with an introduction to tracking-by-detection and a short presentation of some image classification networks. It then continues with de- scriptions of different object detection algorithms. The ambition is to present the object detection algorithms in such a way that it is clear how they differ from each other. Thus, details about the training procedure have mostly been omitted unless the training procedure is a central characteristic of the algorithm. Algo- rithms are presented in sequence if they are related to each other, otherwise the algorithms are presented in the chronological order in which they were published. The chapter finishes with descriptions of tracking algorithms used in this thesis.

5 6 2 Theory and Related Work

2.1 Tracking-by-Detection

Multiple object tracking is the task of assigning consistent and unique identities to multiple objects in a video sequence. This thesis examines an object tracking technique called tracking-by-detection. Tracking-by-detection is a two-stage pro- cess: an object detection algorithm first detects objects present in a frame; these objects are then associated with already tracked objects by a tracking algorithm [27]. Normally, the object detection algorithm and the tracking algorithm are completely separated from each other and can therefore be analyzed individu- ally.

Object detection is the process of detecting particular classes of objects in an image, examples of classes are things such as people or bags. The aim of an ob- ject detection algorithm is to both localize and classify objects belonging to any of the sought-after classes [5]. Thus, for each detected object, an object detection algorithm produces estimates of position, size and class of the object. Position and size of detected objects are often represented by a bounding box, which is a rectangular box encompassing the object. The extent of a detected object can also be defined by a segmentation mask, which is a pixel-level mask of the object [41].

Object detection has developed significantly in the last decade due to advances in the closely related field of image classification. This progress is owed to break- throughs [20] in how convolutional neural networks (CNNs) can be utilized to classify images. Object detection algorithms considered in this thesis usually consists of a CNN designed for image classification and then have additional algorithm-specific structure around the CNN. The CNN is referred to as the back- bone of the algorithm and the algorithm-specific structure is called meta architec- ture. This thesis will, as is custom, identify object detection algorithms by their meta architecture. CNN based object detection algorithms can be split into two different groups: single-stage and two-stage detectors [5]. Two-stage detectors first generate possible bounding boxes by segmenting an image into regions of interest, these regions are then separately classified by a CNN in a second stage. Single-stage detectors produces estimates of both bounding boxes and classes in a single forward pass of an image through a CNN. Traditionally, two-stage detec- tors have achieved higher accuracy at the cost of speed compared to single-stage detectors. However, the recently introduced loss function Focal loss [24] has made single-stage detectors able to near two-stage detectors in terms of accuracy. The trade-off between speed and accuracy is a major design choice and has been stud- ied in papers such as [17].

The tracking algorithm in a tracking-by-detection framework is responsible for assigning unique identities to tracked objects and to make object associations be- tween frames. This thesis main focus is on object detection algorithms and only two different tracking algorithms will be considered, SORT [4] and Deep SORT [44]. SORT stands for Simple Online and Realtime Tracking, it is a deliberately sim- ple tracking algorithm that uses a Kalman filter [19] to estimate future positions 2.2 Image Classification Networks 7 of objects and makes frame-to-frame associations using the Hungarian method [21]. Deep SORT is an extension of SORT that incorporates appearance informa- tion when doing object associations between frames.

2.2 Image Classification Networks

As described in section 2.1, object detection algorithms studied in this thesis con- sist of algorithm-specific meta architecture and a backbone, with backbones be- ing CNNs originally constructed for image classification. This thesis focuses on the meta architectures and thus only a brief explanation of different backbones will be provided. The following list is a short introduction to relevant backbones and how they compare to each other in terms of accuracy and complexity. Com- plexity is measured in floating-point operations per second (FLOPs) and is used as an indication of how fast a network is.

• VGG-16: A convolutional neural network with 16 layers that performed well on the 2014 ILSVRC challenge [38]. Forward passing an image with resolution 224 x 224 pixels requires roughly 15 109 FLOPs [15]. It achieves 71.93% accuracy on the ImageNet validation dataset∗ [15].

• ResNet-50: A residual network with 50 layers, it achieves 77.15% accuracy on the ImageNet validation dataset [15]. Forward passing a 224 x 224 image requires 3.8 109 FLOPs [15]. ∗ • ResNet-101: A residual network with 101 layers achieving 78.25% accuracy on ImageNet [15]. Forward passing a 224 x 224 image requires 7.6 109 FLOPs [15]. ∗

• Darknet-53: A 53-layered network with residual layers, it was designed specifically for use in YOLOv3 [33]. On ImageNet its accuracy of 77.2% is similar to ResNet-101’s 77.15%. It also requires roughly the same amount of operations to perform a single forwards pass, forward passing an image with resolution 256 x 256 pixels requires roughly 18.7 109 FLOPs. Darknet- ∗ 53 is however significantly faster since it utilizes the GPU more effectively [33].

2.3 Object Detection Algorithms

This section provides a theoretical introduction to object detection algorithms tested in the experimental part of this thesis. The tested algorithms’ predecessors are also explained since it makes it easier to understand tested algorithms.

2.3.1 R-CNN R-CNN, short for Regions with CNN features, is a method for object detection in- troduced by Girschick et al. in [10]. The system is a pipeline with three main 8 2 Theory and Related Work parts: a region proposal, a convolutional neural network and a set of support vec- tor machines (SVMs). Figure 2.3.1 below shows the interaction of the main parts of R-CNN. First, the region proposal method segments an image into category- independent regions. This generates approximately 2000 regions per image. Af- ter segmenting the image, each region is warped to a fixed size to fit the required input size of the CNN. Next, the 2000 warped regions are separately fed through the CNN and a feature vector is extracted for each region. The feature vector is then classified by a set of linear SVMs, where each SVM is trained to classify one specific class. Finally, given the class predicted by the SVMs, ridge regression is used to improve the predicted shape of the bounding box. When all regions are scored, non-maxima suppression is applied to remove predicted bounding boxes that overlap with predictions with higher score. R-CNN is not restricted to any specific segmentation method or a specific CNN architecture. In [10], a segmen- tation method called selective search [43] is used and results are demonstrated for when CNN architectures presented in [20] and [38] are used.

Figure 2.3.1: Schematic of the R-CNN pipeline.

The authors of R-CNN also showed that supervised pre-training on a similar prob- lem is an effective way to initialize the weights of a CNN. In [10], initial weights of the CNN were obtained by pre-training the CNN to perform image classifica- tion on data from ILSVRC2013 [37]. The CNN was then fine-tuned to perform object detection by training it on the Pascal VOC 2012 [8] dataset, which is an object detection dataset. This is a form of transfer learning that has shown to be an effective approach when adapting CNNs to domains where training data is sparse [42].

2.3.2 Fast R-CNN

A major drawback of the R-CNN method is that it is slow. This is primarily due to each region proposal being passed through the CNN separately, which is time consuming. In order to increase the speed, an object detection method called Fast R-CNN was introduced by Girshick in [9]. Fast R-CNN increases the speed of object detection, mainly by passing the image forward through the CNN only once instead of once for every region which R-CNN does. 2.3 Object Detection Algorithms 9

As in R-CNN, a region proposal method first segments the image into category- independent regions, creating regions of interest (RoI). The whole image is then processed by a CNN that produces convolutional feature maps of the image. Next, for each region proposal, an RoI pooling layer that uses spatial pyramid pooling [14] is applied to the feature maps. This converts each RoI into a fixed size vec- tor. The feature vector is then processed by fully connected layers that branches out into two different output layers. One of the output layers is a softmax layer that produces estimates for the object classes. The other layer is a bounding box regressor that outputs refined estimates of the bounding boxes for each of the object classes. Figure 2.3.2 shows how an image is processed by Fast R-CNN.

Figure 2.3.2: Schematic of the Fast R-CNN pipeline.

Another advantage of Fast R-CNN is that it can be trained in a single stage, as op- posed to R-CNN that requires its modules to be trained separately. Fast R-CNN accomplishes this by using a single loss function to account for both classifica- tion and bounding box regression simultaneously. This loss function enables Fast R-CNN to jointly train classification and bounding box regression and thus the whole network can, except for the region proposals, be trained end-to-end.

2.3.3 Faster R-CNN Fast R-CNN improved the speed of object detection by forward passing an image through a CNN only once, instead of forward passing every region of interest in an image. For Fast R-CNN the bottleneck lies instead in the image segmentation methods that are usually implemented on the CPU. To resolve this, Ren et al. pro- posed a method called Faster R-CNN in [35]. Faster R-CNN removes the need for CPU computations by introducing the idea of Region Proposal Networks (RPNs). RPN is a region proposal method that is described further in section 2.3.4.

For each position, the RPN makes several predictions relative to a fixed num- 10 2 Theory and Related Work ber of reference boxes, these reference boxes are called anchors. Anchors can be thought of as suggested bounding boxes for each sliding window location. In [35], the anchors are created at 3 scales and with 3 different aspect ratios, giving a total of 9 different anchors for each location. This means that the RPN produces 9 bounding boxes at every sliding window location, one for each anchor.

Figure 2.3.3: The Faster R-CNN network.

The regions generated by the RPN are then used as region proposals in Fast R- CNN, which was described in section 2.3.2. By using an RPN, Faster R-CNN re- moves the time consuming image segmentation that was needed in Fast R-CNN. The speed is further increased by using a single CNN for both the RPN and Fast R-CNN. This also means that Faster R-CNN can be trained end-to-end by first training the RPN to propose regions and then use the region proposals to train Fast R-CNN.

2.3.4 Region Proposal Networks Region Proposal Networks generate region proposals by sliding a small network over convolutional feature maps. At each location, the small network takes a window of the convolutional feature maps and converts it to a feature vector. This feature vector is then input into two different fully connected layers, one layer performs bounding box regression and the other layer is a classification layer that predicts an objectness score. Objectness score is a prediction of how likely it is that the predicted bounding box contains an object compare to just being background.

2.3.5 Mask R-CNN Mask R-CNN is an extension of Faster R-CNN that was presented by He et al. in [16]. In addition to object detection, Mask R-CNN is also able to perform object instance segmentation. Segmentation is implemented by adding a third branch to Faster R-CNN, this branch outputs an object mask for each detected object. To improve segmentation, a method called RoIAlign is introduced in order to extract a more precise feature map for each RoI. RoIAlign computes exact values of the 2.3 Object Detection Algorithms 11 feature map using bi-linear interpolation instead of quantizing the feature map.

The authors of [16] found that Mask R-CNN achieves a higher average precision than Faster R-CNN in object detection. This was shown to be partially due to the use of RoIAlign and partially due to the multi-task loss used to train Mask R-CNN. A Mask R-CNN is trained with a multi-task loss function that simultane- ously accounts for classification, bounding box regression, and object segmenta- tion.

2.3.6 YOLO Redmon et al. introduced a novel approach to object detection in [34] called YOLO, You Only Look Once. Unlike R-CNN and its successors, YOLO does not use any region proposal method and instead uses a single CNN to predict both bounding boxes and classes.

In YOLO, an input image is first split into an S S grid. Each grid cell is then responsible for predicting B bounding boxes as× well as a confidence score for gt every bounding box. The confidence score is calculated as Pr(Object) IoU , ∗ pred where Pr(Object) is the predicted probability that the box contains an object gt and IoUpred is the estimated intersection over union (IoU) between the predicted box and a ground truth box. For each grid cell, C object class are also predicted, these probabilities are conditioned on the cell containing an ob- ject. The predicted boxes and class probabilities are then combined into a single score for each class and box. Equation 2.3.1 is taken from the introduction of YOLO in [34] and shows how the class predictions and box predictions are com- bined. As in the original paper [34], Pr(Classi) is used as a simplified notation for Pr(Classi, Object). gt gt Pr(Class Object) Pr(Object) IoU = Pr(Class ) IoU (2.3.1) i| ∗ ∗ pred i ∗ pred

The score accounts both for the probability that the box contains class i, Pr(Classi), gt and for how the predicted box is estimated to fit to a ground truth box, IoUpred. Figure 2.3.4 shows how an image is split into a grid and how the cell with the red dot centered in it predicts two different bounding boxes. The predicted bounding boxes are then combined with class probabilities, which are also obtained from the image grid, to produce the final object detections. The illustration is kept sim- ple so that it is easier to understand, in reality there would be many more objects predicted by the grid.

The above described procedure is realized by a custom CNN architecture inspired by GoogLeNet [40]. The custom CNN consists of 24 convolutional layers with 2 fully connected layers at the end. Each predicted bounding box is defined by 5 values, (x, y, width, height) and a confidence score. This means that predictions output from the CNN are represented by an S S (B 5 + C) tensor. Where S S is the grid size, B is the number of predicted× × boxes∗ per grid cell and C is × 12 2 Theory and Related Work

Figure 2.3.4: Illustration of the grid used in YOLO. the number of object classes. In [34] they use S = 7, B = 2 and C = 20, yielding a final prediction with the shape 7 7 30. × ×

2.3.7 YOLOv2 With the intention of improving YOLO, Redmon et al. proposed a method called YOLOv2 in [32]. YOLOv2 is a modified version of YOLO intended to increase both speed and accuracy.

Similarly to Faster R-CNN, YOLOv2 utilizes anchors when predicting bounding boxes. For each grid cell, YOLOv2 produces bounding boxes by predicting offsets to 5 anchors. Classes are now also predicted for each anchor instead of for each grid cell, each anchor is also given an objectness score. As in YOLO, classes are predicted on the condition that there is an object, Pr(Classi Object). Objectness is calculated as the estimated IoU between the predicted box| and an estimated gt ground truth box, IoUpred. YOLOv2 also employs a new method to determine anchor sizes; instead of hand-picking the anchors as in Faster R-CNN, YOLOv2 uses k-means clustering on the training data to produce anchors that are better fitted to the data.

In order to increase speed, a CNN architecture called Darknet-19 is introduced in YOLOv2. Darknet-19 is able to achieve higher image classification accuracy than both the widely used VGG-16 [38] and the custom network previously used in YOLO [34]. It manages to do this while only using 5.58 109 floating point operations per forward pass, compared to 30.69 109 operations∗ in VGG-16 and 8.52 109 operations in the network previously used∗ in YOLO. ∗

2.3.8 YOLOv3 YOLOv3 includes further improvements of YOLOv2 presented by Redmon et al. in [33]. Similarly to feature pyramid networks described in section 2.3.9, boxes are predicted at 3 different scales in YOLOv3. This increases YOLOv3’s ability to detect small objects, something which previous versions of YOLO struggled with. 2.3 Object Detection Algorithms 13

Inspired by residual networks presented in [15], Darknet-19 is expanded to in- clude residual layers. This new CNN architecture is called Darknet-53 since it has 53 convolutional layers in total. Compared to Darknet-19, Darknet-53 has higher accuracy but is a bit slower.

2.3.9 Feature Pyramid Network Classic object detection techniques based on hand-crafted features such as SIFT [26] and HOG [7] often use feature pyramids to detect objects in different scales. However, due to the large amount of memory needed to train a CNN with feature pyramids, methods such as R-CNN, Fast R-CNN and YOLO do not use feature pyramids. This was until Lin et al. proposed a method in [23] that enables the use of feature pyramids by utilizing pyramidical feature hierarchies created by CNNs. The proposed method is called Feature Pyramid Network (FPN) and can be applied to any CNN architecture. Examples of algorithms that uses FPNs are Faster R-CNN, Mask R-CNN and RetinaNet.

To generate feature pyramids, FPN initially creates two different feature pyra- mids and then merges them by adding them to each other. To do this, feature maps generated by a CNN are first grouped into stages where each stage contains all the feature maps that are of the same size. For each stage, the feature maps generated by the deepest layer in that stage are taken to represent a level in one of the feature pyramids. The other feature pyramid is constructed by continu- ously upsampling the final level of the first pyramid to create a feature pyramid with the same dimensions as the first one. These two feature pyramids are then merged to create the final feature pyramid. The merging is done to combine fea- ture maps with finer detail from the first pyramid with coarser, but semantically stronger, feature maps from the second pyramid.

2.3.10 Single Shot Detector The Single Shot Detector (SSD) is a single-stage detector presented by Liu et al. in [25]. SSD adds convolutional layers to an existing CNN in order to produce layers with feature maps of smaller size. By then creating predictions at several different layers, SSD is able to detect objects in multiple scales. At each layer, object detections are produced by applying a number of 3 3 kernels to every position of the feature maps. This procedure is illustrated in× figure 2.3.6, which shows how SSD produces predictions from mulitple different feature maps.

All predictions are made relative to reference bounding boxes, these are called default boxes in [25], default boxes are analogous to anchor boxes used in Faster R-CNN. For each position and bounding box, a specific 3 3 kernel predicts a sin- × gle output value denoting either a class or an offset for the bounding box. This means that the total number of filters applied to a position of a feature map is (C + 4)B, where C is the number of classes and B is the number of default boxes. The total number of outputs for a feature map of size M N is then (C + 4)BMN. × 14 2 Theory and Related Work

Figure 2.3.5: Merging of feature pyramids.

2.3.11 RetinaNet

The single-stage approach utilized in detectors such as YOLO and SSD enabled faster object detection. However, these single-stage detectors were not able to achieve the accuracy that two-stage detectors such as Faster R-CNN could offer. Lin et al. found that class imbalance in the training data was the principal cause for this. This class imbalance is a result of the large number of bounding boxes that a single-stage detector processes. The vast majority of these bounding boxes will be easy negatives, i.e. bounding boxes that can easily be classified as not containing an object. This imbalance makes training inefficient and can create models that do not work as intended. To address this, Lin et al. introduced a loss function called Focal Loss [24] and the single-stage object detector RetinaNet that utilizes Focal Loss.

Focal Loss

Focal Loss stems from the cross entropy (CE) loss for binary classification. The cross entropy is described in 2.3.2, where p is the predicted probability that an observation belongs to a certain class and y 0, 1 is the ground truth, which is 1 if the observation belongs to the class and 0∈ otherwise. { }

CE(p, y) = (ylog(p) + (1 y)log(1 p)) (2.3.2) − − − 2.3 Object Detection Algorithms 15

Figure 2.3.6: SORT

By defining pt as in 2.3.3, 2.3.2 can be rewritten as CE(p, y) = CE(pt) = -log(pt).    p, if y=1 pt =  (2.3.3)  1-p, if y=0

In order to down-weight the effect easy negatives have on the training procedure, γ a modulating factor (1 pt) is introduced. Easy negatives will have pt 1 and thus (1 p ) 0. The focusing− parameter γ is used to tune the down-weighting≈ − t ≈ effect of the modulating factor, increasing γ reduces the impact easy negatives have during training.

In focal loss, a weighting factor α [0, 1] is also used to balance the importance t ∈ of negative and positive samples, 2.3.4 describes how αt is defined. Positive sam- ples are samples containing an object and negative samples are samples that do not contain objects.    α, for positive samples αt =  (2.3.4)  1-α, for negative samples

Focal loss combines both the modulating factor and the weighting factor de- scribed above, which means that the focal loss function is defined as in 2.3.5 below. FL(p ) = α (1 p )γ log(p ) (2.3.5) t − t − t t RetinaNet To make use of focal loss, Lin et al. developed an object detector called Reti- naNet. RetinaNet is a single-stage detector consisting of a backbone network and two smaller subnetworks. The backbone network first generates feature maps at different scales, this is done using an FPN that was described in section 2.3.9. 16 2 Theory and Related Work

RetinaNet also utilizes anchors, at every level of the feature pyramid each spatial position has 9 anchors. A classification subnetwork then predicts object existence probabilities for each class in each anchor. This classification subnetwork uses fo- cal loss as loss function. Parallel to the classification subnetwork is a bounding box regression subnetwork that produces bounding box offsets for each anchor. The bounding box regression network is wholly separated from the classification network and does not make use of class probabilities when predicting bounding box offsets.

2.4 Tracking Algorithms

The following section presents the theory behind the two tracking algorithm con- sidered in this thesis: SORT and Deep SORT. As described in the limitations, these two are especially suited to answer the second research questions without broadening the scope too much.

2.4.1 SORT Simple Online and Real-Time Tracking, SORT, is a tracking algorithm that was introduced by Bewley et al. in [4]. SORT is designed to perform multiple object tracking (MOT) in a tracking-by-detection system. In order to achieve real-time processing, SORT is intentionally kept simple and avoids complex and time con- suming tasks. To compensate for its lack of complexity, SORT uses CNN based object detectors to instead rely on more accurate object detections.

For each new frame, SORT first propagates objects that are already tracked into the current frame. The new positions of these already tracked objects are pre- dicted using a Kalman [19] filter with a linear constant velocity model. Next, an object detection algorithm detects objects present in the current frame. These de- tected objects are then compared to already tracked objects and a cost-matrix is created. This cost-matrix is calculated as the IoU between each detection and each of the already tracked objects. Detections are then assigned to already tracked objects using the Hungarian method [21]. A new track is created when an object is detected in several consecutive frames while not overlapping with any of the already tracked objects. Figure 2.4.1 below shows how predicted positions are compared to object detections to assign identities in new frames. SORT, as it is used in this thesis, does not have any memory and a tracked object is lost if SORT fails to detect it in a frame.

2.4.2 Deep SORT Built with the intention of reducing the number of identity switches, Deep SORT incorporates appearance information into the tracking procedure presented in SORT [44]. Similar to SORT, Deep SORT handles state estimations with a Kalman filter. Deep SORT differs from SORT in that it makes use of additional techniques when assigning detections to already tracked objects. 2.4 Tracking Algorithms 17

Figure 2.4.1: Object association between frames in SORT.

Deep SORT utilizes two different distance metrics when comparing detections to already tracked objects: Mahalanobis distance [13] and cosine distance between appearance descriptors. The Mahalanobis distance measures how the position of a new detection differs from the positions of already tracked objects in terms of standard deviations from the of the tracked objects. This metric allows Deep SORT to avoid assigning a new detection to an already existing track where the frame-to-frame motion would be unreasonable. Appearance descriptors are computed by forwarding each bounding box through a CNN that has been pre- trained on a person re-identification dataset. The appearance descriptor of each new detection is then compared to the appearance descriptors of already tracked objects by calculating the cosine distance between descriptors. Tracked objects and their appearance descriptors are also saved for 30 frames after they are lost so that Deep SORT has the ability to resume tracking identities that have been lost for a number of frames. Using appearance descriptors in this way gives Deep SORT the ability to find a previously tracked object even if it has been occluded for a number of frames.

3 Method

This chapter covers the methodology used to perform tests and evaluations for the thesis. The chapter begins with a section describing the data annotation method employed to annotate test data. The second section covers metrics used to eval- uate performance and the chapter then continues with an overview description of the test environment where different practical aspects of the tracking-by-detection system are explained. Finally, the last section of this chapter presents algorithms and implementations tested in this thesis.

19 20 3 Method

3.1 Data Annotation

The test data consists of video from two different surveillance cameras. One cam- era overlooks a platform of an underground train station, the other camera over- looks a stair leading down to the platform on that same station. The video se- quences are 10.6 and 12.8 seconds long respectively, both videos have 10 frames per second giving a total of 234 frames. Frames in both videos have resolution 1280 720 pixels. × Ground truth annotations are made following the protocol presented in MOT16 [30] using Microsoft’s application Visual Object Tagging Tool [29]. Due to the na- ture of the test data only a single class, person, is annotated. There are other ob- jects present in the videos such as trains and bags, these objects are however not relevant in the context of this thesis. Objects are annotated if it is clear from the current frame alone that the object exists, this means that occluded objects are not annotated. If an object is partly occluded its full extent is estimated and the object is annotated thereafter. Bounding boxes are always fitted as tightly as pos- sible to the annotated object while still containing all of the object. There is a total of 2319 different ground truth bounding boxes annotated over the two test videos, these bounding boxes are distributed over 36 unique identities.

Each person is given a single identity throughout the whole sequence, even if the person is occluded in parts of the sequence. Occurring occlusions are tem- porary and no person disappears for more than a couple of seconds. Handling such short term occlusions is seen as a part of the tracking problem rather than as person re-identification. Therefore, these occlusions were considered relevant for this thesis and accounted for when annotating data. Figure 3.1.1 below shows an example of what annotated frames might look like for both sequences. 3.2 Evaluation 21

Figure 3.1.1: Annotated frames for both sequences.

3.2 Evaluation

Performance is measured according to the framework presented in MOT16 [30] and in the same manner as performance is measure in the MOTChallange1. The authors of [30] provide publicly available code2 for evaluation, this code is used to calculate the different performance metrics. MOT16 was chosen since it is a compilation of other many metrics developed in an attempt to standardize eval- uation of multiple object tracking. MOT16 contains a wide array of metrics for

1https://motchallenge.net/ 2https://bitbucket.org/amilan/motchallenge-devkit/overview 22 3 Method evaluation of multiple object tracking, some which are quite similar to each other. Hence, metrics that were considered to similar to other have not been included in this thesis.

3.2.1 Classification of Predicted Bounding Boxes The fundamental performance metric is the classification of bounding boxes. Ta- ble 3.2.1 below shows the different classes that a bounding box can be assigned. A predicted bounding box is considered a true positive (TP) if its intersection over union (IoU), or Jaccard Index [18], with a ground truth box is larger than 0.5. Equation 3.2.1 shows how the IoU between a predicted box P and a ground truth box G is calculated. False positives (FP) are predicted bounding boxes without a corresponding ground truth box, and false negatives (FN) are ground truth boxes that the algorithm fails to detect. True negatives (TN) are irrelevant in this context since object detection algorithms do not produce any predictions on whether ob- jects are absent. For all performance metrics defined in subsequent sections, TP, FP, and FN are used as shorthand notations for the number of bounding boxes that have been labeled as belonging to each class. GT is also used as a notation for the total number of ground truth bounding boxes that exist. P G P G IoU(P, G) = | ∩ | = | ∩ | (3.2.1) P G P + G P G | ∪ | | | | | − | ∩ |

Prediction Object (positive) Background (negative) TP FN Object True Positive False Negative (positive) correctly labeled as object incorrectly labeled as background FP TN Background False Positive True Negative (negative)

Ground Truth incorrectly labeled as object correctly labeled as background

Table 3.2.1: Classification of bounding boxes.

3.2.2 Object Detection Evaluation The MOT16 framework includes several metrics that can be used to evaluate ob- ject detection algorithms. Below are the metrics that this thesis uses to measure performance: • Recall [12]: Recall is measured as the ratio between correctly detected ob- jects and the total number of ground truth objects. Thus, the algorithm’s recall reflects its ability to find ground truth objects. TP Recall = · 100 (3.2.2) TP + FN 3.2 Evaluation 23

• Precision (Prcn) [12]: Precision describes the accuracy of predicted bound- ing boxes. It is calculated as the ratio between correctly predicted bounding boxes and the total number of predicted bounding boxes. TP Precision = · 100 (3.2.3) TP + FP

• F1 score (F1) [12]: F1 score is a metric that combines recall and precision into a single score by calculating the harmonic mean of . 2TP F1 = · 100 (3.2.4) 2TP + FP + FN

• Average Precision (AP) [28]: Average precision is calculated as the area un- der the precision-recall curve. This precision-recall curve is created by first sorting all predictions in descending order according to their confidence. Starting with the most confident prediction, precision can then be plotted against recall by iteratively calculating cumulative precision and recall at different ranks in the now ordered set of predictions. Figure 3.2.1 and table 3.2.2 shows an example of a precision-recall curve for when there is a total of 3 ground truth objects and 5 object are predicted, AP is calculated as the area under the curve.

Rank Conf Label Precision Recall 1 0.987 TP 1.0 0.33 2 0.934 FP 0.5 0.33 3 0.887 TP 0.67 0.67 4 0.764 FP 0.5 0.67 5 0.564 TP 0.6 1.0

Table 3.2.2: Example predictions

1.0

0.8

0.6

0.4 Precision

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall

Figure 3.2.1: Precision-recall curve for example predictions. 24 3 Method

• Multiple Object Detection Precision (MODP) [39]: A metric that mea- sures the overlap between predicted bounding boxes and ground truth data. MODP is calculated as the IoU between predicted bounding boxes and ground truth bounding boxes.

P P k k k frames i objects IoU(Pi , Gi ) MODP = ∈ ∈ · 100 (3.2.5) GT

• Multiple Object Detection Accuracy (MODA) [39]: MODA measures the accuracy of predictions by looking at missed ground truth boxes and false positives. FN + FP MODA = (1 ) · 100 (3.2.6) − GT • Frames Per Second(FPS): FPS is a metric for comparing the speed of object detection algorithms. It is calculated as the ratio between the number of frames processed and the time that it takes to run the algorithm. #frames FPS = (3.2.7) total runtime

3.2.3 Object Tracking Evaluation Tracking is also evaluated in accordance with the MOT16 guidelines. The list be- low describes metrics specific to tracking that this thesis considers. Many object detection metrics can also be used to evaluate tracking performance. Those met- rics are assumed to have the same definition as described in section 3.2.2 unless stated otherwise. • Identification Recall (IDR) and Identification Precision (IDP) [36]: IDR and IDP are similar to the metrics Recall and Precision for object detec- tion. The metrics will however differ since objects are considered tracked only if they can be assigned an identity, which will not be the case for all detected objects. Another difference is that inconsistencies in identity as- signments will lower the IDTP score. For each ground truth identity, the predicted identity most similar to it is found. Any other identity assigned to the ground truth identity is then considered a mismatch (IDFP) and will be counted as a false positive instead of a true positive.

IDTP IDTP IDR = · 100 (3.2.8) IDP = · 100 (3.2.9) IDTP + IDFN IDTP + IDFP

• IDF1-score(IDF1) [36]: Similar to F1 score for object detection, IDF1 com- bines both IDR and IDP into a single score to facilitate comparisons of dif- ferent trackers. 2IDTP IDF1 = · 100 (3.2.10) 2IDTP + IDFP + IDFN 3.3 Test Environment 25

• Mostly Tracked (MT) [30]: The number of ground truth identities that are tracked for 80% or more of their existence. • Partly Tracked (PT) [30]: The number of ground truth identities that are tracked between 20% and 80% of their existence. • Mostly Lost (ML) [30]: The number of ground truth identities that are tracked for less than 20% of their existence. • Identity Switches (IDs) [30]: The number of identity switches. An identity switch is counted every time an already tracked ground truth identity is assigned a new tracking identity. • Track Fragmentations (FM) [30]: The number of track fragmentations. A track fragmentation is counted every time a tracked ground truth identity is lost and then found again in a later frame. • Multiple Object Tracking Accuracy (MOTA) [3]: MOTA combines false negatives, false positives and identity switches into a single score in order to express overall performance with a single value. FN + FP + IDs MOTA = (1 ) · 100 (3.2.11) − GT

• Multiple Object Tracking Precision(MOTP) [3]: MOTP measures how well correctly predicted bounding boxes (TPi) fit their respective ground truth boxes (GTi). This is done by calculating the average overlap between true positives and their corresponding ground truth object. P IoU(TP , GT ) MOTP = i i · 100 (3.2.12) GT 3.3 Test Environment

A tracking-by-detection system developed in a previous project at NFC as a part of the course Image and Graphics, Project Course CDIO is used as test environment for all tests in this thesis. The system is able to perform tracking-by-detection, per- son re-identification and object instance segmentation, though only the tracking- by-detection functionality is relevant for this thesis.

As described in section 2.1, the tracking-by-detection module consists of two parts: an object detection algorithm and a tracking algorithm. The object detec- tion algorithm first takes a video sequence as input and outputs a CSV file that describes objects detected in each frame. The CSV file is constructed following MOT16 guidelines so that the results can be evaluated using the MOT16 proto- col, table 3.3.1 shows an examples of values that rows in the CSV file can contain. The property id is always set to -1 since object detection algorithms do not assign identities to detected objects. xmin, ymin, width, and height defines the bound- ing box of the detected object, and the confidence with which the detection is 26 3 Method made is described by the property conf. x, y, and z are values used to evaluate 3-dimensional object detection within the MOT16 framework, these are not rel- evant for this thesis and are always set to -1. Only objects of the class person are considered since that is the only class annotated in the test data.

frame id xmin ymin width height conf x y z 1 -1 699.66 174.56 88.64 253.76 0.978 -1 -1 -1 1 -1 587.2 65.93 49.26 122.81 0.672 -1 -1 -1 2 -1 704.04 175.93 88.33 256.92 0.956 -1 -1 -1

Table 3.3.1: Examples of rows in the CSV file output from the object detec- tion algorithm.

The tracking algorithm then takes as input both the CSV file from the object de- tection algorithm, and the video sequence. It performs multiple object tracking and produces its own CSV file as output. This CSV file has the same format as the CSV file produced by the object detection algorithm except that each object now has been assigned an identity. Table 3.3.2 shows an example of a few rows of this CSV file, x, y, and z are as before static.

frame id xmin ymin width height conf x y z 1 1 699.66 174.56 88.64 253.76 0.978 -1 -1 -1 1 2 587.2 65.93 49.26 122.81 0.672 -1 -1 -1 2 1 704.04 175.93 88.33 256.92 0.956 -1 -1 -1

Table 3.3.2: Examples of rows in the CSV file output from the tracking algo- rithm.

The system in place at NFC currently uses YOLOv3 [33] as object detection al- gorithm and Deep SORT [44] as tracking algorithm. All tests for this thesis are done by replacing either the object detection algorithm or the tracking algorithm, this is possible since the object detection algorithm and the tracking algorithm are completely separated from each other. The CSV files that the system outputs are then used to measure performance of different object detection and tracking algorithms. Figure 3.3.1 shows the main parts of the tracking-by-detection sys- tem and how each part outputs a CSV file.

3.3.1 Testing Object Detection Algorithms Object detection algorithms are evaluated in two different ways: as stand-alone object detection algorithms and on how they perform in a tracking-by-detection system. Therefore, performance metrics for object detection and object tracking 3.3 Test Environment 27

Video Sequence

Object Object Detection Tracking Algorithm Algorithm

CSV CSV

Figure 3.3.1: Scheme of the tracking-by-detection pipeline with its outputs. are both used to evaluate object detection algorithms. An algorithms performance in object detection is likely highly correlated with how it performs in a tracking- by-detection system. It is however possible that certain characteristics of an ob- ject detection algorithm interacts well with a specific type of tracker, which is why object detection algorithms are also evaluated in the tracking-by-detection system.

3.3.2 Testing Object Tracking Algorithms

One of the objectives of this thesis is to investigate how the use of visual descrip- tors in the tracking algorithm effects the performance of a tracking-by-detection system. There is a myriad of different tracking algorithms available, some which use visual descriptors and some which do not. This thesis is however mainly con- cerned with object detection algorithms and for that reason only two different tracking algorithms are tested, SORT and Deep SORT. These two are especially suited to study how the use of visual descriptors effects performance since Deep SORT is an extension of SORT in which usage of visual descriptors has been incor- porated. Both algorithms are evaluated using the performance metrics described in section 3.2. Tracking algorithms are also tested with ground truth object de- tections as input. The reason for doing this is that it gives an insight into how much error the tracking algorithm introduces and thus provides an upper limit for how much a better object detection algorithm can improve performance in a tracking-by-detection system.

3.3.3 Hardware

All tests are performed on the same machine in order to be able to make valid speed comparisons of different algorithms. The computer’s CPU is an Intel Xeon 28 3 Method

Silver 41083 Processor with a clockspeed of 1.8 GHz. NVIDIA’s Quadro P40004 is used as GPU for the computer.

3.4 Algorithms and Implementations

Implementations of tested algorithms would preferably come from the author who initially presented the algorithm. This is so that the implementation stays as true to the cited paper as possible. When possible, such implementation are used for tests in this thesis. It is however not always feasible to integrate those implementations into the test environment and therefore some non-original im- plementations have also been tested.

The system in place at NFC uses pre-trained weights for both YOLOv3 and Deep SORT, these weights are supplied with the implementations. Similarly, this thesis utilizes pre-trained weights supplied with the implementations for the different object detection algorithms and for Deep SORT. To make comparisons fair, only pre-trained weights trained on the Microsoft COCO [22] dataset are used. Mi- crosoft COCO was chosen since it is an extensive dataset which is often used as a benchmark when comparing algorithms [33][35][16]. Part of the reason as to why only pre-trained weights are used is the lack of annotated data. The small amount of data that was annotated in this thesis was deemed to be more useful as test data rather than as training data.

A short script is created for each object detection algorithm in order to integrate it into the tracking-by-detection system. The script feeds the test videos into the object detection algorithm and then converts its output to a CSV file on the for- mat specified in section 3.3. The different scripts have similar overall structure but differs in the details since they have to be tailored to fit each implementation.

3.4.1 Deep Learning Libraries Implementations of different algorithms are built using software libraries for deep learning. The choice of deep learning library could effect both speed and performance of algorithms and is therefore an import aspect when comparing different implementations. Deep learning libraries used for algorithms tested in this thesis are:

– TensorFlow [1]: A library developed by Google that includes functionality that can be used to create deep learning algorithms such as CNNs.

– PyTorch [31]: Deep learning library developed by Facebook’s AI research group.

3https://ark.intel.com/content/www/us/en/ark/products/123544/intel-xeon-silver-4108- processor-11m-cache-1-80-ghz. 4https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/productspage/quadro/quadro- desktop/quadro-pascal-p4000-data-sheet-us-nvidia-704358-r2-web. 3.4 Algorithms and Implementations 29

– Caffe25: A deep learning framework that has recently been integrated into PyTorch.

– Keras [6]: A high-level deep learning API that can run on top of other li- braries such as TensorFlow.

3.4.2 Implementations

The following list describes the different object detection algorithms and imple- mentations that are evaluated in this thesis:

6  Facebook’s Detectron [11]: Detectron is an object detection library devel- oped by Facebook AI research, it includes implementations of several ob- ject detection algorithms with different backbones. The library is written in Python and all tests for this thesis are done with the Caffe2 framework built into PyTorch 1.0. Detectron’s implementations of Faster R-CNN, Mask R-CNN and RetinaNet are tested in this thesis. All three algorithms are tested with both ResNet50 and ResNet101 as backbone.

7  Matterport’s Mask R-CNN [2]: A Keras implementation of Mask R-CNN that runs on TensorFlow 1.13.1 and Keras 2.2.4. The implementation uses a slightly lower learning rate than 0.02 which is used in the original paper, it also zero-pads images to resolution 1024 1024 instead of dynamically resizing the image as in the original paper [16].× The implementation is writ- ten in Python and uses ResNet101 as backbone.

8  Fizyr’s RetinaNet : A Keras implementation of RetinaNet running on Ten- sorFlow 1.13.1 and Keras 2.2.4 where ResNet50 is used as backbone.

9  Ayoosh Kathuria’s YOLOv3 : A PyTorch implementation of YOLOv3 with Darknet-53 as backbone, tests are performed using PyTorch 1.0.

10  Pierluigi Ferrari’s Single Shot Detector(SSD) : Keras implementation run- ning on top of TensorFlow 1.13.1 with Keras 2.2.4, it uses VGG-16 as back- bone for SSD.

Table 3.4.2 shows the configurations of different object detection algorithms tested as a part of this thesis. Some algorithms are tested with multiple image resolu- tions, when an algorithm is tested with different resolutions it is denoted with algorithm:resolution. YOLOv3 tested with images of size 320 x 320 pixels is for example called YOLOv3:320.

5https://caffe2.ai/ 6https://github.com/facebookresearch/Detectron 7https://github.com/matterport/Mask_RCNN 8https://github.com/fizyr/keras-retinanet 9https://github.com/ayooshkathuria/pytorch-yolo-v3 10https://github.com/pierluigiferrari/ssd_keras 30 3 Method

Algorithm Implementation Backbone SSD:300 Pierluigi Ferrari VGG-16 SSD:512 Pierluigi Ferrari VGG-16 YOLOv3:320 Ayoosh Kathuria Darknet-53 YOLOv3:416 Ayoosh Kathuria Darknet-53 YOLOv3:512 Ayoosh Kathuria Darknet-53 RetinaNet Fizyr ResNet-50 Mask R-CNN Matterport ResNet-101 Faster R-CNN Detectron ResNet-50 Faster R-CNN Detectron ResNet-101 Mask R-CNN Detectron ResNet-50 Mask R-CNN Detectron ResNet-101 RetinaNet Detectron ResNet-50 RetinaNet Detectron ResNet-101

Table 3.4.1: Table showing the different object detection algorithms tested.

Two different implementations of tracking algorithms are used in this thesis:

11  Alex Bewley’s SORT [4]: A Python implementation of SORT written by the author of the SORT paper [4].

12  Nicolai Wojke’s Deep SORT [44] : A Python implementation of Deep SORT written by one of the authors of the original paper [44]. When Deep SORT is used in testing, an object has to be detected in two consecu- tive frames before it is given an identity. Also, if an identity leaves the sequence Deep SORT saves its position and appearance for 30 frames before it is disre- garded. SORT does not have this functionality and thus gives detected objects an identity directly and disregards them as soon as they are lost. To remedy this in- consistency, the first two frames in which a previously unseen object is visible are removed from the ground truth CSV file. This is so that SORT and Deep SORT have the possibility to track the same amount of ground truth objects.

11https://github.com/abewley/sort 12https://github.com/nwojke/deep_sort 4 Results

The following chapter will present results for the different object detection and tracking algorithms tested in this thesis. All testing is done on the two video sequences annotated as described in section 3.1, and performance is evaluated using the metrics presented in section 3.2. In order to make it easier to read the plots, each configuration of an object detection algorithm is given a unique color. This means that every meta-architecture is first given a general color, YOLOv3 is for example yellow. The brightness of the color is then used to indicate either complexity of the backbone or image resolution, with brighter colors denoting a less complex backbone or a lower image resolution. Both RetinaNet and Mask R-CNN are tested with two different implementations. To avoid confusion, Detec- tron’s implementations are given a black edge in all 2-dimensional plots so that it is easier to distinguish different implementations from each other.

31 32 4 Results

4.1 Object Detection Results

This section presents results for the different object detection algorithms consid- ered in this thesis. Figure 4.1.1 first shows the average precision achieved by the different object detection algorithms. Average precision is a common metric for comparing object detection algorithms and is therefore first displayed in a sorted graph to give a general overview of how the algorithms compare to each other [33][24][16].

Figure 4.1.3 plots the average precision against the number of frames the algo- rithm can process per second. This plot is interesting since processing time is a limiting factor for how useful an algorithm is in a surveillance system. Precision and recall are then plotted against each other in figure 4.1.4. The balance between precision and recall is often a design choice and demonstrates central character- istics of an algorithm. Next, figure 4.1.2 displays the F1-score of different algo- rithms, this metrics also expresses the overall performance of the algorithms.

Last, full results for object detection evaluation using the MOT16 [30] frame- work are presented in table 4.1.1. This table includes many of the metrics that can be calculated with the publicly available evaluation code1 for MOT16. The best score for each metric is written in bold font in order to make comparisons easier. The red bars in the cells are used to make it easier to compare algorithms, a larger bar indicates a better score.

Average Precision (AP) 0.0 0.2 0.4 0.6 0.8 1.0

Detectron's Mask R-CNN (ResNet-50)

Detectron's Mask R-CNN (ResNet-101)

Detectron's Faster R-CNN (ResNet-101)

Detectron's Faster R-CNN (ResNet-50)

YOLOv3:512 (Darknet-53)

YOLOv3:416 (Darknet-53)

Mask R-CNN (ResNet-101)

Detectron's RetinaNet (ResNet-50)

RetinaNet (ResNet-50)

Detectron's RetinaNet (ResNet-101)

YOLOv3:320 (Darknet-53)

SSD:512 (VGG-16)

SSD:300 (VGG-16)

Figure 4.1.1: Average precision for different object detection algorithms.

1https://bitbucket.org/amilan/motchallenge-devkit/overview 4.1 Object Detection Results 33

F1 score (F1) 0 20 40 60 80 100

YOLOv3:512 (Darknet-53)

Detectron's Mask R-CNN (ResNet-101)

Detectron's Faster R-CNN (ResNet-101)

Detectron's RetinaNet (ResNet-101)

Detectron's RetinaNet (ResNet-50)

Detectron's Mask R-CNN (ResNet-50)

Detectron's Faster R-CNN (ResNet-50)

YOLOv3:416 (Darknet-53)

RetinaNet (ResNet-50)

Mask R-CNN (ResNet-101)

YOLOv3:320 (Darknet-53)

SSD:512 (VGG-16)

SSD:300 (VGG-16)

Figure 4.1.2: F1 score for different object detection algorithms.

1.0

0.8

0.6

SSD:300 (VGG-16) SSD:512 (VGG-16) 0.4 YOLOv3:320 (Darknet-53) YOLOv3:416 (Darknet-53) YOLOv3:512 (Darknet-53) RetinaNet (ResNet-50) Mask R-CNN (ResNet-101)

Average Precision (AP) Precision Average Detectron's Faster R-CNN (ResNet-50) 0.2 Detectron's Faster R-CNN (ResNet-101) Detectron's Mask R-CNN (ResNet-50) Detectron's Mask R-CNN (ResNet-101) Detectron's RetinaNet (ResNet-50) Detectron's RetinaNet (ResNet-101) 0.0 0 5 10 15 20 25 Frames Per Second (FPS)

Figure 4.1.3: Average precision and frames per second. 34 4 Results

100

95

90

85 (Prcn) 80 SSD:300 (VGG-16) SSD:512 (VGG-16) YOLOv3:320 (Darknet-53) YOLOv3:416 (Darknet-53) 75 YOLOv3:512 (Darknet-53) RetinaNet (ResNet-50) Precision Mask R-CNN (ResNet-101) 70 Detectron's Faster R-CNN (ResNet-50) Detectron's Faster R-CNN (ResNet-101) Detectron's Mask R-CNN (ResNet-50) 65 Detectron's Mask R-CNN (ResNet-101) Detectron's RetinaNet (ResNet-50) Detectron's RetinaNet (ResNet-101) 60 0 20 40 60 80 100 Recall

Figure 4.1.4: Precision and recall plot.

Object Detection Algorithm AP Recall Prcn F1 TP FP FN MODA MODP FPS SSD:300 (VGG-16) 0.2706 24.2 97.7 38.8 561 13 1758 23.6 76.7 5.52 SSD:512 (VGG-16) 0.3578 31.5 94.4 47.2 730 43 1589 29.6 78.3 4.73 YOLOv3:320 (Darknet-53) 0.6182 63.2 81.2 71.0 1465 339 854 48.6 75.6 22.81 YOLOv3:416 (Darknet-53) 0.7141 72.1 86.8 78.7 1671 251 648 61.2 78.7 20.82 YOLOv3:512 (Darknet-53) 0.7185 78.3 89.6 83.5 1815 211 504 69.2 78.7 17.2 RetinaNet (ResNet-50) 0.6321 66.1 95.2 78.0 1534 78 785 62.8 82.1 4.02 Mask R-CNN (ResNet-101) 0.7118 79.3 73.6 76.3 1838 658 481 50.9 78.5 1.84 Detectron’s Faster R-CNN (ResNet-50) 0.7881 81.1 77.3 79.2 1881 552 438 57.3 78.5 5.43 Detectron’s Faster R-CNN (ResNet-101) 0.7919 81.2 80.3 80.8 1884 462 435 61.3 79.7 4.32 Detectron’s Mask R-CNN (ResNet-50) 0.7970 82.2 77.5 79.8 1907 555 412 58.3 79.1 2.95 Detectron’s Mask R-CNN (ResNet-101) 0.7936 82.8 80.0 81.3 1919 481 400 62.0 79.8 2.62 Detectron’s RetinaNet (ResNet-50) 0.6327 68.8 95.1 79.8 1595 83 724 65.2 79.8 4.69 Detectron’s RetinaNet (ResNet-101) 0.6315 69.5 94.4 80.1 1612 96 707 65.4 80.7 3.91

Table 4.1.1: Object detection results following the MOT16 protocol. 4.2 Object Tracking Results 35

4.2 Object Tracking Results

Tracking results for SORT and Deep SORT with different object detection algo- rithms are presented in this section. As described in section 3.3, the tests are performed in a tracking-by-detection system where tracking and detection algo- rithms are completely separated, this is so that it is possible to test different com- binations of tracking and detection algorithms. Further, ground truth detections are also tested with SORT and Deep SORT, this is done so as to give an insight into how much of the error is due to the object detection algorithm and how much of it is due to the tracking algorithm. Tracking performance with ground truth detections should only have errors introduced by the tracking algorithms and can thereby give an upper limit for how much better the tracking-by-detection system can become by changing object detection algorithm. This upper limit is represented by a brown dashed line in all plots in this section.

Figure 4.2.1 and 4.2.2 first show the IDF1 score for different object detection algo- rithms with SORT and Deep SORT respectively. MOTA results are then displayed in figure 4.2.3 and 4.2.4. These plots aim to display the overall performance of each object detector and tracking algorithm. Next, figure 4.2.5 and 4.2.6 plot IDR against IDP for SORT and Deep SORT. This shows how tracking algorithms bal- ance accuracy and precision with different object detection algorithms. Finally the full results with many of the metrics obtained using the MOT16 evalution code2 are presented in table 4.2.1 and 4.2.2.

2https://bitbucket.org/amilan/motchallenge-devkit/overview 36 4 Results

IDF1 score (IDF1) 0 20 40 60 80 100

Ground Truth Detections

Detectron's Mask R-CNN (ResNet-50)

Detectron's Faster R-CNN (ResNet-50)

YOLOv3:512 (Darknet-53)

Mask R-CNN (ResNet-101)

Detectron's Faster R-CNN (ResNet-101)

Detectron's RetinaNet (ResNet-50)

Detectron's Mask R-CNN (ResNet-101)

Detectron's RetinaNet (ResNet-101)

RetinaNet (ResNet-50)

YOLOv3:416 (Darknet-53)

YOLOv3:320 (Darknet-53)

SSD:512 (VGG-16)

SSD:300 (VGG-16)

Figure 4.2.1: IDF1 score for object detection algorithms with SORT.

IDF1 score (IDF1) 0 20 40 60 80 100

Ground Truth Detections

Detectron's Faster R-CNN (ResNet-101)

YOLOv3:512 (Darknet-53)

Detectron's Mask R-CNN (ResNet-101)

Detectron's RetinaNet (ResNet-50)

RetinaNet (ResNet-50)

Detectron's RetinaNet (ResNet-101)

Mask R-CNN (ResNet-101)

Detectron's Mask R-CNN (ResNet-50)

Detectron's Faster R-CNN (ResNet-50)

YOLOv3:416 (Darknet-53)

YOLOv3:320 (Darknet-53)

SSD512 (VGG-16)

SSD300 (VGG-16)

Figure 4.2.2: IDF1 score for object detection algorithms with Deep SORT. 4.2 Object Tracking Results 37

Multiple Object Tracking Accuracy (MOTA) 0 20 40 60 80 100

Ground Truth Detections

YOLOv3:512 (Darknet-53)

Detectron's Faster R-CNN (ResNet-50)

Detectron's Mask R-CNN (ResNet-101)

Detectron's Mask R-CNN (ResNet-50)

Detectron's Faster R-CNN (ResNet-101)

YOLOv3:416 (Darknet-53)

Detectron's RetinaNet (ResNet-50)

Detectron's RetinaNet (ResNet-101)

RetinaNet (ResNet-50)

Mask R-CNN (ResNet-101)

YOLOv3:320 (Darknet-53)

SSD:512 (VGG-16)

SSD:300 (VGG-16)

Figure 4.2.3: MOTA score for object detection algorithms with SORT.

Multiple Object Tracking Accuracy (MOTA) 0 20 40 60 80 100

Ground Truth Detections

YOLOv3:512 (Darknet-53)

Detectron's RetinaNet (ResNet-101)

Detectron's RetinaNet (ResNet-50)

Detectron's Mask R-CNN (ResNet-101)

Detectron's Faster R-CNN (ResNet-101)

RetinaNet (ResNet-50)

YOLOv3:416 (Darknet-53)

Detectron's Faster R-CNN (ResNet-50)

Detectron's Mask R-CNN (ResNet-50)

Mask R-CNN (ResNet-101)

YOLOv3:320 (Darknet-53)

SSD512 (VGG-16)

SSD300 (VGG-16)

Figure 4.2.4: MOTA score for object detection algorithms with Deep SORT. 38 4 Results

100 SSD:300 (VGG-16) SSD:512 (VGG-16) YOLOv3:320 (Darknet-53) YOLOv3:416 (Darknet-53) 95 YOLOv3:512 (Darknet-53) RetinaNet (ResNet-50) Mask R-CNN (ResNet-101) Detectron's Faster R-CNN (ResNet-50) 90 Detectron's Faster R-CNN (ResNet-101) Detectron's Mask R-CNN (ResNet-50) Detectron's Mask R-CNN (ResNet-101) Detectron's RetinaNet (ResNet-50) Detectron's RetinaNet (ResNet-101) 85 Ground Truth Detections

80

75 Identity Precision (IDP) Precision Identity

70

65

60 0 20 40 60 80 100 Identity Recall (IDR)

Figure 4.2.5: IDR and IDP score for object detection algorithms with SORT.

100

95

90

85

80

75 SSD300 (VGG-16) SSD512 (VGG-16) YOLOv3:320 (Darknet-53) YOLOv3:416 (Darknet-53) Identity Precision (IDP) Precision Identity YOLOv3:512 (Darknet-53) 70 RetinaNet (ResNet-50) Mask R-CNN (ResNet-101) Detectron's Faster R-CNN (ResNet-50) Detectron's Faster R-CNN (ResNet-101) 65 Detectron's Mask R-CNN (ResNet-50) Detectron's Mask R-CNN (ResNet-101) Detectron's RetinaNet (ResNet-50) Detectron's RetinaNet (ResNet-101) Ground Truth Detections 60 0 20 40 60 80 100 Identity Recall (IDR)

Figure 4.2.6: IDR and IDP score for object detection algorithms with Deep SORT. 4.2 Object Tracking Results 39

Object Detection Algorithm IDF1 IDP IDR MT PT ML IDs FM MOTA MOTP SSD:300 (VGG-16) 18.7 73.6 10.7 0 11 25 25 41 12.8 77.7 SSD:512 (VGG-16) 27.2 72.1 16.7 0 16 20 28 44 20.0 78.7 YOLOv3:320 (Darknet-53) 53.4 67.6 44.2 10 21 5 41 48 48.0 76.5 YOLOv3:416 (Darknet-53) 60.3 72.1 51.8 12 19 5 36 45 60.2 79.7 YOLOv3:512 (Darknet-53) 65.2 74.4 58.0 15 18 3 34 49 66.3 79.6 RetinaNet (ResNet-50) 61.4 81.3 49.3 10 19 7 30 39 56.4 82.2 Mask R-CNN (ResNet-101) 64.7 69.4 60.5 15 17 4 35 68 55.6 78.7 Detectron’s Faster R-CNN (ResNet-50) 67.1 72.5 62.5 18 15 3 42 53 65.6 78.3 Detectron’s Faster R-CNN (ResNet-101) 64.6 70.6 59.5 14 20 2 38 55 64.6 79.5 Detectron’s Mask R-CNN (ResNet-50) 68.2 73.7 63.4 16 16 4 39 47 65.1 79.5 Detectron’s Mask R-CNN (ResNet-101) 63.9 69.0 59.5 17 14 5 44 60 65.5 79.8 Detectron’s RetinaNet (ResNet-50) 63.9 82.7 52.0 9 20 7 33 46 59.0 80.2 Detectron’s RetinaNet (ResNet-101) 63.6 81.9 51.9 11 19 6 26 42 58.6 80.9 Ground truth detections 90.6 91.0 90.2 35 1 0 8 8 97.5 93.3

Table 4.2.1: Results for tracking with SORT.

Object Detection Algorithm IDF1 IDP IDR MT PT ML IDs FM MOTA MOTP SSD:300 (VGG-16) 34.4 88.2 21.4 2 14 20 11 45 23.2 75.7 SSD:512 (VGG-16) 38.9 78.1 25.9 1 19 16 12 50 28.2 77.6 YOLOv3:320 (Darknet-53) 61.5 72.2 53.5 11 21 4 31 54 48.1 75.0 YOLOv3:416 (Darknet-53) 64.0 72.1 57.6 14 18 4 33 43 60.3 78.1 YOLOv3:512 (Darknet-53) 74.5 81.9 68.4 17 15 4 25 40 68.6 78.0 RetinaNet (ResNet-50) 71.2 88.7 59.5 11 19 6 21 33 61.1 80.7 Mask R-CNN (ResNet-101) 70.0 69.1 71.0 17 16 3 43 71 53.6 77.1 Detectron’s Faster R-CNN (ResNet-50) 67.2 68.1 66.3 17 16 3 44 48 60.1 76.8 Detectron’s Faster R-CNN (ResNet-101) 75.0 76.4 73.7 18 15 3 27 54 62.5 78.6 Detectron’s Mask R-CNN (ResNet-50) 69.0 69.1 68.9 20 13 3 39 57 59.0 78.0 Detectron’s Mask R-CNN (ResNet-101) 73.1 73.6 72.6 20 13 3 17 53 62.5 78.7 Detectron’s RetinaNet (ResNet-50) 71.5 87.7 60.4 11 21 4 20 41 62.8 78.7 Detectron’s RetinaNet (ResNet-101) 70.4 84.4 60.4 13 19 4 20 43 63.2 79.6 Ground truth detections 91.3 91.7 90.9 35 1 0 5 8 96.3 91.5

Table 4.2.2: Results for tracking with Deep SORT.

5 Discussion

This chapter contains a discussion on the results presented in the previous chap- ter. As in chapter 4, this chapter first considers object detection performance and then performance of the whole tracking-by-detection system.

5.1 Object Detection

From figure 4.1.1 it is evident that, on this dataset, two-stage detectors generally outperform single-stage detectors in terms of average precision. Yolov3:512 does however achieve comparable average precision and F1 score which shows that single-stage detectors are capable of performing on the same level as two-stage detectors. Further, both more complex backbones and higher resolution on the input images yield higher precision, higher image resolution slightly more so. These results are largely in agreement with observations presented in the intro- ductory papers for different algorithms [16][33][24], for Detectron’s implementa- tions it is also similar to results presented in their Model Zoo1 [11].

Another takeaway is that different implementations of the same algorithm are not always equal in performance, this is most obvious for Mask R-CNN where Matterport’s implementation [2] does not perform as well as Detectron’s imple- mentation [11]. The reason for this could be that Matterport’s implementation, as described in section 3.4, differs slightly from the original implementation in terms of learning rate and image resizing. It could also be that implementations using different deep learning libraries causes some difference in performance, since Matterports’s implementation is written in Keras with TensorFlow as back- bone whereas Detectron is written in Caffe2. All in all, most algorithms achieve

1https://github.com/facebookresearch/Detectron/blob/master/MODEL_ZOO.md

41 42 5 Discussion roughly the same performance except SSD which has a significantly lower AP than the rest. This is somewhat expected since SSD, being introduced in 2015, is a few years older than the rest.

An unexpected result is that RetinaNet has slightly higher precision when using ResNet-50 rather than the more complex ResNet-101. The difference in precision is however very small and it is important to remember that the small dataset on which tests are performed will not represent general performance as well as tests on larger datasets such as COCO [22] and Pascval VOC [8]. It could therefore be that RetinaNet happens to work better with ResNet-50 on this small dataset while RetinaNet with ResNet-101 is still better at detecting objects in the general case.

Figure 4.1.3 shows that YOLOv3 is the fastest algorithm by far, and this while still achieving average precision comparable to that of the two-stage detectors and RetinaNet. This is not wholly unexpected since YOLOv3’s speed is its defin- ing characteristic, this is also in line with results presented in the introduction [33] of YOLOv3. Detectron’s implementations of Mask R-CNN and RetinaNet are both faster than Matterport’s Mask R-CNN and Fizyr’s RetinaNet. It is difficult to pinpoint an exact cause for this but it probably has something to do with which deep learning library each implementation utilizes. Both Fizyr’s and Matterport’s implementations use Keras with TensorFlow as backend while Detectron uses Caffe2 for all its implementations.

The precision and recall plot in figure 4.1.4 shows that Faster R-CNN and Mask R-CNN, compared to other algorithms, favors recall at the cost of precision. Reti- naNet produces conservative but precise predictions and YOLOv3 has found a compromise between the two approaches. F1 scores can probably be improved by tuning the threshold deciding whether to consider a detection or not to find an optimal balance between precision and recall for each algorithm. Setting the threshold to 0.5 for all algorithm does however still give valid information about how, compared to each other, algorithms balance precision and recall.

From the full results presented in table 4.1.1 it is evident that metrics describ- ing precision can be deceiving unless they are viewed as part of a wider con- text. The best example of this is that SSD:300 has the highest precision and the fewest false positives, even though its average precision is the lowest of all tested algorithms. The table also shows the absolute numbers of TP, FP, and FN, the fundamental metrics that are used to calculate most other metrics. Just as preci- sion and recall, these numbers also highlight the general characteristics of algo- rithms: RetinaNet’s conservative approach yields a low number of FP but many FN, the two-stage detectors’ prioritization of recall gives fewer FN but more FP, and YOLOv3’s compromising approach returns results somewhere in between that of RetinaNet and the two-stage detectors. YOLOv3’s balancing of recall and precision appears successful since YOLOv3:512 achieves the highest F1 score of all algorithms. Finally, MODP shows that the spatial accuracy of predicted bound- 5.2 Object Tracking 43 ing boxes is somewhat correlated with general precision, with RetinaNet hav- ing both the most accurate bounding boxes and the highest precision. Though it should be noted that this correlation is not very distinct since YOLOv3 seems to produce less accurate bounding boxes than the two-stage detectors while also having higher precision.

5.2 Object Tracking

IDF1 scores displayed in figures 4.2.1 and 4.2.2 confirm the hypothesis that an object detection algorithm’s general performance is indicative of its performance in a tracking-by-detection system. This assumption is further supported by the MOTA scores plotted in figures 4.2.3 and 4.2.4 where, again, tracking perfor- mance is highly correlated with object detection performance.

MOTA results are rather inconclusive since some object detection algorithms per- form better with SORT while other perform better with Deep SORT. It seems as if an object detection algorithm’s trade-off between precision and recall is, to some extent, correlated with whether the algorithm performs best with SORT or Deep SORT. Algorithms favoring precision over recall such as RetinaNet and SSD appears to be the ones benefiting the most from using Deep SORT instead of SORT. A possible explanation for this could be that a prioritization of recall leads to more false positives, i.e. bounding boxes without corresponding ground truths. Visually, these false bounding boxes are probably more similar to each other than true positives since the background is quite similar for the whole frame. This could cause Deep SORT’s appearance descriptor to match false positives to each other and thereby lower the overall performance of the tracking procedure.

Even though MOTA scores are somewhat ambiguous, IDF1 scores are conclu- sively in favor of Deep SORT. The reason for this discrepancy between MOTA and IDF1 results is likely due to MOTA not being concerned with whether a per- son is given the same identity throughout the sequence, something which IDF1 score accounts for. This reasoning is further supported by the full results i ta- bles 4.2.1 and 4.2.2, where it is seen that Deep SORT generates fewer identity switches while also being able to track more identities for most of their lifetime. This shows that the use of appearance descriptors in Deep SORT improves gen- eral performance compared to SORT by making Deep SORT better at maintain- ing consistent identities throughout a sequence. These results are also in agree- ment with findings in the introduction of Deep SORT [44], where fewer identity switches is presented as one of the main advantages of Deep SORT over SORT.

6 Conclusions

As for the first research questions, results presented in this thesis shows that the stand-alone performance of an object detection algorithm employed in a tracking- by-detection system is highly correlated with the system’s overall tracking perfor- mance. This correlation demonstrates the pivotal role of the object detection algo- rithm in a tracking-by-detection procedure. Further, it is also shown that modern single-stage detectors are able to achieve performance comparable to that of two- stage detectors. In general, single-stage detectors also seem to outdo two-stage detectors in terms of processing time, with YOLOv3 being the unquestionably fastest algorithm.

Continuing with the second research questions, this thesis also showcases how the use of visual descriptors in the tracking stage can reduce the number of iden- tity switches, which in turn increases the overall performance of the tracking- by-detection system. The increase in performance is also shown to be correlated with how the object detection algorithms balances precision and recall, with Deep SORT improving the overall system the most when object detections are conser- vative but precise. This observation that the usefulness of visual descriptors is dependent on the object detection algorithm’s characteristics is interesting since it could be of aid when constructing future tracking-by-detection systems.

6.1 Future Work

Seeing as all test in this thesis are performed on a small and quite specific dataset, an obvious future work is to apply the tests to larger datasets such as those in the MOT16 challenge. More data also means that it would be possible to train the object detection algorithms on data from the same domain as the test data. An- other possible extension would be to test more tracking algorithms apart from

45 46 6 Conclusions

SORT and Deep SORT. Testing other tracking algorithm would give further in- sights into how the use of visual descriptors effects tracking-by-detection in the more general case.

6.2 Ethics

There is no doubt that multiple object tracking can be used with good intentions in areas such as autonomous driving and sport analysis. It would however be remiss not to acknowledge the more sinister purposes object tracking can be used for. While a powerful tool with potential to do good in the hands of a just law enforcement, it is also the perfect tool for a totalitarian state wishing to enforce control over its citizens. Some would also argue that the very act of supervising your citizens with this kind of technology is an infringement of their privacy and thereby immoral in itself, regardless of your intentions. Whatever our opinions about surveillance might be, we should be aware that even though our intentions with this technology are good, others’ intentions might not be. Bibliography

[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghe- mawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI 16), pages 265–283, 2016. URL https://www.usenix.org/ system/files/conference/osdi16/osdi16-abadi.pdf. [2] W. Abdulla. Mask r-cnn for object detection and instance segmentation on keras and tensorflow. https://github.com/matterport/Mask_RCNN, 2017. [3] K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking per- formance: The clear mot metrics. EURASIP J. Image and Video Processing, 2008, 2008. [4] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Simple online and real- time tracking. In 2016 IEEE International Conference on Image Processing (ICIP), pages 3464–3468, 2016. doi: 10.1109/ICIP.2016.7533003. [5] K.S. Chahal and K. Dey. A survey of modern object detection literature using deep learning. CoRR, abs/1808.07256, 2018. URL http://arxiv.org/ abs/1808.07256. [6] F. Chollet et al. Keras. https://keras.io, 2015. [7] N. Dalal and B. Triggs. Histograms of oriented gradients for human de- tection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 1 - Vol- ume 01, CVPR ’05, pages 886–893, Washington, DC, USA, 2005. IEEE Com- puter Society. ISBN 0-7695-2372-2. doi: 10.1109/CVPR.2005.177. URL http://dx.doi.org/10.1109/CVPR.2005.177. [8] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html, 2012.

47 48 Bibliography

[9] R. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015. URL http:// arxiv.org/abs/1504.08083.

[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR, abs/1311.2524, 2013. URL http://arxiv.org/abs/1311.2524.

[11] R. Girshick, I. Radosavovic, P. Dollár G Gkioxari, and K. He. Detectron. https://github.com/facebookresearch/detectron, 2018.

[12] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.

[13] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learn- ing. Springer Series in . Springer New York Inc., New York, NY, USA, 2001.

[14] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convo- lutional networks for visual recognition. CoRR, abs/1406.4729, 2014. URL http://arxiv.org/abs/1406.4729.

[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recog- nition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/ 1512.03385.

[16] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017. URL http://arxiv.org/abs/1703.06870.

[17] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade- offs for modern convolutional object detectors. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3296–3297, July 2017. doi: 10.1109/CVPR.2017.351.

[18] P. Jaccard. Etude de la distribution florale dans une portion des alpes et du jura. Bulletin de la Societe Vaudoise des Naturelles, 37:547–579, 01 1901. doi: 10.5169/seals-266450.

[19] R.E. Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D):35–45, 1960.

[20] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th Inter- national Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pages 1097–1105, USA, 2012. Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999134.2999257.

[21] H. W. Kuhn and B. Yaw. The hungarian method for the assignment problem. Naval Res. Logist. Quart, pages 83–97, 1955. Bibliography 49

[22] T-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. URL http://arxiv.org/ abs/1405.0312.

[23] T-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. CoRR, abs/1612.03144, 2016. URL http://arxiv.org/abs/1612.03144.

[24] T-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. CoRR, abs/1708.02002, 2017. URL http://arxiv.org/ abs/1708.02002.

[25] W. Liu, D. Anguelov, D. Erhan, C Szegedy, S. Reed, C-Y. Fu, and A. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015. URL http: //arxiv.org/abs/1512.02325.

[26] D. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110, November 2004. ISSN 0920-5691. doi: 10. 1023/B:VISI.0000029664.99615.94. URL https://doi.org/10.1023/ B:VISI.0000029664.99615.94.

[27] W. Luo, X. Zhao, and T-K. Kim. Multiple object tracking: A review. CoRR, abs/1409.7618, 2014. URL http://arxiv.org/abs/1409.7618.

[28] C. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. ISBN 0521865719, 9780521865715.

[29] Microsoft. Visual Object Tagging Tool, 2019. URL https://github.com/ Microsoft/VoTT.

[30] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler. MOT16: A benchmark for multi-object tracking. CoRR, abs/1603.00831, 2016. URL http://arxiv.org/abs/1603.00831.

[31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.

[32] J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. CoRR, abs/1612.08242, 2016. URL http://arxiv.org/abs/1612.08242.

[33] J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv, 2018.

[34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015. URL http://arxiv.org/abs/1506.02640. 50 Bibliography

[35] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: towards real-time ob- ject detection with region proposal networks. CoRR, abs/1506.01497, 2015. URL http://arxiv.org/abs/1506.01497. [36] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. CoRR, abs/1609.01775, 2016. URL http://arxiv.org/abs/1609.01775. [37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, Z. Huang S. Ma, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei. Im- ageNet Large Scale Visual Recognition Challenge. International Jour- nal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/ s11263-015-0816-y. [38] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. URL http: //arxiv.org/abs/1409.1556. [39] R. Stiefelhagen, K. Bernardin, R. Bowers, J. Garofolo, D. Mostefa, and P. Soundararajan. The clear 2006 evaluation. In CLEAR, 2006.

[40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. URL http://arxiv.org/abs/1409.4842. [41] R. Szeliski. Computer Vision: Algorithms and Applications. Springer-Verlag, Berlin, Heidelberg, 1st edition, 2010. ISBN 1848829345, 9781848829343.

[42] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu. A survey on deep transfer learning. CoRR, abs/1808.01974, 2018. URL http://arxiv. org/abs/1808.01974. [43] J. Uijlings, K. Sande, T. Gevers, and A. Smeulders. Selective search for ob- ject recognition. International Journal of Computer Vision, 104:154–171, 09 2013. doi: 10.1007/s11263-013-0620-5. [44] N. Wojke and A. Bewley. Deep cosine metric learning for person re- identification. pages 748–756, 2018. doi: 10.1109/WACV.2018.00087. [45] X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang, C. Zhang, and J. Sun. Alignedreid: Surpassing human-level performance in person re-identification. arXiv preprint arXiv:1711.08184, 2017. Appendix

A Technical Report of CDIO Project

This appendix consists of the technical report describing the previous project at NFC that was conducted as part of the course Images and Graphics, Project Course CDIO at Linköping University. The tracking-by-detection system used to perform tests in this thesis is also a product of this project.

53 Object detection and tracking in video from multiple surveillance cameras

TSBB11 CDIO, Technical report

Hanna Hamrell, [email protected] Klara Hellgren, [email protected] Denise Härnström, [email protected] Helena Kihlström, [email protected] Axel Nyström, [email protected] May 9, 2019

i Contents

1 Introduction 1 1.1 Problem Description ...... 1 1.2 Project Overview ...... 2 1.3 Client ...... 3 1.4 Limitations ...... 3

2 Related Work 4 2.1 Tracking ...... 4 2.1.1 Multi-video Tracking ...... 4 2.2 Detection ...... 4 2.3 Human Parsing ...... 5

3 Theoretical Background 6 3.1 Detection ...... 6 3.2 Tracking ...... 6 3.2.1 SORT ...... 6 3.2.2 SORT with Deep Association Metric ...... 7 3.3 Object Re-Identification ...... 7 3.4 Human Parsing ...... 8

4 Method 9 4.1 System Overview ...... 9 4.2 Detection ...... 9 4.3 Tracking ...... 9 4.4 Person Re-Identification (Multi-video Matching) ...... 10 4.4.1 Re-identification using AlignedReID ...... 10 4.4.2 Re-identification using Features from Deep SORT ...... 11 4.5 Human Parsing ...... 12

5 Evaluation and Result 13 5.1 Evaluation Data ...... 13 5.2 Runtime Comparison ...... 13 5.3 Detection ...... 13 5.4 Tracking ...... 14 5.5 Person Re-identification ...... 16 5.5.1 Runtime Comparison ...... 16 5.5.2 Re-identification using AlignedReID ...... 17 5.5.3 Re-identification using Features from Deep SORT ...... 17 5.6 Human Parsing ...... 18

6 Discussion 19 6.1 Detection ...... 19 6.2 Tracking ...... 19 6.3 Person Re-identification ...... 19 6.4 Human Parsing ...... 20 6.5 Future Work ...... 20

ii 7 Conclusions 21

iii 1 Introduction

This project is about developing an analysis module for surveillance camera videos to minimize manual work when e.g. looking for persons fitting into a specific description or re-identifying a person from one video in other videos.

1.1 Problem Description The client wanted an automatic analysis module for videos from surveillance cameras. The wishes of the client was that it should be able to detect and track people and e.g. cars or bags in surveillance video cameras. Another wish was that it should be possible to match a person detection in one video with person detections in other videos. A third wish was that all found and tracked objects should be saved in format and that it should be possible to search among the objects based on a signalement in text form. In addition, the customer required a study of good algorithms for detection and classifica- tion as well as tracking. After approval from the customer, the requirements in table 1 were decided upon.

Within the scope of the project, all surveillance video cameras are stationary and no cam- era are possible. The surveillance videos of interest are from the Stockholm metro. The positions of the cameras are known.

1 Table 1: Requirement table.

Requirement Change Requirement description Priority nr Nr 1 Original Detect people in a video sequence. 1 Nr 2 Original Assign a unique ID for every person within 1 the scene. Nr 3 Original Track a detected person in a video sequence. 1 Nr 4 Original Output movement position in pixel coordi- 1 nates for every tracked person. Nr 5 Original Save detected objects in JSON format. 1 Nr 6 Original Produce a system user manual. 1 Nr 7 Original Document the time it takes to run the sys- 1 tem. Nr 8 Original Visualize how a person has moved through a 2 video sequence. Nr 9 Original For a person with a specific object ID in 2 one video sequence, identify persons in other video sequences that are probable to be the same person. Nr 10 Original Provide the customer with the results of a lit- 2 erature survey of existing algorithms for de- tection and classification. Nr 11 Original Provide the customer with the results of a 2 literature survey of existing algorithms for tracking. Nr 12 Original Detect object attributes such as clothing col- 3 ors, bags, shoes etc. Nr 13 Original Pair detected bags with the person carrying 3 them, when they are carried. Nr 14 Original Find possible person matches in all video se- 3 quences based on specific attributes input in text format. Nr 14 Original For all persons, detect when their faces are 3 shown. Nr 16 Original For a person with a specific object ID, visu- 3 alize all images where the face is shown.

1.2 Project Overview The goal of the project was to develop an analysis module for surveillance video cameras with focus on the cameras from Stockholm metro stations. The system should be able to detect, classify and track persons and assign the same ID to the same person throughout the sequence where the person is visible. The result from the tracking should be saved in json format. The system should also be able to find a selected person from one video sequence in other video sequences. Lastly, the system should be able to detect attributes, such as shirt or jacket color etc., to make the result searchable based on attributes.

2 1.3 Client The client is Nationellt Forensiskt Centrum (NFC). NFC is the department of the Swedish Police in charge of assisting the Police with image expertise in preliminary crime investi- gations. The Police as a whole have to manually go through huge amounts of surveillance videos to solve crimes and are in need of a way to reduce this manual work and find possible culprits faster.

1.4 Limitations The client asked for the analysis module to be developed under MIT license and said that it could be developed on either Windows and Linux and using open source code. A dataset with surveillance videos from the Stockholm metro was available to use for development of the module.

A limitation for the report was the fact that images from the metro surveillance dataset were not allowed in the report due to juridical reasons. Therefore, the results that are presented visually in the report are from another dataset than the one for which the system was developed.

3 2 Related Work

Below, related works in detection, tracking and attribute segmentation are briefly dis- cussed.

2.1 Tracking There are several ways to do tracking in a video sequence. Especially suited for sometimes very crowded scenes such as metro surveillance videos is multi object tracking (MOT) us- ing tracking-by-detection. Here all objects first are detected in each frame using an object detector and the detections are then associated between frames using object location and appearance. Two examples of traditional methods that have been revisited in a tracking- by-detection scenario are Multiple Hypothesis Tracking (MHT) [1] and Joint Probabilistic Data Association Filter (JPDAF) [2]. These traditional tracking-by-detection methods are usually very complex and computationally heavy. However, with better detections enabled by recent object detectors, simple tracking models can be used.

Two recent open source trackers are SORT and Deep SORT. SORT stands for Simple On- line Real Time Tracker and the tracker uses Kalman filtering and the Hungarian method to handle motion prediction and data association. SORT is fast and simple and have high precision and accuracy but it cannot handle occlusion very well [3].

Deep SORT, is an extension of SORT, which improves the matching procedure and greatly reduces the number of ID switches by using a visual appearance descriptor for each detected object [4].

2.1.1 Multi-video Tracking Multivideo tracking is the process of finding the same object in different cameras. This can be done by comparing different appearance descriptors for the objects. CNNs such as ResNet[5] can be used to extract these descriptors.

An example is the recently introduced method called AlignedReID [6]. AlignedReID extracts both global and local feature vectors. The local feature vectors are aligned when comparing objects so that the model learns to account for differences in camera angles and object size.

2.2 Detection Three of the currently best performing open source and publicly available detectors are Yolov3 [7], Mask R-CNN [8] and RetinaNet [9]. All of these object detectors takes an im- age or a video frame as an input and outputs coordinates of the bounding boxes for each detected object as well as object class for each bounding box and with what confidence the detection is made [7][8][9]. The Mask R-CNN also outputs segmentation masks and it is possible to apply the network to detect instance-specific poses [8].

Comparing the different Average Precision (AP) values using COCOs average mean AP metric [10], currently the most accurate detector is RetinaNet, but the other two detectors

4 are not far off (see table 2). The network speed is not easily comparable only by reading their respective papers. However, Yolov3 is claimed to be faster than RetinaNet [7].

Table 2: AP values using COCOs mean Average Precision for different object detectors [10].

backbone AP AP50 AP75 APS APM APL Yolov3 [7] Darknet-53 33.0 57.9 34.4 18.3 35.4 41.9 Mask R-CNN [11] ResNetXt-101-FPN 37.1 60.0 39.4 16.9 39.9 53.5 RetinaNet [12] ResNetXt-101-FPN 40.8 61.1 44.1 24.1 44.2 51.2

2.3 Human Parsing Human parsing can be used in for example surveillance, for better person identification and in the fashion and clothing industry. Since there are a lot of different areas with interest in clothes detection and parsing, previous work on the subject has been done to meet the requirements in those fields. In fashion there has been a demand for clothing segmentation in rather high quality fashion still images. The Fashionista dataset was created for this purpose [13], and [14] uses a Condition Random Field model to improve clothes parsing on this dataset.

For video surveillance the parsing has to be done on a sequence of images that can be of lower quality and contain more people. The Crowd Instance-level Human Parsing (CIHP) dataset was created to deal with some of the problems with the previous dataset, and includes more images and more instances of persons in one image [15]. The authors of [15] solves the segmentation problem with what they call a Part Grouping Network (PGN), that is based on fully convolutional networks (FCN), and handles both detection and segmentation of human parts.

5 3 Theoretical Background

This section describes the theoretical background on detection, tracking, attribute seg- mentation and re-identification.

3.1 Detection For the detection, Yolov3 [7] was used. Yolov3 is a detector applying a single neural net- work to a full image. The network predicts bounding boxes and probabilities for different regions of the image and the bounding boxes are weighted using predicted probabilities. It predicts detections across three different scales and for each bounding box, the class which it might contain is predicted using multi-lable classification [7].

3.2 Tracking The general aim for a MOT tracker is to associate detections across frames by localizing and identifying all objects of interest. An ideal tracker should provide a constant ID for each of the objects within the scene by keeping track of objects even when the detections are missing or false positive. The MOT problem is challenging since objects can be oc- cluded or temporarily leave the field of view. The appearance of an object can change within the scene because of scale, rotation and illumination variance.

In surveillance tracking both performance and speed are of interest. Real-time tracking requires fast online models. In online tracking only information from current and past frames are presented to the tracker.

3.2.1 SORT The SORT algorithm keeps track of each object by estimating an object model for every frame. The object model contains current spatial information about object position, scale and bounding box ratio. The object model also contains motion prediction for the next frame that is estimated using Kalman filtering.

The data association determines which detection belongs to which object. Since objects can enter or leave the scene, be occluded or correspond to false detections, the data association problem can be hard to solve. The SORT algorithm solves the problem by calculating the bounding box similarity between objects and detections. This is done by calculating the bounding box IoU distance as A B A B IoU(A, B) = | ∩ | = | ∩ | (1) A B A + B A B | ∪ | | | | | − | ∪ | where A is the current bounding box for the previously detected object and B is the current bounding box for the new detection. For the previously detected objects, the current bounding box is estimated using the motion prediction. After calculating the IoU distance, the final assignment problem is solved by using the Hungarian method [3].

6 3.2.2 SORT with Deep Association Metric This algorithm will be referred to as Deep SORT, and it is an extension of the SORT algorithm described in the previous section. SORT is fast and simple, while it per- forms very well in terms of precision and accuracy. However, it also delivers a relatively high number of identity switches. This motivates an improvement using descriptors for the visual appearance of the detected objects, which are used when matching detected objects from one frame to another to keep track of identities throughout a video sequence.

The descriptor used in Deep SORT is obtained from a convolutional neural network (CNN) that has been pre-trained on a large re-identification dataset with the purpose to discriminate pedestrians. This means that the network has been trained to produce descriptor vectors that are far apart for detected pedestrians with different identities, and very close apart for images of the same person [4]. In deep metric learning methods, the notion of similarity is included directly in the training objective. The feature vectors that are generated for re-identification of the persons in the scene using Deep SORT are based on cosine similarity [16].

The Kalman filtering handles occlusion, but when an object has been occluded for several frames, the prediction becomes more uncertain. Hence, the probability mass spreads out in the state space and there is a risk that a larger is prioritized because of the reduced distance in standard deviations of any detection towards the projected track mean. Deep SORT solves this problem by using a matching cascade that prioritizes more frequently seen objects [4].

As previously stated; the Deep SORT algorithm is an extension of the SORT algorithm. Comparing the cosine distance between the feature vectors is a complement to measuring the IoU distance and the distances between Kalman state estimations. Because of this improvement, Deep SORT obtains better accuracy on standard benchmarks [4].

3.3 Object Re-Identification Re-identification is the process of recognizing a specific object in different images. In the context of this project it means being able to assign the same ID to people in multiple cameras. A paper [6] from 2018 introduces an algorithm called AlignedReID which the authors claim performs person re-identification better than human annotators. Aligne- dReID uses a CNN to jointly learn global and local features to represent a person image.

A feature map of size C H W is taken from the last convolutional layer of a CNN, × × for example ResNet50 [5]. A global feature vector is then created by using global max pooling, i.e. pooling with a H W kernel, which gives a C-dimensional global feature × vector. Local feature vectors are then created by first horizontally max pooling the orig- inal feature map to create H different local feature maps and then convolving each local feature map with a 1 1 kernel to reduce the number of channels from C to c. After × this a person image is represented by a C-dimensional global vector and H different c- dimensional local vectors, where each c-dimensional vector represent a row of the person image.

The local distance between two person images is calculated by finding the alignment

7 of local features which gives the smallest total distance. It is done by first creating the distance matrix D containing elements di,j. The element di,j is a normalized distance between local feature vector fi from one image and local feature vector gj from another image. The normalizing transformation is done as in equation 2 below.

f g 2 e|| i− j || 1 di,j = − i, j 1, 2, .., H (2) f g 2 e|| i− j || + 1 ∈ The local distance between two images is then calculated as the shortest path from (1, 1) to (H,H) in matrix D. The global distance is calculated as the L2 distance between the global feature vectors of the images. Finally the total distance between two person images is simply the sum of the local and global distances.

The procedure above with both global and local features is used during the training stage. However, during the inference stage only the global features are used to compare similarity between person images. The authors of AlignedReID found that only using the global feature during inference worked almost as well as the combined global and local features. In [6] they speculate that the reason for this is that the structure prior of the person image in the learning stage makes the model learn better global features and that the local matching makes the global feature pay more attention to the person instead of the background.

AlignedReID uses TriHard [17] loss as metric learning loss and hard samples are mined using the global distance only. Mutual learning is used during training, details about the mutual learning used can be found in [6] and [18].

3.4 Human Parsing Part Grouping network (PGN) was introduced by [15] as a method to solve the multi- person human parsing problem. In order to solve this task, semantic part segmentation and instance-aware edge detection is done. Instead of training two networks to solve these tasks separately, PGN uses shared layers to learn common features, and then branches out to solve the separate tasks [15].

The network used to extract the shared feature maps is ResNet-101. The feature maps are extracted from the three last layers and used in the different branches. Two parallel branches are trained, one to assign every pixel with a semantic part label and one to perform edge detecting. The last branch uses the output from the first two branches to refine both the edge and segmentation predictions [15].

The edges and parts can then be used to to do instance-level human parsing, if the parts are connected to a certain instance of a person as described in [15].

8 4 Method

Here, the system construction is presented.

4.1 System Overview The system is split into two modules; one main module for detection, classification, tracking, and PGN segmentation, and one multi-video module for multi-video matching – re-identification in other videos. Figure 1 shows an illustration of the system.

Figure 1: Overview of the system.

4.2 Detection For our system, the object detection is done using an implementation of Yolov3 in Pytorch [19]. The specific Yolov3 model used in the system is trained to be able to detect 80 different kinds of objects, among these ’persons’, ’handbag’ and ’backpack’ are included. Per default, the final system only forwards detections of object type ’person’. In an effort to speed up the detections, the Tiny Yolov3 model and weights were evaluated.

4.3 Tracking For the tracking, an open source implementation of the Deep SORT algorithm was used[20], with some adaptations to the system.

This implementation is divided into two steps; feature generation and tracking. A 1x128 feature vector is extracted for every object before the tracking is performed, using the network described in section 3.2.2. This is done for the cut-out bounding box for each detection in each frame, obtained from the detection step described in section 4.2.

9 After adding feature vectors to all objects the multi object tracking is performed. For each pair of frames, the cosine feature distance, the IoU distance and the Kalman state distance are calculated between all pairs of detections between the frames. Using mo- tion predictions from the Kalman filter the tracking bounding boxes are updated. The cosine feature distance and the IoU distance are used in the matching procedure. The IDs from the previous frame are assigned to the detections in the next frame according to the matches that generate the lowest costs. The IDs will only be matched if they have an IoU distance greater than IoUmax and cosine feature distance greater than cosinemax. Detections with confidences that are too low (< cscore) are disregarded.

Newly created tracks will not be initiated as an object instantly; a track will remain in the initialization phase until enough evidence have been collected. Tracks that have not been associated with a detection for a time will be deleted and removed from the set of active tracks. The initialization and removal of objects is controlled by the nint and Amax parameters.

When the tracking is done, information about each detected object is written to a json file. For every object, the json file will contain the object ID and the following information for every frame where the object is visible: frame number, position and bounding box width and height .

4.4 Person Re-Identification (Multi-video Matching) The multi-video matching module requires a set of video sequences, on which the tracking had been performed. The task is to find a desired person from one video sequence in the other sequences.

4.4.1 Re-identification using AlignedReID AlignedReID is the main algorithm used for person re-identification. A pre-trained ResNet50 trained on the Market1501 [21] dataset is used as CNN, the pre-trained model was taken from [22].

AlignedReID needs person images in order to compare people found in different cam- eras. Information about the bounding boxes from the tracking was used to extract a person image for each frame in which the person could be seen. This means that there is now one person image for each frame in which the person is seen. However only a single person image per person is wanted since it is not computationally feasible to compare multiple bounding boxes per person. This means that a single bounding box has to be extracted for each person, this bounding box should represent the general appearance of the person as close as possible.

The bounding boxes are extracted in the following way: The bounding boxes are first added to a list and sorted according to size so that the smallest bounding box is first. 1. If the person is visible in less than 10 frames, the median sized bounding box is simply taken as the final bounding box representing the person. 2. If the person is visible in more than 10 frames a mean bounding box is calculated. This mean bounding box consists of the set of all the bounding boxes from the

10 median sized bounding box to the median of the upper half of the list. Then all the boxes in the set are compared to the mean bounding box individually and the bounding box in the set which is most similar to the mean bounding box is taken as the final bounding box representing the person. Figure 2 below shows how the mean bounding box is extracted.

Figure 2: Calculation of mean bounding box

The implementation of AlignedReID which is used works as described in section 3.3. All person images are resized to size (256, 128) before feature maps are extracted using the pre-trained ResNet50. The extracted feature maps have dimensions 2048 8 4 which × × means that the global feature vector is a 2048-dimensional vector. A convolution with a 1 1 kernel is done on the feature map to reduce the number of channels from 2048 to × 512 before the local feature vectors are extracted. This means that there are in total 8 different 512-dimensional local feature vectors per person image.

Unlike the original AlignedReID paper, both the global and local distances are used in the inference stage when comparing two person images. The reason for this is that no significant difference in computation time could be observed when only global distance was calculated compared to when both global and local distances were calculated.

4.4.2 Re-identification using Features from Deep SORT The feature vectors that were used in the (frame-by-frame) tracking step were also in- vestigated for use in re-identification between video sequences. For each detected person, the feature vectors throughout the sequence are stored for this purpose.

For a chosen ID in one of the video sequences, the mean feature vector for the per- son with this ID is calculated. Thereafter, this mean feature vector is compared to the mean feature vector of every other person in other video sequences. The IDs of the ones with the smallest cosine distance to the one belonging to the desired person are returned as candidates.

11 To obtain a more reasonable set of candidates, there is a possibility to use time in- formation to filter out candidates that appears a lot earlier or later than the desired person. A motivation to this is the data from the client; metro surveillance footage – a person rarely stays within the underground area for longer than a few minutes.

4.5 Human Parsing The goal with the human parsing was to find attributes of a person (e.g. "person with hat") and also to connect the attribute with one or several colors (e.g "person with red coat"), in order to simplify searching for a specific person. This was done in several steps: the attributes were detected and segmented out, the dominant colors of the pixels belonging to an attribute was calculated and the color code was mapped to a color name in English (e.g [255, 0, 0] in RGB space to the word "red"). Since the input to the system is a video sequence and each person may appear in several frames, the image that is used for the attribute detection for each person also needs to be chosen.

For the segmentation of the attributes, the PGN network, trained with the previously described CIHP dataset was used. Unlike how it was first used by the authors, who did both human part detection and segmentation in one go, information about where each person is seen in the sequence is already available and can be used. For every detected object, as defined after the tracking, one image of a bounding box is chosen for attribute detection and segmentation. This image is chosen as described in section 4.4.1.

This means that for every detected object, the bounding box image is used as input to the PGN network, after doing the same preprocessing as previously described. This outputs label masks, of the different human parts of the object. For every part, or at- tribute, of the object, the two dominant colors of all the pixels are calculated with k-means clustering. The dominant colors are mapped to a color name by comparing the Euclidean distance to a predefined set of colors in CIELAB color space. The color space chosen because differences in color and luminance are more perceptually uniform [23]. Since the predefined set of colors contains a lot of colors with different names, these names were simplified (e.g. "DodgerBlue3" to "Blue").

Both the color code and the color name is saved to each attribute belonging to each detected object. This is written to a json file along with all other information of the object.

12 5 Evaluation and Result

In this section the result of the project will be presented. Since no ground truth is provided, the result will be qualitative in forms of figures. All figures are generated from the Laboratory sequences.

5.1 Evaluation Data When developing the system, actual surveillance videos from the Stockholm metro was used. Due to juridical reasons, videos 1-4 in sequence 2 from CVLab’s Laboratory se- quences1 was used to evaluate the final system and the results from these final evaluations are what will mainly be presented in this report. Each of the 4 videos is around 2 minutes long and with a frame rate of 25 frames per second, each frame with a size of 360x288 pixels.

5.2 Runtime Comparison The measured speed for when the system is run on a Dell XPS 15 9560 laptop with a NVIDIA GeForce GTX 1050 graphics card. The specified time is for running all four of the videos in Laboratory Sequences with six persons. The runtime speed is shown in table 3. Table 3: Runtime for different parts of the system when processing the four videos in Laboratory Sequences

Module Time spent Frames per second Entire module 68 min and 8 sec 2.89 -Detection and classification module 32 min and 18 sec 6.09 -Entire tracking module 9 min and 5 sec 21.7 -Extracting features 8 min and 17 sec 23.7 -Deep-sort 40 sec 295 -PGN segmentation 26 min and 45 sec 7.35

5.3 Detection Figure 3 shows a few examples of the bounding boxes generated by the detection module. There are successes with detection altough there is partial occlusion as in figure 3a and 3c. There are also glitches in the detection as seen in figure 3b.

When processing the client’s video dataset from the Stockholm metro, Yolov3 rarely de- tected any bags. The few actual bag detections are a few backpacks in random single frames when seen straight from behind. It was generally good at detecting persons, ex- cept for when they were very far away from the cameras or very close and only partly visible. A few times, weird misinterpretations such as a train or a whole platform labeled

1https://cvlab.epfl.ch/data/data-pom-index-php/

13 as a person occurs even at confidence levels as high as 0.85.

(a) Example 1. (b) Example 2. (c) Example 3.

Figure 3: A few examples of detections by Yolov3 in the Laboratory Sequences.

As seen in table 3, it took 32 minutes and 18 seconds to run the detection module of the system, making it the most time consuming process in the complete module. The effort to speed up the system by using Tiny Yolov3 did make the detection module exceptionally fast, but accurate detections decreased to a minimal, as most persons where classified as airplanes.

5.4 Tracking The tracking is performed using the tracking parameters given in table 4.

Table 4: Tracking parameters.

Parameter Value Description IoUmax 0.7 Maximun IoU distance. cosinemax 0.2 Gating threshold for cosine distance metric. cscore 0.85 Detection confidence threshold. nint 3 Max number of missing matches before a track is deleted. Amax 30 Number of frames that a track remain in initialization phase.

The Deep SORT tracker is fast and effective and the performance is quite good. Results of Deep SORT is shown in figures 4, 5 and 6b. Even if the objects have similar appearance, the tracker manages to keep the objects apart and throughout the sequence, there are only a few ID switches.

14 (a) Frame 1 (b) Frame 2

Figure 4: Two arbitrarily chosen frames from the result of Deep SORT, whith bounding boxes and ID:s for each person. For each object there are two bounding boxes, the red box are the detection in the current frame and the colored box are the predicted bounding box given by the Kalman filter.

Even if Deep SORT uses both motion and visual appearance, the algorithm struggles when objects are occluded. In the Laboratory Sequences the objects move back and forth in circles and their motion pattern is irregular and complex. When an object passes another object and is fully occluded, the tracker mostly assigns a new ID to the object when it is visible again. Several IDs can therefore be assigned to the same object throught the sequence. An illustration of this is shown in figure 5.

(a) Frame 1 (b) Frame 2

Figure 5: Tracking result from camera 0, the same object is assigned to different IDs through the sequence. For each object there are two bounding boxes, the red box are the detection in the current frame and the colored box are the predicted bounding box given by the Kalman filter.

If cosinemax is increased the occlusion is handled better and some objects are tracked through almost the whole sequence with the same ID. Feature vectors before and after occlusion are often dissimilar since part of the object can be occluded. A higher cosinemax accepts matches that are more dissimilar and will therefor handle occlusion better. The

15 drawback is that a higher cosinemax also contributes to more ID switches since the dif- ferences in visual appearance have less impact.

(a) Frame 1 (b) Frame 2

Figure 6: Tracking result from camera 0 using cosinemax=0.4, the increase in cosinemax causes an id switch for ID=5.

5.5 Person Re-identification In this section, the results from the re-identification module are presented.

5.5.1 Runtime Comparison The runtime for two re-identification submodules, one using AlignedReID and another using Deep SORT features, is shown in table 5. The implementation of AlignedReID requires the system to save pictures of all bounding boxes, this has to be done after the whole module with detection and tracking has been run. The user can specify if they want to delete these bounding boxes after AlignedReID has been run. Extracting the bounding boxes, i.e. saving pictures of every bounding box, takes a while which is why timing was done for both when the bounding boxes already exist and when they don’t exist. Deep SORT feature vectors are extracted during the tracking part which means that there is no need to save bounding boxes for matching with Deep SORT feature vec- tors.

Runtime for the multi-video matching will be dependent on the total number of bounding boxes which the original bounding box is compared to. Hence, the number of bounding boxes which the original box has been compared to is included in the result.

Table 5: Time spent on re-identification

Submodule Time spent Bounding boxes AlignedReID with boxes 18 sec 187 AlignedReID without boxes 7 min 56 sec 187 Feature vectors from Deep SORT 6.4 sec -

16 5.5.2 Re-identification using AlignedReID Results from multi-video matching using AlignedReID are shown below. Figure 7 shows the original image and figure 8 shows the matched images in other cameras.

Figure 7: The desired person in camera 0.

(a) Camera 1. (b) Camera 2. (c) Camera 3.

Figure 8: The top 5 matches in each of the other sequences using AlignedReID.

AlignedReID also performed quite well on the metro dataset. It worked especially well when good bounding boxes could be extracted.

5.5.3 Re-identification using Features from Deep SORT Below, the result from an execution of the Deep SORT feature based re-identification in the lab sequences is shown. Figure 9 shows the top 5 matches for the person in figure 7.

(a) Camera 1. (b) Camera 2. (c) Camera 3.

Figure 9: The top 5 matches in each of the other sequences using features from Deep SORT.

Overall, the matching performed fairly well on the lab dataset for different IDs. On the metro dataset, the matching worked very well for persons with colorful clothes, but worse for some of the other persons.

17 5.6 Human Parsing

Figure 10: Example of output from human parsing module.

Examples of the result from the human parsing and color mapping can be seen in figures 10 and 11. Figure 10 is an example of when parsing with the Part Grouping Network has worked well. The module has also been able to correctly find the color of the upper- clothes. In figure 11 the parsing hasn’t worked as well. The outline of the upper-clothes has been classified as ’upper-clothes’ but the rest has been classified as ’coat’.

Figure 11: Example of output from human parsing module.

18 6 Discussion

The complete system seems to perform quite well on the lab dataset. Below, a discussion is given on the different parts of the system.

6.1 Detection Since Yolov3 is probably the fastest detector with competitive performance – we can draw the conclusion that it has been beneficial to use Yolov3 considering system runtime. However, certain accuracy sacrifices might have been done, which in turn might have had a negative effect on the tracking. I.e. there is a speed versus accuracy trade-off.

To improve the system, a Yolov3 model that has been trained only on persons, bags and other objects of interest, could be used to improve the accuracy of the detection part of the system, as well as generating better bag detections.

6.2 Tracking The Deep SORT tracker is fast, has few ID switches and is easy to use. The main draw- back of the algorithm is the occlusion handling. Even if the tracker uses both motion and appearance for the association, the tracker handles occlusion poorly in the Labora- tory sequence. The feature vectors before and after occlusion are often too dissimilar to generate a match.

The confidence threshold that was used was larger than the threshold used in the original Deep SORT algorithm. A larger confidence threshold can potentially improve the perfor- mance but the confidence value needs to be changed with care. A higher threshold can improve the performance since detections with low confidence might correspond to erro- neously detected objects. However, if the threshold is too large, too many detections will be disregarded and the performance will be even worse since Deep SORT is a relatively simple tracker that demands detections in most frames.

6.3 Person Re-identification As shown in table 5, one can save a lot of time by re-using the feature vectors from Deep Sort for re-identification. However, it is at the cost of a possibly lower accuracy than when using AlignedReID. This is, of course, only beneficial when the features already have been extracted in the tracking step – a process that also is time consuming, see table 3.

In the lab sequences, the accuracy does not seem to differ so much between the two methods. However, in the metro surveillance dataset the AlignedReID implementation performed better. Therefore, it makes sense to prioritize AlignedReID in this implemen- tation, with the Deep SORT feature comparison as a faster alternative. This may however differ between different datasets.

If the feature representation used in AlignedReID was already used in the tracking in- stead of the current Deep SORT representation, the multi-video re-identification step

19 would be a lot faster when using AlignedReID. This would, however, require a lot longer computation time in the tracking step. Even though one feature representation might be better for re-identification in the same sequence while another representation is better for the multi-video case, it would, in some cases, make sense to use the same representation both in the tracking and in the multi-video matching in order to have a faster system.

6.4 Human Parsing Overall the parsing gives good results, as long as the image is not too small or does not only contain a small part of a person. If there are several persons in the bounding box, the color outputs might be misleading since the attributes belonging to different persons will be added to one single detected object. This is something that could be handled by connecting the parts to an instance of a person by using both the part segmentation and edge output from PGN, in a similar manner as [15] did.

The mapping of color code to a name of a color does not always work well. There might be several reasons for this. First of all, there is a lot of colors in the predefined set of colors, having fewer might yield better results. Secondly the surveillance videos used are often greyish in nature, and a lot of the colors will come back as grey. This could be handled by comparing the hue of the colors instead, but then the problem is the definition of colors like grey, white and black in that color space.

Since PGN does both segmentation and detection, and can group parts to an instance of a person, it could be used earlier in the system and improve the detections, if speed is not an issue. This could also solve the problem of the parsing when the images of the bounding boxes are either too small or only contains a small part of a person, since the parsing would not be dependent on bounding box output from the tracking module. The whole human parsing module could also be used in order to improve tracking, both in one video and in several videos.

6.5 Future Work One might consider trying out another network, e.g. Mask R-CNN or RetinaNet. This would require a longer runtime, but it would most likely result in better accuracy.

If the processing speed is not a problem, one interesting extension regarding the track- ing and re-identification would be to investigate the performance when using the feature representation from AlignedReID also in the tracking step. This would add a significant complexity to the tracking, but it would probably generate more accurate results.

Another improvement that would give a better accuracy, but also require a lot more processing, would be to use the attribute segmentation in the detection step. This could also give better feature vectors in the tracking because the background would not have to be used when extracting features.

By detecting individual clothings, e.g. hats or jackets, this information could also be used as a huge improvement of the re-identification module. By knowing the types and colors of every detected person’s clothes, one could build a text-based search module. One

20 could e.g. search for a person with a blue jacket and a white hat, and the search module would quickly find the matching persons in the sequences. Combined with the compari- son of the feature representations of the persons, this could be a powerful re-identification module.

7 Conclusions

This project managed to fulfill most of the specified requirements with the exception for some of the priority 3 requirements. The performance of the system varies depending on which data is used, however the final system has an overall decent performance on the dataset which was provided by the client.

The system runs reasonably fast on a standard modern laptop, but with more power- ful hardware, some further improvements could be made as suggested in the discussion. This would most likely give more accurate results.

The tracking-by-detection which was used is heavily dependant on the detections in the detections stage. Yolov3 which was used worked good but it is possible that a better re- sult could be achieved if another detector was used. Better detection and tracking would also lead to better person attribute segmentation and better person re-identification.

21 Appendix References

[1] K. F. Chanho, L. A. Ciptadi, and J. Rehg. “Multiple Hypothesis Tracking Revis- ited”. In: (2018). doi: 10.1109/WACV.2018.00087. [2] S. H. Rezatofighi et al. “Joint Probabilistic Data Association Revisited”. In: (2015). doi: 10.1109/ICCV.2015.349. [3] A. Bewley et al. “Simple online and realtime tracking”. In: 2016 IEEE International Conference on Image Processing (ICIP). 2016, pp. 3464–3468. doi: 10.1109/ICIP. 2016.7533003. [4] N. Wojke, A. Bewley, and P. Dietrich. “Simple Online and Realtime Tracking with a Deep Association Metric”. In: (2017), pp. 3645–3649. doi: 10.1109/ICIP.2017. 8296962. [5] K. He et al. “Deep Residual Learning for Image Recognition”. In: arXiv preprint arXiv:1512.03385 (2015). [6] X. Zhang et al. “Alignedreid: Surpassing human-level performance in person re- identification”. In: arXiv preprint arXiv:1711.08184 (2017). [7] J. Redmon and A. Farhadi. “YOLOv3: An Incremental Improvement”. In: arXiv (2018). [8] W. Abdulla. Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. https://github.com/matterport/Mask_RCNN. 2017. [9] Keras RetinaNet. https://github.com/fizyr/keras-retinanet. [10] T. Lin et al. “Microsoft COCO: Common Objects in Context”. In: CoRR abs/1405.0312 (2014). arXiv: 1405.0312. url: http://arxiv.org/abs/1405.0312. [11] K. He et al. “Mask R-CNN”. In: CoRR abs/1703.06870 (2017). arXiv: 1703.06870. url: http://arxiv.org/abs/1703.06870. [12] T. Lin et al. “Focal Loss for Dense Object Detection”. In: CoRR abs/1708.02002 (2017). arXiv: 1708.02002. url: http://arxiv.org/abs/1708.02002. [13] K. Yamaguchi et al. “Parsing clothing in fashion photographs”. In: June 2012, pp. 3570–3577. isbn: 978-1-4673-1226-4. doi: 10.1109/CVPR.2012.6248101. [14] E. Simo-Serra et al. “A High Performance CRF Model for Clothes Parsing”. In: Proceedings of the Asian Conference on Computer Vision (2014). 2014. [15] K. Gong et al. “Instance-level Human Parsing via Part Grouping Network”. In: ArXiv e-prints (July 2018). arXiv: 1808.00157 [cs.CV]. [16] N. Wojke and A. Bewley. “Deep Cosine Metric Learning for Person Re-identification”. In: (2018), pp. 748–756. doi: 10.1109/WACV.2018.00087. [17] A. Hermans, L. Beyer, and B. Leibe. “In Defense of the Triplet Loss for Person Re-Identification”. In: arXiv preprint arXiv:1703.07737 (2017). [18] Y. Zhang et al. “Deep Mutual Learning”. In: CVPR. 2018. [19] pytorch-yolo3. https://github.com/marvis/pytorch-yolo3. [20] N. Wojke. Simple Online Realtime Tracking with a Deep Association Metric (deep- sort). https://github.com/nwojke/deep_sort. 2017.

22 [21] L. Zheng et al. “Scalable Person Re-identification: A Benchmark”. In: Computer Vision, IEEE International Conference on. 2015. [22] H. Luo. Alignedreid+: Dynamically Matching Local Information for Person Re- Identification. https://github.com/michuanhaohao/AlignedReID. 2018. [23] R. Szeliski. Computer Vision: Algorithms and Applications. Springer, 2010.

23