Arxiv:1904.04989V1 [Cs.CV] 10 Apr 2019
Total Page:16
File Type:pdf, Size:1020Kb
FAMNet: Joint Learning of Feature, Affinity and Multi-dimensional Assignment for Online Multiple Object Tracking Peng Chu and Haibin Ling Temple University Philadelphia, PA USA fpchu, [email protected] Abstract other, which results in a complex method design and exten- sive tuning parameters to adapt different target categories Data association-based multiple object tracking (MOT) and tracking scenarios. involves multiple separated modules processed or optimized Recently, deep neural network (DNN) has been investi- differently, which results in complex method design and re- gated intensively to learn the association cost function in a quires non-trivial tuning of parameters. In this paper, we unified architecture combining both feature extraction and present an end-to-end model, named FAMNet, where Fea- affinity metric [10, 26, 42]. Through training, the task and ture extraction, Affinity estimation and Multi-dimensional scenario prior can be automatically adapted by the candi- assignment are refined in a single network. All layers in date representation and estimation metric without manu- FAMNet are designed differentiable thus can be optimized ally tuning the hyper-parameters. However, the associa- jointly to learn the discriminative features and higher-order tion algorithm still stands outside the network, which re- affinity model for robust MOT, which is supervised by the quires dedicated affinity samples to be manually fabricated loss directly from the assignment ground truth. We also from ground truth association for the training process. It integrate single object tracking technique and a dedicated is not guaranteed that training and inference phases share target management scheme into the FAMNet-based tracking the same data distribution; consequently it may lead to the system to further recover false negatives and inhibit noisy degraded generalizability of the trained model. Moreover, target candidates generated by the external detector. The crowded targets, similar appearance and fast motion impose proposed method is evaluated on a diverse set of bench- great ambiguity for the association only considering pairs of marks including MOT2015, MOT2017, KITTI-Car and UA- neighboring frames. Successful association requires global DETRAC, and achieves promising performance on all of optimization across multiple frames, where higher-order them in comparison with state-of-the-arts. discriminative clues such as appearance changes over time and motion context could be included. Learning the robust representation and affinity criteria without the cooperation 1. Introduction from the association procedure in this circumstance is even more complicated. Tracking multiple objects in video is critical for many Our objective in this paper is to formulate an end-to-end arXiv:1904.04989v1 [cs.CV] 10 Apr 2019 applications, ranging from vision-based surveillance to model for MOT: the Feature representation, Affinity model autonomous driving. A current popular framework to and Multi-dimensional assignment (MDA) are refined in a solve multiple object tracking (MOT) uses the tracking-by- single deep network named FAMNet, which is optimized detection strategy where target candidates generated from jointly to learn the task prior. In particular, feature sub- an external detector are associated and connected to form network is used to extract features for candidates on each the target trajectories across frames [1, 14, 21, 34, 37, 45, frame, after which an affinity sub-network estimates the 49]. At the core of tracking-by-detection strategy lies the higher-order affinity for all association hypothesis. With the data association problem which is usually treated as three affinity, the MDA sub-network is to optimize globally and separate parts: feature extraction for candidate representa- obtain the optimal assignments. By all layers in FAMNet tion, affinity metric to evaluate the cost of each association designed differentiable, the feature and affinity sub-network hypothesis and association algorithm to find the optimal can be trained directly referring to the assignment ground association. These parts involve multiple individual data- truth. To realize it, we make the following novelties to the processing steps and are optimized differently from each FAMNet and its based tracking system: 1 Images Sequences Candidates SOT Predictions Feature Forward/Tracking 픽푖푘 퐗 1 Sub-Net MDA Sub-Net 핏 Target 풜 Management Power L1 Feature Affinity Iteration Normalization Sub-Net Sub-Net 퐗 퐾 Detections Layer Layer 휕퐿 휕퐿 Ground Truth 휕푎푗 : 푎푗 휕퐱 푘 퐿 Feature 휕퐿 1 퐾 Assignment Sub-Net Hypothesis 휕퐟 푘 푖푘 Trajectory Backward/Training Generation FAMNet Figure 1. Overview of our FAMNet based tracking system. The sub-networks inside the yellow background consist the FAMNet. Fik is a set of features extracted from each frame as detailed in Sec. 4.1 and L is for the total loss. • We design an affinity sub-network that fuses discrim- while keeping other assignments fixed. In [41], MDA is for- inative higher-order appearance and motion informa- mulated as the rank-1 tensor approximation problem where tion into the affinity estimation. a dedicated power iteration with unite `1 normalization is • We propose an MDA sub-network, in which a modi- proposed to find the optimal solution. Our work is closely fied rank-1 tensor approximation power iteration is de- related to the MDA formulation, especially [41]. signed differentiable and adapted for the deep learning architecture. • We integrate single object tracking into the data Recently, deep learning is explored with increasing pop- association-based MOT. Detections and tracking pre- ularity in MOT with great success. Most recent solutions dictions are merged and selected optimally through rely on it as a powerful discriminative technique [1, 27, 42, MDA to construct the target trajectories. 56]. Tang et al. [43] propose to use DNN based Re-ID tech- • We employ a target management scheme where a ded- niques for affinity estimations. They include lift edges that icated CNN network is used to refine the target bound- connect two candidates spanning multiple frames to capture ing box to eliminate the noised candidates generated the long-term affinity. In [39], recurrent neural networks by external detector. (RNN) and long short-term memory (LSTM) is adapted to model the higher-order discriminative clue. Those methods To show the effectiveness of the proposed approach, it learn the networks in a separate process with the manually is evaluated on the popular multiple pedestrian and ve- fabricated affinity training samples. hicle tracking challenge benchmarks including MOT2015, MOT2017, KITTI-Car and UA-DETRAC. Our results show promising performance in comparison with other published works. Some recent works have gone further to tentatively solve MOT in an entirely end-to-end fashion. Ondruska and Pos- 2. Related Work ner [35] introduce the RNN for the task to estimate the can- didate state. Although this work is demonstrated on the Multiple object tracking (MOT) has been an active re- synthetic sensor data and no explicit data association is ap- search area for decades, and many methods have been inves- plied, it firstly shows the efficacy of using RNN for an end- tigated for this topic. Recently, the most popular framework to-end solution. Milan et al. [31] propose an RNN-LSTM for MOT is the tracking-by-detection. Traditional meth- based online framework to integrate both motion affinity es- ods primarily focus on solving the data association prob- timation and bipartite association into the deep learning net- lem using such as Hungarian algorithm [3, 15, 19], network work. They use LSTM to solve the data association target flow [12, 53, 55] and multiple hypotheses tracking [6, 23] by target at each frame where the constrains in data associ- on various of affinity estimation schemes. Higher-order ation are not explicitly built into the network but learned affinity provides the global and discriminative informa- from training data. For both works, only the occupancy tion that is not available in pairwise association. In or- status of targets are considered, the informative appearance der to utilize it, MOT is usually treated as the MDA prob- clue is not utilized. Different from their methods, we pro- lem. Collins [11] proposes a block ICM-like method for pose an MDA sub-network which handles both the data as- the MDA to incorporate higher-order motion model. The sociation and the constrains, and our affinity fuses both the method iteratively solves bipartite assignments alternatively appearance and motion clue for better discriminability. 3. Overview 3.2. Architecture Overview and Tracking Pipeline In this section, we first formulate the multiple object For each association batch, the FAMNet based tracking tracking (MOT) problem as a multi-dimensional assign- system takes the K +1 image frames and corresponding de- ment (MDA) form, and then provide an overview of our tections provided by an external detector as input. Detection FAMNet-based tracking system (overview in Fig.1). candidates are first used to generate the hypothesis trajecto- ries. Image patches of target candidates together with the 3.1. Problem Formulation trajectory hypothesis are passed into FAMNet to compute Following the notation in [41], the input for MOT is de- the set of local assignments as shown in Fig.1. Inside FAM- (k) K noted by O = fO gk=0, which contains K + 1 target Net, features of candidate patches are extracted through a candidate sets from K + 1 frames. For frame k, O(k) = feature sub-network. The affinity sub-network then calcu- fo(k)gIk is the set of I candidates to be matched or as- lates the affinity for all hypothesis trajectories on those fea- ik ik=1 k (k) tures to form the affinity tensor as described in Sec. 4.1. sociated, where o represents the status of the candidate ik With the affinity tensor, the optimal multi-dimension as- such as its center coordinate on the image frame. signments are estimated by the MDA sub-network as ex- With the input candidate set , MOT is to find a multi- O plained in Sec.