T-CNN: Tubelets with Convolutional Neural Networks for Object

1 T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos Kai Kang*, Hongsheng Li*, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, Member, IEEE,, and Wanli Ouyang, Senior Member, IEEE, Abstract—The state-of-the-art performance for object detection has been significantly improved over the past two years. Video (a) Besides the introduction of powerful deep neural networks such Detection as GoogleNet [1] and VGG [2], novel object detection frameworks Confidences such as R-CNN [3] and its successors, Fast R-CNN [4] and Faster R-CNN [5], play an essential role in improving the Video state-of-the-art. Despite their effectiveness on still images, those (b) frameworks are not specifically designed for object detection Detected Red Panda Red Panda Red Panda Red Panda Red Panda Red Panda from videos. Temporal and contextual information of videos are Classes Turtle Turtle not fully investigated and utilized. In this work, we propose a deep learning framework that incorporates temporal and Fig. 1. Limitations of still-image detectors on videos. (a) Detections from contextual information from tubelets obtained in videos, which still-image detectors contain large temporal fluctuations, because they do not dramatically improves the baseline performance of existing still- incorporate temporal consistency and constraints. (b) Still-image detectors image detection frameworks when they are applied to videos. It is may generate false positives solely based the information on single frames, called T-CNN, i.e. tubelets with convolutional neueral networks. while these false positives can be distinguished considering the context The proposed framework won newly introduced object-detection- information of the whole video. from-video (VID) task with provided data in the ImageNet Large- Scale Visual Recognition Challenge 2015 (ILSVRC 2015). Code is publicly available at https://github.com/myfavouritekk/T-CNN. class label in each frame of the videos, while test videos have no extra information pre-assigned, such as user tags. VID has a broad range of applications on video analysis. I. INTRODUCTION Despite their effectiveness on still images, still-image object N the last several years, the performance of object detection detection frameworks are not specifically designed for videos. I has been significantly improved with the success of novel One key element of videos is temporal information, because deep convolutional neural networks (CNN) [1], [2], [6], [7] locations and appearances of objects in videos should be and object detection frameworks [3]–[5], [8]. The state-of- temporally consistent, i.e. the detection results should not the-art frameworks for object detection such as R-CNN [3] have dramatic changes over time in terms of both bounding and its successors [4], [5] extract deep convolutional features box locations and detection confidences. However, if still- from region proposals and classify the proposals into different image object detection frameworks are directly applied to classes. DeepID-Net [8] improved R-CNN by introducing box videos, the detection confidences of an object show dramatic pre-training, cascading on region proposals, deformation layers changes between adjacent frames and large long-term temporal and context representations. Recently, ImageNet introduces a variations, as shown by an example in Fig. 1 (a). new challenge for object detection from videos (VID), which One intuition to improve temporal consistency is to prop- brings object detection into the video domain. In this chal- arXiv:1604.02532v4 [cs.CV] 3 Aug 2017 agate detection results to neighbor frames to reduce sudden lenge, an object detection system is required to automatically changes of detection results. If an object exists in a certain annotate every object in 30 classes with its bounding box and frame, the adjacent frames are likely to contain the same Copyright 2017 IEEE. Personal use of this material is permitted. However, object at neighboring locations with similar confidence. In permission to use this material for any other purposes must be obtained from other words, detection results can be propagated to adjacent the IEEE by sending an email to [email protected]. frames according to motion information so as to reduce missed This work is supported in part by SenseTime Group Limited, in part by the General Research Fund through the Research Grants Council of Hong detections. The resulted duplicate boxes can be easily removed Kong under Grants CUHK14213616, CUHK14206114, CUHK14205615, by non-maximum suppression (NMS). CUHK419412, CUHK14203015, CUHK14239816, CUHK14207814, in part Another intuition to improve temporal consistency is to by the Hong Kong Innovation and Technology Support Programme Grant ITS/121/15FX, in part by the China Postdoctoral Science Foundation under impose long-term constraints on the detection results. As Grant 2014M552339, in part by National Natural Science Foundation of China shown in Fig. 1 (a), the detection scores of a sequence of (No. 61371192), and in part by ONR N00014-15-1-2356. bounding boxes of an object have large fluctuations over time. *Kai Kang and Hongsheng Li share co-first authorship. Wanli Ouyang is the corresponding author. ([email protected]) These box sequences, or tubelets, can be generated by tracking Kai Kang, Hongsheng Li, Tong Xiao, Zhe Wang, Ruohui Wang, Xiaogang and spatio-temporal object proposal algorithms [9]. A tubelet Wang, and Wanli Ouyang are with The Chinese University of Hong Kong. can be treated as a unit to apply the long-term constraint. Low Cong Zhang is with Shanghai Jiao Tong University, China. Junjie Yan, Xingyu Zeng are with the SenseTime Group Limited. detection confidence on some positive bounding boxes may Bin Yang is with the Computer Science Department, University of Toronto. result from moving blur, bad poses, or lack of enough training 2 samples under particular poses. Therefore, if most bounding feature maps of the last convolutional layer. In the Faster R- boxes of a tubelet have high confidence detection scores, the CNN pipeline [5], the region proposals were generated by a low confidence scores at certain frames should be increased Region Proposal Network (RPN), and the overall framework to enforce its long-term consistency. can thus be trained in an end-to-end manner. He et al. proposed Besides temporal information, contextual information is a novel Residual Neural Network (ResNet) [6] based on also a key element of videos compared with still images. residual blocks, which enables training very deep networks Although image context information has been investigated with over one hundred layers. Based on the Faster R-CNN [8] and incorporated into still-image detection frameworks, framework, He et al. utilized ResNet to win the detection a video, as a collection of hundreds of images, has much challenges in ImageNet 2015 and COCO 2015. The ResNet richer contextual information. As shown in Fig. 1 (b), a small has later been applied to many other tasks and proven its amount of frames in a video may have high confidence false effectiveness. Besides region based frameworks, some direct positives on some background objects. Contextual information regression frameworks have also been proposed for object within a single frame is sometimes not enough to distinguish detection. YOLO [12] divided the image into even grids and these false positives. However, considering the majority of simultaneously predicted the bounding boxes and classification high-confidence detection results within a video clip, the false scores. SSD [19] generated multiple anchor boxes for each positives can be treated as outliers and then their detection feature map location so as to predict bounding box regression confidences can be suppressed. and classification scores for bounding boxes with different The contribution of this works is three-folded. 1) We scales and aspect ratios. All these pipelines were for object propose a deep learning framework that extends popular still- detection from still images. When they are directly applied image detection frameworks (R-CNN and Faster R-CNN) to videos in a frame-by-frame manner, they might miss some to solve the problem of general object detection in videos positive samples because objects might not be of their best by incorporating temporal and contextual information from poses at certain frames of videos. tubelets. It is called T-CNN, i.e. tubelets with convolution Object detection in videos. Since the introduction of neural network. 2) Temporal information is effectively in- ImageNet VID dataset in 2015, there has been multiple works corporated into the proposed detection framework by locally that solve the video object detection problem. Han et al. [20] propagating detection results across adjacent frames as well proposed a sequence NMS method to associated still-image as globally revising detection confidences along tubelets gen- detections into sequences and apply the sequence-level NMS erated from tracking algorithms. 3) Contextual information on the results. Weaker class scores are boosted by the detection is utilized to suppress detection scores of low-confidence on the same sequence. Galteri et al. [21] proposed a closed- classes based on all detection results within a video clip. loop framework to use object detection results on the previous This framework is responsible for winning the VID task with frame to feed back to the proposal algorithm to improve provided data and achieving the second place with external window ranking. Kang et al. [22] proposed a tubelet proposal data in ImageNet Large-Scale Visual Recognition Challenge network to efficiently generates hundreds of tubelet proposals 2015 (ILSVRC2015). Code is available at https://github.com/ simultaneously. myfavouritekk/T-CNN. Object localization in videos. There have also been works on object localization and co-localization [23]–[27]. Although such a task seems to be similar, the VID task we focus II. RELATED WORK on is much more challenging. There are crucial differences Object detection from still images. State-of-the-art meth- between the two problems.

Load more