Mimicking Very Efficient Network for Object Detection
Total Page:16
File Type:pdf, Size:1020Kb
Mimicking Very Efficient Network for Object Detection Quanquan Li1, Shengying Jin2, Junjie Yan1 1SenseTime 2Beihang University [email protected], [email protected], [email protected] Abstract Method MR−2 Parameters test time (ms) Inception R-FCN 7.15 2.5M 53.5 Current CNN based object detectors need initialization 1/2-Inception 7.31 625K 22.8 from pre-trained ImageNet classification models, which are Mimic R-FCN usually time-consuming. In this paper, we present a fully 1/2-Inception finetuned 8.88 625K 22.8 convolutional feature mimic framework to train very effi- from ImageNet cient CNN based detectors, which do not need ImageNet pre-training and achieve competitive performance as the Table 1: The parameters and test time of large and small mod- els. Tested on TITANX with 1000×1500 input. The 1/2-Inception large and slow models. We add supervision from high-level model trained by mimicking outperforms that fine-tuned from Im- features of the large networks in training to help the small ageNet pre-trained model. Moreover, it obtains similar perfor- network better learn object representation. More specifi- mance as the large Inception model with only 1/4 parameters and cally, we conduct a mimic method for the features sampled achieves a 2.5× speed-up. from the entire feature map and use a transform layer to map features from the small network onto the same dimen- sion of the large network. In training the small network, we AlexNet gets 56.8% AP. Due to this phenomenon, nearly optimize the similarity between features sampled from the all the modern detection methods can only train networks same region on the feature maps of both networks. Exten- which have been trained on ImageNet before and cannot sive experiments are conducted on pedestrian and common train a network from scratch to achieve comparable results. object detection tasks using VGG, Inception and ResNet. The result is that we can only use networks designed for On both Caltech and Pascal VOC, we show that the modi- classification task such as AlexNet [23], ZFNet [35], VG- × fied 2.5 accelerated Inception network achieves competi- GNet [30] and ResNet [17], which are not necessarily op- tive performance as the full Inception Network. Our faster timal for detection. Due to the limitation, if we want to × model runs at 80 FPS for a 1000 1500 large input with sweep different network configurations and find a more ef- only a minor degradation of performance on Caltech. ficient network, we will need to pre-train these models on ImageNet classification task and then fine-tune them on de- tection task. The process is very expensive considering 1. Introduction that training a ImageNet classification model needs several Object detection is a fundamental problem in image un- weeks even on multiple GPUs. Moreover, in experiments derstanding. It aims to determine where in the image the ob- we find that smaller networks always perform poor on Im- jects are and which category each object belongs to. Many ageNet classification so that fine-tuning them on detection popular deep convolutional neural network based object de- also leads to poor detection performance. tection methods have been proposed, such as Faster R-CNN In this paper, we want to train more efficient detection [28], R-FCN [6] and SSD [25]. Compared with traditional networks without ImageNet pre-training. More importantly, methods such as DPM [12], these CNN based frameworks we still need to achieve competitive performance as the achieve good performance on challenging dataset. large ImageNet pre-trained models. The basic idea is that if Since the pioneering work R-CNN [14], CNN based we already have a network that achieves satisfying perfor- object detectors need a pre-trained ImageNet classifica- mance for detection, the network can be used to supervise tion model for initialization to get the desired performance. other network training for detection. The question then be- According to the experiments in [22], the Fast R-CNN comes how to use a detection network to supervise a more [13] with AlexNet trained from scratch gets the 40.4% efficient network and keeps its accuracy for detection. AP on Pascal VOC 2007, while with ImageNet pre-trained Similar ideas have been used in standard classification 6356 task, such as [18, 2]. However, we find that they do not performance on Caltech detection tasks. work well for this more complex detection task. The main problems are how and where to add the supervision from 2. Related Work detection ground-truth and the one from a different network. The related work includes recent CNN based object de- Our solution for mimicking in object detection comes tections, network mimicking and network training, as well from observation of modern CNN based detectors, includ- as network acceleration. ing Faster R-CNN [28], R-FCN [6], SSD [25] and YOLO A seminal CNN based object detection method is R- [27]. They all calculate a feature map and then use different CNN [14], which uses the fine-tuned CNN to extract fea- methods to decode detection results from the feature map. tures from object proposals and SVM to classify them. The In this way, detector can actually be divided into the jointly spatial pyramid pooling [16] and Fast R-CNN [13] extract trained feature extractor and the feature decoder. The dif- features on top of a shared feature map to speed up the ferences between the large network and the more efficient R-CNN. Faster R-CNN [28] further improves by predict- network lie in the feature extractor. To this end, our phi- ing region proposals and classifying proposals in the shared losophy is that the mimicking supervision should be added feature map. A very recent work R-FCN [6] proposes in the feature map generated by the feature extractor; the the position-sensitive score map to share more computa- ground-truth supervision should be added on the final fea- tion. The R-CNN series takes object detection as a two-shot ture decoder. For the mimicking supervision, we define a problem, including region proposal generation and region transformation on top of the feature map generated by the classification. Recently, one-shot methods have been pro- small network to a new feature. We want to minimize the posed, such as YOLO and SSD. All these methods need to Euclidean distance between this new feature and the feature calculate the feature map which takes most of the computa- generated by the large network. tion. The mimicking technique we proposed is validated on For the ground-truth supervision, it is the same as the Faster R-CNN and R-FCN, but it can be naturally extended origin detector, such as the joint classification and localiza- to SSD, YOLO and other CNN feature map based methods. tion loss in Fast R-CNN. In training, we first extract the Network mimicking or distilling are recently introduced feature map of each training image generated by the large model acceleration and compression approaches by [18, 2] network, and then use the feature maps and detection anno- aiming to train a more compact model that can learn tations to jointly train the detector with the small network from the output of a large model. [29] further improves initialized from scratch. One problem is that the feature this method by way of the implementation of deeper stu- map is of very high dimension, and we find that directly dent models and hints from the intermediate representation mimicking the feature map does not converge as expected. learned by the teacher network. However, all these mimick- Since the feature extractor is region or proposal based, we ing works, to our best knowledge, have only been validated sample features from regions to optimize, which leads to on easy classification tasks [18, 2, 29, 33]. In this paper, satisfying results. we show how to extend the mimicking techniques for more The feature map mimicking technique proposed in the challenging object detection tasks, and how to use it to train paper can naturally be extended. The first extension is that more efficient object detector. we can mimic across scales. In CNN based detection, we Some works have been proposed to give better initial- only need 1/4 computation if we can reduce the width and ization or replace the ImageNet pre-train. [22] sets the height of the input image by half. However, it usually leads weights of a network such that all units in the network train to significantly performance drop. We show that we can de- at roughly the same rate to avoid vanishing or exploding fine a simple transformation to up-sample the feature map gradients. [1] and [8] learn an unsupervised representation to a large scale and then mimic the transformed feature from videos, and [8] uses spatial context as the source to map. Another extension is that we can extend the mimick- provide supervision signal for training. These methods per- ing technique to a two-stage procedure that further improves form much better than being trained randomly from scratch, the performance. but they still have a large performance gap between the pre- We conduct experiments on Caltech pedestrian detec- trained method from ImageNet. The recent work [19] ana- tion and Pascal VOC object detection using R-FCN and lyzes the ImageNet features in detail. Faster R-CNN. On both Caltech and Pascal VOC, we Our work is also related to works of network accelera- show that the mimicked models demonstrate superior per- tion. [7, 20, 24] accelerate single layer of CNN through formance than the models fine-tuned from ImageNet pre- linear decomposition , while [38] considers the nonlinear trained model.