Squeezedet: Uniﬁed, Small, Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving

SqueezeDet: Unified, Small, Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving Bichen Wu1, Forrest Iandola1,2, Peter H. Jin1, Kurt Keutzer1,2 1UC Berkeley, 2DeepScale [email protected], [email protected], [email protected], [email protected] Abstract annotate. Recent progress in deep learning shows a promis- ing trend that with more and more data that cover all kinds Object detection is a crucial task for autonomous driv- of long-tail scenarios, we can always design more powerful ing. In addition to requiring high accuracy to ensure safety, neural networks with more parameters to digest the data and object detection for autonomous driving also requires real- become more accurate and robust. time inference speed to guarantee prompt vehicle control, While recent research has been primarily focused on im- as well as small model size and energy efficiency to enable proving accuracy, for actual deployment in an autonomous embedded system deployment. vehicle, there are other issues of image object detection that In this work, we propose SqueezeDet, a fully convolu- are equally critical. For autonomous driving some basic re- tional neural network for object detection that aims to si- quirements for image object detectors include the follow- multaneously satisfy all of the above constraints. In our ing: a) Accuracy. More specifically, the detector ideally network we use convolutional layers not only to extract fea- should achieve 100% recall with high precision on objects ture maps, but also as the output layer to compute bound- of interest. b) Speed. The detector should have real-time or ing boxes and class probabilities. The detection pipeline faster inference speed to reduce the latency of the vehicle of our model only contains a single forward pass of a neu- control loop. c) Small model size. As discussed in [18], ral network, thus it is extremely fast. Our model is fully- smaller model size brings benefits of more efficient dis- convolutional, which leads to small model size and bet- tributed training, less communication overhead to export ter energy efficiency. Finally, our experiments show that new models to clients through wireless update, less energy our model is very accurate, achieving state-of-the-art ac- consumption and more feasible embedded system deploy- curacy on the KITTI [10] benchmark. The source code of ment. d) Energy efficiency. Desktop and rack systems SqueezeDet is open-source released. 1 may have the luxury of burning 250W of power for neural network computation, but embedded processors target- 1. Introduction ing automotive market must fit within a much smaller power and energy envelope. While precise figures vary, the new A safe and robust autonomous driving system relies on Xavier2 processor from Nvidia, for example, is targeting a accurate perception of the environment. To be more spe- 20W thermal design point. Processors targeting mobile ap- cific, an autonomous vehicle needs to accurately detect cars, plications have an even smaller power budget and must fit pedestrians, cyclists, road signs, and other objects in real- in the 3W–10W range. Without addressing the problems of time in order to make the right control decisions that ensure a) accuracy, b) speed, c) small model size, and d) energy safety. Moreover, to be economical and widely deployable, efficiency, we won’t be able to truly leverage the power of this object detector must operate on embedded processors deep neural networks for autonomous driving. that dissipate far less power than powerful GPUs (Graph- In this paper, we address the above issues by present- ics Processing Unit) used for benchmarking in typical com- ing SqueezeDet, a fully convolutional neural network for puter vision experiments. object detection. The detection pipeline of SqueezeDet is Object detection is a crucial task for autonomous driving. inspired by [24]: first, we use stacked convolution filters Different autonomous vehicle solutions may have different to extract a high dimensional, low resolution feature map combinations of perception sensors, but image based object for the input image. Then, we use ConvDet, a convolu- detection is almost irreplaceable. Image sensors are inex- tional layer to take the feature map as input and compute pensive compared with others such as LIDAR. Image data a large amount of object bounding boxes and predict their (including video) are much more abundant than, for exam- categories. Finally, we filter these bounding boxes to ob- ple, LIDAR cloud points, and are much easier to collect and 2https://blogs.nvidia.com/blog/2016/09/28/ 1https://github.com/BichenWuUCB/squeezeDet xavier/ 129 tain final detections. The “backbone” convolutional neural ture allows it to amortize more computation across the re- net (CNN) architecture of our network is SqueezeNet [18], gion proposals. which achieves AlexNet level imageNet accuracy with a There have been a number of works that have adapted model size of less than 5MB that can be further compressed the R-CNN approach to address object detection for au- to 0.5MB. After strengthening the SqueezeNet model with tonomous driving. Almost all the top-ranked published additional layers followed by ConvDet, the total model size methods on the KITTI leader board are based on Faster R- is still less than 8MB. The inference speed of our model can CNN. [2] modified the CNN architecture to use shallower reach 57.2 FPS3 (frames per second) with input image res- networks to improve accuracy. [4, 28] on the other hand olution of 1242x375. Benefiting from the small model size focused on generating better region proposals. Most of and activation size, SqueezeDet has a much smaller mem- these methods focused on better accuracy, but to our knowl- ory footprint and requires fewer DRAM accesses, thus it edge, no previous methods have reported real-time infer- consumes only 1.4J of energy per image on a TITAN X ence speeds on KITTI dataset. GPU, which is about 84X less than a Faster R-CNN model Region proposals are a cornerstone in all of the object described in [2]. SqueezeDet is also very accurate. One of detection methods that we have discussed so far. How- our trained SqueezeDet models achieved the best average ever, in YOLO (You Only Look Once) [24], region propo- precision in all three difficulty levels of cyclist detection in sition and classification are integrated into one single stage. the KITTI object detection challenge [10]. Compared with R-CNN and Faster R-CNN based methods, The rest of the paper is organized as follows. We first YOLO’s single stage detection pipeline is extremely fast, review related work in section 2. Then, we introduce our making YOLO the first CNN based, general-purpose object detection pipeline, the ConvDet layer, the training protocol detection model that achieved real-time speed. and network design of SqueezeDet in section 3. In section 4, 2.2. Small CNN models we report our experiments on the KITTI dataset, and dis- cuss accuracy, speed, parameter size of our model. Due to Given a particular accuracy level on a computer vision limited page length, we put energy efficiency discussion in benchmark, it is natural to investigate the question: what is the supplementary material to this paper. We conclude the the smallest model size that can achieve that level of accu- paper in section 5. racy? SqueezeNet [18] was the result of one such investiga- tion. It achieved the same level of accuracy as AlexNet [20] 2. Related Work on ImageNet [7] image classification with less than 5MB of parameters: a reduction of 50x relative to AlexNet. Af- 2.1. CNNs for object detection ter SqueezeNet, several works continued to search for more From 2005 to 2013, various techniques were applied to compact network structures. ENet [23] explored spatial de- advance the accuracy of object detection on datasets such composition of convolutional kernels. Together with other as PASCAL [8]. In most of these years, variants of HOG techniques, ENet achieved SegNet [3] level accuracy for se- (Histogram of Oriented Gradients) + SVM (Support Vector mantic segmentation with 79X less parameters. Recently Machine) [6] or DPM (Deformable Part Models) [9] were MobileNet [17] explored channel-wise decomposition of used to define the state-of-art accuracy on these datasets. convolutional kernels, and was applied to several mobile However, in 2013, Girshick et al. proposed Region-based vision tasks including object detection, fine-grain classifi- Convolutional Neural Networks (R-CNN) [12], which led cation, face attributes and landmark recognition. to substantial gains in object detection accuracy. The R- 2.3. Fully convolutional networks CNN approach begins by identifying region proposals (i.e. regions of interest that are likely to contain objects) and then Fully-convolutional networks (FCN) were popularized classifying these regions using a CNN. One disadvantage of by Long et al., who applied them to the semantic segmen- R-CNN is that it computes the CNN independently on each tation domain [22]. FCN defines a broad class of CNNs, region proposal, leading to time-consuming (≤ 1 fps) and where the output of the final parameterized layer is a grid energy-inefficient (≥ 200 J/frame) computation. To rem- rather than a vector. This is useful in semantic segmen- edy this, Girshick et al. experimented with a number of tation, where each location in the grid corresponds to the strategies to amortize computation across the region propos- predicted class of a pixel. als [13, 19, 11], culminating in Faster R-CNN [25].Another FCN models have been applied in other areas as well. model, R-FCN (Region based Fully Convolutional Net- To address the image classification problem, a CNN needs work), is fully-convolutional and delivers accuracy that is to output a 1-dimensional vector of class probabilities. competitive with R-CNN.

Load more