Perception Systems for Autonomous Vehicles using Energy-Efficient Deep Neural Networks
Forrest Iandola, Ben Landen, Kyle Bertin, Kurt Keutzer and the DeepScale Team OFFLINE MAPS
SENSORS PATH PLANNING & ACTUATION CAMERA ULTRASONIC
REAL-TIME RADAR LIDAR PERCEPTION
THE FLOW IMPLEMENTING AUTONOMOUS DRIVING WhatWhat does does a cara car need need to see? to see?
Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy. What does a car need to see?
Object Detection
Pedestrian (99%) Pedestrian Cyclist Cyclist (99%) (99%) (99%)
Vehicle (99%) Vehicle (99%) Vehicle (100%) Vehicle (98%) Vehicle (100%)
Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy. What does a car need to see?
Distance
Pedestrian (99%) Pedestrian Cyclist Cyclist 7m (99%) (99%) (99%) 7m 16m 14m Vehicle (99%) Vehicle (99%) Vehicle (100%) Vehicle (98%) Vehicle (100%) 15m 18m 10m 20m 14m
Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy. What does a car need to see?
Object Tracking
Pedestrian (99%) Pedestrian Cyclist Cyclist 7m (99%) (99%) (99%) ID: 8 (60 7m 16m 14m frames) ID: 9 (60 ID: 7 (95 Vehicle (99%) Vehicle (98%) Vehicle (99%) ID: 6 (90 frames) Vehicle (100%) Vehicle (100%) frames) 15m 18m frames) 10m 20m 14m ID: 5 (95 frames) ID: 3 (140 frames) ID: 1 (135 frames) ID: 4 (140 ID: 2 (140 frames) frames)
Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy. What does a car need to see?
Free Space & Driveable Area
Pedestrian (99%) Pedestrian Cyclist Cyclist 7m (99%) (99%) (99%) ID: 8 (60 7m 16m 14m frames) ID: 9 (60 ID: 7 (95 Vehicle (99%) Vehicle (98%) Vehicle (99%) ID: 6 (90 frames) Vehicle (100%) Vehicle (100%) frames) 15m 18m frames) 10m 20m 14m ID: 5 (95 frames) ID: 3 (140 frames) ID: 1 (135 frames) ID: 4 (140 ID: 2 (140 frames) frames)
Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy. What does a car need to see?
Lane Recognition
Pedestrian (99%) Pedestrian Cyclist Cyclist 7m (99%) (99%) (99%) ID: 8 (60 7m 16m 14m frames) ID: 9 (60 ID: 7 (95 Vehicle (99%) Vehicle (98%) Vehicle (99%) ID: 6 (90 frames) Vehicle (100%) Vehicle (100%) frames) 15m 18m frames) 10m 20m 14m ID: 5 (95 frames) ID: 3 (140 frames) ID: 1 (135 frames) ID: 4 (140 ID: 2 (140 frames) frames)
Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy. Today's autonomous cars require a lot of computing hardware! …and perception is the most computationally-intensive part of the software stack
Audi BMW + Intel Ford https://www.slashgear.com/man-vs-machine-my-rematch- https://newsroom.intel.com/news-releases/bmw- against-audis-new-self-driving-rs-7-21415540/ http://cwc.ucsd.edu/content/conne group-intel-mobileye-will-autonomous-test-vehicles- cted-cars-long-road-autonomous- roads-second-half-2017/ vehicles Big computers = expensive cars As a workaround, companies want people to share autonomous vehicles to amortize hardware costs As a workaround, companies want people to share autonomous vehicles to amortize hardware costs Shared autonomous vehicles will likely have some of the downsides as public transportation Will better computer chips make autonomous cars affordable? Will better computer chips make autonomous cars affordable? Deep Learning Processors have arrived! THE SERVER SIDE
Platform Computation Memory Computation- Power Year (GFLOPS/s) Bandwidth to-bandwidth (TDP Watts) (GB/s) ratio NVIDIA K20 [1] 3500 208 17 225 2012 (32-bit float) (GDDR5) NVIDIA V100 [2] 112000 900 124 250 2018 (16-bit float) (HBM2) (yikes!)
Uh-oh… Processors are improving much faster than Memory.
[1] https://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf [2] http://www.nvidia.com/content/PDF/Volta-Datasheet.pdf (PCIe version) Deep Learning Processors have arrived! MOBILE PLATFORMS
Device Cores Computation Memory Computation- System Power Year (GFLOPS/s) Bandwidth to-bandwidth (TDP Watts) (GB/s) ratio Samsung Arm Mali Galaxy T-628 GPU 120 12.8 9.3 ~10 2013 Note 3 [1] (32-bit float) (LPDDR3) Huawei Kirin 970 P20 NPU [2] 1920 30 64 ~10 2018 (16-bit float) (LPDDR4X) (ouch!) NVIDIA NVIDIA Jetson Tensor 30000 137 218 10 to 30 2018 Xavier [3,4] Cores (832 int) (yikes!) (multiple modes)
[1] https://indico.cern.ch/event/319744/contributions/1698147/attachments/616065/847693/gdb_110215_cesini.pdf [2] https://www.androidauthority.com/huawei-announces-kirin-970-797788 [3] https://blogs.nvidia.com/blog/2018/01/07/drive-xavier-processor/ [4] https://developer.nvidia.com/jetson-xavier What will the next generation Deep Learning servers look like?
https://medium.com/@shan.tang.g/a-list-of-chip-ip-for-deep-learning-48d05f1759ae What will the next generation Deep Learning servers look like? 20 TOP/W COMPUTATION
Platform Efficiency Computation Memory Computation- Power Year (TOP/s/W) (TOP/s) Bandwidth to-bandwidth (TDP Watts) (TB/s) ratio NVIDIA K20 [1] 0.015 3.50 0.208 17 225 2012 (32-bit float) (GDDR5) NVIDIA V100 [2] 0.45 112 0.900 124 250 2018 (16-bit float) (HBM2) Next-gen: (est.) 20 TOP/W 20 2500* 1.800 1389 250 2020 (HBM3) [3] (oh no!)
[1] https://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf [2] http://www.nvidia.com/content/PDF/Volta-Datasheet.pdf (PCIe version) [3] https://www.eteknix.com/gddr6-hbm3-details-emerge/ * Assuming half the power is spent on computation, and the other half is spent on memory and other devices. 20 TOP/s/W * 20W * 0.5 = 2500 TOP/s Small Neural Nets to the rescue squeeze (verb): to make an AI system use less resources using whatever means necessary Memory Footprint Computational and Power Operations Bandwidth and Energy Time
squeeze (verb): to make an AI system use less resources using whatever means necessary Memory Footprint Computational and Power Operations Bandwidth and Energy Time
squeeze (verb): to make an AI system use less resources using whatever means necessary
New DNN Application- Superior Differentiated Models specific Implementations Data and Quantization Training and Pruning Strategies Most CV Applications Rely on Only a Few Core CV Capabilities
Image Object Semantic Classification Detection Segmentation
And the best accuracy for each of these capabilities is given by Convolutional Neural Nets But We Need a Very Different Kind of DNN
VGG16[1] model: - Parameter size: 552 MB - Memory: 93 MB/image - Computation: 15.8 GFLOPs/image
TitanX: DGX-1, Smartphones IOT Devices 11 TFLOPS, 170 TFLOPS, 100's GFLOPs 100's MHz 223 Watts, 3.2 KWatts, 3 Watts <1Watt 12 GB Memory 128 GB Memory 2-4 GB <1 GB
[1] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014. Speed more Related to Memory Accesses than Operations
Samsung Exynos M1 Access Times
Galaxy S7 L1 Cache/TLB L2 Cache
L1 D-Cache L2 Cache Off-chip (per core) (shared) DRAM Size 32 KB 2 MB 4 GB Read Latency 4 cycles 22 cycles ~200 cycles Read Bandwidth 20.8 GB/s 166.4 GB/s 28.7 GB/s Energy More Related to Memory Accesses than operations (45nm 0.9V)Energy (pJ) 0 20 40 60 80 100
8b INT Mult
16b FP Mult 5.5x
32b FP Mult 18.5x
64b Cache Read (32KB) 100x
64b Cache Read (1MB) 500x
0 500 1000 1500 2000
DRAM 10,000x
Mark Horowitz, “Computing’s Energy Problem (and what we can do about it),” ISSCC 2014 10,000 DNN Architectural Configurations Later: SqueezeNet (2016)
AlexNet [1]
SqueezeNet [2]
CNN Top-5 Accuracy Model Model ImageNet Parameters Size AlexNet[1] 80.3% 60M 243MB compresses SqueezeNet[2] 80.3% 1.2M 4.8MB to 500KB
[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NIPS2012 [2] Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 1MB model size." arXiv: 1602.07360 (2016). (February 2016) SqueezeNet: Immediate Success in Embedded Vision
Apple CoreML
NXP – Embedded Vision Qualcomm – Facebook F8 Summit
Enabled embedded processor vendors (ARM, NXP, Qualcomm) to demo CNNs Quickly ported to all the major Deep Learning Frameworks SqueezeDet for Object Detection (2017)
Conv Filter Det
feature map
• ~2M model parameters Final Input Bounding • 57 FPS boxes detections image • 1.4 Joules Frame
Best Paper Award: Bichen Wu, Forrest Iandola, Peter H. Jin, and Kurt Keutzer. 2017. SqueezeDet: Unified, small, low power fully convolutional neural networks for real- time object detection for autonomous driving. In Proceedings, CVPR Embedded Computer Vision Workshop, July 2017. SqueezeSeg: Semantic Segmentation for LIDAR (2018)
SqueezeSegV2: LIDAR point cloud segmentation • Higher accuracy: v1[1]: 64.6% -> v2[2]: 73.2% (+8.6%) • Better Sim2Real performance: v1[1]: 30% -> v2[2]: 57.4% (+27.4%) • Outperforms v1 trained on real data w/o intensity
[1] Wu, Bichen, et al. "Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud." ICRA18 [2] Wu, Bichen, et al. "SqueezeSegV2: Improved Model Structure and Unsupervised Domain Adaptation for Road-Object Segmentation from a LiDAR Point Cloud." arXiv:1809.08495 (2018). Squeeze Family
SqueezeNet
SqueezeSeg-{v1, v2} SqueezeNext SqueezeDet DiracDeltaNet
ShiftNet
DNASNet
Image Object Semantic Classification Detection Segmentation Andrew Howard's MobileNets: Efficient On-Device Computer Vision Models
Designed for efficiency on mobile phones. Family of pareto optimal models to target needs of the user. V1 based on Depthwise Separable Convolutions. V2 introduces Inverted Residuals and Linear Bottlenecks. Supports Classification, Detection, Segmentation and more.
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications MobileNet V2: Inverted Residuals and Linear Bottlenecks DNN Architecture Search Model Compression
10X ≥50X
Slide Credit: Prof. Warren Gross (McGill Univ.) Anatomy of a convolution layer
13 3 3x3 conv 13 3 13 13 13 13
384 … … … … … 384 384
13 ⨷ 384 13 Filters: Kernel Reduction
3 3x3 conv 13 13 13 3 13 13 …
384 … … … … 384 384
13 ⨷ 384 13
3 1
1 3 9x reduction in model parameters Filters/Channel Reduction
3 3x3 conv 13 13 13 3 13 13 …
384 … … … … 384 384
13 3x3 conv ⨷ 384 13
3 3 3 3
384 … 128 … 9x reduction in model parameters 384 128 Model Distillation/Compression
Model Distillation
Li, et al. Mimicking Very Efficient Network for Object Detection. CVPR, 2017. Examples of what's on a DNN Architect's Palette
Spatial Convolution Pointwise Convolution Depthwise Convolution e.g. 3x3 1x1
Channel Shuffle Shift The Art of Small Model Design Small Neural Nets Are Beautiful – ESWeek 2017
The palette of an adept mobile/embedded DNN designer has grown very rich! Overall architecture: economize on layers while retaining accuracy Layer types Kernel reduction: 5 x 5 3 x 3 1 x 1 Channel reduction: e.g. FireLayer Experiment with novel layer types that consume no FLOPS Shuffle Shift Model distillation: let big models teach smaller ones Apply pruning Tailor bit precision (aka quantization) to target processor
Iandola , Forrest, and Kurt Keutzer. "Small neural nets are beautiful: enabling embedded systems with small deep-neural-network architectures." In Proceedings of the Twelfth International Conference on Hardware/Software Codesign and System Synthesis Companion, p. 1. ACM, 2017. (ESWeek 2017). Also, (arXiv:1710.02759)
Artistic/Engineering Process of Designing a Deep Neural Net
• Manual design: • Each iteration to evaluate a point in the design space is very expensive • Exploration limited by human imagination Can we automate this?
• Manual design: • Each iteration to evaluate a point in the design space is very expensive • Exploration limited by human imagination DNAS: Differentiable Neural Architecture Search Differentiable Neural Architecture Search: • Extremely fast: 8 GPUs, 24 hours • Can search for different conditions case-by-case • Optimize for actual latency Bichen Wu, Kurt Keutzer, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia DNAS in context (FLOPs to normalize comparison)
ImageNet top-1 Accuracy -- Good PNAS: [2] Acc: 74.2%, FLOPs: 588M DNASNet: (ours) Search cost*: 6,000 GPU-hrs Acc: 74.2%, FLOPs: 295M Search Cost: 216 GPU-hrs NAS: [1] Acc: 74.0%, FLOPs: 564M Search cost: 48,000 GPU-hrs MnasNet: [6] Acc: 74.0, FLOPs: 317M DARTS: [3] Acc: 73.1%, FLOPs: 595M Search Cost*: 91,000 GPU-hrs Search cost: 288 GPU-hrs
MobileNetV2: [4] • X-axis: FLOPs Acc: 71.8%, FLOPs: 300M • Y-axis: accuracy • Mark size: search cost AMC: [5] Acc: 70.8%, FLOPs: 150M • Circles: search cost unknown
* Estimated from the paper description More FLOPs - BAD
[1] Zoph, Barret, et al. "Learning transferable architectures for scalable image recognition." arXiv preprint arXiv:1707.070122.6 (2017). [2] Liu, Chenxi, et al. "Progressive neural architecture search." arXiv preprint arXiv:1712.00559 (2017). [3] Liu, Hanxiao, Karen Simonyan, and Yiming Yang. "Darts: Differentiable architecture search." arXiv preprint arXiv:1806.09055 (2018) [4] Sandler, Mark, et al. "MobileNetV2: Inverted Residuals and Linear Bottlenecks.” CVPR18 [5] He, Yihui, et al. "Amc: Automl for model compression and acceleration on mobile devices." Proceedings of the European Conference on Computer Vision (ECCV). 2018. [6] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." arXiv preprint arXiv:1807.11626 (2018). DNAS for device-aware search
• For different targeted devices, both DNASNets achieve similar accuracy. • However, per target DNN optimization was required
NET Latency on Latency on Top-1 acc iPhoneX Samsung S8 DNAS-iPhoneX 19.84 ms 23.33 ms 73.20% (20% slower) DNAS-S8 27.53 ms 22.12 ms 73.27% (25% slower) The Future: Breaking down the wall between DNN Design & Hardware Design
DNN Designers NN HW Accelerator architects • Unaware of arithmetic intensity • Using outdated models: • Floating point vs fixed point costs - AlexNet • Memory hierarchy and latency - VGG16 • Using irrelevant datasets: - MNIST - CIFAR Key Takeaways
• Autonomous vehicles currently need thousands (or even hundreds of thousands) of dollars of computing hardware
• Processing is on a trajectory of rapid improvement (in operations-per-Watt) • but other aspects of the system (e.g. memory) are improving much more slowly • today's neural networks will be choked by slow memory on tomorrow's DNN accelerators (this is already happening and will get worse)
• Designing new (smaller) neural networks helps with all of the following • making full use of next-generation computing platforms • reducing the hardware costs in autonomous vehicles • enabling lower-cost, larger-scale rollouts of autonomous vehicles