Perception Systems for Autonomous Vehicles using Energy-Efficient Deep Neural Networks

Forrest Iandola, Ben Landen, Kyle Bertin, Kurt Keutzer and the DeepScale Team OFFLINE MAPS

SENSORS PATH PLANNING & ACTUATION CAMERA ULTRASONIC

REAL-TIME RADAR LIDAR PERCEPTION

THE FLOW IMPLEMENTING AUTONOMOUS DRIVING WhatWhat does does a cara car need need to see? to see?

Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy. What does a car need to see?

Object Detection

Pedestrian (99%) Pedestrian Cyclist Cyclist (99%) (99%) (99%)

Vehicle (99%) Vehicle (99%) Vehicle (100%) Vehicle (98%) Vehicle (100%)

Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy. What does a car need to see?

Distance

Pedestrian (99%) Pedestrian Cyclist Cyclist 7m (99%) (99%) (99%) 7m 16m 14m Vehicle (99%) Vehicle (99%) Vehicle (100%) Vehicle (98%) Vehicle (100%) 15m 18m 10m 20m 14m

Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy. What does a car need to see?

Object Tracking

Pedestrian (99%) Pedestrian Cyclist Cyclist 7m (99%) (99%) (99%) ID: 8 (60 7m 16m 14m frames) ID: 9 (60 ID: 7 (95 Vehicle (99%) Vehicle (98%) Vehicle (99%) ID: 6 (90 frames) Vehicle (100%) Vehicle (100%) frames) 15m 18m frames) 10m 20m 14m ID: 5 (95 frames) ID: 3 (140 frames) ID: 1 (135 frames) ID: 4 (140 ID: 2 (140 frames) frames)

Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy. What does a car need to see?

Free Space & Driveable Area

Pedestrian (99%) Pedestrian Cyclist Cyclist 7m (99%) (99%) (99%) ID: 8 (60 7m 16m 14m frames) ID: 9 (60 ID: 7 (95 Vehicle (99%) Vehicle (98%) Vehicle (99%) ID: 6 (90 frames) Vehicle (100%) Vehicle (100%) frames) 15m 18m frames) 10m 20m 14m ID: 5 (95 frames) ID: 3 (140 frames) ID: 1 (135 frames) ID: 4 (140 ID: 2 (140 frames) frames)

Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy. What does a car need to see?

Lane Recognition

Pedestrian (99%) Pedestrian Cyclist Cyclist 7m (99%) (99%) (99%) ID: 8 (60 7m 16m 14m frames) ID: 9 (60 ID: 7 (95 Vehicle (99%) Vehicle (98%) Vehicle (99%) ID: 6 (90 frames) Vehicle (100%) Vehicle (100%) frames) 15m 18m frames) 10m 20m 14m ID: 5 (95 frames) ID: 3 (140 frames) ID: 1 (135 frames) ID: 4 (140 ID: 2 (140 frames) frames)

Note: above visuals are an artist’s rendering created to help convey concepts. They should not be judged for accuracy. Today's autonomous cars require a lot of computing hardware! …and perception is the most computationally-intensive part of the software stack

Audi BMW + Intel Ford https://www.slashgear.com/man-vs-machine-my-rematch- https://newsroom.intel.com/news-releases/bmw- against-audis-new-self-driving-rs-7-21415540/ http://cwc.ucsd.edu/content/conne group-intel-mobileye-will-autonomous-test-vehicles- cted-cars-long-road-autonomous- roads-second-half-2017/ vehicles Big computers = expensive cars As a workaround, companies want people to share autonomous vehicles to amortize hardware costs As a workaround, companies want people to share autonomous vehicles to amortize hardware costs Shared autonomous vehicles will likely have some of the downsides as public transportation Will better computer chips make autonomous cars affordable? Will better computer chips make autonomous cars affordable? Processors have arrived! THE SERVER SIDE

Platform Computation Memory Computation- Power Year (GFLOPS/s) Bandwidth to-bandwidth (TDP Watts) (GB/s) ratio NVIDIA K20 [1] 3500 208 17 225 2012 (32-bit float) (GDDR5) NVIDIA V100 [2] 112000 900 124 250 2018 (16-bit float) (HBM2) (yikes!)

Uh-oh… Processors are improving much faster than Memory.

[1] https://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf [2] http://www.nvidia.com/content/PDF/Volta-Datasheet.pdf (PCIe version) Deep Learning Processors have arrived! MOBILE PLATFORMS

Device Cores Computation Memory Computation- System Power Year (GFLOPS/s) Bandwidth to-bandwidth (TDP Watts) (GB/s) ratio Samsung Arm Mali Galaxy T-628 GPU 120 12.8 9.3 ~10 2013 Note 3 [1] (32-bit float) (LPDDR3) Huawei Kirin 970 P20 NPU [2] 1920 30 64 ~10 2018 (16-bit float) (LPDDR4X) (ouch!) NVIDIA NVIDIA Jetson Tensor 30000 137 218 10 to 30 2018 Xavier [3,4] Cores (832 int) (yikes!) (multiple modes)

[1] https://indico.cern.ch/event/319744/contributions/1698147/attachments/616065/847693/gdb_110215_cesini.pdf [2] https://www.androidauthority.com/huawei-announces-kirin-970-797788 [3] https://blogs.nvidia.com/blog/2018/01/07/drive-xavier-processor/ [4] https://developer.nvidia.com/jetson-xavier What will the next generation Deep Learning servers look like?

https://medium.com/@shan.tang.g/a-list-of-chip-ip-for-deep-learning-48d05f1759ae What will the next generation Deep Learning servers look like? 20 TOP/W COMPUTATION

Platform Efficiency Computation Memory Computation- Power Year (TOP/s/W) (TOP/s) Bandwidth to-bandwidth (TDP Watts) (TB/s) ratio NVIDIA K20 [1] 0.015 3.50 0.208 17 225 2012 (32-bit float) (GDDR5) NVIDIA V100 [2] 0.45 112 0.900 124 250 2018 (16-bit float) (HBM2) Next-gen: (est.) 20 TOP/W 20 2500* 1.800 1389 250 2020 (HBM3) [3] (oh no!)

[1] https://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf [2] http://www.nvidia.com/content/PDF/Volta-Datasheet.pdf (PCIe version) [3] https://www.eteknix.com/gddr6-hbm3-details-emerge/ * Assuming half the power is spent on computation, and the other half is spent on memory and other devices. 20 TOP/s/W * 20W * 0.5 = 2500 TOP/s Small Neural Nets to the rescue squeeze (verb): to make an AI system use less resources using whatever means necessary Memory Footprint Computational and Power Operations Bandwidth and Energy Time

squeeze (verb): to make an AI system use less resources using whatever means necessary Memory Footprint Computational and Power Operations Bandwidth and Energy Time

squeeze (verb): to make an AI system use less resources using whatever means necessary

New DNN Application- Superior Differentiated Models specific Implementations Data and Quantization Training and Pruning Strategies Most CV Applications Rely on Only a Few Core CV Capabilities

Image Object Semantic Classification Detection Segmentation

And the best accuracy for each of these capabilities is given by Convolutional Neural Nets But We Need a Very Different Kind of DNN

VGG16[1] model: - Parameter size: 552 MB - Memory: 93 MB/image - Computation: 15.8 GFLOPs/image

TitanX: DGX-1, Smartphones IOT Devices 11 TFLOPS, 170 TFLOPS, 100's GFLOPs 100's MHz 223 Watts, 3.2 KWatts, 3 Watts <1Watt 12 GB Memory 128 GB Memory 2-4 GB <1 GB

[1] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014. Speed more Related to Memory Accesses than Operations

Samsung Exynos M1 Access Times

Galaxy S7 L1 Cache/TLB L2 Cache

L1 D-Cache L2 Cache Off-chip (per core) (shared) DRAM Size 32 KB 2 MB 4 GB Read Latency 4 cycles 22 cycles ~200 cycles Read Bandwidth 20.8 GB/s 166.4 GB/s 28.7 GB/s Energy More Related to Memory Accesses than operations (45nm 0.9V)Energy (pJ) 0 20 40 60 80 100

8b INT Mult

16b FP Mult 5.5x

32b FP Mult 18.5x

64b Cache Read (32KB) 100x

64b Cache Read (1MB) 500x

0 500 1000 1500 2000

DRAM 10,000x

Mark Horowitz, “Computing’s Energy Problem (and what we can do about it),” ISSCC 2014 10,000 DNN Architectural Configurations Later: SqueezeNet (2016)

AlexNet [1]

SqueezeNet [2]

CNN Top-5 Accuracy Model Model ImageNet Parameters Size AlexNet[1] 80.3% 60M 243MB compresses SqueezeNet[2] 80.3% 1.2M 4.8MB to 500KB

[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NIPS2012 [2] Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 1MB model size." arXiv: 1602.07360 (2016). (February 2016) SqueezeNet: Immediate Success in Embedded Vision

Apple CoreML

NXP – Embedded Vision Qualcomm – Facebook F8 Summit

Enabled embedded processor vendors (ARM, NXP, Qualcomm) to demo CNNs Quickly ported to all the major Deep Learning Frameworks SqueezeDet for Object Detection (2017)

Conv Filter Det

feature map

• ~2M model parameters Final Input Bounding • 57 FPS boxes detections image • 1.4 Joules Frame

Best Paper Award: Bichen Wu, Forrest Iandola, Peter H. Jin, and Kurt Keutzer. 2017. SqueezeDet: Unified, small, low power fully convolutional neural networks for real- time object detection for autonomous driving. In Proceedings, CVPR Embedded Workshop, July 2017. SqueezeSeg: Semantic Segmentation for LIDAR (2018)

SqueezeSegV2: LIDAR point cloud segmentation • Higher accuracy: v1[1]: 64.6% -> v2[2]: 73.2% (+8.6%) • Better Sim2Real performance: v1[1]: 30% -> v2[2]: 57.4% (+27.4%) • Outperforms v1 trained on real data w/o intensity

[1] Wu, Bichen, et al. "Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud." ICRA18 [2] Wu, Bichen, et al. "SqueezeSegV2: Improved Model Structure and Unsupervised Domain Adaptation for Road-Object Segmentation from a LiDAR Point Cloud." arXiv:1809.08495 (2018). Squeeze Family

SqueezeNet

SqueezeSeg-{v1, v2} SqueezeNext SqueezeDet DiracDeltaNet

ShiftNet

DNASNet

Image Object Semantic Classification Detection Segmentation Andrew Howard's MobileNets: Efficient On-Device Computer Vision Models

Designed for efficiency on mobile phones. Family of pareto optimal models to target needs of the user. V1 based on Depthwise Separable Convolutions. V2 introduces Inverted Residuals and Linear Bottlenecks. Supports Classification, Detection, Segmentation and more.

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications MobileNet V2: Inverted Residuals and Linear Bottlenecks DNN Architecture Search Model Compression

10X ≥50X

Slide Credit: Prof. Warren Gross (McGill Univ.) Anatomy of a convolution layer

13 3 3x3 conv 13 3 13 13 13 13

384 … … … … … 384 384

13 ⨷ 384 13 Filters: Kernel Reduction

3 3x3 conv 13 13 13 3 13 13 …

384 … … … … 384 384

13 ⨷ 384 13

3 1

1 3 9x reduction in model parameters Filters/Channel Reduction

3 3x3 conv 13 13 13 3 13 13 …

384 … … … … 384 384

13 3x3 conv ⨷ 384 13

3 3 3 3

384 … 128 … 9x reduction in model parameters 384 128 Model Distillation/Compression

Model Distillation

Li, et al. Mimicking Very Efficient Network for Object Detection. CVPR, 2017. Examples of what's on a DNN Architect's Palette

Spatial Convolution Pointwise Convolution Depthwise Convolution e.g. 3x3 1x1

Channel Shuffle Shift The Art of Small Model Design Small Neural Nets Are Beautiful – ESWeek 2017

The palette of an adept mobile/embedded DNN designer has grown very rich! Overall architecture: economize on layers while retaining accuracy Layer types Kernel reduction: 5 x 5  3 x 3  1 x 1 Channel reduction: e.g. FireLayer Experiment with novel layer types that consume no FLOPS Shuffle Shift Model distillation: let big models teach smaller ones Apply pruning Tailor bit precision (aka quantization) to target processor

Iandola , Forrest, and Kurt Keutzer. "Small neural nets are beautiful: enabling embedded systems with small deep-neural-network architectures." In Proceedings of the Twelfth International Conference on Hardware/Software Codesign and System Synthesis Companion, p. 1. ACM, 2017. (ESWeek 2017). Also, (arXiv:1710.02759)

Artistic/Engineering Process of Designing a Deep Neural Net

• Manual design: • Each iteration to evaluate a point in the design space is very expensive • Exploration limited by human imagination Can we automate this?

• Manual design: • Each iteration to evaluate a point in the design space is very expensive • Exploration limited by human imagination DNAS: Differentiable Neural Architecture Search Differentiable Neural Architecture Search: • Extremely fast: 8 GPUs, 24 hours • Can search for different conditions case-by-case • Optimize for actual latency Bichen Wu, Kurt Keutzer, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia DNAS in context (FLOPs to normalize comparison)

ImageNet top-1 Accuracy -- Good PNAS: [2] Acc: 74.2%, FLOPs: 588M DNASNet: (ours) Search cost*: 6,000 GPU-hrs Acc: 74.2%, FLOPs: 295M Search Cost: 216 GPU-hrs NAS: [1] Acc: 74.0%, FLOPs: 564M Search cost: 48,000 GPU-hrs MnasNet: [6] Acc: 74.0, FLOPs: 317M DARTS: [3] Acc: 73.1%, FLOPs: 595M Search Cost*: 91,000 GPU-hrs Search cost: 288 GPU-hrs

MobileNetV2: [4] • X-axis: FLOPs Acc: 71.8%, FLOPs: 300M • Y-axis: accuracy • Mark size: search cost AMC: [5] Acc: 70.8%, FLOPs: 150M • Circles: search cost unknown

* Estimated from the paper description More FLOPs - BAD

[1] Zoph, Barret, et al. "Learning transferable architectures for scalable image recognition." arXiv preprint arXiv:1707.070122.6 (2017). [2] Liu, Chenxi, et al. "Progressive neural architecture search." arXiv preprint arXiv:1712.00559 (2017). [3] Liu, Hanxiao, Karen Simonyan, and Yiming Yang. "Darts: Differentiable architecture search." arXiv preprint arXiv:1806.09055 (2018) [4] Sandler, Mark, et al. "MobileNetV2: Inverted Residuals and Linear Bottlenecks.” CVPR18 [5] He, Yihui, et al. "Amc: Automl for model compression and acceleration on mobile devices." Proceedings of the European Conference on Computer Vision (ECCV). 2018. [6] Tan, Mingxing, et al. "Mnasnet: Platform-aware neural architecture search for mobile." arXiv preprint arXiv:1807.11626 (2018). DNAS for device-aware search

• For different targeted devices, both DNASNets achieve similar accuracy. • However, per target DNN optimization was required

NET Latency on Latency on Top-1 acc iPhoneX Samsung S8 DNAS-iPhoneX 19.84 ms 23.33 ms 73.20% (20% slower) DNAS-S8 27.53 ms 22.12 ms 73.27% (25% slower) The Future: Breaking down the wall between DNN Design & Hardware Design

DNN Designers NN HW Accelerator architects • Unaware of arithmetic intensity • Using outdated models: • Floating point vs fixed point costs - AlexNet • Memory hierarchy and latency - VGG16 • Using irrelevant datasets: - MNIST - CIFAR Key Takeaways

• Autonomous vehicles currently need thousands (or even hundreds of thousands) of dollars of computing hardware

• Processing is on a trajectory of rapid improvement (in operations-per-Watt) • but other aspects of the system (e.g. memory) are improving much more slowly • today's neural networks will be choked by slow memory on tomorrow's DNN accelerators (this is already happening and will get worse)

• Designing new (smaller) neural networks helps with all of the following • making full use of next-generation computing platforms • reducing the hardware costs in autonomous vehicles • enabling lower-cost, larger-scale rollouts of autonomous vehicles