1 Efficient Computing for AI and Robotics: From Hardware Accelerators to Algorithm Design
Vivienne Sze ( @eems_mit) Massachusetts Institute of Technology
In collaboration with Luca Carlone, Yu-Hsin Chen, Joel Emer, Sertac Karaman, Tushar Krishna, Peter Li, Fangchang Ma, Amr Suleiman, Diana Wofk, Nellie Wu, Tien-Ju Yang, Zhengdong Zhang
Slides available at https://tinyurl.com/SzeMITDL2020
Vivienne Sze http://sze.mit.edu/ @eems_mit 2 Processing at “Edge” instead of the “Cloud”
Communication Privacy Latency
Vivienne Sze http://sze.mit.edu/ @eems_mit 3 Computing Challenge for Self-Driving Cars
(Feb 2018)
Cameras and radar generate ~6 gigabytes of data every 30 seconds.
Self-driving car prototypes use approximately 2,500 Watts of computing power.
Generates wasted heat and some prototypes need water-cooling!
Vivienne Sze http://sze.mit.edu/ @eems_mit 4 Existing Processors Consume Too Much Power
< 1 Watt > 10 Watts
Vivienne Sze http://sze.mit.edu/ @eems_mit 5 Transistors Are Not Getting More Efficient
Slowdown of Moore’s Law and Dennard Scaling General purpose microprocessors are not getting faster or more efficient
Slowdown
Need specialized / domain-specific hardware for significant improvements in speed and energy efficiency
Vivienne Sze http://sze.mit.edu/ @eems_mit 6 Efficient Computing with Cross-Layer Design Algorithms Systems
Architectures Circuits
Vivienne Sze http://sze.mit.edu/ @eems_mit 7 Energy Dominated by Data Movement
Operation: Energy Relative Energy Cost (pJ) 8b Add 0.03 16b Add 0.05 32b Add 0.1 16b FP Add 0.4 32b FP Add 0.9 Memory access is orders 8b Multiply 0.2 of magnitude higher energy than compute 32b Multiply 3.1 16b FP Multiply 1.1 32b FP Multiply 3.7 32b SRAM Read (8KB) 5 32b DRAM Read 640
1 10 102 103 104
Vivienne Sze http://sze.mit.edu/ @eems_mit [Horowitz, ISSCC 2014] 8 Autonomous Navigation Uses a Lot of Data
Semantic Understanding Geometric Understanding
• High frame rate • Growing map size • Large resolutions • Data expansion
2 million pixels 10x-100x more pixels
Vivienne Sze http://sze.mit.edu/ @eems_mit [Pire, RAS 2017] 9 Visual-Inertial Localization Determines location/orientation of robot from images and IMU (also used by headset in Augmented Reality and Virtual Reality)
… Localization
Image sequence Visual-Inertial Odometry IMU (VIO) * Inertial Measurement Unit
*Subset of SLAM algorithm (Simultaneous Localization And Mapping) Mapping Slide 28
Vivienne Sze http://sze.mit.edu/ @eems_mit 10 Localization at Under 25 mW First chip that performs complete Visual-Inertial Odometry Front-End for camera (Feature detection, tracking, and outlier elimination)
Front-End for IMU (pre-integration of accelerometer and gyroscope data)
Back-End Optimization of Pose Graph Navion
Consumes 684× and 1582× less energy than mobile and desktop CPUs, respectively
[Joint work with Sertac Karaman (AeroAstro)]
Vivienne Sze http://sze.mit.edu/ @eems_mit [Zhang, RSS 2017], [Suleiman, VLSI-C 2018] 11 Key Methods to Reduce Data Size Navion: Fully integrated system – no off-chip processing or storage
Vision Frontend (VFE) Backend (BE) Previous Current Backend Control Line Buffers Frame Frame Data & Control Bus Feature Undistort Undistort Feature Floating Build Graph Exploit Detection & Rectify & Rectify Tracking Point Graph (FD) (UR) (UR) (FT) Arithmetic Sparsity in Apply Low Linear Matrix Linearize Solver Graph and Cost Left Right Operations Horizon Frame Frame Cholesky Linear States Linear Solver Frame Solver Back Shared Sparse Stereo (SS) Compression Substitute Marginal Memory Vision Frontend Control Rodrigues Register Operations Retract File Data & Control Bus IMU Frontend (IFE) Fixed Point IMU RANSAC Point Cloud Floating Point Arithmetic Arithmetic Pre-Integration memory
Use compression and exploit sparsity to reduce memory down to 854kB Navion Project Website: http://navion.mit.edu
Vivienne Sze http://sze.mit.edu/ @eems_mit [Suleiman, VLSI-C 2018] Best Student Paper Award 12 Understanding the Environment Depth Estimation
Semantic Segmentation State-of-the-art approaches use Deep Neural Networks, which require up to several hundred millions of operations and weights to compute! >100x more complex than video compression
Vivienne Sze http://sze.mit.edu/ @eems_mit 13 Deep Neural Networks Deep Neural Networks (DNNs) have become a cornerstone of AI Computer Vision Speech Recognition
Game Play Medical
Vivienne Sze http://sze.mit.edu/ @eems_mit 14 Book on Efficient Processing of DNNs
Part I Understanding Deep Neural Networks Introduction Overview of Deep Neural Networks
Part II Design of Hardware for Processing DNNs Key Metrics and Design Objectives Kernel Computation Designing DNN Accelerators Operation Mapping on Specialized Hardware
Part III Co-Design of DNN Hardware and Algorithms Reducing Precision Exploiting Sparsity Designing Efficient DNN Models Advanced Technologies https://tinyurl.com/EfficientDNNBook
Vivienne Sze http://sze.mit.edu/ @eems_mit Free download for institutional subscribers 15 Weighted Sums
Sigmoid Rectified Linear Unit (ReLU) Nonlinear ⎛ 3 ⎞ 1 1 Yj =Activationactivation⎜∑Wij × Xi ⎟ 0 Function ⎝ i=1 ⎠ 0 W11 Y1 y=1/(1+e-x) y=max(0,x)
-1 -1 X1 -1 0 1 -1 0 1
Y2 Image source: Caffe tutorial
X2 Key operation is Y3 multiply and accumulate (MAC) Accounts for > 90% of computation X3 Output Layer W34 Y4 Input Layer Hidden Layer
Vivienne Sze http://sze.mit.edu/ @eems_mit 16 High-Dimensional Convolution in CNN
input fmap C filter … … … output fmap
C … …
H … E R … … S W F Many Input Channels (C)
Vivienne Sze http://sze.mit.edu/ @eems_mit 17 Define Shape for Each Layer
Input fmaps Output fmaps Filters C … M … … Shape varies across layers C … … H E H – Height of input fmap (activations) R W – Width of input fmap (activations) 1 … … … 1 1 C – Number of 2-D input fmaps /filters S W F (channels) R – Height of 2-D filter (weights)
… … … S – Width of 2-D filter (weights) M – Number of 2-D output fmaps (channels) C M C… … … … … … E – Height of output fmap (activations) … F – Width of output fmap (activations) R E N – Number of input fmaps/output fmaps M … H (batch size) N … S … N … F W
Vivienne Sze http://sze.mit.edu/ @eems_mit 18 Popular DNN Models
Metrics LeNet-5 AlexNet VGG-16 GoogLeNet ResNet-50 EfficientNet-B4 (v1) Top-5 error (ImageNet) n/a 16.4 7.4 6.7 5.3 3.7* Input Size 28x28 227x227 224x224 224x224 224x224 380x380 # of CONV Layers 2 5 16 21 (depth) 49 96 # of Weights 2.6k 2.3M 14.7M 6.0M 23.5M 14M # of MACs 283k 666M 15.3G 1.43G 3.86G 4.4G # of FC layers 2 3 3 1 1 65** # of Weights 58k 58.6M 124M 1M 2M 4.9M # of MACs 58k 58.6M 124M 1M 2M 4.9M Total Weights 60k 61M 138M 7M 25.5M 19M Total MACs 341k 724M 15.5G 1.43G 3.9G 4.4G Reference Lecun, Krizhevsky, Simonyan, Szegedy, He, Tan, PIEEE 1998 NeurIPS 2012 ICLR 2015 CVPR 2015 CVPR 2016 ICML 2019
DNN models getting larger and deeper
* Does not include multi-crop and ensemble ** Increase in FC layers due to squeeze-and-excitation layers (much smaller than FC layers for classification) Vivienne Sze http://sze.mit.edu/ @eems_mit 19 Properties We Can Leverage
• Operations exhibit high parallelism à high throughput possible • Memory Access is the Bottleneck Memory Read MAC* Memory Write
filter weight ALU fmap act updated DRAM partial sum DRAM partial sum
200x 1x * multiply-and-accumulate Worst Case: all memory R/W are DRAM accesses
• Example: AlexNet has 724M MACs à 2896M DRAM accesses required
Vivienne Sze http://sze.mit.edu/ @eems_mit 20 Properties We Can Leverage
• Operations exhibit high parallelism à high throughput possible • Input data reuse opportunities (up to 500x) Input Fmaps
Filters Filter Input Fmap Input Fmap Filter 1 1
2 2 Convolutional Reuse Fmap Reuse Filter Reuse (Activations, Weights) (Activations) (Weights) CONV layers only CONV and FC layers CONV and FC layers (sliding window) (batch size > 1)
Vivienne Sze http://sze.mit.edu/ @eems_mit 21 Exploit Data Reuse at Low-Cost Memories
Specialized PE PE hardware with Global Reg 0.5File – 1.0 kB DRAM small (< 1kB) Buffer PE ALU fetch data to run low cost memory a MACControl here near compute
Normalized Energy Cost* ALU 1× (Reference) 0.5 – 1.0 kB RF ALU 1× Farther and larger NoC: 200 – 1000 PEs PE ALU 2× memories consume more power 100 – 500 kB Buffer ALU 6× DRAM ALU 200×
* measured from a commercial 65nm process
Vivienne Sze http://sze.mit.edu/ @eems_mit 22 Weight Stationary (WS)
Global Buffer Psum Activation
W0 W1 W2 W3 W4 W5 W6 W7 PE Weight
• Minimize weight read energy consumption − maximize convolutional and filter reuse of weights
• Broadcast activations and accumulate partial sums spatially across the PE array
• Examples: TPU [Jouppi, ISCA 2017], NVDLA
Vivienne Sze http://sze.mit.edu/ @eems_mit [Chen, ISCA 2016] 23 Output Stationary (OS)
Global Buffer Activation Weight
P0 P1 P2 P3 P4 P5 P6 P7 PE Psum
• Minimize partial sum R/W energy consumption − maximize local accumulation
• Broadcast/Multicast filter weights and reuse activations spatially across the PE array
• Examples: [Moons, VLSI 2016], [Thinker, VLSI 2017]
Vivienne Sze http://sze.mit.edu/ @eems_mit [Chen, ISCA 2016] 24 Row Stationary Dataflow
Row 1
PE 1 Row 1 * Row 1 • Maximize row convolutional reuse in RF − Keep a filter row and fmap sliding window in RF • Maximize row psum accumulation in RF
* =
Vivienne Sze http://sze.mit.edu/ @eems_mit [Chen, ISCA 2016] Select for Micro Top Picks 25 Row Stationary Dataflow
Row 1 Row 2 Row 3
PE 1 PE 4 PE 7 Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3
PE 2 PE 5 PE 8 Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4
PE 3 PE 6 PE 9 Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5
= = = * Optimize for overall* energy efficiency instead* for only a certain data type Vivienne Sze http://sze.mit.edu/ @eems_mit [Chen, ISCA 2016] Select for Micro Top Picks 26 Dataflow Comparison: CONV Layers
2
1.5 psums Normalized 1 weights Energy/MAC activations 0.5
0 WS OSA OSB OSC NLR RS CNN Dataflows
RS optimizes for the best overall energy efficiency
Vivienne Sze http://sze.mit.edu/ @eems_mit [Chen, ISCA 2016] 27 Deep Neural Networks at Under 0.3W Eyeriss 4mm
Spatial PE Array 4mm chip Buffer - On
[Chen, ISSCC 2016] Exploits data reuse for 100x reduction in memory accesses from global buffer and 1400x reduction in memory accesses from off-chip DRAM Overall >10x energy reduction compared to a mobile GPU (Nvidia TK1)
Results for AlexNet Eyeriss Project Website: http://eyeriss.mit.edu
Vivienne Sze http://sze.mit.edu/ @eems_mit [Joint work with Joel Emer] 28 Features: Energy vs. Accuracy Exponential 10000 VGG162 1000 Energy/ AlexNet2 100 Pixel (nJ) 10 Measured in 65nm* Video 4mm 4mm Compression 1 Spatial 1 PE Array HOG 4mm Linear 4mm chip Buffer - 0.1 On 0 20 40 60 80 1 [Suleiman, VLSI 2016] 2 [Chen, ISSCC 2016] Accuracy (Average Precision) * Only feature extraction. Does Measured in on VOC 2007 Dataset not include data, classification 1. DPM v5 [Girshick, 2012] energy, augmentation and ensemble, etc. 2. Fast R-CNN [Girshick, CVPR 2015]
Vivienne Sze http://sze.mit.edu/ @eems_mit [Suleiman, ISCAS 2017] 29 Energy-Efficient Processing of DNNs A significant amount of algorithm and hardware research on energy-efficient processing of DNNs
V. Sze, Y.-H. Chen, T-J. Yang, J. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, Dec. 2017 http://eyeriss.mit.edu/tutorial.html
We identified various limitations to existing approaches
Vivienne Sze http://sze.mit.edu/ @eems_mit 30 Design of Efficient DNN Algorithms
Popular efficient DNN algorithm approaches
Network Pruning Efficient Network Architectures Channel G Groups …