From Hardware Accelerators to Algorithm Design

1 Efficient Computing for AI and Robotics: From Hardware Accelerators to Algorithm Design Vivienne Sze ( @eems_mit) Massachusetts Institute of Technology In collaboration with Luca Carlone, Yu-Hsin Chen, Joel Emer, Sertac Karaman, Tushar Krishna, Peter Li, Fangchang Ma, Amr Suleiman, Diana Wofk, Nellie Wu, Tien-Ju Yang, Zhengdong Zhang Slides available at https://tinyurl.com/SzeMITDL2020 Vivienne Sze http://sze.mit.edu/ @eems_mit 2 Processing at “Edge” instead of the “Cloud” Communication Privacy Latency Vivienne Sze http://sze.mit.edu/ @eems_mit 3 Computing Challenge for Self-Driving Cars (Feb 2018) Cameras and radar generate ~6 gigabytes of data every 30 seconds. Self-driving car prototypes use approximately 2,500 Watts of computing power. Generates wasted heat and some prototypes need water-cooling! Vivienne Sze http://sze.mit.edu/ @eems_mit 4 Existing Processors Consume Too Much Power < 1 Watt > 10 Watts Vivienne Sze http://sze.mit.edu/ @eems_mit 5 Transistors Are Not Getting More Efficient Slowdown of Moore’s Law and Dennard Scaling General purpose microprocessors are not getting faster or more efficient Slowdown Need specialized / domain-specific hardware for significant improvements in speed and energy efficiency Vivienne Sze http://sze.mit.edu/ @eems_mit 6 Efficient Computing with Cross-Layer Design Algorithms Systems Architectures Circuits Vivienne Sze http://sze.mit.edu/ @eems_mit 7 Energy Dominated by Data Movement Operation: Energy Relative Energy Cost (pJ) 8b Add 0.03 16b Add 0.05 32b Add 0.1 16b FP Add 0.4 32b FP Add 0.9 Memory access is orders 8b Multiply 0.2 of magnitude higher energy than compute 32b Multiply 3.1 16b FP Multiply 1.1 32b FP Multiply 3.7 32b SRAM Read (8KB) 5 32b DRAM Read 640 1 10 102 103 104 Vivienne Sze http://sze.mit.edu/ @eems_mit [Horowitz, ISSCC 2014] 8 Autonomous Navigation Uses a Lot of Data Semantic Understanding Geometric Understanding • High frame rate • Growing map size • Large resolutions • Data expansion 2 million pixels 10x-100x more pixels Vivienne Sze http://sze.mit.edu/ @eems_mit [Pire, RAS 2017] 9 Visual-Inertial Localization Determines location/orientation of robot from images and IMU (also used by headset in Augmented Reality and Virtual Reality) … Localization Image sequence Visual-Inertial Odometry IMU (VIO) * Inertial Measurement Unit *Subset of SLAM algorithm (Simultaneous Localization And Mapping) Mapping Slide 28 Vivienne Sze http://sze.mit.edu/ @eems_mit 10 Localization at Under 25 mW First chip that performs complete Visual-Inertial Odometry Front-End for camera (Feature detection, tracking, and outlier elimination) Front-End for IMU (pre-integration of accelerometer and gyroscope data) Back-End Optimization of Pose Graph Navion Consumes 684× and 1582× less energy than mobile and desktop CPUs, respectively [Joint work with Sertac Karaman (AeroAstro)] Vivienne Sze http://sze.mit.edu/ @eems_mit [Zhang, RSS 2017], [Suleiman, VLSI-C 2018] 11 Key Methods to Reduce Data Size Navion: Fully integrated system – no off-chip processing or storage Vision Frontend (VFE) Backend (BE) Previous Current Backend Control Line Buffers Frame Frame Data & Control Bus Feature Undistort Undistort Feature Floating Build Graph Exploit Detection & Rectify & Rectify Tracking Point Graph (FD) (UR) (UR) (FT) Arithmetic Sparsity in Apply Low Linear Matrix Linearize Solver Graph and Cost Left Right Operations Horizon Frame Frame Cholesky Linear States Linear Solver Frame Solver Back Shared Sparse Stereo (SS) Compression Substitute Marginal Memory Vision Frontend Control Rodrigues Register Operations Retract File Data & Control Bus IMU Frontend (IFE) Fixed Point IMU RANSAC Point Cloud Floating Point Arithmetic Arithmetic Pre-Integration memory Use compression and exploit sparsity to reduce memory down to 854kB Navion Project Website: http://navion.mit.edu Vivienne Sze http://sze.mit.edu/ @eems_mit [Suleiman, VLSI-C 2018] Best Student Paper Award 12 Understanding the Environment Depth Estimation Semantic Segmentation State-of-the-art approaches use Deep Neural Networks, which require up to several hundred millions of operations and weights to compute! >100x more complex than video compression Vivienne Sze http://sze.mit.edu/ @eems_mit 13 Deep Neural Networks Deep Neural Networks (DNNs) have become a cornerstone of AI Computer Vision Speech Recognition Game Play Medical Vivienne Sze http://sze.mit.edu/ @eems_mit 14 Book on Efficient Processing of DNNs Part I Understanding Deep Neural Networks Introduction Overview of Deep Neural Networks Part II Design of Hardware for Processing DNNs Key Metrics and Design Objectives Kernel Computation Designing DNN Accelerators Operation Mapping on Specialized Hardware Part III Co-Design of DNN Hardware and Algorithms Reducing Precision Exploiting Sparsity Designing Efficient DNN Models Advanced Technologies https://tinyurl.com/EfficientDNNBook Vivienne Sze http://sze.mit.edu/ @eems_mit Free download for institutional subscribers 15 Weighted Sums Sigmoid Rectified Linear Unit (ReLU) Nonlinear ⎛ 3 ⎞ 1 1 Yj =Activationactivation⎜∑Wij × Xi ⎟ 0 Function ⎝ i=1 ⎠ 0 W11 Y1 y=1/(1+e-x) y=max(0,x) -1 -1 X1 -1 0 1 -1 0 1 Y2 Image source: Caffe tutorial X2 Key operation is Y3 multiply and accumulate (MAC) Accounts for > 90% of computation X3 Output Layer W34 Y4 Input Layer Hidden Layer Vivienne Sze http://sze.mit.edu/ @eems_mit 16 High-Dimensional Convolution in CNN input fmap C filter … … … output fmap C … … H … E R … … S W F Many Input Channels (C) Vivienne Sze http://sze.mit.edu/ @eems_mit 17 Define Shape for Each Layer Input fmaps Output fmaps Filters C … M … … Shape varies across layers C … … H E H – Height of input fmap (activations) R W – Width of input fmap (activations) 1 … … … 1 1 C – Number of 2-D input fmaps /filters S W F (channels) R – Height of 2-D filter (weights) … … … S – Width of 2-D filter (weights) M – Number of 2-D output fmaps (channels) C M C… … … … … … E – Height of output fmap (activations) … F – Width of output fmap (activations) R E N – Number of input fmaps/output fmaps M … H (batch size) N … S … N … F W Vivienne Sze http://sze.mit.edu/ @eems_mit 18 Popular DNN Models Metrics LeNet-5 AlexNet VGG-16 GoogLeNet ResNet-50 EfficientNet-B4 (v1) Top-5 error (ImageNet) n/a 16.4 7.4 6.7 5.3 3.7* Input Size 28x28 227x227 224x224 224x224 224x224 380x380 # of CONV Layers 2 5 16 21 (depth) 49 96 # of Weights 2.6k 2.3M 14.7M 6.0M 23.5M 14M # of MACs 283k 666M 15.3G 1.43G 3.86G 4.4G # of FC layers 2 3 3 1 1 65** # of Weights 58k 58.6M 124M 1M 2M 4.9M # of MACs 58k 58.6M 124M 1M 2M 4.9M Total Weights 60k 61M 138M 7M 25.5M 19M Total MACs 341k 724M 15.5G 1.43G 3.9G 4.4G Reference Lecun, Krizhevsky, Simonyan, Szegedy, He, Tan, PIEEE 1998 NeurIPS 2012 ICLR 2015 CVPR 2015 CVPR 2016 ICML 2019 DNN models getting larger and deeper * Does not include multi-crop and ensemble ** Increase in FC layers due to squeeze-and-excitation layers (much smaller than FC layers for classification) Vivienne Sze http://sze.mit.edu/ @eems_mit 19 Properties We Can Leverage • Operations exhibit high parallelism à high throughput possible • Memory Access is the Bottleneck Memory Read MAC* Memory Write filter weight ALU fmap act updated DRAM partial sum DRAM partial sum 200x 1x * multiply-and-accumulate Worst Case: all memory R/W are DRAM accesses • Example: AlexNet has 724M MACs à 2896M DRAM accesses required Vivienne Sze http://sze.mit.edu/ @eems_mit 20 Properties We Can Leverage • Operations exhibit high parallelism à high throughput possible • Input data reuse opportunities (up to 500x) Input Fmaps Filters Filter Input Fmap Input Fmap Filter 1 1 2 2 Convolutional Reuse Fmap Reuse Filter Reuse (Activations, Weights) (Activations) (Weights) CONV layers only CONV and FC layers CONV and FC layers (sliding window) (batch size > 1) Vivienne Sze http://sze.mit.edu/ @eems_mit 21 Exploit Data Reuse at Low-Cost Memories Specialized PE PE hardware with Global Reg 0.5File – 1.0 kB DRAM small (< 1kB) Buffer PE ALU fetch data to run low cost memory a MACControl here near compute Normalized Energy Cost* ALU 1× (Reference) 0.5 – 1.0 kB RF ALU 1× Farther and larger NoC: 200 – 1000 PEs PE ALU 2× memories consume more power 100 – 500 kB Buffer ALU 6× DRAM ALU 200× * measured from a commercial 65nm process Vivienne Sze http://sze.mit.edu/ @eems_mit 22 Weight Stationary (WS) Global Buffer Psum Activation W0 W1 W2 W3 W4 W5 W6 W7 PE Weight • Minimize weight read energy consumption − maximize convolutional and filter reuse of weights • Broadcast activations and accumulate partial sums spatially across the PE array • Examples: TPU [Jouppi, ISCA 2017], NVDLA Vivienne Sze http://sze.mit.edu/ @eems_mit [Chen, ISCA 2016] 23 Output Stationary (OS) Global Buffer Activation Weight P0 P1 P2 P3 P4 P5 P6 P7 PE Psum • Minimize partial sum R/W energy consumption − maximize local accumulation • Broadcast/Multicast filter weights and reuse activations spatially across the PE array • Examples: [Moons, VLSI 2016], [Thinker, VLSI 2017] Vivienne Sze http://sze.mit.edu/ @eems_mit [Chen, ISCA 2016] 24 Row Stationary Dataflow Row 1 PE 1 Row 1 * Row 1 • Maximize row convolutional reuse in RF − Keep a filter row and fmap sliding window in RF • Maximize row psum accumulation in RF * = Vivienne Sze http://sze.mit.edu/ @eems_mit [Chen, ISCA 2016] Select for Micro Top Picks 25 Row Stationary Dataflow Row 1 Row 2 Row 3 PE 1 PE 4 PE 7 Row 1 * Row 1 Row 1 * Row 2 Row 1 * Row 3 PE 2 PE 5 PE 8 Row 2 * Row 2 Row 2 * Row 3 Row 2 * Row 4 PE 3 PE 6 PE 9 Row 3 * Row 3 Row 3 * Row 4 Row 3 * Row 5 = = = * Optimize for overall* energy efficiency

Load more