Mobile/Embedded DNN and AI Socs
Total Page:16
File Type:pdf, Size:1020Kb
Mobile/Embedded DNN and AI SoCs Hoi-Jun Yoo KAIST Outline 1. Deep Neural Network Processor – Mobile DNN Applications – Basic CNN Architectures 2. M/E-DNN: Mobile/Embedded Deep Neural Network – Requirements of M/E-DNN – M/E-DNN Design & Example 3. SoC Applications of M/E-DNN – Hybrid CIS, CNNP Processor and DNPU Processor – Hybrid Intelligent Systems – AR Processor, UI/UX Processor and ADAS Processor Hoi-Jun Yoo 2 Types of AI SoCs H i Cloud • General AI g • Scalability h · Control based on overall conditions · Learning with Data collected from many edge machines • Global Data Sharing · High Maintenance and Cooling Cost Control & G Control Model e n e Stand-Alone AI Data & r a Learned Model Edge l I · Control based on individual situation n t · Real-time Inference and Learning e l l i g Control & e Control Model n c e Embedded AI Data & Learned Model L Mobile o · Low Power Operation • Specific AI w · Real-time Autonomous Learning • UI / UX • Limited Data Sharing Low Real-Time Operation High Requirement Hoi-Jun Yoo 3 Emerging Mobile Applications • The Needs for Embedded Vision & AI Security & Surveillance Visual Perception & Analytics ADAS & Autonomous Cars Augmented Reality Drones Hoi-Jun Yoo 4 Deep Neural Networks MLP CNN RNN (Multi-Layer (Convolutional) (Recurrent) Perceptron) Input Convolution Hidden Pooling Output Input Output Input Output Fully Connected Sequential Data Characteristic Fully Connected Convolutional Layer Feedback Path Internal State Major Speech Recognition Speech Recognition Vision Processing Application Action Recognition Number of 3~10 Layers Max ~100 Layers 3~5 Layers Layers Hoi-Jun Yoo 5 Outline 1. Deep Neural Network Processor – Mobile DNN Applications – Basic CNN Architectures 2. M/E-DNN: Mobile/Embedded Deep Neural Network – Requirements of M/E-DNN – M/E-DNN Design & Example 3. SoC Applications of M/E-DNN – Hybrid CIS, CNNP Processor and DNPU Processor – Hybrid Intelligent Systems – AR Processor, UI/UX Processor and ADAS Processor Hoi-Jun Yoo 6 Mobile/Embedded DNN Requirements Cloud-based DNN Mobile/Embedded DNN FC FC FC FC FC pool pool pool pool pool pool pool pool pool conv conv conv conv conv conv conv conv conv conv conv conv conv conv conv conv conv conv • Cloud Intelligence • On-device Intelligence • General AI • Application-specific AI - Very deep network - Specific DNN network • Communication latency • Real-time operation • High Number Precision • Reduced Number Precision • High power / cooling cost • Low-power consumption • Large memory capacity • Efficient memory arch. Hoi-Jun Yoo 7 Mobile/Embedded DNN Architecture [1] Kwanho Kim et al, ISSCC 2008 [2] Seungjin Lee et al, IEEE TNN 2011 • 2D Cell + Shared PE Digital CNN – 2D Cell Array to Remove Emulation Overhead – Programmable PE for Flexible Operation • Cells: Storage and Communication • PEs: Processing of Cell Data Cell-to-PE Read Bus C00 C01 C02 C03 PE1 Cell-to-Cell PE-to-Cell Communication Write Bus (2D Shift Register) C10 C11 C12 C13 PE2 C20 C21 C22 C23 PE3 C30 C31 C32 C33 PE4 Hoi-Jun Yoo 8 SRAM based Architecture [8] Kwanho Kim et al, ISSCC 2008 6 6 0 0 V V P P 20x60 E 20x60 20x60 E 20x60 Cell A Cell Cell A Cell R R Array R Array Array R Array A A Y Y Controller IMEM(2KB) CELL VPE CELL CELL CELL • Digital Cellular Neural Net. • 80x60 shift-register array CELL VPE CELL CELL CELL • 120 VPEs shared by cells CELL VPE CELL CELL CELL Hoi-Jun Yoo 9 SRAM Memory Based NN Cell [8] Kwanho Kim et al, ISSCC 2008 read bus 1 NMOS only Mux / Demux read bus 2 (Dynamic Logic) North WL0 read read enable1 south en enable2 /pre- WL1 charge north en South WL2 West S load east en H I WL3 enable F west en T O U D N East I T 6T SRAM Cell A Register T O F L /shift I H /equalize S write enable REGISTER FILE SHIFT shift REGISTER write bus /write bus • Dynamic logic-based shift register Small cell area Hoi-Jun Yoo 10 Chip Summary and Applications [8] Kwanho Kim et al, ISSCC 2008 [10] Seungjin Lee et al, ISSCC 2010 Cell PE Cell Cell PE Cell Array Array Array Array Array Array 1 1 2 3 2 4 Controller Process Technology 0.13μm 8M CMOS Area 4.5 mm2 Number of Cells 4800 Number of PEs 120 Operating Frequency 200 MHz Peak Performance 24 GOPS (22 GOPS) (Sustained) Active Power 84 mW Average Power (15FPS) 6 mW Hoi-Jun Yoo 11 Outline 1. Deep Neural Network Processor – Mobile DNN Applications – Basic CNN Architectures 2. M/E-DNN: Mobile/Embedded Deep Neural Network – Requirements of M/E-DNN – M/E-DNN Design & Examples 3. SoC Applications of M/E-DNN – Hybrid CIS, CNNP Processor and DNPU Processor – Hybrid Intelligent Systems – AR Processor, UI/UX Processor and ADAS Processor Hoi-Jun Yoo 12 KAIST Approach: MC Processor + M/E DNN Multi-Core SoC Left brain Right brain • Hard computing • Soft computing - CPU / DSP - Deep Neural Nets - Special purpose HW - Fuzzy logic - Multiple cores - Learning / evolution NoC Fast / accurate computing Suitable for solving imprecision, performance uncertainty, and ambiguous problems Intelligence computing system: multi-core digital processors + DNN soft-computing hardware (through Networks-on- Chip). Hoi-Jun Yoo 13 Low Power Face Recognition SoC [1] Kyeongryeol Bong et al, ISSCC 2017 Conventional Face Detector • Face detection by vision processor Hybrid Face Detector • Face detection by CMOS image sensor • Combine analog & digital face detector 320x240 Pixel Array CMOS CMOS Column Amp. Array Image Sensor Image ADC Array 80x20 Analog Memory Controller Top Analog Haar-like like like Filtering Circuit - Integral Digital Haar Image Haar-like Use Ultra-low Power CNN Processor Unit Filt. Unit Face Detector Face Hoi-Jun Yoo 14 Chip & System implementation [1] Kyeongryeol Bong et al, ISSCC 2017 Always-on face detector (FD) Ultra low-power CNN processor • Always-on FD: 3.3mm x 3.36mm CIS in 65nm – 320 x 240 pixel array, Haar-like filtering circuit & analog MEM • CNN processor: 4mm x 4mm CNNP in 65nm – 4 x 4 PE array with local T-SRAM Hoi-Jun Yoo 15 Chip & System implementation [1] Kyeongryeol Bong et al, ISSCC 2017 • The Most Accurate Face Recognition SoC – 97% Achieved using CNN @ LFW dataset • 0.62mW Face Recognition System w/ imaging Hoi-Jun Yoo 16 CNN + RNN Deep Neural Network [7] Dongjoo Shin et al, ISSCC 2017 • CNN: Visual feature extraction and recognition – Face recognition, image classification… • RNN: Sequential data recognition and generation – Translation, speech recognition… • CNN + RNN: CNN-extracted features RNN input RNN: CNN: Temporal Look Visual feature data extraction Around processing (Single frame) (Multi frame) !! • Previous works – Optimized for convolution layer only: [6], [3] [3] B. Moons, SOVC 2016 [5] S. Han, ISCA 2016 – Optimized for FC layer and RNN only: [5] [6] Y. Chen, ISSCC 2016 Hoi-Jun Yoo 17 Processor for support both CNN & RNN [7] Dongjoo Shin et al, ISSCC 2017 Ext. IF Convolution Processor FC RNN-LSTM Processor Input Buffer Weight Buffer Convolution Convolution Cluster 0 Cluster 1 Quantization Table-based Matrix Multiplier Aggregation Core Accumulation Registers Convolution Convolution Activation Function LUT Cluster 3 Cluster 3 Top RISC Controller Custom Instruction Set-based Instruction Controller Data Memory Memory Ext. IF Convolution layer Heterogeneous Fully-connected layer (CNN) Architecture (CNN) Recurrent neural network Hoi-Jun Yoo 18 DNPU: Chip implementation [7] Dongjoo Shin et al, ISSCC 2017 Specifications Process 65nm 1P8M CMOS Ext. Gateway 2 CNN Die Area 4mm x 4mm (16mm ) Convolution Ctrlr. Convolution Cluster 0 Cluster 3 CLP FCRLP SRAM 280 KB 10 KB Aggregation Core Supply 0.77V ~ 1.1V Convolution Convolution Frequency 50 ~ 200MHz Cluster 1 Cluster 2 50MHz 34.6mW Top @ 0.77V Ctrlr. Power 200MHz 279mW @ 1.1V y Pre-Processing a FC LSTM w CLP Word CLP Word e t Core a Processor Length: 16b Length: 4b G . t x Energy 50MHz E 2.1 TOPS/W 8.1 TOPS/W Efficiency @ 0.77V 200MHz 1.0 TOPS/W 3.9 TOPS/W @ 1.1V • CNN-RNN processor : 4mm x 4mm in 65nm • Supply Voltage : 0.77V ~ 1.1V, frequency : 50 ~ 200MHz • Energy Efficiency : 8.1TOPS/W (50MHz, 0.77V, 4bit word length) 3.9TOPS/W (200MHz, 1.1V, 16bit word length) Hoi-Jun Yoo 19 Pet Robot Demonstration Video [7] Dongjoo Shin et al, ISSCC 2017 Hoi-Jun Yoo 20 Requirements for AR: High-Throughput Recognition BEEP! - Mini Cooper Cooper: 98% - 1600CC Peugeot: 45% - $35,000 AUDI TT: 11% Marker-based AR Markerless AR • 2D-barcode / 2D-marker • Natural Feature Extraction • Markers on Everything? • >10x Higher Computation Impossible Solution in Real Difficulties in Real-time (30fps) World Operation Hoi-Jun Yoo 21 Markerless AR Process Hoi-Jun Yoo 22 Requirements for AR in HMD: Energy Hands-free Head Mounted Display (HMD) – Smaller Battery Size – Complicated Markerless AR Functionalities [a] [b] [c] [a] HP iPAQ [b] Apple iPhone5 [c] Google Project Glass Hoi-Jun Yoo 23 SoC Architecture of K-Glass [12] Gyeonghoon Kim et al, ISSCC 2014 Hoi-Jun Yoo 24 BONE-AR: Chip Implementation Visual Keypoint Detection Scale Space Attention Rendering Processor (KDP) Processor (SSP) ACC. (RA) Processor 8 Cores 2 Cores 2 Cores (VAP) 5 Cores . Keypoint Matching Pose ACC Descriptor Generation Neural Accelerator (KMA) Estimation Network Processor (DGP) 4 Cores Processor 10 Cores (PEP) 2 Cores Mixed-Mode SVM • 4mm x 8mm by 65nm CMOS Technology • 381mW Average / 778mW Peak Power Consumption • 1.22 TOPS Peak Performance • 1.57 TOPS/W Energy Efficiency for 720p Video [12] Gyeonghoon Kim et al, ISSCC 2014 Hoi-Jun Yoo 25 Demo: BONE-AR Hoi-Jun Yoo 26 Gaze User Interface Step1. “Point” the Cursor by “Gaze” Gaze Point Low-power(<50mW [a]) Always-On Gaze UI - Previous Work [b] : 100mW [a] Power Consumption of Wireless Optical Mouse Reported in “Mouse Battery Life Comparison Test Report”, Microsoft Corp., 2004 [b] D.