AI Acceleration – Image Processing As a Use Case
Total Page:16
File Type:pdf, Size:1020Kb
AI acceleration – image processing as a use case Masoud Daneshtalab Associate Professor at MDH www.idt.mdh.se/~md/ Outline • Dark Silicon • Heterogeneouse Computing • Approximation – Deep Neural Networks 2 The glory of Moore’s law Intel 4004 Intel Core i7 980X 2300 transistors 1.17B transistors 740 kHz clock 3.33 GHz clock 10um process 32nm process 10.8 usec/inst 73.4 psec/inst Every 2 Years • Double the number of transistors • Build higher performance general-purpose processors ⁻ Make the transistors available to masses ⁻ Increase performance (1.8×↑) ⁻ Lower the cost of computing (1.8×↓) Semiconductor trends ITRS roadmap for SoC Design Complexity Trens ITRS: International Technology Roadmap for Semiconductors Expected number of processing elements into a System-on-Chip (SoC). 4 Semiconductor trends ITRS roadmap for SoC Design Complexity Trens 5 Semiconductor trends Network-on-Chip (NoC) –based Multi/Many-core Systems Founder: Andreas Olofsson Sponsored by Ericsson AB 6 Dark Silicon Era The catch is powering exponentially increasing number of transistors without melting the chip down. 10 10 Chip Transistor Count 2,200,000,000 109 Chip Power 108 107 106 105 2300 104 103 130 W 102 101 0.5 W 100 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 If you cannot power them, why bother making them? 7 Dark Silicon Era The catch is powering exponentially increasing number of transistors without melting the chip down. 10 10 Chip Transistor Count 2,200,000,000 109 Chip Power 108 107 Dark Silicon 106 105 (Utilization Wall) 2300 104 Fraction of transistors that need to be 103 powered off at all times due to power 130 W 2 constraints. 10 101 0.5 W 100 Lower utilization, Lower performane 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 If you cannot power them, why bother making them? 8 Evaluation of processors 9 Even multicores could not help! [Esmaeilzadeh, Blem, St. Amant, Sankaralingam, Burger, ISCA 2011] 10 End to Moore’s law? Source: Shekhar Borkar, Intel Corporation Multicores are likely to be a stopgap Not likely to continue the historical trends Do not overcome the transistor scaling trends The performance gap is significantly large 11 End to Moore’s law? Source: Shekhar Borkar, Intel Corporation • HW/SW specialization and heterogeneity • Approximate computing • New emerging technologies (under development) 12 Where we are now and what’s the trend? ITRS roadmap for SoC Portable Design Complexity Trens ITRS: International Technology Roadmap for Semiconductors SoC Architecture Template. The SOC embodies a highly parallel architecture consisting of • Main processors (grow slowly) • PEs (Processing Engines) (grow faster) This architecture template enables both high . customized processor processing performance and low power . Accelerators (function). consumption by virtue of parallel processing • Peripherals and hardware realization of specific functions. • Memories (proportional to #PEs) • Die size of 49mm2 gradually decreases to 44mm2 13 Example: OMAP (Open Multimedia Applications Platform) 14 Paradigm Shift from Homogeneity to Heterogeneity Heterogenous (Zynq/HSA-like) HW/SW SoC platform Programmable Logic Application Processing Unit GPU ARM Mali ARM Cortex A53 ARM Cortex A53 64KB L2 Cache ARM Cortex A53 ARM Cortex A53 Security 4x32KB I-Chache 4x32KB D-Chache AES SCU DMA Trust Zone 1MB L2 Cache GIC MMU V/T Monitring Real-Time Processing Unit DMA ARM Cortex R5 ARM Cortex R5 128KB TCM GIC 2x32KB I-Chache 2x32KB I-Chache MMU 15 Heterogeneous Computing Programmable Logic Application Processing Unit GPU ARM Mali Transforming von Neumann to heterogeneity ARM Cortex A53 ARM Cortex A53 64KB L2 Cache ARM Cortex A53 ARM Cortex A53 Security 4x32KB I-Chache 4x32KB D-Chache AES SCU DMA Trust Zone 1MB L2 Cache GIC MMU V/T Monitring Real-Time Processing Unit DMA ARM Cortex R5 ARM Cortex R5 128KB TCM GIC 2x32KB I-Chache 2x32KB I-Chache MMU Find critical part of Compile the program Execute on a fast specialized program component & specilized SW/HW processing engine (PE) 16 Approximate computing • Relax the abstraction of near-perfect accuracy in general-purpose computing • Allow errors to happen in the computation Run faster Power efficient • This sounds a bit crazy but a large body of important applications having some amount of errors in the computation, entirely acceptable! Computer vision, multimedia, stream applications Large-scale machine learning Bioinformatics Mining big data Speech and AI 17 Approximate computing (cont.) Embracing error: 18 Approximate computing (cont.) Embracing error: 0% 100% 2% 60% Errors Energy Errors Energy 20% 45% 70% 20% Errors Energy Errors Energy 19 Approximate computing (cont.) Transforming von Neumann to Neural Networks Speed: ~4×↑, Energy: ~10×↓, Quality: 5%↓) [Esmaeilzadeh, and Burger, MICRO 2012] 20 Approximate computing (cont.) Zynq/HSA-like HW/SW SoC platform Programmable Logic Application Processing Unit ARM Cortex A53 ARM Cortex A53 ARM Cortex A53 ARM Cortex A53 4x32KB I-Chache 4x32KB D-Chache SCU DMA 1MB L2 Cache GIC MMU Neuromorphic Application 21 Why Deep Learning (Neuromorphic)? # of papers Neuromorphic computing scales well # of citations increasing complexity of problems. Intel: Nervana and Movidius Google: Tensor Nvidia: Jetson TX1 and TX2 specialized for DNN Microsoft: BrainWave Qualcomm: Zeroth Processors, extending with NVM IBM: TrueNorth Biological and silicon neurons have much better power and Auviz/Xilinx: CNN accelarator space efficiencies than digital computers [MIT & Intel] 22 Computing Platform Potential applications: Bionix Programmable Logic Application Processing Unit GPU ARM Mali ARM Cortex A53 ARM Cortex A53 64KB L2 Cache ARM Cortex A53 ARM Cortex A53 Security 4x32KB I-Chache 4x32KB D-Chache AES SCU DMA Trust Zone 1MB L2 Cache GIC MMU V/T Monitring Real-Time Processing Unit DMA ARM Cortex R5 ARM Cortex R5 128KB TCM GIC Autonomous 2x32KB I-Chache 2x32KB I-Chache MMU UAV/vehicle/robot WSN/IoT/Wearables Wearable System DSP + Memory Preprocessing Mainprocessing Postprocessing ADC Communication ADC (Filtering & (Feature extraction & (Compression & ADC (TX/RX) Segmentation) Classification) Encryption) 23 24 Autonomous vehicles & Technology Used • Vision Camera: – Cameras are the only sensor technology that can capture texture, color and contrast information and the high level of detail captured by cameras allow them to be the leading technology for classification. • make camera sensors indispensable for autonomous systems. – The camera sensor technology play a very large role in autonomous vehicle. Camera will becomes domnant U.S. collision avoidance sensors market, by technology, 2014 - 2025 (USD Million) Some examples of camera in the Advanced driver- assistance systems (ADAS) application: • Adaptive Cruise Control (ACC) • Automatic High Beam Control (AHBC) • Traffic Sign Recognition (TSR) • Lane Keep Systems (LKS) 25 * Real-time data will grow at 1.5 times the rate of Heterogeneous Era overall data creation Heterogenous embedded platform: • We need high-performance embedded computing machines to deal with huge amount of heterogeneous sensor inputs while executing multiple algorithms for autonomy! • Examples: Perception: • (parallel) Heterogenenous Computing – 3-D imaging with multiple lasers (LIDAR). – Edge-Detection Algorithm • Deep Learning – Motion-Detection algorithm – Tracking algorithm 26 Heterogeneous Era (cont.) - HW&SW Integration - Based on legacy code - Map the legacy code to heterogeneous architgectures Code generation Find a critical Compile the program Execute on the HW logic from modeling program component & generate HW logics 27 Heterogeneous Era (cont.) - HW&SW Integration - Based on legacy code - Generting and optimizing deep neural nerwork Code generation Find a critical Compile the program Execute on the smart and from modeling program component , generate & train a NN faster logic 28 Heterogeneous Era (cont.) DeepMaker: Deep Learning Accelerator on Commercial Programmable Devices Frontend Layer Backend Layer DNN Builder Instrumented Source code via multi-objective optimization CPU binary Hardware Training Sets (accuracy, energy, etc.) Multi-core CPU Library constraints GPU Code generator Mapping DNN_send (a); c = filter (a); DNN_rec (&c); FPGA System Hand-optimized constraints HW&SW Templates Heterogeneous System Deep Neural Network Accelerators (DeNNA): (c) Core Micro-Architecture FPGA Device Bufferi Input data Communication and Memory Interface Scheduler Input from Corei+1 Output to Streaming Streaming Streaming Corei-1 scratchpad 0 scratchpad 1 scratchpad m ALU DNN weights Accumulator Ctrl Unit Cluster00 Cluster01 Cluster0m FIFO Core0 Corei output 0 0 Cluster10 Cluster11 Cluster1m Core Unit Buffer ) Activation Activation sigmoid ( Pooling 1 - ) Normalization 1 - n x ( n f Lookup Core Buffer Clustern0 Clustern1 Clusternm Weight buffer Cluster10 (b) Cluster Micro-Architecture (a) DeNNA Architecture DeepMaker • Mohammad will continue with some of the latest results • How we generate optimal models for deep networks • Some results for image processing with industrial dataset 29 DeepMaker Framework Mohammad Loni, Masoud Daneshtalab {mohammad.loni, masoud.daneshtalab}@mdh.se School of Innovation, Design and Engineering Mälardalen University, Sweden 12 Dec. 2018, Västerås Agenda ● Convolutional Neural Networks (CNNs) ● Processing Challenges of CNNs ● DeepMaker Framework ● Classification/Implementation Results ● Conclusion ● References 2 CNN ● CNN is composed of multiple layers running in sequence, where input data is fed to the first layer and output is