PUMA” Microarchitecture PUMA = Programmable Ultra-Efficient Memristor-Based Accelerator
Total Page:16
File Type:pdf, Size:1020Kb
Accelerating Machine Learning and Neural Networks: Software for Hybrid Memristor-based Computing Dejan Milojicic, Distinguished Technologist, Hewlett Packard Labs Presentation to IEEE-CNSV, May 14, 2019 Co-authors from Hewlett Packard Labs, Penn State, Purdue, UIUC, USP, GaTech, ETH, …. Materials from the IEEE ICRC’[16, 17, 18], IEEE ICDCS’18, ACM ASPLOS [16, 19], and more Potential Approaches vs. DisruptionComputing in Memory in Computing Revisited Stack SystemAlgorithm Software Stack for Dot Product Engine Language Non von Neumann API computing Memory-Driven Computing Architecture Architectural changes ISA Microarchitecture FU Hidden logic changes device “More Moore” Level 1 2 3 4 Source: Tom Conte LEGEND: No Disruption Total Disruption Potential Approaches vs. Disruption in Computing Stack Algorithm Language Non von Neumann API computing Memory-Driven Computing Architecture Architectural changes ISA Microarchitecture FU Hidden logic changes device “More Moore” Level 1 2 3 4 Source: Tom Conte LEGEND: No Disruption Total Disruption • • • • • • • – – – – – – – – – – – Generalize or Die: ISA and Compiler for Memristor Accelerators Aayush Ankit, Pedro Bruel, Sai Rahul Chalamalasetti, Izzat El Hajj, Cat Graves, Dejan Milojicic, Geoffrey Ndu, John Paul Strachan ACM ASPLOS 2019, IEEE ICRC 2018 Potential Approaches vs. Disruption in Computing Stack SystemAlgorithm Software Stack for Dot Product Engine Language Non von Neumann API computing Architecture Architectural changes ISA Microarchitecture FU Hidden logic changes device “More Moore” Level 1 2 3 4 Source: Tom Conte LEGEND: No Disruption Total Disruption Outline Introduction to Dot Product Engine (DPE) Memristor Accelerator Architecture and laboratory demonstrations New Instruction Set Architecture (ISA) to facilitate development Microarchitecture Compiler Performance Benchmarks Business Opportunity and Next Steps Summary 22 Intelligent Edge Opportunities for Machine Learning / AI 7.2 6.0 4.9 3.6 2.2 1.2 2020 2021 2022 2023 2024 2025 Total Addressable Market – Edge Accelerator Machine learning is part of real products, real revenue, real growth ($Billions) Over $30 Billion invested annually Reflects software and hardware investments by startups, established players (Google, Apple, Microsoft, Nvidia, etc.) and countries (China) HPE Labs has significant R&D history and IP in machine learning and cognitive systems Current and past programs supported by US government Task-specific accelerators a growing opportunity to enable new applications at the Edge • Perform vector-matrix multiplication (VMM) in a single cycle Input Voltage • Memristors = nanoscale, non-volatile memory cells vector • High data density; each memristor can be 1-8 bits • Allows efficient multiplication in analog domain Output • Key advantage is in-memory processing O . I current Ij = ∑i Gij Vi • Many pathways for future scaling Array size, Memristor levels, lower current, 3D layers, DAC/ADC Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, DPE, continued and Vivek Srikumar. 2016. ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. Proceedings of ACM ISCA 2016 T T T T TILE MP s T T T T eDRAM OR Buffer S+A T T T T IMA IMA IMA IMA T T T T IMA IMA IMA IMA EXTERNAL IO INTERFACE CHIP (NODE) In-Situ Multiply Accumulate OR IR – Input Register S+A OR – Output Register XB XB DAC MP – Max Pool Unit DAC ADC S+A – Shift and Add S+H S+H IR s – Sigmoid Unit ADC XB – Memristor Crossbar XB XB ADC DAC S+H – Sample and Hold DAC DAC – Digital to Analog S+H S+H ADC 26 ADC – Analog to Digital Dot Product Engine: lab prototype 128x64 Chip image Experimental demonstrations of analog vector-matrix multiplications (VMM) in memristor chip Each grayscale pixel is an analog memristor >7 bits per memristor cell 64x64 Custom control circuitry for in-memory computing 32x32 16x16 SDL designing and taping- out integrated memristor- control system Dot Product Engine – signal processing Dot Product Engine – single-layer neural network, MNIST Successful Neural Network inference with memristor-based analog computing Challenges: Generalize or Die – Need to broaden application space for the DPE – Need to reduce software development burden for customers Solutions – Developed an Instruction Set Architecture (ISA) supporting state-of-the-art neural network types – Developed a high performance microarchitecture implementing ISA – Developed compiler to add programming support and increased functionality – Benchmarked performance of the resulting end-to-end solution Load Store Copy Set MVM ALU – add, sub, mult, div shift, logical, ReLU, etc. JMP, BEQ Halt Implementing the ISA: the “PUMA” microarchitecture PUMA = Programmable Ultra-efficient Memristor-based Accelerator Each core contains multiple memristor xbars performing matrix vector multiplications. Also supports simpler digital operations (logical, scalar, and vector units). 3-stage pipeline, instruction decoder, and instruction memory. Core Tile Aayush Ankit, Izzat El Hajj, Sai Rahul Chalamalasetti, Geoffrey Ndu, Martin Foltin, R Stanley Williams, Paolo Faraboschi, Wen-mei W Hwu, John Paul Strachan, Kaushik Roy, Dejan S Milojicic Proceedings of the Twenty-Fourth APLOS, 2019. Implementing the ISA: the “PUMA” microarchitecture PUMA = Programmable Ultra-efficient Memristor-based Accelerator Tile Node Within each “tile” are multiple “cores” with a shared memory Each “node” consists of for holding data to/from other multiple tiles, with an on-chip tiles and to/from cores. network, implementing the NN layers Cores perform all scalar, vector, and matrix operations. Minimal area and power overhead Architecture Applications Supported Tile-area Tile-power (mm2) (mW) ISAAC CNN 0.372 330 PUMA CNN, RNN, LSTM, BM, 0.491 377.9 RBM, MLP, etc (1.32x larger) (1.14x larger) ISAAC = accelerator for CNN. PUMA = accelerator for CNN, RNN, BM, RBM, MLP, etc. New ISA and Microarchitecture in PUMA provides much broader application support, with a small area and power cost 35 The PUMA compiler – As applications grow in size and inter-core/tile dependencies become more complex, a compiler becomes mandatory to increase programmers’ productivity – We designed and implemented a complete compiler that translates high-level programs to PUMA assembly – Compiler supports the following aspects – Programming Interface – Graph Partitioning – Data Communication – Tile Placement – MVM Coalescing – Linearization – Memory Allocation 36 Memristor-based DPE programming Map Dot- Map Non- Virtual to Build Task Code products to dot-products Physical Tile Graph Generation Crossbars to Tiles Assignment Host Graph partitioning and data communication 40 Instruction usage (importance of different execution units in PUMA core) Inter-Tile Data Transfer Inter-Core Data Transfer Control Flow Scalar Functional Unit Vector Functional Unit MVM Unit (crossbar) RBM (V500-H500) BM (V500-H500) RNN (26-93-61) LSTM (26-120-61) MLP (64-150-150-14) CNN (Lenet5) 0% 20% 40% 60% 80% 100% All ISA components play an important role across various neural networks 41 Software stack • DPE-SW components Application Application Application …… Application • ONNX Interpreter: converts ONNX model to input to our compiler/assembler and eventually hardware Neural nets Tensorflow CNTK Caffe2 frameworks DPE Tool • Applications optimization: improve performance of deployed …… models (quantization, weights, etc.) • Compiler and assembler: converts models into executables onnx-tf onnx-cntk onnx-caffe models onnx-dpe for our own hardware …… • Driver: operating system support for the hardware running onnx DPE, e.g. for multiple PCIe boards interpreter • Emulator: used for hardware verification (Brazil) and as a app optimiz. development platform for software, QEMU inference compiler & • Summary assembler • Only required DPE software components being developed for users & developers’ community, leverage rest! driver • Focus on inference (hardware, performance, robustness) DPE framework • Extensive collaborations to support use cases and demos; emulator but also driving future support, e.g. training DPE MemristorMemristor-- FPGAFPGA ASICASIC--11 ASICASIC--22 hardware …… basedbased42 DPEDPE 42 Inference latency normalized to PUMA (lower is better) 100,000.00 10,000.00 1,000.00 Haswell Intel processors 100.00 Skylake Kepler 10.00 Normalized Latency Normalized Maxwell NVDIA accelerators Pascal 1.00 PUMA Our solution 0.10 L4 L5 L3 L5 Big LSTM- Vgg16 Vgg19 LSTM 2048 MLP NMT WLM CNN Types of Neural Networks Our solution is better than Intel processors (10x-104x) and NVIDIA accelerators (10x-102x) except for one NN 43 Inference energy normalized to PUMA (lower is better) 10,000,000.00 1,000,000.00 Haswell 100,000.00 Intel processors Skylake 10,000.00 Kepler 1,000.00 Maxwell NVDIA accelerators 100.00 Pascal PUMA Our solution 10.00 Normalized Energy Normalized 1.00 0.10 L4 L5 L3 L5 Big LSTM- Vgg16 Vgg19 LSTM 2048 MLP NMT WLM CNN Types of Neural Networks Our solution is substantially better than both Intel processors (103x-106x) and NVIDIA accelerators (10x-103x) 44 Batch throughput normalized to Haswell (higher is better) Intel processors NVDIA accelerators Our solution Haswell Skylake Kepler Maxwell Pascal PUMA 1,000,000.00 100,000.00 10,000.00 1,000.00 100.00 10.00 1.00 0.10 0.01 Normalized Throughput Normalized B32 B32 B64 B16 B64 B16 B32 B64 B16 B32 B64 B16 B64 B16 B32 B64 B16 B32 B64 B16 B32 B16 B32 B64 B128 B128 B128 B128 B128 B128 B128 B128 L4 L5 L3 L5 BigLSTM LSTM-2048