PUMA: a Programmable Ultra-Efficient Memristor-Based Accelerator for Machine Learning Inference
Total Page:16
File Type:pdf, Size:1020Kb
PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference Aayush Ankit Izzat El Hajj∗ Sai Rahul Chalamalasetti Purdue University, American University of Beirut Hewlett Packard Enterprise Hewlett Packard Enterprise Geoffrey Ndu Martin Foltin R. Stanley Williams Hewlett Packard Enterprise Hewlett Packard Enterprise Hewlett Packard Enterprise Paolo Faraboschi Wen-mei Hwu John Paul Strachan Hewlett Packard Enterprise University of Illinois at Hewlett Packard Enterprise Urbana-Champaign Kaushik Roy Dejan S Milojicic Purdue University Hewlett Packard Enterprise Abstract PUMA’s components to evaluate performance and energy Memristor crossbars are circuits capable of performing ana- consumption. A PUMA accelerator running at 1 GHz can 2 log matrix-vector multiplications, overcoming the fundamen- reach area and power efficiency of 577 GOPS/s/mm and tal energy efficiency limitations of digital logic. They have 837 GOPS/s/W, respectively. Our evaluation of diverse ML been shown to be effective in special-purpose accelerators applications from image recognition, machine translation, for a limited set of neural network applications. and language modelling (5M-800M synapses) shows that We present the Programmable Ultra-efficient Memristor- PUMA achieves up to 2,446× energy and 66× latency im- based Accelerator (PUMA) which enhances memristor cross- provement for inference compared to state-of-the-art GPUs. bars with general purpose execution units to enable the Compared to an application-specific memristor-based ac- acceleration of a wide variety of Machine Learning (ML) celerator, PUMA incurs small energy overheads at similar inference workloads. PUMA’s microarchitecture techniques inference latency and added programmability. exposed through a specialized Instruction Set Architecture Keywords memristors, accelerators, machine learning, neu- (ISA) retain the efficiency of in-memory computing and ana- ral networks log circuitry, without compromising programmability. We also present the PUMA compiler which translates ACM Reference Format: high-level code to PUMA ISA. The compiler partitions the Aayush Ankit, Izzat El Hajj, Sai Rahul Chalamalasetti, Geoffrey computational graph and optimizes instruction scheduling Ndu, Martin Foltin, R. Stanley Williams, Paolo Faraboschi, Wen- mei Hwu, John Paul Strachan, Kaushik Roy, and Dejan S Milojicic. and register allocation to generate code for large and complex 2019. PUMA: A Programmable Ultra-efficient Memristor-based Ac- workloads to run on thousands of spatial cores. celerator for Machine Learning Inference. In 2019 Architectural We have developed a detailed architecture simulator that Support for Programming Languages and Operating Systems (ASP- arXiv:1901.10351v2 [cs.ET] 30 Jan 2019 incorporates the functionality, timing, and power models of LOS ’19), April 13–17, 2019, Providence, RI, USA. ACM, New York, NY, USA, 17 pages. https://doi.org/10.1145/3297858.3304049 ∗Work done while at University of Illinois at Urbana-Champaign Permission to make digital or hard copies of all or part of this work for 1 Introduction personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear General-purpose computing systems have benefited from this notice and the full citation on the first page. Copyrights for components scaling for several decades, but are now hitting an energy of this work owned by others than ACM must be honored. Abstracting with wall. This trend has led to a growing interest in domain- credit is permitted. To copy otherwise, or republish, to post on servers or to specific architectures. Machine Learning (ML) workloads redistribute to lists, requires prior specific permission and/or a fee. Request in particular have received tremendous attention because permissions from [email protected]. of their pervasiveness in many application domains and ASPLOS ’19, April 13–17, 2019, Providence, RI, USA © 2019 Association for Computing Machinery. high performance demands. Several architectures have been ACM ISBN 978-1-4503-6240-5/19/04...$15.00 proposed, both digital [20, 21, 36, 61, 72, 89] and mixed digital- https://doi.org/10.1145/3297858.3304049 analog using memristor crossbars [22, 23, 73, 95, 100]. ML workloads tend to be data-intensive and perform a A naïve approach to generality is not viable because of large number of Matrix Vector Multiplication (MVM) opera- the huge disparity in compute and storage density between tions. Their execution on digital CMOS hardware is typically the two technologies. CMOS digital logic has an order of characterized by high data movement costs relative to com- magnitude higher area requirement than a crossbar for equal pute [49]. To overcome this limitation, memristor crossbars output width (∼20×, see Table 3). Moreover, a crossbar’s can store a matrix with high storage density and perform storage density (2-bit cells) is 160MB/mm2, which is at least MVM operations with very low energy and latency [5, 13, 52, an order of magnitude higher than SRAM (6T, 1-bit cell) [95]. 87, 98, 116]. Each crosspoint in the crossbar stores a multi-bit A 90mm2 PUMA node can store ML models with up to 69MB value in one memristor device, which enables high storage of weight data. Note that the PUMA microarchitecture, ISA, density [112]. Upon applying an input voltage at the cross- and compiler are equally suitable to crossbars made from bar’s rows, we get the MVM result as output current at the emerging technologies other than memristors such as STT- crossbar’s columns based on Kirchhoff’s law. A crossbar MRAM [94], NOR Flash [46], etc. thus performs MVM in one computational step – includ- We make the following contributions: 2 ing O(n ) multiplications and additions for an n × n matrix • A programmable and highly efficient architecture ex- – which typically takes many steps in digital logic. It also posed by a specialized ISA for scalable acceleration combines compute and storage in a single device to alleviate of a wide variety of ML applications using memristor data movement, thereby providing intrinsic suitability for crossbars. data-intensive workloads [23, 95]. • A complete compiler which translates high-level code Memristor crossbars have been used to build special-purpose to PUMA ISA, enabling the execution of complex work- accelerators for Convolutional Neural Networks (CNN) and loads on thousands of spatial cores. Multi Layer Perceptrons (MLP) [23, 73, 95], but these designs • A detailed simulator which incorporates functionality, lack several important features for supporting general ML timing, and power models of the architecture. workloads. First, each design supports one or two types of • An evaluation across ML workloads showing that PUMA neural networks, where layers are encoded as state machines. can achieve promising performance and energy effi- This approach is not scalable to a larger variety of workloads ciency compared to state-of-the-art CPUs, GPUs, TPU, due to increased decoding overhead and complexity. Second, and application-specific memristor-based accelerators. existing accelerators lack the types of operations needed Our simulator and compiler have been open-sourced to en- by general ML workloads. For example, Long Short-Term able further research on this class of accelerators. Memory (LSTM) [51] workloads require multiple vector lin- ear and transcendental functions which cannot be executed 2 Workload Characterization on crossbars efficiently and are not supported by existing This section characterizes different ML inference workloads designs. Third, existing designs do not provide flexible data with a batch size of one. The characteristics are summarized movement and control operations to capture the variety in Table 1. The section’s objective is to provide insights of access and reuse patterns in different workloads. Since on the suitability of memristor crossbars for accelerating crossbars have high write latency [54], they typically store ML workloads and highlight implications on the proposed constant data while variable inputs are routed between them architecture. in a spatial architecture. This data movement can amount to a significant portion of the total energy consumption which 2.1 Multi-Layer Perceptron (MLP) calls for flexible operations to optimize the data movement. MLPs are neural networks used in common classification To address these limitations, we present PUMA, a Pro- tasks such as digit-recognition, web-search, etc. [26, 61]. Each grammable Ultra-efficient Memristor-based Accelerator. PUMA layer is fully-connected and applies a nonlinear function to is a spatial architecture designed to preserve the storage den- the weighted-sum of outputs from the previous layer. The sity of memristor crossbars to enable mapping ML applica- weighted-sum is essentially an MVM operation. Equation 1 tions using on-chip memory only. It supplements crossbars shows the computations in a typical MLP (act is nonlinear): with an instruction execution pipeline and a specialized ISA n−1 Õ that enables compact representation of ML workloads with O»y¼ = act¹B»y¼ + I»x¼ × W »x¼»y¼º (1) low decoder complexity. It employs temporal SIMD units x=0 and a ROM-Embedded RAM [69] for area-efficient linear and MLPs are simple, capturing the features common across transcendental vector computations. It includes a microar- the ML workloads we discuss: dominance of MVM opera- chitecture, ISA, and compiler co-designed to optimize data tions, high