Exploring the Programmability for Deep Learning Processors: from Architecture to Tensorization Chixiao Chen, Huwan Peng, Xindi Liu, Hongwei Ding and C.-J
Total Page:16
File Type:pdf, Size:1020Kb
Exploring the Programmability for Deep Learning Processors: from Architecture to Tensorization Chixiao Chen, Huwan Peng, Xindi Liu, Hongwei Ding and C.-J. Richard Shi Department of Electrical Engineering, University of Washington, Seattle, WA, 98195 {cxchen2,hwpeng,xindil,cjshi}@uw.edu ABSTRACT Custom tensorization explores problem-specific computing data This paper presents an instruction and Fabric Programmable Neuron flows for maximizing energy efficiency and/or throughput. For Array (iFPNA) architecture, its 28nm CMOS chip prototype, and a example, Eyeriss proposed a row-stationary (RS) data flow to reduce compiler for the acceleration of a variety of deep learning neural memory access by exploiting feature map reuse [1]. However, row networks (DNNs) including convolutional neural networks (CNNs), stationary is only effective for convolutional layers with small recurrent neural networks (RNNs), and fully connected (FC) net strides, and shows poor performance on the Alexnet CNN CONVl works on chip. The iFPNA architecture combines instruction-level layer and RNNs. Systolic arrays, on the other hand, feature both programmability as in an Instruction Set Architecture (ISA) with efficiency and high throughput for matrix-matrix computation [4]. logic-level reconfigurability as in a Field-Prograroroable Gate Anay But they take less advantage of convolutional reuse, thus require (FPGA) in a sliced structure for scalability. Four data flow models, more data transfer bandwidth. namely weight stationary, input stationary, row stationary and Fixed data flow schemes in deep learning processors limit their tunnel stationary, are described as the abstraction of various DNN coverage of advanced algorithms. A flexible data flow engine is data and computational dependence. The iFPNA compiler parti desired. Normally, instruction set architectures (ISA) are utilized to tions a large-size DNN to smaller networks, each being mapped to, improve such flexibility. The first deep learning specific ISA was optimized and code generated for, the underlying iFPNA processor proposed in [5] with both vector and matrix instructions, and their using one or a mixture of the four data-flow models. Experimental execution units. A single instruction multiple data (SIMD) proces results have shown that state-of-art large-size CNNs, RNNs, and sor in [2] enhances the instruction-level parallelism by combining FC networks can be mapped to the iFPNA processor achieving the multiple sub-instructions into one. However, all these ISAs do not near ASIC performance. change the tensor-level data flow by compilation. To alleviate these issues, this paper presents an architecture that CCS CONCEPTS realizes flexible tensorization by a data-flow enhanced ISA. A pro totype processor chip has been designed and fabricated in 28nm • Computer systems organization --+ Neural networks; Data CMOS. A compiler has been developed to map a given deep learning flow architectures; • Software and its engineering --+ Compliers; network to the underlying iFPNA processor. Adaptive tensorization KEYWORDS is proposed to achieve the best performance according to the hard ware constraints. The prototyped processor and its compiler have Deep Leaming Processor, Neural Network, Data flow, Domain Spe been run successfully on a set of AI applications, and demonstrated cific Instruction Set, Tensorization the near ASIC performance on large-size deep learning networks including AlexNet and a 1024-cell long short term memory (LSTM) 1 INTRODUCTION network, a representative modem RNN. Recent success of artificial intelligence (AI), in particolar, deep learn This paper is organized as follows: Section II describes the back ing algorithms, in various applications especially in understanding ground and related work. Section m introduces the iFPNA architec images, videos, natural languages and human intentions, has drived ture. The four DNN data flow models and the iFPNA compiler are intensified interests in developing custom-silicon deep learning pro presented in Sections IV and V, respectively. Experimental results cessors (DLPs). Existing work includes dedicated accelerators for are described in Section VI. Section VII concludes the paper. convolutional neural networks (CNNs) [1, 2] and recurrent neu ral networks (RNNs) [3]. Most progresses have been made on the 2 BACKGROUND AND RELATED WORK efficient implementation of computing primitives beyond scalar All state-of-the-art neural networks, including CNNs, RNNs, logistic and vector operations, such as matrix-vector and matrix-matrix regression and fully connected (FC) networks, follow the the basic operations, which are referred to as tensor primitives. neuron expression as follows: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed ij = Act. Fct(W · x + b), (1) for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM where Ydenotes neuron outputs, X stands for feature/image inputs, must be honored. Abstracting with credit is pennitted. To copy otherwise, or republish, to post on servers or to redisbibute to lists, requires prior specific permission and/or a W is a weight (also called filter) matrix, b is a biased threshold fee. Request permissions from [email protected]. vector, and Act.Fct is a vector of non-linear activation functions. DAC '18, June 24-29, 2018, San Francisco, CA, VSA A matrix-vector product used in (1) can be unfolded to a number c 2018 Association for Computing Maclrinery. ACM ISBN 978-1-4503-5700-5/18/06 ... $15.00 of vector products in parallel and can be computed in different httpa:J/doiorg/10.1145/3195970.3196049 blocks simultaneously. A single-instruction-multiple-data (SIMD) .··1 HWMetric.s ~ Denotation A<.. s Data Inst Parameter aao M~ i I I DOD ... Mem Mem 1 M J/O Bandwidth B 0001 - 1 D RISC Weight ~ ... Central w R Controller MemSizc - F I 1. Scratcll p..t - (S) Scratchpad I Output s I I MemSir.e Neuron Slk:e Interface # Multiplier M Weight Memory (W) J (b) (c) (a) Figure 1: Proposed hutruction-and-fabric programmable DLP architec:ture. (a) Dft!l'all architecture (b) parameterized design table (c) multi-chip paralleliam diagram. Table 1: iFPNA lmtruct:lon Set. computer architecture ia mi.table for auch a model. The block for vector product computation is usually referred to as Processing Instruction (Opcode) Operandi Operandll Operandm Bkment (PE). Most atate-of-the-art DI.Pa have an array ofP&. Weight Load (WL) Length Src. Addr. Dst.Addr. In gem:ral, more parallelism requirea more data communication, Vector Collect (VC) Sn:.Addr. DstReg. Sld./Sps. thua higher memory acceu bandwidth. Data rcuac in DNNs de Computing Execute (F.X) Mode Dst.Addr. crease• off-chip memory acce11. Three types of rewie are involved in CNNs [6], input reuse, weight reuse and canvolutional reuse. Among three, the previmu two are also deployed for RNNs. FC only 3.1 iFPNA Instruction Set supports input reuse. One critical criteria is bow much the mapped iFPNA adopta an inatructiona set architecture (ISA), rather than a data ftow decreues repeated memory aca!SS without sacrificing finite state machine (FSM), as the controller. FSM restricts possible PE utilization. The extreme case is that all data input only once. upgrades fur any data-flow not discovered at the time of chip design. Various data flaw1 were propo.ed to explore data reuse, including In contrut, instructions sets with sufficient flexibility are adaptive input stationary and row stationary [6]. to any data flow u long as the compiler, in partku1ar the mapping Temorization that maps efficient data 8owa exploit tbeae reusea library, iJ updated. 1111d achieve the bett performance with the constnlnecl hardware. In addition to common RISC instructions, three new instructions a neural network requires hundreds of ofoperation&, Often. milliona an: introduced to perform data movement and computing execution. while the hardware only lw tem or hundreds ofPEs. Large temora They are summarized in Table 1 and explained below. are partitioned into small on.es by the compiler and executed by Weight Load. The iFPNA can invoke a direct memory access PEs in a time-multiplexed manner. (DMA) operation to load weights from the off-chip memory without interrupt, referring to as weight load. Its operands include the 3 INSTRUCTION AND FABRIC memory acce11 length. starting source address and destination address (mapped to an address space among slices). PROGRAMMABLE ARCHITECTURE Vector Collect. The SIMD register file has four 128-bit regis This work proposes iFPNA: an lnstructlon-ond-fahrk progrommable ters to store the input features and output computing results. The neuron aTTGY architecture as ahown in Pig. l(a). lt consist& of a registers are filled by the vector collect instruction. It can load central conb:oller with dedicated inatructiom to enhance data reuae, an input feature from the memory (for CNNs/FCs) or directly copy an array of programmable neuron slices supporting diverse neuron from another register (for RNN1). Note that a elictinglsparse flag is computation, and a single-instruction-multiple-data global register uted to arrange data more efficiently. The tliding operation partially file connecting the controller and the .slicca. A. the main computing update. the register and shifts the remain&. The sparse operation PE, a neuron slice includes