TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
Total Page:16
File Type:pdf, Size:1020Kb
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao Jing Pu Xuan Yang Mark Horowitz Christos Kozyrakis Stanford University fmgao12,jingpu,xuany,horowitz,[email protected] Abstract 1. Introduction The high accuracy of deep neural networks (NNs) has led Modern computer intelligence problems, including voice to the development of NN accelerators that improve perfor- recognition, object detection, and scene labeling, can be mance by two orders of magnitude. However, scaling these solved with unprecedented accuracy using deep neural net- accelerators for higher performance with increasingly larger works (NNs) [27]. However, their high computation require- NNs exacerbates the cost and energy overheads of their ments (500K operations per pixel) and large memory foot- memory systems, including the on-chip SRAM buffers and prints (up to 100s of MBytes) make NNs a challenging work- the off-chip DRAM channels. load for programmable computer systems. Hence, recent This paper presents the hardware architecture and soft- research has focused on building domain-specific accelera- ware scheduling and partitioning techniques for TETRIS, a tors for NN computations, which mostly use spatial arrays of scalable NN accelerator using 3D memory. First, we show narrow-datawidth processing elements and achieve 100x im- that the high throughput and low energy characteristics of provements in performance and energy efficiency compared 3D memory allow us to rebalance the NN accelerator design, to general-purpose platforms [5, 7, 12, 18, 36]. using more area for processing elements and less area for Ideally, we would like to continue scaling the perfor- SRAM buffers. Second, we move portions of the NN com- mance and efficiency of NN accelerators in order to achieve putations close to the DRAM banks to decrease bandwidth real-time performance for increasingly complicated prob- pressure and increase performance and energy efficiency. lems. In general, the size of state-of-the-art NNs has been Third, we show that despite the use of small SRAM buffers, increasing over the years, in terms of both the number of the presence of 3D memory simplifies dataflow schedul- layers and the size of each layer [19, 26, 41]. As a result, ing for NN computations. We present an analytical schedul- the memory system is quickly becoming the bottleneck for ing scheme that matches the efficiency of schedules derived NN accelerators. To feed larger numbers of processing el- through exhaustive search. Finally, we develop a hybrid par- ements, the accelerators need to use large on-chip SRAM titioning scheme that parallelizes the NN computations over buffers and multiple DDRx channels, both of which are ex- multiple accelerators. Overall, we show that TETRIS im- pensive in terms of power and cost (chip area or pin count). proves the performance by 4.1x and reduces the energy by Advances in through-silicon-via (TSV) technology have 1.5x over NN accelerators with conventional, low-power enabled 3D memory that includes a few DRAM dies on top DRAM memory systems. of a logic chip [20, 22, 44]. Compared with the conven- tional 2D DRAM, 3D memory provides an order of mag- CCS Concepts • Computer systems organization ! nitude higher bandwidth with up to 5x better energy effi- Neural networks ciency by replacing the off-chip traces with hundreds of ver- tical connections [21]. Hence, 3D memory is an excellent Keywords 3D memory, neural networks, acceleration, dataflow option for the high throughput, low energy requirements of scheduling, partitioning scalable NN accelerators. There are already proposals that place several NN accelerator arrays on the logic layer of a 3D memory stack to improve performance [25]. This paper focuses on improving the performance scal- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice ability and energy efficiency of inference tasks on large- and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to scale NNs by going beyond naively combining 3D mem- lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ASPLOS ’17, April 08 - 12, 2017, Xi’an, China ory with NN accelerator arrays. At the hardware level, we c 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. take advantage of the high bandwidth and low access energy ISBN 978-1-4503-4465-4/17/04. $15.00 DOI: http://dx.doi.org/10.1145/3037697.3037702 of 3D memory to rebalance the NN accelerator chip, using more area for processing elements and less area for on-chip inference task, each layer performs computations on a set SRAM buffers. This allows us to achieve both higher per- of neurons and feeds the results forward to later layers ac- formance and better energy efficiency. We also move simple cording to the network connections. The first layer accepts accumulation operations close to the data locations (DRAM the input data (e.g., an image), and the last layer generates banks) in order to reduce memory accesses and improve per- the output as an encoded tag or a probability vector for the formance and energy. At the software level, we build an an- category of the input (e.g., picture of a dog). alytical model to quickly derive optimized dataflow sched- NNs can have different types of layers. In the computer ules, while previous NN designs required exhaustive search vision domain, convolutional neural networks (CNNs) are for the optimal schedules [7, 45]. Finally, we also develop mostly composed of convolutional (CONV) layers. The neu- an efficient hybrid partitioning scheme for NNs that allows rons in a CONV layer are organized into a set of 2D feature us to exploit the high parallelism inherent in 3D memory maps (fmaps). A small number of CONV layers at the front stacks. of the CNN work on large fmaps, e.g. 112 × 112 or 56 × 56, We combine these hardware and software optimizations followed by a very deep stack of CONV layers processing in the design of TETRIS, an NN accelerator that uses an a large set (up to 1024) of small (e.g., 14 × 14) fmaps per eight-die HMC memory stack organized into 16 vaults (ver- layer. Each CONV layer performs 2D convolutions between tical channels). Each vault is associated with an array of input fmaps (ifmaps) and trained filter weights, and then a 14 × 14 NN processing elements and a small SRAM buffer. non-linear activation function, such as ReLU or sigmoid, is We demonstrate that a single TETRIS vault improves the per- applied on the sum of the convolution results to generate the formance and energy efficiency by 37% and 40% respec- output fmaps (ofmaps). This computation is typically per- tively over a larger, 16×16 array with a 4 times larger buffer formed on a batch of fmaps for better efficiency. The formal and one 2D low-power DDRx channel. Using all 16 vaults definition is: in the 3D stack, TETRIS achieves 4.1x and 1.5x performance Ni−1 ! X and energy improvements respectively over the conventional O[b][v] = f I[b][u] ∗ W[u][v] + B[v] (1) 2D design, even if we scale its memory channels to four. u=0 We show that TETRIS improves computational density by 0 ≤ v ≤ No − 1; 0 ≤ b ≤ Nb − 1 optimally using area for processing elements and on-chip where O, I, and W are the ofmaps, ifmaps, and filter buffers, and that moving partial computations to DRAM dies weights, each of which has four dimensions (2D image, is beneficial. Moreover, our analytically derived dataflow number of fmaps, and batch). “∗” denotes a 2D convolution schedules are equally efficient to the exhaustively searched operation, and Ni, No, and Nb are the number of ifmaps, optimal schedules using previous approaches. Finally, the ofmaps and the size of batch, respectively. B is a 1D bias.1 proposed hybrid partitioning scheme improves the perfor- f is the non-linear activation function. mance and energy efficiency by more than 10% over simple Fully-connected (FC) layers also widely exist in many heuristics as we parallelize NN computations across multi- NNs (e.g., at the end of a CNN). An FC layer operates ple stacks. Overall, we develop and evaluate all of the ele- on lower-dimensional ifmaps and weights, and calculates ments necessary to make 3D memory a practical substrate their inner products to generate ofmaps. Mathematically, for scalable and efficient NN accelerators. the computation can also be framed using Equation 1, by The rest of this paper is organized as follows: Section 2 restricting each ofmap size to be 1 × 1. Pooling (POOL) provides the background on NN acceleration and demon- and normalization (NORM) layers are also common, but strates the memory challenges for scaling NN accelerators. do not use trained weights and are fast to process through Section 3 presents the hardware architecture for TETRIS and streaming, thus we focus on CONV and FC layers. Section 4 develops the software scheduling and partitioning Many studies have shown that NNs become more effec- schemes for it. Section 6 evaluates TETRIS against state-of- tive as the networks become deeper. For example, ResNet [19], the-art 2D and previous 3D NN accelerator designs using the the ILSVRC 2015 winner, has 152 layers. Furthermore, the methodology in Section 5. Section 7 discusses the related composition of CONV layers in CNNs becomes modular- work and Section 8 concludes the paper.