Wire-Aware Architecture and Dataflow for CNN Accelerators

Wire-Aware Architecture and Dataflow for CNN Accelerators Sumanth Gudaparthi Surya Narayanan Rajeev Balasubramonian University of Utah University of Utah University of Utah Salt Lake City, Utah Salt Lake City, Utah Salt Lake City, Utah [email protected] [email protected] [email protected] Edouard Giacomin Hari Kambalasubramanyam Pierre-Emmanuel Gaillardon University of Utah University of Utah University of Utah Salt Lake City, Utah Salt Lake City, Utah Salt Lake City, Utah [email protected] [email protected] pierre- [email protected] ABSTRACT The 52nd Annual IEEE/ACM International Symposium on Microarchitecture In spite of several recent advancements, data movement in modern (MICRO-52), October 12–16, 2019, Columbus, OH, USA. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3352460.3358316 CNN accelerators remains a significant bottleneck. Architectures like Eyeriss implement large scratchpads within individual processing elements, while architectures like TPU v1 implement large systolic arrays and large monolithic caches. Several data move- 1 INTRODUCTION ments in these prior works are therefore across long wires, and ac- Several neural network accelerators have emerged in recent years, count for much of the energy consumption. In this work, we design e.g., [9, 11, 12, 28, 38, 39]. Many of these accelerators expend sig- a new wire-aware CNN accelerator, WAX, that employs a deep and nificant energy fetching operands from various levels of the mem- distributed memory hierarchy, thus enabling data movement over ory hierarchy. For example, the Eyeriss architecture and its row- short wires in the common case. An array of computational units, stationary dataflow require non-trivial storage for scratchpads and each with a small set of registers, is placed adjacent to a subarray registers per processing element (PE) to maximize reuse [11]. There- of a large cache to form a single tile. Shift operations among these fore, the many intra-PE and inter-PE accesses in Eyeriss require registers allow for high reuse with little wire traversal overhead. data movement across large register files. Many accelerators also This approach optimizes the common case, where register fetches access large monolithic buffers/caches as the next level of their and access to a few-kilobyte buffer can be performed at very low hierarchy, e.g., Eyeriss has a 108 KB global buffer, while Google cost. Operations beyond the tile require traversal over the cache’s TPU v1 has a 24 MB input buffer [24]. Both architectures also im- H-tree interconnect, but represent the uncommon case. For high plement a large grid of systolic PEs, further increasing the wire reuse of operands, we introduce a family of new data mappings lengths between cached data and the many PEs. In this paper, we and dataflows. The best dataflow, WAXFlow-3, achieves a 2× im- re-visit the design of PEs and memory hierarchy for CNN accelera- provement in performance and a 2.6-4.4× reduction in energy, rel- tors, with a focus on reducing these long and frequently traversed ative to Eyeriss. As more WAX tiles are added, performance scales wire lengths. well until 128 tiles. It is well known that data movement is orders of magnitude more expensive than the cost of compute. At 28 nm, a 64-bit floating- CCS CONCEPTS point multiply-add consumes 20 pJ; transmitting the correspond- • Computer systems organization → Neural networks. ing operand bits across the chip length consumes 15× more; access- ing a 1 MB cache consumes 50× more; and fetching those bits from KEYWORDS off-chip LPDDR consumes 500× more [26, 27, 32]. Since this initial comparison from 2011, DNN accelerators have switched to using 8- CNN, DNN, neural networks, accelerator, near memory bit fixed-point [24] or 16-bit flexpoint [29] arithmetic, which helps ACM Reference Format: lower compute energy by an order of magnitude [24]. Recently, Sumanth Gudaparthi, Surya Narayanan, Rajeev Balasubramonian, Edouard technologies like HBM have helped reduce memory energy per bit Giacomin, Hari Kambalasubramanyam, and Pierre-Emmanuel Gaillardon. by an order of magnitude [36]. Meanwhile, on-chip wiring and on- 2019. Wire-Aware Architecture and Dataflow for CNN Accelerators . In chip caches have not benefited much from technology steps [6, 21]. In response to the relative shift in bottlenecks, this work targets Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed low on-chip wire traversal. for profit or commercial advantage and that copies bear this notice and the full cita- We create a new wire aware accelerator WAX, that implements tion on the first page. Copyrights for components of this work owned by others than a deep and distributed memory hierarchy to favor short wires. Such ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission an approach has also been leveraged in the first designs from the and/or a fee. Request permissions from [email protected]. startup, Graphcore [18]. We implement an array of PEs beside each MICRO-52, October 12–16, 2019, Columbus, OH, USA cache subarray. Each PE is assigned less than a handful of registers. © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6938-1/19/10...$15.00 The registers have shift capabilities to implement an efficient ver- https://doi.org/10.1145/3352460.3358316 sion of systolic dataflow. Each PE therefore uses minimal wiring MICRO-52, October 12–16, 2019, Columbus, OH, USA Gudaparthi, et al. to access its few registers, its adjacent register, and a small (few- systolic dataflow. This allows, for example, an input operand to KB) cache subarray. Data movement within this basic WAX tile be multiplied by the many weights in one convolutional kernel, has thus been kept to a minimum. Large layers of CNNs map to and by the weights in multiple kernels. Each MAC is fed by a few several tiles and aggregate the results produced by each tile. To registers. While the MACs are working on one computation, the increase the computational power of the WAX tile, we introduce registers are pre-loaded with operands required by the next com- a novel family of dataflows that perform a large slice of compu- putation (a form of double-buffering). Weights are fetched from tation with high reuse and with data movement largely confined off-chip memory (DDR for TPU v1 and HBM for TPU v2) into a within a tile. We explore how the dataflows can be adapted to re- FIFO. Input/output feature maps are stored in a large 24MB buffer. duce problematic partial sum updates in the subarray. While this What is notable in the TPU design is that there is a monolithic reduces reuse for other data structures and requires more adders, grid of MACs that occupies 24% of the chip’s area [24]. Further, we show that the trade-off is worthwhile. all input and output feature maps are fetched from a 24 MB cache, Our analysis shows that the additional WAX components con- which too occupies 29% of the chip’s area. As a result, most operands tribute 46% of the tile area. Our best design reduces energy by 2.6- must traverse the length or width of the large grid of MACs, as well 4.4×, relative to Eyeriss. WAX also consumes less area and hence as navigate a large H-Tree within the cache. less clock distribution power by eliminating the many large regis- Wire Traversal ter files in Eyeriss. We show that our best dataflow (WAXFlow-3) Our proposed approach is motivated by the premise that short- enables higher overlap of computation with operand loading into wire traversal is far more efficient than long-wire traversal. We subarrays – this leads to higher compute utilization and through- quantify that premise here. put than Eyeriss. As we scale the design to several tiles, the compu- While a large scratchpad or register file in an Eyeriss PE pro- tational throughput increases until 128 tiles. A WAX tile can there- motes a high degree of reuse, it also increases the cost of every fore form the basis for both, an energy-efficient edge device and a scratchpad/register access, it increases the distance to an adjacent throughput/latency-oriented server. PE, and it increases the distance to the global buffer. Figure 1c shows the breakdown of energy in the baseline Eyeriss while ex- ecuting the CONV1 layer of AlexNet [30]. Nearly 43% of the total 2 BACKGROUND energy of Eyeriss is consumed by scratchpads and register files. We first describe two designs, one commercial and one academic, Our hypothesis is that less storage per PE helps shorten distances that highlight the extent of data movement in state-of-the-art ar- and reduce data movement energy, especially if efficient dataflows chitectures. can be constructed for this new hierarchy. We also implement a Eyeriss deeper hierarchy where a few kilo-bytes of the global buffer are Eyeriss [11] uses a monolithic grid of processing elements (PEs). adjacent to the PEs, while the rest of the global buffer is one or Each PE has scratchpads and register files that together store about more hops away. half a kilo-byte of operands. The filter scratchpad has 224 entries To understand the relative energy for these various structures and is implemented in SRAM, while the partial sums and activa- and wire lengths, we summarize some of the key data points here. tions are stored in 24- and 12-entry register files respectively.

Wire-Aware Architecture and Dataflow for CNN Accelerators

DEEP-Hybriddatacloud ASSESSMENT of AVAILABLE TECHNOLOGIES for SUPPORTING ACCELERATORS and HPC, INITIAL DESIGN and IMPLEMENTATION PLAN

Efficient Management of Scratch-Pad Memories in Deep Learning

Efficient Processing of Deep Neural Networks

Accelerate Scientific Deep Learning Models on Heteroge- Neous Computing Platform with FPGA

Exploring the Programmability for Deep Learning Processors: from Architecture to Tensorization Chixiao Chen, Huwan Peng, Xindi Liu, Hongwei Ding and C.-J

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications

Energy-Efficient Neural Network Architectures

Deep Learning in Mobile and Wireless Networking: a Survey Chaoyun Zhang, Paul Patras, and Hamed Haddadi

Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization

The Use of Appropriate Hardware Fabrics to Achieve a Low Latency

Mobile/Embedded DNN and AI Socs

DLIR: an Intermediate Representation for Deep Learning Processors