Balancing Efficiency and Flexibility for DNN Acceleration Via Temporal GPU-Systolic Array Integration

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration Cong Guo1, Yangjie Zhou1, Jingwen Leng1;∗, Yuhao Zhu2, Zidong Du3, Quan Chen1, Chao Li1, Bin Yao1, Minyi Guo1;∗ 1Shanghai Jiao Tong University, 2University of Rochester, 3Institute of Computing Technology, Chinese Academy of Sciences Abstract—The research interest in specialized hardware accel- incurs significant data movement overhead and accelerator erators for deep neural networks (DNN) spikes recently owing to resource under-utilization during the unsupported computa- their superior performance and efficiency. However, today’s DNN tions. Second, systems like TPU could convert the unsupported accelerators primarily focus on accelerating specific “kernels” such as convolution and matrix multiplication, which are vital but operations to compatible operations for the native execution, only part of an end-to-end DNN-enabled application. Meaningful but with the severe efficiency loss. Finally, the GPU-based speedups over the entire application often require supporting systems provide general-purpose programmability for different computations that are, while massively parallel, ill-suited to DNN types of operations, but are less efficient than the specialized accelerators. Integrating a general-purpose processor such as a accelerators when executing those regular kernels. CPU or a GPU incurs significant data movement overhead and leads to resource under-utilization on the DNN accelerators. We propose simultaneous multi-mode architecture (SMA), We propose Simultaneous Multi-mode Architecture (SMA), which provides flexibility in accelerating the irregular opera- a novel architecture design and execution model that offers tions while maintaining efficiency in regular convolution op- general-purpose programmability on DNN accelerators in order erations. It leads to significant end-to-end application speedup to accelerate end-to-end applications. The key to SMA is the compared to state-of-the-art systems. Our design philosophy temporal integration of the systolic execution model with the GPU-like SIMD execution model. The SMA exploits the common starts with the SIMD execution in the GPUs, and judiciously components shared between the systolic-array accelerator and apply lightweight architectural augmentations to support the the GPU, and provides lightweight reconfiguration capability to systolic execution model, which is proven efficient for regular switch between the two modes in-situ. The SMA achieves up operations like convolutions and matrix multiplication [10]. to 63% performance improvement while consuming 23% less Our integration of a specialized accelerator on a general energy than the baseline Volta architecture with TensorCore. purpose substrate shares the similarity in the recent tensor I. INTRODUCTION core (TC) in the NVIDIA Volta GPU [20]. However, our architecture is different from the tensor core in the two fol- Deep learning has revolutionized the key domains such lowing key aspects. First, we temporally integrate the systolic as natural-language processing and computer vision [9], for array and SIMD architecture while TC does so spatially which which the GPU is the enabler. Although still widely used leads to area wastage. Second, we employ a SIMD-friendly for training and inference, GPU’s performance and energy systolic array dataflow, which achieves high data reuse while efficiency start to plateau. As such, computer architects start maximizing the GPU memory access efficiency. In contrast, to build specialized hardware accelerators for deep neural the dataflow in tensor core suffers from the low data reuse. network models, which have high computation intensity and The critical challenge in our architecture is to achieve a regular dataflow patterns. Prime examples include Eyeriss [5], high execution efficiency in the systolic mode and a low Diannao Family [4], TPU [8], and TensorCore (TC) [17]. runtime reconfigurability overhead between the systolic mode However, existing accelerators are insufficient to provide and SIMD mode. We leverage the GPU’s massive parallelism meaningful speedup for DNN-enabled applications. This is and a fine-grained synchronization scheme, which is able to because they focus only on specific “kernels” such as convolu- control the execution of systolic array with little overhead. tion, but ignore the end-to-end application characteristics. As The systolic mode eliminates most of the expensive accesses to a result, accelerating those fixed kernels leads to insignificant register file and shared memory, and significantly improves the speedup, sometimes even slowdown, at the application-level. matrix multiplication efficiency than the SIMD-only mode. On In specific, emerging “hybrid” neural network models have the other hand, the SIMD mode enables efficient acceleration started to combine the convolution kernels with irregular of hybrid DNN workloads as it preserves the programmability. kernels that are, although massively parallel, ill-suited for The contribution of our work is as follows: specialized accelerators. Besides, applications like autonomous driving [13] adopt both the neural network models and algo- • We quantify the execution inefficiencies of running DNNs rithms from other domains with different characteristics. with irregular operations on a variety of existing hardware To support the end-to-end applications, specialized accel- platforms and identify their execution bottlenecks. erators must be coupled with flexibilities for unsupported • We consider various systolic array dataflows and identify a computations. The mainstream solutions fall into three cat- SIMD-friendly one that balances memory access efficiency egories, neither being efficient. The first approach integrates and data reuse to facilitate its integration on GPUs. the accelerator with a general-purpose host processor, which • We propose SMA, an architecture that temporally integrates SIMD and systolic execution model with little reconfigura- * Jingwen Leng and Minyi Guo are co-corresponding authors of this paper. tion overhead by exploiting the architectural similarity. 978-1-7281-1085-1/20/$31.00 ©2020 IEEE 100 RoI Conv 80 Image CNN Feature Mask maps Align 60 Layers 40 TC 20 TPU Feature 0 Image CNN ArgMax Mask CRF Efficiency (%) maps 7 8 9 10 11 12 13 14 2 2 2 2 2 2 2 2 Square Matrix Size GEMM Compatible GEMM Incompatible Fig. 1: Tensor Core and TPU Efficiency. Fig. 2: Mask R-CNN (top) and DeepLab (bottom) details. II. ANALYSIS OF EXISTING DNN ACCELERATORS TPU 358ms(1.75x) RoI Align NMS We analyze the performance of executing the emerging GPU 204ms(1.00x) Transfer hybrid DNN models on existing hardware accelerators, i.e., Mask R-CNN CNN&FC TPU [8] and TensorCore (TC) [17]. They are the two com- TPU 168ms(1.98x) Transfer CPU 555ms(10.65x) mercially available DNN accelerators with highly optimized ArgMax CNN&FC GPU 85ms(1.00x) GPU 52ms(1.00x) software stacks. However, our experimental results identify DeepLab CRF that the GEMM-incompatible computation in the hybrid models is too computational-demanding to run on the hardware Fig. 3: TPU vs GPU for Mask R-CNN and DeepLab. accelerators even coupled with a general-purpose CPU. input image and is more complicated than the image clas- sification. As such, the state-of-the-art models rely on CNN A. DNN Hardware Accelerators models for feature extraction, but also introduce additional Both TPU and TC accelerate the GEMM operation, which is operations to improve the segmentation accuracy. Both models the dominant operation in the commonly used models, such as in Fig. 2 have GEMM-compatible CONV and FC layers. But convolution neural network [9] (CNN), multi-layer perceptron Mask R-CNN has RoIAlign, a bi-linear interpolation that (MLP), and RNN/LSTM [14]. requires many reshape operations, and RegionProposal, a The TPU uses a weight stationary systolic array [10], whose control-flow intensive non-max suppression (NMS) algorithm. size is 256×256 in TPU-v1. The systolic array has a significant Those operations are challenging to support with only GEMM degree of data reuse, which leads to high-performance and operation. Similarly, DeepLab has the ArgMax and CRF energy-efficient execution of GEMM. In contrast, TC is still (conditional random field [11]) that are GEMM-incompatible. SIMD architecture and has a limited degree of data reuse. Fig. 3 shows the performance comparison and breakdown According to the reverse-engineered work [20], it executes on TPU and GPU. The TPU executes Mask R-CNN about the GEMM operation in the dot-product fashion and supports 75% slower than the GPU. A closer observation reveals that a 4 × 4 × 4 GEMM operation, much smaller than the TPU. the GPU is slower than the TPU on the GEMM-compatible We compare the GEMM performance between TPU and TC kernels (i.e., CONV and FC), but far out-performs the TPU in Fig. 1. We use a cloud TPU-v2, which has a total of eight on RoIAlign and RegionProposal. We examine the cores. We only use one core that has a 128×128 systolic array TPU-version source code for performance debugging and find with peak 22:5 TFLOPS. The GPU is a Tesla V100, which that it converts the control-flow intensive NMS operation in has 15.7 FP32 and 125 TC peak TFLOPS [17]. Owing to their RegionProposal to multiple dataflow-based GEMM oper- different peak FLOPS, we use the FLOPS efficiency (achieved ations, and converts RoIAlign operation to multiple average FLOPS divided by the peak FLOPS) as the metric for a fair pooling operations for which TPU has hardware support. As comparison. With a large enough matrix, the TPU achieves such, the TPU can execute Mask R-CNN without relying on almost 100% FLOPS efficiency, while the TC achieves less the CPU, i.e., with no data transferring overhead. However, the than 60% efficiency. As previously explained, the TC calcu- improper mapping causes severe performance degradation. lates the GEMM with multiple parallel dot-product operations The DeepLab runs much slower on the TPU than the while the TPU does it in the systolic fashion. As a result, GPU owing to its infeasibility to support the important CRF a tensor core only supports a small GEMM operation with operation.

Balancing Efficiency and Flexibility for DNN Acceleration Via Temporal GPU-Systolic Array Integration

ECE 590: Digital Systems Design Using Hardware Description Language (VHDL) Systolic Implementation of Faddeev's Algorithm in V

On the Efficiency of Register File Versus Broadcast Interconnect For

Computer Architectures

Systolic Computing on Gpus for Productive Performance

Research Article a Systolic Array-Based FPGA Parallel Architecture for the BLAST Algorithm

EE 7722—GPU Microarchitecture

Delft University of Technology an Efficient GPU-Accelerated Implementation of Genomic Short Read Mapping with BWA-MEM

An FPGA-Based Systolic Array to Accelerate the BWA-MEM Genomic Mapping Algorithm

11 Systolic Arrays Types of Special-Purpose Computers: 1

Computer Architecture: SIMD and Gpus (Part III) (And Briefly VLIW, DAE, Systolic Arrays)

Generating Systolic Array Accelerators with Reusable Blocks

Computer Archicture