Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators Benjamin Y. Cho, Jeageun Jung, and Mattan Erez1 1The University of Texas at Austin fbjcho,jeageunjung,[email protected] Abstract— First, DL inference queries require small-batch execution to DL inference queries play an important role in diverse internet meet tight latency constraints, leading to very tall/skinny or services and a large fraction of datacenter cycles are spent short/fat activation matrices. Such matrices offer lower local- on processing DL inference queries. Specifically, the matrix- matrix multiplication (GEMM) operations of fully-connected ity, increasing the importance of memory bandwidth. Second, MLP layers dominate many inference tasks. We find that the some recommender and language models have billions of GEMM operations for datacenter DL inference tasks are memory parameters (across numerous layers) and it is common for bandwidth bound, contrary to common assumptions: (1) strict multiple models to be colocated on a single node to improve query latency constraints force small-batch operation, which system efficiency and reduce multi-model query latency [16], limits reuse and increases bandwidth demands; and (2) large and colocated models require reading the large weight matrices [20], [33], [42]. As a result, it is common for the larger weight from main memory, again requiring high bandwidth without matrices to reside only in main memory, stressing the memory offering reuse opportunities. We demonstrate the large potential channel when executing on a CPU and often requiring low- of accelerating these small-batch GEMMs with processing in the bandwidth host-device transfers in systems with accelerators. main CPU memory. We develop a novel GEMM execution flow Our experiments demonstrate that these GEMM operations and corresponding memory-side address-generation logic that exploits GEMM locality and enables long-running PIM kernels are in fact bandwidth-bound on both CPU and GPU systems, despite the complex address-mapping functions employed by the and describe how they can be accelerated with processing CPU that would otherwise destroy locality. Our evaluation of in/near main memory (PIM). StepStone variants at the channel, device, and within-device We present StepStone PIM, which is integrated within the PIM levels, along with optimizations that balance parallelism CPU main memory system and solves the dual challenges benefits with data-distribution overheads demonstrate 12× better minimum latency than a CPU and 2:8× greater throughput of utilizing available GEMM locality and sharing data with for strict query latency constraints. End-to-end performance the CPU under its sophisticated XOR-based DRAM address analysis of recent recommendation and language models shows mapping scheme. Hence, StepStone is an appealing datacenter that StepStone PIM outperforms a fast CPU (by up to 16×) solution because it: (1) better utilizes bandwidth within the and prior main-memory acceleration approaches (by up to 2:4× memory system; (2) utilizes locality, enabling high perfor- compared to the best prior approach). mance and efficiency for datacenter DL inference GEMM operations; (3) does not require additional memory devices I. INTRODUCTION or capacity, avoiding the exorbitant cost of additional memory With the evolution of deep learning (DL), artificial intel- and taking advantage of the already-memory resident matrices; ligence is being widely used in many internet services. We and (4) offloads a low-performance workload from the CPU, describe a new approach for reducing the latency of such DL freeing additional execution capacity for colocated tasks. inference tasks by accelerating their fully-connected layers This unique set of StepStone capabilities is, to the best of arXiv:2012.00158v1 [cs.AR] 30 Nov 2020 with a processing in/near memory (PIM) approach. Park et our knowledge, not available in any prior PIM architecture al. [35] report that for important personalized recommendation and research, including in recent work that targets datacenter and natural language DL inference workloads, a large fraction DL inference or processing in main memory. While recent of DL-related data-center cycles (42%) are spent executing work explored PIM-acceleration for datacenter DL inference, fully-connected (FC) layers in Facebook data centers. it focuses on the embedding layers of DL-inference [20], [25] FC layers are executed as matrix-matrix multiplication rather than on the MLP GEMM operations, which require operations (commonly referred to as GEMM kernels) and a different approach for exploiting locality. Prior work that these GEMMs dominate the overall execution time of some considers integrating PIM accelerators within main memory workloads [15], [35]. GEMMs are commonly considered either requires costly data replication to avoid the DRAM compute rather than bandwidth bound based on decades of address mapping challenge [4], [5], [12] or does not offer the scientific-computing and DL training experience. However, we mechanisms to exploit GEMM locality [3], [9], [20], [23]. observe that DL inference GEMMs exhibit two unique traits We choose a straight-forward PIM microarchitecture for that leave them memory-bandwidth bound in many cases, and StepStone that follows recent research trends. Our contribu- thus amenable to PIM acceleration. tions instead lie with four key innovations. The first is the StepStone PIM GEMM parallelization and execution flow II. MOTIVATION AND CHALLENGES that is cognizant of the XOR-based DRAM address mapping Bandwidth-bound GEMMs. Matrix-matrix multiplication that otherwise break GEMM locality. The second contribution (GEMM) is commonly regarded as compute bound. However, accelerates the localization and reduction operations of the we observe that GEMM becomes bandwidth-bound and ex- execution flow without consuming CPU core resources. The hibits low CPU/GPU utilization when both: (1) one of the third contribution enables long-running locality-conserving two input matrices is much larger than the other (e.g., A is PIM GEMM kernels with the new StepStone memory-side large while B is “tall and skinny”) and (2) the large input address generation logic. Long-running kernels relieve PIM matrix is in main memory. While rare in traditional linear pressure on the memory command channel, enabling high- algebra applications, DL inference tasks in datacenters often performance colocated CPU tasks. meet both conditions. The fourth contribution is identifying and exploiting a First, DL inference queries have tight latency constraints new tradeoff opportunity in balancing the performance ben- that require small batches [35]. The corresponding GEMM efits of parallelization across fine-grained PIM units (PIMs) operations in fully-connected layers therefore multiply a large within DRAM with the data-transfer overheads of the lo- weight matrix and a small input matrix. Second, the MLP calization/replication and reduction operations necessary for weights are often only found in main memory because either high parallelization. We explore this tradeoff by evaluating the total size of the MLP parameters exceeds cache capacity channel-, device-, and bank group-level StepStone PIMs. (e.g., in recent language models [7], [21], [37]) and/or multiple To summarize our contributions: models are colocated on a single node [16]. • We identify and demonstrate that small-batch GEMM The resulting matrix sizes (Table I) are executed ineffi- operations of DL datacenter inference workloads are ciently on CPUs and GPUs as shown by the roofline analysis bandwidth bound on CPUs and GPUs, and can hence presented in Figure 1. Each point in the figure corresponds benefit from PIM-acceleration (Section II). to the performance measured on a 2.7 GHz 28-core Intel • We develop the novel StepStone PIM GEMM execution Cascade Lake Xeon CPU or an NVIDIA Titan Xp GPU when flow that is cognizant of the complex CPU address multiplying a memory-resident 1024×4096 matrix by a cache- mapping, thus exploiting GEMM locality and improving resident 4096×N matrix, where N represents the batch size. performance by 35 − 55% over a prior PIM architecture The left-most point for each system is when N = 1 and each that supports complex address mappings [9]. point moving right represents a doubling of N. We observe • We accelerate the localization and reduction operations that all three systems are bandwidth bound for inference- of our new GEMM flow at the CPU memory controller appropriate batch sizes (N . 32). Further, for such small to improve performance by up to an additional 40%. batches, GPU performance is lower than the CPU if matrix A • We design the novel memory-side StepStone address is in host memory because of the slow host-device bus. generator that enables long-running GEMM kernels to We conclude that processing in/near memory (PIM) is minimize command-channel usage, which improves PIM appealing for these GEMM operations of datacenter DL- performance by 5:5× when the CPU executes concurrent inference workloads. memory-intensive tasks. • We identify a new tradeoff opportunity in determining PIM GEMMs with XOR-based DRAM address map- whether to target channel-, device-, or bank group-level ping. We target systems in which main memory is PIM PIMs and show benefits of up to 35% in exploiting it. enabled, implying a shared DRAM address space with the • We present a detailed StepStone PIM evaluation, includ- CPU. The CPU relies on sophisticated XOR-based DRAM ing end-to-end performance

Load more