Zng: Architecting GPU Multi-Processors with New Flash For

ZnG: Architecting GPU Multi-Processors with New Flash for Scalable Data Analysis Jie Zhang and Myoungsoo Jung Computer Architecture and Memory Systems Laboratory, Korea Advanced Institute of Science and Technology (KAIST) http://camelab.org Abstract 1000 —We propose ZnG, a new GPU-SSD integrated archi- HQ`V tecture, which can maximize the memory capacity in a GPU and H:H.V GDDR5 address performance penalties imposed by an SSD. Specifically, H:H.V ZnG replaces all GPU internal DRAMs with an ultra-low- 100 latency SSD to maximize the GPU memory capacity. ZnG further J V`HQJJVH JV 1Q`@ 341.30 removes performance bottleneck of the SSD by replacing its flash channels with a high-throughput flash network and integrating V_%V R1]: H.V` gap SSD firmware in the GPU’s MMU to reap the benefits of 10 ""HQJ `QCCV` Performance hardware accelerations. Although flash arrays within the SSD 25.60 : : 11.20 can deliver high accumulated bandwidth, only a small fraction 10.24 4.80 of such bandwidth can be utilized by GPU’s memory requests G%``V` due to mismatches of their access granularity. To address this, 1 Accumulated bandwdith (GB/s) bandwdith Accumulated ZnG employs a large L2 cache and flash registers to buffer the Flash read R:``:7 Flash write SSD engine DRAM buffer memory requests. Our evaluation results indicate that ZnG can Flash channel achieve 7.5× higher performance than prior work. (a) HybridGPU design. (b) Bandwidth. Index Terms—data movement, GPU, SSD, heterogeneous sys- Fig. 1: An integrated HybridGPU architecture and the tem, MMU, L2 cache, Z-NAND, DRAM performance analysis. head. Specifically, the host first needs to load the target page I. INTRODUCTION from the NVMe SSDs to the host-side main memory and then Over the past few years, graphics processing units (GPUs) moves the same data from the memory to the GPU memory. become prevailing to accelerate the large-scale data-intensive The data copy across different computing domains, the limited applications such as graph analysis and bigdata [1]–[5], be- performance of NVMe SSD and the bandwidth constraints of cause of the high computing power brought by their massive various hardware interfaces (i.e., PCIe) significantly increase cores. To reap the benefits from the GPUs, large-scale applica- the latency of servicing page faults, which in turn degrades the tions are decomposed into multiple GPU kernels, each contains overall performance of many applications at the user-level. ten or hundred of thousands of threads. These threads can be To reduce the data movement overhead, a prior study [11], simultaneously executed by such GPU cores, which exhibits referred to as HybridGPU, proposes to directly replace a high thread-level parallelism (TLP). While the massive parallel GPU’s on-board DRAM packages with Z-NAND flash pack- computing drives GPUs to exceed CPUs’ performance by upto ages as shown in Figure 1a. Z-NAND, as a new type of 100 times, the on-board memory capacity of GPUs is much NAND flash, achieves 64 times higher capacity than DRAM, less than that of the host-side main memory, which cannot while reducing the access latency of conventional flash media arXiv:2006.08975v1 [cs.AR] 16 Jun 2020 accommodate all data sets of the large-scale applications. In from hundreds of micro-seconds to a few micro-seconds practice, as a peripheral device, GPUs have the limited on- [12]. However, Z-NAND faces several challenges to service board space to deploy an enough number of memory packages the GPU memory requests directly: 1) the minimum access [6], while the capacity of a single memory package is difficult granularity of Z-NAND is a page, which is not compatible to expand due to the DRAM scaling issues such as DRAM with the memory requests; 2) Z-NAND programming (writes) cell disturbances and retention time violations [7]–[9]. requires the assistance of SSD firmware to manage address To meet the requirement of such large memory capacity, a mapping as it forbids in-place updates; and 3) its access GPU vendor utilizes the underlying NVMe SSD as a swap disk latency is still much longer than DRAM. To address these of the GPU memory and leverages the memory management challenges, HybridGPU employs a customized SSD controller unit (MMU) in GPUs to realize memory virtualization [10]. to execute SSD firmware and has a small DRAM as read/write For example, if a data block requested by a GPU core misses buffer to hide the relatively long Z-NAND latency. While in the GPU memory, GPU’s MMU raises the exception of this approach can eliminate the data movement overhead by a page fault. As both GPU and NVMe SSD are peripheral placing Z-NAND close to GPU, there is a huge performance devices, the GPU informs the host to service the page fault, disparity between this approach and the traditional GPU mem- which unfortunately introduces severe data movement over- ory subsystem. To figure out the main reason behind the performance disparity, we analyze the bandwidths of traditional VIQ`7 V$1 V` .:`VR I:J:$VIVJ %J1 GPU memory subsystem and each component in HybridGPU. V H.L `1CV IVIQ`7 VHQRV #1$.C7R .`V:RVR The results are shown in Figure 1b. The maximum bandwidth Q:CVH1J$ %:$V :GCV1:C@V` H.VR%CV %J1 %J1 of HybridGPU’s internal DRAM buffer is 96% lower than that 1 %:$V`:%C 6VH% V .:JRCV` of the traditional GPU memory subsystem. This is because %J1 %:$V1:C@ the state-of-the-art GPUs employ six memory controllers to L G%``V` H:H.V communicate with a dozen of DRAM packages via a 384-bit %:$V1:C@ data bus [13], while the DRAM buffer is a single package V 1`V VI]Q` %J1 # H:H.V connected to a 32-bit data bus [11]. It is possible to increase the number of DRAM packages in HybridGPU by reducing JV 1Q`@ the number of its Z-NAND packages. However, this solution is undesirable as it can reduce the total memory capacity in Q% V` Q% V` +% * +% * GPU. In addition, I/O bandwidth of the flash channels and data VI `1 V* 1CC* :``:7 H `CV` processing bandwidth of a SSD controller are much lower than # V:R* H:H.V those of the traditional GPU memory subsystem, which can .:`VR )J]% * ``RH.1] also become performance bottleneck. This is because the bus structure of the flash channels constrains itself from scaling Fig. 2: GPU internal architecture. up with a higher frequency, while the single SSD controller manner. On the other hand, compared to the GPU interconnect has a limited computing power to translate the addresses of network, a flash channel has a limited number of electrical the memory requests from all the GPU cores. lanes and runs in a relatively low clock frequency, and We propose a new GPU-SSD architecture, ZnG, to maxi- therefore, its bandwidth is much lower than the accumulated mize the memory capacity in the GPU and to address the per- bandwidth of the Z-NAND arrays. ZnG addresses this by formance penalties imposed by the SSD integrated in the GPU. changing the flash network from a bus to a mesh structure. ZnG replaces all the GPU on-board memory modules with • Hardware implementation for zero-overhead address trans- multiple Z-NAND flash packages and directly exposes the Z- lation. The large-scale data analysis applications are typically NAND flash packages to the GPU L2 cache to reap the benefits read-intensive. By leveraging characteristics of the target user of the accumulated flash bandwidth. Specifically, to prevent the applications, ZnG splits the flash address translation into single SSD controller from blocking the services of memory two parts. First, a read-only mapping table is integrated in requests, ZnG removes HybridGPU’s request dispatcher. In- the GPU internal MMU and cached by the GPU TLB. All stead, ZnG directly attaches the underlying flash controller memory requests can directly get their physical addresses to a GPU interconnect network to directly adopt memory when MMU looks up the mapping table to translate their requests from GPU L2 caches. Since the SSD controller also virtual addresses. Second, when there is a memory write, has a limited computing power to perform address translation the data and the updated address mapping information are for the massive number of memory requests, ZnG offloads simultaneously recorded in the flash address decoder and flash the functionality of address translation to the GPU internal arrays. A future access to the written data will be remapped MMU and flash address decoder (to hide the computation by the flash address decoder. overhead). To satisfy the accumulated bandwidth of Z-NAND • L2 cache and flash internal register design for maximum flash arrays, ZnG also increases the bandwidth of the flash flash bandwidth. As Z-NAND is directly connected to GPU channels. Lastly, although replacing all GPU DRAM modules L2 cache via flash controllers, ZnG increases the L2 cache with the Z-NAND packages can maximize the accumulated capacity with an emerging non-volatile memory, STT-MRAM, flash bandwidth, such bandwidth can be underutilized without to buffer more number of pages from Z-NAND. ZnG further a DRAM buffer, as Z-NAND has to fetch a whole 4KB flash improves the space utilization of the GPU L2 caches by page to serve a 128B data block. ZnG increases the GPU predicting spatial locality of the fetched pages. As STT- L2 cache sizes to buffer the pages fetched from Z-NAND, MRAM suffers from a long write latency, ZnG constructs which can serve the read requests, while it leverages Z-NAND L2 cache as a read-only cache.

Zng: Architecting GPU Multi-Processors with New Flash For

Intel ® Atom™ Processor E6xx Series SKU for Different Segments” on Page 30 Updated Table 15

Adding Support for Vector Instructions to 8051 Architecture

Datasheet, Volume 2

MPC500 Family MPC509 User's Manual

Introduction to CUDA Programming

Address Decoding Large-Size Binary Decoder: 28-To-268435456 Binary Decoder for 256Mb Memory

Covert and Side Channels Due to Processor Architecture*

Special Address Generation Arrangement

Low Overhead Memory Subsystem Design for a Multicore Parallel DSP Processor

The Memory System

The CPU and Memory

Intel SGX Explained