Pump up the Volume: Processing Large Data on Gpus with Fast

Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects Clemens Lutz Sebastian Breß Steffen Zeuch [email protected] [email protected] [email protected] DFKI GmbH TU Berlin DFKI GmbH Berlin, Germany Berlin, Germany Berlin, Germany Tilmann Rabl Volker Markl [email protected] [email protected] HPI, University of Potsdam DFKI GmbH, TU Berlin Potsdam, Germany Berlin, Germany ABSTRACT Memory NVLink 2.0 PCI-e 3.0 200 158.9 GPUs have long been discussed as accelerators for database 150 124.6 102.6 120.7 query processing because of their high processing power and 100 memory bandwidth. However, two main challenges limit the 50 24.7 20.5 0 utility of GPUs for large-scale data processing: (1) the on- Bandwidth (GiB/s) Theoretical Measured board memory capacity is too small to store large data sets, yet (2) the interconnect bandwidth to CPU main-memory is Figure 1: NVLink 2.0 eliminates the GPU’s main- insufficient for ad hoc data transfers. As a result, GPU-based memory access disadvantage compared to the CPU. systems and algorithms run into a transfer bottleneck and ACM Reference Format: do not scale to large data sets. In practice, CPUs process Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker large-scale data faster than GPUs with current technology. Markl. 2020. Pump Up the Volume: Processing Large Data on GPUs In this paper, we investigate how a fast interconnect can with Fast Interconnects. In Proceedings of the 2020 ACM SIGMOD resolve these scalability limitations using the example of International Conference on Management of Data (SIGMOD’20), June NVLink 2.0. NVLink 2.0 is a new interconnect technology 14–19, 2020, Portland, OR, USA. ACM, New York, NY, USA, 17 pages. that links dedicated GPUs to a CPU. The high bandwidth of https://doi.org/10.1145/3318464.3389705 NVLink 2.0 enables us to overcome the transfer bottleneck and to efficiently process large data sets stored in main-mem- 1 INTRODUCTION ory on GPUs. We perform an in-depth analysis of NVLink 2.0 Over the past decade, co-processors such as GPUs, FPGAs, and show how we can scale a no-partitioning hash join be- and ASICs have been gaining adoption in research [17, 35, 38, yond the limits of GPU memory. Our evaluation shows speed- 56] and industry [88] to manage and process large data. De- ups of up to 18× over PCI-e 3.0 and up to 7:3× over an op- spite this growth, GPU-enabled databases occupy a niche [67] timized CPU implementation. Fast GPU interconnects thus in the overall databases market [28]. In contrast, there is enable GPUs to efficiently accelerate query processing. wide-spread adoption in the deep learning [21, 71] and high performance computing domains. For instance, 29% of the Top500 supercomputers support co-processors [92]. Data- base research points out that a data transfer bottleneck is Permission to make digital or hard copies of all or part of this work for the main reason behind the comparatively slow adoption of personal or classroom use is granted without fee provided that copies GPU-enabled databases [31, 100]. are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights The transfer bottleneck exists because current GPU inter- for components of this work owned by others than the author(s) must connects such as PCI-e 3.0 [1] provide significantly lower be honored. Abstracting with credit is permitted. To copy otherwise, or bandwidth than main-memory (i.e., CPU memory). We break republish, to post on servers or to redistribute to lists, requires prior specific down the transfer bottleneck into three fundamental limita- permission and/or a fee. Request permissions from [email protected]. tions for GPU-enabled data processing: SIGMOD’20, June 14–19, 2020, Portland, OR, USA L1: Low interconnect bandwidth. When the database de- © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. cides to use the GPU for query processing, it must transfer ACM ISBN 978-1-4503-6735-6/20/06...$15.00 data ad hoc from CPU memory to the GPU. With current https://doi.org/10.1145/3318464.3389705 interconnects, this transfer is slower than processing the SIGMOD’20, June 14–19, 2020, Portland, OR, USA Lutz et al. data on the CPU [13, 31, 100]. Consequently, we can only 16× PCI-e 3.0 3× NVLink 2.0 GPU 16 GB/s 75 GB/s GPU speed up data processing on GPUs by increasing the inter- Core 0 Core N connect bandwidth [16, 26, 51, 89, 95]. Although data com- I/O Hub NPU pression [24, 85] and approximation [81] can reduce transfer Memory Controller 75 GB/s 1× X-Bus 3× X-Bus 75 GB/s volume, their effectiveness varies with the data and query. 64 GB/s 192 GB/s 3× NVLink 2.0 GPU GPU 3× NVLink 2.0 L2: Small GPU memory capacity. To avoid transferring CPU Memory data, GPU-enabled databases cache data in GPU memory [13, Cache-Coherent 38, 50, 83]. However, GPUs have limited on-board GPU mem- Figure 2: Architecture and cache-coherence domains ory capacity (up to 32 GiB). In general, large data sets cannot of GPU interconnects, with their electrical band- be stored in GPU memory. The capacity limitation is inten- widths annotated. sified by database operators that need additional space for intermediate state, e.g., hash tables or sorted arrays. In sum, GPU co-processing does not scale to large data volumes. L3: Coarse-grained cooperation of CPU and GPU. Us- The remainder of the paper is structured as follows. In Sec- ing only a single processor for query execution leaves avail- tion 2, we briefly explain the hash join algorithm and high- able resources unused [16]. However, co-processing on mul- light NVLink 2.0. We present our contributions in Sections tiple, heterogeneous processors inherently leads to execution 3–6. Then, we present our experimental results in Section 7 skew [22, 32], and can even cause slower execution than on and discuss our insights in Section 8. Finally, we review re- a single processor [13]. Thus, CPU and GPU must cooperate lated work in Section 9 and conclude in Section 10. to ensure that the CPU’s execution time is the lower bound. Cooperation requires efficient synchronization between pro- 2 BACKGROUND cessors on shared data structures such as hash tables or In this section, we provide an overview of the no-partitioning B-trees, that is not possible with current interconnects [4]. hash join, and the PCIe 3.0 and NVLink 2.0 interconnects. In this work, we investigate the scalability limitations of GPU co-processing and analyze how a faster interconnect 2.1 No-Partitioning Hash Join helps us to overcome them. A new class of fast interconnects, In this work, we focus on the no-partitioning hash join algo- that currently includes NVLink, Infinity Fabric, and CXL, rithm as proposed by Blanas et al. [10]. The no-partitioning provides high bandwidth, and low latency. In Figure 1, we hash join algorithm is a parallel version of the canonical show that fast interconnects enable the GPU to access CPU hash join [8]. We focus on this algorithm because it is simple memory with the full memory bandwidth. Furthermore, we and well-understood. Loading base relations from CPU mem- propose a new co-processing strategy that takes advantage of ory requires high bandwidth, scaling the hash table beyond the cache-coherence provided by fast interconnects for fine- GPU memory requires low latency, and sharing the hash grained CPU-GPU cooperation. Overall, fast interconnects table between multiple processors requires cache-coherence. integrate GPUs tightly with CPUs and significantly reduce Thus, the no-partitioning hash join is a useful instrument to the data transfer bottleneck. investigate fast GPU interconnects. Our contributions are as follows: The anatomy of a no-partitioning hash join consists of two phases, the build and the probe phase. The build phase (1) We analyze NVLink 2.0 to understand its performance takes as input the smaller of the two join relations, which and new functionality in the context of data manage- we denote as the inner relation R. In the build phase, we ment (Section 3). NVLink 2.0 is one representative of populate the hash table with all tuples in R. After the build the new generation of fast interconnects. phase is complete, we run the probe phase. The probe phase (2) We investigate how fast interconnects allow us to per- reads the second, larger input relation as input. We name form efficient ad hoc data transfers. We experimentally this relation the outer relation S. For each tuple in S, we determine the best data transfer strategy (Section 4). probe the hash table to find matching tuples from R. When (3) We scale queries to large data volumes while consider- executing the hash join in parallel on a system with p cores, ing the new trade-offs of fast interconnects. We usea its time complexity observes O¹1/p¹jRj + jSjºº. no-partitioning hash join as an example (Section 5). (4) We propose a new cooperative and robust co-processing approach that enables CPU-GPU scale-up on a shared, 2.2 GPU Interconnects mutable data structure (Section 6). Discrete GPUs are connected to the system using an inter- (5) We evaluate joins as well as a selection-aggregation connect bus. In Figure 2, we contrast the architectures of query using a fast interconnect (Section 7).

Pump up the Volume: Processing Large Data on Gpus with Fast

GPTPU: Accelerating Applications Using Edge Tensor Processing Units Kuan-Chieh Hsu and Hung-Wei Tseng University of California, Riverside {Khsu037, Htseng}@Ucr.Edu

Amd Filed: February 24, 2009 (Period: December 27, 2008)

Quad-Core Catamount and R&D in Multi-Core Lightweight Kernels

ADVANCED MICRO DEVICES, INC. (Exact Name of Registrant As Specified in Its Charter)

AMD's Early Processor Lines, up to the Hammer Family (Families K8

N51tp/Te AMD PUMA Rev 0.6 2008/Nov/05

AMD Zen Rohin, Vijay, Brandon Outline

AMD Details Next-Generation Platform for Notebook Pcsamd Details Next-Generation Platform for Notebook Pcs 18 May 2007

Intel Corporation, Founded 1968, Is the Largest Microprocessor Company in the World, with the Greatest Overall Share of the Microprocessor Market Worldwide

PUMA: a Programmable Ultra-Efficient Memristor-Based Accelerator for Machine Learning Inference

PUMA” Microarchitecture PUMA = Programmable Ultra-Efficient Memristor-Based Accelerator

AMD ENERGY STAR Server Energy Efficiency Technology