How to Manage High-Bandwidth Memory Automatically
Total Page:16
File Type:pdf, Size:1020Kb
How to Manage High-Bandwidth Memory Automatically Rathish Das Kunal Agrawal Michael A. Bender Stony Brook University Washington University in St. Louis Stony Brook University [email protected] [email protected] [email protected] Jonathan Berry Benjamin Moseley Cynthia A. Phillips Sandia National Laboratories Carnegie Mellon University Sandia National Laboratories [email protected] [email protected] [email protected] ABSTRACT paging algorithms; Parallel algorithms; Shared memory al- This paper develops an algorithmic foundation for automated man- gorithms; • Computer systems organization ! Parallel ar- agement of the multilevel-memory systems common to new super- chitectures; Multicore architectures. computers. In particular, the High-Bandwidth Memory (HBM) of these systems has a similar latency to that of DRAM and a smaller KEYWORDS capacity, but it has much larger bandwidth. Systems equipped with Paging, High-bandwidth memory, Scheduling, Multicore paging, HBM do not fit in classic memory-hierarchy models due to HBM’s Online algorithms, Approximation algorithms. atypical characteristics. ACM Reference Format: Unlike caches, which are generally managed automatically by Rathish Das, Kunal Agrawal, Michael A. Bender, Jonathan Berry, Benjamin the hardware, programmers of some current HBM-equipped super- Moseley, and Cynthia A. Phillips. 2020. How to Manage High-Bandwidth computers can choose to explicitly manage HBM themselves. This Memory Automatically. In Proceedings of the 32nd ACM Symposium on process is problem specific and resource intensive. Vendors offer Parallelism in Algorithms and Architectures (SPAA ’20), July 15–17, 2020, this option because there is no consensus on how to automatically Virtual Event, USA. ACM, New York, NY, USA, 13 pages. https://doi.org/10. manage HBM to guarantee good performance, or whether this is 1145/3350755.3400233 even possible. In this paper, we give theoretical support for automatic HBM 1 INTRODUCTION management by developing simple algorithms that can automat- Enabled by the recent innovations in 3D die-stacking technology, ically control HBM and deliver good performance on multicore vendors have begun implementing a new approach for improving systems. HBM management is starkly different from traditional memory performance by increasing the bandwidth between on- caching both in terms of optimization objectives and algorithm chip cache and off-package DRAM [34, 39]. The approach is to bond development. Since DRAM and HBM have similar latencies, mini- memory directly to the processor package where there can be more mizing HBM misses (provably) turns out not to be the right memory- parallel connections between the memory and caches, enabling a management objective. Instead, we directly focus on minimizing higher bandwidth than can be achieved using older technologies. makespan. In addition, while cache-management algorithms must Throughout the rest of this paper, we refer to the on-package 3D focus on what pages to keep in cache; HBM management requires memory technologies as high-bandwidth memory or HBM.1 answering two questions: (1) which pages to keep in HBM and (2) The HBM cannot replace DRAM (“main memory”) since it is how to use the limited bandwidth from HBM to DRAM. It turns out generally about 5 times smaller than DRAM due to constraints that the natural approach of using LRU for the first question and such as heat dissipation, as well as economic factors. For example, FCFS (First-Come-First-Serve) for the second question is provably current HBM sizes range from 16 gigabytes per compute node (the bad. Instead, we provide a priority based approach that is simple, ef- Department of Energy’s “Trinity” [26]) to 96 gigabytes per compute ¹ º ficiently implementable and $ 1 -competitive for makespan when node (the Department of Energy’s “Summit” [47]), several times all multicore threads are independent. smaller than the per-node sizes of DRAM on those systems (96 GB and 512 GB, respectively). HBM therefore augments the existing CCS CONCEPTS memory hierarchy by providing memory that can be accessed with • Theory of computation ! Scheduling algorithms; Pack- up to 5x higher bandwidth than DDR4, today’s DRAM technology, ing and covering problems; Online algorithms; Caching and when feeding a CPU [1], and up to 20x higher bandwidth when feeding a GPU [47], but with latency similar to DDR4. As the number of cores on a chip has grown in the past two decades, the relative memory capacity, defined as memory capacity Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States divided by available gigaflops, has decreased by more than 10x[35]. government. As such, the Government retains a nonexclusive, royalty-free right to Thus, processors are becoming more starved for data. HBM, with publish or reproduce this article, or to allow others to do so, for Government purposes its improved ability to feed processors, provides an opportunity to only. SPAA ’20, July 15–17, 2020, Virtual Event, USA overcome this bottleneck, if application software can use it. © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-6935-0/20/07...$15.00 1Hardware vendors use various brand names such as High-Bandwidth Memory (HBM), https://doi.org/10.1145/3350755.3400233 Hybrid Memory Cube (HMC), and MCDRAM for this technology. HBM has a critical limitation besides its constrained size: since it In sequential caching, we generally use minimizing cache misses as uses the same technology as DRAM, it offers little or no advantage in a performance objective since it is a good proxy for makespan. We latency. Therefore, the HBM is not designed to accelerate memory- might be tempted to use HBM misses as the performance objective latency-bound applications, only memory-bandwidth-bound ones.2 here. Surprisingly, it turns out that minimizing HBM misses cor- relates poorly with makespan for HBM for two reasons (formally How HBM is managed. Intel’s Knights Landing processors [46] proved in Section 7). First, unlike caches, HBM has the same access on Trinity can boot into multiple modes. In “cache mode,” the HBM latency as DRAM. Second, multiple processors are accessing the is system-controlled and is integrated into the memory hierarchy HBM and potentially contending for the channel between HBM and as the “last level of cache.” In “flat mode,” the programmer con- DRAM and the amount of contention can have an impact on the trols HBM by explicitly copying data in and out of HBM, trying to makespan, not just the sheer number of HBM misses. Therefore, we squeeze as much performance from the system as possible. Hybrid focus directly on minimizing makespan. We establish the following: mode splits the HBM into one “flat” piece and one “cache” piece. This proliferation of HBM modes exists because there is no • Minimizing page faults does not correlate well with minimizing consensus on how to automatically manage HBM efficiently, or makespan. An algorithm that optimizes the total number whether this is even possible. of HBM misses may have bad makespan (the time the last On the other hand, on most systems, on-chip cache is automati- thread completes)— up to a Θ¹?º-factor worse than optimal. cally managed by the hardware and works well enough that applica- This is not true for the standard cache-replacement problem tion programmers can treat the cache hierarchy a a black box. They where the total number of cache misses is the surrogate can also assume sufficient support from libraries of cache-aware objective for makespan [19, 28, 44]. and cache-oblivious algorithms. Ideally, we would like a system • Minimizing makespan is strongly NP-hard. controlled HBM to also work well-enough to free the programmer • Sharing the HBM-to-DRAM channel fairly does not work. We from the need to manage it. consider how to design block-replacement policies for the However, managing the HBM is not identical to managing caches HBM coupled with the First-Come-First-Serve (FCFS) algo- and introduces complications that do not exist in more traditional rithm for determining the order of accesses from HBM to memory hierarchies. In particular, cores compete not only for HBM DRAM. We show that even though LRU is a very good block- capacity, but also for the more limited channel capacity between replacement policy, if we use FCFS in the HBM-to-DRAM HBM and DRAM. HBM does not fit into a standard memory hier- channel, LRU performs poorly. In particular, with any con- archy model [12, 30], because in traditional hierarchies, both the stant amount of resource augmentation the makespan of latency and bandwidth improve as the levels get smaller. This is using FCFS with LRU is Ω¹?º away from the optimal policy not true of HBM. in the worst case. This negative result establishes that more The question looms: Are there provably good algorithms for sophisticated management of the channel between HBM and automatically controlling HBM? DRAM is central to the designing a good algorithm for the problem. The seemingly fair FCFS policy is bad. 1.1 Results Multicore HBM model. We propose a multicore model for HBM Our main positive results are simple online and offline algo- that captures the high bandwidth from the cores to HBM and the rithms for automatically managing HBM. What is interesting about much lower bandwidth to DRAM. There are ? parallel channels HBM management is how starkly it differs from traditional caching connecting ? cores to the HBM but only a single channel connect- policies, which are well understood in a serial setting but are chal- ing HBM to DRAM; see Figure 1. This configuration captures the lenging in multicore settings [31, 38]. high (on-package) bandwidth between the ? cores and HBM and • Priority-based mechanism for managing the HBM-to-DRAM the much lower (off-package) bandwidth between HBM and DRAM.