HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems
Total Page:16
File Type:pdf, Size:1020Kb
2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems Xiaowei Renyz, Daniel Lustigz, Evgeny Bolotinz, Aamer Jaleelz, Oreste Villaz, David Nellansz yThe University of British Columbia zNVIDIA [email protected] fdlustig, ebolotin, ajaleel, ovilla, [email protected] Abstract—Prior work on GPU cache coherence has shown Multi-GPU System that simple hardware- or software-based protocols can be more than sufficient. However, in recent years, features such as multi- GPU GPU GPU GPU MCM-GPU chip modules have added deeper hierarchy and non-uniformity into GPU memory systems. GPU programming models have GPM GPM DRAM DRAM chosen to expose this non-uniformity directly to the end user NVSwitch NVSwitch NVSwitch through scoped memory consistency models. As a result, there is room to improve upon earlier coherence protocols that were GPM GPM DRAM designed only for flat single-GPU hierarchies and/or simpler DRAM memory consistency models. GPU GPU GPU GPU In this paper, we propose HMG, a cache coherence protocol designed for forward-looking multi-GPU systems. HMG strikes Figure 1: Forward-looking multi-GPU system. Each GPU a balance between simplicity and performance: it uses a readily- implementable VI-like protocol to track coherence states, but has multiple GPU modules (GPMs). it tracks sharers using a hierarchical scheme optimized for mitigating the bandwidth limitations of inter-GPU links. HMG adopted a bulk-synchronous programming model (BSP) for leverages the novel scoped, non-multi-copy-atomic properties simplicity. This paradigm disallowed any data communica- of modern GPU memory models, and it avoids the overheads of invalidation acknowledgments and transient states that tion among Cooperative Thread Arrays (CTAs) of active were needed to support prior GPU memory models. On a kernels. However, in emerging applications, less-restrictive 4-GPU system, HMG improves performance over a software- data sharing patterns and fine-grained synchronization are controlled, bulk invalidation-based coherence mechanism by expected to be more frequent [17–20]. BSP is too rigid and 26% and over a non-hierarchical hardware cache coherence inefficient to support these new sharing patterns. protocol by 18%, thereby achieving 97% of the performance of an idealized caching system. To extend GPUs into more general-purpose domains, GPU vendors have released very precisely-defined scoped memory I. INTRODUCTION models [21–24]. These models allow flexible communication As the demand for GPU compute continues to grow and synchronization among threads in the same CTA, the beyond what a single die can deliver [1–4], GPU vendors same GPU, or anywhere in the system, usually by requiring are turning to new packaging technologies such as multi- programmers to provide scope annotations for synchroniza- chip modules (MCMs) [5] and new networking technolo- tion operations. Scopes allow synchronization and coherence gies such as NVIDIA’s NVLink [6] and NVSwitch [7] to be maintained entirely within a narrow subset of the and AMD’s xGMI [8] in order to build ever-larger GPU full-system cache hierarchy, thereby delivering improved systems [9–11]. Consequently, as Fig. 1 depicts, modern GPU performance over system-wide synchronization enforcement. systems are becoming increasingly hierarchical. However, Furthermore, unlike most CPU memory models, these GPU due to physical limitations, the large bandwidth discrepancy models are now non-multi-copy-atomic: they do not require between existing inter-GPU links [6, 8] and on-package that memory accesses become visible to all observers at the integration technologies [12] can contribute to Non-Uniform same logical point in time. As a result, there is room for Memory Access (NUMA) behavior that often bottlenecks forward-looking GPU cache coherence protocols to be made performance. Following established principles, GPUs use even more relaxed, and hence capable of delivering even aggressive caching to recover some of the performance loss higher throughput, than protocols proposed in prior work (as created by the NUMA effect [5, 13, 14], and these caches outlined in Section III). are kept coherent with lightweight coherence protocols that Previously explored GPU coherence schemes [5, 13–16, 25– are implemented in software [5, 13], hardware [14, 15], or a 27] were well tuned for GPUs and much simpler than CPU mix of both [16]. protocols, but few have studied how to scale these protocols GPU originally assumed that inter-thread synchronization to larger multi-GPU systems with deeper cache hierarchies. would be coarse-grained and infrequent, and hence they To test their efficiency, we simply apply the existing software 2378-203X/20/$31.00 ©2020 IEEE 582 DOI 10.1109/HPCA47549.2020.00054 Non-Hierarchical SW Coherence Non-Hierarchical HW Coherence Idealized Caching w/o Coherence 3.1 3.1 3.2 3.4 3.5 4.0 3.7 3.6 4.4 3.3 3.4 7.1 2.5 2.0 1.5 1.0 0.5 0.0 Normalized Speedup bfs mst lstm snap CoMD resnet nw-16K HPGMG AlexNet overfeat RNN_FW MiniAMR cuSolver Nekbone GeoMean namd2.10 pathfinder GoogLeNet MiniContact RNN_DGRAD RNN_WGRAD Figure 2: Benefits of caching remote GPU data under three different protocols on a 4-GPU system with 4 GPMs per GPU, all normalized to a baseline which has no such caching. The results show that existing protocols leave room for improvement when extended to multiple GPUs. and hardware coherence protocols GPU-VI [15] on a 4- software or hardware coherence protocols to multi- GPU system, in which each GPU consists of 4 separate GPU systems, performance is very inefficient without GPU modules (GPMs). These protocols do not account for mechanisms in place to exploit intra-GPU data locality architectural hierarchy; we simply extend them as if the and mitigate the inter-GPU/inter-module bandwidth system were a flat platform of 16 GPMs. As Fig. 2 shows, constraints. in hierarchical multi-GPU systems, existing non-hierarchical • We propose HMG, a hardware-managed cache co- software and hardware VI coherence protocols indeed leave herence protocol for distributed L2 caches in hierar- plenty room for improvement; see Section III for details. chical multi-GPU platforms. Our proposal not only We therefore ask the following question: how do we extend mitigates the majority of the inter-GPU bandwidth existing GPU coherence protocols across multiple GPUs related bottlenecks by improving intra-GPU data locality while simultaneously providing high-performance support for with hierarchical sharer tracking, but also eliminates flexible fine-grained synchronization within emerging GPU unnecessary transient states and coherence messages applications, and without a dramatic increase in protocol found in previous proposals. HMG delivers 97% of the complexity? overall possible performance of an idealized system. We propose HMG, a hardware-managed cache coher- ence protocol designed to extend coherence guarantees II. BACKGROUND across forward-looking Hierarchical Multi-GPU systems with To avoid confusion around the term “shared memory” scoped memory consistency models. Unlike prior CPU and which is used to describe scratchpad memory on NVIDIA GPU protocols that enforce multi-copy-atomicity and/or track GPUs, we use “global memory” for the virtual address space ownership within the protocol, HMG eliminates transient shared by all CPUs and GPUs in a system. states and/or extra hardware structures that would otherwise be needed to cover the latency of acquiring write permis- A. Hierarchical Multi-Module GPUs sions [15, 16]. HMG also filters out unnecessary invalidation Because the end of Moore’s Law is limiting continued acknowledgment messages, since a write can be processed transistor density improvement, future multi-module GPU instantly if multi-copy-atomicity is not required. Similarly, architectures will consist of a hierarchy in which each GPU unlike prior work that filters coherence traffic by tracking is split into GPU modules (GPMs), as depicted in Fig. 1 [28]. the read-only state of data regions [14, 16], HMG relies Recent work has demonstrated the benefits of creating on precise but hierarchical tracking of sharers to mitigate GPMs in the form of single-package multi-chip modules the performance impact of bandwidth-limited inter-GPU (MCMs) [5]. Researchers have explored the possibility of links without adding unnecessary coherence traffic. As such, presenting a large hierarchical multi-GPU system to users HMG is able to scale up to large multi-GPU systems, while as if it were a single large GPU [13], but mainstream GPU nevertheless maintaining the simplicity and implementability platforms today largely just expose the hierarchy directly to that made prior GPU cache coherence protocols popular [15]. users so that they can manually optimize data and thread Overall, this paper makes the following contributions: placement and migration decisions. • We study the architectural implications of extending The constrained bandwidth of the inter-GPM/GPU links cache coherence protocols to multi-GPU systems. To is the main performance bottleneck on hierarchical GPU the best of our knowledge, this is the first study of architectures. To mitigate this, both MCM-GPU and Multi- multi-MCM, multi-GPU hardware coherence under a GPU explorations have scheduled adjacent CTAs to the same scoped, non-multi-copy-atomic memory model. GPM/GPU to exploit inter-CTA data locality [5, 13]. These • We show that while it is possible to extend existing proposals map each memory page to the first GPM/GPU that 583 touches it to increase the likelihood that caches will capture synchronize. Synchronization of scope .cta is performed in locality. They also extended a conventional GPU software the L1 cache of each GPU Streaming Multiprocessor (SM); coherence protocol to the L2 caches, and they showed that synchronizations of scope .gpu and .sys are processed it worked well for traditional bulk-synchronous programs. through the GPU L2 cache and via the memory hierarchy More recently, CARVE [14] proposed to allocate a fraction of the whole system, respectively.