Software Assisted Hardware Cache Coherence for Heterogeneous Processors

Software Assisted Hardware Cache Coherence for Heterogeneous Processors Arkaprava Basu Sooraj Puthoor Shuai Che Bradford M. Beckmann AMD Research Advanced Micro Devices, Inc. {arkaprava.basu, sooraj.puthoor, brad.beckmann, shuai.che}@amd.com ABSTRACT continue in the near-future with highly-capable accelerator-like Current trends suggest that future computing platforms will be GPU being tightly integrated inside a processor. increasingly heterogeneous. While these heterogeneous processors Ease of programming these heterogeneous processors is key to physically integrate disparate computing elements like CPUs and harness their full potential. Cache coherence and shared virtual GPUs on a single chip, their programmability critically depends memory are two critical features that greatly enhance upon the ability to efficiently support cache coherence and shared programmability of any system. It is thus no surprise that several virtual memory across tightly-integrated CPUs and GPUs. major hardware vendors like AMD®, ARM®, and Qualcomm® However, throughput-oriented GPUs easily overwhelm existing promise shared virtual memory and cache coherence in their hardware coherence mechanisms that long kept the cache heterogeneous processors [22]. hierarchies in multi-core CPUs coherent. However, realizing the promise of full cache coherence across This paper proposes a novel solution called Software Assisted the CPU and GPU in a heterogeneous processor is challenging. Hardware Coherence (SAHC) to scale cache coherence to future Traditional approaches have either burdened the application heterogeneous processors. We observe that the system software programmer to explicitly manage coherence (software cache (Operating system and runtime) often has semantic knowledge coherence) or put the onus of maintaining coherence entirely on the about sharing patterns of data across the CPU and the GPU. This hardware (hardware cache coherence). Software cache coherence high-level knowledge can be utilized to effectively provide cache is more appealing for niche accelerators programmed by ninja coherence across throughput-oriented GPUs and latency-sensitive programmers while the hardware cache coherence is the norm for CPUs in a heterogeneous processor. SAHC thus proposes a hybrid more generic and easily programmable CPUs. Unfortunately, software-hardware mechanism that judiciously uses hardware neither of these two approaches is readily extensible to coherence only when needed while using software’s knowledge to heterogeneous processors that should be programmable en masse. filter out most of the unnecessary coherence traffic. Our evaluation Software cache coherence seriously impedes programmability. The suggests that SAHC can often eliminate up to 98-100% of the hardware-only coherence is hard to scale to meet the demand from hardware coherence lookups, resulting up to 49% reduction in throughput-oriented GPUs [15] runtime. Researchers have proposed several designs to address this challenge in the past. For example, GMAC [7] proposes a software CCS Concepts only solution that enables coherence between a CPU and GPU with • CCS → Computer systems organization → Architectures → a software library, but adds significant performance overhead. Parallel architectures → Single instruction, multiple data Cohesion [10] proposes a software-hardware co-design that allows data to be kept coherent either by hardware or by software. Keywords However, to achieve high performance, Cohesion requires Cache coherence; Heterogeneous processor; GPGPU; Virtual application programmers to manage coherence and adds significant memory; Operating system. hardware overheads (e.g., adds region table). HSC [15] proposes a 1. INTRODUCTION hardware-only solution that revamps the hardware cache coherence mechanism by proposing to maintain cache coherence at a larger Many current and future processors are becoming increasingly granularity of region (e.g., 1KB) instead of traditional cache blocks heterogeneous. Large commercial processor manufacturers like (e.g., 64 bytes). Given that current cache coherence protocols are AMD®, Intel®, and Qualcomm® ship millions of processors with already hard to verify, the significant changes proposed by HSC CPUs and GPUs tightly coupled together. This trend is likely to will be challenging to adopt. Permission to make digital or hard copies of all or part of this work for Ideally, heterogeneous processors would retain the ease of personal or classroom use is granted without fee provided that copies are programming enabled by hardware cache coherence but without not made or distributed for profit or commercial advantage and that copies adding significant performance overhead and without requiring a bear this notice and the full citation on the first page. Copyrights for re-design of conventional hardware. To this end, we propose components of this work owned by others than ACM must be honored. software-assisted hardware-managed coherence (SAHC). The key Abstracting with credit is permitted. To copy otherwise, or republish, to observation is that the system software (Operating system, runtime) post on servers or to redistribute to lists, requires prior specific permission knows how data is shared across the CPU and the GPU. This and/or a fee. Request permissions from [email protected]. knowledge can aid the hardware cache coherence. Furthermore, MEMSYS '16, October 03-06, 2016, Washington, DC, USA since memory is allocated by operating system at the granularity of © 2016 ACM. ISBN 978-1-4503-4305-3/16/10…$15.00 DOI: http://dx.doi.org/10.1145/2989081.2989092 pages we observed that this semantic knowledge is captured well at the page-granularity (e.g., 4KB). Thus, SAHC piggybacks on virtual memory’s page-permission checks to enforce coherence CPU Cluster GPU Cluster between the CPU and the GPU cache hierarchies. SAHC then 2 CPUs + 32 CUs + private dynamically falls back to the traditional block-granular hardware private L1 L1 caches per caches per core CU cache coherence only for frequently-shared data across CPU and GPU. Specifically, SAHC extends the set of page permissions (e.g., read/write/no-execute) to encode page-granular coherence CPU Shared L2 GPU Shared L2 permissions. These page-grain coherence permissions encode Cache Cache whether a given page is currently accessible by CPU or by GPU or by both. The existing virtual memory hardware for enforcing page permissions is then minimally extended to enforce newly-added page-grain coherence permissions. However, if a page contains L3 Cache Coherence (Shared by CPU data that is frequently-shared across CPU and GPU, then it will Directory and GPU cluster) trigger an excessive number of costly page-permission changes. To avoid any potential performance degradation under such scenario, SAHC dynamically falls back to conventional block-granular hardware cache coherence only for pages that are frequently shared across the CPU and the GPU. Our analysis (Section 4) shows that DRAM Channels only small fraction of data is frequently-shared across CPU and Figure 1. Baseline system architecture. GPU (e.g., synchronization variable) SAHC’s software-hardware co-design enables several important benefits over the state-of-art. First, SAHC enables Both the CPU and GPU clusters have their private L2 caches. A traditional block-granular hardware cache coherence to scale to block-granular inclusive coherence directory maintains the GPU’s memory bandwidth requirements by judiciously using it coherence between the CPU and GPU L2 caches and each directory only when necessary. Thus, unlike hardware-only coherence for entry has a two-bit sharing vector (one bit for each cluster). A heterogeneous processors SAHC does not require re-designing and request to directory allocates an entry in Miss Handling Registers re-verifying hardware cache coherence mechanism. Second, SAHC (MSHR) to keep track of outstanding requests. An entry in MSHR does not burden application writers to maintain cache coherence is deallocated only when the directory request completes. unlike software cache-coherence mechanisms [10]. Instead, SAHC Memory accesses from CPU or GPU that miss in their respective minimally extends system software (OS and runtime) to manage local cache hierarchy look up the coherence directory. A lookup in coherence. Finally, our evaluation shows that SAHC can eliminate the coherence directory identifies whether the remote cache 98-100% of all hardware coherence directory lookup and can hierarchy may contain the requested cache block. The local cache reduce application runtime by up to 50% hierarchy of GPU consists of per-CU L1 caches that shares a Our contributions are as follows: common L2 cache. The CPU local cache hierarchy consists of a 1. We analyze and quantify bottlenecks in the hardware per-CPU L1 and a shared L2. GPU’s local cache hierarchy is called cache coherence in a heterogeneous system. the remote cache hierarchy for any request originated from the CPU 2. We analyze and characterize cache coherence needs and vice-versa. If a request misses in the coherence directory then across the CPU and GPU for several applications. This it looks up the unified L3 cache before going off chip to the DRAM. analysis further leads to classification of CPU-GPU interactions in heterogeneous applications. 2.2 Virtual Memory Basics 3. Guided by this analysis, we propose a novel software hardware co-designed coherence scheme (SAHC) for The software generates a memory access with a virtual address but emerging heterogeneous systems.

Load more