A Simple Cache Coherence Scheme for Integrated CPU-GPU Systems

A Simple Cache Coherence Scheme for Integrated CPU-GPU Systems Ardhi W. B. Yudha Reza Pulungan Henry Hoffmann Surya University Universitas Gadjah Mada University of Chicago [email protected] [email protected] [email protected] ABSTRACT GPU cache coherence mechanisms. Specifically, perfor- This paper presents a novel approach to accelerate applica- mance critical data is homed on the GPU. When the CPU tions running on integrated CPU-GPU systems. Many inte- writes to such data, the proposed coherence protocol forwards grated CPU-GPU systems use cache-coherent shared memory the data to the GPU memory hierarchy—instead of leaving it to communicate efficiently and easily. In this pull-based ap- in the memory close to the CPU. This proposal is an intuitive proach, the CPU produces data for the GPU to consume, and match for the CPU to GPU producer-consumer relationship, data resides in a shared cache until the GPU accesses it, result- as it moves the data to the memory system where it will be ing in long load latency on a first GPU access to a cache line. consumed, reducing GPU miss rate, and improving overall In this work, we propose a new, push-based, coherence mech- performance. anism that explicitly exploits the CPU and GPU’s producer- While the basic idea of homed memory has appeared in consumer relationship by automatically moving data from many systems—from commercial products for distributed CPU to GPU’s last-level cache. The proposed mechanism re- shared memory [13] and manycores [16] to research de- sults in a dramatic reduction of the GPU’s L2 cache miss rate signs [21]—the integrated CPU-GPU system provides new in general, and a consequent increase in overall performance. challenges and opportunities. Specifically, prior approaches Our experiments show that the proposed scheme can increase required that users explicitly declare the home core for dif- performance by up to 37%, with typical improvements in the ferent data ranges (e.g., in the TILE architecture [16]) or 5–7% range. We find that even if an application does not required added hardware complexity to automatically move benefit from the proposed approach, performance is never data to an appropriate home (e.g., [21]). The unique nature worse under this model. While we demonstrate how the pro- of an integrated CPU-GPU system makes it possible to auto- posed scheme can co-exist with traditional cache-coherence matically translate existing programs to home memory and mechanisms, we also argue that it could serve as a simpler thus achieve the performance benefits of homing data on the replacement for existing protocols. GPU while requiring no extra work from users and keeping hardware complexity low. Thus, in this paper we show how to support homed memory 1. INTRODUCTION on an integrated CPU-GPU system. Our proposed approach The abundance of transistors available on a die has led to includes source to source translation that converts existing new hardware designs where CPUs and GPUs are integrated GPU programs to use our mechanism. Specifically, the trans- onto a single chip [10,18,34,47,48]. This integration presents lator changes existing memory allocation to allocate data that new opportunities for increased performance and ease of pro- will be homed on the GPU in a specific data range. Data gramming as integrated CPU-GPU designs can communicate in that range cannot be cached on the CPU, only the GPU. through cache-coherent shared memory (CCSM) rather than Therefore, when the CPU attempts to write it (to produce through explicit data copying from CPU to GPU memory. data for the GPU’s consumption), there will be a write miss. In fact, for most integrated CPU-GPU systems, the critical Concurrently, the modified Translation Look-aside Buffer operation is moving data from CPU to GPU, and sending (TLB) will detect the address range and tell the CPU memory this data via CCSM incurs overhead in the form of coher- system to forward the write miss to the GPU memory system. ence messages. The CPU-GPU relationship is different from The GPU then resolves the write miss and the produced data peer cores in the traditional shared-memory systems, how- is stored in the GPU memory hierarchy. Then, when the GPU ever. The CPU is a data producer and the GPU is a consumer. accesses it, the data is already in cache locally, reducing both GPU performance depends directly on how fast it can con- compulsory misses and load latency on the first data access. sume the CPU-produced data. This relationship presents We test the proposed scheme by implementing it in the an opportunity for further performance gain by exploiting gem5-gpu simulator and compare it with recent proposals the producer-consumer relationship to eliminate unnecessary for the traditional CCSM between CPU and GPU. Across cache coherence messages, reduce GPU miss rate, and im- benchmarks from a range of suites, we find that our proposed prove system performance. approach can achieve performance improvements as high This paper proposes an augmentation of existing CPU- 1 as 37%, with typical improvements in the range of 5–7%. producer-consumer relationships in multi-threaded multicore Furthermore, we find that performance under our scheme is programs can be quite challenging. Our work adapts the RSP never worse than with CCSM. In addition, we compare the idea to an integrated CPU-GPU environment, to which it is proposed scheme with CCSM with a straightforward stride much better suited, as the producer consumer relationships prefetcher and see that the proposed scheme is much more are well-defined and easily determined. efficient. While the proposed technique can co-exist with CCSM, we also believe it could be a simpler replacement for 3. DESIGN AND IMPLEMENTATION this approach. Integrated CPU-GPU systems allow the GPU to be placed on the same die as the CPU. As a consequence, it is now 2. RELATED WORK possible for CPU and GPU to share their resources including There have been massive efforts toward tightly integrating cache memory, and the short distance between them allows CPUs and GPUs on a single die [23, 26, 28, 31, 35]. Prior for a high-speed interconnection network to and from each work has tackled programming models [24,33, 42], network of the processing cores for exchanging messages. CPU and design [39], and many have studied the performance impact GPU have different kinds of workload to present their best of the integration [6,9, 15, 38, 40] and the key characteristics performance. Most of the heterogeneous workload offloads that deliver the best performance [29, 46, 50]. some portions of the computation to the GPU as the hardware Cache coherence is one of the primary concerns for tightly accelerator. From this nature and based on the relationship integrated CPUs and GPUs, including both hardware- [14,17, between them, the CPU usually produces data, which will 30,34,43,44] and software-based cache coherence [42,51] later be consumed by the GPU to accelerate the computation. mechanisms. Several projects have investigated the perfor- In conventional cache-coherence protocols for such integrated mance of existing coherence mechanisms in this new con- CPU-GPU systems, the data is pulled to the GPU on a load text [11,14,36,42,45], while other work has even investigated instruction. In this paper, however, we propose a scheme for tight integration without cache coherence [2]. pushing the data directly to the GPU’s cache to reduce load The mechanism investigated in this paper bears some re- latency. semblance to prefetching techniques, implemented in both In realizing such scheme, several challenges may arise. Ad- software [20, 27, 52] and hardware. Many prefetching tech- ditional information might be needed at the source-code level niques have been proposed, including a technique that as- about the variables that are to be stored remotely, memory sumes that the integrated CPU-GPU system shares the last- allocation for variables that need to be stored to the GPU, spe- level cache (LLC) [49], a dynamically adaptive prefetch- cific address detection to present different treatments for the ing strategy to reduce memory latency and improve better data and new rules for storing those variables in the caches power efficiency [37], and a hybrid of software and hard- need to be added. Also, a dedicated interconnection network ware prefetching [22], which combines conventional stride is helpful to provide fast data delivery and to reduce the com- prefetching and a hardware-implemented inter-thread tech- plexity of the cache coherence modifications. This section nique, where each thread prefetches for threads located in describes our proposal for such a push-based cache coherence other warps. scheme, which we will call direct store. A GPU selective caching strategy that enables unified shared memory without hardware cache-coherence protocol 3.1 Data Flow is proposed in [2]. This involves architectural improvements The data flow of the traditional cache coherence and of to address the performance reduction of selective caching by our proposed scheme are different. Figure1 compares the aggressive request coalescing, CPU-side coherent caching data movement between the state-of-the-art CCSM used in for GPU uncacheable requests, and a CPU-GPU intercon- integrated CPU-GPU systems and the proposed direct store. nect optimization to support variable-sized transfers. On the In the Hammer protocol, all CPU and GPU caches maintain other hand, a heterogeneous system coherence for CPU-GPU cache coherence, except for GPU’s L1 cache [19]. Although system to address the effect of huge number of GPU mem- it is non-coherent in term of hardware, by writing through ory requests is proposed in [34].

A Simple Cache Coherence Scheme for Integrated CPU-GPU Systems

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Abstract 1 Introduction

Detailed Cache Coherence Characterization for Openmp Benchmarks

Analysis and Optimization of I/O Cache Coherency Strategies for Soc-FPGA Device

Chapter 5 Thread-Level Parallelism

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Parallel Processors and Cache Coherency

Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Rchitectures

18-742 Parallel Computer Architecture Lecture 5: Cache Coherence

Why On-Chip Cache Coherence Is Here to Stay

Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

A CMOS Vector Processor with a Custom Streaming Cache Greg Faanes [email protected] Silicon Graphics

Computer Architecture, Part 7