Evaluating the Effect of Last-Level Cache Sharing on Integrated GPU-CPU Systems with Heterogeneous Applications

Evaluating the Effect of Last-Level Cache Sharing on Integrated GPU-CPU Systems with Heterogeneous Applications V´ıctor Garc´ıa1;2 Juan Gomez-Luna´ 3 Thomas Grass1;2 Alejandro Rico4 Eduard Ayguade1;2 Antonio J. Pena˜ 2 1Universitat Politecnica` de Catalunya 2Barcelona Supercomputing Center 3Universidad de Cordoba´ 4ARM Inc. Heterogeneous systems are ubiquitous in the field of High- smaller steps that can be executed on the device they are best Performance Computing (HPC). Graphics processing units suited for (i.e., data parallel regions on the GPU or serial/low (GPUs) are widely used as accelerators for their enormous data parallelism regions on the CPU). Using multiprogrammed computing potential and energy efficiency; furthermore, on-die integration of GPUs and general-purpose cores (CPUs) enables workloads to evaluate resource sharing can give insights into unified virtual address spaces and seamless sharing of data the challenges of GPU-CPU integration, but we believe it is structures, improving programmability and softening the entry not representative of future HPC applications that will fully barrier for heterogeneous programming. Although on-die GPU leverage the capabilities of such systems. integration seems to be the trend among the major microprocessor This paper analyzes the impact of LLC sharing in heteroge- manufacturers, there are still many open questions regarding the architectural design of these systems. This paper is a step forward neous computations where GPU and CPU collaborate to reach towards understanding the effect of on-chip resource sharing a solution, leveraging the unified virtual address space and low between GPU and CPU cores, and in particular, of the impact overhead synchronization. Our intent is to present guidelines of last-level cache (LLC) sharing in heterogeneous computations. for the cache configuration of future integrated architectures, To this end, we analyze the behavior of a variety of heterogeneous as well as for applications to best benefit from these. GPU-CPU benchmarks on different cache configurations. We perform an evaluation of the popular Rodinia benchmark We first evaluate a set of kernels from the Rodinia benchmark suite modified to leverage the unified memory address space. We suite modified to use the unified virtual address space, and show find such GPGPU workloads to be mostly insensitive to changes in that on average sharing the LLC can reduce execution time the cache hierarchy due to the limited interaction and data sharing by 13%, mostly due to better resource utilization. Since these between GPU and CPU. We then evaluate a set of heterogeneous benchmarks do not take full advantage of the characteristics benchmarks specifically designed to take advantage of the fine- grained data sharing and low-overhead synchronization between of integrated heterogeneous systems, we then analyze a set GPU and CPU cores that these integrated architectures enable. of heterogeneous benchmarks that use cross-device atomics to We show how these algorithms are more sensitive to the design synchronize and work together on shared data. We evaluate the of the cache hierarchy, and find that when GPU and CPU share effect of LLC sharing on these workloads and find it reduces the LLC execution times are reduced by 25% on average, and execution times in average by 25% and up to 53% on the energy-to-solution by over 20% for all benchmarks. most sensitive benchmark. We also find reductions in energy- I. INTRODUCTION to-solution by more than 30% in 9 out of 11 benchmarks. To the best of our knowledge, this is the first effort analyzing the effect Resource sharing within heterogeneous GPU-CPU systems of cache sharing in integrated GPU-CPU systems with strongly has gotten some attention in the literature over the last few cooperative heterogeneous applications. The key conclusions years. The ubiquity of these systems integrating GPU and CPU from this study are that LLC sharing: cores on the same die has prompted researchers to analyze the effect of sharing resources such as the last-level cache • Has a limited, albeit positive, impact on traditional (LLC) [1], [2], the memory controller [3]–[5], or the network- GPGPU workloads such as those from the Rodinia suite. on-chip [6]. Most of these works start with the premise that • Allows for faster communication between GPU and CPU GPU and CPU applications exhibit different characteristics cores and can greatly improve the performance of compu- (spatial vs. temporal locality) and have different requirements tations that rely on fine-grained synchronization. (high bandwidth vs. low latency), and therefore careful manage- • Can reduce memory access latencies by keeping shared ment of the shared resources is necessary to guarantee fairness data closer and avoiding coherency traffic. and maximize performance. To evaluate their proposals, the • Improves resource utilization by allowing GPU and CPU authors use multiprogrammed workloads composed of a mix cores to fully use the available cache. of GPU and CPU applications running concurrently. The rest of the paper is organized as follows. Section II gives The tight integration of CPU and GPU cores enables a some background on heterogeneous computing and presents unified virtual address space and hardware-managed coherence, the motivation behind this work. Section III reviews related increasing programmer productivity by eliminating the need for work. Section IV discusses the methodology we have used explicit memory management. Devices can seamlessly share and the benchmarks analyzed. In Section V we present and data structures and perform fine-grained synchronization via discuss the experimental results obtained. Section VI presents atomic operations. In this manner, algorithms can be divided in the conclusions we extract from this work. 978-1-5090-3896-1/16/$31.00 ©2016 IEEE 168 II. BACKGROUND AND MOTIVATION This section serves as an overview of heterogeneous architectures and the memory consistency and cache coherence models followed by GPUs and CPUs. We describe the new computa- tional paradigm these systems make possible and present the motivation that led to the evaluation we present in this work. A. Heterogeneous Computing GPUs and other accelerators have traditionally been discrete devices connected to a host machine by a high bandwidth interconnect such as PCI Express (PCIe). Host and device have Fig. 1. Heterogeneous architecture with: a) Private LLCs b) Shared LLC. different memory address spaces, and data must be copied back and forth between the two memory pools explicitly by and Samsung to create an open standard for heterogeneous the programmer. In this traditional model, the host allocates computing. The HSA specification allows for integrated GPU- and copies the data to the device, launches a computation CPU systems sharing data in a unified virtual memory space kernel, waits for its completion and copies back the result. This and synchronizing via locks and barriers like traditional multi- offloading model leverages the characteristics of the GPU to core processors. The OpenCL programming model [16] offers exploit data parallel regions, but overlooks the capabilities of support for Shared Virtual Memory (SVM) since the 2.0 the general-purpose CPU. Deeply pipelined out-of-order cores specification. On a system supporting SVM features, the same can exploit high instruction level parallelism (ILP), and are pointer can be used indistinctly by the CPU and GPU, and better suited for regions of code where GPUs typically struggle, coherence is maintained by the hardware as in Symmetric such as those with high control flow or memory divergence. Multi-core Processors (SMP). The OpenCL 2.0 specification On-die integration of GPU and CPU cores provides multiple provides a series of atomic operations that can be used to benefits: a shared memory pool avoids explicit data move- guarantee race-free code when sharing data through SVM [16]. ment and duplication; communication through the network- These atomic operations allow for fine-grained synchronization on-chip instead of a dedicated high bandwidth interconnect among compute units, opening the door for heterogeneous (PCIe) saves energy and decreases latency; reduced latency applications that work on shared data structures and coordinate enables efficient fine-grained data sharing and synchronization, via low-overhead synchronization. allowing programmers to distribute the computation between Unified Virtual Addressing (UVA) was introduced in devices, each contributing by executing the code regions they NVIDIA devices on CUDA 4 [17]. UVA allows the GPU are more efficient on. For further details we refer the reader to directly access pinned host memory through the high- to the literature, where the benefits of integrated heterogeneous bandwidth interconnect. This zero-copy memory allows appli- architectures have already been explored [7]–[13]. cation programmers to use the same pointers in the device In this work we focus on architectures used in the field and the host, improving programmability. However, on systems of HPC. Most HPC systems are built out of hundreds or with discrete GPUs, the interconnect (usually PCI Express) is thousands of nodes, each composed of general purpose cores optimized for bandwidth and not latency, and therefore it is and specialized accelerators. In many cases these accelerators seldom useful to rely on zero-copy memory. Integrated systems are GPUs, and the computation follows the master - worker such as the Tegra X1 [18] do not suffer from this problem, scheme described above. While technically it can be considered and zero-copy memory can be used to avoid data movement heterogeneous computing, we believe the trend will continue and replication. With CUDA 6 NVIDIA introduced Unified to be tighter integration and better utilization of all computing Memory (UM). UM builds on top of UVA to further simplify resources. Figure 1 shows an example of a heterogeneous sys- the task of GPU programming. On systems with discrete GPUs, tem integrating CPU cores and GPU Streaming Multiprocessors UM has the CUDA driver move data on-demand to and from (SMs) on the same die.

Evaluating the Effect of Last-Level Cache Sharing on Integrated GPU-CPU Systems with Heterogeneous Applications

Parallel Computer Architecture

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Thread-Level Parallelism – Part 1

Efficient Synchronization Mechanisms for Scalable GPU Architectures

How to Manually Switch Graphics Cards Macbook Pro

Op E N So U R C E Yea R B O O K 2 0

Versapipe: a Versatile Programming Framework for Pipelined Computing on GPU

CIS 501 Computer Architecture This Unit: Shared Memory

Shared-Memory Multiprocessors Gates & Transistors

Hard Disk Drive

Tardis: Time Traveling Coherence Algorithm for Distributed Shared Memory

Gpgpu & Accelerators