The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Computing Architectures

The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Computing Architectures Kyle Spafford Jeremy S. Meredith Seyong Lee Oak Ridge National Oak Ridge National Oak Ridge National Laboratory Laboratory Laboratory 1 Bethel Valley Road 1 Bethel Valley Road 1 Bethel Valley Road Oak Ridge, TN 37831 Oak Ridge, TN 37831 Oak Ridge, TN 37831 [email protected] [email protected] [email protected] Dong Li Philip C. Roth Jeffrey S. Vetter Oak Ridge National Oak Ridge National Oak Ridge National Laboratory Laboratory Laboratory 1 Bethel Valley Road 1 Bethel Valley Road 1 Bethel Valley Road Oak Ridge, TN 37831 Oak Ridge, TN 37831 Oak Ridge, TN 37831 [email protected] [email protected] [email protected] ABSTRACT 1. INTRODUCTION With the rise of general purpose computing on graphics processing units (GPGPU), the influence from consumer mar- 1.1 GPUs and Heterogeneity kets can now be seen across the spectrum of computer architectures. In fact, many of the high-ranking Top500 HPC The demand for flexibility in advanced computer graph- systems now include these accelerators. Traditionally, GPUs ics has caused the GPU to evolve from a highly specialized, have connected to the CPU via the PCIe bus, which has fixed-function pipeline to a more general processor. How- proved to be a significant bottleneck for scalable scientific ever, in its current form, there are still substantial differ- applications. Now, a trend toward tighter integration be- ences between the GPU and a traditional multi-core CPU. tween CPU and GPU has removed this bottleneck and uni- Perhaps the most salient difference is in the memory hier- fied the memory hierarchy for both CPU and GPU cores. We archy: the GPU shuns high capacity, coherent caches in fa- examine the impact of this trend for high performance scien- vor of a much larger number of functional units. This lack of tific computing by investigating AMD's new Fusion Acceler- coherent caches in the GPU is not surprising, given the low ated Processing Unit (APU) as a testbed. In particular, we reuse of graphics data flowing through the frame buffer. In- evaluate the tradeoffs in performance, power consumption, stead, GPUs have used wide memory busses and specialized and programmability when comparing this unified memory texturing hardware (that provides a limited set of address- hierarchy with similar, but discrete GPUs. ing and interpolation operations) for a high bandwidth, high latency connection to off-chip RAM. Because of these architectural differences, early GPGPU adopters observed that Categories and Subject Descriptors many data-parallel problems in scientific computing exhib- B.3.3 [Memory Structures]: Performance Analysis and ited substantial performance improvements when run on a Design Aids; C.1.3 [Processor Architectures]: Other Ar- GPU. chitecture Syltes|heterogeneous (hybrid) systems Initially, achieving such speedups usually required that all operations be cast as graphics operations, comprising a Keywords \heroic" programming effort. This made the initial costs of GPGPU too high for mainstream scientific computing. How- APU, hybrid memory, GPGPU, performance analysis, bench- ever, with the advent of programming models like OpenCL marking and CUDA, which are very similar to standard C/C++, the barriers to entry have decreased and GPGPU now en- joys much wider adoption. Indeed, the low cost of the GPU hardware itself is one of GPGPU's main advantages. Despite these improvements, most GPUs suffer from performance limitations due to the PCIe bus, and limited pro- Copyright 2012 Association for Computing Machinery. ACM acknowl- ductivity to due an increasingly complex memory model. edges that this contribution was authored or co-authored by an employee, Both of these problems are direct consequences of the hard- contractor or affiliate of the U.S. Government. As such, the Government re- ware architecture. Simply put, any data that moves between tains a nonexclusive, royalty-free right to publish or reproduce this article, the CPU and a discrete GPU must traverse the PCIe bus, or to allow others to do so, for Government purposes only. CF’12, May 15–17, 2012, Cagliari, Italy. which has limited bandwidth (usually at least an order of Copyright 2012 ACM 978-1-4503-1215-8/12/05 ...$10.00. magnitude less than GPU memory bandwidth). This architecture results in relatively slow transfers, and high latency 1. In the multi-level cache hierarchy of fused designs, synchronization between devices that applications should what is the set of caches that should be kept coher- avoid when possible. ent in order to allow scalability to large core counts? The complexity of the memory model for programming accelerated applications has also increased. The OpenCL 2. With fixed die space and transistor count, how should memory model for a single GPU is already more compli- resources be allocated to improve serial performance cated than the cache hierarchy of a traditional multicore. (CPU cores) as opposed to parallel performance (GPU In fact, the OpenCL memory model contains five distinct cores)? memory spaces (global, local, constant, image, and private), 3. Given a fixed power budget, should the memory be each with its own coherency policy and optimal access pat- configured for higher capacity (e.g. DDR3) or higher tern. Moreover, many of the address spaces only implement bandwidth (e.g. GDDR5)? relaxed consistency, requiring the programmer to perform explicit synchronization. This model is further complicated 4. Are fused designs more power efficient? Would power by the PCIe bus, since the programmer is required to keep be better spent using discrete, specialized components? CPU and GPU memory consistent via explicit DMA transfers. 5. Given a limited amount of programmer effort, what In an effort to address these difficulties, system archi- is the correct level of abstraction to use when pro- tects and vendors are now focusing on designs which feature gramming APUs? Simple abstractions will require ad- much tighter integration between the CPU and GPU, such vanced runtime systems and middleware, while lower- as AMD's Fusion [4] (studied in this paper) and NVIDIA's level abstractions require more application program- Project Denver. ming effort. 2. CACHE COHERENCY VS. SCALABIL- 1.2 SoC and Tighter Integration Another consumer trend that has motivated the design ITY of the APU is the shift towards a system-on-a-chip (SoC). The specialization of the GPU memory hierarchy is also SoC design largely came about in the mobile and embed- evident in its cache configuration, which has traditionally re- ded spaces due to the desire for reuse of silicon designs, lied on a combination of simple, incoherent SRAM scratch- specifically reusing basic system blocks for wireless tech- pad memories with a specialized texturing unit. This textur- nologies, specialized media processing units, etc. Tighter ing unit typically contains its own separate cache, targeted integration also offers advantages in energy efficiency by en- to regular access patterns on two dimensional data. It also abling fine-grained dynamic voltage and frequency scaling implements a limited set of functions in hardware includ- (DVFS) across multiple system components. In DVFS, the ing several addressing computations and interpolation oper- clock speed and voltage of a processing element are raised ations that are ubiquitous in graphics applications. Aside or lowered by the operating system based on processor uti- from the texture unit, the cache hierarchy of the GPU is lization and workload, resulting in higher performance un- remarkably flat compared to the CPU, largely due to the der load and lower power consumption when cores are idle. data parallelism of graphics operations. One of the bene- This improved efficiency is increasingly appealing for HPC fits of this flat hierarchy is that it scales very well with the systems, due to the projections for the power requirements number of processor tiles (e.g. to sixteen hundred cores in of an exascale computer [12], and the possibility of leverag- Cypress, with only thirty-two kilobytes of scratchpad mem- ing design trends in in the mobile and embedded markets, ory per tile). The disadvantage is that this hierarchy can where the focus is on longer battery life. only support relaxed consistency models for small groups of cores, which places a burden on the programmer which will be discussed more in Section6. 1.3 Tradeoffs On the other hand, CPUs have traditionally used multi- Heterogeneity and SoC-like integration are both evident level, high capacity caches which are kept consistent using in the design of AMD's Fusion APU, shown in Figure1; it protocols like MESI. Such protocols provide strong consis- replaces the PCIe connection between the CPU and GPU tency models for the programmer, but are much less scalable cores with a unified north bridge and two new busses, the due to the rapid growth of coherency-related traffic. The key Radeon Memory Bus (RMB) and the Fusion Compute Link tradeoff, then, is in determining how the CPU and fGPU (FCL), discussed further in Section 2.1. While integrated caches can be coupled to enforce coherency while preserving GPUs have existed for some time in the mobile market, scalability to a large number of cores. AMD's fused GPU (fGPU) can snoop CPU cache transac- tions using the FCL, making Fusion the first mass-market 2.1 Fusion Memory Hierarchy architecture to support cache coherency between the CPU In Fusion, this coupling is accomplished via the addition and GPU. This capability for cache coherency is the hall- of two new busses, the Fusion Compute Link (FCL) and the mark of a fused heterogeneous architecture. Radeon Memory Bus (RMB), depicted in Figure1 as well as Fusing these two distinct memory hierarchies results in a unified north bridge which coordinates access to the differ- numerous tradeoffs. For example, traditional GPU archi- ent logical regions of physical memory.

The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Computing Architectures

Memory Hierarchy Memory Hierarchy

Make the Most out of Last Level Cache in Intel Processors In: Proceedings of the Fourteenth Eurosys Conference (Eurosys'19), Dresden, Germany, 25-28 March 2019

Migration from IBM 750FX to MPC7447A by Douglas Hamilton European Applications Engineering Networking and Computing Systems Group Freescale Semiconductor, Inc

Caches & Memory

Stealing the Shared Cache for Fun and Profit

IBM Power Systems Performance Report Apr 13, 2021

Cache & Memory System

Cache-Fair Thread Scheduling for Multicore Processors

A Cache Line Fill Circuit for a Micropipelined, Asynchronous Microprocessor

Quickspecs HP Integrity Rx7640 Server Overview

Exploiting Cache Side Channels on CPU-FPGA Cloud Platforms

Architectural Trade-Offs in a Latency Tolerant Gallium Arsenide Microprocessor