Lazy Release Consistency for Gpus

Lazy Release Consistency for GPUs Johnathan Alsop† Marc S. Orr‡§ Bradford M. Beckmann§ David A. Wood‡§ †University of Illinois at Urbana–Champaign ‡University of Wisconsin–Madison §AMD Research [email protected] {morr,david}@cs.wisc.edu [email protected] Abstract—The heterogeneous-race-free (HRF) memory and hardware simple. In comparison, CPUs provide relatively model has been embraced by the Heterogeneous System Archi- simple memory models, but use complex and highly tecture (HSA) Foundation and OpenCLTM because it clearly optimized cache coherence protocols that enforce the single- and precisely defines the behavior of current GPUs. However, writer/multiple reader invariant [1]. Specifically, store compared to the simpler SC for DRF memory model, HRF has operations invalidate the target address at every private cache two shortcomings. The first is that HRF requires programmers other than the initiator’s. This complicated CPU approach is a to label atomic memory operations with the correct scope of syn- poor fit for GPUs for several reasons. First, a GPU core, called chronization. This explicit labeling can save significant coher- a compute unit (CU), has thousands of hardware threads, ence overhead when synchronization is local, but it is tedious called work-items. Sending invalidations on every store miss and error-prone. The second shortcoming is that HRF restricts would generate far too much invalidation traffic. Second, important dynamic data sharing patterns like work stealing. Prior work on remote scope promotion (RSP) attempted to re- managing the invalidations requires sophisticated cache solve the second shortcoming. However, RSP further compli- controllers that detract from the GPU’s primary application: cates the memory model and no scalable implementation of RSP graphics. Finally, writer-initiated invalidations often employ has been proposed. For example, we found that the previously inclusive caches, which are a poor fit for GPUs because their proposed RSP implementation actually results in slowdowns of aggregate L1 cache capacity approaches the size of a typical up to 30% on large GPUs, compared to a naïve baseline system GPU last-level cache. that forgoes work stealing and scopes. Meanwhile, DeNovo has For these reasons, GPUs take a different approach to been shown to offer efficient synchronization with an SC for DRF memory model, performing on average 21% better than synchronization. Specifically, they use simple bulk coherence our baseline system, but it introduces additional coherence traf- actions, like cache flushes and invalidates, at the fic to maintain ownership of all modified data. synchronization points in the program. This approach aligns with current memory models, like C++11 [2], where To resolve these deficiencies, we propose to adapt lazy re- programmers clearly identify inter-thread communication by lease consistency—previously only proposed for homogeneous operating on atomic variables. At these synchronization CPU systems—to a heterogeneous system. Our approach, called points, coarse-grain coherence actions, like cache flushes and hLRC, uses a DeNovo-like mechanism to track ownership of invalidates, are sufficient to implement memory models that synchronization variables, lazily performing coherence actions guarantee sequential consistency for data race-free (SC for only when a synchronization variable changes locations. hLRC DRF) programs [3]. allows GPU programmers to use the simpler SC for DRF memory model without tracking ownership for all modified Unfortunately, bulk coherence actions negatively affect data. Our evaluation shows that lazy release consistency pro- performance. Specifically, cache flushes incur long latencies vides robust performance improvement across a set of graph because they require all of the dirty cache blocks in the analysis applications—29% on average versus the baseline sys- initiator’s private caches to be written through the memory tem. hierarchy. Flash invalidations are fast, but degrade cache locality and cause excessive cache misses. Keywords—graphics processing unit (GPU); memory model; lazy release consistency; scope promotion; scoped synchroniza- To solve these problems, modern GPUs support scoped tion; work stealing synchronization [3][4][5]. Scopes takes advantage of the GPU’s hierarchical execution model to limit the cost of bulk I. INTRODUCTION coherence actions. For example, work-items executing on the Architects must carefully consider a plethora of tradeoffs same CU can communicate through the L1 cache without when specifying a new memory model and designing the incurring any cache flushes or invalidates. In contrast, work- hardware that implements it. With the emergence of items executing on different CUs are required to read and heterogeneous computing and high-throughput accelerators, write from the GPU’s monolithic last-level cache. there is an increasing tension to keep both the memory model 978-1-5090-3508-3/16/$31.00 ©2016 IEEE baseline RSP system-scope 1.4 System on Chip (SoC) 1.2 agent-scope 1 GPU wg-scope 0.8 Compute Unit (CU) work-group 0.6 CPU Speedup wavefront CU 0.4 0.2 0 L1 L1 8 CUs 128 CUs Memory Figure 1. RSP scalability on a small and large GPU. L2 Controller While scoped synchronization is successful in mitigating the cost of bulk coherence actions, it leads to a memory model Figure 2. Baseline example GPU. (e.g., SC for HRF [6][7][8]) with two significant registration, thus implementing lazy releases and potentially shortcomings. First, programmers are expected to explicitly reducing coherence traffic. hLRC achieves a speedup of on label atomic memory operations with the correct scope in average 29% on a large GPU with 128 CUs, when compared order to maximize performance, which is tedious and error- to the naïve baseline, and 7% on average compared to prone. Second, scoped synchronization does not use caches DeNovo. Finally, our implementation of hLRC builds off of effectively for important dynamic data sharing patterns like bulk synchronization flush and invalidate actions, which is work stealing. consistent with the current approach to GPU synchronization. To combat this second shortcoming, remote scope promotion [10] was recently proposed, but it is not a panacea. II. GPU CACHES AND SYNCHRONIZATION RSP further complicates the memory model and the initial A. GPU Architecture implementations of RSP, while effective for relatively small The GPU’s massively threaded architecture, depicted in GPUs, do not scale to large GPUs. Specifically, we found that Figure 2, targets highly concurrent applications. Specifically, RSP actually performs worse on a large 128-CU GPU when each GPU core, called a compute unit (CU), executes compared to a naïve baseline that forgoes work stealing and thousands of threads, called work-items, simultaneously. For scopes (Figure 1). example, a CU in the AMD GCN architecture has hardware Meanwhile, to combat the first shortcoming, the recent state for 2,560 work-items [11]. The GPU’s CUs are DeNovo proposal suggested that future GPUs should forgo connected to memory through a hierarchy of caches. scoped synchronization and support the simpler SC for DRF Typically, each CU has a private L1 cache to optimize memory model [8]. However, DeNovo tracks ownership for communication within a CU. The L1 caches tend to be small all written data, requiring additional traffic to request and and optimize for throughput. For example, the L1 cache is 16 revoke ownership registration. Also, when compared to kB in AMD’s GCN architecture [11] and up to 48 kB in current GPU designs, DeNovo’s benefits primarily arise from Nvidia’s Maxwell GPU [4]. To optimize communication locality in written data, which is limited in existing GPU between work-items on different CUs, it is common to compute applications. connect the L1 caches to a GPU-wide non-inclusive L2 cache. In this work, we introduce heterogeneous lazy release GPU work-items (wi) execute within an execution consistency (hLRC) for GPUs. Like DeNovo, our approach hierarchy that mirrors the GPU’s hierarchical design. The first eliminates scopes and enables SC for DRF on GPUs, level of the execution hierarchy is a wavefront, which is a achieving scalable synchronization for data sharing patterns small group of work-items (e.g., 64 on AMD GPUs, 32 on like work stealing. hLRC also uses atomic registration, as NVIDIA GPUs, 4 on Intel GPUs, etc.) that execute in lockstep proposed by Sung and Adve [9], to track exclusive ownership on the GPU’s data-parallel execution units. Wavefronts then of synchronization variables, but not all of stored data like execute in small teams called work-groups. Wavefronts in the DeNovo. hLRC also differs from DeNovo by performing same work-group execute on the same CU, which enables coherence actions when synchronization variables change them to synchronize through the L1 cache. Ultimately, a GPU executes a grid of work-groups. Thus, work-items in a grid Table 1. Simple GPU coherence actions. can communicate through the GPU’s L2. Finally, work-items Coarse-grain flush of all dirty data in the local L1 to the in a grid can communicate externally (e.g., with CPU threads) Flush local L1 next level of the memory hierarchy. through a common level of the memory hierarchy (e.g., the Coarse-grain invalidation of all valid data in the local memory controller). Inv local L1 L1. LD/ST/RMW x Atomic memory access on location x performed at the B. GPU Synchronization L1/L2 L1 or L2 cache Recall that each CU executes thousands of work-items Block a specific operation (op) at a particular cache or Lock op/x concurrently. Thus, to avoid excessive invalidation traffic, all ops on address x within a cache. CU0 CU1 CU0 (wi 0) CU0 (wi 1) CU0 CU1 CU0 CU1 <guarded> <guarded> <guarded> <guarded> ST_rel x ST_rel_wg x ST_rel_agt x ST_rel_wg x LD_acq x LD_acq_wg x LD_acq_agt x LD_acq_rm_agt x <guarded> <guarded> <guarded> <guarded> a. SC for DRF (no scopes) b. SC for HRF (work-group scope) c. SC for HRF (agent scope) d.

Load more