Efficient GPU Synchronization Without Scopes

Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models Matthew D. Sinclairy Johnathan Alsopy Sarita V. Adveyz y University of Illinois at Urbana-Champaign z École Polytechnique Fédérale de Lausanne [email protected] ABSTRACT 1. INTRODUCTION As GPUs have become increasingly general purpose, ap- GPUs are highly multithreaded processors optimized plications with more general sharing patterns and fine- for data-parallel execution. Although initially used for grained synchronization have started to emerge. Un- graphics applications, GPUs have become more general- fortunately, conventional GPU coherence protocols are purpose and are increasingly used for a wider range of fairly simplistic, with heavyweight requirements for syn- applications. In an ongoing effort to make GPU pro- chronization accesses. Prior work has tried to resolve gramming easier, industry has integrated CPUs and these inefficiencies by adding scoped synchronization to GPUs into a single, unified address space [1, 2]. This conventional GPU coherence protocols, but the result- allows GPU data to be accessed on the CPU and vice ing memory consistency model, heterogeneous-race-free versa without an explicit copy. While the ability to (HRF), is more complex than the common data-race- access data simultaneously on the CPU and GPU has free (DRF) model. This work applies the DeNovo co- the potential to make programming easier, GPUs need herence protocol to GPUs and compares it with conven- better support for issues such as coherence, synchro- tional GPU coherence under the DRF and HRF consis- nization, and memory consistency. tency models. The results show that the complexity of Previously, GPUs focused on data-parallel, mostly the HRF model is neither necessary nor sufficient to ob- streaming, programs which had little or no sharing or tain high performance. DeNovo with DRF provides a data reuse between Compute Units (CUs). Thus, GPUs sweet spot in performance, energy, overhead, and mem- used very simple software-driven coherence protocols ory consistency model complexity. that assume data-race-freedom, regular data accesses, Specifically, for benchmarks with globally scoped fine- and mostly coarse-grained synchronization (typically at grained synchronization, compared to conventional GPU GPU kernel boundaries). These protocols invalidate the with HRF (GPU+HRF), DeNovo+DRF provides 28% cache at acquires (typically the start of the kernel) and lower execution time and 51% lower energy on average. flush (writethrough) all dirty data before the next re- For benchmarks with mostly locally scoped fine-grained lease (typically the end of the kernel) [3]. The dirty synchronization, GPU+HRF is slightly better { how- data flushes go to the next level of the memory hier- ever, this advantage requires a more complex consis- archy shared between all participating cores and CUs tency model and is eliminated with a modest enhance- (e.g., a shared L2 cache). Fine-grained synchroniza- ment to DeNovo+DRF. Further, if HRF's complexity tion (implemented with atomics) was expected to be is deemed acceptable, then DeNovo+HRF is the best infrequent and executed at the next shared level of the protocol. hierarchy (i.e., bypassing private caches). Thus, unlike conventional multicore CPU coherence Categories and Subject Descriptors protocols, conventional GPU-style coherence protocols are very simple, without need for writer-initiated inval- B.3.2 [Hardware]: Memory Structures { Cache mem- idations, ownership requests, downgrade requests, pro- ories; Shared memory; C.1.2 [Processor Architec- tocol state bits, or directories. Further, although GPU tures]: Single-instruction-stream, multiple-data-stream memory consistency models have been slow to be clearly processors (SIMD); I.3.1 [Computer Graphics]: Graph- defined [4, 5], GPU coherence implementations were ics processors amenable to the familiar data-race-free model widely adopted for multicores today. Keywords However, the rise of general-purpose GPU (GPGPU) GPGPU, cache coherence, memory consistency models, computing has made GPUs desirable for applications data-race-free models, synchronization with more general sharing patterns and fine-grained synchronization [6, 7, 8, 9, 10, 11]. Unfortunately, con- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are ventional GPU-style coherence schemes involving full not made or distributed for profit or commercial advantage and that copies cache invalidates, dirty data flushes, and remote execu- bear this notice and the full citation on the first page. To copy otherwise, to tion at synchronizations are inefficient for these emerg- republish, to post on servers or to redistribute to lists, requires prior specific ing workloads. To overcome these inefficiencies, recent permission and/or a fee. Request permissions from [email protected]. work has proposed associating synchronization accesses Copyright 2015 ACM 978-1-4503-4034-2/15/12 ...$15.00. DOI: http://dx.doi.org/10.1145/2830772.2830821 with a scope that indicates the level of the memory hierarchy where the synchronization should occur [8, 12]. to minimize overhead). Specifically, this paper makes For example, a synchronization access with a local scope the following contributions: indicates that it synchronizes only the data accessed by the thread blocks within its own CU (which share the • We identify DeNovo (without regions) as a viable L1 cache). As a result, the synchronization can execute coherence protocol for GPUs. Due to its use of at the CU's L1 cache, without invalidating or flushing ownership on writes, DeNovo is able to exploit data to lower levels of the memory hierarchy (since no reuse of written data and synchronization vari- other CUs are intended to synchronize through this ac- ables across synchronization boundaries, without cess). For synchronizations that can be identified as the additional complexity of scopes. having local scope, this technique can significantly im- • We compare DeNovo with the DRF consistency prove performance by eliminating virtually all sources model (i.e., no scoped synchronization) to GPU- of synchronization overhead. style coherence with the DRF and HRF consis- Although the introduction of scopes is an efficient tency models (i.e., without and with scoped syn- solution to the problem of fine-grained GPU synchro- chronization). As expected, GPU+DRF performs nization, it comes at the cost of programming complex- poorly for applications with fine-grained synchro- ity. Data-race-free is no longer a viable memory con- nization. However, DeNovo+DRF provides a sweet sistency model since locally scoped synchronization ac- spot in terms of performance, energy, implemen- cesses potentially lead to \synchronization races" that tation overhead, and memory model complexity. can violate sequential consistency in non-intuitive ways Specifically, we find the following when comparing (even for programs deemed to be well synchronized by the performance and energy for DeNovo+DRF and the data-race-free memory model). Recently, Hower et GPU+HRF. For benchmarks with no fine-grained al. addressed this problem by formalizing a new mem- synchronization, DeNovo is comparable to GPU. ory model, heterogeneous-race-free (HRF), to handle For microbenchmarks with globally scoped fine- scoped synchronization. The Heterogeneous System Ar- grained synchronization, DeNovo is better than chitecture (HSA) Foundation [1], a consortium of sev- GPU (on average 28% lower execution time and eral industry vendors, and OpenCL 2.0 [13] have re- 51% lower energy). For microbenchmarks with cently adopted a model similar to HRF with scoped mostly locally scoped synchronization, GPU does synchronization. better than DeNovo { on average 6% lower execu- Although HRF is a very well-defined model, it cannot tion time, with a maximum reduction of 13% (4% hide the inherent complexity of using scopes. Intrinsi- and 10% lower, respectively, for energy). However, cally, scopes are a hardware-inspired mechanism that GPU+HRF's modest benefit for locally scoped syn- expose the memory hierarchy to the programmer. Us- chronization must be weighed against HRF's higher ing memory models to expose a hardware feature is con- complexity and GPU's much lower performance sistent with the past evolution of memory models (e.g., for globally scoped synchronization. the IBM 370 and total store order (TSO) models es- sentially expose hardware store buffers), but is discour- • For completeness, we also enhance DeNovo+DRF aging when considering the past confusion generated with selective invalidations to avoid invalidating by such an evolution. Previously, researchers have ar- valid, read-only data regions at acquires. The ad- gued against such a hardware-centric view and proposed dition of a single software (read-only) region does more software-centric models such as data-race-free [14]. not add any additional state overhead, but does re- Although data-race-free is widely adopted, it is still a quire the software to convey the read-only region source of much confusion [15]. Viewing the subtleties information. This enhanced DeNovo+DRF proto- and complexities associated even with the so-called sim- col provides the same performance and energy as plest models, we argue GPU consistency models should GPU+HRF on average. not be even more complex than the CPU models. • For cases where HRF's complexity is deemed ac- We therefore ask the question { can we develop co- ceptable, we

Load more