2020 IEEE International Symposium on High Performance Architecture (HPCA)

HMG: Extending Protocols Across Modern Hierarchical Multi-GPU Systems

Xiaowei Ren†‡, Daniel Lustig‡, Evgeny Bolotin‡, Aamer Jaleel‡, Oreste Villa‡, David Nellans‡ †The University of British Columbia ‡NVIDIA [email protected] {dlustig, ebolotin, ajaleel, ovilla, dnellans}@nvidia.com

Abstract—Prior work on GPU cache coherence has shown Multi-GPU System that simple hardware- or software-based protocols can be more than sufficient. However, in recent years, features such as multi- GPU GPU GPU GPU MCM-GPU chip modules have added deeper hierarchy and non-uniformity

into GPU memory systems. GPU programming models have GPM GPM

DRAM DRAM chosen to expose this non-uniformity directly to the end user NVSwitch NVSwitch NVSwitch through scoped memory consistency models. As a result, there

is room to improve upon earlier coherence protocols that were GPM GPM DRAM designed only for flat single-GPU hierarchies and/or simpler DRAM memory consistency models. GPU GPU GPU GPU In this paper, we propose HMG, a cache coherence protocol designed for forward-looking multi-GPU systems. HMG strikes Figure 1: Forward-looking multi-GPU system. Each GPU a balance between simplicity and performance: it uses a readily- implementable VI-like protocol to track coherence states, but has multiple GPU modules (GPMs). it tracks sharers using a hierarchical scheme optimized for mitigating the bandwidth limitations of inter-GPU links. HMG adopted a bulk-synchronous programming model (BSP) for leverages the novel scoped, non-multi-copy-atomic properties simplicity. This paradigm disallowed any data communica- of modern GPU memory models, and it avoids the overheads of invalidation acknowledgments and transient states that tion among Cooperative Arrays (CTAs) of active were needed to support prior GPU memory models. On a kernels. However, in emerging applications, less-restrictive 4-GPU system, HMG improves performance over a software- data sharing patterns and fine-grained synchronization are controlled, bulk invalidation-based coherence mechanism by expected to be more frequent [17–20]. BSP is too rigid and 26% and over a non-hierarchical hardware cache coherence inefficient to support these new sharing patterns. protocol by 18%, thereby achieving 97% of the performance of an idealized caching system. To extend GPUs into more general-purpose domains, GPU vendors have released very precisely-defined scoped memory I.INTRODUCTION models [21–24]. These models allow flexible communication As the demand for GPU compute continues to grow and synchronization among threads in the same CTA, the beyond what a single die can deliver [1–4], GPU vendors same GPU, or anywhere in the system, usually by requiring are turning to new packaging technologies such as multi- programmers to provide scope annotations for synchroniza- chip modules (MCMs) [5] and new networking technolo- tion operations. Scopes allow synchronization and coherence gies such as NVIDIA’s NVLink [6] and NVSwitch [7] to be maintained entirely within a narrow subset of the and AMD’s xGMI [8] in order to build ever-larger GPU full-system cache hierarchy, thereby delivering improved systems [9–11]. Consequently, as Fig. 1 depicts, modern GPU performance over system-wide synchronization enforcement. systems are becoming increasingly hierarchical. However, Furthermore, unlike most CPU memory models, these GPU due to physical limitations, the large bandwidth discrepancy models are now non-multi-copy-atomic: they do not require between existing inter-GPU links [6, 8] and on-package that memory accesses become visible to all observers at the integration technologies [12] can contribute to Non-Uniform same logical point in time. As a result, there is room for Memory Access (NUMA) behavior that often bottlenecks forward-looking GPU cache coherence protocols to be made performance. Following established principles, GPUs use even more relaxed, and hence capable of delivering even aggressive caching to recover some of the performance loss higher throughput, than protocols proposed in prior work (as created by the NUMA effect [5, 13, 14], and these caches outlined in Section III). are kept coherent with lightweight coherence protocols that Previously explored GPU coherence schemes [5, 13–16, 25– are implemented in software [5, 13], hardware [14, 15], or a 27] were well tuned for GPUs and much simpler than CPU mix of both [16]. protocols, but few have studied how to scale these protocols GPU originally assumed that inter-thread synchronization to larger multi-GPU systems with deeper cache hierarchies. would be coarse-grained and infrequent, and hence they To test their efficiency, we simply apply the existing software

2378-203X/20/$31.00 ©2020 IEEE 582 DOI 10.1109/HPCA47549.2020.00054 eteeoeaktefloigqeto:hwd eextend we do how question: following the ask therefore We hl iutnosypoiighg-efrac upr for support high-performance providing simultaneously while GPUs. multiple to extended when nepooo eindt xedchrneguarantees coherence extend to designed protocol ence htmd ro P ah oeec rtcl oua [15]. popular protocols coherence cache while GPU systems, prior implementability multi-GPU made and large that simplicity to the up maintaining scale such, nevertheless to As able traffic. is coherence inter-GPU HMG unnecessary bandwidth-limited adding mitigate without of relies to links sharers HMG impact of 16], performance [14, tracking the regions hierarchical but data tracking precise of by on traffic state coherence read-only filters the that Similarly, work required. prior not unlike processed is be multi-copy-atomicity can if invalidation write instantly permis- unnecessary a out write since filters messages, acquiring acknowledgment also of HMG latency 16]. [15, the otherwise sions cover would to that needed structures transient be hardware eliminates extra HMG and/or protocol, states the track within and/or and ownership multi-copy-atomicity CPU enforce that prior protocols Unlike GPU models. consistency memory scoped forward-looking across protocol in increase dramatic a complexity? without GPU and emerging applications, within synchronization GPUs fine-grained flexible multiple across protocols details. coherence for GPU III existing Section see improvement; leave indeed for protocols room shows, coherence plenty 2 VI Fig. hardware non-hierarchical As and existing the GPMs. software systems, 16 if multi-GPU of hierarchical as platform in them flat a extend separate for were simply account 4 system we not of do hierarchy; consists 4- protocols architectural GPU These a (GPMs). each improvement on modules for which GPU, GPU [15] room per in leave GPU-VI GPMs system, protocols protocols 4 existing GPU with that coherence system show hardware 4-GPU results The and a caching. on such protocols no different has three which under baseline data a GPU to remote normalized caching all of Benefits 2: Figure Normalized • contributions: following • the makes paper this Overall, coher- cache hardware-managed a HMG, propose We 0.0 0.5 1.0 1.5 2.0 2.5 eso htwiei spsil oetn existing extend to possible is it while of that study show We first model. memory the a non-multi-copy-atomic under is scoped, coherence this hardware knowledge, multi-GPU our multi-MCM, To of systems. best multi-GPU extending the to of protocols implications coherence cache architectural the study We

overfeat

MiniAMR 3.1 3.1 AlexNet 3.2 H ierarchical CoMD Non-Hierarchical SWCoherence

HPGMG M

ulti- MiniContact G

Usseswith systems PU pathfinder

Nekbone

cuSolver Non-Hierarchical HWCoherence

namd2.10 583 hc sue odsrb cacpdmmr nNVIDIA on memory scratchpad describe to used is which MM)[] eerhr aeepoe h osblt of possibility the explored have Researchers [5]. (MCMs) rnitrdniyipoeet uuemlimdl GPU multi-module future improvement, density transistor rpsl a ahmmr aet h rtGMGUthat GPM/GPU first the to These page 13]. memory [5, each locality map data proposals same inter-CTA the exploit to to CTAs GPM/GPU adjacent Multi- scheduled and GPU have MCM-GPU explorations hierarchical both GPU on this, mitigate bottleneck To performance architectures. main the is thread and data optimize decisions. manually migration to can and GPU directly placement they mainstream hierarchy that but the [13], so expose GPU users just large largely single today a users platforms were to it system if multi-GPU as hierarchical large modules creating a multi-chip presenting of single-package of benefits form the the [28]. 1 demonstrated in Fig. GPMs in has depicted GPU as work each (GPMs), which Recent modules in GPU into hierarchy split a is of consist will architectures GPUs Multi-Module system. Hierarchical a A. in GPUs space and address CPUs virtual all the for by memory” shared “global use we GPUs, resnet • h osrie adit fteitrGMGUlinks inter-GPM/GPU the of bandwidth constrained The continued limiting is Law Moore’s of end the Because memory” “shared term the around confusion avoid To ihheacia hrrtakn,btas eliminates also but tracking, sharer hierarchical with otaeo adaechrnepooost multi- to protocols coherence hardware or software eec rtclfrdsrbtdL ahsi hierar- in caches co- L2 cache distributed hardware-managed for protocol a herence HMG, propose We bandwidth inter-GPU/inter-module locality constraints. the data intra-GPU mitigate exploit and to without place inefficient in very mechanisms is performance systems, GPU vrl osbepromneo nielzdsystem. idealized the an of of 97% performance delivers possible HMG overall proposals. previous in found bandwidth locality inter-GPU data intra-GPU the improving only by of bottlenecks not related majority proposal the Our mitigates platforms. multi-GPU chical neesr rnin ttsadchrnemessages coherence and states transient unnecessary mst

nw-16K 3.4 Idealized Cachingw/oCoherence 3.5 lstm 4.0 I B II.

RNN_FW ACKGROUND

RNN_DGRAD 3.7 3.6 GoogLeNet 4.4

bfs

snap 3.3 3.4 RNN_WGRAD 7.1

GeoMean touches it to increase the likelihood that caches will capture synchronize. Synchronization of scope .cta is performed in locality. They also extended a conventional GPU software the L1 cache of each GPU Streaming Multiprocessor (SM); coherence protocol to the L2 caches, and they showed that synchronizations of scope .gpu and .sys are processed it worked well for traditional bulk-synchronous programs. through the GPU L2 cache and via the memory hierarchy More recently, CARVE [14] proposed to allocate a fraction of the whole system, respectively. of local GPU DRAM as cache for remote GPU data, and enforced coherence using a simple protocol filtered by D. GPU Cache Coherence tracking whether data was private, read-only, or read-write Some GPU protocols advocate for strong memory models, shared. However, as CARVE neither tracked sharers nor used and hence they propose sophisticated cache coherence scopes, CARVE broadcast invalidation messages to all caches protocols capable of delivering good performance [27]. Most for read-write shared data. other GPU protocols enforce variants of release consistency B. Emerging Programs Need Fine-Grained Communication by invalidating possibly stale values in caches when perform- ing acquire operations (implicitly including the start of a Nowadays, many applications contain fine-grained com- kernel), and by flushing dirty data during release operations munication between CTAs of the same kernel and/or of (implicitly including the end of a kernel). Much of the dependent kernels [29–35]. For example, in RNNs, abundant research in the area today proposes optimizations on top inter-CTA communication exists in the neuron connections of these basic principles. We broadly classify this work by between continuous timesteps [36]. In the simulation of whether reads or writes are responsible for invalidating stale molecule or neutron dynamics [34, 35], inter-CTA communi- data. cation is necessary for the movement dependency between Among read-initiated protocols, hLRC [26] elided unnec- different particles and different simulation timesteps. Graph essary cache invalidations and flushes by lazily performing algorithms usually dispatch vertices among multiple CTAs coherence actions when synchronization variables change reg- or kernels that need to exchange their individual update istration. Furthermore, the recent proposals of DeNovo [16] to the graph for the next round of computing until they and VIPS [38] can protect read-only or private data from reach convergence [4, 30]. We provide more details on the invalidation. However, they incur additional overheads and/or workloads we study in this paper in Section VI. All these require software changes to convey region information for applications can benefit from a hierarchical GPU system for read-only data, ownership tracking in word granularity, or higher performance and from the scoped memory model for coarse-grained (memory page level)2 private/shared data efficient inter-CTA synchronization enforcement. classification. C. GPU Memory Model As for write-initiated cache invalidations, previous work Both CUDA and OpenCL originally supported a coarse- has observed that MESI-like coherence protocols are a poor grained bulk-synchronous programming model. Under this fit for GPUs [15, 25]. QuickRelease [25] reduced the overhead paradigm, data sharing between threads of the same CTA of cache flush by enforcing the partial order of writes with a could be performed locally in the scratchpad FIFO. However, QuickRelease needs to partition the resources and synchronized using CTA execution barriers; but inter- required by reads and writes; it also broadcasts invalidations CTA synchronization was permitted only between dependent to all remote caches. GPU-VI [15] is a simple yet effective kernel calls (i.e., where data produced by one kernel is hardware cache coherence protocol, but it predated scoped consumed by the following kernels). They could not, with memory models and introduced extra overheads to enforce guaranteed correctness, perform arbitrary communication us- multi-copy-atomicity, which is no longer necessary. Also, ing global memory. While many GPU applications work very GPU-VI was proposed for use within a single GPU, and well under a bulk-synchronous model with rare inter-CTA did not consider the added complexity of having various synchronizations, it quickly becomes a limiting factor for the bandwidth tiers. types of emerging applications described in Section II-B. III.THE NOVEL COHERENCE NEEDSOF To support data sharing more flexibly and more efficiently, MODERN MULTI-GPUSYSTEMS both CUDA and OpenCL have shifted from bulk-synchronous models to more general-purpose scoped memory models [21– To scale coherence across multiple GPUs, the design 24, 37]. By incorporating the notion of scope, these new of HMG not only considers the architectural hierarchy of models allow each thread to communicate with any other modern GPU systems (Fig. 1), but also aggressively takes threads in the same CTA (.cta), the same GPU (.gpu), advantage of the latest scoped memory models (Section II-C). and anywhere in the system (.sys)1. Scope indicates the set Before diving into the details of HMG, we first describe our of threads with which a particular memory access wishes to main insights below.

1We use NVIDIA terminology in this paper. Equivalent scopes in HRF 2GPUs need large pages (e.g., 2MB) to ensure high TLB coverage. Smaller are work-group, device, and system [24]. pages can cause severe TLB bottlenecks [39].

584 ihrsett te hed ntesoei usin The question. in scope the in threads other to respect with idealized an and achieve protocols non-hierarchical what hs opeiisaeaporaefrltnybudCPUs, latency-bound for appropriate are complexities These P eoymdl loifr h eino good a of design the inform also models memory GPU oeec eyepii 2] omnpteni multi-GPU in of pattern nature common A relaxed [21]. this explicit very makes coherence model memory only GPU to NVIDIA and coherence boundaries, require synchronization scopes model at expose only programming explicitly enforced the be do coherent, of kept that part be as models to memory accesses memory GPU all memory CPU require non-scoped models While hierarchy. coherence GPU Models Memory Weak GPU Leveraging B. protocols CPU-like multi-GPUs. such for of and costs unnecessary behavior, the remain memory that relaxed shows more HMG [42]. hence far latency permit GPUs access but memory reduce to cache and cache directory replication HitMe IO page and introduced Skylake memory Intel’s for [44]. support migration OS WildFire Sun’s special example, had For veri- implemented optimizations. prohibitive products aggressive more Industrial in to [43]. resulting added complexity stalls, are fication write coherence states exploit transient the to Many reduce ownership [40–42]. track locality have MESI data coherence and as CPU such, TSO) such As protocols (e.g., requirements. model latency stricter memory much usually stronger CPUs a GPUs, unlike enforce However, CPUs. for GPUs. coherence multiple hierarchical across a extended as being HMG of on capable build to stored protocol therefore GPU addresses We same of GPUs. the range on remote common GPMs shows a multiple access 3 for redundantly Fig. common is Indeed, it links. that remote bandwidth-limited from via data invalidated GPUs reloading of and invalidations between gap the close cannot shows, VI 2 Fig. hardware as fine-grained However, the even invalidations. mitigate cache bulk that of mechanisms impact on focused mainly protocols GPUs Multiple to Coherence Extending GPU. same A. the in GPM addresses another to by destined loads accessed inter-GPU of Percentage 3: Figure ahswl nyapiytecs fcas-rie cache coarse-grained of cost L2 the shared amplify larger only multi-GPUs, will future caches In scenario. caching % of possibly reduced eie h hneo adaeacietr,scoped architecture, hardware of change the Besides cache hierarchical into research much been has There Section in described As peer GPU loads 100% 25% 50% 75% 0%

overfeat MiniAMR AlexNet CoMD HPGMG MiniContact pathfinder Nekbone

II-D cuSolver namd2.10

ro P coherence GPU prior , resnet mst nw-16K lstm RNN_FW RNN_DGRAD GoogLeNet bfs snap RNN_WGRAD Avg 585 ol u infiatyicesdpesr ntecoherence the on pressure increased significantly put would ut-oyaoi eoymdl o Ps[5,recent [45], GPUs for models memory multi-copy-atomic aytasetsae n yuigoto-re execution out-of-order using by and states with transient protocols multi- many coherence enforce sophisticated CPUs using Most multi-copy- copy-atomicity accesses. not delays memory are apparent subsequent an today create for can GPUs share also SM, GPUs Multi-copy-atomicity buffering an atomic. As across thread-private memory. cache only and were L1 with cores it between unit, if allowed as multi- atomic behave speaking, single to Loosely a memory 24]. [21, requires lack requirement the copy-atomicity a formalized since such have of models memory scoped GPU order environments. an multi-GPU is in scope larger narrowest magnitude and of broadest latency/bandwidth are the the scopes between [16], that GPU gap concluded single a has within work unnecessary prior some heavily of while rely efficiency scope; patterns comparative kernels Such the with frequently. and on less single first, GPUs a other other on each on running with kernels synchronize or to CTA GPU for be will applications ah aaiywt yia elcmn oiysc as such policy replacement typical for a contending holds with accesses that capacity DRAM cache cache L2 remote-GPM an and has local caches GPUs (GPM) both L1 module in assume GPU as We Each 4. write-through, today. Fig. and in software-managed shown remain is NHCC for ture Overview Architectural A. of systems notion multi-GPU larger a 1. on with Fig. better NHCC like scales extend it that will evaluations. so we later hierarchy our section, during As next baseline the account. our In into as serve hierarchy will architectural it such, the take it However, not transient acknowledgments. does invalidation all most eliminates and NHCC As states [15], annotations. GPU-VI scope to user-provided compared different the to to according accesses NHCC caches memory protocols, scoped synchronization weak most modern propagates Like for models. optimized memory be GPU can protocol (NHCC) ence altogether. acknowledgments invalidation leveraging states and by transient Instead, eliminates latency. HMG the non-multi-copy-atomicity, hide and to larger ability magnitude protocol’s of order an is GPUs caches, remote L2 to and round time the L1 trip 24 however, the environments, and in multi-GPU states transitions In transient respectively. state 12 coherence and 41 3 and added reduce [15] to GPU-VI example, can stalls, For protocols multi-copy-atomicity. coherence prior tolerate single-GPU Some also that overheads. found latency have the studies hide to speculation and utemr,atog oepirwr a proposed has work prior some although Furthermore, ihlvldarmo u aeiesnl-P architec- single-GPU baseline our of diagram high-level A coher- cache non-hierarchical a how describe now We V B IV. ASELINE C ACHE N C ON OHERENCE -H IERARCHICAL .gpu cp over scope .sys the home node for address A. Other GPMs may cache the GPM0 SMs + L1 $ SMs + L1 $ GPM1 value at A locally, but GPM0 maintains the authoritative copy. In the same figure, address B is cached in GPM1, X X B B even though GPM3 (the home node for B) is not caching L2 $ L2 $

A A DRAM R R DRAM B. Similarly, data can be cached in the home node only, as with address C in our example in Fig. 5. In NHCC, explicit coherence maintenance messages (i.e., cache invalidations) are sent only in two cases: when there X X B B L2 $ L2 $ is read-write sharing between CTAs on different GPMs, and

A A

DRAM DRAM R R when there is a directory capacity eviction. The fact that most memory accesses incur no coherence overhead ensures GPM2 SMs + L1 $ SMs + L1 $ GPM3 that the GPU does not deviate from peak throughput in the GPU common case where data is either read-only or CTA- Figure 4: Future GPUs will consist of multiple GPU modules private. We measure the impact of coherence messages in (GPMs). For example, each GPM might be a chiplet in a Section VII. single package. To explain the basics of NHCC, we track the life of a memory reference as an example. First, a memory access from the SM queries the L1 cache. Upon a L1 miss or write, L2 in GPM0 Home of A Home of C L2 in GPM1 the request is routed to the local GPM L2 cache. If the request misses in the L2 (or writes to L2, again assuming a Cached B Cached A Coherence V:A:[GPM2, GPM3] Cached C write-through policy), the address is checked to determine if Directory Data Cache Data Cache the local GPM is the home node for this reference. If so, the request is routed to local DRAM. Otherwise, the request is routed to the L2 cache of the home node via the inter-GPM links. The request may then hit in the home node L2 cache, Coherence Cached A Cached A or it may get routed through to that particular GPM’s off-chip V:B:[GPM1] Directory Data Cache Data Cache memory. We provide full details below.

L2 in GPM2 Home of B L2 in GPM3 B. Coherence Protocol Flows in Detail Coherence directory entry format is State:Addr:[Sharers] Table I details the full operation of NHCC. In this table, “local” refers to operations issued by the same GPM as the Figure 5: NHCC coherence architecture. The dotted yellow L2 cache which is handling the request. “Remote” requests boxes are the L2 caches from Fig. 4. The shaded gray cache are those originally issued by other GPMs. We walk through lines and directory entries indicate lines for which the GPM the entries in the table below. in question is the home node. Local Loads: When a local load request reaches the local L2 cache partition, if it hits, a reply is sent back to the least-recently-used (LRU). To support hardware inter-GPM requester directly. If the request misses, the next destination coherence, one GPM in the system is chosen by some hash depends on where the data is mapped by the address hash function as the home node for any given physical address. function. If the local L2 cache partition happens to be the The home node always contains the most up-to-date value home node for the address being accessed, the request will be at each memory location. sent to DRAM. Otherwise, the load request will be forwarded Like many protocols, NHCC attaches an individual direc- to the home node. Loads with .gpu or .sys scope must tory to every L2 cache within each GPM. The coherence always miss in the L1 cache and in the non-home L2 caches directory is organized as a traditional set-associative structure. to guarantee forward progress. Each directory entry tracks the identity of all GPM sharers, Local Stores: Depending on L2 design, local stores may along with coherence state. Like GPU-VI [15], each line be stored as dirty data in the L2 cache, or alternatively can be tracked in one of two stable states: Valid and Invalid. they may be written through and stored as clean data in However, unlike GPU-VI, NHCC does not have transient the L2 cache. All stores with scope greater than .cta (i.e., states, and it requires acknowledgments only for release .gpu and .sys) must be written through in order to ensure operations. Non-synchronizing stores (i.e., the vast majority) forward progress. Data which is written back or written do not require acknowledgments in NHCC. through the L2 is sent directly to DRAM if the local L2 We assume a non-inclusive inter-GPM L2 cache architec- cache is the home node for the address in question, or it ture to enable data to be cached freely across the different is relayed to the home node otherwise. If the local GPM GPMs. Fig. 5 shows an example in which GPM0 serves as is the home node and the coherence directory has recorded

586 State Local Ld Local St/Atom Remote Ld Remote St/Atom Replace Dir Entry Invalidation I - - add s to sharers, →V add s to sharers, →V N/A - add s to sharers, forward inv to all sharers V - inv all sharers, →I add s to sharers inv all sharers, →I inv other sharers (HMG only), → I Table I: NHCC and HMG coherence directory transition table. s refers to the sender of the message. any sharers for the address in question, then these sharers itself. The local L2 then collects these acknowledgments and must be notified that the data has been changed. As such, returns a single response to the original requester. a local store triggers an invalidation message being sent to Cache Eviction: Two design options are possible upon each sharer. These invalidations propagate in the background, cache line eviction. First, a clean cache line being evicted off the critical path of any subsequent reads. There are no from an L2 cache in a non-home GPM could send a invalidation acknowledgments. downgrade message to the home node. This allows the home Remote Loads: When a remote load arrives at the local node to delete the remote node as a sharer and will potentially home L2 cache, it either hits in the cache and returns data save an invalidation message from being sent later. However, to the requester, or it misses and forwards the request to this is not required for correctness. The second option is DRAM. The coherence directory also records the ID of the to have valid clean cache lines get silently evicted. This requesting node. If the line is already being tracked, the eliminates the overhead of the downgrade message, but it requesting ID is simply added as an additional sharer. If the triggers an unneeded invalidation message upon eventual line is not being tracked yet, a new entry is allocated in the eviction of the coherence directory entry. Optionally, dirty directory, possibly by evicting another valid entry (discussed cache lines being evicted and written back can use a new further below). message type indicating that the data must be updated but Remote Stores: Remote stores that arrive at a home L2 that the issuing GPM need not be tracked as a sharer going are cached and written through or written back to DRAM, forward. Again, this optimization is not strictly required for depending on the configuration of the L2. Since the requesting correctness, but may be useful in implementations using GPM may also be caching the stored data, the requester is writeback caches. recorded as a sharer. Since the data has been changed, all other sharers should be invalidated. V. HIERARCHICAL MULTI-GPU Atomics and Reductions: Atomic operations must always CACHE COHERENCE be performed at the home node L2. From a coherence Like most prior work, NHCC is designed for single-GPU transition perspective, these operations are treated as stores. scenarios and does not take the hierarchy between intra- and Invalidations: Upon receiving an invalidation request, any inter-GPU connections into account. This becomes a problem local clean copy of the address in question is invalidated. No as we try to extend protocols like NHCC to multiple GPUs, acknowledgment needs to be sent. as inter-GPU bandwidth limitations become a bottleneck. Directory Entry Eviction/Replacement: Because the To better exploit intra-GPU data locality, we propose a coherence directory is implemented as a set-associative cache, hierarchical multi-GPU (HMG) cache coherence protocol there may be entry evictions due to capacity and conflict that extends NHCC to be able to take advantage of the misses. To ensure correctness, invalidation messages must be type of locality that Fig. 3 highlights. The HMG protocol sent to all sharers of the entry that is being evicted. As with fundamentally enables multiple cache requests from individ- invalidations triggered by stores, these invalidations propagate ual GPMs to be coalesced and/or cached within a single in the background and do not require acknowledgments to GPU before traversing the lower-bandwidth inter-GPU links, be sent in return. thereby saving bandwidth and energy. Acquire: Acquire operations greater than .cta scope (i.e., .gpu and .sys) invalidate the entire local L1 cache, A. Architectural Overview following software coherence practice. However, they do not HMG is composed of two layers. The first layer is designed propagate past the L1 cache, as L2 coherence in GPMs is for intra-GPU caching, while the second layer is targeted at now maintained using NHCC. optimizing memory request routing in inter-GPU settings. For Release: Release operations trigger a writeback of all the intra-GPU layer, we define a GPU home node for each dirty data to the respective home nodes, if writeback caches given address within each individual MCM-GPU. An MCM- are being used. Releases also ensure completion of any GPU home node manages inter-GPM coherence using NHCC write-through operations and invalidation messages that are described in Section IV. Using the intra-GPU coherence layer, still in flight. Release operations greater than .cta scope data that is cached within a MCM-GPU can be consumed are propagated through the local L2 to all remote L2s to by multiple GPMs on that GPU without consulting a remote ensure that all invalidation messages have arrived at their GPU. destinations. Once this step is complete, each remote L2 We define one of the GPU home nodes for each address to sends back an acknowledgment for the release operation be the system home node. The choice of system home node

587 GPU0 GPU1 same address within GPU0. The L2 cache in GPU0:GPM1 L2 in GPM0 Sys Home of A GPU Home of A L2 in GPM0 is kept coherent with the L2 cache in GPU0:GPM0 using

Cached A Cached A the intra-GPU protocol layer. The L2 cache in GPU1:GPM0 Coherence V:A:[GPU1] Cached B Directory serves as the GPU1 home node for address A, and it is Data Cache Data Cache kept coherent with the L2 cache in GPU1:GPM1 using the

Inter-GPU intra-GPU layer. Both GPU home nodes are kept coherent Inter-GPM Network Inter-GPM Network using the inter-GPU protocol layer. Furthermore, suppose that from the state shown in Fig. 6(a),

Cached B GPU0:GPM0 wants to load address B, and the system home V:B:[GPM0] Data Cache Data Cache node for address B is mapped to GPU1:GPM1. GPU0:GPM1 is the GPU0 home node for B, so the load request propagates L2 in GPM1 GPU Home of B Sys Home of B L2 in GPM1 from GPU0:GPM0 to GPU0:GPM1 (the GPU home node), Cached Data GPU Home Sys Home and then to GPU1:GPM1 (the system home node). When the response is sent back to the requester, GPU0 (but not (a) Before: GPU0:GPM0 is about to load address B GPU0:GPM0 or GPU0:GPM1) is recorded as a sharer by GPU0 GPU1 the directory of the system home node GPU1:GPM1, and L2 in GPM0 Sys Home of A GPU Home of A L2 in GPM0 GPU0:GPM0 is recorded as a sharer by the directory of the

Cached A Cached A GPU0 home node GPU0:GPM1, as shown in Fig. 6(b). Coherence V:A:[GPU1] Cached B Cached B Directory Data Cache Data Cache B. Coherence Protocol Flows in Detail HMG behaves similarly to Table I but adds the single Inter-GPU Inter-GPM Network Inter-GPM Network extra transition shown in Table I. No extra coherence states are added. We highlight the important differences between NHCC and HMG as follows. Cached B Cached B V:B:[GPM0] V:B:[GPU0, GPM0] Loads: Loads progress through the cache hierarchy from Data Cache Data Cache the local L2 cache, to the GPU home node, to the system L2 in GPM1 GPU Home of B Sys Home of B L2 in GPM1 home node. Specifically, loads that miss in the GPM-local Cached Data GPU Home Sys Home L2 cache are routed to the GPU home node, unless the GPM- local L2 cache is already the GPU home node. From there, (b) After: B is cached in the L2 of the GPU0 home node for B as loads that miss in the GPU home node are routed to the well as in the L2 of the original requester system home node, unless the GPU home node is also the Figure 6: Hierarchical coherence in multi-GPU systems. system home node. Loads that miss in the system home node Loads are routed from the requesting GPM to the GPU home are routed to DRAM. node, and then to the system home node, and responses are Non-synchronizing loads (i.e., the vast majority) and loads returned and cached accordingly. with .cta scope can hit in all caches. However, loads with .gpu scope must miss in all caches prior to the GPU home can be made using any NUMA page allocation policy, such node. Loads with .sys scope must also miss in the GPU as first touch page placement, NVIDIA Unified Memory [37], home node; they may only hit in the system home node. static distribution, or any other reasonable heuristic. Among Loads propagating from the GPU home node to the system multiple GPUs, sharers are tracked by the directory using home node do not carry information about the GPM that a hierarchy-aware variant of the NHCC directory design. originally requested the data. Because this information is Specifically, each GPU home node will track any sharers already stored by the GPU home node, it would be redundant among other GPMs in the same GPU. Each system home to store it again in the directory of the system home node. node will track any sharers among other GPUs, but not Instead, invalidations are propagated to sharers hierarchically individual GPMs within these other GPUs. For an M-GPM, as described below. N-GPU system, each directory entry will therefore need to Stores: Stores are routed through a similar hierarchy as track as many as M + N − 2 sharers. they write-through and/or write-back. Specifically, stores The hierarchical caching mechanism of an example two- propagating past the GPM-local L2 cache are routed to the GPU system is shown in Fig. 6. Each GPU is shown with GPU home node (unless the GPM-local L2 is already the only two GPMs for brevity, but the protocol itself can extend GPU home node), and stores propagating past the GPU to an arbitrary number of GPUs, with an arbitrary number home node are routed to the system home node (unless the of GPMs per GPU. In Fig. 6(a), the system home node of GPU home node is already the system home node). Stores address A is the L2 cache residing in GPU0:GPM0. This propagating past the system home node are written to DRAM. particular L2 cache also serves as the GPU home node for the Similar to loads, stores or write-back/write-through operations

588 propagating from the GPU home node to the system home Structure Configuration node carry only the GPU identifier, not the identifier of the Number of GPUs 4 GPM within that GPU. Number of SMs 128 per GPU, 512 in total Number of GPMs 4 per GPU Stores must be written through at least to the home node for GPU frequency 1.3GHz the scope in question: the L1 cache for non-synchronizing Max number of warps 64 per SM and .cta-scoped stores, the GPU home node for .gpu- OS Page Size 2MB L1 data cache 128KB per SM, 128B lines scoped stores, and the system home node for .sys-scoped L2 data cache 12MB per GPU stores. This ensures that synchronization operations will make 128B lines, 16 ways forward progress. L2 coherence directory 12K entries per GPU module each entry covers 4 cache lines Atomics and Reductions: Atomics are always performed Inter-GPM bandwidth 2TB/s per GPU, bi-directional in the home node for the scope in question and they continue Inter-GPU bandwidth 200GB/s per link, bi-directional to be treated as stores for the purposes of coherence protocol Total DRAM bandwidth 1TB/s per GPU Total DRAM capacity 32GB per GPU transitions, just as in NHCC. Once performed at the home node, responses are propagated back to the requester just Table II: Configuration of simulated architecture. as load responses are handled and the result is stored as a dirty line or written through to subsequent levels of the Our simulator GPGPU-Sim 10 cache hierarchy, just as a store would be. For example, the 10 8 result of a .gpu-scoped atomic read-modify-write operation 10 cycles performed in the GPU will be written through to the system 106 4 home node, in systems which configure the GPU home node Sim. 10 to be write-through for stores. 104 106 108 1010 Invalidations: Because sharers are tracked hierarchically, Real HW cycles invalidations sent due to stores and directory evictions must 6 also propagate hierarchically. Invalidations sent from the 10 system or GPU home node to other GPMs in the same GPU 104 are processed and dropped without acknowledgment, just as 102 0 in NHCC. However, in HMG any invalidations received by all Clock (s) 10 a GPU home node from the system home node must also be W 104 106 108 1010 propagated to any and all GPM sharers within the same GPU. Sim. cycles This is the special transition shown in Table I for HMG. Acquire: As before, .cta-scoped acquire operations Figure 7: Simulator correlation vs. a NVIDIA Quadro GV100 invalidate the local L1 cache, but nothing more, as all levels and simulation runtime for our simulator and GPGPU- of L2 cache are being kept hardware-coherent. Sim [46–49]. Release: Release operations trigger writeback of all dirty not present in our suite of workloads. Simulating the system- data, at least to the home node for the scope being released. level effects of fine-grained synchronization, in reasonable They also still ensure completion of any write-through time, without sacrificing fidelity [20, 50] remains an open operations and invalidation messages still in flight to the problem for GPU researchers. .gpu home node for the scope in question. A -scoped release Fig. 7 shows our simulator correlation versus a NVIDIA operation, however, need not flush all write-back operations Quadro-GV100 GPU across a range of targeted microbench- across the inter-GPU network before returning a completion marks, public, and proprietary workloads. Fig. 7 also shows acknowledgment to the original requester. the corresponding data for GPGPU-Sim, a widely-used VI.EVALUATION METHODOLOGY academic GPU architecture simulator [51], with simulations capped at running for about one week. Our simulator has To evaluate HMG, we use a proprietary industrial simulator a correlation coefficient of 0.99 and average absolute error to model a multi-GPU system described in Table II. The of 0.13. This compares favorably to GPGPU-Sim (at 0.99 simulator is driven by program traces that record instructions, and 0.045, respectively), as well as other recently reported registers, memory addresses, and CUDA events. All micro- simulator results [52] while being significantly faster, which architectural scheduling, and thus time for execution, is allows us to run forward-looking GPU configurations more dynamic within the simulator and respects functional de- easily. Our simulator inherits the contiguous CTA scheduling pendencies such as work scheduling, barrier synchronization, and first-touch page placement polices from prior work [5, 13] memory access latencies. However, it cannot accurately to maximize data locality in memory. model spin-lock synchronizations in memory. While this type To perform our evaluation, we choose a public subset of of communication is legal on current NVIDIA hardware, it workloads (shown in Table III) [29–35] that have sufficient is not yet widely adopted due to performance overheads and parallelism to fill a 4-GPU system. These benchmarks utilize

589 okod vna eaclrt okod ihmr fine- more with workloads accelerate we as even workloads rtcl,mliGUsseslk i.1bhvsa single GPMs. a more as behaves with that 1 GPU Fig. performance non-hierarchical flat like For for systems caching. multi-GPU bound hardware protocols, upper via achieved loose coherence; be a enforce can as not does serves that compare this caching also idealized We to (HMG). proposed them our protocol and hardware scopes), hierarchical with leverage to coherence extension software hierarchical (conventional hierarchical protocol a software (NHCC), protocol hardware non-hierarchical baseline. comparative historical bulk-synchronous a traditional are providing few a and kernels, dependent mst traditional Specifically, on sharing. grained regress not does This performance patterns. that synchronization ensures inter-kernel and/or scoped 4-GPU a overhead. to non-hierarchical coherence normalized with is without protocols Performance caching software evaluated: GPMs. ideal are 4 and configurations of HMG, Five composed NHCC, data. is implementations, GPU GPU hierarchical remote each of and where caching system, disallows 4-GPU that a system of Performance 8: Figure tlz ne-enlcmuiainb anhn frequent launching by communication inter-kernel utilize eec ihsoe n ukivldto fcce) a caches), of bulk-invalidation and co- scopes non- software with a (conventional herence possibilities: protocol coherence software 4 hierarchical compares and plements Normalized Speedup oeec rtclImplementations: Protocol Coherence 0.0 0.5 1.0 1.5 2.0 2.5 Rodinia Rodinia ML ML ML ML ML ML ML ML Lonestar Lonestar HPC HPC HPC HPC HPC HPC HPC cuSolver Benchmark use al I:Bnhak sdfrevaluation. for used Benchmarks III: Table Non-Hierarchical SWCoherence RNN RNN RNN resnet overfeat lstm GoogLeNet AlexNet snap Nekbone-10 namd2.10 MiniContact MiniAMR-test2 HPGMG CoMD-xyz49 .gpu overfeat pathfinder nw-16K-10 mst-road-fla bfs-road-fla layer2 layer4 layer4 layer4 MiniAMR 3.2 3.2 3.2 3.1 3.1 soe ycrnzto xlcty others explicitly, synchronization -scoped layer1 conv2 conv2 WGRAD FW DGRAD AlexNet

CoMD cuSolver, pathfinder nw-16K RNN RNN RNN resnet overfeat lstm GoogLeNet AlexNet mst bfs snap Nekbone namd2.10 MiniContact MiniAMR HPGMG CoMD cuSolver Abbrev. HPGMG Non-Hierarchical HWCoherence WGRAD FW DGRAD MiniContact

and namd2.10, pathfinder hswr im- work This .9GB 1.49 GB 2.00 MB 38 MB 40 MB 29 GB 3.20 MB 618 MB 710 GB 1.15 MB 812 MB 83 MB 26 GB 3.44 MB 178 MB 72 MB 246 GB 1.80 GB 1.32 MB 313 GB 1.60 Footprint

Nekbone

cuSolver

namd2.10 Hierarchical SWCoherence 590 6MB ohteL ah fteisigS n h GPM-local the and protocol, SM hierarchical issuing the the In of cache. cache L2 scope the L1 for the node both example, home For the question. and caches in SM any issuing in invalidations the cache between bulk coher- trigger software protocols our ence in operations Load-acquire respectively. httebnfiso M r uhmr rnucdin pronounced more much are HMG of benefits shows the 8 that with Fig. benchmarks communication, for thread-to-thread even fine-grained GPUs, here. individual further within them sufficient on elaborate not do single-GPU we hence in and invalidations systems, the cache minimize of caches can penalty bandwidths L2 performance inter-GPM small large non- relatively relatively idealized The and an scheme. to caching close coherent and coherence hardware similarly and perform software generally both benchmarks, most for Analysis Performance HMG. A. of space design the analysis explore sensitivity conduct to we without Then caching overhead. coherence idealized any and protocols, coherence software parameters. these of choices the Section to GPUs. sensitivity and/or performance GPMs other by enables state This together. the lines tracks cache entry four each We of optimization: messages. directory downgrade one sharer model optional the implement not in scope writes. the pending for all node clears home stall question fetch the operations not until Store-release will operations caches. loads subsequent those subsequent from as data GPU, stale same the of GPMs protocol, non-hierarchical of the caches in L2 .sys However, all and GPU. SM issuing issuing the the of cache L1 the invalidate resnet HCadHGbhv codn oScinI n V and IV Section to according behave HMG and NHCC ut-P System: Multi-GPU NHCC, to HMG of performance the compare first We do We write-through. are caches all evaluation, our In igeGUSystem: Single-GPU soe od edntt naiaeL ahsi other in caches L2 invalidate to not need loads -scoped

fdt sindt ahGMt eatvl shared actively be to GPM each to assigned data of mst . . . . 4.0 4.1 3.7 3.5 3.4

I.R VII. nw-16K

lstm HMG Coherence SLSAND ESULTS

RNN_FW hl otaechrnemybe may coherence software While iepirwr,w bev that observe we work, prior Like .gpu

RNN_DGRAD 4.4 4.3 4.4 3.6 3.7 soe od ilinvalidate will loads -scoped D

GoogLeNet Idealized Cachingw/oCoherence ISCUSSION

.sys bfs 12K VII-B . . . . 7.1 7.2 7.0 3.4 3.3 × soe loads -scoped snap 4 ae shows later ×

128B RNN_WGRAD

GeoMean = tr ntutoso oeec ietr vcin onot do evictions directory coherence or instructions store aeasgicn mato efrac fHG hsis This HMG. of performance on impact significant a have penalties. bandwidth larger and from locality. latency suffer inter-GPU data protocols intra-GPU non-hierarchical additional the Meanwhile, the signifi- from protocols benefit generally hierarchical cantly hardware HMG NHCC. and system, and software 4-GPU protocols Both coherence a software In both side). (i.e, outperforms sharing half data right fine-grained the the more have for which especially applications systems, multi-GPU hierarchical deeply by invalidated lines eviction. cache directory of coherence each number Average 10: Figure each data. by shared invalidated on lines request cache of store number Average 9: Figure te plctos h eet fHGoteg h costs. the outweigh HMG of benefits the applications, This of other overhead. per performance cases, higher have such the lines will In explains cache HMG sets. protocol input four hardware the these the of on (in depending granularity entry), invalidations directory the through, frequent at data trigger this experiments, might write HMG simply but software will in operations protocols Store sharing. coherence false to lead can patterns [14]. e.g., alone than type rather sharing dynamically, data sharers highlight classifying tracking observations of These benefit than workloads. the more our no in generally sharers do are two that there evictions invalidations, directory sharer or read-write stores trigger among contains Even of workload data. percentage each shared small of a footprint sharer only memory a typically the is and address there same if the invalidations for trigger only stores because avg cacheline invs by avg cacheline eas rfietebnwdhoeha finvalidation of overhead bandwidth the profile also We access conflicting often fine-grained, workloads’ Graph to due invalidations line cache that show 10 and 9 Fig. each directory eviction invs by each store 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

overfeat overfeat MiniAMR MiniAMR 2.9 AlexNet AlexNet CoMD CoMD HPGMG HPGMG MiniContact MiniContact pathfinder pathfinder Nekbone Nekbone 2.1 cuSolver cuSolver mst namd2.10 namd2.10

o xml.Frmost For example. for , resnet resnet mst mst 3.5 nw-16K nw-16K lstm lstm RNN_FW RNN_FW RNN_DGRAD RNN_DGRAD GoogLeNet GoogLeNet 3.4 bfs bfs snap snap RNN_WGRAD RNN_WGRAD Avg Avg 591 ihpromne thg fcec,wt eaieysimple deliver relatively can with efficiency, multi-GPUs high hierarchical at performance, for becomes high that it HMG fact tolerant, that the latency with clear generally Combined small are low line. relatively workloads cache also GPU a GPU is The a and message out. to invalidation sent sharing compared be each read-write prior must of a invalidations with little size when just consistent sharers is as of is there number low This cost since as second. bandwidth data generally per total gigabytes is the few messages that invalidation shows of 11 Fig. messages. estvt tde cosarneo einsaeparameters. space design performed of we range a HMG, across of studies performance sensitivity the and parameters Analysis Sensitivity caching B. inter-GPU that speedup enable. ideal possibly able the can is of HMG 97% model, deliver memory to scoped the to mechanism tuned enforcement specifically coherence By lightweight contexts. a multi-GPU providing hierarchical in even protocols unnecessary, coherence are CPU-like prior complicated confirm that results suggestions our Overall, implementation. hardware iue1:Ttlbnwdhcs fivldto messages. invalidation of cost bandwidth Total 11: Figure bandwidth cost of • • • architectural our between relationship the understand To inv messages (GB/s) 0 1 2 3 4 Ps oee,ormdsl-ie ietre r large are is directories across and modestly-sized invalidations sharers our However, cache enough GPUs. additional track perform to to able forced not hardware- if is shrinks of directory coherence the benefit software over The somewhat coherence size. is managed directory HMG to proposed sensitive our of performance shows, the 14 Fig. between As trade-off coverage/performance. and a power/area presents sizing directory caches. Coherence larger favorable with more become systems only indicating in will grows, HMG capacity of advantage as the increases HMG perfor- by of the restricted mance Conversely, are protocols. capacity invalidation, coherence L2 cache software increased of of overhead benefits in the shown the of is Because performance on 13. size Fig. cache L2 bandwidth. of inter-GPU impact sufficient The to performance due saturate absolute to when begins even option, perform- coherence best the across ing always sweeping multi-GPU is when HMG bottleneck bandwidths, that inter-GPU shows often 12 that Fig. performance. effects cause NUMA main the of are links inter-GPU Bandwidth-limited 15.6 overfeat 19.6 MiniAMR AlexNet CoMD 8.8 HPGMG MiniContact pathfinder Nekbone cuSolver namd2.10 resnet mst nw-16K lstm RNN_FW RNN_DGRAD GoogLeNet bfs snap RNN_WGRAD Avg Non-Hierarchical HW Coherence HMG Coherence Non-Hierarchical HW Coherence HMG Coherence Hierarchical SW Coherence Idealized Caching w/o Coherence Hierarchical SW Coherence Idealized Caching w/o Coherence 2.5 2.0

2.0 1.5 1.5 1.0 1.0 0.5 0.5

0.0 0.0 Normalized Speedup Normalized Speedup 100GB/s 200GB/s 300GB/s 400GB/s 3K entries/GPM 6K entries/GPM 12K entries/GPM Figure 12: Performance sensitivity to inter-GPU bandwidth Figure 14: Performance sensitivity to the coherence directory (baseline is no caching with configurations of Table II). size (baseline is no caching with configurations of Table II).

Non-Hierarchical HW Coherence HMG Coherence Hierarchical SW Coherence Idealized Caching w/o Coherence to record more sharers and cover a larger footprint of shared 2.0 data, but the system performance will likely be more sensitive 1.5 to the link speeds and the actual network topology. As shown

1.0 in Fig. 14, HMG can perform very well even after we reduce

0.5 the coherence directory size by 50%, showing that there is still room to scale HMG to larger systems. We envision our 0.0 Normalized Speedup 6MB/GPU 12MB/GPU 24MB/GPU proposed coherence protocol being applicable for systems that can be comprised by a single NVSwitch-based network within Figure 13: Performance sensitivity to L2 cache size (baseline a single operating system node. Systems significantly larger is no caching with configurations of Table II). than this (e.g., 1024-GPU systems) may be decomposed into enough to successfully capture the locality needed to a hierarchy consisting of hardware-coherent GPU-clusters deliver near-ideal caching performance. which are in turn share data using software mechanisms such • Coarse-grained directory entry tracking granularity (e.g., as MPI or SHMEM [54, 55]. where each entry tracks four cache lines at a time) allows The rise of MCM-GPU-like architectures might seem to directories to be made smaller, but it also introduces a motivate adding scopes in between .cta and .gpu, to risk of false sharing. In order to quantify this impact, minimize the negative effects of coherence. However, our we varied the granularity tracked by each directory single-GPU performance results indicate that our workloads entry while simultaneously adjusting the total number are minimally sensitive to the inter-GPM coherence mecha- of entries in order to keep the total coverage constant. nism due to high inter-GPM bandwidth. As a result, the The results (not pictured) showed minimal sensitivity, performance benefits of introducing a new .gpm scope and we therefore conclude that coarse-grained directory may not outweigh the added programmer burden of using tracking is a useful optimization for HMG. numerous scopes. We expect further exploration of other software-hardware coherence interactions to remain an active C. Hardware Costs area of research as GPU systems continue to grow in size. In our HMG implementation, each directory entry needs to track as many as six sharers: three GPMs in the same VIII.RELATED WORK GPU and three other GPUs. Therefore, a 6 bit vector is required for the sharer list. Because our protocol uses just Memory Consistency: Some previous work enforced two states, Valid and Invalid, only one bit is needed to track sequential consistency (SC) in GPUs [27, 56, 57]. They directory entry state. We assume 48 bits for tag addresses, found that TSO and relaxed memory models could not so each entry in the coherence directory requires 55 bits of significantly outperform SC. However, modern GPU products storage. Every GPM has 12K directory entries, so the total still enforce relaxed memory models, as weak models allow storage cost of the coherence directories is 84KB, which is for more microarchitectural flexibility and arguably better only 2.7% of each GPM’s L2 cache data capacity, a small performance/power/area tradeoffs. price to pay for large performance improvements in future Cache Coherence: We have discussed most of the existing multi-GPUs. GPU coherence protocols in Section II-D. Others have also proposed GPU coherence based on physical or logical D. Discussion timestamps [15, 27]. Without explicit invalidation messages, On-package integration [5, 12, 53] along with off-package cache lines are self-invalidated after timestamps are expired, integration technologies [6–8] enable more and more GPU so they can reduce traffic load and improve performance modules to be integrated in a single systems. However, efficiently. However, they considered neither architecture NUMA effects are exacerbated as the number of GPMs, hierarchy nor scoped memory model. In this paper, we GPUs, and non-uniform topologies increase within the system. explore the coherence support for future deeply hierarchical In these situations, HMG’s coherence directory would need GPU systems with scoped memory model enforcement.

592 Coherence hierarchy has been commonly employed in [3] K. Simonyan and A. Zisserman, “Very Deep Convolutional CPUs [58, 59]. Most hierarchical CPU designs [40–42, 44, 60] Networks for Large-Scale Image Recognition,” CoRR, have adopted MESI-like coherence, which has been proven vol. abs/1409.1556, September 2014. [Online]. Available: http://arxiv.org/abs/1409.1556 to be a poor fit for GPUs [15, 25]. HMG shows that the complexity of extra states is also unnecessary for hierarchical [4] P. Harish and P. Narayanan, “Accelerating Large Graph GPUs. Both DASH [41] and WildFire [44] increased the Algorithms on the GPU Using CUDA,” in International complexity even more by employing the mixed coherence conference on high-performance computing (HiPC). Springer, policy: intra-cluster snoopy coherence and inter-cluster 2007, pp. 197–208. directory-based coherence. To implement [5] A. Arunkumar, E. Bolotin, B. Cho, U. Milic, E. Ebrahimi, efficiently, Alpha GS320 [60] separated the commit events to O. Villa, A. Jaleel, C.-J. Wu, and D. Nellans, “MCM-GPU: allow time-critical replies to bypass inbound requests without Multi-Chip-Module GPUs for Continued Performance Scala- violating memory order constraints. HMG can achieve almost bility,” in Proceedings of the 44th International Symposium optimal performance without such overheads. on Computer Architecture (ISCA). ACM, 2017, pp. 320–332. Shared data synchronization in the unified memory space [6] NVIDIA, “NVIDIA NVLink: High Speed GPU Inter- of heterogeneous systems also requires efficient coherence connect,” https://www.nvidia.com/en-us/design-visualization/ protocols [61–63]. We expect that HMG would be integrated nvlink-bridges/, accessed on 2019-08-04. nicely with such schemes due to its simple states and clear coherence hierarchy. [7] NVIDIA, “NVIDIA NVSwitch: The World’s Highest- Bandwidth On-Node Switch,” https://images.nvidia.com/ GPU Scaling: As Section II-A described, previous content/pdf/nvswitch-technical-overview.pdf, accessed on work built larger GPU systems with distributed architec- 2019-08-04. tures [5, 13, 14]. They alleviated the NUMA effect with aggressive caching, which correctness is enforced with [8] “AMD’s answer to Nvidia’s NVLink is xGMI, and it’s coming to the new 7nm Vega GPU,” https://www.pcgamesn.com/amd- either conventional software or hardware coherence without xgmi-vega-20-gpu-nvidia-nvlink, accessed on 2019-08-04. recording exact sharers. However, none of them targeted deeply hierarchical systems and scoped memory models. [9] NVIDIA, “NVIDIA DGX-1: Essential Instrument for AI Research,” https://www.nvidia.com/en-us/data-center/dgx-1/, IX.CONCLUSION 2017, accessed on 2019-08-04. In this paper, we introduce HMG, a novel cache coherence protocol specifically tailored to scale well to hierarchical [10] ——, “NVIDIA DGX-2: The world’s most powerful AI system for the most complex AI challenges,” https://www.nvidia.com/ multi-GPU systems. HMG provides efficient support for en-us/data-center/dgx-2/, 2018, accessed on 2019-08-04. fine-grained synchronization now permitted under recently- formalized scoped GPU memory models. We find that, [11] ——, “NVIDIA HGX-2: Powered by NVIDIA Tesla V100 without much complexity, simple hierarchical extensions GPUs and NVSwitch,” https://www.nvidia.com/en-us/data- and optimizations to existing coherence protocols can take center/hgx/, 2018, accessed on 2019-08-04. advantage of relaxations now permitted in scoped memory [12] J. W. Poulton, W. J. Dally, X. Chen, J. G. Eyles, T. H. Greer, models to achieve 97% performance of an ideal caching S. G. Tell, J. M. Wilson, and C. T. Gray, “A 0.54 pJ/b 20 Gb/s scheme that has no coherence overhead. Thanks to its Ground-Referenced Single-Ended Short-Reach Serial Link in cheap hardware implementation and high performance, HMG 28 nm CMOS for Advanced Packaging Applications,” IEEE Journal of Solid-State Circuits (JSSC), vol. 48, no. 12, pp. demonstrates the most practical solution available for extend- 3206–3218, 2013. ing cache coherence to future hierarchical multi-GPU systems, and thereby for enabling continued performance scaling of [13] U. Milic, O. Villa, E. Bolotin, A. Arunkumar, E. Ebrahimi, applications onto larger and larger GPU-based systems. A. Jaleel, A. Ramirez, and D. Nellans, “Beyond the Socket: NUMA-aware GPUs,” in Proceedings of the 50th International X.ACKNOWLEDGMENT Symposium on Microarchitecture (MICRO). ACM, 2017, pp. We are grateful to Mieszko Lis, and the anonymous 123–135. reviewers for their insightful feedback. [14] V. Young, A. Jaleel, E. Bolotin, E. Ebrahimi, D. Nellans, and O. Villa, “Combining HW/SW Mechanisms to Improve REFERENCES NUMA Performance of Multi-GPU Systems,” in Proceedings [1] Inside HPC, “TOP500 Shows Growing Momentum for Acceler- of the 51th International Symposium on Microarchitecture ators,” https://insidehpc.com/2015/11/top500-shows-growing- (MICRO). IEEE, 2018, pp. 339–351. momentum-for-accelerators/, 2015, accessed on 2019-08-04. [15] I. Singh, A. Shriraman, W. W. Fung, M. O’Connor, and [2] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, T. M. Aamodt, “Cache Coherence for GPU Architectures,” B. Catanzaro, and E. Shelhamer, “cuDNN: Efficient Primitives in Proceedings of the 19th International Symposium on High for Deep Learning,” CoRR, vol. abs/1410.0759, October 2014. Performance Computer Architecture (HPCA). IEEE, 2013, [Online]. Available: http://arxiv.org/abs/1410.0759 pp. 578–590.

593 [16] M. D. Sinclair, J. Alsop, and S. V. Adve, “Efficient GPU [28] W. J. Dally, C. T. Gray, J. Poulton, B. Khailany, J. Wilson, Synchronization Without Scopes: Saying No to Complex and L. Dennison, “Hardware-Enabled Artificial Intelligence,” Consistency Models,” in Proceedings of the 48th International in Symposium on VLSI Circuits. IEEE, 2018, pp. 3–6. Symposium on Microarchitecture (MICRO). ACM, 2015, pp. 647–659. [29] L. Chien, “How to Avoid Global Synchronization by Domino Scheme,” NVIDIA GPU Technology Conference (GTC), 2014. [17] M. Burtscher, R. Nasre, and K. Pingali, “A Quantitative Study of Irregular Programs on GPUs,” in International Symposium [30] M. Kulkarni, M. Burtscher, C. Casc¸aval, and K. Pingali, “Lon- on Workload Characterization (IISWC). IEEE, 2012, pp. eStar: A Suite of Parallel Irregular Programs,” in International 141–151. Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2009, pp. 65–76. [18] J. Kim and C. Batten, “Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists,” in Proceed- [31] J. Gong, S. Markidis, E. Laure, M. Otten, P. Fischer, and ings of the 47th International Symposium on Microarchitecture M. Min, “Nekbone Performance on GPUs with OpenACC (MICRO). IEEE, 2014, pp. 75–87. and CUDA Fortran Implementations,” The Journal of Super- computing, vol. 72, no. 11, pp. 4160–4180, 2016. [19] S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, “Pannotia: Understanding Irregular GPGPU Graph Applica- [32] D. Li and M. Becchi, “Multiple Pairwise Sequence Align- International Symposium on Workload Characteri- tions,” in ments with the Needleman-Wunsch Algorithm on GPU,” in zation (IISWC). IEEE, 2013, pp. 185–195. SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC). IEEE, 2012, pp. 1471–1472. [20] M. D. Sinclair, J. Alsop, and S. V. Adve, “HeteroSync: A Benchmark Suite for Fine-Grained Synchronization on Tightly [33] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.- Coupled GPUs,” in International Symposium on Workload H. Lee, and K. Skadron, “Rodinia: A Benchmark Suite For Characterization (IISWC). IEEE, 2017, pp. 239–249. Heterogeneous Computing,” in International Symposium on Workload Characterization (IISWC). IEEE, 2009, pp. 44–54. [21] D. Lustig, S. Sahasrabuddhe, and O. Giroux, “A Formal Analysis of the NVIDIA PTX Memory Consistency Model,” in Proceedings of the 24th International Conference on Archi- [34] R. J. Zerr and R. S. Baker, “SNAP: SN (discrete ordinates) tectural Support for Programming Languages and Operating Application Proxy Description,” Los Alamos National Labo- Systems (ASPLOS). ACM, 2019, pp. 257–270. ratories, Tech. Rep. LAUR-13-21070, 2013.

[22] “HSA Platform System Architecture Specification Version 1.2,” [35] J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, http://www.hsafoundation.com/?ddownload=5702, accessed on E. Villa, C. Chipot, R. D. Skeel, L. Kale, and K. Schulten, 2019-07-07. “Scalable Molecular Dynamics with NAMD,” Journal of Computational Chemistry, vol. 26, no. 16, pp. 1781–1802, 2005. [23] “The OpenCL Specification Version 2.2,” https://www.khronos. org/registry/OpenCL/specs/2.2/pdf/OpenCL API.pdf, accessed on 2019-07-07. [36] G. Diamos, S. Sengupta, B. Catanzaro, M. Chrzanowski, A. Coates, E. Elsen, J. Engel, A. Hannun, and S. Satheesh, [24] D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. “Persistent RNNs: Stashing Recurrent Weights On-Chip,” in International Conference on Machine Learning (ICML) Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood, , 2016, “Heterogeneous-Race-Free Memory Models,” in Proceedings of pp. 2024–2033. the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). [37] NVIDIA, “Unified Memory in CUDA 6,” https://devblogs. ACM, 2014, pp. 427–440. nvidia.com/unified-memory-in-cuda-6/, Nov 2013, accessed on 2019-08-04. [25] B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood, [38] K. Koukos, A. Ros, E. Hagersten, and S. Kaxiras, “Building “QuickRelease: A Throughput-oriented Approach to Release Heterogeneous Unified Virtual Memories (UVMs) without Consistency on GPUs,” in Proceedings of the 20th Interna- the Overhead,” ACM Transactions on Architecture and Code tional Symposium on High Performance Computer Architecture Optimization (TACO), vol. 13, no. 1, p. 1, 2016. (HPCA). IEEE, 2014, pp. 189–200. [39] R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, [26] J. Alsop, M. S. Orr, B. M. Beckmann, and D. A. Wood, J. Gandhi, C. J. Rossbach, and O. Mutlu, “Mosaic: A GPU “Lazy Release Consistency for GPUs,” in Proceedings of the Memory Manager with Application-Transparent Support for 49th International Symposium on Microarchitecture (MICRO). Multiple Page Sizes,” in Proceedings of the 50th International IEEE, 2016, p. 26. Symposium on Microarchitecture (MICRO). ACM, 2017, pp. 136–150. [27] X. Ren and M. Lis, “Efficient Sequential Consistency in GPUs via Relativistic Cache Coherence,” in Proceedings of the 23rd [40] S.-L. Guo, H.-X. Wang, Y.-B. Xue, C.-M. Li, and D.-S. Wang, International Symposium on High Performance Computer “Hierarchical Cache Directory for CMP,” Journal of Computer Architecture (HPCA). IEEE, 2017, pp. 625–636. Science and Technology, vol. 25, no. 2, pp. 246–256, 2010.

594 [41] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and [52] A. Gutierrez, B. Beckmann, A. Dutu, J. Gross, J. Kalamatianos, J. Hennessy, “The Directory-Based Cache Coherence Protocol O. Kayiran, M. Lebeane, M. Poremba, B. Potter, S. Puthoor, for the DASH Multiprocessor,” in Proceedings of the 17th M. D. Sinclair, M. Wyse, J. Yin, X. Zhang, A. Jain, and T. G. International Symposium on Computer Architecture (ISCA). Rogers, “Lost in Abstraction: Pitfalls of Analyzing GPUs IEEE, 1990, pp. 148–159. at the Intermediate Language Level,” in Proceedings of the 24th International Symposium on High Performance Computer [42] D. Mulnix, “Intel Xeon Processor Scalable Family Technical Architecture (HPCA). IEEE, 2018, pp. 141–155. Overview,” https://software.intel.com/en-us/articles/intel-xeon- processor-scalable-family-technical-overview, 2017, accessed [53] AMD, “Multi-Chip Module Architecture: The Right Approach on 2019-08-04. for Evolving Workloads,” http://developer.amd.com/wordpress/ media/2017/11/LE-62006-SB-Latency-170824-Final-1.pdf, [43] D. J. Sorin, M. D. Hill, and D. A. Wood, “A Primer on Memory August 2017. Consistency and Cache Coherence,” Synthesis Lectures on Computer Architecture, vol. 6, no. 3, pp. 1–212, 2011. [54] Message Passing Interface Forum, “MPI: A Message-Passing Interface Standard, Version 3.1,” https://www.mpi-forum.org/ [44] E. Hagersten and M. Koster, “WildFire: A Scalable Path for docs/mpi-3.1/mpi31-report.pdf, June 2015. SMPs,” in Proceedings of the 5th International Symposium on High Performance Computer Architecture (HPCA). IEEE, [55] OpenSHMEM Project, “OpenSHMEM Application Program- 1999, pp. 172–181. ming Interface,” http://www.openshmem.org/site/sites/default/ site files/OpenSHMEM-1.4.pdf, December 2017. [45] J. Alglave, M. Batty, A. F. Donaldson, G. Gopalakrishnan, J. Ketema, D. Poetzl, T. Sorensen, and J. Wickerson, “GPU [56] B. A. Hechtman and D. J. Sorin, “Exploring Memory Con- Concurrency: Weak Behaviours and Programming Assump- sistency for Massively-Threaded Throughput-Oriented Proces- tions,” in Proceedings of the 20th International Conference sors,” in Proceedings of the 40th International Symposium on on Architectural Support for Programming Languages and Computer Architecture (ISCA). ACM, 2013, pp. 201–212. Operating Systems (ASPLOS). ACM, 2015, pp. 577–591. [57] A. Singh, S. Aga, and S. Narayanasamy, “Efficiently Enforcing [46] A. Jain, M. Khairy, and T. G. Rogers, “A Quantitative Strong Memory Ordering in GPUs,” in Proceedings of the Evaluation of Contemporary GPU Simulation Methodology,” 48th International Symposium on Microarchitecture (MICRO). Proceedings of the ACM on Measurement and Analysis of ACM, 2015, pp. 699–712. Computing Systems (SIGMETRICS), p. 35, 2018. [58] A. W. Wilson Jr, “Hierarchical Cache/Bus Architecture for [47] M. Khairy, A. Jain, T. M. Aamodt, and T. G. Rogers, Shared Memory Multiprocessors,” in Proceedings of the 14th “Exploring Modern GPU Memory System Design Challenges International Symposium on Computer Architecture (ISCA). through Accurate Modeling,” CoRR, vol. abs/1810.07269, ACM, 1987, pp. 244–252. October 2018. [Online]. Available: http://arxiv.org/abs/1810. 07269 [59] D. A. Wallach, “PHD: A Hierarchical Cache Coherent Proto- col,” Ph.D. dissertation, Massachusetts Institute of Technology, [48] M. A. Raihan, N. Goli, and T. M. Aamodt, “Modeling 1992. Deep Learning Accelerator Enabled GPUs,” in International Symposium on Performance Analysis of Systems and Software [60] K. Gharachorloo, M. Sharma, S. Steely, and S. Van Doren, (ISPASS). IEEE, 2019, pp. 79–92. “Architecture and Design of AlphaServer GS320,” in Proceed- ings of the 9th International Conference on Architectural [49] J. Lew, D. A. Shah, S. Pati, S. Cattell, M. Zhang, A. Sand- Support for Programming Languages and Operating Systems hupatla, C. Ng, N. Goli, M. D. Sinclair, T. G. Rogers et al., (ASPLOS). ACM, 2000, pp. 13–24. “Analyzing Machine Learning Workloads Using a Detailed GPU Simulator,” in International Symposium on Performance [61] J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Analysis of Systems and Software (ISPASS). IEEE, 2019, pp. Hill, S. K. Reinhardt, and D. A. Wood, “Heterogeneous System 151–152. Coherence for Integrated CPU-GPU Systems,” in Proceedings of the 46th International Symposium on Microarchitecture [50] Y. Sun, T. Baruah, S. A. Mojumder, S. Dong, X. Gong, (MICRO). ACM, 2013, pp. 457–467. S. Treadway, Y. Bao, S. Hance, C. McCardwell, V. Zhao, H. Barclay, A. K. Ziabari, Z. Chen, R. Ubal, J. L. Abellan,´ [62] L. E. Olson, M. D. Hill, and D. A. Wood, “Crossing Guard: J. Kim, A. Joshi, and D. Kaeli, “MGPUSim: Enabling Mediating Host-Accelerator Coherence Interactions,” in Pro- Multi-GPU Performance Modeling and Optimization,” in ceedings of the 22th International Conference on Architectural Proceedings of the 46th International Symposium on Computer Support for Programming Languages and Operating Systems Architecture (ISCA). ACM, 2019, pp. 197–209. (ASPLOS). ACM, 2017, pp. 163–176.

[51] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. [63] J. Alsop, M. D. Sinclair, and S. V. Adve, “Spandex: A Aamodt, “Analyzing CUDA Workloads Using a Detailed Flexible Interface for Efficient Heterogeneous Coherence,” in GPU Simulator,” in International Symposium on Performance Proceedings of the 45th International Symposium on Computer Analysis of Systems and Software (ISPASS). IEEE, 2009, pp. Architecture (ISCA). IEEE, 2018, pp. 261–274. 163–174.

595