A Simple Coherence Scheme for Integrated CPU-GPU Systems

Ardhi W. B. Yudha Reza Pulungan Henry Hoffmann Surya University Universitas Gadjah Mada University of Chicago [email protected] [email protected] [email protected]

ABSTRACT GPU cache coherence mechanisms. Specifically, perfor- This paper presents a novel approach to accelerate applica- mance critical data is homed on the GPU. When the CPU tions running on integrated CPU-GPU systems. Many inte- writes to such data, the proposed coherence protocol forwards grated CPU-GPU systems use cache-coherent the data to the GPU —instead of leaving it to communicate efficiently and easily. In this pull-based ap- in the memory close to the CPU. This proposal is an intuitive proach, the CPU produces data for the GPU to consume, and match for the CPU to GPU producer-consumer relationship, data resides in a shared cache until the GPU accesses it, result- as it moves the data to the memory system where it will be ing in long load latency on a first GPU access to a cache line. consumed, reducing GPU miss rate, and improving overall In this work, we propose a new, push-based, coherence mech- performance. anism that explicitly exploits the CPU and GPU’s producer- While the basic idea of homed memory has appeared in consumer relationship by automatically moving data from many systems—from commercial products for distributed CPU to GPU’s last-level cache. The proposed mechanism re- shared memory [13] and manycores [16] to research de- sults in a dramatic reduction of the GPU’s L2 cache miss rate signs [21]—the integrated CPU-GPU system provides new in general, and a consequent increase in overall performance. challenges and opportunities. Specifically, prior approaches Our experiments show that the proposed scheme can increase required that users explicitly declare the home core for dif- performance by up to 37%, with typical improvements in the ferent data ranges (e.g., in the TILE architecture [16]) or 5–7% range. We find that even if an application does not required added hardware complexity to automatically move benefit from the proposed approach, performance is never data to an appropriate home (e.g., [21]). The unique nature worse under this model. While we demonstrate how the pro- of an integrated CPU-GPU system makes it possible to auto- posed scheme can co-exist with traditional cache-coherence matically translate existing programs to home memory and mechanisms, we also argue that it could serve as a simpler thus achieve the performance benefits of homing data on the replacement for existing protocols. GPU while requiring no extra work from users and keeping hardware complexity low. Thus, in this paper we show how to support homed memory 1. INTRODUCTION on an integrated CPU-GPU system. Our proposed approach The abundance of transistors available on a die has led to includes source to source translation that converts existing new hardware designs where CPUs and GPUs are integrated GPU programs to use our mechanism. Specifically, the trans- onto a single chip [10,18,34,47,48]. This integration presents lator changes existing memory allocation to allocate data that new opportunities for increased performance and ease of pro- will be homed on the GPU in a specific data range. Data gramming as integrated CPU-GPU designs can communicate in that range cannot be cached on the CPU, only the GPU. through cache-coherent shared memory (CCSM) rather than Therefore, when the CPU attempts to write it (to produce through explicit data copying from CPU to GPU memory. data for the GPU’s consumption), there will be a write miss. In fact, for most integrated CPU-GPU systems, the critical Concurrently, the modified Translation Look-aside Buffer operation is moving data from CPU to GPU, and sending (TLB) will detect the address range and tell the CPU memory this data via CCSM incurs overhead in the form of coher- system to forward the write miss to the GPU memory system. ence messages. The CPU-GPU relationship is different from The GPU then resolves the write miss and the produced data peer cores in the traditional shared-memory systems, how- is stored in the GPU memory hierarchy. Then, when the GPU ever. The CPU is a data producer and the GPU is a consumer. accesses it, the data is already in cache locally, reducing both GPU performance depends directly on how fast it can con- compulsory misses and load latency on the first data access. sume the CPU-produced data. This relationship presents We test the proposed scheme by implementing it in the an opportunity for further performance gain by exploiting gem5-gpu simulator and compare it with recent proposals the producer-consumer relationship to eliminate unnecessary for the traditional CCSM between CPU and GPU. Across cache coherence messages, reduce GPU miss rate, and im- benchmarks from a range of suites, we find that our proposed prove system performance. approach can achieve performance improvements as high This paper proposes an augmentation of existing CPU-

1 as 37%, with typical improvements in the range of 5–7%. producer-consumer relationships in multi-threaded multicore Furthermore, we find that performance under our scheme is programs can be quite challenging. Our work adapts the RSP never worse than with CCSM. In addition, we compare the idea to an integrated CPU-GPU environment, to which it is proposed scheme with CCSM with a straightforward stride much better suited, as the producer consumer relationships prefetcher and see that the proposed scheme is much more are well-defined and easily determined. efficient. While the proposed technique can co-exist with CCSM, we also believe it could be a simpler replacement for 3. DESIGN AND IMPLEMENTATION this approach. Integrated CPU-GPU systems allow the GPU to be placed on the same die as the CPU. As a consequence, it is now 2. RELATED WORK possible for CPU and GPU to share their resources including There have been massive efforts toward tightly integrating cache memory, and the short distance between them allows CPUs and GPUs on a single die [23, 26, 28, 31, 35]. Prior for a high-speed interconnection network to and from each work has tackled programming models [24,33, 42], network of the processing cores for exchanging messages. CPU and design [39], and many have studied the performance impact GPU have different kinds of workload to present their best of the integration [6,9, 15, 38, 40] and the key characteristics performance. Most of the heterogeneous workload offloads that deliver the best performance [29, 46, 50]. some portions of the computation to the GPU as the hardware Cache coherence is one of the primary concerns for tightly accelerator. From this nature and based on the relationship integrated CPUs and GPUs, including both hardware- [14,17, between them, the CPU usually produces data, which will 30,34,43,44] and software-based cache coherence [42,51] later be consumed by the GPU to accelerate the computation. mechanisms. Several projects have investigated the perfor- In conventional cache-coherence protocols for such integrated mance of existing coherence mechanisms in this new con- CPU-GPU systems, the data is pulled to the GPU on a load text [11,14,36,42,45], while other work has even investigated instruction. In this paper, however, we propose a scheme for tight integration without cache coherence [2]. pushing the data directly to the GPU’s cache to reduce load The mechanism investigated in this paper bears some re- latency. semblance to prefetching techniques, implemented in both In realizing such scheme, several challenges may arise. Ad- software [20, 27, 52] and hardware. Many prefetching tech- ditional information might be needed at the source-code level niques have been proposed, including a technique that as- about the variables that are to be stored remotely, memory sumes that the integrated CPU-GPU system shares the last- allocation for variables that need to be stored to the GPU, spe- level cache (LLC) [49], a dynamically adaptive prefetch- cific address detection to present different treatments for the ing strategy to reduce memory latency and improve better data and new rules for storing those variables in the caches power efficiency [37], and a hybrid of software and hard- need to be added. Also, a dedicated interconnection network ware prefetching [22], which combines conventional stride is helpful to provide fast data delivery and to reduce the com- prefetching and a hardware-implemented inter- tech- plexity of the cache coherence modifications. This section nique, where each thread prefetches for threads located in describes our proposal for such a push-based cache coherence other warps. scheme, which we will call direct store. A GPU selective caching strategy that enables unified shared memory without hardware cache-coherence protocol 3.1 Data Flow is proposed in [2]. This involves architectural improvements The data flow of the traditional cache coherence and of to address the performance reduction of selective caching by our proposed scheme are different. Figure1 compares the aggressive request coalescing, CPU-side coherent caching data movement between the state-of-the-art CCSM used in for GPU uncacheable requests, and a CPU-GPU intercon- integrated CPU-GPU systems and the proposed direct store. nect optimization to support variable-sized transfers. On the In the Hammer protocol, all CPU and GPU caches maintain other hand, a heterogeneous system coherence for CPU-GPU cache coherence, except for GPU’s L1 cache [19]. Although system to address the effect of huge number of GPU mem- it is non-coherent in term of hardware, by writing through ory requests is proposed in [34]. This proposed mechanism dirty data and flash invalidating, the L1 cache retains coher- replaces the standard directory with a region directory and ence at the time the kernel starts execution. Data movement insert into L2 cache a region buffer. in CCSM involves longer steps than in direct store as depicted While prefetching shares a goal of minimizing load la- in Figure1 (top). In CCSM, when a CPU core issues store tency, prefetching mechanisms are very different from, and instruction st x, it also copies the data from register file (RF) complementary to, the mechanism proposed in this paper. to its L1 cache. Sometime afterwards, there are three possibil- Perhaps the most similar mechanism is remote store program- ities: the data remains in the CPU’s L1 cache, it is moved to ming (RSP) [16]. RSP is a memory model for multicores lower (L2) cache, or it is evicted to DRAM. In direct store, on that requires less hardware support than CCSM or direct the other hand, when a CPU core issues store instruction st x, memory access (DMA). RSP sacrifices store latency for a a copy is directly sent to the GPU’s L2 cache, as illustrated low load latency because most programs tolerate store la- by step 1 in Figure1 (bottom). In case the GPU’s L2 cache tency better than load latency. Like our proposed approach, is full, the system then writes the data to DRAM. RSP homes data on particular cores, so that data can only When a GPU wants to consume data produced by CPU in exist in cache on the home core. Compared to DMA and CCSM system, the GPU fetches it from its L1 cache. If it CCSM, RSP shows dramatic improvements when evaluated is not in the cache, then the GPU core requests it from the for 32 or more processors. The downside is that determining GPU’s L2 cache. If it is in the GPU’s L2 cache, the GPU

2 it reserves and allocates a region of memory in the GPU’s RF RF RF RF L2 cache. The third component is address translation, which

1 st x occurs when a CPU store targets the reserved memory. This a ld x ... address translation is accomplished by an added functionality L1 L1 L1 L1 of the TLB. The fourth component is a special cache coher- c1 ence mechanism, which kicks in when a CPU store targets the GPU Core 0 GPU Core 1 GPU Core 15 2 reserved memory. The fifth is a new interconnection network b L2 that will bring data directly to the GPU’s L2 cache. c2 Intuitively, direct store increases the latency of CPU stores CPU Core L2 to memory regions designated for the GPU. In exchange, the

3 c3 data to be loaded by the GPU is moved into its memory hier- archy at creation. Thus, the protocol is designed to decrease DRAM GPU load latency (which is hard to hide) in exchange for increased CPU store latency (to which most programs are less sensitive). RF RF RF RF st x 3.3 Automatic Code Translation 1 a ld x ... We develop a code translator to implement our proposed L1 L1 L1 L1 scheme automatically from the baseline benchmarks that use GPU Core 0 GPU Core 1 GPU Core 15 no memory copy. The automatic code translator works as b follows. L2 The translator needs input, which specifies the applica- CPU Core L2 tion location. All files within the input directory will be scanned. The translator then inspects each file in the appli- 2 c cation directory to identify variables that will be accessed DRAM by the GPU. This is done by scanning all variable inferences in the kernel invocations of all files. The transla- tor then searches for the pattern that matches the kernel call as Figure 1: Comparison of data movement between CPU and kernel_name<<>>(x ,x ,...,x ), where GPU under CCSM and the proposed direct store. In CCSM 1 2 n x ,x ,...,x are the variable names captured by the translator (top), data is stored into the CPU’s cache and is only pulled 1 2 n and stored in the temporary memory. Once all appearances to the GPU on a load, inviting a large load latency. In direct of kernel invocations are inspected, the translator scans each store (bottom), data homed to the GPU is immediately writ- source code file to determine the amount of memory alloca- ten to the GPU’s L2 cache; when the GPU loads it, it hits tion needed for each variable and the variable type and stores immediately. that information in another temporary memory. Therefore, each variable name has the information of the memory size fetches it anyway to GPU’s L1 cache as illustrated by step needed and its type. Finally, the translator iterates all files b in Figure1 (top). The worst case is when the GPU’s L2 inside the directory to find the memory declaration of the cache does not contain the requested data. The GPU then has stored variables in the temporary memory. to pull it to the GPU’s L2 cache. Since there are 3 possible The translator searches for memory declarations that use places the data may reside in, it has to be pulled from these malloc and cudamalloc syntax. Whenever such a declaration places as illustrated by steps c1, c2, and c3 in Figure1 (top). is found, the translator modifies the memory declaration to This CCSM’s data movement is very different from what we use mmap memory allocation and sets its memory size and propose in direct store. In our proposal, the GPU only needs the starting virtual address and saves this modification to the to pull the data to its L1 cache, since it is already in the GPU’s associated file. The translator increments the starting virtual L2 cache and can be instantly fetched as illustrated by steps address to the memory size needed by the variable, such a and b in Figure1 (bottom). In summary, the main upside that there is no overlapping starting virtual addresses for all of direct store over CCSM from the data movement view is variables. After all iterations are completed, the application that a GPU doesn’t need to pull data to its local cache, as it is is ready to be compiled in the standard way. already there. This approach reduces message traffic needed By using this automatic code translator, there is no effort for coherence when providing the data to the GPU. for the programmer to modify the source code to implement The following subsections detail the design of our proposed direct store scheme. Therefore, there is no need for the pro- scheme to achieve this shorter data flow. grammer to understand the proposed scheme in detail. 3.2 Overall Design 3.4 Special Memory Allocation Figure2 depicts the five components of direct store for The proposed method needs OS support for reserving ex- integrated CPU-GPU systems. The first is an automatic clusive memory regions. In the implementation, we use code translation, which modifies a standard program with- mmap, a Unix system call that maps files or devices to a out CUDA memory copy to a program that can run on our virtual address space of the calling process. proposed scheme. Secondly, when the program is running, The mmap function is declared as *mmap(void *addr,

3 Code Memory Address Cache Coherence New Special Behavior Interconnection Translation Allocation Translation Network

Modifying the program to fit Granting some region of the TLB inspects the address of the When a store instruction The store will go through this arrives, the cache coherence 1 our proposed scheme. GPU's L2 cache to be accessible 1 store instruction to detect high- 1 network to the GPU's L2 cache by CPU. order virtual addresses. protocol ensures that it is safe controller. Compiling the program to do a remote store targeting 2 in the standard way. TLB signals the CPU's L1 the GPU's L2 cache. 2 cache controller to forward the Execute the program as The cache coherence protocol CU CU CU store instruction to the GPU's L2 3 usual. ... performs store to the GPU's L2 L1 Cache L1 Cache L1 Cache cache. 2 cache by forwarding it through L2 Cache the new network.

Reserved Region of Memory Accesible by CPU SM

Figure 2: The overall design and the components of direct store on integrated CPU-GPU systems. size_t length, int prot, int flags, int fd, off_t offset), where addr specifies the starting address of the new mapping. If TLB addr is NULL, the kernel will map to the (page-aligned) ad- PTE dress as the starting point of mapping, otherwise the kernel chooses that address as a hint for the mapping. In our work, VPN Local the special memory allocation works by specifying the ar- cache gument addr to a high-order address and by setting flags to st x (VA) MAP_FIXED, we can map an address exactly the same to CPU MMU the address as specified on the addr argument before. By st x (PA) this way, allocating high-order addresses for remotely stored data, we can give a sign to the TLB, telling it that this address memory needs to be stored remotely to the GPU’s L2 cache. GPU's L2 cache 3.5 Translation Look-aside Buffer Figure3 illustrates the control flow of direct store on a Figure 3: The control flow when the CPU issues a store store targeting the GPU’s L2 cache. First, the CPU allocates instruction and the roles of the TLB. data that will be read by the GPU with a special memory allocation. Specifically, we reserve bits of the virtual address space to represent data that is homed in the GPU’s. This the starting high-order addresses. Then the virtual address of special data range can never be cached on the CPU side the store will be compared to this base address using standard (so accesses from the CPU will always miss), but it can be address comparison. Whenever the virtual address of the cached on the GPU. When CPU stores target this memory, store is higher than the base address, the TLB sends a signal the (MMU) translates the virtual to the MMU that will be forwarded to the CPU’s L1 cache address (VA) to a physical address (PA) by consulting the controller to forward the store onto the GPU’s L2 cache. TLB. By definition, this memory will always result in a cache miss on the CPU. On detecting the special memory regions 3.6 Cache Coherence Protocol VA, the TLB signals the MMU to forward the store to the Since we add a network between the CPU and the GPU’s GPU (over a dedicated network), which resolves it. When a L2 cache, we need only a simple modification to the cache GPU wants to consume data stored by CPU, it simply issues coherence protocol because data does not travel through the a standard load. whole cache hierarchy and therefore does not affect the proto- Therefore, we need an additional functionality on the TLB col much. However, for direct store to function correctly, we to implement our proposed scheme. We modify the TLB by need to specify to the protocol the store state of the remotely adding a logic to detect some regions of virtual addresses. stored data. Figure4 depicts a modified state-transition dia- The region of virtual addresses that will be used to distin- gram of AMD’s Hammer cache coherence protocol [1], which guish a remote store from an ordinary store is the high-order consists of 5 stable states: MM, M, O, S, and I. State MM virtual addresses, which will use the memory range of the represents that the node has an exclusive hold on the cache reserved system’s virtual address space. The system’s virtual block and the block is potentially locally modified (similar address space is within the range of 0x80000000 through to conventional M state). State O indicates that the node 0xFFFFFFFF. When a high-order virtual address is refer- owns the cache block, but has not modified the block, and enced by the program, the TLB performs an address compar- no other node holds the block in exclusive mode, although ison to detect the high-order virtual address. To do this, first sharers may exist. State M indicates that the cache block is we choose the address 0x80000000 as the base address for held in exclusive mode, but has not been written to (similar

4 Load: hit, Ifetch: hit Store: hit MM Store: hit Store: GETX

Fwd GETS No Mig: Data Store: GETX Merged GETS: Data Fwd GETS No Mig: Data Replace: PUTX Fwd GETS: Data INV: Ex Data NC DMA GETS: Data Fwd GETX: Ex Data Merged GETS: Data Fwd GETS: Ex Data M Fwd GETS: Data O NC DMA GETS: Ex Data S Fwd GETS No Mig: Data Flush: GETF: PUTF Merged GETS: Data Remote Store: GETX, PUTX Load: hit, Ifetch: hit Memory Store: Load: hit, Ifetch: hit NC DMA GETS: Data PUTX Fwd GETS: Ack Replace: PUTX NC DMA GETS: Ack Fwd GETX: Data Flush: GETF: PUTF Replace: PUTX Fwd GETX: Ex Data Flush: GETF: PUTF Replace, Fwd GETX, INV INV: Ex Data Flush: GETF: PUTF Remote Store: GETX, PUTX I Remote Store: GETX, PUTX

Figure 4: The modified state-transition diagram of the Hammer protocol (based on [1]). to conventional E state), and no other node holds a copy of tected by the TLB, it tells the MMU to direct the store to the block. Stores are not allowed in state M. State S indicates the GPU’s L2 cache. We therefore add a fast network di- that the cache line holds the most recent, correct copy of the rectly connecting the CPU and the GPU’s L2 cache. Figure5 data, but other processors in the system may hold copies of shows the newly added network, indicated by the dotted line. the data in the shared state. The cache line can be read, but By connecting the CPU and GPU’s L2 cache directly, the not written to in state S. State I indicates that the cache line data transfer will be faster because it bypasses much of the is invalid and it does not hold a valid copy of the data. cache hierarchy and eliminates many coherence messages. We modify the protocol for handling a store instruction that Since we bypass many cache hierarchies and go straight to targets remote location by adding new actions, depicted by the GPU’s L2 cache, we need only introduce minor modifica- bold text in the figure. The modification is minimal because tions on the cache coherence protocol, namely those related CPU to GPU stores travel through the new network. When to CPU’s L1 cache and GPU’s L2 cache. The newly added a remote store instruction arrives at the CPU’s L1 cache interconnection network will have exactly the same character- targeting the GPU’s L2 cache, the protocol starts from state I istics as the network used in many cache coherence systems and then data is forwarded directly from the CPU’s L1 cache to exchange the messages. to GPU’s L2 cache without caching it into L1 cache through the newly added network that connects them. After data is sent through the network the protocol remains in state I. CPU SM SM To cover other possibilities, where the CPU might load ... data after remotely storing it, we add the ability to do a L1I L1D L1 L1 remote store from states S, M, and MM. All remote stores that begin from states S, M, or MM always go to state I. This is safe because when the protocol wants to flip from a beginning state to state I, it gets exclusive permission to the L2 L2 cache block needed. In the GPU’s L2 cache, after a store instruction is forwarded here, it begins in state I then always proceeds to state MM. This is safe because we make sure that Directory Mem. Controller the remotely stored data will only be deposited in GPU’s L2 cache, not in any other cache. The GPU’s L2 cache has exactly the same protocol as the Off-chip DRAM CPU. However, there is a single additional feature that makes it possible for the GPU to store data from a remote location. Figure 5: The topology of the simulated system with a net- Every time a remote store arrives at the GPU’s L2 cache, it work connecting CPU and the GPU’s LLC (assumed to be will transition from state I to MM as illustrated by the blue L2 in this paper). dashed-lined transition in Figure4. This transition always starts from state I since before forwarding the data, the CPU will issue GETX command. The store will be issued as PUTX 3.8 Integrating All Components Together action indicating the store is to the GPU’s L2 cache. The first step to run direct store is to modify the application: it should not use any explicit memory copy from CPU to GPU. 3.7 Interconnection Network To begin, data that will be used by the kernel code needs to When a CPU store instruction targeting the GPU is de- be identified. The next step is to change the CPU memory

5 allocation of data that later will be used by GPU. Instead of relative run times for both CCSM and our prosposed malloc, memory allocations should use mmap function, and approach for all applications. We also test the perfor- choose high-order addresses. These are all accomplished by mance gains for small inputs, which fit into the GPU the automatic code translator. LLC and large inputs which exceed the LLC capacity. After modifying the application, we just run it as it is in standard integrated CPU-GPU systems. When a CPU tries • Does the proposed approach reduce cache misses? Here to store data that was remotely allocated, it will be detected we measure the miss rate reductions for both CCSM by the TLB. We have modified the TLB so that it can detect and the proposed approach. In addition, we believe the high-order addresses, and when that happens, the TLB sends proposed approach should specifically reduce compul- a different store message to the CPU’s L1 cache. When the sory misses, so we measure the number of compulsory message arrives at the CPU’s L1 cache, it detects a different misses for both approaches. store message. Then instead of storing the data to its cache, it • Does the proposed approach do a better job of reducing sends it to the GPU’s L2 cache through the new network. The misses (and increasing performance) than other simple GPU’s L2 cache receives the data from the CPU and stores techniques? To answer this question we compare our the data as if it came from the GPU’s L1 cache. proposed approach to a version of CCSM augmented 3.9 Implementing Direct Store without Tradi- with a simple stride prefetcher. tional Cache Coherence The results and analysis for all these experiments are included Although we have presented this scheme as a complement in subsequent sections. to existing coherence mechanisms (since we think that would be the most likely use case), it could be adopted as a simpler, 4.1 Simulation Infrastructure stand-alone replacement for communication between CPU Direct store is evaluated on gem5-gpu [32], a simulator and GPU. The proposed scheme is simple to implement be- that models tightly-coupled CPU-GPU systems. It merges cause it requires fewer coherence messages than traditional two popular simulators: gem5 [5] and GPGPU-Sim [3]. Our protocols. This reduction in messages comes from the fact baseline implementation in gem5-gpu is based on the config- that the proposed scheme only allows memory shared be- uration listed in Table1. tween GPU and CPU to exist in the GPU cache. Thus, there will not be multiple sharers and so there are no transitions Table 1: System configuration from (for example) shared states to exclusive states. The proposed scheme could be easily attached to systems CPU that use the state-of-the-art CCSM. This opens a new possi- Cores 1 bility for the programmer to use the new approach for only L1D cache 64KB, 2 ways a particular type of data and other remaining data still uses L1I cache 32KB, 2 ways the traditional approach. The programmer can set large vari- L2 cache 2MB, 8 ways ables to use this approach, since this will speed up the data GPU movement and the remaining small-sized data to use CCSM SMs 16 - 32 lanes per SM @ 1.4Ghz system. Thus, the programmer benefits from using the fast L1 cache 16KB + 48 KB shared memory, 4 ways CPU-GPU data transfer proposed in this scheme and the easy L2 cache 2MB, 16 ways, 4 slices aspect of CCSM system. MEMORY The proposed scheme could also replace the entire CCSM Memory 2GB, 1 channel, 2 ranks, 8 banks @ 1GHz system and thus gains a simpler design with better perfor- mance. To do this we only need an interconnection network All evaluations are performed using gem5-gpu in syscall- that directly connects the CPU to the GPU’s L2 cache and to emulation mode. We use Ruby memory system instead of remove the CCSM system’s interconnection network. The the classic memory system. We simulate an integrated CPU- programmer then will structure the code to fit the proposed GPU system comprising of a CPU core and a GPU with 16 scheme needed. The main code structure characteristic is that Fermi-like SMs. Each L1 data and instruction cache and L2 it ensures that data used by the GPU are allocated in its L2 cache are private to each CPU core. For GPU, each L1 cache cache. Finally, the program is just compiled and run in the is private to each SM, while L2 cache is shared by all SMs. standard way. Cache line size is 128 bytes across the whole system. The system has 2GB of total memory. The simulator provides full 4. EXPERIMENTAL EVALUATION system coherence between CPU and GPU using the modified This section details the platform and the applications we Hammer cache coherence protocol. use to evaluate direct store on integrated CPU-GPU systems 4.2 Applications as well as the experimental results. We first describe our simulation infrastructure and the To show a variety of applications running on direct store, tested applications. We then seek to evaluate the following we use 4 standard benchmark suites and 4 other individual questions: benchmarks; as shown in Table2. The table lists each bench- mark code name, the sizes of small and big inputs, bench- • Does the proposed approach improve performance over mark suite name, and whether the benchmark uses the GPU’s CCSM? We answer this question by measuring the shared memory feature or not. Some of the benchmarks are

6 Table 2: Benchmarks

Benchmark Abbr. Small input size Big input size Unit Suite Shared Mem. Back propagation BP 1536 10000 nodes Rodinia [8] Yes Breadth-first search BF 4096 6000 nodes Rodinia No Gaussian elimination GA 256x256 700x700 matrix Rodinia Yes Hotspot HT 64x64 512x512 data points Rodinia Yes K-means KM 2000, 34 feature 5000, 34 feature data points Rodinia Yes lavaMD LV 2 4 boxes Rodinia Yes LU decomposition LU 256x256 512x512 matrix Rodinia Yes Nearest neighbour NN 10691 42764 data points Rodinia No Needleman-Wunch NW 160x160 320x320 data points Rodinia Yes Pathfinder PT 2500 5000 data points Rodinia Yes Srad SR 256x256 512x512 data points Rodinia Yes Stencil ST 128x128x32 164x164x32 matrix Parboil [41] Yes Graph coloring GC power delaunay-n15 graph Pannotia [7] No Floyd-Warshall FW 256_16384 512_65536 graph Pannotia No Maximal independent set MS power delaunay-n13 graph Pannotia No Single-source shortest path SP power delaunay-n13 graph Pannotia No Blackscholes BL 5000 10000 data points NVIDIA SDK No Vector addition VA 50000 200000 data points NVIDIA SDK No Bitonic sort BS 262144 524288 data points [12] No Matrix multiplication MM 256x256 900x900 matrix [25] No Matrix transpose MT 32x32 1600x1600 matrix [25] No Cholesky decomposition CH 150x150 600x600 matrix [4] No provided in the gem5-gpu project [32], namely Rodinia [8], MM, and MT are the benchmarks with a above 10%. Parboil [41] and Pannotia [7]. In addition to the four suites, Almost all of these benchmarks also have large miss rate four other benchmarks are also used: bitonic sort from [12], reduction as shown in Figure7, except for MM and MT, matrix multiplication and transpose from [25], and Cholesky which instead experience miss rate increase. Moreover, these decomposition from [4]. All benchmarks are used to compare benchmarks do not use GPU shared memory, and therefore, the performance of direct store and CCSM. they extensively make use of the GPU’s cache. BP, HT, Our code translator assumes that the input program per- and GC have large miss rate reduction but their speedup forms no CUDA memory copy. For Rodinia and Parboil is not high, because they use shared memory feature and in [32], benchmark codes have specifically excluded memory therefore do not involve the GPU’s L2 cache much in their copy from CPU to GPU and vice versa. For other bench- computation. The rest are benchmarks with small miss rate marks, this has to be done manually, by converting the code reduction followed by small speedup or zero speedup. Note to eliminate all CUDA memory copy functions and deleting that the proposed approach never decreases performance, and all device variables, because in integrated CPU-GPU systems can often provide performance increases. the GPU can easily access CPU data without the need to Figure6 (bottom) shows that for BP, HT, LU, NW and FW, explicitly copy it. After all source codes contain no CUDA when input is big, although the benchmarks use shared mem- memory copy, they are then processed by the automatic trans- ory, the threads are unable to hide memory lator to produce modified source codes ready to be compiled latency, and thus direct store impacts the performance sig- in the usual way. nificantly. For NN, BL, VA, MM, and TP, the speedup for big inputs is smaller compared to that for small inputs. This 4.3 Speedup is because the input is larger than the size of the GPU’s L2 We evaluate the effect of remote store by running all bench- cache, and hence the miss rate reduction decreases followed marks with two different types of memory accesses: cache- by the speedup. The extreme condition happens for MM and coherent shared memory (CCSM) and the proposed direct MT, where the speedup drops to zero followed by a decrease store memory access. We report the speedup obtained by the in miss rate reduction. The rightmost bars on both figures rep- direct store approach by normalizing its total ticks to the total resent the geometric means of all non-zero speedups, namely ticks of CCSM. 7.8% speedup for small and 5.7% speedup for big inputs. Figure6 shows the speedup produced by the proposed In summary, direct store does improve performance over scheme for small (top) and big (bottom) inputs, ignoring CCSM. The mean performance improvements are moder- those benchmarks with zero percent speedup for both small ate, but are achieved with no direct intervention from the and big inputs, namely GA, KM, LV, PT, SR, ST, and MS. For programmer other than running our code translator on an ex- these benchmarks the proposed scheme does not bring benefit isting program. Furthermore, while we might expect that the but also does not hurt the application. For small inputs, of performance gains disappear for very large inputs, direct store the 22 benchmarks, 5 show a speedup of over 10%, 7 show programs still retain a performance advantage over CCSM no speedup, while 10 show modest speedup. NN, BL, VA,

7 37 35 70

31 29 CCSM DS 30 60

43 20 40 36 33 14 3131 32 28 22 Speedup (%) 9 10 7.8 20 7 19 19 20 16 16 17 5 5 14 4 L2 Miss Rate (%) 10 10 2 2 8 9.3 1 7 6 7 7.3 0 0 4 5 4 4 4 2 3 3 0 1 0 0 1 1 1 0 1 0 BP BF HT LU NN NW GC FW SP BL VA BS MM MT CH geo BP BF HT KM LU NN NW PT SR ST GC FW MS SP BL VA MM MT CH geo 16 71 15 13 CCSM DS 12 60

10 5050 10 43 38 7 40 33 6 6 31313232 5.7 29 5 5 5

Speedup (%) 4 22 20 20 20 18 1717 14 2 12 12.5 L2 Miss Rate (%) 11 11.1 1 8 7 6 6 0 0 0 5 3 2 2 3 3 0 1 0 0 0 0 1 0 0 0 BP BF HT LU NN NW GC FW SP BL VA BS MM MT CH geo BP BF HT KM LU NN NW PT SR ST GC FW MS SP BL VA MM MT CH geo Figure 6: Speedup of direct store over CCSM for small (top) and big (bottom) inputs. Figure 7: GPU L2 miss rate for (top) small and (bottom) big inputs. in this case. Finally, we see that converting programs to use direct store actually decrease. However, the GPU L2 accesses direct store never hurts performance compared to CCSM. are also reduced and this reduction is even larger than the Given these observations, we conclude that direct store can reduction in total misses. The rest of the benchmarks (GA, be a viable replacement for CCSM that requires little pro- LV, PT, ST, and BS) produce no different miss rate in CCSM gramming effort, never hurts performance, and often provides and direct store. For GA, LV, ST, and BS, the total misses performance gains. decrease significantly; however, since the total cache accesses to the GPU’s L2 cache are extremely huge compared to the 4.4 GPU’s L2 Cache Miss Rate Reduction total misses, the miss rate remains the same. For PT, there We now attempt to breakdown the performance gains of is no significant change in GPU’s L2 cache miss rate when direct store (DS) over CCSM. Specifically, direct store ’s direct store is implemented. The total misses and the total performance gains should come from more favorable cache cache accesses to GPU’s L2 cache also do not change. This behavior compared to CCSM. We therefore begin by inves- is due to the fact that in this benchmark the CPU does not tigating the GPU’s L2 miss rate under both direct store and store any data that will later be used by GPU. CCSM. Figure7 (bottom) depicts the GPU’s L2 cache miss rate Figure7 shows the GPU’s L2 cache miss rate for all bench- for big inputs. The miss rate reduction from CCSM to direct marks, ignoring those benchmarks with zero L2 cache miss store is similar to the small input cases. The miss rate is rate both in CCSM and direct store, namely GA, LU, and BS. reduced for BP, BF, HT, KM, LU, NN, NW, GC, MS, SP, The red bar is the miss rate for CCSM, while the blue bar is BL, VA, and CH, while the rest shows no difference. For SR for direct store. and FW, when input size is big, the reduction is canceled out Figure7 (top) depicts the GPU’s L2 cache miss rate for due to the fact that cache accesses actually increase because small inputs. We can observe that in most cases the GPU’s the input is larger, while the total miss reduction is relatively L2 cache miss rate in CCSM is higher than in direct store. small compared to the increase in cache accesses. Note again Benchmarks whose miss rate gets reduced are BP, BF, HT, that the rightmost bars on both figures represent the geometric KM, LU, NN, NW, SR, GC, FW, MS, SP, BL, VA, and CH. means of all L2 miss rate, namely 9.3% for CCSM compared This is as we expect; since data is directed to the GPU’s L2 to 7.3% for direct store for small inputs and 12.5% for CCSM cache when the CPU stores, GPU does not need to fetch it compared to 11.1% for direct store for big inputs. again and this will reduce the miss rate. This data confirms that the proposed approach is having Nevertheless, some cases need further explanation. For a positive impact on cache behavior. Specifically, under the MM and MT, the miss rate in direct store is higher than in direct store approach we hope that data resides in the GPU’s CCSM. For both benchmarks, the total misses in CCSP and L2 cache on the first access, rather than being pulled to the

8 GPU after a first miss. total misses reduction. For BS, MM and MT, the total misses reduction is not so high but memory accesses are lower signif- 4.5 Compulsory Misses icantly. In the big inputs the majority of total misses reduction We now further dive into the cache performance. Our hy- follows the small inputs case, slightly lower than the compul- pothesis about direct store is that it should result in more hits sory misses, except for HS and BS. HS and BS have slightly on initial accesses. To evaluate the hypothesis, we examine higher total misses reduction than the compulsory misses. the compulsory misses under CCSM and direct store. This fact shows that our proposed scheme also prevents some Compulsory misses are misses that must occur since when conflict misses that later will be caused by the compulsory an application is executed the first time it always starts with misses. an empty cache. We measure the number of compulsory These results provide further evidence that the performance misses for each benchmark to show the potential reduction increases observed with direct store come from the proposed after implementing our proposed scheme. We also provide scheme (rather than some other external factor). Specifically, the number of total misses reduction to show our achieving in if direct store is working as intended, we should see a lower reducing the compulsory misses. The results of compulsory cache miss rate (which we do) and the specific cache misses misses calculation and total misses reduction are shown in that should be reduced are compulsory misses. The data in Table3. The first column lists the benchmarks, while the next this section confirms that the misses reduction comes from two columns contain the compulsory misses for each of the removing compulsory misses. Therefore, we conclude that standard benchmarks running in the baseline system and the the direct store approach is not only reducing runtime and in- associated total misses reduction for small inputs. The last creasing performance, but it is doing so for the stated reasons: two columns contain the compulsory misses for each of the this approach deposits data directly in the GPU cache where standard benchmarks running in the baseline system and the it will be used, reducing compulsory misses at the GPU and associated total misses reduction for big inputs. The total improving overall performance in many cases. misses reduction is obtained by subtracting the compulsory misses of a benchmark when it is running on the baseline 4.6 Stride Prefetching system from the total misses of the benchmark when it is running on our proposed scheme. The preceding sections show that our approach improves performance and the performance gains come from the cache Table 3: Compulsory misses and total miss reduction behavior that we expect. In this section we compare our approach with another simple way to try to achieve the same Benchmark Misses Reduction Misses Reduction performance gains. (Small) (Small) (Big) (Big) Prefetching has a successful story in the CPU to acceler- BP 1749 1637 11271 10629 ate the performance of many workloads and our proposed BF 1264 1240 10233 10222 scheme might seem identical to prefetching in some ways. GE 3744 3723 14176 3723 In the following, we evaluate the use of stride prefetching HS 401 256 14968 15666 on the benchmarks to compare its performance with direct KM 9160 2679 19707 5908 store (DS). The results are shown in Figure8. The y-axis of LV 487 470 3851 3834 the charts in the figure provides information on the gained LU 2119 2049 8263 8193 speedup in percentage. The speedup is obtained by normaliz- NN 1006 668 4013 2672 ing the GPU total cycles of a benchmark when it is running NW 10867 1616 12006 7293 using prefetching to the GPU total cycles of the benchmark PT 7901 -1 15792 0 when it is running in the baseline system. Each benchmark is SR 12321 1984 12253 2734 evaluated for small and big inputs. ST 15988 8207 15238 8647 Figure8 (top) compares the speedup gained by stride GC 1404 1032 15448 12702 prefetching and by direct store for small inputs. In most FW 4092 2048 14052 8286 cases, stride prefetching hurts the performance up to -73% MS 2093 1234 7725 4620 and in average of -19%, except for MT. MT has a speedup of SP 1616 981 5479 3392 17%. These results are in stark contrast with our proposed BL 791 470 15639 9379 scheme that never drops the system performance. Our pro- VA 4691 3126 15484 12503 posed scheme always gives a positive impact or minimally BS 8196 117 15977 19415 hurts nothing. Figure8 (bottom), on the other hand, compares MM 604 16 15705 49 the speedup gained by stride prefetching and by direct store MT 70 47 13585 1 for big inputs. This results also follow the same trend as the CH 844 837 21103 3105 small inputs case. Moreover, stride prefetching gives worse effects to the system performance up to -93% and -10% in The total misses reduction is slightly lower than the com- average with no positive achievement. On the other side, our pulsory misses for most cases for small inputs, except for proposed scheme gives positive achievement or gives zero KM, NW, PT, SR, ST, FW, BS, MM. This means that for effect. We also conduct experiments for stride prefetching most cases the proposed scheme manages to reduce misses as with degree 8 to see another possibility if we add the degree. many as the compulsory ones. That is as we expect. PT has The results show the same performance as stride prefetching near to zero reduction since the application involves many with degree 3. From these results, we can conclude that di- random memory accesses. KM, SR, and FW, NW have small rect store has a better performance and a better stability in all

9 35 37 31 Furthermore, this approach is far more performant than sim- 29 ple stride prefetching, which is also simple to implement and 17 14 requires little hardware support. 9 7 5 4 5 2 0 0 0 0 0 0 0 2 0 0 0 1 0 In summary, our proposed mechanisms can complement 0 existing cache coherence mechanisms and provide a modest −1 −1 −3 −5 −5 −6 to performance in many cases, or they could be used as −17 −18 −18−17 −17−18 −23 a simpler replacement for CPU to GPU communication.

−33 −34 Speedup (%) −50 −43 −44 Acknowledgements Prefetching DS This research is supported by NSF (CCF-1439156, CNS- −66 1526304, CCF-1823032, CNS-1764039). Additional support −73 BP BF GA HT KM LV LU NN NW PT SR ST GC FW MS SP BL VA BS MM MT CH comes from the Proteus project under the DARPA BRASS

16 program and a DOE Early Career award. It is also supported 13 12 10 7 5 5 4 6 6 in part by hibah penelitian dosen Departemen Ilmu Komputer 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 dan Elektronika Universitas Gadjah Mada dana BPPTNBH −1 −1 −1 −1 −3 −3 −3 −4 −7 −5 −5 tahun anggaran 2019. −16 −16 −16 −20

−30 6. REFERENCES [1] “The gem5 simulator documentation: MOESI hammer,” 2013. −50 [Online]. Available: http://www.m5sim.org/MOESI_hammer

Speedup (%) [2] N. Agarwal, D. Nellans, E. Ebrahimi, T. F. Wenisch, J. Danskin, and S. W. Keckler, “Selective GPU caches to eliminate CPU-GPU HW cache coherence,” in 2016 IEEE International Symposium on High Prefetching DS Performance (HPCA), March 2016, pp. −93 494–506. BP BF GA HT KM LV LU NN NW PT SR ST GC FW MS SP BL VA BS MM MT CH [3] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing CUDA workloads using a detailed GPU simulator,” in 2009 Figure 8: Speedup achieved by stride prefetching and by IEEE International Symposium on Performance Analysis of Systems direct store for small (top) and big (bottom) inputs. and Software, April 2009, pp. 163–174. [4] M. Bhatt, “Parallel implementation of Cholesky decomposition using CUDA ,” 2017. [Online]. Available: cases compared to stride prefetching. https://github.com/bhattmansi/Implementation-of-Cholesky- Decomposition-in-GPU-using-CUDA These results show that the proposed scheme is not only [5] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, performant, but also much more performant than one other A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, obvious simple scheme that could be easily implemented. We K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The note that these showing that simple prefetching mechanisms gem5 simulator,” SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1–7, Aug. 2011. [Online]. Available: hurt are consistent with prior research [37], which showed http://doi.acm.org/10.1145/2024716.2024718 that prefetching can provide GPU performance, but only if [6] R. Buchty, V. Heuveline, W. Karl, and J.-P. Weiss, “A survey on the prefetching scheme is significantly more complicated than hardware-aware and on multicore simple stride prefetching. processors and accelerators,” Concurr. Comput. : Pract. Exper., vol. 24, no. 7, pp. 663–675, May 2012. [Online]. Available: 4.7 Hardware Overhead http://dx.doi.org/10.1002/cpe.1904 [7] S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, “Pannotia: To apply direct store in integrated CPU-GPU systems, a Understanding irregular GPGPU graph applications,” in 2013 IEEE small hardware overhead is incurred (all performance over- International Symposium on Workload Characterization (IISWC), Sep. head is included in the above numbers). We add a network 2013, pp. 185–195. that directly connects the CPU’s L1 cache and GPU’s L2 [8] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and cache and a logic in the TLB to detect the incoming remotely K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in 2009 IEEE International Symposium on Workload stored data (Section3). The logic works by comparing store Characterization (IISWC), Oct 2009, pp. 44–54. instructions’ high-order addresses to the baseline address. [9] Y.-k. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei, “A This small overhead can be done by wiring to a logic gate. quantitative analysis on of modern CPU-FPGA platforms,” in Proceedings of the 53rd Annual Design Automation Conference, ser. DAC ’16. New York, NY, USA: ACM, 2016, pp. 5. CONCLUSION 109:1–109:6. [Online]. Available: We have proposed direct store, a new memory model for http://doi.acm.org/10.1145/2897937.2897972 integrated CPU-GPU systems. In direct store, when a CPU [10] E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai, “Single-chip heterogeneous computing: Does the future include custom logic, stores data that will be consumed by the GPU, the memory FPGAs, and GPGPUs?” in (MICRO), 2010 43rd system automatically deposits the data in the GPU’s LLC. Annual IEEE/ACM International Symposium on. IEEE, 2010, pp. Our experimental results show that direct store reduces 225–236. the GPU’s LLC miss rate and increases performance at the [11] M. Dashti and A. Fedorova, “Analyzing memory management cost of a small hardware overhead for adding interconnection methods on integrated CPU-GPU systems,” SIGPLAN Not., vol. 52, no. 9, pp. 59–69, Jun. 2017. [Online]. Available: network and a new TLB logic. Additionally, direct store is http://doi.acm.org/10.1145/3156685.3092256 much simpler than CCSM and needs less hardware support.

10 [12] M. Endler, “Parallel bitonic sort using CUDA,” 2011. [Online]. computing techniques,” ACM Comput. Surv., vol. 47, no. 4, pp. Available: https://gist.github.com/mre/1392067 69:1–69:35, Jul. 2015. [Online]. Available: http://doi.acm.org/10.1145/2788396 [13] R. B. Gillett, “Memory channel network for pci,” IEEE Micro, vol. 16, no. 1, pp. 12–18, Feb 1996. [30] S. Pei, M.-S. Kim, J.-L. Gaudiot, and N. Xiong, “Fusion coherence: Scalable cache coherence for heterogeneous kilo-core system,” in [14] B. A. Hechtman and D. J. Sorin, “Evaluating cache coherent shared Advanced Computer Architecture, J. Wu, H. Chen, and X. Wang, Eds. virtual memory for heterogeneous multicore chips,” in 2013 IEEE Berlin, Heidelberg: Springer Berlin Heidelberg, 2014, pp. 1–15. International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2013, pp. 118–119. [31] B. Pichai, L. Hsu, and A. Bhattacharjee, “Architectural support for address translation on GPUs: Designing memory management units [15] J. Hestness, S. W. Keckler, and D. A. Wood, “GPU computing for CPU/GPUs with unified address spaces,” in Proceedings of the inefficiencies and optimization opportunities in heterogeneous 19th International Conference on Architectural Support for CPU-GPU processors,” in 2015 IEEE International Symposium on Programming Languages and Operating Systems, ser. ASPLOS ’14. Workload Characterization, Oct 2015, pp. 87–97. New York, NY, USA: ACM, 2014, pp. 743–758. [Online]. Available: [16] H. Hoffmann, D. Wentzlaff, and A. Agarwal, “Remote store http://doi.acm.org/10.1145/2541940.2541942 programming,” in High Performance Embedded Architectures and [32] J. Power, J. Hestness, M. S. Orr, M. D. Hill, and D. A. Wood, Compilers, Y. N. Patt, P. Foglia, E. Duesterwald, P. Faraboschi, and “gem5-gpu: A heterogeneous CPU-GPU simulator,” IEEE Computer X. Martorell, Eds. Springer Berlin Heidelberg, 2010, pp. 3–17. Architecture Letters, vol. 14, no. 1, pp. 34–36, Jan 2015. [17] D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. [33] J. Power, M. D. Hill, and D. A. Wood, “Supporting -64 address Hill, S. K. Reinhardt, and D. A. Wood, “Heterogeneous-race-free translation for 100s of GPU lanes,” in 2014 IEEE 20th International memory models,” in Proceedings of the 19th International Conference Symposium on High Performance Computer Architecture (HPCA), Feb on Architectural Support for Programming Languages and Operating 2014, pp. 568–578. Systems, ser. ASPLOS ’14. New York, NY, USA: ACM, 2014, pp. 427–440. [Online]. Available: [34] J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Hill, http://doi.acm.org/10.1145/2541940.2541981 S. K. Reinhardt, and D. A. Wood, “Heterogeneous system coherence for integrated CPU-GPU systems,” in Proceedings of the 46th Annual [18] J. Y. Jang, H. Wang, E. Kwon, J. W. Lee, and N. S. Kim, IEEE/ACM International Symposium on Microarchitecture. ACM, “Workload-aware optimal power allocation on single-chip 2013, pp. 457–467. heterogeneous processors,” IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 6, pp. 1838–1851, June 2016. [35] B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan, M. Rajagopalan, J. Fang, P. Zhang, R. Ronen, and A. Mendelson, “Programming model for a [19] C. N. Keltcher, K. J. McGrath, A. Ahmed, and P. Conway, “The AMD heterogeneous x86 platform,” in Proceedings of the 30th ACM Opteron for multiprocessor servers,” IEEE Micro, vol. 23, SIGPLAN Conference on Programming Language Design and no. 2, pp. 66–76, 2003. Implementation, ser. PLDI ’09. New York, NY, USA: ACM, 2009, [20] D. Kim and D. Yeung, “Design and evaluation of compiler algorithms pp. 431–440. [Online]. Available: for pre-execution,” in Proceedings of the 10th International http://doi.acm.org/10.1145/1542476.1542525 Conference on Architectural Support for Programming Languages and [36] M. J. Schulte, M. Ignatowski, G. H. Loh, B. M. Beckmann, W. C. Operating Systems, ser. ASPLOS X, 2002, pp. 159–170. [Online]. Brantley, S. Gurumurthi, N. Jayasena, I. Paul, S. K. Reinhardt, and Available: http://doi.acm.org/10.1145/605397.605415 G. Rodgers, “Achieving exascale capabilities through heterogeneous [21] G. Kurian, O. Khan, and S. Devadas, “The locality-aware adaptive computing,” IEEE Micro, vol. 35, no. 4, pp. 26–36, July 2015. cache coherence protocol,” in The 40th Annual International [37] A. Sethia, G. Dasika, M. Samadi, and S. Mahlke, “APOGEE: Symposium on Computer Architecture, ISCA’13, Tel-Aviv, Israel, June Adaptive prefetching on GPUs for energy efficiency,” in Proceedings 23-27, 2013, 2013, pp. 523–534. of the 22nd International Conference on Parallel Architectures and [22] J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, “Many-thread Compilation Techniques, Sep. 2013, pp. 73–82. aware prefetching mechanisms for GPGPU applications,” in 2010 [38] Y. S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks, “Toward 43rd Annual IEEE/ACM International Symposium on cache-friendly hardware accelerators,” in HPCA Sensors and Cloud Microarchitecture, Dec 2010, pp. 213–224. Architectures Workshop (SCAW), 2015, pp. 1–6. [23] J. Lee, M. Samadi, Y. Park, and S. Mahlke, “Transparent CPU-GPU [39] M. Silberstein, B. Ford, I. Keidar, and E. Witchel, “GPUfs: Integrating collaboration for data-parallel kernels on heterogeneous systems,” in a file system with GPUs,” in Proceedings of the Eighteenth Proceedings of the 22Nd International Conference on Parallel International Conference on Architectural Support for Programming Architectures and Compilation Techniques, ser. PACT ’13. Languages and Operating Systems, ser. ASPLOS ’13. New York, Piscataway, NJ, USA: IEEE Press, 2013, pp. 245–256. [Online]. NY, USA: ACM, 2013, pp. 485–498. [Online]. Available: Available: http://dl.acm.org/citation.cfm?id=2523721.2523756 http://doi.acm.org/10.1145/2451116.2451169 [24] W. Liu, B. Lewis, X. Zhou, H. Chen, Y. Gao, S. Yan, S. Luo, and [40] K. L. Spafford, J. S. Meredith, S. Lee, D. Li, P. C. Roth, and J. S. B. Saha, “A balanced programming model for emerging Vetter, “The tradeoffs of fused memory hierarchies in heterogeneous heterogeneous multicore systems,” in Proceedings of the 2nd USENIX computing architectures,” in Proceedings of the 9th Conference on Conference on Hot Topics in Parallelism, ser. HotPar’10. Berkeley, Computing Frontiers, ser. CF ’12. New York, NY, USA: ACM, 2012, CA, USA: USENIX Association, 2010, pp. 1–6. [Online]. Available: pp. 103–112. [Online]. Available: http://dl.acm.org/citation.cfm?id=1863086.1863089 http://doi.acm.org/10.1145/2212908.2212924 [25] Z. Liu, “Matrix multiplication in CUDA,” 2018. [Online]. Available: [41] J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, https://github.com/lzhengchun/matrix-cuda N. Anssari, G. D. Liu, and W. mei W. Hwu, “Parboil: A revised [26] M. Lodde, J. Flich, and M. E. Acacio, “Heterogeneous NoC design for benchmark suite for scientific and commercial throughput computing,” efficient broadcast-based coherence protocol support,” in 2012 University of Illinois at Urbana-Champaign, Center for Reliable and IEEE/ACM Sixth International Symposium on Networks-on-Chip, May High-Performance Computing, Tech. Rep., 2012. 2012, pp. 59–66. [42] T. Suh, D. M. Blough, and H.-H. S. Lee, “Supporting cache coherence [27] C.-K. Luk, “Tolerating memory latency through software-controlled in heterogeneous multiprocessor systems,” in Proceedings Design, pre-execution in simultaneous multithreading processors,” in Automation and Test in Europe Conference and Exhibition, vol. 2, Feb Proceedings 28th Annual International Symposium on Computer 2004, pp. 1150–1155 Vol.2. Architecture, June 2001, pp. 40–51. [43] T. Suh, H.-H. S. Lee, and D. M. Blough, “Integrating cache coherence [28] D. Lustig and M. Martonosi, “Reducing GPU offload latency via protocols for heterogeneous multiprocessor system. part 2,” IEEE fine-grained CPU-GPU synchronization,” in Proceedings of the 2013 Micro, vol. 24, no. 5, pp. 70–78, Sep. 2004. IEEE 19th International Symposium on High Performance Computer [44] T. Suh, D. Kim, and H.-H. S. Lee, “Cache coherence support for Architecture (HPCA), ser. HPCA ’13. Washington, DC, USA: IEEE non-shared bus architecture on heterogeneous MPSoCs,” in Computer Society, 2013, pp. 354–365. [Online]. Available: Proceedings. 42nd Design Automation Conference, 2005., June 2005, http://dx.doi.org/10.1109/HPCA.2013.6522332 pp. 553–558. [29] S. Mittal and J. S. Vetter, “A survey of CPU-GPU heterogeneous [45] J. van Eijndhoven, J. Hoogerbrugge, M. Jayram, P. Stravers, and

11 A. Terechko, Cache-Coherent Heterogeneous as Parallel architectures and compilation. ACM, 2014, pp. 331–342. Basis for Streaming Applications. Dordrecht: Springer Netherlands, 2005, pp. 61–80. [Online]. Available: [49] Y. Yang, P. Xiang, M. Mantor, and H. Zhou, “CPU-assisted GPGPU https://doi.org/10.1007/1-4020-3454-7_3 on fused CPU-GPU architectures,” in IEEE International Symposium on High-Performance Comp Architecture, Feb 2012, pp. 1–12. [46] J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee, “Observations and opportunities in architecting shared virtual memory [50] F. Zhang, J. Zhai, B. He, S. Zhang, and W. Chen, “Understanding for heterogeneous systems,” in 2016 IEEE International Symposium co-running behaviors on integrated CPU/GPU architectures,” IEEE on Performance Analysis of Systems and Software (ISPASS), April Transactions on Parallel and Distributed Systems, vol. 28, no. 3, pp. 2016, pp. 161–171. 905–918, March 2017. [47] H. Wang, V. Sathish, R. Singh, M. J. Schulte, and N. S. Kim, [51] X. Zhou, H. Chen, S. Luo, Y. Gao, S. Yan, W. Liu, B. Lewis, and “Workload and power budget partitioning for single-chip B. Saha, “A case for software managed coherence in many-core heterogeneous processors,” in Proceedings of the 21st international processors,” in Proceedings of the 2nd USENIX Conference on Hot conference on Parallel architectures and compilation techniques. Topics in Parallelism (HOTPAR’10), 06 2010, pp. 1–6. ACM, 2012, pp. 401–410. [52] C. Zilles and G. Sohi, “Execution-based prediction using speculative [48] H. Wang, R. Singh, M. J. Schulte, and N. S. Kim, “Memory slices,” in Proceedings of the 28th Annual International Symposium on scheduling towards high-throughput cooperative heterogeneous Computer Architecture, ser. ISCA ’01, 2001, pp. 2–13. [Online]. computing,” in Proceedings of the 23rd international conference on Available: http://doi.acm.org/10.1145/379240.379246

12