Using Criticality of GPU Accesses in Memory Management for CPU-GPU Heterogeneous Multi-core Processors

ABSTRACT this paper, we model such heterogeneous computational sce- narios by simultaneously scheduling SPEC CPU workloads Heterogeneous chip-multiprocessors with CPU and GPU in- on the CPU cores and 3D scene rendering or GPGPU ap- tegrated on the same die allow sharing of critical memory plications on the GPU of a heterogeneous CMP. In these system resources among the CPU and GPU applications and computational scenarios, the working sets of the jobs co- give rise to challenging resource scheduling problems. In scheduled on the CPU and the GPU cores contend for the this paper, we explore memory access scheduling algorithms shared LLC capacity and the DRAM bandwidth causing de- driven by criticality of GPU accesses in such systems. Dif- structive interference. ferent GPU access streams originate from different parts of GPU workload = Parboil LBM (OpenCL) CPU GPU 1 the GPU rendering , which behaves very differently 1C1G 2C1G 4C1G from the typical CPU pipeline requiring new techniques for 0.9 GPU access criticality estimation. We propose a novel queu- 0.8 ing network model to estimate the performance-criticality of 0.7 the GPU access streams. If a GPU application performs be- 0.6 low the quality of service requirement (e.g., frame rate in 0.5 Normalized performance

3D rendering), the memory access scheduler uses the esti- 429.mcf 437,450 429,462 410.bwaves 437.leslie3d450.soplex462.libquantum473.astar482.sphinx3 429,437,462,473410,450,462,482 mated criticality information to accelerate the critical GPU Figure 1: Performance of heterogeneous mixes rela- accesses. Detailed simulations done on a heterogeneous chip- tive to standalone performance on Core i7-4770. multiprocessor model with one GPU and four CPU cores running DirectX, OpenGL, and CPU application mixes show To understand the extent of the CPU-GPU interference, that our proposal improves the GPU performance by 15% on we conduct a set of experiments on an Intel Core i7-4770 pro- average without degrading the CPU performance much. Ex- cessor (Haswell)-based platform. This heterogeneous pro- tensions proposed for the mixes containing GPGPU applica- cessor has four CPU cores and an integrated HD 4600 GPU tions, which do not have any quality of service requirement, sharing an 8 MB LLC and a dual-channel DDR3-1600 16 improve the performance by 7% on average. GB DRAM (25.6 GB/s peak DRAM bandwidth) [15]. We prepare eleven heterogeneous mixes for these experiments. 1 INTRODUCTION Every mix has LBM from the Parboil OpenCL suite [49] as the GPU workload using the long input (3000 time-steps). A heterogeneous chip-multiprocessor (CMP) makes simul- LBM serves as a representative for memory-intensive GPU taneous use of both CPU and GPU cores. Such architec- workloads.1 Seven out of the eleven mixes exercise one CPU tures include AMD’s accelerated processing unit (APU) fam- core, two mixes exercise two CPU cores, and the remain- ily [2, 22, 56] and Intel’s Sandy Bridge, Ivy Bridge, Haswell, ing two mixes exercise all four CPU cores. In all mixes, Broadwell, and Skylake processors [7, 12, 20, 21, 41, 57]. In the GPU workload co-executes with the CPU application(s) this study, we explore heterogeneous CMPs that allow the drawn from the SPEC CPU 2006 suite. All SPEC CPU CPU cores and the GPU to share the on-die interconnect, 2006 applications use the ref inputs. Figure 1 shows, for the last-level (LLC), the memory controllers, and the each mix (identified by the constituent CPU workload on DRAM banks, as found in Intel’s integrated GPUs. The the x-axis), the performance of the CPU and the GPU work- GPU workloads are of two types: massively parallel compu- loads separately relative to the standalone performance of tation exercising only the shader cores (GPGPU) and multi- these workloads. For example, for the first 4C1G mix us- frame 3D animation utilizing the entire rendering pipeline of ing four CPU cores and the GPU, the standalone CPU per- the GPU. Each of these two types of GPU workloads can be formance is the average turn-around time of the four CPU co-executed with general-purpose CPU workloads in a het- applications started together, while the GPU is idle. Sim- erogeneous CMP. These heterogeneous computational sce- ilarly, the standalone GPU performance is the time taken narios are seen in embedded, desktop, workstation, and high- to complete the Parboil LBM application on the integrated end computing platforms employing CPU-GPU MPSoCs. In GPU. When these workloads run together, performance de- grades. As Figure 1 shows, the loss in CPU performance Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that varies from 8% (410.bwaves in the 1C1G group) to 28% (the copies are not made or distributed for profit or commercial advantage first mix in the 4C1G group). The GPU performance degra- and that copies bear this notice and the full citation on the first page. dation ranges from 1% (the first mix in the 1C1G group) to Copyrights for components of this work owned by others than ACM 26% (the first mix in the 4C1G group). The GPU workload must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, degrades more with the increasing number of active CPU requires prior specific permission and/or a fee. Request permissions cores. Similar levels of interference have been reported in from [email protected]. simulation-based studies [1, 18, 23, 29, 32, 38, 42, 47, 48, 55]. CASES’17, Seoul, South Korea © 2017 ACM. 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 1 Experiments done with a larger set of memory-intensive GPU work- DOI: 10.1145/nnnnnnn.nnnnnnn loads show similar trends. In this paper, we attempt to recover some portion of the to application groups has also been proposed [34]. Critical- lost GPU performance by proposing a novel memory access ity estimation of load instructions [50] and criticality-driven scheduler driven by GPU performance feedback. memory access schedulers [10] for CPUs have been explored. Prior proposals have studied specialized memory access The memory access scheduling studies for the discrete schedulers for heterogeneous systems [1, 18, 38, 48, 55]. These GPU parts focus on the shader cores only and do not con- proposals modulate the priority of all GPU or all CPU re- sider the rest of the GPU rendering pipeline. These studies quests in bulk depending on the latency-, bandwidth-, and explore ways to minimize the latency variance among the deadline-sensitivity of the current phase of the CPU and the threads within a warp [3], to accelerate the critical shader GPU workloads. In this paper, we propose a new mechanism cores that do not have enough short-latency warps [19], and to dynamically identify a subset of GPU requests as critical to design a mix of shortest-job-first and FR-FCFS with the and accelerate them by exercising fine-grain control over al- goal of accelerating the less latency-tolerant shader cores [28]. location of memory system resources. Our proposal is mo- Several studies have explored specialized memory access tivated by the key observation that the performance impact schedulers for heterogeneous systems [1, 18, 38, 48, 55]. The of accelerating different GPU accesses is not the same (Sec- staged memory scheduler (SMS) clubs the memory requests tion 3). For identifying the bottleneck GPU accesses, we from each source (CPU or GPU) into source-specific batches model the information flow through the rendering pipeline based on DRAM row locality [1]. Each batch is next sched- of the GPU using a queuing network. We observe that the uled with a probabilistic mix of shortest-batch-first (favor- accesses to the memory system originating from the GPU ing latency-sensitive jobs) and round-robin (enforcing fair- shader cores and the fixed function units have complex inter- ness among bandwidth-sensitive jobs). The dynamic priority dependence, which is not addressed by the existing CPU load scheduler [18] proposed for mobile heterogeneous platforms criticality estimation techniques [10, 50]. To the best of our employs dynamic progress estimation of tile-based deferred knowledge, our proposal is the first to incorporate critical- rendering (TBDR) [40, 43] and offers the GPU accesses equal ity information of fine-grain GPU accesses in the design of priority as the CPU accesses if the GPU lags behind the memory access schedulers for heterogeneous CMPs. target frame rendering time. Also, during the last 10% 3D animation is an important GPU workload. For these of the left time to render a frame, the GPU accesses are workloads, it is sufficient to deliver a minimum acceptable given higher priority than the CPU accesses. The subse- frame rate (usually thirty frames per second) due to the per- quently proposed deadline-aware memory scheduler for het- sistence of human vision; it is a wastage of resources to im- erogeneous systems (DASH) further improves the dynamic prove the GPU performance beyond the required level. Our priority scheme by offering the highest priority to short- proposal includes a highly accurate architecture-independent deadline applications and prioritizing the GPU applications technique to estimate the frame rate of a 3D rendering job at that lag behind the target [55]. The option of statically par- run-time. Our proposal employs the criticality information titioning the physical address space between the CPU and to speed up the GPU workload only if the estimated frame GPU datasets and assigning two independent memory con- rate is below the target level. We suitably extend our pro- trollers to handle accesses to the two datasets has been ex- posal for the GPGPU applications, which do not have any plored [38]. A subsequent study has shown that such static minimum performance requirement. Section 4 discusses our partitioning of memory resources can be sub-optimal [23]. proposal. We summarize our contributions in the following. Insertion and replacement policies to manage the shared 1. We present a novel technique for identifying the critical LLC in the heterogeneous CMPs have been explored [29, 42]. memory accesses sourced by the GPU rendering pipeline. Selective LLC bypass policies for GPU misses arising from 2. We present mechanisms based on accurate frame rate latency-tolerant shader cores have been proposed [32]. estimation to identify the critical phases of 3D animation. 3. We propose DRAM access scheduling mechanisms to par- 3 MOTIVATION tition the DRAM bandwidth between the critical GPU ac- Different types of data are accessed by the programmable cesses, non-critical GPU accesses, and CPU accesses. and the fixed function units in a GPU rendering pipeline. Simulations done with a detailed model of a heterogeneous Examples of such data include vertex data, vertex index CMP (Section 5) having one GPU and four CPU cores run- data, pixel color data, texture sampler data, pixel depth ning mixes of DirectX, OpenGL, and CPU applications show data, hierarchical depth data [11], shader cores’ instruction that our proposal improves the GPU performance by 15% on and data, blitter data, etc.. An access from a data stream average without degrading the CPU performance much (Sec- looks up the internal of the GPU dedicated tion 6). For mixes of CUDA and CPU applications, system to that stream and, on a miss, looks up the LLC shared be- performance improves by 7% on average. tween the GPU and CPU cores. The LLC misses are served by the DRAM. In this section, we demonstrate that the 2 RELATED WORK sensitivity of different types of GPU access streams toward Memory access scheduling has been explored for CPU plat- memory system optimization is not uniform necessitating a forms, discrete GPU parts, and heterogeneous CMPs. The stream-wise criticality measure. studies targeting the CPU platforms have attempted to im- Figure 2 shows the distribution of DRAM read accesses prove the throughput as well as fairness of the threads that across different stream types coming from the GPU for four- share the DRAM system [6, 8–10, 14, 16, 24, 26, 36, 37, 39, teen DirectX and OpenGL workloads. Each workload ren- 44, 51–53]. Profile-guided assignment of DRAM channels ders a multi-frame segment of a popular PC game. These data are collected on a simulated heterogeneous CMP.2 We and depth data may be consumed by the texture sampler for consider the following stream categories: color (C), tex- generating dynamic texture maps and shadow maps, respec- ture sampler (T), depth (Z), blitter (B), and everything else tively [13, 30]. On the other hand, accelerating the color clubbed into the “other” (O) category.3 Figure 2 shows that, stream without improving a bottlenecked texture stream in general, the color, texture, and depth streams constitute may not be helpful because color blending is typically im- the larger share of the DRAM accesses from the GPU; the plemented after shading and texturing. We need to discover actual distribution varies widely across applications. this inter-dependent critical group of streams at run-time.

C T Z B O 3 1 2.5 0.8 2 3DMark06GT13DMark06GT2 COD2 CRYSIS 0.6 1.5 3DMark06HDR13DMark06HDR2 DOOM3 0.4 1 0.2 CT CTZ CTZB CTZBO 3 0 2.5 HL2 L4D NFS COR UT3 COD2 2 CRYSISDOOM3 UT2004

QUAKE4 COMBINED

1.5 INDIVIDUAL GPU performance speedup 3DMark06GT13DMark06GT2 Fraction of GPU accesses to DRAM 3DMark06HDR13DMark06HDR2 1 Figure 2: Distribution of DRAM accesses from 3D HL2 L4D NFS COR UT3 UT2004 QUAKE4 scene rendering workloads. Figure 4: Speedup achieved when a set of streams is

1.4 1.64 1.50 1.57 made to behave ideally. 1.3 1.2 For the GPGPU applications, the shader accesses con- COD2 3DMark06GT2 1.1 3DMark06HDR13DMark06HDR2 CRYSIS DOOM3 stitute the dominant stream. In this paper, each static 1 load/store shader instruction defines a distinct shader access C 3DMark06GT1 1.47 1.91 1.4 T stream. We adopt well-known stall-based techniques used 1.3 Z in the CPU space for identifying the critical shader access 1.2 B streams [27, 31, 35]. More specifically, in each shader core 1.1 O GPU performance speedup 1 we maintain a fully-associative stall table with least-recently- HL2 L4D NFS COR UT3 used (LRU) replacement. Each entry of the table records the UT2004 QUAKE4 program (PC) of a shader instruction. If a shader in- Figure 3: Speedup achieved when each individual struction I stalls at dispatch time due to a pending operand, stream is made to behave ideally. the parent shader instruction P that produces the operand is We evaluate the performance-criticality of each stream by inserted into the stall table, provided P is a load instruction treating all non-compulsory LLC misses from that stream as that has missed in the shader core’s private cache. Subse- hits. Figure 3 quantifies the speedup (ratio of frame rates quently, the accumulated stall cycle count introduced by P with and without this optimization) achieved by accelerat- is tracked in its entry. The left panel of Figure 5 shows the ing each stream in this way. Performance-sensitivity of any distribution of the DRAM accesses sourced by the shader in- particular stream varies across applications. A comparison structions sorted by stall cycle count for six GPGPU work- of Figures 2 and 3 shows that the performance-sensitivity of loads. “TnPC” denotes the top n shader instructions in the streams is not always in proportion to the volumes of this sorted list, while “All’ denotes all load/store shader in- DRAM accesses of the streams within an application. Dif- structions. Top four shader instructions can cover almost all ferent streams exploit the latency hiding capability of the DRAM accesses except for LBM and CFD. GPU by different amounts leading to varying impacts on T1PC T2PC T4PC T8PC T16PC All 1 3 the critical path through the application. 0.9 0.8 2.5 Figure 4 quantifies the performance-criticality of a set of 0.7 0.6 streams by treating all their non-compulsory LLC misses as 0.5 2 0.4 hits. We focus on only a few sets for acceleration, namely, 0.3 1.5 0.2 CT (set of color and texture), CTZ, CTZB, and CTZBO. 0.1

The left bar (“COMBINED”) for each application shows 0 GPU performance speedup 1 the stacked speedup as a new stream is added to the acceler- LBM CFD BFS LBM CFD BFS

REDUCTION FASTWALSH REDUCTION FASTWALSH ated set starting from CT. For comparison, we also show the Fraction of GPU accesses to DRAM accumulated speedup when each stream in a set is individ- BLACKSCHOLES BLACKSCHOLES ually accelerated in the bar “INDIVIDUAL”. We observe Figure 5: Left: distribution of DRAM accesses from that the combined speedup is much higher than the accu- GPGPU workloads. Right: speedup achieved when mulated individual speedup in several applications. This the top n PC streams are made to behave ideally. indicates that there are certain inter-stream performance- The right panel of Figure 5 shows the speedup achieved dependencies that must be accelerated together. For exam- when the load/store accesses sourced by the top n shader ple, a semantic dependence arises from the fact that the color instructions are treated ideally in the LLC. We observe that 2 Our simulation infra-structure is discussed in Section 5. the speedup data correlate well with the DRAM access dis- 3 The blitter is a special fixed function unit used to copy and tribution indicating that pipeline stall-based critical shader color data before it is sampled by the texture sampler. stream identification is a fruitful direction to pursue. 4 GPU ACCESS CRITICALITY has completed more than a threshold, thout[i], number of requests; otherwise Cout[i] is decremented. The peak in- In this section, we present our proposal on identifying and put bandwidths of FE, ROP, and BT are used as thin[FE], managing critical GPU accesses in heterogeneous CMPs. thin[ROP ], and thin[BT ], respectively. Similarly, the peak 4.1 Identifying Critical GPU Accesses output bandwidths determine thout[FE], thout[ROP ], and thout[BT ]. We use the shader’s peak input bandwidth di- 6 In the following, we present the mechanisms for selecting the vided by a constant as thin[SH] as well as thout[SH]. critical accesses in 3D rendering and GPGPU workloads. Critical Stream Selection Algorithm. Using the val- ues of Cin[i] and Cout[i], we first generate three bits for each 4.1.1 3D Scene Rendering Workloads. We represent the unit i: IOccupancy[i], AOccupancy[i], and T hroughput[i]. 3D rendering pipeline as an abstract queuing network of IOccupancy[i] is 1 iff Cin[i] of any instance of unit i is more five units, namely, front-end (FE), depth/stencil test units w−1 than 2 . AOccupancy[i] is 1 iff Cin[i] for all instances of (ZS), shader cores (SH), color blenders and writers (CW), w−1 unit i are more than 2 . T hroughput[i] is 1 iff Cout[i] for and blitters (BT). The texture samplers are attached to the w−1 shader cores. The front-end loads vertex indices and vertex all instances of unit i are more than 2 . In general, if attributes, generates the geometry primitives, and produces T hroughput[i] is 0 and IOccupancy[i] is 1, the unit i is clas- the rasterized fragment quads.4 The ZS unit removes the sified as bottlenecked and all accesses originating from it are hidden surfaces based on a depth/stencil test on the frag- classified as critical. For example, if SH is bottlenecked, all ments. The shader cores run a user-defined parallel shader shader and texture sampler accesses would be marked criti- program on each of the fragments received from the ZS unit. cal. For SH to be bottlenecked, T hroughput[SH] must be 0 The shaded fragment quads are passed on to the CW unit and AOccupancy[SH] must be 1 meaning that throughput for computing the final pixel color. One ZS unit and one is low even though all shader units have enough work to do. CW unit constitute one render output pipeline (ROP). If the If a unit i has multiple instances, e.g., ZS, SH, CW, and depth/stencil test is done before pixel shading, it is known AOccupancy[i] is 0, we identify the unit i as underloaded. as early-Z. In certain situations, the ZS unit may have to be To identify the bottleneck unit(s), we periodically execute invoked after SH and before CW. This is known as late-Z. Algorithm 1. First, this algorithm determines if CW and BT are bottlenecked. Next, it traverses the path FE-ZS-SH- Queuing Model for Rendering Pipeline. We model the CW (if early-Z is enabled) or the path FE-SH-ZS-CW (if inter-dependence between the 3D rendering pipeline units early-Z is disabled) from back to front. During this back-to- using a queuing network shown in Figure 6. The model has front traversal, if the algorithm encounters an underloaded 2n+3 queues, where n is the number of ROPs. The FE, SH, unit U, it examines the unit V in front of U and finds out and BT units have one queue each. Each of the n ZS and whether U is underloaded because V is bottlenecked. CW units has one queue. Processing in the pipeline model can begin at FE or BT. In the first case, information flows 4.1.2 GPGPU Workloads. For the GPGPU workloads, we through FE, ZS, SH, and CW in that order leading to the employ a two-level algorithm invoked periodically for iden- output. This path gets activated during a draw operation. tifying the critical accesses. The first level of the algo- The second path, which connects BT to the output, gets rithm identifies the bottlenecked shader cores. For each activated during the blitting process. shader core, we maintain two saturating counters named InputStall and OutputStall, each of width w bits (w = 8

ZS0 CW0 in our implementation) and initialized to the mid-point i.e., . . 2w−1. In a cycle, if the front-end of a shader core i fails FE . SH . to dispatch any warp due to pending source operands, the . . InputStall[i] counter is incremented by one; otherwise it ZSn CWn is decremented by one. Similarly, in a cycle, if the back- end of the shader core i fails to commit any shader instruc- BT tion, the OutputStall[i] counter is incremented by one; oth- erwise it is decremented by one. For a shader core i, if Figure 6: Queuing network for GPU pipeline. both InputStall[i] and OutputStall[i] are found to be above w−1 Request Flow Monitoring. We monitor the request ar- 2 , the core is classified as bottlenecked. The second level rival and completion rates at each of the units to first identify of the algorithm employs the stall table introduced in Sec- the bottleneck units. For this purpose, we associate two up- tion 3 to identify the critical accesses from the bottlenecked cores. We use a sixteen-entry fully-associative LRU stall ta- down saturating counters Cin[i] and Cout[i] of width w bits to each unit i, where w is a configuration parameter (our ble per shader core. Among the instruction PC’s captured by this table, the top few PC’s covering up to 90% of the evaluations use w = 8).5 All counters are initialized to the w−1 total stall cycles are considered to be generating critical ac- mid-point i.e., 2 . At the end of a cycle, Cin[i] is in- cesses to the memory sub-system. If a load/store instruction cremented if unit i is found to have pending requests the misses in a bottlenecked shader core’s private cache and is count of which is above a threshold thin[i]; otherwise Cin[i] among the top few critical instructions captured by the stall is decremented. Similarly, Cout[i] is incremented, if unit i table, the miss request sent to the LLC is marked critical. 4 A fragment quad is made of four fragments, each with complete information to render a pixel in the render buffer. 6 The constant represents the number of cycles the shader program 5 All algorithm parameters are tuned meticulously. takes to process a fragment. Algorithm 1 Algorithm to find bottleneck units Our frame rate estimation scheme operates in two modes, Inputs: IOccupancy (IO), AOccupancy (AO), namely, learning and prediction modes. The learning mode Throughput (TH) vectors lasts for one full frame. It measures the amount of work Returns: Bottleneck vector in the frame and the rendering time of the frame. In the Initialize Bottleneck vector to zero. prediction mode, our scheme starts producing frame rate if TH[CW] == 0 and IO[CW] == 1 then projections and continuously compares the learning mode Bottleneck[CW] = 1 data with the data from the newly completed frames. If end if the newly observed values differ from the learned values by if TH[BT] == 0 and IO[BT] == 1 then more than a threshold, the hardware discards the learned Bottleneck[BT] = 1 values and back to the learning mode. Unlike prior end if proposals for estimating GPU progress [18, 55], our proposal . Back to front traversal does not assume tile-based deferred rendering and does not if CW is underloaded then require any profile information. if early-Z enabled then if TH[SH] == 0 and AO[SH] == 1 then Learning Mode: Rendering of a frame involves generating Bottleneck[SH] = 1 the color values of all pixels into a buffer commonly known end if as the render target (RT). A single pixel in the RT can if SH is underloaded then get overdrawn multiple times depending on the arrival or- if TH[ZS] == 0 and IO[ZS] == 1 then der and depth of the geometry primitives. This complicates Bottleneck[ZS] = 1 the estimation of the amount of work involved in rendering a end if frame. We divide the RT into equal sized t × t render target if ZS is underloaded then tiles (RTT). We divide the rendering of a frame into ren- Check FE state using Algorithm 2 der target planes (RTP). As shown in Figure 7, each RTP end if represents a batch of updates that cover all tiles of the RT. end if else Render-Target tile if TH[ZS] == 0 and IO[ZS] == 1 then Y Bottleneck[ZS] = 1 end if if ZS is underloaded then if TH[SH] == 0 and AO[SH] == 1 then Bottleneck[SH] = 1 end if if SH is underloaded then X Check FE state using Algorithm 2 end if Render-Target plane end if Figure 7: Render-target plane and tile end if end if We maintain a 64-entry RTP information table in the GPU. For a frame, each entry of this table records three pieces of information about a distinct RTP: (i) total number Algorithm 2 Module to check FE bottleneck of updates to the RTP, (ii) the number of cycles to finish the RTP, and (iii) the number of RTTs in the RTP. Each field is if TH[FE] == 0 and IO[FE] == 1 then four bytes in size. If the number of RTPs in a frame exceeds Bottleneck[FE] = Bottleneck[SH] = Bottleneck[ZS] = 1 end if 64, the last entry of the table is used to accumulate the data for all subsequent RTPs. i Prediction Mode: If the number of RTPs in a frame i is Nrtp i If such a shader instruction is not found in the stall table of and the average number of cycles per RTP is Crtp, then the a bottlenecked core, the request to the LLC is still marked number of estimated cycles Fi required to render frame i is i i i critical, provided the LLC miss rate of GPU accesses is at given by Fi = Crtp × Nrtp. We obtain Nrtp directly from most 80%. In all other cases, the GPU access is marked the data collected in the learning mode assuming that it i non-critical. The non-critical shader accesses that miss in doesn’t change for the current frame i. To compute Crtp the LLC bypass the LLC freeing up space for other blocks. for the frame being rendered currently, let the fraction of the frame that has been rendered so far be λ, the average i number of cycles per RTP seen in the current frame be Ccur, 4.2 Estimating Projected Frame Rate and the average number of cycles per RTP recorded in the i The critical GPU accesses in a 3D scene rendering applica- learning mode be Cavg. Therefore, Crtp can be computed as i i tion are marked critical by our proposal only if the projected Crtp = λ × Ccur + (1 − λ) × Cavg. frame rate is below the target. Such projections need to be generated early in a frame to avoid losing opportunity of 4.3 Scheduling DRAM Accesses improving performance. We present a completely dynamic architecture-independent mechanism for estimating the pro- The CPU and GPU requests that miss in the shared LLC jected frame rate at any point in time during a frame. access the DRAM. Every DRAM access coming from the GPU carries a bit set by our criticality estimation hardware identify such phases, we periodically give the highest prior- specifying if the access is critical. We propose two DRAM ity to all GPU accesses in the DRAM scheduler over a small scheduling policies, namely, the GPU-favoring policy and time-window of 100K GPU cycles. If the GPGPU perfor- the interference mitigation policy (IM policy). In the GPU- mance (shader instructions retired per cycle) improves dur- favoring policy, among the requests to the currently open ing this window compared to the last window, the scheduler row in a bank, the critical GPU accesses are served before continues to offer higher priority to all GPU accesses. considering the rest. When a new row needs to be activated in a bank, the oldest critical GPU access is given priority over the global oldest access. The GPU-favoring policy leads to 5 SIMULATION ENVIRONMENT two performance problems. First, the GPU fills arrive at a faster rate to the LLC causing the CPU blocks to get re- We use a modified version of the Multi2Sim simulator [54] to placed at a faster rate than the baseline. Second, the CPU model the CPU cores of the simulated heterogeneous CMP. requests may starve due to a long burst of critical GPU re- Each dynamically scheduled out-of-order issue core is quests. The IM policy, designed to mitigate these problems, clocked at 4 GHz. We use two GPU simulators, one to ex- has two components, one to mitigate CPU starvation in the ecute the 3D rendering jobs and the other to execute the scheduler (IM-SCHED) and another to handle LLC inter- CUDA applications. The 3D rendering GPU is modeled with ference (IM-LLC). While IM-SCHED is the default policy, a an upgraded version of the Attila GPU simulator [33]. The to IM-LLC takes place on detecting LLC interference. shader throughput of the GPU is one tera-FLOPS (single precision). The GPU model used for CUDA applications is The IM-SCHED component prioritizes CPU accesses over borrowed from the MacSim infra-structure [25]. The shader critical GPU accesses with a certain probability. The prob- throughput of this GPU is 512 GFLOPS (single precision). ability is obtained as follows. The execution is divided into Depending on the type of the GPU workload, one of the equal intervals and at the end of each interval, the fraction two GPU models gets attached to the rest of the CMP. The of CPU requests de-prioritized by younger critical GPU re- DRAM modules are modeled using DRAMSim2 [45]. Ta- quests during the interval is computed. This is used as the ble 1 presents the detailed configuration. CPU prioritization probability for the next interval. Effec- tively, the CPU prioritization probability in an interval is The heterogeneous workloads are prepared by mixing the same as the observed probability that a given CPU access is SPEC CPU 2006 applications with 3D scene rendering jobs de-prioritized by a younger critical GPU access in the last drawn from fourteen popular DirectX 9 and OpenGL game interval. If this probability is more than half, it is capped titles as well as six CUDA applications. The DirectX and to half. This probability exceeds half only in the GPGPU OpenGL API traces are obtained from the Attila simula- applications during 2-6% of all intervals. tor distribution and the 3DMark06 suite [58]. We select thirteen SPEC CPU 2006 applications and partition them For detecting LLC interference, the execution is divided into two groups based on the LLC misses per kilo instruc- into equal intervals and within an interval, the CPU applica- tions (MPKI). The high MPKI group (H-group) contains tions are classified into high (H), medium (M), and low (L) bwaves, lbm, leslie3d, libquantum, mcf, milc, and soplex. intensities based on their LLC miss rates. H category has The low MPKI group (L-group) contains bzip2, gcc, om- more than 70% miss rate, M category has miss rate between netpp, sphinx3, wrf, and zeusmp. Each of the twenty GPU 10% and 70%, and L category has miss rate at most 10%. In workloads (fourteen 3D rendering and six GPGPU) is co- two consecutive intervals, if an application’s state is found executed with three different four-way multi-programmed to change from L to M or L to H which can be due to pos- CPU workload mixes. To do this, we use the applications sible LLC interference, the application enters an emergency from the H-group to prepare twenty four-way H mixes. Simi- mode. The IM-LLC component is activated if there is at larly, we prepare twenty four-way L mixes from the L-group. least one emergency mode application. It schedules requests We also prepare twenty four-way HL mixes, each of which from emergency mode applications as often as critical GPU has two H-group and two L-group applications. Each of accesses. The remaining accesses are assigned lower priority. the twenty GPU workloads is mixed with one CPU mix At the end of an interval, if an emergency mode application each from the H, L, and HL sets. For each GPU workload, is found to go back to the L state, this indicates that the we report the performance averaged (geometric mean) over application benefits from IM-LLC. It continues to stay in the three mixes containing that GPU workload. The multi- the emergency mode. On the other hand, at the end of an frame 3D rendering jobs are detailed in Table 2. The last interval, if an emergency mode application is still in M or H column of this table lists the baseline average frames per sec- state, the application exits the emergency mode because it ond (FPS) achieved by the applications when co-scheduled is not helpful for this application. with the four-way CPU mixes. The CUDA applications are The CPU accesses are given higher priority than the non- shown in Table 3. LBM is drawn from Parboil [49]; CFD critical GPU accesses except in one situation. In certain and BFS from Rodinia 3.0 [4, 5]; FASTWALSH, BLACKSC- phases of the GPGPU workloads, the GPU becomes very HOLES, and REDUCTION from the CUDA SDK 4.2. sensitive to memory system performance. In these phases, The first 200M instructions retired by each CPU core are it is possible to improve the GPU performance by sacrificing used to warm up the caches. After the warm-up, each CPU an equal amount of CPU performance and vice-versa. We application in a mix commits at least 450M dynamic in- decide to maintain the GPU performance in these phases structions [46]. Early-finishing applications continue to run by prioritizing all GPU accesses over the CPU accesses. To until each CPU application commits its representative set of dynamic instructions and the GPU completes its job. The Table 1: Simulation environment Table 3: CUDA application details CPU cache hierarchy Application configuration Per-core iL1 and dL1 caches: 32 KB, 8-way, 2 cycles LBM 120×150 blocks, 120 threads/block Per-core unified L2 cache: 256 KB, 8-way, 3 cycles CFD 759 blocks, 128 threads/block GPU model for 3D scene rendering BFS 1954 blocks, 512 threads/block FASTWALSH 8192 blocks, 256 threads/block Shader cores: 64, 1 GHz, four 4-way SIMD per core BLACKSCHOLES 480 blocks, 128 threads/block Texture samplers: two per shader core, 128 GTexel/s REDUCTION 64 blocks, 256 threads/block ROP: 16, fill rate 64 GPixels/s Texture caches: three-level hierarchy, L0: 2 KB per sampler, shared L1, L2: 64 KB, 384 KB performance metrics used for CPU mix, 3D animation, and Depth caches: two-level hierarchy, CUDA application are respectively weighted speedup, aver- L1: 2 KB per ROP, shared L2: 32 KB age frame rate, and the number of execution cycles. Color caches: two-level hierarchy, L1: 2 KB per ROP, shared L2: 32 KB 5.1 Additional Hardware Overhead Vertex cache: 16 KB, shader instruction cache: 32 KB, The critical stream identification logic needs to maintain the Hierarchical depth cache: 16 KB Cin and Cout counters for the FE, BT, ZS, SH, and CW GPU model for GPGPU units. The 3D rendering GPU models 64 SH units and six- Shader cores: 16, 2 GHz, sixteen SP FLOPS/cycle teen ZS and CW units leading to 98 Cin and Cout coun- Instruction, data cache per core: 4 KB, 32 KB, ters requiring a total of 196 bytes. The GPGPU model has Texture, constant cache per core: 8 KB, 8 KB, sixteen shader cores. Each core maintains one OutputStall Software-managed shared memory per core: 16 KB counter, one InputStall counter, and a sixteen-entry stall ta- ble with each entry being 69 bits (32-bit PC, 32-bit stall cy- Shared LLC and interconnect cles, one valid bit, and four LRU bits) amounting to 2.2 KB Shared LLC: 16 MB, 16-way, lookup latency 10 cycles, for all cores. The frame rate estimation mechanism main- inclusive for CPU blocks, non-inclusive for GPU blocks, tains a 64-entry RTP information table, each entry being 97 two-bit SRRIP policy [17] bits. Overall, the storage overhead of our proposal is only Interconnect: bi-directional ring, single-cycle hop 3.1 KB. Most importantly, none of the additional structures Memory controllers and DRAM are accessed or updated on the critical path of execution. Memory controllers: two on-die single-channel, The structures that are accessed every cycle (such as the C , C , InputStall, and OutputStall counters) are small DDR3-2133, FR-FCFS access scheduling in baseline in out in size and expend energy much smaller than what we save DRAM modules: 14-14-14, 64-bit channels, BL=8, throughout the system (CMP die and DRAM device) by im- open-page policy, one rank/channel, 8 banks/rank, proving performance. The remaining structures are accessed 1 KB row/bank/device, x8 devices less frequently and expend much lower energy.

Table 2: Graphics frame details 6 SIMULATION RESULTS Application, DX/OGL7 Frames Res.8 FPS We evaluate our proposal on a simulated heterogeneous CMP 3DMark06 GT1, DX 670–671 R1 5.9 with four CPU cores and one GPU. Sections 6.1 and 6.2 re- 3DMark06 GT2, DX 500–501 R1 14.0 spectively discuss the results for the mixes containing the 3DMark06 HDR1, DX 600–601 R1 16.7 3D rendering and CUDA workloads. 3DMark06 HDR2, DX 550–551 R1 21.8 Call of Duty 2 (COD2), DX 208–209 R2 19.5 6.1 Mixes with 3D Rendering Workloads Crysis, DX 400–401 R2 6.7 DOOM3, OGL 300–314 R3 80.7 We divide the discussion into evaluation of the several indi- Half Life 2 (HL2), DX 25–33 R3 77.4 vidual components that constitute our proposal. Left for Dead (L4D), DX 601–605 R1 33.6 Critical vs. Non-critical Accesses. We conduct two ex- Need for Speed (NFS), DX 10–17 R1 66.6 periments to understand whether our critical access identifi- Quake4, OGL 300–309 R3 80.5 cation logic is able to mark the critical GPU accesses as such. Chronicles of Riddick 253–267 R1 103.9 In one case, we treat all non-compulsory LLC misses from (COR), OGL the critical accesses as hits. In the other case, we treat all non-compulsory LLC misses from the non-critical accesses as Unreal Tournament 2004 200–217 R3 132.5 hits. Figure 8 shows the improvement in FPS over the base- (UT2004), OGL line in the two cases. Except for L4D, all applications show Unreal Tournament 3 955–956 R1 26.6 much higher FPS improvement when the critical accesses (UT3), DX are treated ideally. These results confirm that our proposal 7 DX=DirectX, OGL=OpenGL is able to identify a subset of the critical accesses correctly. 8 Resolutions: R1=1280×1024, R2=1920×1200, R3=1600×1200 On average, treating the critical accesses ideally offers an FPS improvement of 48%, while favoring the complementary access set offers only 13% improvement. In L4D, our algo- the GPU normalized to the baseline. The right panel shows rithm misclassifies a number of critical blitter accesses. COR the weighted speedup for the corresponding CPU mixes nor- loses performance when the non-critical accesses are treated malized to the baseline. We identify each CPU workload by ideally because some of the non-critical accesses negatively GPUworkloadnameCPU . The GPU-favoring policy improves interfere with the critical ones. the FPS by 18% on average while degrading the weighted speedup of the CPU mixes by 8% on average. The IM policy Critical Non−critical 105 is able to recover most of the lost CPU performance. This 90 policy improves the FPS of the GPU applications by 15% 75 60 on average while performing within 3% of the baseline for 45 the CPU application mixes. The CPU mixes co-scheduled 30 15 with 3DMark06HDR1 perform better than the baseline, on 0 −15 average. The IM policy has the IM-SCHED and IM-LLC

Percent increase in FPS HL2 L4D NFS UT3 COD2 COR components. Compared to the GPU-favoring policy, the CRYSISDOOM3 UT2004 Average QUAKE4 IM-LLC component alone reduces CPU performance loss by 3DMark06GT13DMark06GT2 3DMark06HDR13DMark06HDR2 3% while sacrificing 2% GPU performance. The IM-SCHED Figure 8: Percent improvement in FPS when LLC component alone reduces CPU performance loss by 2% while behaves ideally for critical and non-critical accesses. sacrificing 1% GPU performance. Effects are additive when Figure 9 shows the distribution of the critical color (C), they work together in the IM policy. critical texture (T), critical depth (Z), critical blitter (B), GPU−favoring policy GPU−favoring policy critical other (O), and non-critical (NC) accesses as iden- 1.3 IM policy 1.1 IM policy tified by our algorithm in the aforementioned experiment. 1.25 1.05 The distribution varies widely across the applications with 1.2 1 62% of accesses being identified as critical on average. It is 1.15 0.95 encouraging to note that for most of the applications, the 1.1 0.9 stream that was found to enjoy the largest speedup in Fig- 1.05 0.85 ure 3 is among the dominant critical streams identified by 1 0.8 FPS normalized to baseline L4D UT3 our algorithm. COD2 Normalized weighted speedup CRYSIS GT1CPUGT2CPU L4DCPUUT3CPU GEOMEAN HDR1CPUHDR2CPUCOD2CPU GEOMEAN 3DMark06GT13DMark06GT2 CRYSISCPU C T Z B O NC 3DMark06HDR13DMark06HDR2 1 Figure 11: Left: normalized FPS of GPU ap- 0.8 plications that perform below target FPS. Right: 0.6 0.4 weighted CPU speedup for the mixes. 0.2 0 To further understand the quality of the critical access

HL2 L4D NFS COR UT3 set identified by our algorithm, we conduct two experiments COD2 Fraction of all GPU accesses CRYSISDOOM3 UT2004 Average QUAKE4 with the HL mixes containing the GPU applications with

3DMark06GT13DMark06GT2 3DMark06HDR13DMark06HDR2 lower than 40 FPS. In the first experiment, we evaluate the Figure 9: Distribution of critical accesses. FPS improvement when out of the critical accesses, as iden- tified by our algorithm, a randomly selected 25% or 50% or Frame Rate Estimation. Figure 10 shows the percent er- 75% or 100% population is marked critical. The left panel ror observed in our dynamic frame rate estimation technique. of Figure 12 shows the stacked improvement in FPS as a A positive error means over-estimation and a negative error new quarter of the critical accesses is marked critical. These means under-estimation. Several applications have zero er- results show that all quarters are equally important from ror. The maximum over-estimation error is 6% (UT2004) performance viewpoint. In the second experiment, we ex- and the maximum under-estimation error is 4% (COR). The plore if our criticality estimation algorithm can be replaced average error across all applications is less than 1%. by a simpler random sampling algorithm that marks accesses as critical uniformly at random while maintaining the total 6 4 number of critical accesses from each stream same as our 2 algorithm. The right panel of Figure 12 shows the perfor- 0 −2 mance of this algorithm normalized to our algorithm. On HL2 L4D NFS UT3

Percent error −4 COD2 CRYSISDOOM3 UT2004 Average average, the random sampling technique performs 5% worse QUAKE4 COR than our algorithm. 3DMark06GT13DMark06GT2 3DMark06HDR13DMark06HDR2 Comparison to Related Proposals. We compare our Figure 10: Percent error in frame rate estimation. proposal against staged memory scheduling (SMS) [1], dy- DRAM Scheduling for Critical GPU Accesses. Our namic priority scheduler (DynPrio) [18], and deadline-aware DRAM scheduling proposal employs the access criticality scheduling (DASH) [55]. These proposals were discussed in information for the 3D rendering applications that fail to Section 2. We evaluate two versions of SMS, namely, one meet a target FPS. We set this target to 40 FPS and show with a probability of 0.9 of using shortest-job-first (SMS- the results for the eight applications that deliver frame rate 0.9) and the other with this probability zero (SMS-0) i.e., it below this level (see Table 2). Figure 11 evaluates the GPU- always selects the round-robin policy. DynPrio and DASH favoring and IM policies (Section 4.3) for the mixes contain- make use of our frame rate estimation technique to com- ing these GPU applications. The left panel shows the FPS of pute the time left in a frame. Additionally, we compare Q1 Q2 Q3 Q4 is equipped with an 8 MB shared LLC (as opposed to 16 MB 1.3 1 0.99 considered so far). The GPU applications improve by an 1.25 0.98 impressive 17% over the baseline and the co-scheduled CPU 1.2 0.97 0.96 application mixes perform within 4% of the baseline, on av- 1.15 0.95 0.94 erage. The CPU mixes co-scheduled with 3DMark06HDR1 1.1 0.93 1.05 0.92 and 3DMark06HDR2 outperform the baseline, on average. 0.91

FPS normalized to baseline 1 0.9 Referring back to Figure 11, we observe that for a 16 MB L4D UT3 L4D UT3

COD2 Normalized FPS of random sampling COD2 LLC, the GPU gain is 15% and the CPU mixes perform CRYSIS CRYSIS GEOMEAN GEOMEAN within 3% of the baseline, on average. 3DMark06GT13DMark06GT2 3DMark06GT13DMark06GT2 3DMark06HDR13DMark06HDR2 3DMark06HDR13DMark06HDR2

Figure 12: Left: cumulative performance contribu- 1.3 1.02 IM policy 1.01 IM policy tion of each quarter of the critical accesses. Right: 1.25 1 0.99 performance of random sampling normalized to the 1.2 0.98 0.97 proposed criticality estimation algorithm. 1.15 0.96 0.95 1.1 0.94 0.93 our proposal against HeLM, the state-of-the-art shared LLC 1.05 0.92 0.91 management policy for heterogeneous CMPs [32]. FPS normalized to baseline 1 0.9 Normalized weighted speedup L4D UT3 SMS−0.9 SMS−0 DynPrio DASH HeLM GPU criticality COD2 CRYSIS GT1CPUGT2CPU L4DCPUUT3CPU 1.4 GEOMEAN HDR1CPUHDR2CPUCOD2CPU GEOMEAN CRYSISCPU 3DMark06GT13DMark06GT2 1.2 3DMark06HDR13DMark06HDR2 1 Figure 14: Left: normalized FPS of GPU ap- 0.8 plications that perform below target FPS. Right: 0.6 Normalized FPS GEOMEAN weighted CPU speedup for the mixes. L4D UT3 COD2 1.2 3DMark06GT13DMark06GT23DMark06HDR13DMark06HDR2 CRYSIS 1.1 6.2 Mixes with GPGPU Workloads 1 0.9 Figure 15 evaluates SMS-0.9, SMS-0, HeLM, and our GPU CPU speedup 0.8 GEOMEAN criticality-aware proposal for the heterogeneous mixes con-

GT1CPU GT2CPU HDR1CPU HDR2CPU COD2CPU CRYSISCPUL4DCPU UT3CPU taining CUDA applications when the CMP is equipped with Figure 13: Top: FPS speedup over baseline. Bottom: a 16 MB shared LLC.9 Both SMS-0.9 and SMS-0 degrade weighted CPU speedup for the mixes. GPU performance (left panel) by 4% on average while im- Figure 13 shows the comparison for the heterogeneous proving the CPU performance (right panel) by 7% and 8%, mixes containing the GPU applications that fail to meet the respectively. HeLM improves GPU performance by 6% and target FPS. SMS suffers large losses in FPS (upper panel) CPU performance by 7%, on average. Our proposal im- due to the delay in batch formation. DynPrio fails to observe proves GPU performance by 1% and CPU performance by any overall benefit because it offers express bandwidth to the 14%, on average. Since the GPU performance can be traded GPU application only during the last 10% of a frame time. off for CPU performance and vice-versa, we use the equal- Both DASH and our GPU criticality-aware proposal (IM weight performance metric to understand the overall system policy) improve average FPS by 14%. DASH prioritizes the performance. Both SMS-0.9 and SMS-0 improve the equal- GPU accesses throughout the execution. Such a policy, how- weight metric by 2%, while HeLM improves this metric by ever, hurts the performance of the co-scheduled CPU mixes 6%. Our proposal achieves a 7% improvement in this metric. by 10% on average (lower panel of Figure 13). Our proposal, on the other hand, accelerates only the critical GPU accesses SMS−0.9 SMS−0 HeLM GPU criticality 1.2 1.3 and improves average FPS by the same amount as DASH 1.2 1.81 1.1 while delivering CPU performance within 3% of the baseline. 1.1 1 Both SMS-0.9 and SMS-0 improve CPU mix performance by 1 CPU speedup 8%, while suffering large losses in GPU performance. HeLM GPU speedup 0.9 0.9 LBM CFD BFS improves CPU performance by 6% on average, while degrad- BS−CPU LBM−CPUCFD−CPU FW−CPU BFS−CPU RED−CPUGEOMEAN REDUCTIONGEOMEAN ing GPU performance by 5%. To understand how these pro- FASTWALSH posals fare in terms of combined CPU-GPU system perfor- BLACKSCHOLES mance, we consider a performance metric in which the CPU Figure 15: Left: GPU application speedup. Right: and the GPU performance are weighed equally i.e., over- weighted CPU speedup for the mixes. all speedup is the geometric mean of the FPS speedup and the normalized weighted speedup of the CPU mix [29]. We 7 SUMMARY find that DASH and HeLM improve this performance metric by 1% on average compared to the baseline, while our pro- We have presented a new class of memory access schedulers posal improves this metric by 5%. DynPrio delivers baseline for heterogeneous CMPs. Our proposal dynamically identi- performance, while both SMS-0.9 and SMS-0 degrade the fies the critical GPU accesses and probabilistically prioritizes equal-weight metric by 9%. them in the memory access scheduler. Detailed simulation

Sensitivity to LLC Capacity. Figure 14 summarizes the 9 DynPrio and DASH are left out from this evaluation because these performance of the IM policy when the heterogeneous CMP two proposals are suitable for deadline-specific GPU workloads. studies show that our proposal achieves its goal of offer- [26] Y. Kim et al. Thread Cluster Memory Scheduling: Exploiting ing a bigger share of the shared memory system resources Differences in Memory Access Behavior. In MICRO 2010. [27] N. Kirman et al. Checkpointed Early Load Retirement. In HPCA to the critical GPU accesses. The GPU performance im- 2005. proves by 15% on average for the 3D scene rendering ap- [28] N. B. Lakshminarayana et al. DRAM Scheduling Policy for plications, while the co-scheduled CPU application mixes GPGPU Architectures Based on a Potential Function. In IEEE perform within 3% of the baseline on average. For the het- CAL, 11(2): 33–36, 2012. [29] J. Lee and H. Kim. TAP: A TLP-aware Cache Management Policy erogeneous mixes with GPGPU applications, the CPU ap- for a CPU-GPU Heterogeneous Architecture. In HPCA 2012. plication mixes improve by 14% on average, while the GPU [30] F. D. Luna. Introduction to 3D Game Programming with DirectX performs 1% above the baseline leading to an overall 7% im- 10 . Wordware Publishing Inc.. provement in system performance, measured in terms of a [31] R. Manikantan and R. Govindarajan. Focused Prefetching: Per- formance Oriented Prefetching Based on Commit Stalls. In ICS CPU-GPU equal-weight performance metric. 2008. [32] V. Mekkat et al. Managing Shared Last-level Cache in a Hetero- geneous Multicore . In PACT 2013. REFERENCES [33] V. Moya et al. ATTILA: A Cycle-Level Execution- Driven Simulator for Modern GPU Architectures. [1] R. Ausavarungnirun et al. Staged Memory Scheduling: Achieving In ISPASS 2006. Source and traces available at High Performance and Scalability in Heterogeneous Systems. In http://attila.ac.upc.edu/wiki/index.php/Main Page. ISCA 2012. [34] S. P. Muralidhara et al. Reducing Memory Interference in Multi- [2] D. Bouvier et al. Kabini: An AMD Accelerated Processing Unit core Systems via Application-aware Memory Channel Partitioning. . In IEEE Micro, 34(2):22–33, 2014. In MICRO 2011. [3] N. Chatterjee et al. Managing DRAM Latency Divergence in [35] O. Mutlu et al. Runahead Execution: An Alternative to Very Irregular GPGPU Applications. In SC 2014. Large Instruction Windows for Out-of-Order Processors. In HPCA [4] S. Che et al. Rodinia: A Benchmark Suite for Heterogeneous 2003. Computing. In IISWC 2009. [36] O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access [5] S. Che et al. A Characterization of the Rodinia Benchmark Suite Scheduling for Chip Multiprocessors. In MICRO 2007. with Comparison to Contemporary CMP Workloads. In IISWC [37] O. Mutlu and T. Moscibroda. Parallelism-aware Batch Scheduling: 2010. Enhancing both Performance and Fairness of Shared DRAM [6] R. Das et al. Application-to-core Mapping Policies to Reduce Systems. In ISCA 2008. Memory System Interference in Multi-core Systems. In HPCA [38] N. C. Nachiappan et al. GemDroid: A Framework to Evaluate 2013. Mobile Platforms. In SIGMETRICS 2014. [7] M. Demler. Iris Pro Takes On Discrete GPUs. In [39] K. J. Nesbit et al. Fair Queuing Memory Systems. In MICRO Report, 2013. 2006. [8] E. Ebrahimi et al. Fairness via Source Throttling: A Configurable [40] T. Olson. Mali 400 MP: A Scalable GPU for Mobile and Embedded and High-performance Fairness Substrate for Multi-core Memory Devices. In HPG 2010. Systems. In ASPLOS 2010. [41] T. Piazza. Intel Processor Graphics. In HPG 2012. [9] E. Ebrahimi et al. Parallel Application Memory Scheduling. In [42] S. Rai and M. Chaudhuri. Exploiting Dynamic Reuse Probability MICRO 2011. to Manage Shared Last-level Caches in CPU-GPU Heterogeneous [10] S. Ghose, H. Lee, and J. F. Martinez. Improving Memory Sched- Processors. In ICS 2016. uling via Processor-side Load Criticality Information. In ISCA [43] M. Ribble. Next-gen Tile-based GPUs. In GDC 2008. 2013. [44] S. Rixner et al. Memory Access Scheduling. In ISCA 2000. [11] N. Greene, M. Kass, and G. Miller. Hierarchical Z-buffer Visibility. [45] P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A In SIGGRAPH 1993. Cycle Accurate Memory System Simulator. In IEEE CAL, 10(1): [12] P. Hammarlund et al. Haswell: The Fourth Generation Intel Core 16–19, 2011. Processor. In IEEE Micro, 34(2):6–20, 2014. [46] T. Sherwood et al. Automatically Characterizing Large Scale [13] M. Harris. Dynamic Texturing. Available at Program Behavior. In ASPLOS 2002. http://developer.download.nvidia.com/assets/gamedev/docs/ [47] D. Shingari, A. Arunkumar, and C-J. Wu. Characterization and DynamicTexturing.pdf. Throttling-Based Mitigation of Memory Interference for Hetero- [14] I. Hur and C. Lin. Adaptive History-Based Memory Schedulers. geneous Smartphones. In IISWC 2015. In MICRO 2004. [48] A. Stevens. QoS for High-performance and Power-efficient HD [15] Intel Corporation. Intel Core i7-4770 Processor. Avail- Multimedia. ARM White Paper, 2010. able at http://ark.intel.com/products/75122/Intel-Core-i7-4770- [49] J. A. Stratton et al. Parboil: A Revised Benchmark Suite for Processor-8M-Cache-up-to-3 90-GHz. Scientific and Commercial Throughput Computing. IMPACT [16] E. Ipek et al. Self-Optimizing Memory Controllers: A Reinforce- Technical Report IMPACT-12-01 , 2012. ment Learning Approach. In ISCA 2008. [50] S. Subramaniam et al. Criticality-based Optimizations for Effi- [17] A. Jaleel et al. High Performance Cache Replacement using Re- cient Load Processing. In HPCA 2009. reference Interval Prediction (RRIP). In ISCA 2010. [51] L. Subramanian et al. The Blacklisting Memory Scheduler: [18] M. K. Jeong et al. A QoS-aware for dynamically Achieving High Performance and Fairness at Low Cost. In ICCD balancing GPU and CPU bandwidth use in an MPSoC. In DAC 2014. 2012. [52] L. Subramanian et al. The Application Slowdown Model: Quanti- [19] A. Jog et al. Exploiting Core Criticality for Enhanced GPU fying and Controlling the Impact of Inter-application Interference Performance. In SIGMETRICS 2016. at Shared Caches and Main Memory. In MICRO 2015. [20] D. Kanter. Intel’s Ivy Bridge Graphics Architecture. April 2012. [53] L. Subramanian et al. MISE: Providing Performance Predictabil- Available at http://www.realworldtech.com/ivy-bridge-gpu/. ity and Improving Fairness in Shared Main Memory Systems. In [21] D. Kanter. Intel’s Sandy Bridge Graphics Architecture. August HPCA 2013. 2011. Available at http://www.realworldtech.com/sandy-bridge- [54] R. Ubal et al. Multi2Sim: A Simulation Framework for CPU-GPU gpu/. Computing. In PACT 2012. [22] D. Kanter. AMD Fusion Architecture and Llano. June 2011. Avail- [55] H. Usui et al. DASH: Deadline-Aware High-Performance Memory able at http://www.realworldtech.com/fusion-llano/. Scheduler for Heterogeneous Systems with Hardware Accelerators. [23] O. Kayiran et al. Managing GPU Concurrency in Heterogeneous In ACM TACO, 12(4), 2016. Architectures. In MICRO 2014. [56] J. Walton. The AMD Trinity Review (A10-4600M): A New Hope. [24] Y. Kim et al. ATLAS: A Scalable and High-performance Schedul- 2012. Available at http://www.anandtech.com/show/5831/amd- ing Algorithm for Multiple Memory Controllers. In HPCA 2010. trinity-review-a10-4600m-a-new-hope/. [25] H. Kim et al. MacSim: A CPU-GPU Heteroge- [57] M. Yuffe et al. A Fully Integrated Multi-CPU, GPU, and Memory neous Simulation Framework. 2012. Available at Controller 32 nm Processor. In ISSCC 2011. https://code.google.com/p/macsim/. [58] 3D Mark Benchmark. http://www.3dmark.com/.