Using Criticality of GPU Accesses in Memory Management for CPU-GPU Heterogeneous Multi-Core Processors

Using Criticality of GPU Accesses in Memory Management for CPU-GPU Heterogeneous Multi-core Processors ABSTRACT this paper, we model such heterogeneous computational scenarios by simultaneously scheduling SPEC CPU workloads Heterogeneous chip-multiprocessors with CPU and GPU in- on the CPU cores and 3D scene rendering or GPGPU ap- tegrated on the same die allow sharing of critical memory plications on the GPU of a heterogeneous CMP. In these system resources among the CPU and GPU applications and computational scenarios, the working sets of the jobs co- give rise to challenging resource scheduling problems. In scheduled on the CPU and the GPU cores contend for the this paper, we explore memory access scheduling algorithms shared LLC capacity and the DRAM bandwidth causing de- driven by criticality of GPU accesses in such systems. Dif- structive interference. ferent GPU access streams originate from different parts of GPU workload = Parboil LBM (OpenCL) CPU GPU 1 the GPU rendering pipeline, which behaves very differently 1C1G 2C1G 4C1G from the typical CPU pipeline requiring new techniques for 0.9 GPU access criticality estimation. We propose a novel queu- 0.8 ing network model to estimate the performance-criticality of 0.7 the GPU access streams. If a GPU application performs be- 0.6 low the quality of service requirement (e.g., frame rate in 0.5 Normalized performance 3D rendering), the memory access scheduler uses the esti- 429.mcf 437,450 429,462 410.bwaves 437.leslie3d450.soplex462.libquantum473.astar482.sphinx3 429,437,462,473410,450,462,482 mated criticality information to accelerate the critical GPU Figure 1: Performance of heterogeneous mixes rela- accesses. Detailed simulations done on a heterogeneous chip- tive to standalone performance on Core i7-4770. multiprocessor model with one GPU and four CPU cores running DirectX, OpenGL, and CPU application mixes show To understand the extent of the CPU-GPU interference, that our proposal improves the GPU performance by 15% on we conduct a set of experiments on an Intel Core i7-4770 pro- average without degrading the CPU performance much. Ex- cessor (Haswell)-based platform. This heterogeneous pro- tensions proposed for the mixes containing GPGPU applica- cessor has four CPU cores and an integrated HD 4600 GPU tions, which do not have any quality of service requirement, sharing an 8 MB LLC and a dual-channel DDR3-1600 16 improve the performance by 7% on average. GB DRAM (25.6 GB/s peak DRAM bandwidth) [15]. We prepare eleven heterogeneous mixes for these experiments. 1 INTRODUCTION Every mix has LBM from the Parboil OpenCL suite [49] as the GPU workload using the long input (3000 time-steps). A heterogeneous chip-multiprocessor (CMP) makes simul- LBM serves as a representative for memory-intensive GPU taneous use of both CPU and GPU cores. Such architec- workloads.1 Seven out of the eleven mixes exercise one CPU tures include AMD's accelerated processing unit (APU) fam- core, two mixes exercise two CPU cores, and the remain- ily [2, 22, 56] and Intel's Sandy Bridge, Ivy Bridge, Haswell, ing two mixes exercise all four CPU cores. In all mixes, Broadwell, and Skylake processors [7, 12, 20, 21, 41, 57]. In the GPU workload co-executes with the CPU application(s) this study, we explore heterogeneous CMPs that allow the drawn from the SPEC CPU 2006 suite. All SPEC CPU CPU cores and the GPU to share the on-die interconnect, 2006 applications use the ref inputs. Figure 1 shows, for the last-level cache (LLC), the memory controllers, and the each mix (identified by the constituent CPU workload on DRAM banks, as found in Intel's integrated GPUs. The the x-axis), the performance of the CPU and the GPU work- GPU workloads are of two types: massively parallel compu- loads separately relative to the standalone performance of tation exercising only the shader cores (GPGPU) and multi- these workloads. For example, for the first 4C1G mix us- frame 3D animation utilizing the entire rendering pipeline of ing four CPU cores and the GPU, the standalone CPU per- the GPU. Each of these two types of GPU workloads can be formance is the average turn-around time of the four CPU co-executed with general-purpose CPU workloads in a het- applications started together, while the GPU is idle. Sim- erogeneous CMP. These heterogeneous computational sce- ilarly, the standalone GPU performance is the time taken narios are seen in embedded, desktop, workstation, and high- to complete the Parboil LBM application on the integrated end computing platforms employing CPU-GPU MPSoCs. In GPU. When these workloads run together, performance degrades. As Figure 1 shows, the loss in CPU performance Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that varies from 8% (410.bwaves in the 1C1G group) to 28% (the copies are not made or distributed for profit or commercial advantage first mix in the 4C1G group). The GPU performance degra- and that copies bear this notice and the full citation on the first page. dation ranges from 1% (the first mix in the 1C1G group) to Copyrights for components of this work owned by others than ACM 26% (the first mix in the 4C1G group). The GPU workload must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, degrades more with the increasing number of active CPU requires prior specific permission and/or a fee. Request permissions cores. Similar levels of interference have been reported in from [email protected]. simulation-based studies [1, 18, 23, 29, 32, 38, 42, 47, 48, 55]. CASES'17, Seoul, South Korea © 2017 ACM. 978-x-xxxx-xxxx-x/YY/MM. $15.00 1 Experiments done with a larger set of memory-intensive GPU work- DOI: 10.1145/nnnnnnn.nnnnnnn loads show similar trends. In this paper, we attempt to recover some portion of the to application groups has also been proposed [34]. Critical- lost GPU performance by proposing a novel memory access ity estimation of load instructions [50] and criticality-driven scheduler driven by GPU performance feedback. memory access schedulers [10] for CPUs have been explored. Prior proposals have studied specialized memory access The memory access scheduling studies for the discrete schedulers for heterogeneous systems [1, 18, 38, 48, 55]. These GPU parts focus on the shader cores only and do not con- proposals modulate the priority of all GPU or all CPU re- sider the rest of the GPU rendering pipeline. These studies quests in bulk depending on the latency-, bandwidth-, and explore ways to minimize the latency variance among the deadline-sensitivity of the current phase of the CPU and the threads within a warp [3], to accelerate the critical shader GPU workloads. In this paper, we propose a new mechanism cores that do not have enough short-latency warps [19], and to dynamically identify a subset of GPU requests as critical to design a mix of shortest-job-first and FR-FCFS with the and accelerate them by exercising fine-grain control over al- goal of accelerating the less latency-tolerant shader cores [28]. location of memory system resources. Our proposal is mo- Several studies have explored specialized memory access tivated by the key observation that the performance impact schedulers for heterogeneous systems [1, 18, 38, 48, 55]. The of accelerating different GPU accesses is not the same (Sec- staged memory scheduler (SMS) clubs the memory requests tion 3). For identifying the bottleneck GPU accesses, we from each source (CPU or GPU) into source-specific batches model the information flow through the rendering pipeline based on DRAM row locality [1]. Each batch is next sched- of the GPU using a queuing network. We observe that the uled with a probabilistic mix of shortest-batch-first (favor- accesses to the memory system originating from the GPU ing latency-sensitive jobs) and round-robin (enforcing fair- shader cores and the fixed function units have complex inter- ness among bandwidth-sensitive jobs). The dynamic priority dependence, which is not addressed by the existing CPU load scheduler [18] proposed for mobile heterogeneous platforms criticality estimation techniques [10, 50]. To the best of our employs dynamic progress estimation of tile-based deferred knowledge, our proposal is the first to incorporate critical- rendering (TBDR) [40, 43] and offers the GPU accesses equal ity information of fine-grain GPU accesses in the design of priority as the CPU accesses if the GPU lags behind the memory access schedulers for heterogeneous CMPs. target frame rendering time. Also, during the last 10% 3D animation is an important GPU workload. For these of the left time to render a frame, the GPU accesses are workloads, it is sufficient to deliver a minimum acceptable given higher priority than the CPU accesses. The subse- frame rate (usually thirty frames per second) due to the per- quently proposed deadline-aware memory scheduler for het- sistence of human vision; it is a wastage of resources to im- erogeneous systems (DASH) further improves the dynamic prove the GPU performance beyond the required level. Our priority scheme by offering the highest priority to short- proposal includes a highly accurate architecture-independent deadline applications and prioritizing the GPU applications technique to estimate the frame rate of a 3D rendering job at that lag behind the target [55]. The option of statically par- run-time. Our proposal employs the criticality information titioning the physical address space between the CPU and to speed up the GPU workload only if the estimated frame GPU datasets and assigning two independent memory con- rate is below the target level.

Using Criticality of GPU Accesses in Memory Management for CPU-GPU Heterogeneous Multi-Core Processors

Pipelining and Vector Processing

Diploma Thesis

How Data Hazards Can Be Removed Effectively

Pipelining: Basic Concepts and Approaches

Powerpc 601 RISC Microprocessor Users Manual

Improving UNIX Kernel Performance Using Profile Based Optimization

Analysis of Body Bias Control Using Overhead Conditions for Real Time Systems: a Practical Approach∗

Lecture Topics RAW Hazards Stall Penalty

UM0434 E200z3 Powerpc Core Reference Manual

Instruction Pipelining Review

ECE 361 Computer Architecture Lecture 13: Designing a Pipeline Processor

Pipelined Instruction Executionhazards, Stages