High-Performance and Energy-Effcient Memory Scheduler Design for Heterogeneous Systems
Total Page:16
File Type:pdf, Size:1020Kb
High-Performance and Energy-Ecient Memory Scheduler Design for Heterogeneous Systems Rachata Ausavarungnirun1 Gabriel H. Loh2 Lavanya Subramanian3,1 Kevin Chang4,1 Onur Mutlu5,1 1Carnegie Mellon University 2AMD Research 3Intel Labs 4Facebook 5ETH Zürich This paper summarizes the idea of the Staged Memory Sched- 1. Introduction uler (SMS), which was published at ISCA 2012 [14], and exam- ines the work’s signicance and future potential. When multiple As the number of cores continues to increase in modern processor cores (CPUs) and a GPU integrated together on the chip multiprocessor (CMP) systems, the DRAM memory sys- same chip share the o-chip DRAM, requests from the GPU tem has become a critical shared resource [139, 145]. Mem- can heavily interfere with requests from the CPUs, leading to ory requests from multiple cores interfere with each other, low system performance and starvation of cores. Unfortunately, and this inter-application interference is a signicant im- state-of-the-art memory scheduling algorithms are ineective pediment to individual application and overall system perfor- at solving this problem due to the very large amount of GPU mance. Various works on application-aware memory schedul- memory trac, unless a very large and costly request buer ing [98,99,141,142] address the problem by making the mem- is employed to provide these algorithms with enough visibility ory controller aware of application characteristics and ap- across the global request stream. propriately prioritizing memory requests to improve system performance and fairness. Previously-proposed memory controller (MC) designs use a Recent heterogeneous CPU-GPU systems [1,27,28,37,76, single monolithic structure to perform three main tasks. First, 77, 78, 133, 152,153, 167, 209] present an additional challenge the MC attempts to schedule together requests to the same by introducing integrated graphics processing units (GPUs) DRAM row to increase row buer hit rates. Second, the MC on the same die with CPU cores. GPU applications typi- arbitrates among the requesters (CPUs and GPU) to optimize cally demand signicantly more memory bandwidth than for overall system throughput, average response time, fairness CPU applications due to the GPU’s capability of executing and quality of service. Third, the MC manages the low-level a large number of concurrent threads [1,2,3,4,13,23,27,37, DRAM command scheduling to complete requests while ensur- 38, 40, 62, 68, 77, 78, 133, 149, 150, 151, 152, 153, 154, 167, 176, 178, ing compliance with all DRAM timing and power constraints. 179, 188, 189, 199, 200, 206, 209]. GPUs use single-instruction This paper proposes a fundamentally new approach, called multiple-data (SIMD) pipelines to concurrently execute mul- the Staged Memory Scheduler (SMS), which decouples the three tiple threads [53]. In a GPU, a group of threads executing the primary MC tasks into three signicantly simpler structures same instruction is called a wavefront or warp, and threads that together improve system performance and fairness. Our in a warp are executed in lockstep. When a wavefront stalls three-stage MC rst groups requests based on row buer locality. on a memory instruction, the GPU core hides this memory This grouping allows the second stage to focus only on inter- access latency by switching to another wavefront to avoid application scheduling decisions. These two stages enforce high- stalling the pipeline. Therefore, there can be thousands of level policies regarding performance and fairness, and therefore outstanding memory requests from across all of the wave- the last stage can use simple per-bank FIFO queues (i.e., there is fronts. This is fundamentally more memory intensive than arXiv:1804.11043v1 [cs.AR] 30 Apr 2018 no need for further command reordering within each bank) and CPU memory trac, where each CPU application has a much straightforward logic that deals only with the low-level DRAM smaller number of outstanding requests due to the sequential commands and timing. execution model of CPUs. Figure 1 (a) shows the memory request rates for a represen- We evaluated the design trade-os involved and compared it tative subset of our GPU applications and the most memory- against four state-of-the-art MC designs. Our evaluation shows intensive SPEC2006 (CPU) applications, as measured by mem- that SMS provides 41.2% performance improvement and 4.8× ory requests per thousand cycles when each application runs fairness improvement compared to the best previous state-of- alone on the system. The raw bandwidth demands (i.e., mem- the-art technique, while enabling a design that is signicantly ory request rates) of the GPU applications are often multiple less complex and more power-ecient to implement. times higher than the SPEC benchmarks. Figure 1 (b) shows Our analysis and proposed scheduler have inspired signi- the row buer hit rates (also called row buer locality or cant research on (1) predictable and/or deadline-aware memory RBL [134]). The GPU applications show consistently high scheduling [91,92,194,195,197,201,202,216] and (2) memory levels of RBL, whereas the SPEC benchmarks exhibit more scheduling for heterogeneous systems [161, 201, 202, 207]. variability. The GPU programs have high levels of spatial 250 1 6 225 0.9 200 0.8 5 175 0.7 4 150 0.6 125 0.5 3 BLP RBL MPKC 100 0.4 75 0.3 2 50 0.2 1 25 0.1 0 GAME01 GAME03 GAME05 BENCH02 BENCH04 gcc h264ref astar omnetpp leslie3d mcf 0 GAME01 GAME03 GAME05 BENCH02 BENCH04 gcc h264ref astar omnetpp leslie3d mcf 0 GAME01 GAME03 GAME05 BENCH02 BENCH04 gcc h264ref astar omnetpp leslie3d mcf (a) (b) (c) Figure 1: GPU memory characteristics. (a) Memory intensity, measured by memory requests per thousand cycles, (b) row buer locality, measured by the fraction of accesses that hit in the row buer, and (c) bank-level parallelism. Reproduced from [14]. locality, often due to access patterns related to large sequen- (a) CPU Requests tial memory accesses (e.g., frame buer updates). Figure 1(c) shows the bank-level parallelism (BLP) [109, 142], which is the (b) GPU Requests average number of parallel memory requests that can be issued X X X X X X X X X X to dierent DRAM banks, for each application, with the GPU programs consistently making use of four banks at the same (c) X X X X X X X X X X X XX X X X X X X XX X X X time. Figure 2: Example of the limited visibility of the mem- In addition to the high-intensity memory trac of GPU ory controller. (a) CPU-only information, (b) Memory con- applications, there are other properties that distinguish GPU troller’s visibility, (c) Improved visibility. Adapted from [14]. applications from CPU applications. Prior work [99] observed buer, thereby limiting the memory controller’s visibility of that CPU applications with streaming access patterns typi- the CPU applications’ memory characteristics. cally exhibit high RBL but low BLP, while applications with less uniform access patterns typically have low RBL but high One approach to increasing the memory controller’s visibil- BLP. ity across a larger window of memory requests is to increase In contrast, GPU applications have both high RBL and high the size of its request buer. This allows the memory con- BLP. The combination of high memory intensity, high RBL troller to observe more requests from the CPUs to better char- and high BLP means that the GPU will cause signicant inter- acterize their memory behavior, as shown in Figure 2(c). For ference to other applications across all banks, especially when instance, with a large request buer, the memory controller using a memory scheduling algorithm that preferentially fa- can identify and service multiple requests from one CPU core vors requests that result in row buer hits (e.g., [173, 220]). to the same row such that they become row buer hits, how- Recent memory scheduling research has focused on mem- ever, with a small request buer as shown in Figure 2(b), the ory interference between applications in CPU-only scenarios. memory controller may not even see these requests at the These past proposals are built around a single centralized re- same time because the GPU’s requests have occupied the quest buer at each memory controller (MC). The scheduling majority of the entries. algorithm implemented in the memory controller analyzes Unfortunately, very large request buers impose signi- the stream of requests in the centralized request buer to cant implementation challenges, including the die area for determine application memory characteristics, decides on the larger structures and the additional circuit complexity for a priority for each core, and then enforces these priorities. analyzing so many requests, along with the logic needed for Observable memory characteristics may include the number assignment and enforcement of priorities [194,195]. There- of requests that result in row buer hits, the bank-level par- fore, while building a very large, centralized memory con- allelism of each core, memory request rates, overall fairness troller request buer could perhaps lead to reasonable mem- metrics, and other information. Figure 2(a) shows the CPU- ory scheduling decisions, the approach is unattractive due to only scenario where the request buer only holds requests the resulting area, power, timing and complexity costs. from the CPUs. In this case, the memory controller sees a In this work, we propose the Staged Memory Scheduler number of requests from the CPUs and has visibility into (SMS), a decentralized architecture for memory scheduling in their memory behavior. On the other hand, when the request the context of integrated multi-core CPU-GPU systems. The buer is shared between the CPUs and the GPU, as shown in key idea in SMS is to decouple the various functional tasks of Figure 2(b), the large volume of requests from the GPU occu- memory controllers and partition these tasks across several pies a signicant fraction of the memory controller’s request simpler hardware structures which operate in a staged fash- 2 ion.