Optimizing Thread Throughput for Multithreaded Workloads on Memory Constrained Cmps

Optimizing Thread Throughput for Multithreaded Workloads on Memory Constrained CMPs Major Bhadauria and Sally A. Mckee Computer Systems Lab Cornell University Ithaca, NY, USA [email protected], [email protected] ABSTRACT 1. INTRODUCTION Multi-core designs have become the industry imperative, Power and thermal constraints have begun to limit the replacing our reliance on increasingly complicated micro- maximum operating frequency of high performance proces- architectural designs and VLSI improvements to deliver in- sors. The cubic increase in power from increases in fre- creased performance at lower power budgets. Performance quency and higher voltages required to attain those frequen- of these multi-core chips will be limited by the DRAM mem- cies has reached a plateau. By leveraging increasing die ory system: we demonstrate this by modeling a cycle-accurate space for more processing cores (creating chip multiproces- DDR2 memory controller with SPLASH-2 workloads. Sur- sors, or CMPs) and larger caches, designers hope that multi- prisingly, benchmarks that appear to scale well with the threaded programs can exploit shrinking transistor sizes to number of processors fail to do so when memory is accurately deliver equal or higher throughput as single-threaded, single- modeled. We frequently find that the most efficient config- core predecessors. The current software paradigm is based uration is not the one with the most threads. By choosing on the assumption that multi-threaded programs with little the most efficient number of threads for each benchmark, contention for shared data scale (nearly) linearly with the average energy delay efficiency improves by a factor of 3.39, number of processors, yielding power-efficient data through- and performance improves by 19.7%, on average. We also put. Nonetheless, applications fully exploiting available cores, introduce a shadow row of sense amplifiers, an alternative where each application thread enjoys its own cache and pri- to cached DRAM, to explore potential power/performance vate working set, may still fail to achieve maximum through- impacts. The shadow row works in conjunction with the put [5]. Bandwidth and memory bottlenecks limit the po- L2 Cache to leverage temporal and spatial locality across tential of multi-core processors, just as in the single-core memory accesses, thus attaining average and peak speedups domain [2]. of 13% and 43%, respectively, when compared to a state-of- We explore memory effects on a CMP running a multi- the-art DRAM memory scheduler. threaded workload. We model a memory controller using state-of-the-art scheduling techniques to hide DRAM access Categories and Subject Descriptors latency. Pin limitations affect available memory channels and bandwidth, which can cause applications not to scale C.4 [Performance of Systems]: Design Studies well with increasing cores. It limits the maximum number ;C.1[Processor Architectures]: Parallel Architectures of cores that should be used to achieve the best ratio between efficiency and performance. Academic computer ar- General Terms chitects rarely model the variable latencies of main memory requests, instead assuming a fixed latency for every memory Design, Performance request [9]. We compare differences in performance between modeling a static memory latency and modeling a memory Keywords controller with realistic, variable DRAM latencies for the Performance, Power, Memory, Efficiency, Bandwidth SPLASH-2 benchmark suite. We specifically examine CMP efficiency, as measured by their energy delay squared prod- uct. Our results demonstrate the insufficiency of assuming static latencies: the behaviors caused by such modeling choices not only affect performance but the shapes of the performance curves with respect to number of threads. Addi- tionally, we find that choosing the optimal configuration instead of that with the most threads can result in signif- Permission to make digital or hard copies of all or part of this work for icantly higher computing efficiency. This is a direct result personal or classroom use is granted without fee provided that copies are of variable DRAM latencies and the bottlenecks they cause not made or distributed for profit or commercial advantage and that copies for multi-threaded programs. This bottleneck is visible at bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific a lower thread count than in earlier research [25] when the permission and/or a fee. multi-dimensional structure of DRAM is modeled. CF’08, May 5–7, 2008, Ischia, Italy. To help mitigate these bottlenecks, we introduce a shadow Copyright 2008 ACM 978-1-60558-077-7/08/05 ...$5.00. 119 row of DRAM sense amplifiers. The shadow row keeps the 1 4 8 16 Threads previously accessed DRAM page accessible (in read-only 1.2 mode) without closing the current page, and thus it can be fed to the core when it satisfies the pending memory re- 1 quest. This leverages the memory’s temporal and spatial locality characteristics to potentially improve performance. 0.8 The shadow row can be used in conjunction with normal caching or can be used to implement an alternative non- 0.6 caching policy, preventing cache pollution by data unlikely to be reused. 0.4 The contributions of this work are threefold. Normalized Speedup 0.2 • We extend a DDR2 memory model to include memory optimizations exploiting the shadow row. 0 cholesky fft lu ocean radix • We find significant non-linear performance differences between bandwidth-limited static latency simulations and those using a realistic DRAM model and memory controller. • We find that the maximum number of threads is rarely Figure 1: Performance for Open Page Mode with Read the optimal number to use for performance or power. Priority Normalized to No Priority 2. THE MULTI-DIMENSIONAL DRAM 3. EXPLOITING GREATER LOCALITY STRUCTURE To take advantage of high spatial locality present in mem- Synchronous random access memory (SDRAM) is the main- ory access patterns of many parallel scientific applications, stay to bridge the gap between cache and disk. DRAMs have we make a simple modification to our DRAM model. We not increased in operating frequency or bandwidth propor- add a second “hot row” of sense amplifiers per DRAM bank, tionately with processors. One innovation is double data storing the second-most recently accessed data page. When rate SDRAM (DDR-SDRAM) that transfers data on both the main row is sent a precharge command, its data are the positive and negative edges of the clock. DDR2 SDRAM copied into this shadow row while being written back to the further increases available memory bandwidth by doubling bank’s storage array. the internal memory frequency of DRAM chips. However, The shadow row supplies data for read requests, but data memory still operates an order of magnitude slower than the are not written to it, since it does not perform write-backs to core processor. the main storage array. The shadow row cannot satisfy write Main memory (DRAM) is partitioned into ranks, where requests, but programs are most sensitive to read requests. each is composed of independently addressable banks, and Since most cores employ a store buffer, this seems a reason- each bank contains storage arrays addressed by row and col- able tradeoff between complexity and performance. We ex- umn. When the processor requires data from memory, the amine performance improvements of prioritizing reads over request is steered to a specific rank and bank based on phys- writes in satisfying memory requests in a standard DRAM ical memory address. All banks on a rank share command controller (now a common memory controller optimization). and data buses, which can lead to contention when memory Figure 1 shows significant speedups by giving reads priority requests arrive in bursts. over writes for some benchmarks. Clock cycles are averaged Banks receive an activate command to charge a bank of for multi-threaded configurations, and then normalized to sense amplifiers. A data row is fetched from the storage their counterparts that give reads and writes equal priorities. array and copied to the sense amps (a hot row or open Differences in clock cycles among different threads are negli- page). Banks are sent a read command to access a spe- gible. Giving reads priority generally improves performance cific column of the row. Once accesses to the row finish, for all but a few benchmarks. For these, our scheduling can a precharge command closes it so another can be opened: lead to thread starvation by giving newer reads priority over reading from DRAM is destructive, thus the precharge com- older outstanding writes. Prior research [8] addresses this mand writes data back from the hot row to DRAM storage by having a threshold at which writes increase in priority before precharging the sense amps. This prevents pipelining over reads. accesses to different rows within a bank, but commands can Only one set of sense amplifiers responds to commands at be pipelined to different banks to exploit parallelism within any time (i.e., either the shadow or the hot row); avoiding the memory subsystem. Although banks are not pipelined, parallel operation simplifies power issues and data routing. data can be interleaved across banks to reduce effective la- We make minimal modifications to a standard DRAM de- tency. For data with high spatial locality, the same row is sign, since cost is an issue for commodity DRAM manufac- likely to be accessed consecutively multiple times. Precharge turing. If manufacturing cost restrictions are relaxed, the and activate commands are wasted when the same row pre- hot row can perform operations in parallel while the shadow viously accessed is closed and subsequently reopened. Mod- row feeds data to the chip. Our mutual-exclusion restric- ern memory controllers usually keep rows open to leverage tion results in minimal changes to internal DRAM chip cir- spatial locality, operating in “open page” mode.

Optimizing Thread Throughput for Multithreaded Workloads on Memory Constrained Cmps

Benchmarking the Intel FPGA SDK for Opencl Memory Interface

A Modern Primer on Processing in Memory

Advanced X86

Motorola Mpc107 Pci Bridge/Integrated Memory Controller

The Impulse Memory Controller

COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous Multi-Threading and Multi-Core Processors Edgar Gabriel Spring 2011

WP127: "Embedded System Design Considerations" V1.0 (03/06/2002)

Computer Architecture Lecture 12: Memory Interference and Quality of Service

Memory Controller SC White Paper

Intel Xeon Scalable Family Balanced Memory Configurations

10Th Gen Intel® Core™ Processor Families Datasheet, Vol. 1

Madison Processor