<<

Optimizing Throughput for Multithreaded Workloads on Memory Constrained CMPs

Major Bhadauria and Sally A. Mckee Computer Systems Lab Cornell University Ithaca, NY, USA [email protected], [email protected]

ABSTRACT 1. INTRODUCTION Multi-core designs have become the industry imperative, Power and thermal constraints have begun to limit the replacing our reliance on increasingly complicated micro- maximum operating frequency of high performance proces- architectural designs and VLSI improvements to deliver in- sors. The cubic increase in power from increases in fre- creased performance at lower power budgets. Performance quency and higher voltages required to attain those frequen- of these multi-core chips will be limited by the DRAM mem- cies has reached a plateau. By leveraging increasing die ory system: we demonstrate this by modeling a cycle-accurate space for more processing cores (creating chip multiproces- DDR2 memory controller with SPLASH-2 workloads. Sur- sors, or CMPs) and larger caches, designers hope that multi- prisingly, benchmarks that appear to scale well with the threaded programs can exploit shrinking transistor sizes to number of processors fail to do so when memory is accurately deliver equal or higher throughput as single-threaded, single- modeled. We frequently find that the most efficient config- core predecessors. The current software paradigm is based uration is not the one with the most threads. By choosing on the assumption that multi-threaded programs with little the most efficient number of threads for each benchmark, contention for shared data scale (nearly) linearly with the average energy delay efficiency improves by a factor of 3.39, number of processors, yielding power-efficient data through- and performance improves by 19.7%, on average. We also put. Nonetheless, applications fully exploiting available cores, introduce a shadow row of sense amplifiers, an alternative where each application thread enjoys its own and pri- to cached DRAM, to explore potential power/performance vate working set, may still fail to achieve maximum through- impacts. The shadow row works in conjunction with the put [5]. Bandwidth and memory bottlenecks limit the po- L2 Cache to leverage temporal and spatial locality across tential of multi-core processors, just as in the single-core memory accesses, thus attaining average and peak speedups domain [2]. of 13% and 43%, respectively, when compared to a state-of- We explore memory effects on a CMP running a multi- the-art DRAM memory scheduler. threaded workload. We model a memory controller using state-of-the-art scheduling techniques to hide DRAM access Categories and Subject Descriptors latency. Pin limitations affect available memory channels and bandwidth, which can cause applications not to scale C.4 [Performance of Systems]: Design Studies well with increasing cores. It limits the maximum number ;C.1[ Architectures]: Parallel Architectures of cores that should be used to achieve the best ratio be- tween efficiency and performance. Academic computer ar- General Terms chitects rarely model the variable latencies of main memory requests, instead assuming a fixed latency for every memory Design, Performance request [9]. We compare differences in performance between modeling a static memory latency and modeling a memory Keywords controller with realistic, variable DRAM latencies for the Performance, Power, Memory, Efficiency, Bandwidth SPLASH-2 benchmark suite. We specifically examine CMP efficiency, as measured by their energy delay squared prod- uct. Our results demonstrate the insufficiency of assuming static latencies: the behaviors caused by such modeling choices not only affect performance but the shapes of the perfor- mance curves with respect to number of threads. Addi- tionally, we find that choosing the optimal configuration instead of that with the most threads can result in signif- Permission to make digital or hard copies of all or part of this work for icantly higher computing efficiency. This is a direct result personal or classroom use is granted without fee provided that copies are of variable DRAM latencies and the bottlenecks they cause not made or distributed for profit or commercial advantage and that copies for multi-threaded programs. This bottleneck is visible at bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific a lower thread count than in earlier research [25] when the permission and/or a fee. multi-dimensional structure of DRAM is modeled. CF’08, May 5–7, 2008, Ischia, Italy. To help mitigate these bottlenecks, we introduce a shadow Copyright 2008 ACM 978-1-60558-077-7/08/05 ...$5.00.

119

row of DRAM sense amplifiers. The shadow row keeps the 1 4 8 16 Threads previously accessed DRAM page accessible (in read-only 1.2 mode) without closing the current page, and thus it can be fed to the core when it satisfies the pending memory re- 1 quest. This leverages the memory’s temporal and spatial locality characteristics to potentially improve performance. 0.8 The shadow row can be used in conjunction with normal caching or can be used to implement an alternative non- 0.6 caching policy, preventing cache pollution by data unlikely to be reused. 0.4

The contributions of this work are threefold. Normalized Speedup 0.2 • We extend a DDR2 memory model to include memory optimizations exploiting the shadow row. 0 cholesky fft lu ocean radix • We find significant non-linear performance differences between bandwidth-limited static latency simulations and those using a realistic DRAM model and memory controller. • We find that the maximum number of threads is rarely Figure 1: Performance for Open Page Mode with Read the optimal number to use for performance or power. Priority Normalized to No Priority

2. THE MULTI-DIMENSIONAL DRAM 3. EXPLOITING GREATER LOCALITY STRUCTURE To take advantage of high spatial locality present in mem- Synchronous random access memory (SDRAM) is the main- ory access patterns of many parallel scientific applications, stay to bridge the gap between cache and disk. DRAMs have we make a simple modification to our DRAM model. We not increased in operating frequency or bandwidth propor- add a second “hot row” of sense amplifiers per DRAM bank, tionately with processors. One innovation is double data storing the second-most recently accessed data page. When rate SDRAM (DDR-SDRAM) that transfers data on both the main row is sent a precharge command, its data are the positive and negative edges of the clock. DDR2 SDRAM copied into this shadow row while being written back to the further increases available memory bandwidth by doubling bank’s storage array. the internal memory frequency of DRAM chips. However, The shadow row supplies data for read requests, but data memory still operates an order of magnitude slower than the are not written to it, since it does not perform write-backs to core processor. the main storage array. The shadow row cannot satisfy write Main memory (DRAM) is partitioned into ranks, where requests, but programs are most sensitive to read requests. each is composed of independently addressable banks, and Since most cores employ a store buffer, this seems a reason- each bank contains storage arrays addressed by row and col- able tradeoff between complexity and performance. We ex- umn. When the processor requires data from memory, the amine performance improvements of prioritizing reads over request is steered to a specific rank and bank based on phys- writes in satisfying memory requests in a standard DRAM ical . All banks on a rank share command controller (now a common memory controller optimization). and data buses, which can lead to contention when memory Figure 1 shows significant speedups by giving reads priority requests arrive in bursts. over writes for some benchmarks. Clock cycles are averaged Banks receive an activate command to charge a bank of for multi-threaded configurations, and then normalized to sense amplifiers. A data row is fetched from the storage their counterparts that give reads and writes equal priorities. array and copied to the sense amps (a hot row or open Differences in clock cycles among different threads are negli- page). Banks are sent a read command to access a spe- gible. Giving reads priority generally improves performance cific column of the row. Once accesses to the row finish, for all but a few benchmarks. For these, our scheduling can a precharge command closes it so another can be opened: lead to thread starvation by giving newer reads priority over reading from DRAM is destructive, thus the precharge com- older outstanding writes. Prior research [8] addresses this mand writes data back from the hot row to DRAM storage by having a threshold at which writes increase in priority before precharging the sense amps. This prevents pipelining over reads. accesses to different rows within a bank, but commands can Only one set of sense amplifiers responds to commands at be pipelined to different banks to exploit parallelism within any time (i.e., either the shadow or the hot row); avoiding the memory subsystem. Although banks are not pipelined, parallel operation simplifies power issues and data routing. data can be interleaved across banks to reduce effective la- We make minimal modifications to a standard DRAM de- tency. For data with high spatial locality, the same row is sign, since cost is an issue for commodity DRAM manufac- likely to be accessed consecutively multiple times. Precharge turing. If manufacturing cost restrictions are relaxed, the and activate commands are wasted when the same row pre- hot row can perform operations in parallel while the shadow viously accessed is closed and subsequently reopened. Mod- row feeds data to the chip. Our mutual-exclusion restric- ern memory controllers usually keep rows open to leverage tion results in minimal changes to internal DRAM chip cir- spatial locality, operating in “open page” mode. Memory cuitry. Figure 2 illustrates a DRAM chip with modifications controllers can also operate in “closed page” mode, where a (shaded) to show a potential organization for incorporating row is always closed after it is accessed. shadow rows. To further reduce costs and power overhead,

120

reordering can achieve multiple open-page hits. In contrast, we explicitly try to maintain high queue depth pressure,

r thereby increasing the possibilities of page hits, which re- to K ra e n duce latency to be less than that of the fixed latency case. K e g k c Bank 7 E lo C Bank 6 Bank 5 Bank 4 Bank 3 4. EXPERIMENTAL SETUP Bank 2 Bank 1 We use SESC [18], a cycle-accurate MIPS ISA simulator, to evaluate our workloads. We incorporate a cycle-accurate A0 to A13, r e A0, BA1, BA2 d o c DRAM memory controller that models transmitting mem- Row address e Memory cell array buffer and d Bank 0 w refresh o ory commands synchronously with the front side (FSB). Mode R register We assume memory chips of infinite size (really 4 GB), and no page faults. Our workloads are memory-intensive pro- Shadow sense amp. grams from the SPLASH-2 suite, which should scale to six- teen or more cores [25]. We choose programs with large Sense amp. memory footprints exceeding our L2 cache. r e /CS d Column o ic Column decoder c g address buffer Our baseline is an aggressive CMP with four-issue out-of- /RAS e o d l and burst d o n tr counter /CAS a n order cores with relevant design parameters listed in Table m o m C /WE o C 1. The baseline design incorporates L1 and L2 caches of 32 KB four-way and 1 MB eight-way associativity, respectively, Data control circuit to model realistic silicon die area utilization, reduce program dependence on main memory, and fit large portions of the working set on chip. The load-store queues are an important Latch circuit DQS,/DQ characteristic of modern architectures, as they prevent stores from causing stalls. The fixed latency used for accessing main memory incorporates the time to perform an activate and read/write command. The precharge command time is RDQS,/R CK,/CK DLL Input & Output buffer ODT omitted, since the memory request can be satisfied without DM or in tandem with the row being closed. A request takes 300 cycles, which accounts for the time to open a new row and

DQ read the data and transmit it over the FSB (Front Side Bus). When data are transmitted over the FSB, the transmission is limited by the bandwidth of the memory channel. For dynamic power consumption, we conservatively assume that cross communication between cores and clock network power Figure 2: Internal DRAM Circuitry with Shadow Row is negligible. Based on current technology trends, the static power accounts for at least 50% of total core power [12], and increases linearly with number of cores and execution time. shadow row sense-amplifiers can be replaced by latches, since Our DRAM chips are 800MHz 4-4-4-14 memory chip timings the shadow row reads data from sense amplifiers and not for CAS, RAS, RAS precharge commands, with a precharge from the (lower-powered) storage array. We introduce no ad- to row charge delay of 18 memory clock cycles, respectively. ditional buses, adding minimal logic overhead. The shadow Our variable latency calculations for the closed and open row shares the bus with the original hot row. Since the page mode variants use discrete command timings for the “column address strobe” (CAS) command uses fewer input row access strobe (RAS), column access strobe (CAS), and pins than the “Row Address Strobe” (RAS), the extra pins the precharge (PRE). can be used to indicate whether to read data from the hot Our memory controller intercepts requests from L2 cache. row or the shadow row. Effectively utilizing the shadow row The requests are partitioned into discrete DRAM commands. thus requires slight memory controller modifications. Note If the request’s page is already open, then only a CAS com- that the shadow row does not require sense-amplifiers, and mand is sent to memory. Otherwise PRE and RAS com- can use simple flipflop memory elements. Sense-amplifiers mands are sent, as well. Once the CAS command completes, are large and power hungry, especially compared to digital the memory request is satisfied, and the next command is CMOS memory elements. Thereby, the addition of an extra processed. Our controller implements out of order memory page of buffers results in negligible power and area overhead. scheduling and read priority, thus later requests accessing an Yamauchi et al. [27] explore a different approach to increas- open page are given preference over earlier requests. Read ing the effective number of banks without increasing circuit requests are have preference over older write requests when overhead. They pipeline and overlap bank structures (such choosing the next request to satisfy. We use memory buffers as row and column decoders and output sense amplifiers) that are filled and satisfied on a first come first served basis over several data arrays. They implement 32 such banks, (FCFS), however the FCFS priority of a request can be pre- while only using the hardware overhead of four banks. They empted if either of the aforementioned two conditions are thus trade area costs for increased complexity in the memory satisfied. We use an infinite queue size for rescheduling ac- controller; the memory controller must ensure banks shar- cesses to remove as many configuration specific bottlenecks ing data structures do not conflict with one another when from our memory controller. We use a 32-bit memory space, satisfying memory requests. Increasing the number of banks with all addresses 32 bits long. The memory addresses are reduces the queue depth pressure, reducing the chance that based on addresses being mapped to row address, memory

121

Technology 70nm 16 thread ocean run performs only 2% better than the eight Number of Cores 1/4/8/16 Execution Out of Order thread version. Bandwidth contention plays a large role Issue/Decode/Commit Width 4 here, with threads having a large concentration of misses Instruction Fetch Queue Size 8 with increasing thread counts. Figure 3 (b) shows the most INT/FP ALU Units 2/2 efficient configuration for every benchmark is usually the one Physical Registers 80 LSQ 40 with the most threads. This illustrates that when assuming Branch Mispredict Latency 2 a static latency with peak bandwidth (every memory ac- Branch Type Hybrid cess being a page hit), the SPLASH benchmarks increase L1 Icache 32KB 2-Way Associative in speed with increases in number of threads and proces- 1-cycle Access 32B Lines sors. These results imply increasing threads as the salvation L1 Dcache 32KB 4-Way Associative to increasing performance efficiently, since frequency has 2-cycle Access plateaued. ED2 results indicate increasing threads leads to 32B Lines improved or equivalent efficiency for most programs. The ex- MESI Shared L2 Cache 1024KB 8-way Associative ception being ocean, in which the 16 thread case has a 2.6% 2 9-cycle Access increase in ED over its eight thread counterpart, likely due 32B Lines to overheads of scaling and increases in cache misses. As Main Memory 300 cycles static latency shown in later graphs, these results are not accurate with what behavior would be truly exhibited in a realistic system Table 1: Base Architectural Parameters constrained by accurate memory latencies and bandwidth. APPLICATION INPUT PARAMETERS For a single access, a closed page system and the static Cholesky tk29.0 latency assumption both satisfy a request within the same FFT 20 Million number of clock cycles. However, once multiple accesses oc- Lu 512x512 matrix, 16×16 blocks cur, a DRAM controller utilizing the closed page scheme Ocean 258x258 ocean Radix 2M keys quickly incurs a backlog of requests to open and close rows, often even the same one. This results in a substantial degra- Table 2: SPLASH-2 Large Inputs dation of performance. Figure 4 (a) shows the performance for a system using a closed page memory system normalized to a single thread fixed latency simulation (the ideal case). channel, bank, and column address from most significant bit Configurations ranging from a single thread to 16 threads to least significant bit respectively. This is found to be the are graphed. Closed page mode single threaded benchmarks most optimal configuration [24] because we want columns to perform worse than the static latency case due to the multi- have the most variability and rows the least. Additionally, dimensional structure of DRAM. Clearly the disadvantage our bit mappings spread requests across the most banks, and of this system is not only performance but power consump- reduce the number of row conflicts that occur to any given tion as the same rows pre-charged are subsequently charged. bank. In closed page mode, almost all of our benchmarks are mem- For our SPLASH-2 multi-threaded benchmarks, we use a ory limited after eight or fewer threads. The exception being large input set. Since SPLASH-2 is rather dated, we use ag- lu which improves performance with increases in number of gressive inputs to keep the benchmarks competitive in eval- processors. Closed page simulations show a 50+% difference uating current performance bottlenecks on modern systems. (degradation) in performance compared to the static latency Input parameters for our programs is outlined in Table 2. scenario. The difference is also non-uniform, with all bench- The larger input sets offset the initialization, program par- marks showing a degradation in performance with just one titioning overhead and program portions that are not paral- thread, but this trend changes with increasing threads de- lelized. pending on benchmark. Hence, no normalization can map results with no cycle-accurate DRAM timing model to simu- 5. EVALUATION lations where one is used, because of the non-linear structure of DRAM memory. Figure 4 (b) illustrates the ED2 ineffi- We quantitatively measure computing efficiency as the 2 ciency of closed page mode for four to 16 threads normalized lowest energy delay squared product (ED ) of our config- to the single threaded closed page mode configuration. It urations for each benchmark. Researchers find this to be indicates that the four threaded configuration is often the a good metric for measuring power-efficient computing [21]. most efficient or close to it for most benchmarks. Lu is the Figure 3 (a) shows performance of programs as n umber of only benchmark that offers strong evidence to the contrary. threads increase from one to 16. Assuming a fixed latency The differences for cholesky and fft make increasing number for every main memory request, all programs achieve a mod- of threads debatable. est reduction in execution time by increasing threads. How- Figure 5 (a) graphs open page-mode variable latency. Per- ever, even with a static latency, not all benchmarks exhibit formance is normalized to the static latency of a single threaded linear improvement in performance with increasing threads. configuration. Again, we see that non-uniform results, with This is attributed to: a) programs being limited by available some single-threaded open page applications showing little bandwidth, and b) contention for shared data, exhibited by difference in performance (cholesky, lu), and others showing locks/barriers and increases in main memory accesses by degradations (fft, ocean), and one showing significant im- contention for the shared cache. Researchers find that for provements (radix). Open page mode leverages the temporal a 32 processor run, a 258 × 258 ocean simulation acquires and spatial locality inherent within programs, but it fails to and releases thousands of locks during execution [25]. Other achieve the performance that static latency with peak band- programs such as radix, lu and fft have no locks and few bar- width assumes would be available. This is surprising, since riers, and are instead possibly limited by bandwidth. The

122

4 8 16 Threads 4 8 16 Threads

14 0.25

12 0.2

10

0.15 8

6 0.1

4

Normalized Speedup 0.05 Normalized Inefficiency 2

0 0 cholesky fft lu ocean radix cholesky fft lu ocean radix

(a) Performance (b) Energy Delay (ED2) Inefficiency (lower is better)

Figure 3: Results for Static Main Memory Latency Normalized to One Thread

1 4 8 16 Threads 4 8 16 Threads

2.5 5

4.5

2 4

3.5

1.5 3

2.5

1 2

1.5

Normalized Speedup 0.5 1 Normalized Inefficiency

0.5

0 0 cholesky fft lu ocean radix cholesky fft lu ocean radix

(a) Performance (b) Energy Delay (ED2) Inefficiency (lower is better)

Figure 4: Results for Variable Closed Page Latency Normalized to One Thread

1 4 8 16 Threads 4 8 16 Threads

7 2.5 5.59 15.1

6 2

5

1.5 4

3 1

2

Normalized Speedup 0.5 Normalized Inefficiency 1

0 0 cholesky fft lu ocean radix cholesky fft lu ocean radix

(a) Normalized Performance (b) Energy Delay (ED2) Inefficiency (lower is better)

Figure 5: Results for Variable Open Page Latency Normalized to One Thread

123

1 4 8 16 Threads 1 4 8 16 Threads

1.5 1.4 1.4 1.3 1.2 1.2 1.1 1 1 0.9 0.8 0.8 0.7 0.6 0.6 0.5 0.4 0.4 Normalized Speedup Normalized Speedup 0.3 0.2 0.2 0.1 0 0 cholesky fft lu ocean radix cholesky fft lu ocean radix

Figure 6: Shadow Row Performance Normalized to Figure 8: 16 Bank Performance Normalized to Eight Open Page Read Priority Bank Open Page Read Priority the aggressive out of order memory scheduling should ensure tal number of requests to main memory. Radix shows a maximizing possible open page hits, but clearly it is insuf- significantly high number of DRAM page conflicts, followed ficient. Examining open page performance trends, we see by fft, ocean and cholesky respectively. This indicates that this issue is exacerbated with increasing threads as memory for some benchmarks, performance degradation is due to pressure increases on the memory channel, leading to pro- DRAM page conflicts rather than from limited bus band- grams bottle-necking at lower thread counts, not because width. We validate these observations in the latter section of limited bandwidth, but because of latency of accessing a where we examine benchmark sensitivity to bandwidth. Fig- multi-dimensional DRAM structure. ure 7 (b) graphs DRAM page conflicts normalized to the Figure 5 (b) depicts the inefficiency of programs running open page DRAM memory controller. There was only neg- with four to 16 threads, normalized to their single threaded ligible differences in DRAM memory accesses between the counterparts (which is not shown). Surprisingly, the max- open page memory controller and the one with the shadow imum number of threads is rarely the most efficient. For row, so it is not graphed. For most benchmarks, there are radix the single threaded case is the most efficient, while significant reductions in DRAM page conflicts compared to for fft, cholesky and ocean it is the four thread configura- the baseline, with cholesky showing the highest reductions tion. If one was to always choose the maximum number of in conflict misses. On average, a 30% reduction in con- threads for each benchmark, they would end up having 339% flict misses is achieved from keeping the previous row open. worse ED2 on average than the optimal value, resulting in These page conflict reductions result from our shadow row increased energy and often delay as well. performance improvements. Some performance benefits are We implement the shadow row to reduce the effective la- masked by the bandwidth bottleneck, which we explore in tency of DRAM. Figure 6 shows speedup with the shadow the next section. For example, radix shows the most per- row, where performance is normalized to a baseline state of formance improvement, but it does not enjoy the same re- the art memory scheduler that reorders memory accesses, ductions in page misses. This is due to the other bench- giving priority to open page hits and reads. Significant re- marks (cholesky and fft) being simultaneously bottlenecked ductions in execution time are achieved with every bench- by memory bandwidth, reducing performance gains. mark, except lu, since it is not as memory limited as the We investigate the sensitivity of benchmark performance other benchmarks. The benchmarks suffer no performance due to number of banks and bus bandwidth speed. Although degradation, since shadow row misses do not increase de- current chips consist of eight banks we are interested in sen- lay over the baseline scheduler. Overall, the shadow row sitivity to bank size as this increases to 16, and the ex- achieves speedups of 12.8%, and peak performance of 43.5%. pected performance improvements. Increasing number of For some benchmarks, performance gains decrease with in- banks reduces contention on a single bank, since memory re- creasing threads, since the memory addresses become more quests might get spread across more banks. Figure 8 graphs sparse. Sparse addresses that span more than the last two performance of 16 banks normalized to a baseline memory hot rows results in an open page miss. We investigate the controller that utilizes read priority, out of order memory underlying causes for the difference in performance between scheduling of an eight bank configuration. Each 16 bank the shadow row and the baseline open page memory con- configuration is normalized to its respective eight bank con- troller, by examining the percentage of DRAM page con- figuration. Performance increases from the original eight flicts. DRAM page conflicts occur when the memory re- bank configuration for most benchmarks. However, some quested is not on a currently opened DRAM row, requir- benchmarks actually degrade in performance. For example, ing it to be closed, and the requested memory’s correspond- the eight thread fft run increases in delay by 4.8%. This is ing row opened. Figure 7 (a) shows memory page conflicts because fewer accesses are waiting in the queue for reorder- for the open page DRAM controller normalized to the to- ing, resulting in fewer accesses to the same hot row. This

124

1 4 8 16 Threads 1 4 8 16 Threads

80.00% 110%

100% 70.00% 90% 60.00% 80%

50.00% 70%

60% 40.00% 50%

30.00% 40%

30% 20.00% 20% 10.00% 10% Normalized DRAM Page Conflicts Normalized DRAM Page Conflicts

0.00% 0% cholesky fft lu ocean radix cholesky fft lu ocean radix

(a) Page Conflicts Normalized to Total DRAM Accesses (b) Shadow Row Conflicts Normalized to Open Page Conflicts

Figure 7: DRAM Page Conflict Differences Between Open Page and Open Page with Shadow Row

1 4 8 16 Threads controller, which we use as our baseline. Some benchmarks show speedups (cholesky, fft, ocean) by increasing with num- 1.6 ber of threads due to increasing memory pressure. Two 1.4 benchmarks show no improvements for two different reasons. Lu was not originally constrained by memory, so increas- 1.2 ing bandwidth does not improve its performance. Radix 1 shows negligible improvements at lower thread counts, be- cause it suffers more from page conflicts than from limited 0.8 bandwidth. This is foreshadowed in the earlier graph, which

0.6 shows multi-threaded radix having very high DRAM page conflict misses (over 50%). In such a scenario, it is more im-

Normalized Speedup 0.4 portant to reduce page conflicts than to improve available bandwidth, since the time for satisfying a DRAM conflict 0.2 is so high. This is also verified by speedups for radix in 0 Figure 6, where the shadow row significantly improves per- cholesky fft lu ocean radix formance by providing more open page hits than the baseline open-page memory controller.

Figure 9: Increasing Bandwidth Performance Normal- ized to Open Page Read Priority 6. RELATED WORK The literature contains a wealth of research on memory de- vices and subsystems. We focus on DRAM latency-reduction results in less contention but fewer hits to a hot row where and bandwidth-enhancing techniques here. the access latency is even less than the static latency config- Researchers previously found the wide gap in performance uration. Comparing Figure 8 to Figure 6 indicates that the between memory and processor resulted in a memory wall [26]. shadow row actually performs better for every benchmark This was the observation that the processor, regardless of and thread configuration except for the radix four and eight operating speed would be limited by the slower speed of the thread runs. Additionally, the best performance for cholesky memory it was dependent on for data. Previous work has and fft is still the eight thread configuration, and for radix examined this issue for single-threaded benchmarks on sin- and ocean it is the four thread configuration. Thereby dou- gle core processors [20]. One way to alleviate the speed gap bling existing banks fails to adequately change the thread between memory and processor is efficient DRAM memory performance curve. In fact, ocean saturates even faster now scheduling. Earlier examined for streaming benchmarks [16], (at four threads rather than the eight threads with the pre- memory scheduling strategies have been examined for high vious eight bank configuration). performance computing [19], with several scheduling meth- We examine performance improvements that can be ex- ods culminating from this work. One idea is to out- pected by reducing the amount of time data occupies the standing memory accesses that map to the same row consec- bus (effectively speeding the bus). Specifically, we assume utively, out of the order they may have arrived at the mem- we can double the bus speed or widen communication lanes, ory controller. With this strategy, emphasis is given on first such that data only occupies the bus for half the number processing memory requests that map to the currently open of CPU clock cycles they previously did. We graph the re- page. Another strategy is to satisfy read accesses out of or- sults of applying this optimization on our open-page memory der over write requests, since programs can usually continue

125

processing with outstanding writes pending. However, both This work culminated in the development of DRAMsim, a strategies need to have some heuristic to ensure outstanding tool for modeling the effects of variable latency [23]. Cur- requests don’t starve because of the continued promotion of rently, DRAMsim does not support partitioning memory re- later requests. quests into discrete memory commands. However, future Researchers find DRAM does not effectively capture tem- versions are scheduled to implement discrete memory con- poral locality of memory [28]. One solution is to leverage the troller commands, making it a viable possibility for our fu- temporal locality of DRAM data by using cached DRAM, ture research. It also can be currently used by architects where some cache is stored in the DRAM along with the to examine the potential performance of their architecture DRAM arrays. Memory requests are first handled by the when the DRAM is accurately modeled. Recently, they com- cache. A miss in the DRAM cache results in a memory con- pared FBDIMM and DDR performance, finding good FB- troller on the DRAM forwarding the request onto the main DIMM performance being dependent on bus utilization and DRAM arrays. We don’t investigate this implementation, as traffic [6]. Additionally, Jaleel et al. find [11] [10] that sys- it requires a memory controller as well as the implementa- tems with larger reorder buffers, once modeled accurately tion of cache on the DRAM. Current trends indicate multiple with a memory controller, can actually degrade in perfor- memory controllers will be integrated on chip, increasing in mance. They find that increasing reorder buffer sizes be- tandem with the number of cores [7]. yond 128 entries can increase the frequency of replay traps Enhanced SDRAM (ESDRAM) is a similar technology to and data cache misses, resulting in performance degradation. cached DRAM, utilizing a SRAM cache row for each bank We examine the effects increasing the number of banks has between the IO channel and the sense-amplifier [4]. The on performance. Previous research found increasing banks SRAM cache rows store an entire open page. The rows can results in 18% performance improvement [24], however their be closed and another row opened without disrupting the memory controller did not perform any memory re-ordering. data being read from the SRAM row, effectively keeping Memory reordering that results in multiple hits to an open the previous row open longer. Additionally, the row can be page results in lower latency than the static latency case. closed while the data from that row is still available for ac- Although currently there are no high density 16 bank DDR cessing, pipelining the overhead of the subsequent precharge DRAMs being shipped, researchers have shown their feasi- command. However, unlike the shadow row it cannot keep bility [13]. two different pages open and feed data from both rows of Another method of reducing the memory bottleneck is in- the same bank to the chip. ESDRAM’s cache row functions corporating multiple memory controllers on chip, which is like a normal open page, and can be written to and read outside the scope of this study. The main disadvantages for from. multiple memory controllers are space limitations on chip Prior research examined prefetching data to reduce the and providing the pin count for adequate bandwidth. How- effects of the memory wall. One idea is to prefetch cache ever, recent advances in photonics have reduced the latency blocks when the DRAM channels are idle [15]. This method for on-chip networks, making them a potentially feasible al- does not increase pressure on the DRAM channel since the ternative in the distant future. With the potential for large prefetching only occurs when the channels are free. Other amount of bandwidth and many memory channels without work dynamically modifies the TLB and cache [1] to im- the resulting explosive increases in pin count, using photon- prove performance, reducing the effects of ics to feed future processors could solve the memory wall the memory wall. Another idea is to remap non-contiguous problem. Researchers have looked at the potential gains of physical addresses using directives by the application and using photonics on-chip [14], and if there are enough banks, compiler. Researchers [3] find this improves cache locality bandwidth contention could become trivial. and bus utilization, as well as being an effective tool for prefetching from the memory controller. 7. CONCLUSIONS The adaptive history based (AHB) memory controller resched- We find that the memory wall limits multi-threaded pro- ules memory accesses, accounting for extended dimensions gram performance, and this issue is exacerbated with in- of the DRAM memory hierarchy [8]. The AHB uses the creasing threads. We demonstrate memory intensive pro- same bank and row hardware conflict management schemes grams are limited by DRAM latencies and bandwidth as previously discussed. However, rank, port and channel con- number of cores and program threads increase. Even pro- flicts are avoided using an adaptive history based algorithm grams that show performance gains with increasing thread that reschedules memory accesses based on what previous counts and limited bandwidth actually degrade consider- memory accesses are already in the pipeline. This is done to ably when the multi-dimensional structure of DRAM is ac- reduce the conflict of switching communication buses, which counted for. This bottleneck results in severely limited com- have significant latencies in their configuration. In the CMP puting efficiency, and (depending on application) it also in- domain, recently Nesbit et al. [17] looked at running se- creases energy or delay costs. Attempts at using state of rial programs in unison on a CMP. They find that the first the art memory scheduling techniques to hide DRAM laten- come first serve policy that worked in the serial domain is no cies are not sufficient in reversing the trend. We find using longer optimal. They apply theoretical techniques used for the optimal number of threads versus the maximum results quality of service for networks to reduce variance in band- in average efficiency improvements of 19.7% with memory width utilization and improve system performance. Since intensive programs from the SPLASH-2 benchmarks. our work is orthogonal to the above methods, they could We implement closed and open page mode memory con- be used in conjunction to achieve higher throughput and trollers and demonstrate their effects on performance. We efficiency. introduce a shadow row technique for reducing main memory Researchers at the University of Maryland examined why latency, and attain average delay reductions of 13%. We ex- it is worthwhile to study DRAM at the system level [9]. amine performance sensitivity to DRAM parameters: specif-

126

ically increasing number of banks on a single channel and in- queues (vlsqs) to reduce the negative effects of creasing number of bus lanes. Unfortunately neither of these reordered memory instructions. In Proc. 11th IEEE achieve consistently higher thread performance. Our shadow Symposium on High Performance Computer row achieves competitive performance with other solutions Architecture, pages 266–277, Feb. 2005. that attempt to reach the maximum theoretical bandwidth [12] N. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, of the channel through reduced conflict misses, without the J. Hu, M. Irwin, M. Kandemir, and V. Narayanan. associated hardware overhead. It specifically alleviates is- Leakage current: Moore’s law meets static power. sues that increasing bandwidth does not, chiefly being page IEEE Computer, 36(12):68–75, Dec. 2003. conflict misses. All results can be further tuned by using a [13] T. Kirihata, G. Mueller, B. Ji, G. Frankowsky, J. Ross, ratio of old requests to new requests and reads to writes. H. Terletzki, D. Netis, O. Weinfurtner, D. Hanson, Future work will examine further ways to reduce the la- G. Daniel, L.-C. Hsu, D. Sotraska, A. Reith, M. Hug, tency of main memory so program throughput can continue K. Guay, M. Selz, P. Poechmueller, H. Hoenigschmid, to grow in tandem with number of cores and threads. We and M. Wordeman. A 390-mm2, 16-bank, 1-gb ddr will also examine the behavior of SPEC OMP workloads [22] sdram with hybrid bitlinearchitecture. IEEE Journal and investigate running several single and multi-threaded of Solid-State Circuits, 34(11):1580–1588, 1999. programs simultaneously on a CMP. [14] N. Kirman, M. Kirman, R. Dokania, J. Martinez, A. Apsel, M. Watkins, and D. Albonesi. Leveraging 8. REFERENCES optical technology in future bus-based chip [1] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, multiprocessors. In Proc. IEEE/ACM 40th Annual and S. Dwarkadas. Dynamic memory hierarchy International Symposium on , pages performance optimization. In Workshop on Solving the 492–503, Dec. 2006. Memory Wall Problem, held at the 27th International [15] W. Lin, S. Reinhardt, and D. Burger. Reducing Symposium on , June 2000. DRAM latencies with an integrated memory hierarchy [2] D. Burger, J. Goodman, and A. K¨agi. Memory design. In Proc. 7th IEEE Symposium on High bandwidth limitations of future . In Performance Computer Architecture, pages 301–312, Proc. 23rd IEEE/ACM International Symposium on Jan. 2001. Computer Architecture, pages 78–89, May 1996. [16] S. McKee, W. Wulf, J. Aylor, R. Klenke, M. Salinas, [3] J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, S. Hong, and D. Weikle. Dynamic access ordering for E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, streamed computations. IEEE Transactions on M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Computers, 49(11):1255–1271, Nov. 2000. Building a smarter memory controller. In Proc. Fifth [17] K. Nesbit, N. Aggarwal, J. Laudon, and J. Smith. Fair Annual Symposium on High Performance Computer queuing memory systems. In Proc. IEEE/ACM 40th Architecture, pages 70–79, Jan. 1999. Annual International Symposium on [4] V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A Microarchitecture, pages 208–222, Dec. 2006. performance comparison of contemporary DRAM [18] J. Renau. SESC. http://sesc.sourceforge.net/index.html, architectures. In Proc. 26th IEEE/ACM International 2002. Symposium on Computer Architecture, pages 222–233, [19] S. Rixner, W. Dally, U. Kapasi, P. Mattson, and May 1999. J. Owens. Memory access scheduling. In Proc. 27th [5] M. Curtis-Maury, K. Singh, S. McKee, F. Blagojevic, IEEE/ACM International Symposium on Computer D. Nikolopoulos, B. de Supinski, and Schulz. Architecture, pages 128–138, June 2000. Identifying energy-efficient concurrency levels using [20] R. Sites. It’s the Memory, Stupid! machine learning. In Proc. 1st International Workshop Report, 10(10):2–3, Aug. 1006. on Green Computing, Sept. 2007. [21] M. Stan and K. Skadron. Guest editors’ introduction: [6] B. Ganesh, A. Jaleel, D. Wang, and B. Jacob. Power-aware computing. IEEE Computer, Fully-buffered memory architectures: 36(12):35–38, Dec. 2003. Understanding mechanisms, overheads and scaling. In [22] Standard Performance Evaluation Corporation. SPEC Proc. 13th IEEE Symposium on High Performance OMP benchmark suite. Computer Architecture, pages 109–120, Feb. 2007. http://www.specbench.org/hpg/omp2001/, 2001. [7] H. Hidaka, Y. Matsuda, M. Asakura, and [23] D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, K. Fujishima. The cache dram architecture: A dram A. Jaleel, and B. Jacob. DRAMsim: A with an on-chip cache memory. IEEE Micro, memory-system simulator. In Computer Architecture 10(2):14–25, 1990. News, volume 33, pages 100–107, Sept. 2005. [8] I. Hur and C. Lin. Adaptive history-based memory [24] D. T. Wang. Modern dram memory systems: schedulers. In Proc. IEEE/ACM 38th Annual Performance analysis and a high performance, International Symposium on Microarchitecture, pages power-constrained dram-scheduling algorithm. Ph.D. 343–354, Dec. 2004. dissertation, University of Maryland, May 2005. [9] B. Jacob. A case for studying DRAM issues at the [25] S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. system level. IEEE Micro, 23(4):44–56, 2003. The SPLASH-2 programs: Characterization and [10] A. Jaleel. The effects of aggressive out-of-order methodological considerations. In Proc. 22nd mechanisms on the memory sub-system. Ph.D. IEEE/ACM International Symposium on Computer dissertation, University of Maryland, July 2005. Architecture, pages 24–36, June 1995. [11] A. Jaleel and B. Jacob. Using virtual load/store

127

[26] W. Wulf and S. McKee. Hitting the wall: Implications of the obvious. Computer Architecture News, 23(1):20–24, Mar. 1995. [27] T. Yamauchi, L. Hammond, and K. Olukotun. The hierarchical multi-bank dram: A high-performance architecture for memory integrated with processors. In Conference on Advanced Research in VLSI, pages 303–320, Nov. 1997. [28] Z. Zhang, Z. Zhu, and X. Zhang. Cached DRAM for ILP processor memory access latency reduction. IEEE Micro, 21(4):22–32, July/August 2001.

128