Integrating Dram Caches for Cmp Server Platforms

[3B2-14] mmi2011010099.3d 20/1/011 14:55 Page 99

...... CHOP: INTEGRATING DRAM CACHES FOR CMP SERVER PLATFORMS

...... INTEGRATING LARGE DRAM CACHES IS A PROMISING WAY TO ADDRESS THE MEMORY BANDWIDTH WALL ISSUE IN THE MANY-CORE ERA.HOWEVER, ORGANIZING AND Xiaowei Jiang IMPLEMENTING A LARGE DRAM CACHE IMPOSES A TRADE-OFF BETWEEN TAG SPACE Intel OVERHEAD AND MEMORY BANDWIDTH CONSUMPTION. CHOP (CACHING HOT PAGES) Niti Madan ADDRESSES THIS TRADE-OFF THROUGH THREE FILTER-BASED DRAM-CACHING IBM Thomas J. Watson TECHNIQUES. Research Center ...... Today’s multicore processors To deal with this dilemma, we propose already integrate multiple cores on a die. CHOP (Caching Hot Pages), a methodology Li Zhao Many-core architectures enable far more comprising three filter-based DRAM caching small cores for throughput computing. The techniques that cache only hot (frequently Mike Upton key challenge in many-core architectures is used) pages: the memory bandwidth wall:1-3 the required Ravi Iyer memory bandwidth to keep all cores running a filter cache that determines the hot smoothly is a significant challenge. In the subset of pages to allocate into the Srihari Makineni past, researchers have proposed large dy- DRAM cache, namic RAM (DRAM) caches to address a memory-based filter cache that spills Donald Newell this challenge.4-7 Such solutions fall into and fills the filter state to improve the two main categories: DRAM caches with filter cache’s accuracy and reduce its Intel small allocation granularity (64-byte cache size, and lines) and DRAM caches with large alloca- an adaptive DRAM-caching technique Yan Solihin tion granularity (page sizes such as 4-Kbyte that switches DRAM-caching policy or 8-Kbyte cache lines). However, fine- on the fly. North Carolina grained DRAM cache allocation comes at the cost of tag array storage overhead. By employing coarse-grained allocation, State University This overhead implies that it’s necessary to we considerably reduce the on-die tag space reduce the last-level cache (LLC) to accom- needed for the DRAM cache. By allocating modate the DRAM cache tag because the only hot pages into the DRAM cache, we Rajeev DRAM tag must be stored on die for fast avoid the memory bandwidth problem be- lookup.8 The alternative coarse-grained allo- causewedon’twastebandwidthonpages Balasubramonian cation comes at the cost of significant mem- with low spatial locality. ory bandwidth consumption. It tends to University of Utah limit performance benefits for applications DRAM-caching benefits and challenges that don’t have high spatial locality across Researchers have investigated DRAM all its pages. caches as a solution to the memory ......

0272-1732/11/$26.00 c 2011 IEEE Published by the IEEE Computer Society 99 [3B2-14] mmi2011010099.3d 20/1/011 14:55 Page 100

...... TOP PICKS

6 1.0 0.9 5 0.8 4 0.7 0.6 3 0.5 0.4 2 0.3 per instruction Tag array storage overhead (Mbytes) 1 0.2 Normalized miss rate 0.1 0 0.0 64 128 256 512 1,000 2,000 4,000 No DRAM 64 128 256 512 1,000 2,000 4,000 (a) (b)

100 1.6 90 1.4 80 1.2 70 60 1.0 50 0.8 40 0.6 30 20 Speedup ratio 0.4 10 0.2 0 0.0 Memory bandwidth utilization (%) No DRAM 64 128 256 512 1,000 2,000 4,000 No DRAM 64 128 256 512 1,000 2,000 4,000 (c) (d)

Figure 1. Overview of dynamic RAM (DRAM) cache benefits, challenges, and trade-offs: tag array storage overhead (a), misses per instruction (b), memory bandwidth utilization (c), and performance improvement (d).

bandwidth wall problem.3,6 Such a solution consider this a viable approach. The signifi- enables 8 to 16 times higher density than tra- cant tag space overhead at small cache lines ditional static RAM (SRAM) caches,4 and inherently promotes the use of large cache consequently provides far higher cache ca- lines, which alleviates the tag space problem pacity. The basic DRAM-caching approach by more than an order of magnitude (as is to integrate the tag arrays on die for fast Figure 1a shows, the 4-Kbyte line size ena- lookup, whereas the DRAM cache data bles a tag space of only a few kilobytes). arrays can be either on chip (embedded However, Figure 1b shows that using larger DRAM, or eDRAM) or off chip (3D stacked cache lines doesn’t significantly reduce or misses per instruction [MPI]). MPI. This is because larger cache line sizes To understand the benefits and challenges alone aren’t sufficient for providing spatial of DRAM caching, we experimented with locality. More importantly, the memory the Transaction Processing Performance bandwidth utilization increases significantly Council C benchmark (TPC-C). Figure 1 (Figure 1c) due to the decrease in DRAM shows the DRAM cache benefits (in terms cache miss rate (or MPI); it isn’t directly pro- of MPI and performance) along with the portional to the increase in line size. Such an trade-offs (in terms of tag space overhead increase in main memory bandwidth require- and memory bandwidth utilization). The ments saturates the main memory subsystem baseline configuration has no DRAM and therefore immediately affects this cache. At smaller cache lines, tag space over- DRAM cache configuration’s performance head becomes significant. significantly (as shown by the 2-Kbyte and Figure 1a shows that a 128-Mbyte 4-Kbyte line size results in the last two bars DRAM cache with 64-byte cache lines has of Figure 1d). a tag space overhead of 6 Mbytes. If we im- plement this cache, then we must either Server workload DRAM-caching increase the die size by 6 Mbytes or reduce characterization theLLCspacetoaccommodatethetag Using large cache line sizes (such as space. Either alternative significantly offsets 4-Kbyte page sizes) considerably alleviates the performance benefits, so we don’t the tag space overhead. However, the main ...... 100 IEEE MICRO [3B2-14] mmi2011010099.3d 20/1/011 14:55 Page 101

memory bandwidth requirement increases Table 1. Offline hot-page profiling statistics for server workloads. accordingly and becomes the primary bottle- neck. Therefore, at large cache line sizes, the Hot-page Hot-page minimum DRAM-caching challenge changes into a Workload percentage no. of accesses memory bandwidth efficiency issue. Saving SAP 24.8 95 memory bandwidth requires strictly control- Sjas 38.4 65 ling the amount of cache lines that enter Sjbb 30.6 93 the DRAM cache. However, even when TPC-C 7.2 64 caching only a subset of the working set, Average 25.2 79 we must retain the amount of useful ...... *SAP: SAP Standard Application; Sjas (SPECjApps2004): SPEC Java DRAM cache to provide performance bene- Application Server 2004; Sjbb (SPECjbb2005): SPEC Java Server 2005; fits. So, we must carefully choose this subset TPC-C: Transaction Processing Performance Council C. of the working set. To understand how we should choose this subset, we performed a detailed offline profiling of workloads to obtain their DRAM cache access information (assuming On 4-Kbyte page-size cache lines). Table 1 C C C die shows the results for four server workloads. Tag LRUcounter Tag LRUcounter For this profiling, we define hot pages as L2C L2C . . . L2C the topmost accessed pages that contribute . . . to 80 percent of the total access number. Interconnect We calculate the hot-page percentage as the number of hot pages divided by the total FC DT LLC Way 0 . . . Way N number of pages for each workload. We consider about 25 percent of the pages to > Threshold be hot pages. The last column in Table 1 DRAM cache Memory Hot page also shows the minimum number of accesses (a) (b) to these hot pages. On average, a hot page is accessed at least 79 times. If we can capture Figure 2. The filter cache for DRAM caching: architecture overview (a) such hot pages on the fly and allocate them and filter cache structure (b). (DT: DRAM cache tag array; FC: filter into the DRAM cache, we can reduce mem- cache; L2C: level-two cache; LLC: last-level cache; LRU: least recently ory traffic by about 75 percent and retain used; Way: cache way of the filter cache.) about 80 percent of the DRAM cache hits. Filter-based DRAM caching We introduce a small filter cache that array is off die via 3D stacking or MCP). profiles the memory access pattern and iden- Both the filter cache and DRAM cache use tifies hot pages: pages that are heavily page-size lines. Each filter cache entry accessed because of temporal or spatial local- (Figure 2b) includes a tag, LRU (least ity. By enabling a filter cache that identifies recently used) bits, and a counter indicating hot pages, we can introduce a filter-based how often the line is touched. The filter DRAM cache that only allocates cache lines cache allocates a line into its tag array the for the hot pages. By eliminating allocation first time an LLC miss occurs on this line. of cold pages in the DRAM cache, we can re- The counter returns to zero for a newly allo- duce the wasted bandwidth due to allocating cated line and increases incrementally for cache lines that never get touched later. subsequent LLC misses on that line. Once the counter reaches a threshold, the filter Filter cache (CHOP-FC) cacheconsidersthelinetobeahotpage Figure 2a shows our basic filter-based and puts it into the DRAM cache. DRAM-caching architecture (CHOP-FC), To allocate a hot page into the DRAM which incorporates a filter cache on die cache, the filter cache sends a page request with the DRAM cache tag array (the data to the memory to fetch the page, and the ......

JANUARY/FEBRUARY 2011 101 [3B2-14] mmi2011010099.3d 20/1/011 14:55 Page 102

...... TOP PICKS

DRAM cache chooses a victim. Our meas- BIOS reserves it without exposing it to the urements indicated that the LFU (least fre- operating system. However, this option quently used) replacement policy performs requires off-chip memory accesses to look better than the LRU policy for the DRAM up the counters in the allocated memory re- cache. To employ the LFU policy, the gion. For smaller filters with potentially DRAM cache also maintains counters simi- higher miss rates, the performance benefit larly to the way the filter cache does. The of using a filter cache might not be able to counter is incremented when a hit to the amortize the cost of off-chip counter line occurs. lookups. When the DRAM cache replaces a line, it To deal with these problems, the second chooses the line with the smallest counter option for memory backup is to pin this value as the victim. The DRAM cache then counter memory space in the DRAM returns the victim to the filter cache as a cache. If the DRAM cache size is cold page, while writing its data back to 128 Mbytes, and only 4 Mbytes of memory memory. The victim’s counter is then set space are required for counter backup, then to half its current value so that the victim the loss in DRAM cache space is minimal has a better chance to become hotter than (about 3 percent). To generate the DRAM other new lines in the filter cache. The pro- cache address for retrieving a counter, we cessor considers as cold pages any LLC simply use the page frame number of a misses that also miss in the filter cache and page to reference the counter memory DRAM cache (or hit in the filter cache space in the DRAM cache. with the counter less than the threshold). Similar to CHOP-FC, whenever an LLC For such pages, the LLC sends a regular re- miss occurs, the processor checks the filter quest (64 bytes) to the memory. So, the pro- cache and increments the corresponding cessor doesn’t waste memory bandwidth on counter. To further reduce the counter look- fetching cold pages into the DRAM cache. ups’ latency, the processor can send prefetch- ing requests of counters upon a lower-level Memory-based filter cache (CHOP-MFC) cache miss. If a replacement in the filter In the baseline filter cache scheme, the fil- cache occurs, the processor updates the corre- ter cache size directly impacts how long a line sponding counter in the reserved memory can stay in it. The counter history is com- space. This counter writeback can occur in pletely lost when the line is evicted. This the background. Once the counter in the fil- could reduce the likelihood of a hot page ter cache reaches the threshold, the filter being identified. To deal with this problem, cache identifies a hot page and installs it we propose a memory-based filter cache into the DRAM cache. (More details on (CHOP-MFC), which spills a counter into this option are available elsewhere.9) memory when the line is replaced and We further propose a third option for the restores it when the line is fetched back MFC, in which we distribute the counter again. With a memory backup for counters, memory space and augment it into we can expect the filter cache to provide the page table entry (PTE) of each page. The more accurate hot-page identification. In ad- processor allocates the unused bits in the dition, this scheme lets us safely reduce the PTEs (18 bits in a 64-bit machine10)for filter cache size significantly. the denoted page’s counter. Upon a transla- Storing counters in memory requires tion look-aside buffer (TLB) miss, the pro- allocating extra space in the memory. A cessor’s TLB miss handler or the page table 16-Gbyte memory with a 4-Kbyte page size walk routine brings the counter along with and a counter threshold of 256 (8-bit coun- other PTE attributes into the TLB. The ters) requires 4 Mbytes of memory space. We processor performs virtual to physical page propose three options for this memory translation at the beginning of the in- backup. struction fetch stage (for instruction pages) In the first option, either the operating and the memory access stage (for data system allocates this counter memory space pages). Therefore, by the time a filter cache in the main memory or the firmware or access fires for a load or store instruction, ...... 102 IEEE MICRO [3B2-14] mmi2011010099.3d 20/1/011 14:55 Page 103

the TLB has already obtained the desired The DRAM cache is 128 Mbytes with an page’s counter and could supply it to the on-die tag array. The filter cache has a cover- filter cache. So, we can significantly reduce age of 128 Mbytes. The MFC contains the counter fetch operation’s overhead 64 entries with four-way associativity. The because this operation doesn’t need to go memory access latency is 400 cycles with a off chip. bandwidth of 12.8 Gbytes per second. For Upon a filter cache eviction, the filter CHOP-MFC, we evaluated the second cache must write the replaced counter back option, which pins down counter memory to the PTEs. To assist in PTE lookup, the fil- space in the DRAM cache, because this ter cache must store both a page’s virtual and option involves fewer hardware changes. physical tag bits. This approach is transpar- We chose four key commercial server ent to the operating system in that the latter workloads: TPC-C, SAP Standard Applica- doesn’t allocate or update the counters in tion (SAP), SPEC Java Server 2005 (Sjbb), PTEs. Rather, the processor TLB and the and SPEC Java Application Server 2004 filter cache alone operate these counters. (Sjas). We collected long bus traces (data and instruction) for workloads on an Intel Adaptive filter cache schemes (CHOP-AFC Xeon Multiprocessor platform with the and CHOP-AMFC) LLC disabled. Because different workloads can have dis- tinct behaviors, and even the same workload Evaluation results can have different phases, memory subsys- We evaluated the performance efficacy of tems could be affected differently. To address the three filter-cache-based DRAM caching such effects, we propose two adaptive filter techniques individually, and then we com- cache schemes, CHOP-AFC and CHOP- pared their results collectively. AMFC, in which the processor turns the filter cache on and off dynamically on the basis CHOP-FC evaluation of the memory utilization status. CHOP- To demonstrate our basic filter cache’s ef- AMFC differs from CHOP-AFC in that fectiveness, we compared CHOP-FC’s per- the former is based on CHOP-MFC, where- formance results to a regular 128-Mbyte as the latter is based on CHOP-FC. We add DRAM cache and a naive scheme that caches monitors to track memory traffic. When a random portion of the working set. memory utilization exceeds a certain thresh- Figure 3a shows the speedup ratios of the old, the processor turns on the filter cache basic 128-Mbyte DRAM cache (DRAM), a so that only hot pages are fetched into the scheme that caches a random subset of the DRAM cache; otherwise, the processor LLCmisses(RAND)intotheDRAM turns off the filter cache and brings all cache, and our CHOP-FC scheme, which pages into the DRAM cache on demand. captures the hot subset of pages. We normal- With the capability of adjusting the DRAM ized all results to the base case in which no cache allocation policy on the fly, we expect DRAM cache was applied. For the RAND more performance benefits by using the scheme, we generated a random number entire available memory bandwidth. for each new LLC miss and compared it to a probability threshold to determine whether Evaluation methodology the block should be cached into the DRAM For the simulation environment, we used cache. We adjusted the probability threshold, the ManySim simulator to evaluate our filter and Figure 3 shows the one that leads to the cache schemes.11 The simulated chip multi- best performance. processor (CMP) architecture consists of The DRAM bar shows that directly add- 8cores(eachoperatingat4GHz).Each ing a 128-Mbyte DRAM cache doesn’t core has an eight-way, 512-Kbyte private improve performance as much as expected. level-two(L2)cache,andallcoressharea Instead, this addition incurs a slowdown in 16-way, 8-Mbyte L3 cache. An on-die inter- many cases. With a 4-Kbyte cache line size, connect with bidirectional ring topology the DRAM resulted in an average slowdown connects the L2 and L3 caches. of 21.9 percent, with a worst case of ......

JANUARY/FEBRUARY 2011 103 [3B2-14] mmi2011010099.3d 20/1/011 14:55 Page 104

...... TOP PICKS

DRAM RAND CHOP-FC DRAM RAND CHOP-FC 1.4 1.4 1.2 1.2 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 Speedup ratio 0.2 Speedup ratio 0.2 0.0 0.0 (a)SAP Sjas SjbbTPC-C Mix Avg (b) SAP SjasSjbb TPC-C Mix Avg

DRAM RAND CHOP-FC DRAM RAND CHOP-FC 100 100 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 0 SAP Sjas SjbbTPC-C Mix Avg SAP SjasSjbb TPC-C Mix Avg Memory bandwidth utilization (%) Memory bandwidth utilization (%) (c) (d)

Figure 3. Speedup ratios for a 4-Kbyte cache line (a) and an 8-Kbyte cache line (b), and memory bandwidth utilization for a 4-Kbyte cache line (c) and an 8-Kbyte cache line (d), of our CHOP (Caching Hot Pages) filter-cache-based schemes and two other schemes. (CHOP-FC: CHOP filter-cache-based DRAM-caching architecture; DRAM: regular DRAM caching; RAND: caching a random subset of LLC misses; Mix: a mixture of all the benchmarks shown (SAP, Sjas, Sjbb, and TPC-C.)

58.2 percent for Sjbb. With the 8-Kbyte DRAM and RAND. With the 4-Kbyte and cache line size, it resulted in an average slow- 8-Kbyte line sizes, CHOP-FC achieved an down of 50.1 percent, with a worst case of average speedup of 17.7 percent and 77.1 percent for Sjbb. To understand why 12.8 percent using counter threshold 32. this occurred, Figure 3b presents each The reason is that although hot pages were scheme’s memory bandwidth utilization. only 25.24 percent of the LLC-missed por- Using large cache line sizes easily saturates tion of the working set, they contributed to the available memory bandwidth, with an av- 80 percent of the entire LLC misses. Caching erage of 92.8 percent and 98.9 percent for those hot pages significantly reduces memory 4-Kbyte and 8-Kbyte line sizes, respectively. bandwidth utilization and thus reduces the Because allocating cache lines for each associated queueing delays. As compared to LLC miss in the DRAM cache saturates RAND, caching hot pages also provides far the memory bandwidth, one possible solu- smaller MPI. tion is to cache only a subset of it. However, the RAND bar proves that this subset must CHOP-MFC evaluation be carefully chosen. On average, RAND We compared CHOP-MFC’s perfor- shows 20.2 percent and 48.1 percent slow- mance and memory bandwidth utilization downs for the 4-Kbyte and 8-Kbyte line to that of a regular DRAM cache. Figure 4 sizes, respectively. compares the speedup ratios (Figure 4a) Figure 3a also demonstrates that our and memory bandwidth utilization CHOP-FC scheme generally outperforms (Figure 4b) of a basic DRAM cache ...... 104 IEEE MICRO [3B2-14] mmi2011010099.3d 20/1/011 14:55 Page 105

DRAM MFC-32 MFC-64 MFC-128 MFC-256 1.6 1.6 1.4 1.4 1.2 1.2 1.0 1.0 0.8 0.8 0.6 0.6

Speedup ratio 0.4 Speedup ratio 0.4 0.2 0.2 0.0 0.0 (a) SAP Sjas SjbbTPC-CMix Avg(b) SAP Sjas Sjbb TPC-C Mix Avg

100 100 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 0 SAP Sjas SjbbTPC-CMix Avg SAP Sjas Sjbb TPC-C Mix Avg Memory bandwidth utilization (%) Memory bandwidth utilization (%) (c) (d)

Figure 4. Speedup ratios for a 4-Kbyte cache line (a) and an 8-Kbyte cache line (b), and memory bandwidth utilization for a 4-Kbyte cache line (c) and an 8-Kbyte cache line (d), of the CHOP memory-based filter cache (CHOP-MFC) with various thresholds.

(DRAM) with CHOP-MFC under various 32,000 entries for the 4-Kbyte line size. counter thresholds (MFC-32 for threshold Thanks to the performance robustness 32, and so on). (speedups in all cases) that counter threshold Figure 4a shows that CHOP-MFC out- 128 provided, we chose to use it as the default performs DRAM in most cases. For the value. 4-Kbyte line size, MFC-64 achieved the highest speedup (25.4 percent), followed CHOP-AFC evaluation by MFC-32 (22.5 percent), MFC-128 Figure 5 exhibits the results for CHOP- (18.6 percent), and MFC-256 (12.0 percent). AFC with memory utilization thresholds For the 8-Kbyte line size, MFC-128 per- varying from 0 to 1. (For this evaluation, formed best, with an average speedup of we show CHOP-AFC’s results only because 18.9 percent. The performance improve- CHOP-AMFC exhibited similar perfor- ments are due to the significant reduction in mance.) For memory bandwidth utilization memory bandwidth utilization, as Figure 4b thresholds less than 0.3, the speedups shows. These results demonstrate that having achieved remained the same (except for a small MFC is sufficient for keeping the TPC-C). For threshold values between 0.3 fresh hot-page candidates. Unlike CHOP- and 1, the speedups show an increasing FC, replacements in the MFC incur counter trend followed by an immediate decrease writebacks to memory rather than a total (see Figure 5a). On the other hand, loss of counters. Therefore, CHOP-MFC Figure 5b shows that for threshold values achieves higher hot-page identification accu- less than 0.3, the memory bandwidth utiliza- racy. CHOP-MFC with 64 entries had an tion remained approximately the same average miss rate of 4.29 percent compared (except for an increase in TPC-C). For to 10.34 percent in CHOP-FC, even with threshold values greater than 0.3, memory ......

JANUARY/FEBRUARY 2011 105 [3B2-14] mmi2011010099.3d 20/1/011 14:55 Page 106

...... TOP PICKS

1.8 1.6 1.4 1.2 1.0 0.8 0.6 Speedup ratio 0.4 0.2 SAP Sjas Sjbb TPC-C Mix 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 (a) Memory bandwidth utilization threshold

100 90 80 70 60 50 40 30 20 10 SAP Sjas Sjbb TPC-C Mix 0 Memory bandwidth utilization (%) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 (b) Memory bandwidth utilization threshold

Figure 5. Speedup ratios (a) and memory bandwidth utilization (b) of the CHOP adaptive filter cache (CHOP-AFC) with various memory bandwidth utilization thresholds.

bandwidth utilization tended to increase as When the threshold was greater than 0.3, the threshold increased. owing to the abundance of available memory To understand why this is the case, recall bandwidth, the filter cache could be turned that the AFC scheme adjusts its DRAM off for some time, and therefore more cache allocation policy with respect to the pages could go into the DRAM cache. As a available memory bandwidth. Moreover, result, the useful hits provided by the the average memory bandwidth utilization DRAM cache tended to increase as well. achieved was 93 percent for the DRAM This explains why memory bandwidth utili- and 32 percent for CHOP-FC (Figure 3b), zation kept increasing and performance kept which essentially denotes the upper and increasing in the beginning for threshold val- lower bounds of the memory bandwidth uti- ues between 0.3 and 1. However, after some lization that the CHOP-AFC scheme can point, because of the high memory band- reach. When the threshold was less than width utilization, queueing delay began to 0.3, the measured memory utilization with dominate and consequently performance the filter cache turned on was always larger decreased. The reason the TPC-C workload than the threshold, so the filter cache was al- shows a different trend for thresholds be- ways turned on, resulting in an average of tween 0 and 0.3 is its smaller lower-bound 32 percent memory bandwidth utilization. memory bandwidth utilization of 15 percent Therefore, with the threshold between (the DRAM bar in Figure 5b). 0 and 0.3, CHOP-AFC behaved the same as CHOP-FC. This explains why CHOP- Comparison of the three schemes AFC shows a constant speedup and memory Compared to the traditional DRAM- bandwidth utilization under those thresholds. caching approach, all our filter-based ...... 106 IEEE MICRO [3B2-14] mmi2011010099.3d 20/1/011 14:55 Page 107

techniques improve performance by caching ...... only the hot subset of pages. For a 4-Kbyte References block size, CHOP-MFC achieved on average 1. D. Burger, J.R. Goodman, and A. Ka¨ gi, speedup of 18.6 percent with only 1 Kbyte ‘‘Memory Bandwidth Limitations of Future of extra on-die storage overhead, whereas Microprocessors,’’ Proc. 23rd Ann. Int’l CHOP-FC achieved a 17.2 percent speedup Symp. Computer Architecture (ISCA 96), with 132 Kbytes of extra on-die storage ACM Press, 1996, pp. 78-79. overhead. Comparing these two schemes, 2. F. Liu et al., ‘‘Understanding How Off-Chip CHOP-MFC uniquely offers larger address Memory Bandwidth Partitioning in Chip- space coverage with very small storage over- Multiprocessors Affects System Perfor- head, whereas CHOP-FC naturally offers mance,’’ Proc. IEEE 16th Int’l Symp. High both hotness and timeliness information at Performance Computer Architecture a moderate storage overhead and minimal (HPCA 10), IEEE Press, 2010, doi:10.1109/ hardware modifications. HPCA.2010.5416655. The adaptive filter cache schemes 3. B.M. Rogers et al., ‘‘Scaling the Bandwidth (CHOP-AFC and CHOP-AMFC) typically Wall: Challenges in and Avenues for CMP outperformed CHOP-FC and CHOP- Scaling,’’ Proc. 36th Ann. Int’l Symp. Com- MFC. These schemes intelligently detect puter Architecture (ISCA 09), ACM Press, available memory bandwidth to dynamically 2009, pp. 371-382. adjust DRAM-caching coverage policy and 4. B. Black et al., ‘‘Die Stacking (3D) Microarch- thereby improve performance. Dynamically itecture,’’ Proc. 39th Int’l Symp. Microarchi- turning the filter cache on and off can tecture, IEEE CS Press, 2006, pp. 469-479. adapt to the memory bandwidth utilization 5. R. Iyer, ‘‘Performance Implications of Chip- on the fly. When memory bandwidth is set Caches in Web Servers,’’ Proc. IEEE abundant, the adaptive filter cache enlarges Int’l Symp. Performance Analysis of Sys- coverage, and more blocks are cached into tems and Software (ISPASS 03), IEEE CS the DRAM cache to produce more useful Press, 2003, pp. 176-185. cache hits. When memory bandwidth is 6. N. Madan et al., ‘‘Optimizing Communica- scarce, the adaptive filter cache reduces cov- tion and Capacity in a 3D Stacked Reconfig- erage, and only hot pages are cached to pro- urable Cache Hierarchy,’’ Proc. IEEE 15th duce a reasonable amount of useful cache Int’l Symp. High Performance Computer hits as often as possible. By setting a proper Architecture (HPCA 09), IEEE Press, 2009, memory utilization threshold, CHOP-AFC pp. 262-274. and CHOP-AMFC can use memory band- 7. Z. Zhang, Z. Zhu, and X. Zhang, ‘‘Design and width wisely. Optimization of Large Size and Low Over- head Off-Chip Caches,’’ IEEE Trans. Com- he adaptive filter cache schemes show puters, vol. 53, no. 7, 2004, pp. 843-855. T their performance robustness and 8. L. Zhao et al., ‘‘Exploring DRAM Cache guarantee performance improvement by Architectures for CMP Server Platforms,’’ quickly adapting to the bandwidth utiliza- Proc. 25th Int’l Conf. Computer Design tion situation. Possible extensions to these (ICCD 07), IEEE Press, 2007, pp. 55-62. schemes include application to other 9. X. Jiang et al., ‘‘CHOP: Adaptive Filter- forms of two-level memory hierarchies, Based DRAM Caching for CMP Server including phase-change memory (PCM) Platforms,’’ Proc. IEEE 16th Int’l Symp. and flash-based memory, and architectural High Performance Computer Architecture support for exposing hot-page information (HPCA 10), IEEE Press, 2010, doi:10.1109/ back to the operating system to schedule HPCA.2010.5416642. optimizations. MICRO 10. Intel Architecture Software Developer’s Man- ual, vol. 1, Intel, 2008; http://www.intel.com/ Acknowledgments design/pentiumii/manuals/243190.htm. We thank Zhen Fang for providing 11. L. Zhao et al., ‘‘Exploring Large-Scale CMP valuable comments and feedback on this Architectures Using ManySim,’’ IEEE article. Micro, vol. 27, no. 4, 2007, pp. 21-33......

JANUARY/FEBRUARY 2011 107 [3B2-14] mmi2011010099.3d 20/1/011 14:55 Page 108

...... TOP PICKS

Xiaowei Jiang is a research scientist at Intel. networking, and large-scale chip multipro- His research interests include system-on-chip cessor (CMP) architectures. He has an MS in (SoC) and heterogeneous-core architectures. electrical and computer engineering from He has a PhD in computer engineering from Lamar University, Texas. North Carolina State University. Donald Newell is the CTO of the Server Niti Madan is a National Science Founda- Product Group at AMD. His research tion and Computing Research Association interests include server architectures, cache Computing Innovation Fellow at the IBM memory systems, and networking. He has a Thomas J. Watson Research Center. Her BS in computer science from the University research interests include power and of Oregon. reliability-aware architectures and the use of emerging technologies such as 3D die Yan Solihin is an associate professor in the stacking. She has a PhD in computer Department of Electrical and Computer science from the University of Utah. Engineering at North Carolina State Uni- versity. His research interests include cache Li Zhao is a research scientist at Intel. Her memory systems, performance modeling, research interests include SoC architectures security, and reliability. He has a PhD in and cache memory systems. She has a PhD computer science from the University of in computer science from the University of Illinois at Urbana-Champaign. California, Riverside. Rajeev Balasubramonian is an associate Mike Upton is a principal engineer at Intel. professor in the School of Computing at His research interests include graphics the University of Utah. His research interests processor architectures and many-core include memory systems, interconnect architectures. He has a PhD in computer design, large-cache design, transactional architecture from the University of Michi- memory, and reliability. He has a PhD in gan, Ann Arbor. computer science from the University of Rochester. Ravi Iyer is a principal engineer at Intel. His research interests include SoC and hetero- Direct questions and comments about geneous-core architectures, cache memory this article to Xiaowei Jiang, JF2, Intel, systems, and accelerator architectures. He has 2111 NE 25th Ave., Hillsboro, OR 97124; a PhD in computer science from Texas [email protected]. A&M University.

Srihari Makineni is a senior researcher at Intel. His research interests include cache and memory subsystems, interconnects,

...... 108 IEEE MICRO