IMPROVING MEMORY HIERARCHY PERFORMANCE WITH DRAM CACHE, RUNAHEAD CACHE MISSES, AND INTELLIGENT ROW-BUFFER PREFETCHES

By

XI TAO

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2016

© 2016 Xi Tao

To my parents

ACKNOWLEDGMENTS

It was such a long journey since I first arrived at Gainesville. I have never dreamt of ever studying at a place so distant from my hometown, yet I spent five and a half wonderful years here.

Obtaining a Ph.D. degree is never an easy job. You constantly feel stressful, at loss, and sometimes wondering how to continue. During those years, I am really grateful for all the help and guidance from my advisor, Dr. Jih-Kwon Peir, who is always so patient and kind. His brilliant suggestions helped me overcome many obstacles. He has also spent numerous hours helping me reviewing my paper and making modifications. Without his help, I really cannot imagine sitting here and writing this dissertation now.

I also want to thank my Ph.D. committee members: Dr. Shigang Chen, Dr. Prabhat

Mishra, Dr. Beverly Sanders and Dr. Tan Wong. Thank you for your advice and support during my study at University of Florida. I also would like to thank my lab mate Qi Zeng, who has provided great suggestions and advice on our collaborating work.

Lastly, I want to give my greatest thanks to my friends here at Gainesville. You guys really made my life colorful here. I also want to thank my parents, who has always been there encouraging me and believed in me. I could not achieve all these without your support!

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS ...... 4

LIST OF TABLES ...... 7

LIST OF FIGURES ...... 8

ABSTRACT ...... 10

CHAPTER

1 INTRODUCTION ...... 12

1.1 DRAM Caches ...... 17 1.2 Runahead Cache Misses ...... 18 1.3 Hashing Fundamentals and Bloom Filter ...... 19 1.4 Intelligent Row Buffer ...... 21

2 PERFORMANCE METHODOLOGY AND WORKLOAD SELECTION ...... 23

2.1 Evaluation Methodology ...... 23 2.2 Workload Selection ...... 25

3 CACHE LOOKASIDE TABLE ...... 26

3.1 Background and Related Work ...... 26 3.2 CLT Overview ...... 29 3.2.1 Stacked Off-die DRAM Cache with On-Die CLT ...... 29 3.2.2 CLT Coverage ...... 31 3.2.3 Comparison of DRAM Cache Methods ...... 32 3.3 CLT Design ...... 37 3.4 Performance Evaluation ...... 41 3.4.1 Difference between Related Proposals...... 41 3.4.2 Performance Results...... 43 3.4.3 Sensitivity Study and Future Projection ...... 47 3.4.4 Summary ...... 49

4 RUNAHEAD CACHE MISSES USING BLOOM FILTER ...... 50

4.1 Background and Related work ...... 50 4.2 Memory Hierarchy and Timing analysis ...... 51 4.3 Performance Results ...... 57 4.3.1 IPC Comparison ...... 58 4.3.2 Sensitivity Study ...... 60 4.4 Summary ...... 62

5

5 GUIDED MULTIPLE HASHING ...... 64

5.1 Background ...... 64 5.2 Hashing ...... 66 5.3 Proposed Algorithm ...... 67 5.3.1 The Setup Algorithm ...... 67 5.3.2 The Lookup Algorithm ...... 70 5.3.3 The Update Algorithm ...... 71 5.4 Performance Results ...... 72 5.5 Summary ...... 82

6 INTELLIGENT ROW BUFFER PREFETCHES ...... 83

6.1 Background and Motivation ...... 83 6.2 Hot Row Buffer Design and Results ...... 86 6.3 Performance Evaluation ...... 93 6.4 Conclusion ...... 95

7 SUMMARY ...... 97

LIST OF REFERENCES ...... 100

BIOGRAPHICAL SKETCH ...... 106

6

LIST OF TABLES

Table page 2-1 Architecture parameters of processor and memories ...... 24

2-2 MPKI and footprint of the selected benchmarks ...... 25

3-1 Comparison of different DRAM cache designs ...... 33

3-2 Difference between three designs ...... 42

3-3 Comparison of L4 MPKR, L4 occupancy and predictor accuracy ...... 46

4-1 False-positive rates of 12 benchmarks ...... 59

4-2 Future Conventional DRAM parameters ...... 62

5-1 Notation and Definition ...... 68

5-2 Routing table updates for enhanced 4-ghash ...... 80

6-1 Hit ratio for hybrid scheme of 10 workloads using 64 entries ...... 89

6-2 Prefetch usage for 10 workloads using a simple stream prefetcher ...... 95

6-3 Sensitivity study on prefetch granularity ...... 95

7

LIST OF FIGURES

Figure page 1-1 The structure of a memory hierarchy...... 13

1-2 Memory hierarchy organization with 4-level caches ...... 14

1-3 Dram Internal Organization ...... 15

3-1 Memory hierarchy with stacked DRAM cache ...... 30

3-2 Reuse distance curves normalized to the percentage of the maximum distance ...... 32

3-3 Coefficient of variation (CV) of hashing 64K cache-set using different indices ...... 35

3-4 DRAM cache MPKI using sector indexing ...... 36

3-5 CLT design schematics ...... 38

3-6 CLT operations in handling memory requests ...... 39

3-7 CLT speedup with respect to Alloy, TagTables_64, and TagTables_16 ...... 45

3-8 Memory access latency (CPU cycles)...... 45

3-9 IPC change for different CLT coverage...... 48

3-10 Execution cycle change for different sector size in CLT design ...... 49

4-1 Memory latency with / without BFL3 ...... 52

4-2 Cache indexing and hashing for BF ...... 55

4-3 False-positive rates for 6 hashing mechanisms ...... 56

4-4 False-positive rates with m:n = 2:1, 4:1, 8:1, 16:1, and k = 1, 2 ...... 56

4-5 IPC comparisons with/without BF ...... 59

4-6 Average IPC for m:n ratios and hashing functions ...... 61

4-7 Average IPC for different L4 sizes ...... 61

4-8 Average IPC over different DRAM latency ...... 62

5-1 Distribution of keys in buckets of four hashing algorithms...... 66

5-2 A simple d-ghash table with 5 keys, 8 buckets and 2 hash functions...... 68

8

5-3 Bucket loads for the five hashing schemes...... 73

5-4 Number of bucket accesses per lookup for d-ghash...... 74

5-5 Average number of keys per lookup based on memory usage ratio...... 75

5-6 The average number of non-empty buckets for looking up a key...... 77

5-7 Sensitivity of the number of bucket accesses per lookup...... 78

5-8 Changes in the number of bucket accesses per lookup and rehash percentage...... 79

5-9 Number of bucket accesses per lookup for experiments with five routing tables...... 80

5-10 Experiment with the update trace using enhanced 4-ghash...... 80

6-1 Hot Row pattern of 10 workloads ...... 85

6-2 Hot Row Identification and Update ...... 88

6-3 Results of proposed hybrid scheme ...... 89

6-4 Block column difference within a row for 10 workloads...... 91

6-5 IPC/Row buffer hit Ratio speedup of 10 workloads ...... 93

9

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

IMPROVING MEMORY HIERARCHY PERFORMANCE WITH DRAM CACHE, RUNAHEAD CACHE MISSES, AND INTELLIGENT ROW-BUFFER PREFETCHES

By

Xi Tao

December 2016

Chair: Jih-Kwon Peir Major: Computer Engineering

Large off-die stacked DRAM caches have been proposed to provide higher effective bandwidth and lower average latency to main memory. Designing a large off-die DRAM cache with conventional block size (64 bytes) requires a large tag array which is impractical to fit on- die. We investigate a novel design called Cache Lookaside Table (CLT) to reduce the average access latency and to lessen off-die tag array accesses. The proposed CLT exploits memory reference locality and provides a fast alternative tag path to capture most of the DRAM cache requests.

To hide long memory latency and to alleviate memory bandwidth requirement, a fourth- level of cache (L4) is introduced in modern high-performance computing systems. However, increasing cache levels worsens the cache miss penalty since memory requests go through levels of cache hierarchy sequentially. We investigate a new way of using a Bloom Filter (BF) to predict cache misses earlier at a particular cache level. These misses can runahead to access lower level of caches and memory to shorten the miss penalty.

Inspired by the usefulness of Bloom filter in cache accesses, we conduct a fundamental study to find a way to balance the hashing buckets while maintaining lower false-positive rate for

10

Bloom filter. To broaden the applications, our study is based on the routing and packet forwarding function at the core of the IP network-layer protocols. We propose a guided multi- hashing approach which achieves near perfect load balance among hash buckets, while limiting the number of buckets to be probed for each key (address) lookup, where each bucket holds one or a few routing entries.

A key challenge in effectively improving system performance lies in maximizing both row-buffer hits and bank level parallelism while simultaneously providing fairness among different requests. We observed that accesses to each bank in DRAM are not equally distributed among different rows for most of the workloads we study. We propose a simple scheme to capture the hot row pattern and prefetch data in these hot rows. Results have demonstrated the effectiveness of our proposed scheme.

11

CHAPTER 1 INTRODUCTION

Memory hierarchy plays a critical role in designing high-performance processors. It becomes increasingly difficult to advance processor performance further due to the memory wall problem. Despite aggressive out-of-order, speculative execution, processor stalls waiting for the data from memory. Lately, manufacturers are putting a growing number of cores on a chip to satisfy the increased demand for larger workloads such as data mining and analytics.

As the number of cores grows, the pressure on memory subsystem in terms of capacity and bandwidth increases as well.

Memory hierarchy design takes advantage of memory reference locality, trade-offs in the capacity and access speed of memory technologies to hide the memory latency and to alleviate memory bandwidth requirement. Between CPU and main memory, there are multiple levels of cache memory. Close to the CPU side are small, fast caches with higher bandwidth that temporarily storing the most frequently used data. With increasing levels, the cache capacity become larger, but the access speed becomes slower with reduced bandwidth. Based on reference locality, the most recently referenced data can be accessed in the highest level of cache. The large capacity of lower levels capture recently referenced data that cannot fit into the highest level. Figure 1-1 depicts this memory hierarchy organization.

Conventional cache access goes through a tag path to determine a cache hit or a miss and a data path to access the data in case of a hit. The cache tag and data arrays maintain topological equivalency such that matching of the address tag in the tag array determines the location of the block in the data array. These two paths may overlap to permit the data array access starts before the hit position is determined to shorten the cache access time.

12

CPU

Level 1

Increasing distance from Levels in the the CPU in access time, memory hierarchy Level 2 decreasing bandwidth

...

Level n

Size of the memory at each level

Figure 1-1. The structure of a memory hierarchy: as the distance from the processor increases, so does the size and access time, but with decreasing bandwidth.

Modern high-performance multi-core systems generally adopt 3-level on-die cache architecture, referred as L1, L2 and L3 caches which are placed on the processor die using the

SRAM technology. The L1 and the L2 caches are usually private meaning each core has its own private L1 and L2 caches. The L3 cache is normally shared by all the cores and serves as a connection point to the main memory which is located off the processor chip with long access latency. Intel Haswell [1], the 4th-generation core adopts a 4th-level cache built on embedded

DRAM technology to hide main memory latency and to deliver substantial performance improvements for media, graphics and other high-performance computing applications. A general organization of a multicore system with 4-level of caches is illustrated in Figure 1-2.

13

Processor CPU ... CPU Die

L1/L2 Caches L1/L2 Caches

Memory Main L3 Cache Controller Memory

Multichip L4 Cache Package

Figure 1-2. Memory hierarchy organization with 4-level caches.

In order to measure quantitatively the memory performance of cache hierarchy, we use the average memory access time (AMAT) to describe the average time it takes for the entire hierarchy to return data. Suppose we have a three-level cache architecture with L1, L2 and L3 caches, the AMAT is calculated as follows:

퐴푀퐴푇 = 퐻푖푡푇푖푚푒퐿1 + 푀푖푠푠푅푎푡푒퐿1 × 푀푖푠푠푃푒푛푎푙푡푦퐿1

= 퐻푖푡푇푖푚푒퐿1 + 푀푖푠푠푅푎푡푒퐿1 × (퐻푖푡푇푖푚푒퐿2 + 푀푖푠푠푅푎푡푒퐿2

× (퐻푖푡푇푖푚푒퐿3 + 푀푖푠푠푅푎푡푒퐿3 × 푀푖푠푠푃푒푛푎푙푡푦퐿3))

where HitTime is the access time to the cache at a particular level, MissRate is the percentage of misses at a cache level, and MissPenalty is the time to fetch the block from next level of memory hierarchy. Fetching data from next level may encounter cache hits or cache misses too, so the miss penalty of current level is also equivalent to the AMAT starting from next level. For achieving high-performance, we want to design a cache hierarchy with fast hit time, small miss ratio and miss penalty.

14

Beyond the cache hierarchy, main memory is where the application instruction and data are stored. During program execution, the requested instruction and data are moved from disk to main memory on demand initiating an I/O activity called page fault. Main memory is built using the dynamic random access memory (DRAM) technology. When the requested instruction and data are not located in caches, they are accessed from main memory with substantial longer latency. Multiple levels of caches hold the working set of the recently referenced instructions and data in hope to limit the need to access them from the main memory.

Channel 0 DRAM Chip Bank 0 Bank 1 Cols

Addr

Cmd Memory Rows Controller

Data

Internal Row buffer

Figure 1-3. Dram Internal Organization.

DRAM-based main memory is a multi-level hierarchy of structures. At the highest level, each processor die is connected to one or more DRAM channels. Each channel has a dedicated command, address and data bus. One or more memory modules can be connected to each DRAM channel. Each memory module contains a number of DRAM chips. As the data output width of each DRAM chip is low (typically 8 bits for commodity DRAM), multiple chips are grouped

15 together to form a rank. In other words, a rank is a collection of DRAM chips that together feed the standard 64-bit data bus.

Internally, each chip consists of multiple banks. Each bank consists of many rows of

DRAM cells and a row buffer that caches the last accessed row from the bank. Each dram cell in the row is identified by the corresponding column address. Reading or writing data to DRAM requires that the entire row first be read into the row buffer. Reads and writes then operate directly on the row buffer. After the operation, the row is closed and the data in the row buffer is written back into the DRAM array. Figure 1-3 shows this topology.

When the memory controller receives an access to 64-byte cache line, it will first decode the address into the channel, rank, bank, row and column number. As the data of each 64-byte cache line is split into different chips within the rank, the memory controller maintains a mapping scheme to determine which parts of the cache line are mapped to which chips. Upon receiving the command, each chip accesses the corresponding column of data from the row buffer and transfer it on the data bus. Once the data is transferred, the memory controller assembles the required cache line and sends back to the processor.

All banks on a channel share a common set of command and data buses, operations on multiple banks may occur in parallel (e.g., opening a row in one bank while reading data from another bank’s row buffer) so long as the commands are properly scheduled and any other

DRAM timing constraints are obeyed. A memory controller can improve memory system throughput by scheduling requests such that can be paralleled among banks. In the meanwhile, dram can have different page modes. Leaving a row buffer open after every access is called

Open-page policy. Closing a row buffer after every access is called Close-page policy. Accessing data already loaded in the row buffer, also called a row-buffer hit, incurs a shorter latency than

16 when the corresponding row must first be “opened” from the DRAM array. Therefore, open- page policy enables more efficient access to the same open row, at the expense of increased access delay to other rows in the same DRAM array. A row-buffer conflict would happen if requesting a different row rather than the currently opened row, which incurs substantial delay.

Close-page policies on the other hand, can server row buffer conflict requests faster.

Our proposed research is focused on various techniques from caches to memory to improve performance of memory hierarchy on modern multicore systems. The outline of the research topics is given in the following subsections. The performance evaluation methodology and workload selection will be given in Chapter 2. This is followed by detailed description of each research topic in Chapter 3, 4, 5, and 6. Finally, a summary of the proposed research is given in Chapter 7.

1.1 DRAM Caches

Cache capacity is limited by the number of transistors in the processor die. With newer packaging technology such as silicon interposer (2.5D) [2] or 3D integrated circuit stacking [3], processor and DRAM can be in close proximity which gives high-bandwidth and low-latency access to dense memory from processors. However, at the current time, unfortunately stacked

DRAM capacity is still insufficient to be used as the system main memory [4] [5]. There have been two approaches to integrate the stacked DRAM, either as the last-level cache [6] [7] [4] [8]

[9] [10], or as a part of the main memory [11] [12] [13] [14]. Using stacked DRAM as a part of memory requires extra address mapping and data block swapping between fast and slow DRAMs

[15] [16] [17] [18] [19]. This approach utilizes the entire DRAM capacity which is essential when the capacities of stacked and off-chip DRAM are close. However, with tens of GBs off- chip DRAM in today’s personal systems, it is more viable to use an order of magnitude smaller

17 stacked DRAM as the last-level cache (L4) to provide fast memory access and to alleviate off- chip memory bandwidth requirement.

Our first research topic is to investigate fundamental issues and to access performance advantage of a large stacked DRAM cache included in the memory hierarchy as the last-level cache. Large off-die stacked DRAM caches have been proposed to provide higher effective bandwidth and lower average latency to main memory. Designing a large off-die DRAM cache with conventional block size (e.g. 64 bytes) requires a large tag array which is impractical to fit on-die. Placing the large directory on off-die memory prolong the latency since a tag access is necessary before the data can be accessed. This additional trip also generates extra off-die traffic.

We investigate a novel design called Cache Lookaside Table (CLT) to reduce the average access latency and to lessen off-die tag array accesses. The basic approach is to cache a small amount of recently referenced tags on-die. An off-die tag access is avoided when a requested block’s tag hits a cached tag. To save on-die space, cached tags are recorded in a large sector for sharing tags with multiple blocks. However, due to the loss of one-to-one physical mapping of the cached tags and the data array, a way pointer is added for each block to indicate its way location. The proposed CLT exploits memory reference locality and provides a fast alternative tag path to capture most of the DRAM cache requests.

Experiment results show that with a small on-die CLT, in comparison with other proposed DRAM caching mechanisms, the on-die CLT shows average performance improvements in the range of 4-15%.

1.2 Runahead Cache Misses

To hide long memory latency and to alleviate memory bandwidth requirement, a fourth- level of cache (L4) is introduced in modern high-performance computing systems as illustrated in Figure 1-2. However, increasing cache levels worsens the cache miss penalty since memory 18 requests go through levels of cache hierarchy sequentially. We investigate a new way of using a

Bloom Filter (BF) to predict cache misses earlier at a particular cache level. These misses can runahead to access lower level of caches and memory to shorten the miss penalty. One inherent difficulty in using a BF to predict cache misses is due to the fact that cache contents are dynamically updated through insertions and deletions. We propose a new BF hashing scheme that extends the cache index for the target set to access the BF array. Since the BF index is a superset of the cache index, all blocks hashed to the same BF location are allocated in the same cache set to simplify updates to the BF array. When a block is evicted from the cache, the corresponding BF bit is reset only when no block hashed to this location exists in the cache set.

Performance evaluation using a set of SPEC2006 benchmarks show that using a BF for the third-level (L3) cache in a 4-level cache hierarchy to filter and runahead L3 misses, the IPCs can be improved by 3-21% with an average improvement of 9.5%.

1.3 Hashing Fundamentals and Bloom Filter

Inspired by the usefulness of Bloom filter in cache accesses, we conduct a fundamental study to find a way to balance the hashing buckets while maintaining lower false-positive rate for

Bloom filter. To broaden the applications, our study is based on the routing and packet forwarding function at the core of the IP network-layer protocols. The throughput of a router is constrained by the speed at which the routing table lookup can be performed. Hash-based lookup has been a research focus in this area due to its O(1) average lookup time.

It is well-known that hash collision is an inherent problem when a single random hash function is used, which causes uneven distribution of keys among the hash buckets in a nondeterministic fashion. The multiple-hashing technique, on the other hand, uses d independent hash functions to place a key into one of d possible buckets. The criteria of selecting the target bucket for placement is flexible and can be controlled to accomplish a specific objective. One 19 well-known objective of using multiple hash functions is load balancing, i.e. to balance the keys in the buckets [20] [21] [22] [23] [24] [25].

Another known objective of multiple hashing sets an opposite criteria for reducing the fill factor of the hash buckets [22] [26]. The fill factor is measured by the ratio of nonempty buckets.

Instead of placing a key in the bucket with smaller number of keys for load balancing, this approach places the key in the bucket with non-zero number of keys. The objective of this placement is to maximize the amount of empty buckets. One potential application is to apply the low fill-factor hashing method to a Bloom filter [22] [26]. With more zeros remained in the

Bloom filter, the critical false positive rate can be reduced. To create more zeros in establishing the Bloom filter, however, multiple sets of hash functions are needed for different keys since all the hashed k bits for each key must be set during the setup of the Bloom filter. Therefore, the multiple hashing concept is actually applied for choosing a set of hash functions out of multiple groups to maximize the number of zeros in the Bloom filter after recording 푘 ‘1’s for every key.

With a series of prior multi-hashing developments, including d-random, 2-left, and d-left, we discover that a new guided multi-hashing approach holds the promise of further pushing the envelope of this line of research to make significant performance improvement beyond what today’s best technology can achieve. Our guided multi-hashing approach achieves near perfect load balance among hash buckets, while limiting the number of buckets to be probed for each key (address) lookup, where each bucket holds one or a few routing entries. Unlike the localized optimization by the prior approaches, we utilize the full information of multi-hash mapping from keys to hash buckets for global key-to-bucket assignment. We have dual objectives of lowering the bucket size while increasing empty buckets, which helps to reduce the number of buckets

20 brought from off-chip memory to the network processor for each lookup. We introduce mechanisms to make sure that most lookups only require one bucket to be fetched.

Our simulation results show that with the same number of hash functions, the guided multiple hashing schemes are more balanced than d-left and others, while the average number of buckets to be accessed for each lookup is reduced by 20–50%.

1.4 Intelligent Row Buffer

Accessing off-chip memory is a major performance bottleneck in . As all of the cores must share the limited off-chip memory bandwidth, a large number of outstanding requests greatly increases contention for the memory data and command buses. Because a bank can only process one command at a time, a large number of requests also increases bank contention, where requests must wait for busy banks to finish servicing other requests.

A key challenge in effectively improving system performance lies in maximizing both row-buffer hits and bank level parallelism while simultaneously providing fairness among different requests. We observed that accesses to each bank in DRAM are not equally distributed among different rows for most of the workloads we study in Chapter 2. Some rows tend to have more frequent accesses than other rows in a certain amount of time due to spatial locality. We call these rows “hot rows”. However, if requests from different hot rows in the same bank interleave with each other, there is slight chance that those requests result in row buffer hits.

DRAM banks will frequently close opened rows and issue commands to open another row thus causing large queuing delays (time spent waiting for the memory controller to start servicing a request) and DRAM device access delays (due to decreased row-buffer hit rates and bus contention).

21

We propose a simple scheme to capture the hot row pattern and prefetch data in these hot rows. Prefetched data will be row buffer hits and thus saving access time later. Results show that our design is able to consistently perform better than simple LRU and LFU hot row schemes.

22

CHAPTER 2 PERFORMANCE METHODOLOGY AND WORKLOAD SELECTION

2.1 Evaluation Methodology

In order to evaluate the performance advantages of the proposed works in memory hierarchy designs, we adopt two cycle-accurate simulation methodologies. The first method is to establish and run applications on MARSSx86 [27], an x86-based whole-system simulation environment. MARSSx86 is built on QEMU, a full system emulation environment, where selected multi-threaded and multi-programmed workloads are compiled and run in. The executed instructions and memory requests drive a cycle-accurate multi-core model, which is extended from PTLsim [28]. Memory requests are simulated through multiple levels of cache hierarchy. In case of a last level cache miss, the request is issued to the memory, which is modelled using

DRAMsim2 [29], a cycle-accurate DDR-based DRAM model.

We develop a memory interface controller, called MICsim to handle requests from processors to the off-die DRAM cache and memory. We also develop a callback function between MICsim and the multicore processor model in MARSSx86. When a memory request misses the last-level on-die caches, the request is inserted into a memory request queue. The

MICsim processes requests from the top of the memory request queue one at a time in every cycle. We model partial hits to the stacked and the conventional DRAMs. Outstanding requests are saved in a pending queue for detecting and holding subsequent requests to the pending blocks. This first performance evaluation methodology is used in studying run-ahead cache misses proposal since detailed L1, L2 and L3 caches are simulated to understand the effectiveness of bypassing certain levels of caches.

23

Table 2-1. Architecture parameters of processor and memories. Processor, DRAM Cache and DRAM Memory Parameters processor 3.2GHz, 8 cores, out-of-order L1 Caches I/D split, 32KB MESI, 64B line, 4-way, 2 read ports, 1 write port, latency 4 cycles L2 Cache Private, 256 KB, 64B line, 8-way, 2 read ports, 1 write ports, latency 11 cycles, snooping-bus MESI protocol, 2-cycle per request, split transaction L3 Cache Shared, 8MB, 64B line, 16-way, 2 read ports, 2 write ports, latency 24 cycles L4-DRAM 128MB-256MB, 1.6GHz, 16-byte bus, Channels/Ranks/Banks: Cache 4/1/16, tCAS-tRCD-tRP: 9-9-9, tRAS-tRC: 36-33 Conventional 16GB, 800Mhz, 8-byte bus, Channels/Ranks/Banks: 2/1/8, tCAS- DRAM tRCD-tRP: 9-9-9, tRAS-tRC: 36-33, 2KB row buffer, close page

Although MARSSx86 precisely simulates multicore processors, the simulation time is unbearably long when simulating large stacked DRAM cache and memory. It requires billions of instructions to drive meaningful results. The virtual machine infrastructure in MARSSx86 also puts limits on the physical address space as pointed in [30]. In the second method, we adopt the

Epoch model [31] for estimating the execution time of different applications. It uses traces generated by the Pin-based Sniper simulator [32] in which the representative regions of applications are simulated and the requests sent to L3 are collected based on Intel Gainestown configuration with private L1, L2 caches and a shared L3 that interface with main memory. Per- core memory traces generated from Sniper are annotated with Epoch marks to ensure correct dependence tracking for issuing a cadence of memory requests. Each core explores memory level parallelism by issuing memory requests up to the Epoch mark. Each memory request is simulated through cycle-accurate memory hierarchy model which is the same as in the first method. The processor waits until all requests come back from the memory controller before moves to the next Epoch. In this method, we model precise on-die shared L3 cache with correct timing and bandwidth considerations. This Epoch simulation model is used in evaluation of the

24

Cache Lookaside Table proposal for providing alternative tag path for large stacked DRAM cache as well as studying intelligent hot-row prefetch for DRAM row buffers. Table 2-1 summarizes the architecture parameters used in our simulation.

2.2 Workload Selection

Table 2-2. MPKI and footprint of the selected benchmarks. Benchmarks FootPrint (MB) L3MPKI L4MPKI mcf 9310 74.0 19.2 gcc 477 39.5 1.4 lbm 3222 30.2 15.1 soplex 1693 28.1 21.9 milc 4084 27.5 22.1 libquantum 256 24.1 13.2 omnetpp 259 19.1 0.2 sphinx 78 12.1 0.2 bt 240 11.2 0.4 bwaves 3794 9.2 5.0 leslie3d 599 8.3 4.0 gems 1663 7.7 5.2 zeusmp 3488 6.3 4.5

To conduct our evaluation, we evaluated all workloads from SPEC CPU2006 and selected 12 applications with high L3 MPKI and large footprint. All workloads are running under multithreaded mode, where each application is replicated 8 times with 8 threads running on 8 cores. Therefore, the total memory footprint is roughly 8 times large as the footprint reported in

[33]. Table 2-2 gives basic information for these workloads. In general, we use the first 5 billion instructions to warm up caches, tables, and other data structures in different methods. We simulate the next billion instructions to collect performance statistics.

25

CHAPTER 3 CACHE LOOKASIDE TABLE

In this chapter, we propose our CLT work on providing an alternative tag path for large off-chip stacked DRAM cache. We begin by showing the current state-of-the-art designs to solve the tag problem and the motivation for using small on-die space to capture the majority of

DRAM cache tags. We then present the detailed design of how CLT adopts the decoupled sector cache idea to utilize spatial locality as well as save tag space. Finally we show some results to support our work.

3.1 Background and Related Work

Future high-end servers with tens or hundreds of cores demand high memory bandwidth.

Recent advances in 3D stacking technology provides a viable solution to the memory wall problem [13] [34] [11] [35]. It provides a promising venue for low-latency and high-bandwidth interconnect between processor and DRAM dies through silicon vias [3]. However, due to physical space availability, the capacity of this nearby memory is limited and not suitable to serve as the system main memory [4] [5] [36]. One viable approach is to use this nearby DRAM as the last level cache for fast access and reduced bandwidth to main memory. Intel Haswell [1], a fourth generation core is an example, which has 128MB L4 cache on embedded DRAM technology

Designs of large off-die DRAM caches have gained much interest recently [5] [37] [38]

[39] [40] [6] [14] [7] [41]. Researchers noticed large space requirement as well as the access time and power overheads for implementing a tag array for a large DRAM cache [4] [9].

Considering a common cache block size of 64 bytes, if each tag is 6 bytes, the tag array will consumes 24MB and 96MB respectively for 256MB and 1GB DRAM caches. Such a large tag

26 array is impractical to fit on the processor die. If this tag array is part of the off-die DRAM cache, it requires extra trip to access the tag array.

There are two general approaches to handle the large tag array of a DRAM cache placed off-die. The first approach is to record large block tags in order to fit the tag array on-die. Zhao, et al. [10] explore DRAM cache for CMP platforms. They recognize tag space overhead and suggest storing all tags in a set in a continuous DRAM block for fast access. They show that using on-die partial tags and sector directory can achieve the best space-performance tradeoff.

However, partial tags are expensive and encounter false-positive situation. CHOP [6] advocates large block size to alleviate the tag space overhead. It uses a separate filter cache to detect and cache only hot blocks for reducing the fragmentation problem.

The Footprint cache [39] has large blocks and uses the sector cache idea [42] [43] [44]

[45] to reduce the tag space and the memory bandwidth requirement. In addition, they predict the footprint in a 2KB sector and prefetch those 64-byte blocks based on the footprint history [46].

Data prefetching is orthogonal to the proposed caching methods and is beyond the scope of this proposal. Nevertheless, it is noteworthy to point out that footprint cache will lose cache space for blocks which are not a part of the footprint. The Unison cache [40] extends the Footprint idea for handling even bigger stacked DRAM cache. They move the sector tag off-chip and use way- prediction to fetch the tag and the predicted data block in parallel.

The second approach is to keep conventional block size and use other techniques to alleviate the impact of extra off-die tag array access. Loh and Hill [4] [47]propose allocating all tags and data blocks in a cache set in a single contiguous DRAM location to improve row buffer hit. To reduce the miss latency, they use a miss-map directory to identify DRAM cache misses and issue accesses directly to the next low-level memory. To save space, they record miss-map

27 for a large memory segment. However, when the segment is evicted from the miss-map directory, all blocks in the segment must be evicted from the DRAM cache. Sim et al. [5] suggest to speculatively issuing requests directly to main memory if the block is predicted not in the

DRAM cache. However, one will need to handle complicated miss predictions.

To reduce extra off-die tag access, we can cache on-die a small amount of recent referenced tags. An off-die tag array access is avoided when the tag is found on-die. This simple approach faces two problems. First, a tag is cached on-die only when a request misses on-die cached tags. It does not take advantage of spatial locality in data access. Second, caching tags of individual blocks does not save tag space if we want to maintain the same capacity. Even worse, without one-to-one mapping of the cached tags and the DRAM data array, a location pointer to the data array is needed for each cached block tag. Meza et al. [48] propose to cache the block tags of the entire DRAM cache set as a unit in an on-die directory called Timber. Caching the tags of the entire set as a unit avoids using way pointers. It does not require to invalidate blocks in a set when the set of tags are evicted from Timber. However, caching all tags in a set does not save any tag space. Moreover, it does not follow spatial nor temporal locality principle in applications. The ATCache [38] applies the same idea of Timber with additional prefetching for a set of tags. Other works [49] [50] have proposed caching tags to improve cache access latency in the generic setting of a multilevel SRAM cache hierarchy.

The Alloy cache trades high hit ratios of a set-associative cache for fast access time of a direct-mapped cache [9]. Besides suffering the hit ratio, Alloy cache relies on a cache-miss predictor [51] to speculative issue parallel accesses to both cache and memory when a cache miss is predicted in order to avoid sequential accesses to the off-die cache and memory. The

TagTables [37] applies the page-walk technique to identify large blocks (pages) located in cache.

28

It allocates the entire page into the same cache set in order to combine adjacent blocks into a chunk to save space for the location pointers in the page-walk tables.

Our proposed CLT maintains conventional block size for off-die DRAM cache. It differs from other proposed approaches in avoiding off-die tag array accesses. In contrast to the

TagTables, the Alloy cache, the Footprint, and the Unison caches, the role of CLT is to provide a fast, alternative on-die tag path to cover a majority of cache requests. With off-die full block tags as the backup, CLT can be decoupled from the off-die cache without the inherent complexity in the decoupled sector cache [45]. Unlike Timber and ATCache, CLT caches on-die tags for bigger sectors allowing CLT to capture spatial locality and to save tag space by sharing the sector tags. In addition, unlike the miss-map approach, CLT can identify both cache hit and cache miss without block invalidation when a sector is replaced from CLT. Furthermore, different from using partial tag or hit/miss speculation, our proposed CLT maintains precise hit/miss information for all recorded blocks and is non-speculative which can bypass the off-die tag array for a majority of memory requests.

3.2 CLT Overview

3.2.1 Stacked Off-die DRAM Cache with On-Die CLT

Figure 3-1 depicts the block diagram of a CMP system with nearby off-die stacked

DRAM cache and on-die Cache Lookaside Table (CLT). All requests to the DRAM cache are filtered through the CLT first, which records recently referenced memory sectors. When the sector of a requested block is found in the CLT (CLT hit which is different from DRAM cache hit), off-die tag directory access is avoided from the critical cache access path. Either the stacked data array or the next-level DRAM will be accessed depending on the block hit/miss information. The proposed sector-based CLT records large sector tags to save space and to exploit spatial locality for better coverage. If a small CLT can cover a high percentage of the

29 total requests, we can reduce the average memory access latency as well as the off-die bandwidth requirement without putting the entire cache tags on-die.

Core + Core + Core + Core + Cache Cache Cache Cache Main Memory

DRAM Main Memory Cache Lookaside Controller Table Main Memory (CLT) Stacked DRAM Main Memory Controller

Stacked DRAM Cache Tag and Data Array

Figure 3-1.4Memory hierarchy with stacked DRAM cache.

It is well-known that conventional cache access goes through a tag path to determine a cache hit or a miss and a data path to access the data in case of a hit. The cache tag and data arrays maintain topological equivalency such that matching of the address tag in the tag array determines the location of the block in the data array. The original sector cache [42] records large sector tags to save tag space, but maintains the physical equivalency with the block-based data array where all blocks in a sector are allocated in fixed locations in the data array. It wastes cache space for unused blocks.

Decoupled sector cache [45] allocates requested blocks in a sector in any locations in the target set of the data array for better cache performance. However, it faces a few inherent difficulties. First, without physical equivalency, it requires a location pointer to the data array for each block in a sector for locating the block. Second, due to topological mismatch of sector- based tag array and block-based data array, the location pointers are also used to invalidate the remaining valid blocks when a sector is replaced from the sector tag array. To minimize such

30 invalidations, the number of sector tags recorded in the sector tag array needs to be larger than the number of sectors matching the cache size. Third, decoupled sector cache also requires a backward pointer from each block in the data array to its parent sector in the sector tag array for updating the validity information when the block is replaced from the data array. This double- pointer requirement along with enlarged number of sector tags defeats the purpose of saving tag space and further complicates the sector-cache design.

CLT only captures a portion of recently referenced sectors on-die and reply on off-die full block tags to handle the rest. It avoids two critical issues that the decoupled sector cache encounters. First, the backward pointer from the blocks in the data array to the parent sector in the CLT can be eliminated. This is due to the fact that with only a portion of the sectors recorded in the CLT, the index bits to large cache sets can be a superset of the index bits to the CLT.

Although the missed block and the replaced block in a cache set can be from different sectors, they must be located in the same CLT set. A search in the CLT set can identify the sector where the evicted block belongs to for updating the validity information.

Second, when a sector is replaced from the CLT, the valid blocks in the sector can remain in cache as long as all blocks in a sector are allocated in the same cache set. This allocation can be accomplished by using the low-order bits of the sector address as the cache index bits. When the sector is referenced later, a search of the block tags in the target cache set can recover all the valid blocks in the sector. This search is possible because the block tags are maintained in the cache tag array. Without the block tags, the decoupled sector cache must invalidate remaining blocks when a sector is evicted. Detailed CLT design will be given in Section 4.3.

3.2.2 CLT Coverage

We first validate the potential CLT coverage using 12 SPEC2006 CPU workloads. In

Figure 3-2, we plot the accumulated reuse distance curves of the 12 workloads with large (2KB) 31 block size. We want to show that a small portion of recently used large blocks (sectors) can indeed cover the majority of cache access. The horizontal axis (logarithm scale) represents the percentage of the reuse distance with respect to the full stack distance that covers the entire block references, and the vertical axis is the accumulated percentage of the total blocks that can be covered. We can observe that by recording 10% of the most recent referenced blocks, over 90% of the requests can be covered for all workloads except for gcc and milc whose coverages are

82% and 88% respectively. These results support the CLT approach by recording a small portion of recently reference sectors on-die to provide a fast path for a majority of the DRAM cache requests.

Figure 3-2.5Reuse distance curves normalized to the percentage of the maximum distance.

3.2.3 Comparison of DRAM Cache Methods

There are several key aspects in comparing DRAM cache designs including Alloy cache,

ATcache, TagTables, and the proposed CLT as summarized in Table 3-1. For comparison purpose, we also include an impractical method to have entire cache block tags on-die. We assume 64-way set-associativity for all caches. Note that the footprint cache follows the original

32 sector cache design plus prefetching the blocks in the predicted footprint in each sector. Since prefetching techniques (e.g. streaming prefetcher) benefit all proposed methods and is orthogonal to the cache design, we omit the prefetch aspect in our comparison.

The goal of all proposals is to avoid off-die tag access. Therefore, the storage requirement including on-die SRAM, stacked DRAM, and regular memory must be considered. First, different methods require different on-die SRAM. Alloy uses a small on-die table to predict

DRAM cache misses for issuing parallel accesses to memory. ATcache caches tags of recently referenced cache sets. TagTables caches the page-walk tables in on-die L3 on demand. CLT records tags of recently referenced sectors along with valid, modify and way pointer for each block to cover tag accesses for a majority of requests. For fair comparison, we will keep fixed on-die SRAM size by adjusting the L3 size for all methods. For example, we deduct the CLT size from the L3 cache size in our evaluation.

Table 3-1. Comparison of different DRAM cache designs. On-die SRAM Stacked Tag-data Entropy on Cache (constant) DRAM mapping cache set placement (constant) Block tag block tag data 1-1 map block indices 64-way LRU on-die Alloy predict table tag + data 1-1 map block indices direct-map cache TagTables cached page-walk data 1-many sector indices chunk tables decoupled placement CLT CLT tag + data 1-many sector indices 64-way LRU (tag+pointer) decoupled

Next, for stacked DRAM requirement, Alloy, ATcache and CLT must maintain the tags along with the data blocks in the stacked DRAM while TagTables does not have separate block tags. Therefore, we reduce the data array size proportionally for Alloy, ATcache and CLT in our evaluation. Third, it is important to note that TagTables creates page-walk tables in main

33 memory. Since we do not evaluate I/O activity, we do not impose any penalty associated with this extra memory requirement.

Physical mapping of the on-die tags and their data blocks is another key aspect. Alloy has a simple direct-mapped topology without separate on-die tags. ATcache does not alter mapping of the on-die tags and their data. TagTables, and CLT share a sector tag with multiple data blocks to save tag space. CLT only records a portion of the sector tags. As a result, a location pointer for each block in a sector is needed to locate the block in the data array. TagTables also requires the location pointers. It further limits four pointers per sector (page) by combining adjacent blocks into physical chunks using more restricted block placement and replacement in the cache data array.

With respect to fetch bandwidth requirement, all methods fetches 64-byte blocks from off-die stacked DRAM. However, Alloy needs to fetch the 64-byte data block along with its tag and miscellaneous bits.

Last but not the least, the effort of avoiding off-die tag access could affect the cache performance. The first impact is on the entropy of indexing the cache set. It is well-known that using the low-order block address bits to hash to the cache sets provides good entropy of distributing blocks across the entire cache sets. However, due to restricted mapping in CLT, as well as a need to combine adjacent blocks into chunks in TagTables, these approaches use the low-order bits from the sector address to determine the cache sets. Depending on the sector size, the cache indices are taken from higher order address bits in comparison with the block indices.

Using higher-order bits for indexing the cache may adversely impact the entropy of hashing blocks to the cache sets and create more conflicts.

34

These three methods are also different in cache placement and replacement policies.

ATcache and CLT maintains 64-way set-associativity in each set with a pseudo-LRU replacement policy decoupled from the topology of the sector tags in the CLT. Alloy uses the direct-mapped design which may suffer lower hit ratios. TagTables relies on special placement and replacement mechanisms for creating big chunks since each sector can only record up to four chunks.

sector_1 sector_8 sector_16 sector_32 sector_64 3

2.5

2

1.5

1

0.5 Coeffiecient of variation of Coeffiecient

0

Figure 3-3.6Coefficient of variation (CV) of hashing 64K cache-set using different indices.

In order to understand the impact on the entropy of hashing memory requests across cache sets, we show the coefficient of variation (CV) using five different sector-based caching indices in Figure 3-3. The CV is the ratio of the standard deviation to the mean based on the number of requests hashed to each cache set. In this simulation, we assume there are 64K sets, hence 16 index bits and the block size is 64 bytes. In the figure, Sector_n indicates that the starting least-significant index bit position is from the least-significant bit of the block address to the left of log2 푛 bits. For 64-byte block size, the least-significant 6 bits are the block offset. The

35 index bits start from the 7th bit for sector_1, 10th bit for sector_8, and so on. In other words, there are n consecutive blocks allocated to the same set. We can observe that the CV increase significantly when n increases for all workloads except for milc. The significance of the variation among the workloads is sorted from left to right. With large n, this uneven distribution of memory requests across cache sets increases the chance of set conflict and degrades cache performance. Milc presents special indexing behavior. By allocating consecutive blocks into the same set, the CV is actually reduced. However, since milc has the least CV significance, the impact should be minimum.

mcf milc 65 5 soplex lbm 55 libquantum GemsFDTD 4 bwaves zeusmp 45 3 leslie3d gcc 35

MPKI omnetpp sphinx3 2 25

1 15

5 0

Figure 3-4.7DRAM cache MPKI using sector indexing.

Figure 3-4 shows the impact on the MPKI using the sector-based indexing schemes. We simulate a 256MB cache, 64-way set-associativity with 64-byte blocks. The results indeed demonstrate that with large sector size, allocating all blocks in the same cache set degrades cache performance significantly, especially when the sector size is 64. However, with moderate sector size (e.g. 16 blocks), the impacts are rather manageable. Among different workloads, mcf,

36 sphinx3, omnetpp, gcc, and leslie3d have worse impacts which are consistent with the CV results in Figure 3-3. These results suggest that a moderate sector size enables a decoupled CLT without compromising much of cache performance.

3.3 CLT Design

An example of 3-way set-associative CLT with 64-byte blocks and 16 blocks per sector is depicted in Figure 3-5. According to the size and set associativity of the CLT, a few low-order bits in the sector address are used to determine the CLT set. The remaining high-order tag bits are used to match the sector tags recorded in the set. In this example, the address part labeled sect represents the block ID in a sector and is used to look up the recorded block information. Each sector has a valid bit and 16 groups of valid (v), modify (m), and location pointer (way) for the

16 blocks. Given that the cache set index is a part of the address, only the cache way is needed in the location pointer. For example, in a 64-way DRAM cache, it requires 6 bits to record the way.

Based on whether the sector is valid in the CLT and whether the requested block is located in the

DRAM cache, the cache access works as follows.

First, when a sector tag match is found and the target sector is valid (a CLT hit), the 4-bit block ID (sect) selects the corresponding hit/miss (same as a valid bit) and the way pointer for the block. In case of a cache hit, the way pointer is used to access the data block from off-die stacked DRAM data array. The critical off-die tag path is bypassed.

Second, on a CLT hit and the hit/miss indicating that the block is not located in the

DRAM cache, the request is then issued to the conventional DRAM (main memory). The off-die tag path is bypassed as well. When the missed block is returned from the conventional DRAM, the block data and tag are stored into the DRAM cache in the LRU position given by the on-die replacement logic. A writeback is necessary in case the evicted block is dirty. (For simplicity we omit the dirty bits in the drawing.) Meanwhile, the CLT is updated by turning off the hit/miss for 37 the evicted block and recording a hit and the way location for the new block. Note that the evicted block may not be in the same sector as the missing block. However, they must be located in the same CLT set since the CLT index bits are a subset of the DRAM cache index bits. By matching the cache tag and the remaining cache index bits with proper sector tag bits, the LRU block in the CLT can be identified.

Address: 4 6 tag index sect offset Cache Lookaside Table (CLT)

16 h/m, h/m, sectag v ...... way way

= ... = =

MUX

h/m DRAM way

Figure 3-5.8CLT design schematics.

Third, when a request misses the CLT, an off-die tag access is necessary to bring in all the tags in the target cache set for determining hit/miss status and the way location of the requested block. Depending on whether the requested block is located in cache, the remaining cache and memory accesses are the same as that when the target sector is valid in the CLT. In order to update the CLT for the new sector, cache tag comparison logic is extended to allow matching of the tags in the target cache set with all other block tags in the sector. For those blocks in the set, hit/miss status bits are set and their way pointers are recorded. For other blocks that are missing, their corresponding hit/miss indicator is recorded as a miss. The new sector tag

38 and its associative hit/miss and location information replace the LRU sector in the CLT. Note that there is no cache invalidation for the evicted sector.

Address: index 2 6 tag sect offset

way DRAM Cache 0 1 2 3 4 5 6 7

A0 A2 A3 CLT C2 C3 C1 C0 B1 B2 B3

Request CLT – Target Set for Sector A, B, C MRU LRU addr condition tag v pt v pt v pt v pt tag v pt v pt v pt v pt initial B 0 -- 1 001 0 -- 1 101 -- A0 m, h A 1 000 0 -- 1 100 0 -- B 0 -- 1 001 0 -- 1 101 A2 h, h A 1 000 0 -- 1 100 0 -- B 0 -- 1 001 0 -- 1 101 A3 h, m A 1 000 0 -- 1 100 1 110 B 0 -- 1 001 0 -- 1 101 B1 h, h B 0 -- 1 001 0 -- 1 101 A 1 000 0 -- 1 100 1 110 B2 h, m B 0 -- 1 001 1 011 1 101 A 1 000 0 -- 1 100 1 110 B3 h, h B 0 -- 1 001 1 011 1 101 A 1 000 0 -- 1 100 1 110 C1 m, h C 1 111 1 101 1 010 1 011 B 0 -- 1 001 1 011 1 101 C2 h, h C 1 111 1 101 1 010 1 011 B 0 -- 1 001 1 011 1 101

Note, A, B, C are 3 sectors each has 4 blocks; a few blocks are located in cache initially, the ones with circle are moved in due to cache miss. All three sectors are recorded in the same CLT set with 2-way set- associativity. ‘condition’ indicates hit/miss to CLT and cache for the request. Figure 3-6.9CLT operations in handling memory requests.

In Figure 3-6, we illustrate the CLT operations in handling a sequence of DRAM cache requests, A0, A2, A3, B1, B2, B3, C1, and C2, where A, B, C represent three different sectors and each sector has 4 64-byte blocks as indicated by the subscript. The least-significant 6 address bits are the block offset and the next 2 bits define the block IDs in a sector. Both the target sets of the CLT and the cache are determined by the low-order bits in the sector address as illustrated in the figure for allocating blocks in a sector in the same cache set. Since the number of cache

39 blocks is several times larger than the number of sectors in the CLT, the cache index bits are a superset of the sector index bits. In this example, we assume sectors A, B, and C are hashed to the same CLT set, but allocated to different cache sets. Several blocks in A, B, and C are located in the 8-way DRAM cache initially. Note, the blocks marked by a circle are moved into the cache after a miss occurs. For simplicity, we assume the CLT has 2-way set-associativity. We also assume sector B is already recorded in the CLT with its sector tag, two valid blocks B1 and

B3 with location pointers ‘001’, ‘101’, and two invalid blocks B0 and B2.

When A0 is issued, it misses the CLT. All tags in the target DRAM cache set where A0 is located are fetched and compared with all the block tags in A. A match of requested A0 is found and so is A2, while A1 and A3 are missing. The request to the data array to fetch A0 is then issued. The CLT is updated by recording sector tag for A and setting valid and way bits for A0,

A2 and invalid for A1, A3. When A2 is processed, it hits the CLT with a valid block indicator and the location pointer is ‘100’. Therefore, the data block can be fetched from the correct data array location directly. Next, A3 is also a CLT hit, but the block is invalid. A request is issued to the conventional DRAM to bring in the missing block. According to the on-die pseudo-LRU cache replacement logic, A3 is placed in way 6 to replace the LRU block. The DRAM cache tag array and data array are updated accordingly. Meanwhile, the valid bit and location pointer are updated for A3 and the valid bit for the replaced block is turned off in case the block is valid.

When B1 comes, it hits the CLT as a valid block, hence B1 can be fetched from the data array directly with the location pointer ‘001’. The MRU/LRU position in the CLT is updated.

Next, B2 is a CLT hit, but a cache miss, which is handled the same way as A3. B2 is moved to the DRAM cache set in way 3 afterwards. B3 hits both the CLT and the cache and can be treated the same as B1. Next, C1 misses the CLT, but hit the cache. It can be handled the same as A0

40 where an off-die fetch to bring in all tags in the target DRAM set is necessary. In addition, to record the new sector C in the CLT, sector A must be evicted to make room for sector C. Blocks

A0, A2 and A3 remain valid in cache. The update of C in the CLT is the same as the update of A when A entered the CLT. Finally, C2 hits both the CLT and the cache and can be handled accordingly.

3.4 Performance Evaluation

3.4.1 Difference between Related Proposals

Table 3-2 summarizes the on-die SRAM space and latency for Alloy, ATcache,

TagTables, and CLT, as well as L3 cache size and DRAM cache data array size. The MAP-I cache miss predictor is used in Alloy with one cycle access latency and 768-byte SRAM space.

TagTables takes up L3 space for caching the page-walk tables. Proper partition based on address of the TagTables allocates metadata on the same bank of the L3 cache that would trigger tag access. As a result, the interconnect latency can be avoided. We use Cacti 6.5 [52] to estimate a latency of 8 cycles for accessing the tag table, which is the same as that used in the TagTables paper [37].

For CLT, the sector tag plus 16 groups of valid, modify, and way pointers account for 20 bytes per sector. With 4K CLT sets, each has 20 ways, the total number of sectors is 80K.

Therefore, the CLT space is 80퐾 × 20 푏푦푡푒푠 = 1.6 푀퐵. In addition, we use a 6-level binary tree (63 bits) to implement a pseudo-LRU policy for the 64-way cache. The space requirement is 64퐾푠푒푡 × 63푏푖푡푠 = 504퐾퐵. Therefore, the total on-die SRAM for CLT is close to 2MB. We use the same policy to allocate CLT partitions on the same L3 cache bank which triggers the

CLT access to avoid interconnect latency. With smaller 20 bytes of data, the estimated CLT latency is 6 cycles. ATcache requires on-die Timber, pseudo-LRU logic, and a tag prefetcher.

Since each cache set consists of 64 4-byte tags, Timber has 12-way and 512 sets for a total of

41

1.54MB of tag space. In addition, each entry needs one bit for prefetching logic, which costs

12*512/8 = 768 B.

CLT only records recently referenced sectors and must fetch all cache tags from the stacked DRAM when a CLT miss occurs. For a 256MB cache, 64-byte blocks and 64-way set- associativity, each set has 64 blocks. Each block has 30-bit address tag, a valid, and a modify bit, for a total of 4 bytes. Therefore, it requires fetching 4 tag blocks of 256 bytes on a CLT miss.

Note that 30-bit address tag can accommodate 52-bit physical address. To overlap tag accesses, four tag blocks are allocated in different banks in the stacked DRAM. ATcache requires fetching

4 tag blocks on a miss too. It also includes a tag prefetcher as described in [38].

Table 3-2.4Difference between three designs. SRAM size On-die L3 Cache size Stacked DRAM latency Cache Data Array cycles Alloy 768 bytes 1 8MB, 16-way 240MB, direct-mapped ATcache 2MB (1.54MB + 6 6MB, 12-way 240 MB, 64-way, 768 bytes prefetch + pseudo-LRU 504KB pseudo-LRU) TagTables 2MB metadata in L3 8 8MB, 16-way 256MB, 64-way, (metadata in L3) chunk placement CLT 2MB (1600KB tag + 6 6MB, 12-way 240MB, 64-way, 504KB pseudo-LRU) pseudo-LRU

For Alloy, we follow the operations described in [9]. The MAP-I cache-miss predictor is implemented to predict cache misses for parallel accesses to DRAM cache and memory. Each core has a 256-entry 3-bit hit/miss counter table. The address of the L3 miss causing instruction is hashed using folded-xor [53] into the table for the recorded counter. In Alloy, the extra tag fetch along with data block from the stacked DRAM is charged for one burst cycle.

For TagTables, the page-walk tables are dynamically created in main memory during the simulation. We do not charge any penalty in creating the page-walk tables. The tables are cached

42 in L3 on demand. Extra latency occurs when the needed entry in a table is not located in L3. A fetch to the main memory is issued to bring back the block with the needed information. We follow the same procedure in [37] in managing the shared L3 cache for caching the page-walk tables. The page entries recorded in the leaf level are saved in the intermediate level whenever possible to shorten the level of the page walk. We implement the same algorithm for allocating and combining blocks into chunks with special cache placement and replacement mechanism.

TagTables allocates 64 blocks in a page into the same set, which hurts the entropy of hashing blocks across the cache sets. In addition, the limit of four chunks for each page may create holes (empty frames) in a cache set and underutilize the DRAM cache. Therefore, we also evaluate a TagTables scheme with 16 blocks per page. We reduce the page offset from 6 bits to 4 bits and shift the remaining higher-order bits to the right. As a result, it may encounter a level-4 table with 2-bit index in 48-bit physical address format. We keep 4 chunks per entry at the leaf level. The rest design and operations stay the same.

3.4.2 Performance Results

In this section, we first compare the speedup of five DRAM cache designs, Alloy,

ATcache, TagTables_64, TagTables_16, and CLT. We show the average memory access times for the tag and the data, which contribute to the overall execution time. Multiple factors that impact the memory access time such as number of DRAM cache misses, on-die tag hit/miss ratios for ATcache, TagTables and CLT, Alloy’s miss-predictor accuracy are also discussed.

In Figure 3-7, we plot the speedups of CLT with respect to Alloy, ATcache,

TagTables_64, and TagTables_16. CLT demonstrates significant performance advantage over the other four methods. On the average, CLT improves 4.3%, 12.8%, 12.9% and 14.9% respectively over Alloy, ATcache, TagTables_64 and TagTables_16. The improvement of CLT over Alloy is rather moderate. CLT performs worse than Alloy for omnetpp due to its sector- 43 indexing. In comparison with ATcache, CLT is able to gain 12-24% speedup for all workloads except mcf and omnetpp. This is because CLT can capture most DRAM cache accesses and the sector-indexing does not hurt DRAM cache performance much as shown in Figure 3-4. Both

TagTables perform especially poorly for mcf, lbm, and milc. For TagTables_64, the CLT improvements are 44.8%, 70.6%, and 36.4% for these three workloads while TagTables_64 shows slight edge over CLT for omnetpp, leslie3d and zeusmp.

The diversified performance impacts on individual workloads are caused by multiple factors. An overall speedup analysis is further complicated by the fact of exploiting MLP

(memory-level parallelism) in the Epoch model. During the timing simulation, a cadence of memory requests is issued in each Epoch. The latency is dominated by the DRAM cache misses in each Epoch. Therefore, the DRAM cache hit latency plays a small role. On the other hand, the hit latency becomes the decisive factor in case there is no cache miss in an Epoch. This mix performance factors exist even with a precise processor model. In the following, we analyze the important parameters without detailing the MLP factor.

The most decisive performance factor is the average memory access time of the L3 misses. In Figure 3-8, we plot the average access latencies separated by the tag and the data segments where the total access time is dominated by the data latency. In general, these average latencies are consistent with the speedups shown in Figure 3-7. CLT has the shortest average latency followed by Alloy, ATcache and both TagTables. As expected, Alloy has the shortest tag latency since it only pays one-cycle predictor delay. However, in case of a false-positive miss prediction, the tag latency includes a sequential DRAM cache access for fetching the tag.

ATcache has the longest tag latency due to: 1. recording the block tags does not save space, hence lowering the Timber hit ratio; and 2. Sequential prefetch of set tags generates high traffic

44 since each set of tags occupied 4 blocks. The TagTables_16 has longer tag latency than

TagTables_64 in accessing the tags through the page-walk tables. When the page size reduces to

16 blocks, more active pages are requested, causing more L3 misses.

80% Alloy ATcache TagTables_64 TagTables_16 60%

40%

20%

0%

-20%

Figure 3-7.10CLT speedup with respect to Alloy, TagTables_64, and TagTables_16.

300 Data Latency Tag Latency 250

200

150

100

50

0

CLT CLT CLT CLT CLT CLT CLT CLT CLT CLT CLT CLT CLT

Alloy Alloy Alloy Alloy Alloy Alloy Alloy Alloy Alloy Alloy Alloy Alloy Alloy

ATcache ATcache ATcache ATcache ATcache ATcache ATcache ATcache ATcache ATcache ATcache ATcache ATcache

TagTables_64 TagTables_16 TagTables_64 TagTables_16 TagTables_64 TagTables_16 TagTables_64 TagTables_16 TagTables_64 TagTables_16 TagTables_64 TagTables_16 TagTables_64 TagTables_16 TagTables_64 TagTables_16 TagTables_64 TagTables_16 TagTables_64 TagTables_16 TagTables_64 TagTables_16 TagTables_64 TagTables_16 TagTables_64 TagTables_16 mcf gcc lbm soplex milc libquantum omnetpp sphinx3 bwaves leslie3d GemsFDTD zeusmp GEOMEAN

Figure 3-8.11Memory access latency (CPU cycles).

There are multiple factors contributed to the data latency. In table 3-3, we analyze three performance parameters. The first and most important parameter is the DRAM cache performance. Based on the trace-driven Epoch model, we can measure DRAM cache performance using misses per thousand requests (MPKR) where each request is a L2 miss.

Similar to MPKI, MPKR is closely associated with the execution speedup estimation, higher

45

MPKR causing longer average access time. In general, ATcache has the lowest MPKR due to its

64-way set-associativity. CLT is close to ATcache for all workloads except mcf and omnetpp. Its moderate sector size (i.e. 16 blocks per sector) does not degrade cache performance much. Alloy suffers higher MPKR, hence longer data latency, due to its direct-mapped design. TagTables show much higher MPKR in comparison with CLT for mcf, gcc, lbm, and milc, hence lower speedups as shown in Figure 3-7. Although omnetpp and sphinx also have large MPKR gap between CLT and TagTables, much smaller MPKRs lessen the impact.

The high MPKR of TagTables is due to two reasons, a negative impact on the sector- indexing scheme, and the restricted chunk-based placement and replacement. As observed in the second parameter ˗ the DRAM cache occupancy, the restricted 4 chunks per page creates empty space in the cache sets. For example, the average occupancy for gcc is only 75% for

TagTables_64. In other words, 25% of the cache space is wasted and caused more misses. By reducing the page size from 64 to 16, we can alleviate both the negative sector-indexing effect and the empty space in the cache data array. But, TagTables_16 encounters higher L3 misses for accessing the page-walk tables. Note that Alloy, ATcache, and CLT have 100% cache occupancy.

Table 3-3.5Comparison of L4 MPKR, L4 occupancy and predictor accuracy.

Occupancy DRAM cache MPKR Predictor accuracy ATcache TT- TT- TT- TT- ATcache TT- TT- Alloy CLT Alloy CLT 64 16 64 16 64 16 mcf 198 71 228 113 109 91 95 83 64 88 73 8 gcc 484 410 520 441 413 75 84 81 63 95 84 8 lbm 307 275 504 382 278 66 81 99 69 98 93 9 soplex 693 655 668 668 664 98 99 98 61 97 91 9 milc 747 658 722 714 665 99 87 89 55 90 79 7 libqutum 515 456 470 448 465 100 90 99 62 98 94 9 omnetpp 161 71 181 123 100 73 82 79 67 83 69 7 sphinx3 85 7 95 7 7 66 99 96 70 97 92 9 bwaves 429 411 422 411 412 100 100 99 58 98 93 9 leslie3d 438 394 459 402 408 99 99 95 59 98 93 9 gems 447 427 484 442 429 90 92 99 65 98 93 9 zeusmp 564 545 565 563 559 99 100 99 66 97 88 9

46

The third performance factor is the predictor accuracy which shows the accuracy for

Alloy miss predictor, the CLT hit ratio, TagTables tags hit ratio in L3 and cached tags hit ratio for ATcache. In general, all schemes show high hit ratios except for ATcache. TagTables_16 has lower tag hit ratio and causes higher average tag access latency.

In summary, mcf, lbm, and milc, have the highest MPKRs for the TagTables. Together with wasted cache space and L3 miss for tags, CLT outperforms TagTables by a large margin.

For milc, the large MPKR gap between CLT and Alloy helps CLT outperforming Alloy. Alloy outperforms CLT by 11.8% for omnetpp which has small MPKR, hence DRAM cache miss plays an insignificant role. The difference in memory access latency is due to high CLT misses and longer tag latency. ATcache hurts the most due to its low Timber hit ratios.

3.4.3 Sensitivity Study and Future Projection In this section, we show the results of two sensitivity studies, CLT coverage and sector size. For CLT coverage, we change the total SRAM space from 2MB to 1MB (low coverage) and 3MB (high coverage) and adjust the L3 size to 7MB and 5MB accordingly. With 1MB

SRAM, the CLT size is reduced to (2퐾 ×13)×20 푏푦푡푒푠= 520KB with 2K sets, 13-way, and 20 bytes per entry. The pseudo-LRU still costs 504 KB. On the other hand, with 3MB SRAM, the

CLT increases to (4퐾 ×30)×20 푏푦푡푒푠= 2.4MB.

In Figure 3-9, we show the change of the total execution cycles for the new coverages with respect to the original CLT which utilizes 2MB SRAM space. On the average, the low coverage has about 9% increases in the total cycles, while the high coverage shows about 1% increase. The low coverage option provokes more CLT misses and degrades the performance.

Bigger L3 helps little in this case. Mcf, omnetpp and sphinx3 have 20-26% increases in the execution cycles with low CLT coverage due mainly to the significant increase in CLT misses.

47

On the other hand, the high coverage relinquishes more L3 space for building bigger CLT. Since the CLT hit ratio is very high for most workloads, further improvement is limited. The increase of L3 miss hurts the overall performance for most workloads. Omnetpp shows about 3% cycle reduction due to improvement of its low CLT hit ratio (79%).

We also study the cycle change for smaller (8 blocks) and bigger (32 blocks) sector sizes.

As shown in Figure 3-10, both small and big sector sizes increase the total cycles by about 7% and 3% respectively. In this study, we keep 2MB SRAM space for the CLT. We need to adjust the number of sectors recorded in the CLT to utilize the available space since the number of way-pointer is changed from 16 to 8 and 32 for the respective sector sizes. For 8-block sectors, although it alleviates the sector-indexing effect, the CLT coverage is also reduced since each sector can only record 8 blocks. It also lowers the advantage of the special locality since each

CLT miss can only record 8 adjacent blocks in the CLT. The impact of bigger sector is the opposite. Although the CLT coverage is better with higher special locality exploitation, allocating 32 blocks into the same set hurts the cache performance. Among the workloads, mcf and gcc degrade the most with small sector size while mcf and omnetpp hurt the most with large sector size.

30% lower coverage 20% higher coverage 10% 0%

-10% % of Cycle of change % Cycle

Figure 3-9.12IPC change for different CLT coverage.

48

20% sector_8 15% sector_32 10% 5% 0%

-5% % of cycle change of % cycle

Figure 3-10.13Execution cycle change for different sector size in CLT design.

3.4.4 Summary

We present a new caching technique to cache a portion of the large tag array for an off- die stacked DRAM cache. Due to its large size, the tag array is impractical to fit on-die, hence caching a portion of the tags can reduce the need to go off-die twice for each DRAM cache access. In order to reduce the space requirement for cached tags and to obtain high coverage for

DRAM cache accesses, we proposed and evaluated a sector-based Cache Lookaside Table (CLT) to record cache tags on-die. CLT reduces space requirement by sharing a sector tag for a number of consecutive cache blocks and uses a location (way) pointer to locate the blocks in off-die cache data array. The large sector can also take the advantage of exploiting spatial locality for better coverage. In comparison with the Alloy cache, ATcache and the TagTables approaches, the average improvements are in the range of 4-15%.

49

CHAPTER 4 RUNAHEAD CACHE MISSES USING BLOOM FILTER

In this chapter, we present the work of using a Bloom Filter to filter out L3 cache misses and issue requests to off-ide early. We first introduce some related works of Bloom Filter applications as well as cache miss identification. We then present the timing analysis of using the

Bloom Filter, followed by the proposed indexing scheme for solving the problem of dynamic updates of L3 cache contents when using Bloom Filter. In the end we present results that demonstrates our idea.

4.1 Background and Related work

Membership queries using a Bloom Filter has been explored in many architecture, database, and network applications [54] [55] [56] [57] [58] [59] [14] [60]. In [60], a cache-miss

BF based on partial or partitioned block address is proposed to filter cache miss early in the processor pipeline. The early cache miss filtering helps scheduling load dependent instructions to avoid execution pipeline bubbles. To reduce cache coherence traffic, the RegionScout [59] was used to dynamically detect most non-shared regions. A node with a RegionScout filter can determine in advance that a request will miss in all remote nodes, hence the coherence request can be avoided. A vector Bloom Filter [14] was introduced to satisfy quick search of large

MSHRs in the critical execution path without the need for expensive CAM implementation. A counting Bloom Filter called a summary cache to handle dynamic membership updates is presented in [56]. In this approach, each proxy keeps a summary of internet cache directory and use the summary to check for potential hit to avoid sending a useless query to other proxies. In

[57], a counting Bloom Filter is used as a conflict detector in virtualized Transactional Memory to detect conflicts among all the active transactions. Multiprocessor deterministic replay was

50 introduced in [58] in which a replayer creates an equivalent execution despite inherent sources of nondeterminism that exist in modern multicore computer systems. They use a write and a read

Bloom filters to track the current episode’s write and read sets. A good early survey paper in network applications using Bloom Filters and the mathematical basis behind them is reported in

[54]. A Bloom-filter-based semijoin algorithm for distributed database systems is in [55]. This algorithm reduces communications costs to process a distributed natural join as much as possible with a filter approach.

A closely related work by Lok and Hill [4] suggested that the block residency for DRAM cache can be recorded in an on-die structure called MissMap. As described in Section 3.1, off-die trips to access the DRAM cache can be avoided if the block recorded in the MissMap indicates a miss. To save space, the MissMap records the block residency information for a large consecutive segment. However, when the segment is evicted from the miss-map directory, all blocks in the segment must be invalidated from the DRAM cache in order to maintain the precise residency information. To avoid such invalidations, Sim et al. [19] suggested to speculatively issuing requests directly to main memory if the block is predicted not in the DRAM cache.

However, significant complexity must be dealt with for handling miss predictions.

Xun et al. [ [61]] observed the need of using counters to filter cache misses. To avoid the counters, they proposed to delay updating the BF array for the evicted blocks. Instead, they trigger the BF array recalibration periodically to reconcile the correct cache content in the BF array. This delay recalibration method increases the chance of false-positive and incurs time and power overheads for the recalibration.

4.2 Memory Hierarchy and Timing analysis

Considering a BFk for cache level k, we can adjust the Average Memory Access Time

(AMAT) as follows: 51

풌−ퟏ 푨푴푨푻 = (ퟏ − 푩푭풓풂풕풆 ) × ( 푯풊풕푻풊풎풆 + ∑ ( 푴풊풔풔푹풂풕풆 × 푯풊풕푻풊풎풆 ) ) + Lk L1 풊=ퟏ Li Li+1 풏 푩푭풓풂풕풆 × ( (푩푭풕풊풎풆 + 푯풊풕푻풊풎풆 ) + ∑ ( 푴풊풔풔푹풂풕풆 × 푯풊풕푻풊풎풆 ) ) Lk Lk 풋=풌 Lj Lj+1

where 퐵퐹푟푎푡푒Lk is the ratio of cache misses filtered by the BFk at level k. When a cache miss is identified, the extra delays of hit times through levels 1 to k are avoided. Only the delay

(BFtime) of accessing the BFk is added to access cache k+1 up to the DRAM memory. This formula also shows that using BFs in multiple levels overlap the benefits in bypassing higher levels of caches. For example, if both BFi and BFi+1 are implemented and used when the memory address becomes available, BFi+1 can only save the hit time at cache level i+1. In the base memory hierarchy design shown in Figure 1-2, we will focus on a new BFL3 since the sector- based L4 tags are on the processor die with small latency. Furthermore, the large L4 size requires a large BFL4 to filter L4 misses for achieving a small false-positive rate.

Timing Analysis:

Address AG

L1 + L2 + L3 L1 + L2 + L3

L3miss L4 tag h i Delay no BF t L4 data (eDRAM) miss Regular DRAM

L3 BF L3miss Delay L3 BF L4 tag h i t miss L4 data (eDRAM)

Regular DRAM

Figure 4-1.14Memory latency with / without BFL3.

Figure 4-1 illustrates the memory latency on runahead L3 misses. After the memory address is available, BFL3 is checked. Once a miss is filtered, the on-die L4 tag array is looked up and followed by either eDRAM or regular DRAM access depending on a L4 hit or miss for

52 fetching the requested data block. Regardless the filtering result, a memory request always goes through the regular L1, L2 and L3 path. This is necessary for handling cache hits at these levels as well as identifying any false-positive L3 misses from the BFL3. Even when the request is identified as a L3 miss, the request is also issued to the normal L1, L2 and L3 path. This is necessary to avoid major changes to the microarchitecture of the cache hierarchy. If the filtered request goes through the normal cache levels and arrives to the memory controller, the early runahead miss will block this request at the controller waiting for the block comes back from L4 or memory. On the other hand, if the request has not yet arrived to the controller when the block for the runahead miss comes back, the block is inserted into L3 and treated as a prefetch to the

L3 cache. Eventually, the request through the normal path will be a L3 hit and shorten the latency.

Formally, a Bloom filter (BF) for representing a set of n elements (cache blocks) from a large universe (memory blocks) consists of an array of m bits, initially all set to 0. The filter uses k independent hash functions h1, ... , hk with range {1, ... , m}, where these hash functions map each element x (block address) in memory to a random number uniformly over the range. When a block enters cache, the bits hi(x) are set to 1 for 1 ≤ i ≤ k. To check if a block y is in cache, we check whether all hi(y) are set to 1. If not, then clearly y is not in cache, hence a cache miss. In practice, however, a BF for cache misses faces two major difficulties. First, it is hard to implement in hardware multiple independent and randomized hashing functions. Second, cache contents are dynamically changed with insertions and deletions. The BF array must be updated accordingly to reflect the content changes for maintaining the correct BF function.

In Figure 4-2, we illustrate a solution for simplified BF hashing functions and for handling dynamic cache updates. Let us first describe the conventional cache indexing scheme.

53

In a cache access, the target set is determined by decoding the index which is located in the low- order bits (a0) of the block address as shown in Figure 4-2. Instead of constructing uniformly distributed hashing functions, we can simply expand the cache index scheme to include a few more adjacent tag bits (a1) to be used for indexing the BF array. Like in conventional cache access, a simple address decoder to decode the BF index can decide the hashed BF location.

Based on the study in [54], the probability of false-positive rate is minimized when 푘 =

푙푛2 × (푚⁄푛), giving a false-positive rate ≈ (0.6185)푚⁄푛. Hence, increasing 푚⁄푛 can reduce the false-positive rate. For a cache with 2푝 way set associativity, the total number of cache blocks 푛 = 2푝 × 2푎0 = 2(푎0+푝). Furthermore, the BF array size 푚 must be bigger than 푛 for reducing the false-positive rate. Assuming 푚⁄푛 = 2푞 where 푞 is a small positive integer, we have 푚 = 2(푎0+푝+푞). Therefore, the BF index is 푎1||푎0 where 푎1 has 푝 + 푞 bits and must be a positive number.

There is a unique advantage to include the cache index (a0) as a part of the BF index.

Due to collisions, multiple blocks can be hashed to the same location in the BF array. Since the cache index is a part of the BF index, all blocks hashed to the same BF array location must be located in the same cache set. By comparing the a1 bits in all cache tags in the set with the a1 bits of the replaced block, the BF array location is reset only if the replaced a1 is not found in any other location.

Note that due to spatial locality in memory references, using low-order block address may ease collisions in the BF array, hence lower the false-positive rate. Moreover, we apply a simple cache index randomization technique by exclusive-ORing the a1 bits with the adjacent high-order a2 bits to further reduce the collision. Consider a BFL3 design for an 8MB L3 cache with 64-byte blocks and 16-way set associativity. The target set is determined by the low-order

54

13 bits (a0) of the block address to hash to the 8K sets. The total number of blocks n=217.

Assume that the BF array size m is 8 times larger than n. As a result, the additional index bits

(a1) is 7 and the BF index has 20 bits. For randomizing the a1, higher order 7 bits (a2) are used.

The total required address bit is 33 including the 6 offset bits. With limited physical address, we can have several hashing combinations for BFL3 using a0, a1 and a2. Note that in this work we set up 8GB for our simulated memory. With bigger memory and more physical address bits, more hashing options can be explored.

(a) k=1: three BF indices: a1||a0, a2||a0, and (a1 XOR a2)||a0.

(b) k=2: three BF index groups: (a1||a0 and a2||a0), (a1||a0 and (a1 XOR a2)||a0)), and (a2||a0 and (a1 XOR a2)||a0).

(c) k=3: one BF index group: (a1||a0, a2||a0, and (a1 XOR a2)||a0))

tag cache index (a0) offset

a2 a1 a0 BF index 1 = a1||a0 BF index 2 = (a1 XOR a2)||a0

Figure 4-2.15Cache indexing and hashing for BF.

Figure 4-3 shows the false-positive rates for different hashing schemes. Note that we only show the results of 6 hashing schemes since the false-positive rate using a single hashing index

(a2||a0) is very high. We simulate 3 m:n ratios: 4:1, 8:1, and 16:1 using memory traces generated from SPEC2006 benchmarks. We first ran the workloads on a whole-system simulation infrastructure based on a cycle-accurate 8-core model along with a memory hierarchy model [29] to collect memory reference traces from misses to the L2 caches. 5 billion instructions from 8 cores are collected for each workload. The simulation environment and parameters will be given in Section 3.3.

55

The false-positive rate is calculated as the ratio of the filter-hits-actual-misses divided by the total misses. Each false-positive point in the figure is the geometric mean of 12 SPEC2006 benchmarks. As can be observed, when k=1, randomization of a1 helps very little in improving the false-positive rate. Two hashing functions with indices a1||a0 and (a1 XOR a2)||a0) as illustrated in Figure 4-3 show the lowest false positive rates about 2.3%, 4.8%, and 16.5% respectively for m:n ratios 16:1, 8:1, and 4:1. Three hashing functions cannot further improve the false-positive rate because of insufficient address bits where the third hashing index is highly correlated with the first two.

25% m/n=4 m/n=8 m/n=16 20% 15% 10% 5% 0%

a1 a1^a2 (a1, a2) (a2, (a1, (a1,a2, false positiverate false a1^a2) a1^a2) a1^a2)

Figure 4-3.16False-positive rates for 6 hashing mechanisms.

45%

35%

25%

15%

5% false positiverate false -5%

k=1 m/n=2 k=1 m/n=4 k=1 m/n=8 k=1 m/n=16 k=2 m/n=2 k=2 m/n=4 k=2 m/n=8 k=2 m/n=16

Figure 4-4.17False-positive rates with m:n = 2:1, 4:1, 8:1, 16:1, and k = 1, 2. 56

In Figure 4-4, we show the false positive rates for individual SPEC2006 benchmarks.

Based on the results in Figure 4-3, we pick two hashing schemes, a1||a0 for k=1, and a1||a0 and

(a1 XOR a2)||a0 for k=2. The results show that the m:n ratio plays an important role as bigger

BF arrays reduce the false-positive rate significantly for all benchmarks. The false-positive rates are very high for small BF array with ratio m:n=2:1. The benefit of multiple hashing functions becomes more evident when m/n is 4 or greater. The false-positive rate behavior is very consistent across all benchmarks. For k=1, the average false-positive rates are 8.7% and 4.3% using a BF array which has 8 and 16 times more bits than the total number of cache blocks.

When k=2, the false-positive rates are reduced to 4.8% and 2.3% respectively using the BF array entries that are 8 and 16 times more than the number of cache blocks. These results are used to guide our IPC timing simulation.

4.3 Performance Results

The IPC improvement using a BF for runahead L3 misses is presented in this section. We also compare the improvement with a perfect BF without any false-positive misses. In addition, the sensitivity studies of BF design parameters, the size of the L4 caches, and the latency and bandwidth of the regular DRAM are also presented.

For an 8MB L3 cache with 64-byte blocks, the space overhead for the new BFL3 are

64KB, 128KB and 256KB respectively for m:n = 4:1, 8:1, and 16:1. We use Cacti [21] to estimate the BF latency and get 2, 3, and 3 cycles for the three BF arrays. In addition, we add two more cycles for the wiring delay. For delay recalibrations, since we need to get last 14 bits out (a1 and a2) and perform hierarchical or operations, we measured using Cacti 6.5[21] that it takes 3 cycles to recalibrate one set. 4 sets can be recalibrated in parallel and a total of 6K cycles are charged for each recalibration.

57

4.3.1 IPC Comparison

Figure 4-5 displays the IPCs of the twelve benchmarks with six caching mechanisms including a regular 4-level cache without BF, a BFL3 to filter and runahead L3 misses, three delay calibration d1-BFs, d2-BFL3 and d3-BFL3 with recalibration periods of 0.5M, 1M and 2M memory references, and a perfect BFL3 which does not incur false-positive. Note that we use two hashing functions a1||a0 and (a1 XOR a2)||a0, and m:n = 8:1 for the BFL3. The results show that the average IPC improvement is about 10.5% using the BFL3. This improvement is only 1.3% less than using a perfect BFL3 which averages 11.8%. In comparison with the three delay-BF designs, the improvements are 4.3%, 4.8% and 3.5% respectively. Shorter recalibration period has less false-positives, but pays more overhead in recalibrations. In general, for the design using a BFL3, all benchmarks show good IPC improvement. Mcf and sphinx benefit the most with close to 20% improvement. Other workloads show at least 6% improvement compared against design without a BF, except bwaves, which has about 4% improvement. The BFL3 design also show 1.3˗8.9% improvement against the delay BF designs.

The major impact on the IPC comes from the average memory access latency. In Table 4-

1, we list the average memory latency for L3 misses with and without runahead using a BF. The

L3 miss latency is measured from the generation of the memory address till the return of the requested data. Note that the measurement does not include the hit latencies of the L1, L2 and L3 caches since these latencies are basically the same with or without using the BF.

We can observe a significant saving in the L3 miss latency using the BFL3 to runahead the misses for all benchmarks. On the average, the L3 miss latency is reduced from 154 cycles to

120 cycles which closely match to the savings of the L1, L2 and L3 hit times minus a 5-cycle penalty for accessing the BF array. In general, the latency result is consistent with the IPC

58 results. The BF with delay recalibrations (not shown) also has longer latency due to more false- positives. In addition, they are charged for the recalibration overheads.

9 without BF BF d1-BF d2-BF d3-BF Perfect-BF

7

5 IPC

3

1

Figure 4-5.18IPC comparisons with/without BF.

Table 4-1.6False-positive rates of 12 benchmarks. Latency False-Positive Rate(%) BFL3 w/o BF BFL3 d1- BFL3 d2- BFL3 d3- BFL3 mcf 105.9 138.2 7 9 12 14 soplex 106.2 141.1 5 10 12 13 lbm 153.3 186 5 8 9 11 leslie3d 158.8 190.4 6 7 9 11 gems 216.9 248.8 4 6 8 9 libquantum 137.6 162 9 12 13 17 milc 142.3 173.9 4 6 9 10 bwaves 128.9 161.5 3 6 8 9 sphinx 69.8 106.4 4 7 11 12 bt 134.6 168.3 4 5 7 9 omnetpp 77.3 112.2 6 10 13 16 gcc 80.6 116.3 3 6 8 9

In Table 4-1, we also show the false-positive rates for the twelve benchmarks measured in the timing simulation. The results are ranged from 3 to 9%, which are consistent with the rates based on simulating long memory traces (Figure 4-4). The small false-positive rate ensures a small impact on the IPC improvement. As shown in Figure 4-5, the IPC improvement of using a

59 perfect BF can only surpass the average IPC improvement of a realistic BF from 10.5% to

11.8%.

We also provide a rough estimation of the power consumption. We measure using Cacti

[52] that each BF access takes around 0.013 nJ, which is close to a single L1 cache’s dynamic access energy of 0.014 nJ. Since the only power difference is the Bloom Filter access energy, and the number of Bloom Filter accesses is the same as L1 cache’s total accesses, the extra power consumption is basically the same as L1 cache’s total power consumption. On the other hand, using a Bloom Filter can speed up simulation by 10.5%, which can be translated to 10.5% static energy saving. To even save more energy, we can only access our Bloom Filter after L1 cache misses.

4.3.2 Sensitivity Study The sensitivity study of the IPC impact with respect to the m:n ratio and the number of hashing function k is shown in Figure 4-6, in which each IPC point is the geometric mean of 12 benchmarks. Again, the results show that bigger BF arrays reduce the false-positive rate and improve the IPC. The improvement rate is much more for k=3 than for k=2, and k=1. This is due to the fact that without sufficient entries in the BF array, more hashing functions actually increase the chance for collisions. On the other hand, with bigger BF arrays, more hashing functions can spread each block more randomly and reduce the chance of collision. When the m:n ratio is 2:1, there is insufficient room in the BF array even for k=2, resulting slightly lower

IPC than the IPC of k=1. k=3 is obviously much worse. However, the IPC of k=3 nearly catches up with the IPC of k=2, when m:n=8:1. When m:n=16:1, the IPCs for different hashing functions are very close. Given sufficient BF array size, the false-positive rates are small regardless the number of hashing functions. Nevertheless, for large m:n=16:1, we expect that k=3 should have a better IPC than the IPC for k=2. However, due to the limit address bits, the third BF index is

60 highly correlated with the first two BF indices resulting limited improvement in the false- positive rate.

In Figure 4-7, we show the results of the IPC for four L4 sizes ranging from 64MB to 512

MB. In these simulations, we maintain m:n=8:1 and k=2. Regardless the L4 size, the BFL3 always improves the IPC significantly. As expected, however, bigger L4 reduces L4MPKI and improves the IPC more using the BFL3. For the four L4 sizes, the IPC improvements are 9.0%,

10.5%, 11.5%, and 12.0% respectively.

5.8 5.7 5.6

5.5 IPC 5.4

5.3 k=1 k=2 k=3 5.2 m/n=2 m/n=4 m/n=8 m/n=16 Figure 4-6.19Average IPC for m:n ratios and hashing functions.

withoutBF BF 6

5

IPC 4

3

2 64MB 128MB 256MB 512MB

Figure 4-7.20Average IPC for different L4 sizes.

61

6.5 6 without BF BF 5.5 5

IPC 4.5 4

3.5

original original

fastlatency fastlatency

slow_latency slow_latency

channel = 4 channel = 2 Figure 4-8.21Average IPC over different DRAM latency.

Table 4-2.7 Future Conventional DRAM parameters. Faster DRAM latency Slower DRAM latency tCAS-tRCD-tRP: 6-6-6 tCAS-tRCD-tRP: 11-15-15 tRAS-tRC: 33-30 tRAS-tRC: 38-50

Next, the impact of DRAM latency is simulated. In comparison with the original DRAM latency in Table 2-1, we simulate a fast and a slow DRAM latency as shown in Table 4-2. We also test two DRAM bandwidth configurations, one with 2 channels and the other with 4 channels. The L3 and L4 sizes are 8MB and 128MB and the BFL3 remains as m:n=8:1, k=2. The results are shown in Figure 4-8. It is interesting to see that higher DRAM bandwidth and fast

DRAM latency helps the IPC with runahead L3 misses more than that without runahead misses.

For the fast latency with 4 DRAM channels, the average IPC improvement reaches to 12%. On the other hand, for the slow latency with 2 DRAM channels, the average IPC improvement is about 7%.

4.4 Summary

A new Bloom Filter is introduced to filter L3 cache misses for bypassing L1, L2 and L3 caches to shorten the L3 miss penalty in a 4-level cache hierarchy system. The proposed Bloom

62

Filter applies a simple indexing scheme by decoding the low-order block address to determine the hashed location in the BF array. To provide better hashing randomization, a part of the index bits are XORing with the adjacent higher-order address bits. In addition, with certain combinations of the limited block address bits, multiple index functions can be selected to further reduce the false-positive rate. Results show that the proposed simple hashing scheme can lower the average false-positive rate below 5% for filtering L3 misses, and to improve the average IPC by 10.5% from runahead these misses.

Furthermore, the proposed BF indexing scheme resolves an inherent difficult problem in using the Bloom Filter for identifying L3 cache misses. Due to dynamic updates of the cache content, a counting Bloom Filter is necessary to update the BF array to reflect dynamic changes of the cache content. A unique advantage of the proposed BF index is that it includes the cache index as a superset. As a result, the blocks which are hashed to the same BF array location, are allocated in the same cache set. By searching the tags in the set when a block is replaced, the corresponding BF bit can be reset correctly. This restricted hashing scheme demonstrates low false-positive rate and simplifies the BF array updates without using expensive counters.

63

CHAPTER 5 GUIDED MULTIPLE HASHING

In this chapter, we present our guided multiple hashing work. We begin by introducing the problems of single hashing and multiple hashing. We then use a simple example to illustrate our proposed idea. A detailed algorithm that targets at maximizing the number of empty buckets while balancing the keys at non-empty buckets is given. In the end, we present results that shows our improvement over other traditional hashing methods.

5.1 Background

Hash-based lookup has been an important research direction on routing and packet forwarding, among the core functions of the IP network-layer protocols. While there are other alternative approaches for routing table lookup such as trie-based solutions, we only focus on hash-based solutions, which have the advantages of simplicity and O(1) average lookup time, whereas trie-based lookup tends to make much more memory accesses.

Single-hashing suffers from the collision problem, where multiple keys are hashed to the same bucket and cause uneven distribution of keys among the buckets. It takes variable delays in looking up keys located in different buckets. For hash-based network routing tables [62] [63]

[64] [65], it is critical to perform fast lookup for the next hop routing information. In today’s backbone routers, routing tables are often too big to fit into on-chip memory of a network processor. As a result, off-chip routing table access becomes the bottleneck for meeting the increasing throughput requirement on high speed Internet [66] [67]. The unbalanced hash buckets further worsen the off-chip access. Today’s memory technology is more efficient to fetch a contiguous block (such as a cache block) at once than individual data elements separately from off-chip memory. A heavy-loaded hash bucket may require two or more memory accesses to fetch all its keys. However, in order to accommodate the most-loaded bucket for a constant

64 lookup delay, fetching a large memory block which can hold the highest number of keys in a bucket increases the critical memory bandwidth requirement, wastes the memory space, and lowers the network throughput [63] [65] [68] [69].

Methods were proposed to handle the hash collision problem for balancing the bucket load by reducing the maximum number of keys in a bucket among all buckets. One approach is to use multiple hashing such as d-random [70] which hashes each key to d buckets using d independent hash functions and stores the key into the least-loaded bucket. The 2-left scheme

[62] [68] is a special case of d-random where the buckets are partitioned into left and right regions. When inserting a key, a random hash function is applied in each region and the key is allocated to the least-loaded bucket (to the left in case of a tie). The multiple-hashing approach balances the buckets and reduces the fetched bucket size for each key look up. However, without the knowledge of which bucket that a key is located, d-random (d-left) requires probing all d buckets. As the bottleneck leans on the off-chip memory access, accessing multiple buckets slows down the hash table access and degrades the network performance [65] [67].

To remedy probing d buckets, extended Bloom Filter [64] uses counters and extra pointers to link keys in multiple hashed buckets to avoid lookups of multiple buckets. However, it requires key replications and must handle complex key updates. The recently proposed

Deterministic Hashing [65] applies multiple hash functions to an on-chip intermediate index table where the hashed bucket addresses are saved. By properly setting up the bucket addresses in the index table, the hashed buckets can be balanced. This approach incurs space overhead and delays due to indirect access through the index table. In [69], an improved approach uses an intermediate table to record the hash function IDs, instead of the bucket addresses to alleviate the space overhead. In addition, it uses a single hash function to the index table to ease the update

65 complexity. However, with limited index table and hashing functions, the achievable balance is also limited. In another effort to avoid collision, the perfect hash function sets more rigid goal to achieve one to one mapping between keys and buckets. It accomplishes the goal using complex hash functions encoded on-chip with significant space and additional delays [71] [20]. It also requires changes in the encoded hash function upon a hash table update.

5.2 Hashing

We first describe the challenges of a hashing based information table using a single hash function. We also bring up the motivation and applications of using a multiple hashing approach for organizing and accessing a hash table.

Figure 5-1.22Distribution of keys in buckets of four hashing algorithms.

To demonstrate the power of multiple hashing in accomplishing different objectives for the hash table, we compare the simulation results of four hashing schemes: single hashing

(single-hash), 2-hash with load balancing (2-left), 4-hash with load balancing (4-left), and 2-hash with maximum zero buckets (2-max-0). We simulate 200,000 randomly generated keys to be

66 hashed to 100,000 buckets. The distribution of keys in buckets is plotted in Figure 5-1. We can observe substantial differences in the key distribution among the four hashing schemes. The maximum number of keys in a bucket reach to ten for single-hash and 2-max-0. Meanwhile, 2- max-0 produces 2.5 times empty buckets than single-hash does. 2-left and 4-left are more balanced with four and three as the maximum numbers of keys in a bucket, respectively. It is easy to see that increasing the number of hash functions from two to four helps improving the balance.

5.3 Proposed Algorithm

In this section, we describe the detailed algorithms of the guided multiple-hashing scheme that consists of a setup algorithm, a lookup algorithm, and an update algorithm. Assume we have m buckets 퐵1, . . . , 퐵푚 and d independent hash functions 퐻1, . . . , 퐻푑. Each key x is hashed and placed into all d buckets, 퐵퐻푖(푥), 1 ≤ 푖 ≤ 푑. The set of keys in bucket 퐵푖 is denoted by

B[i], and the number of keys in bucket 퐵푖 is 푣(퐵[푖]), 1 ≤ 푖 ≤ 푚. The bucket load Ω푎 is defined as the maximum number of keys in any bucket. We define the memory usage ratio as:

휃 = (Ω푎 × 푚)/푛 to indicate the memory requirement of the hash table. Other terminologies are self-explanatory and are listed in Table 5-1. For better illustration of d-ghash, we use a simple hashing table with 5 keys and 8 buckets. All keys are hashed to the buckets using two hashing functions, where buckets 퐵0, . . . , 퐵7 have 1, 0, 1, 2, 3, 0, 1, 2 keys as indicated by the arrows in Figure 5-2.

5.3.1 The Setup Algorithm

Since the objective is to minimize the bucket load while approaching to a single bucket access per lookup, the setup algorithm needs to satisfy two criteria: (1) achieving near perfect balance, and (2) maximizing the number of c-empty buckets. Recall that a c-empty bucket serves

67 as a multiple hashing target of one or more keys, but the key(s) is placed into other alternative buckets that make c-empty bucket access unnecessary.

Table 5-1.8Notation and Definition. Symbol Meaning n Total number of keys m Total number of buckets 퐵[푖] Set of keys in i-th bucket v(B[i]) Number of keys of the i-th bucket s indices of the buckets in B sorted in ascending order of v(B[i])

퐻푖 i-th hash function Ω푝 Optimal bucket load, ⌈푛/푚⌉ Ω푎 Achievable bucket load 푛푢 Total number of keys in under-loaded buckets (bucket load less than Ω푎 푏푢 Number of under-loaded buckets 휃 Memory usage ratio

Figure 5-2.23A simple d-ghash table with 5 keys, 8 buckets and 2 hash functions. (The shaded bucket is a c-empty bucket. The final key assignment is as illustrated.

Given n keys and m buckets, a perfect-balance hashing scheme achieves optimal bucket load Ω푝 = ⌈푛/푚⌉. However, the perfect balance may not be achieved under our or other multi- hashing schemes because even with multiple hashing, some buckets may still probabilistically be under-loaded, i.e. zero or less than Ω푝 keys are hashed to the bucket. And this translates to some other buckets being squeezed with more keys. Increasing the number of hash functions reduces

68 under-loaded buckets and helps approaching to the perfect balancing. Our simulation shows that with 4 hash functions, the achievable balance is the same as or very close to Ω푝.

The first step in the setup algorithm is to estimate Ω푎, the achievable balance. The idea is to count the number of under loaded buckets and the number of keys inside. If the remaining buckets cannot hold the rest of the keys with Ω푎 keys in each bucket, we increase Ω푎 by one.

We then use Ω푎 as the benchmark bucket load for key assignment. We sort all buckets in B resulting a sorted index array, s, such that 푣(퐵[푠(푖)]) ≤ 푣(퐵[푠(푖 + 1)], 1 ≤ 푖 ≤ 푚 − 1. In the simple example of Figure 5-2, Ω푎 = Ω푝 = ⌈푛/푚⌉= 1.

The next step is key assignment, which consists of two procedures, creating c-empty buckets, and balancing key assignment. For creating c-empty buckets, the procedure removes duplicate keys starting from the most-loaded buckets to maximize their services as companion buckets to reduce the bucket access. A key can be safely removed from a bucket if it exists in other bucket(s). The procedure goes through all buckets whose initial load are greater than Ω푎 and tries to remove keys from them. In the illustrated example in Figure 5-2, all 3 keys in 퐵4 are successfully removed and 퐵4 becomes empty. Next, we check 퐵3 and 퐵7, each of which has 2 keys. Note that both 퐾2 and 퐾4 in 퐵3 can be removed if 퐵3 is emptied first. As a result, 퐾2 and

퐾3 cannot be removed from 퐵7 and exceed Ω푎. All buckets with the bucket load exceeding Ω푎 will be a target for reallocation as described next.

After emptying the buckets, the key assignment procedure assigns each key to a bucket starting from the least-loaded bucket. Once a key is assigned, its duplicates are removed from the remaining buckets. During the assignment, buckets with more than Ω푎 keys are skipped in order to maintain the achievable balance. A re-assignment of the buckets with load greater than Ω푎 is necessary after all the buckets are assigned. During re-assignment, the keys in the overflow

69 buckets are attempted to be relocated to other buckets. In our experiment, we use Cuckoo

Hashing [72] to relocate keys from an overflow bucket to an alternative bucket using multiple hashing functions. If all alternative buckets are full, an attempt is made to make room in the alternative buckets. For simplicity, however, such attempts stop after r tries, where r can be any heuristic number. A larger r brings better balance at the expense of longer setup time. In the illustrated example, 퐾2 in 퐵7 is relocated to 퐵3 to reduce the bucket load of 퐵7, hence, the optimal load is achieved.

In case that the perfect balance is not achievable, Ω푎 is incremented by one and the key assignment procedure repeats. It is important to note that the priority of the key assignment is to achieve perfect balance. Therefore, the keys that are previously removed from an empty bucket can be reassigned back in order to accomplish the perfect balance such that the number of keys are less than or equal to Ω푎 in all buckets. It is also important to know that in order to reduce the bucket load, we can decrease the ratio of 푛/푚, i.e. to increase the number of buckets for a fixed number of keys. However, increasing the number of buckets inflates the memory space requirement as the memory usage ratio can be calculated by θ = (Ω푎 × 푚⁄푛) for a constant bucket size for efficient fetch of a bucket from off-chip memory.

5.3.2 The Lookup Algorithm

In order to speed up the lookup of keys, we introduce a data structure called the empty array, which is a bit array of size m indicating whether a bucket is empty or not. If a bit in the empty array is ‘0’, it means that the corresponding bucket is empty; otherwise it is not empty.

Upon looking up a key x, the bits of indices 퐻1(푥), . . . , 퐻푑(푥) in the empty array are checked. If there is only one of the hashed buckets is non-empty, we simply fetch that bucket and thus complete a lookup. If there are two or more non-empty buckets, we access them one by one until

70 we find the key. In the worst case, all d bits are ones and d buckets are examined before we find the key. As discussed above, creating c-empty buckets helps reduce bucket accesses per lookup, thus alleviates the lookup cost.

To further enhance our algorithm, we introduce another data structure, the target array, to record the hash function ID once a key is hashed to two or more non-empty buckets. To separate from the algorithm described above, we call it enhanced d-ghash algorithm. The algorithm only using the empty array is called base d-ghash algorithm. The recorded ID indicates the bucket that the key is most-likely located. The empty array has m bits while the size of the target array varies depending on the number of keys. Suppose m = 200K, and we use 200K-entry target array, then the empty array takes 25KB and target array 25KB for enhanced 2-ghash and 50KB for enhanced 4-ghash. These two small arrays can be placed on chip for fast access. Multiple keys may collide in the target array. When a collision occurs, the priority of recording the target hashing function is given to the key which hashes to more non-empty buckets. Given a fixed number of keys, we can adjust the number of buckets (m) and hash functions (d) to achieve a specific goal of the bucket size and the number of buckets to be fetched for looking up a key.

More discussions will be in Section 5.4.

5.3.3 The Update Algorithm

There are three common types of hash table updates: insertion, deletion, and modification. It is straightforward to delete or to modify a key in the hash table. For deletion, the key is probed first by fetching the bucket from off-chip memory. The key is then removed from the bucket before the bucket is written back to memory. If the key is the last one in the bucket, the corresponding bit in the empty array is set to zero. For modification of the associated record of a key, the key and its associated record are fetched. The new record replaces the old one

71 before the bucket is written back to memory. Those two types of updates do not involve the modification of the target array.

The key insertion is slightly complicated. All hashed buckets are probed and the key is inserted into the least-loaded, nonempty bucket with the number of keys < Ω푎. If all non-empty buckets are full, the key is inserted into an empty bucket if it is also hashed. The empty array is updated accordingly. In case that all hashed buckets are full, the Cuckoo Hashing is applied to make a room for the new key, i.e., “rehashing” a key in one of the hashed buckets to another alternative bucket. During key relocations, both the empty and the target arrays are updated accordingly. There are two options in case a key cannot be inserted without breaking the property of 푣(퐵[푖]) ≤ Ω푎, i.e., all its hashed/rehashed bucket loads are greater than or equal to

Ω푎. First, set Ω푎 = Ω푎 + 1 and insert the key normally; Second, initiate an off-line process to re-setup the table. Normally, the possibility that a key cannot be inserted is small, and we should use the second option to prevent the bucket size from growing fast. However, if this operation happens very frequently, it implies that most of the buckets are “full”, i.e. the average number of keys in buckets are approaching Ω푎. In this case we should use the first option. By increasing the maximum load by one, all buckets gain one extra space to store another key.

5.4 Performance Results

The performance evaluation is based on simulations for seven hashing schemes: single- hash, 2-left, 4-left, base 2-ghash, enhanced 2-ghash, base 4-ghash, and enhanced 4- ghash. Note that we do not include d-random in the evaluation, because it is outperformed by d-left both in terms of the bucket load and the number of bucket accesses per lookup. We simulate 200,000 randomly generated keys to be hashed into 100,000 to 500,000 buckets. To test the new multiple hashing idea, we adopt the random hash function in [25] which uses a few shift, or, and addition

72 operations on the key to produce multiple hashing results. For relocation, we try to relocate keys in no more than ten buckets to other alternative buckets in the Setup Algorithm and no more than two in the Update Algorithm. We first compare the bucket load and the average number of bucket accesses per lookup by varying 푛/푚. Then we normalize the number of keys per lookup based on the memory usage ratios to understand the memory overhead for different hashing schemes. In addition, we demonstrate the effectiveness of creating c-empty buckets to reduce the bucket access. We also give a sensitivity study on the number of bucket accesses per lookup with respect to the size of the target array. Lastly, we evaluate the robustness of d-ghash scheme by using two simple probabilistic models.

Figure 5-3.24Bucket loads for the five hashing schemes. The enhanced d-ghash scheme and base d-ghash scheme has the same bucket load.

Figure 5-3 displays the bucket loads of the hashing schemes. Note that enhanced d-ghash scheme and base d-ghash scheme has the same bucket load. Note that enhanced d-ghash and base d-ghash have the same bucket load. The only difference between the two is that enhanced d-ghash uses a target array to reduce the number of bucket accesses per lookup. The results

73 show that d-ghash has the least bucket load, and hence achieves best balance among the buckets.

This is followed by d-left. More hash functions improve the balance for both d-ghash and d-left.

With 275,000 buckets, 4-ghash accomplishes perfect balance with the bucket load of a single key. No other simulated scheme can achieve such balance with up to 500,000 buckets. 2-ghash performs slightly better than 4-left as the former needs 150,000 buckets to reduce the bucket load to two keys while the latter requires 175,000 buckets. This result demonstrates the power of d- ghash in balancing the keys over that of d-left. The single-hash scheme is the worst. The bucket load is six even with 500,000 buckets. Note that bucket load is an integer, but we slightly adjust the integer values to separate the curves of different schemes for easy read.

Figure 5-4.25Number of bucket accesses per lookup for d-ghash.

In Figure 5-4, we evaluate the lookup efficiency of the seven hashing schemes. Single- hash only accesses one bucket per lookup. The d-left scheme looks up a key from the left-most bucket. In case that the key is not found, the next bucket to the right is accessed until the key is located. Since the key is always placed in the left-most bucket to break a tie, the number of

74 bucket accesses per lookup is quite low, 1.68 ∼ 2.36 for 4-left and 1.27 ∼ 1.44 for 2-left. The base 4-ghash and base 2-ghash reduce the number of bucket accesses per lookup to 1.25 ∼ 2.18 and 1.11 ∼ 1.44 respectively with a 5–34% and 0–14% reduction. With a target array of 1.5n entries, the enhanced 4-ghash and the enhanced 2-ghash can further reduce the number of bucket accesses per lookup to as low as 1.03 ∼ 1.23 and 1.01 ∼ 1.11 respectively with a 38–51% and

21–24% reduction.

It is interesting to see that the number of bucket accesses per lookup for d-ghash does not decrease continuously when the number of buckets increases. We can observe a sudden jump at m = 125, 000 and m = 275, 000 for 4-ghash. This is due to the fact that the optimal bucket load drops from three to two when m = 125, 000 and from two to one when m = 275, 000. As the average number of keys per bucket is very close to the optimal bucket load, it is hard to create c- empty buckets. Therefore, there are sudden decreases in the amount of c-empty buckets at those two points. As a result, 4-ghash experiences more bucket access for key lookup. The same reason goes for 2-ghash.

Figure 5-5.26Average number of keys per lookup based on memory usage ratio.

75

In order to reduce the bucket load for a fixed number of keys, we can increase the number of buckets. However, increasing buckets inflates the memory space requirement. In Figure 5-5, we plot the average number of keys per lookup based on the memory usage ratio, where the average number of keys is the product of the bucket load and the average number of buckets per look up. The results clearly show the advantage of the d-ghash scheme. Enhanced 4-ghash accomplishes a single key per bucket with 275,000 buckets which are only 37% more than the number of keys. With slightly larger than one key per lookup, enhanced 4-ghash requires the least amount of memory to achieve close to one key access per lookup.

Besides the perfect balance, d-ghash creates c-empty buckets to maximize the number of keys hashing to empty buckets. Figure 5-6 shows the effectiveness of the c-empty buckets for reducing the bucket access. In this figure, y-axis indicates the average number of non-empty buckets that each key is hashed into. In comparison with d-left, d-ghash reduces nonempty buckets more significantly, resulting in smaller number of bucket access. For d-left, the number of non-empty buckets decreases as the number of buckets increases. This is due to the fact that d- left assigns each key to the least-loaded bucket. D-ghash on the other hand, creates c-empty buckets by removing keys away from those buckets with more keys hashed into. As a result, there are fewer non-empty buckets for looking up each key. It is interesting to observe that the ability to create c-empty buckets depends heavily on the optimal bucket load and the ratio of keys and buckets. For example, when the number of buckets is 250,000, the optimal bucket load is 2 for 200,000 keys that leaves plenty of room to create many c-empty buckets. However, when the number of buckets increases to 275,000, the optimal bucket load drops to 1 that leaves little room for the c-empty buckets. Hence, the average number of non-empty buckets increases for each key to be hashed into.

76

Moreover, we show a sensitivity study of bucket access per lookup with respect to the size of the target array. We vary the size of the target array from n to 2n entries using enhanced

4-ghash and the result is shown in Figure 5-7. As expected, larger target array reduces the collision, resulting in a smaller number of bucket accesses. We pick 1.5n entries as the target array size in earlier simulations which has the best tradeoff in terms of space overhead and bucket access per lookup.

Figure 5-6.27The average number of non-empty buckets for looking up a key. This parameter is the same for enhanced d-ghash scheme and base d-ghash scheme.

Finally, we evaluate the robustness of our scheme. We first set up a table using 200,001 keys, 200,000 buckets, and 300,000 target array entries. The achievable bucket load Ω푎 is 2 in this setting. We simulate two update models: (1) Balanced Update: 33% insertion, 33% deletion, and 33% modification; and (2) Heavy Insertion: 40% insertion, 30% deletion, and 30% modification. We run for 600K updates and record the rehash percentage of all the update operations and the number of bucket accesses per lookup. The results are presented in Figure 5-

8. The top two lines reflect the number of bucket accesses per lookup under Heavy Insertion

77 model and Balanced Update model respectively. We notice increases for both lines. The number of bucket accesses per lookup increases continuously to 1.37 for Heavy Insertion, an increase of

25% than the original number. While for Balanced Update, the number first increases up to 1.25 and then drops to 1.21, with an increase of 10% in the end. The bottom two lines are rehash percentages of the whole update operations. These two lines give a clear view that if the insertion is heavy, we will come across more rehashes. For Balanced Update, the rehash percentage stays almost the same at 0.5%. There is a slight increase in Heavy Insertion rehash percentage. Since the rehash percentages for both models are less than 2% and the rehash operation involves keys in no more than two buckets, we believe d-ghash is able to handle these rehashes without incurring too much delay.

Figure 5-7.28Sensitivity of the number of bucket accesses per lookup for enhanced 4-ghash with respect to the target array size.

Finally, we apply our algorithm to a real routing table application. We use five routing tables downloaded from the Internet backbone routers: as286 (KPN Internet Backbone), as513

(CERN, European Organization for Nuclear Research), as1103 (SURFnet, the Netherlands),

78 as4608 (Asia Pacific Network Information Center, Pty. Ltd.), and as4777 (Asia Pacific Network

Information Center) [66], with 276K, 291K, 279K, 283K, 281K prefixes respectively after removing the redundant prefixes.

Figure 5-8.29Changes in the number of bucket accesses per lookup and rehash percentage for two update models using enhanced 4-ghash. Bucket accesses per lookup lines correspond to the left Y axis. Rehash percentage lines correspond to the right Y axis.

To handle the longest prefix matching problem, hash-based lookup adopts the controlled prefix expansion [26] along with other techniques [73], [74], [75]; it is observed that there are small numbers of prefixes for most lengths, and they can be dealt with separately, for example, using TCAM, while other prefixes are expanded to a limited number of fixed lengths. Lookup will then be performed for those lengths. In this experiment, we expand the majority of prefixes

(with lengths in the range of [26], [76]) to two lengths: 22 bits and 24 bits. Assuming the small number of prefixes outside [26], [76] are handled by TCAM, we perform lookups against lengths

79

22 and 24. Because there are more prefixes of 24-bits long after expansion, we present the results for 24-bit prefix lookup.

Figure 5-9.30Number of bucket accesses per lookup for experiments with five routing tables.

Figure 5-10.31Experiment with the update trace using enhanced 4-ghash. Bucket access per lookup line correspond to the left Y axis. Rehash percentage line correspond to the right Y axis.

Table 5-2.9Routing table updates for enhanced 4-ghash. Number of buckets 110K 120K 130K 140K 150K Rehash percentage re-setup 0.23% 0.13% 0.08% 0.05%

80

There are 159, 444, 159, 813, 159,395, 159,173 and 159,376 prefixes of 24-bits long from the five routing tables, respectively. We use these prefixes to setup five tables separately and vary the number of buckets from 100K to 250K with a target array of 150K entries. We find the number of bucket accesses per lookup for d-ghash and d-left scheme. The results are obtained based on the average of the five tables. As shown in Figure 5-9, both base 4-ghash and base 2- ghash performs better than the respective d-left scheme. The maximum reduction rate for base 4- ghash than 4-left is about 36% when m = 210,000 and base 2-ghash 12% when m = 250,000. The average number of bucket accesses per lookup for enhanced 4-ghash scheme is almost one bucket less than 4-left, with up to 50% reduction. For enhanced 2-ghash, there is an average of

20% reduction over 2-left. We also notice that there is a jump for 4-ghash at m = 220,000 and another one for 2-ghash at m = 130,000. This is due to the change in Ω푎, as mentioned before.

In the second experiment, we setup our hash tables with the routing table as286 downloaded at January 1st, 2010 from [66] and use the collected update trace of the whole month of January, 2010 to simulate the update process. To make experiments simple, we also choose prefixes with the length of 24. There are 159,444 24-bit prefixes in the table. The update trace contains 1,460,540 insertions and 1,458,675 deletions for those 24-bit prefixes. We vary the number of buckets from 110K to 150K. For all these settings, the achievable bucket load Ω푎 is 2 for enhanced 4-ghash. We also use a fixed 150Kentry target array.

As shown in Table 5-2, if we use 110K buckets, we need a re-setup for the whole table. If we use 120K buckets, we do not need a re-setup, but have to rehash 0.23% of the whole update operations, which is about 0.5% of the 1.4 million insertions. And if we increase the number of buckets, we will rehash less. When using 150K buckets, we have close to 0.05% chance of rehash. We also show the lookup efficiency change in Figure 5-10 with m = 150K. The update

81 trace used has nearly the same number of insertions and deletions, which is similar to a Balanced

Update model used in Section 5.4. We can view that the rehash percentage grows continually to

0.05%. The number of bucket accesses per lookup increases and decreases through the update process, with a 7% increase in the end.

5.5 Summary

A new guided multiple-hashing method, d-ghash is introduced in this chapter. Unlike previous approaches which select the least-loaded bucket to place a key progressively, d-ghash achieves global balance by allocating keys into buckets after all keys are placed into buckets d times using d independent hash functions. D-ghash calculates the achievable perfect balance and removes duplicate keys to achieve this goal. Meanwhile, d-ghash reduces the number of bucket accesses for looking up a key by creating as many empty buckets as possible without disturbing the balance. Furthermore, d-ghash uses a table to encode the hash function ID for the bucket where a key is located to guide the lookup and to avoid extra bucket access. Simulation results show that d-ghash achieves better balance than existing approaches and reduces the number of bucket accesses significantly.

82

CHAPTER 6 INTELLIGENT ROW BUFFER PREFETCHES

6.1 Background and Motivation

As we discussed in the introduction, accesses to DRAM would be separated into each channels, ranks and eventually each banks. Inside each bank, dram arrays are organized into different rows. Before a memory location can be read, the entire row containing that memory location is opened and read into the row buffer. Leaving a row buffer open after every access

(Open-page policy) enables more efficient access to the same open row, at the expense of increased access delay to other rows in the same DRAM array. Requests to the opened row is called a row buffer hit. A row buffer miss happens if the next request wants to access a different row rather than the current row, which could cause long delay.

Row buffer locality has long been observed and utilized in previous proposals. There are generally two approaches. The first one is to change DRAM mapping and scheduling in order to improve row buffer hits. [77] proposes a permutation-based page interleaving scheme to reduce row-buffer conflicts. [78] introduces a Minimalist Open-page memory scheduling policy that utilizes open-page gains with a relatively small number of page accesses for each page activation. They observe that while the commonly used open-page address mapping schemes map each memory page to a sequential region of real memory, which results in linear access sequences hitting in row buffer, it can cause interference between applications sharing the same

DRAM devices [79] and cannot utilize bank-level parallelism. They argue that the number of row-buffer hits are generally small and through adjusted DRAM mapping scheme, they shall be able to both keep row-buffer hits and reduce row-buffer conflicts. However, they need a complex data prefetch engine and a complex scheduler to schedule normal requests and prefetch requests.

[80] proposes a three-stage memory controller that first groups requests based on row-buffer

83 locality, then focus on inter-application request scheduling and lastly sending simple FIFO

DRAM commands. They mainly target on CPU-GPU systems that applications between the two can have great interference and different behavior in row-buffer accesses.

The second approach is to change DRAM page closure management. [81] first proposes tracking history at a per DRAM-page granularity and uses a two-level to predict whether to close a row-buffer. [82] extends the previous proposal and proposes an one level low- cost access based predictor (ABP) that closes a row-buffer after the specified number of access or when a page-conflict occurs. They argue that the number of accesses to a given DRAM page is better than timer based policies to determine page closure. [83] [84] proposes application- aware page policy and assign different page policy to different applications based on memory intensity and locality.

Of all the related works, [85] is most related to ours. [85] proposed the row based page policy (RBPP) that tracks the row addresses and use it as an indicator to decide whether or not to close the row buffer when the active memory request finishes. They use a few registers to record the most accessed rows in a bank. For each recorded row address, a counter that dynamically updates based on the access pattern determines whether or not the row buffer should be closed.

They use LRU scheme to replace old entries. We will show in the result section that only using

LRU scheme to replace entries is not accurate compared to replacing based on the row access count. More specifically, requests that are accessed only once or twice are very common in many workloads [39] that tend to replace the entries often in their most accessed row registers

(MARR), causing a poor hit ratio. Compared to theirs, we use a general approach that adds a learning table to filter out those requests. The comparison result will be shown in the Section V.

84

When a hot row is identified, [85] proposed modifying DRAM page policy which cannot reduce latency when two hot rows’ accesses interleaving with each other. We argue that by simply caching the hot rows without modifying the DRAM mapping, we can still harness the latency gain of row buffer hits and avoid the complexity of modifying DRAM.

bt bwaves 2.8e+05 2.8e+05

2.4e+05 2.4e+05 2.0e+05 2.0e+05 1.6e+05

Row Number 1.6e+05 Row Number 1.2e+05 1.2e+05 8.0e+04 8.0e+04 4.0e+04 4.0e+04 0.0e+00

1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07 1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07

HybridSim Cycles HybridSim Cycles

gems lbm 2.8e+05 2.8e+05 2.4e+05 2.0e+05 2.4e+05 1.6e+05

Row Number Row Number 2.0e+05 1.2e+05

8.0e+04 1.6e+05 4.0e+04 0.0e+00 1.2e+05

1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07 1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07

HybridSim Cycles HybridSim Cycles

leslie3d mg 2.8e+05 2.8e+05

2.4e+05 2.4e+05 2.0e+05

Row Number Row Number 2.0e+05 1.6e+05 1.6e+05 1.2e+05

8.0e+04 1.2e+05

1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07 1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07

HybridSim Cycles HybridSim Cycles

Figure 6-1.32Hot Row pattern of 10 workloads.

85

milc soplex 2.8e+05 2.8e+05

2.4e+05 2.4e+05

Row Number 2.0e+05 Row Number 2.0e+05

1.6e+05 1.6e+05

1.2e+05 1.2e+05

1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07 1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07

HybridSim Cycles HybridSim Cycles

swim zeusmp 2.8e+05 2.8e+05 2.4e+05 2.0e+05 2.4e+05 1.6e+05

Row Number Row Number 2.0e+05 1.2e+05

8.0e+04 1.6e+05 4.0e+04 0.0e+00 1.2e+05

1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07 1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07

HybridSim Cycles HybridSim Cycles

Figure 6-1. Continued.

Figure 6-1 shows a slice of row buffer accesses when we simulate the No-Cache Design mentioned in Chapter 3. The y axis is the row number of a request address and the x axis is the simulation cycle. The overall simulation length is 200k cycles. During the whole simulation length, it is interesting to observe that 8 out of 10 workloads exhibit strong pattern that some rows are accessed more than others. However, due to interleaved accesses to different rows, we do not observe a decent row buffer hit rate. One way to improve row buffer hit rate would be scheduling all requests to the same row together, the other way is to prefetch requests inside the hot rows for later use.

6.2 Hot Row Buffer Design and Results

We propose a simple but effective design that could utilize the hot row pattern we observed in the previous subsection. Two data structures are needed: a Learning Table (LT) and

86 a Hot-Row Buffer (HRB). The LT captures new hot rows based on recently referenced rows and the HRB buffers the Hot Rows. Each entry in LT records the address tag of a row with a shift register. The size of the LT is n and the width of the shift register is m. The HRB has k rows and each entry in HRB has a 2KB data buffer along with a reference counter to count the number of references to the row.

When a reference hits a row in HRB, the respective counter is incremented. When a requested row is not in HRB, the row enters the LT if it is not already in the LT. In case of LT is full, the request is dropped. The m bits are initialized to all ‘0’s when the row first enters the LT.

A hit to a row in LT will shift ‘1’ into the corresponding shift register.

When the shift-out bit is ‘1’, a hot row is identified. The newly identified row is fetched into the HRB to replace the row with the least used count. The counter of the new row is initialized to a middle value that the counter can represent. Meanwhile, other counters are decremented by ‘1’. The row is dropped from the LT creating an empty space. If a request misses the HRB, hit the LT, but the shift-out bit is ‘0’, no action is taken.

The process continues until a maximum h hot blocks are identified. After inserted the last hot row into the HRB, the entire LT is wiped out and re-start over again.

As for the replacement policy of HRB, one may think that LRU replacement may be a good choice. We can add another competing directory that implements LRU replacement policy.

Periodically, we can compare the number of hits in either scheme and select the one with more hits to be applied to the next time period. Note that at period end, the directory which records the losing scheme should be updated to the contents of the other. The hit/miss counters for the two schemes are reset to measure the next period. A dynamic method with a saturated counter is used

87 to switch between the two schemes. Figure 6-2 shows the flow diagram when a new row address arrives. Note we omit the LT restart and dropping requests part.

Row Address

Check Address in Check Address in N HRB LT

Y N, Insert Into Y Update LT

Update HRB

LRU HRB A Hot Row is Y Driectory Directory Identified

Figure 6-2.33Hot Row Identification and Update.

Figure 6-3 shows the detailed hit ratio with different configuration of HRB for 10 workloads for capturing reused blocks. We can observe that the hybrid scheme proves reasonable hit ratio among all workloads, which matches the hot row pattern we observed in Figure 6-1.

The X axis is the number of rows we store in directory for each scheme. The Y axis is the hit ratio for HRB/LRU directory over all row accesses if we store x amount of rows. With more entries recorded, the hit ratio is increasing obviously for all workloads. Milc is able to capture more than 90% of row accesses if using more than 16 entries. Gems and zeusmp perform badly

88 with slight increase of hit ratio when using more entries. In general, the LRU scheme perform better than the HRB scheme if having enough entries. However, when space is limited and few entries can be recorded, HRB scheme perform better.

Table 6-1 summarizes the hit ratio of 10 workloads when using 64 entries. 7 out of 10 workloads have a hit rate over 50%. 4 workloads even have a hit rate larger than 90%. The average hit rate of the 10 workloads reaches to 57.7%. We believe this can be well utilized to achieve reasonable performance gain.

Table 6-1.10Hit ratio for hybrid scheme of 10 workloads using 64 entries.

workload hit ratio (%) bt 91.2 bwaves 91.3 gems 26.3 lbm 76.7 leslie3d 49.4 mg 96.8 milc 94.9 soplex 53.1 swim 76.1 zeusmp 13.3 Geomean 57.7

bt bwaves

100% LRU 100% LRU Hybrid Hybrid 75% HRB 75% HRB

Percentage 50% Percentage 50% 25% 25% 0% 0% 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Reuse distance Hot Rows Reuse distance Hot Rows

Figure 6-3.34Results of proposed hybrid scheme.

89

gems lbm

100% LRU 100% LRU Hybrid Hybrid 75% HRB 75% HRB

Percentage 50% Percentage 50% 25% 25% 0% 0% 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Reuse distance Hot Rows Reuse distance Hot Rows

leslie3d mg

100% LRU 100% LRU Hybrid Hybrid 75% HRB 75% HRB

Percentage 50% Percentage 50% 25% 25% 0% 0% 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Reuse distance Hot Rows Reuse distance Hot Rows

milc soplex

100% LRU 100% LRU Hybrid Hybrid 75% HRB 75% HRB

Percentage 50% Percentage 50% 25% 25% 0% 0% 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Reuse distance Hot Rows Reuse distance Hot Rows

swim zeusmp

100% LRU 100% LRU Hybrid Hybrid 75% HRB 75% HRB

Percentage 50% Percentage 50% 25% 25% 0% 0% 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Reuse distance Hot Rows Reuse distance Hot Rows

Figure 6-3. Continued. 90

Previous proposals have only focused on different ways of modifying row buffer management in order to reduce row buffer conflicts. We argue that we can just cache these hot rows to avoid the complexity of changing DRAM organization. Of course caching an entire row not only wastes space when only a few blocks in the row are accessed, but also puts heavy burden on bandwidth. We can consider caching part of the row. If these cached blocks in the row are accessed, we can start prefetch remaining blocks in the row. These can be easily implemented with a simple stream prefetcher.

bt bwaves 3500 3500 3000 3000 2500 2500 2000 2000 Count 1500 Count 1500 1000 1000 500 500 0 0 -80-60-40-20 0 20 40 60 80 -80-60-40-20 0 20 40 60 80 Blocks Distance Blocks Distance

gems lbm 2500 6000 2000 5000 1500 4000 Count Count 3000 1000 2000 500 1000 0 0 -80-60-40-20 0 20 40 60 80 -80-60-40-20 0 20 40 60 80 Blocks Distance Blocks Distance

Figure 6-4.35Block column difference within a row for 10 workloads.

91

leslie3d mg 3500 450 3000 400 350 2500 300 2000

Count Count 250 1500 200 150 1000 100 500 50 0 0 -80-60-40-20 0 20 40 60 80 -80-60-40-20 0 20 40 60 80 Blocks Distance Blocks Distance

milc soplex 9000 4000 8000 3500 7000 3000 6000 2500

Count 5000 Count 4000 2000 3000 1500 2000 1000 1000 500 0 0 -80-60-40-20 0 20 40 60 80 -80-60-40-20 0 20 40 60 80 Blocks Distance Blocks Distance

swim zeusmp 4500 70 4000 60 3500 3000 50 40

Count 2500 Count 2000 30 1500 1000 20 500 10 0 0 -80-60-40-20 0 20 40 60 80 -80-60-40-20 0 20 40 60 80 Blocks Distance Blocks Distance

Figure 6-4. Continued.

Figure 6-4 shows the column difference inside a row buffer. Upon the first request to a certain row, we record this address as the base address. When later new requests access this row, we record the column difference between these two requests. We observed a high regularity that the following blocks in the later rows would be accessed. Take lbm as an example, we can observe that most requests in a row are having increasing differences compared to the first access

92 to a row. And with difference increasing, the number of requests are decreasing. This motivates us using a simple stream prefetcher in our performance evaluation. We can observe strong pattern the same with lbm for 7 out of 10 workloads. For the milc workload, there is an gap between odd difference and even difference. For swim and zeusmp, the difference is more scattered. Soplex and zeusmp also shows lots of requests with negative difference. For such pattern, more advanced prefetching algorithm can be applied.

6.3 Performance Evaluation

In this section, we present the IPC improvement of hot row prefetch against the base design without hot row prefetch along with the row buffer hit ratio.

18 IPC_speedup row_buffer_hit_improve cache_hit_rate_improve 16 14 12 10 8

Percentage 6 4 2 0 bt bwaves gems lbm leslie3d mg milc soplex swim zeusmp Geomean

Figure 6-5.36IPC/Row buffer hit Ratio speedup of 10 workloads.

The total cost of learning table is calculated as follows: each Learning table costs

16*(0.5+3) = 56B per bank. (3B is used to record row address). We let HRB/LRU scheme to record 1-64 rows and has a reference count of 1B, that costs at most 2 * 64 *(1+3) = 512B per bank. So the total overhead is 568B for recording 64 rows. To record for all the 16 banks, we will need 8.9KB in total, which can be easily fit into last level cache. Once we have identified

93 the hot rows, we implement a simple stream prefetcher in the hot row that prefetches next 4 blocks inside the row.

Figure 6-5 show the IPC and row buffer hit improvement of adding an hot row prefetcher over 10 workloads. The overall IPC improvement is 9.1% with a range from 3.1 to 16.4% over individual workloads. 6 out of 10 workloads have IPC improvement over 10%. Bt, mg and milc has the most significant improvement, while zeusmp has the least improvement of all.

We can also observe an average of 6.3% more row buffer hits. In general, IPC improvement is consistent with row buffer hit increase. Prefetching next 4 blocks upon a hot row access would create 4 row buffer hits which improve the row buffer hit rate. If any of these prefetched blocks are later requested, access time for them are reduced, hence lower the average memory access time. Among the 10 workloads, milc and mg has more than 10% increased row buffer hits. Gems, soplex and zeusmp has less than 5% increased row buffer hits.

We also show the last level cache hit rate improvement. The prefetched blocks may become cache hits when they arrive faster to last level cache before requested. We do observe a slight increase for all workloads of the L3 cache hit rate. The average increase of last level cache hit rate is 3.9%. Note that even cache misses can be served faster because they are prefetched earlier.

Table 6-2 shows the usage of the prefetched blocks. We can observe that the average usage is about 52%, which means that on average 2 out 4 blocks we prefetched are requested next. 7 out of 10 workloads have more than 52% accuracy. Among them, bt, milc, and mg have the highest prefetch usage, which is consistent with their IPC results. Zeusmp, soplex, and gems have the smallest prefetch usage rate. The prefetched useless blocks would consume bandwidth and in the meanwhile pollute cache, hence hurt the improvement prefetching brings.

94

Table 6-2.11Prefetch usage for 10 workloads using a simple stream prefetcher.

workload prefetch_hit_percentage bt 80 bwaves 71 gems 32 lbm 62 leslie3d 51 mg 72 milc 83 soplex 33 swim 54 zeusmp 23 Geomean 52

Table 6-3.12Sensitivity study on prefetch granularity.

Percentage(%) IPC_speedup row_buffer_hit_improve cache_hit_rate_improve prefetch_2 5.4 3.7 2.1 prefetch_4 9.1 6.3 3.8 prefetch_6 3.1 8.3 0.4 prefetch_8 -2.8 10.1 -2.2

Table 6-3 shows the sensitivity study of changing the stream prefetch granularity.

Compared to prefetching only two blocks on every row access, prefetching 4 blocks improve the row buffer hit ratio and improve the IPC because of reducing last level cache miss rate. But the improvement becomes less and even negative when we prefetch 6 or more blocks. Prefetching more blocks on every row access puts a heavy burden on the bandwidth and hurts the LLC performance.

6.4 Conclusion

From the results, we can conclude that there exists strong locality pattern for row buffer accesses. But because of those accesses are interleaved with each other, we tend to have more row buffer conflicts, rather than row buffer hits. Based on the hot row pattern, we can easily identify some frequently used rows. We evaluate the proposed scheme to capture hot rows with

95 trace collected from Marssx86 and DRAMSim2. The use of a learning table filters requests that are accessed only a few times. The competing of LRU and HRB generates the hybrid scheme that provides the best hit ratio. Results have shown that an average of 57.7% row accesses can be captured by using 568B per bank. We also show that simple LRU replacement used by previous

RBPP scheme is not effective if recording limited hot rows.

We further implement a simple stream prefetcher to harness the hot row pattern we captured through our learning table design. Results have demonstrated that with a simple prefetch-in-row stream prefetcher, we are able to achieve a IPC speedup of 9.1% and a row buffer hit rate improve of 6.3%.

96

CHAPTER 7 SUMMARY

We propose four works in this dissertation targeting at improving memory hierarchy performance. The proposed ideas can be easily applied to real world systems and evaluations have shown that they can improve system performance significantly. With the increasing demand for high performance memory systems, the proposed techniques become valuable.

In the first work, we present a new caching technique to cache a portion of the large tag array for an off-die stacked DRAM cache. Due to its large size, the tag array is impractical to fit on-die, hence caching a portion of the tags can reduce the need to go off-die twice for each

DRAM cache access. In order to reduce the space requirement for cached tags and to obtain high coverage for DRAM cache accesses, we proposed and evaluated a sector-based Cache Lookaside

Table (CLT) to record cache tags on-die. CLT reduces space requirement by sharing a sector tag for a number of consecutive cache blocks and uses location (way) pointers to locate the blocks in off-die cache data array. The large sector can also take the advantage of exploiting spatial locality for better coverage. In comparison with the Alloy cache, the ATcache and the TagTables approaches, the average improvements of CLT are in the range of 4-15%.

In the second work, A new Bloom Filter is introduced to filter L3 cache misses for bypassing L1, L2 and L3 caches to shorten the L3 miss penalty in a 4-level cache hierarchy system. The proposed Bloom Filter applies a simple indexing scheme by decoding the low-order block address to determine the hashed location in the BF array. To provide better hashing randomization, partial index bits are XORing with the adjacent higher-order address bits. In addition, with certain combinations of the limited block address bits, multiple index functions can be selected to further reduce the false-positive rate. Performance evaluation using SPEC2006 benchmarks on a 8-core system with 4-level caches show that the proposed simple hashing

97 scheme can lower the average false-positive rate below 5% for filtering L3 misses, and to improve the average IPC by 10.5% over no L3 filtering and runahead. Furthermore, the proposed

BF indexing scheme resolves an inherent difficult problem in using the Bloom Filter for identifying L3 cache misses. Due to dynamic updates of the cache content, a counting Bloom

Filter is necessary to update the BF array to reflect dynamic changes of the cache content. A unique advantage of the proposed BF index is that it includes the cache index as a superset. As a result, the blocks which are hashed to the same BF array location, are allocated in the same cache set. By searching the tags in the set when a block is replaced, the corresponding BF bit can be reset correctly without using expensive counters.

The third work proposes a new guided multiple-hashing method, d-ghash. Unlike previous approaches which select the least-loaded bucket to place a key progressively, d-ghash achieves global balance by allocating keys into buckets after all keys are placed into buckets d times using d independent hash functions. D-ghash calculates the achievable perfect balance and removes duplicate keys to achieve this goal. Meanwhile, d-ghash reduces the number of bucket accesses for looking up a key by creating as many empty buckets as possible without disturbing the balance. Furthermore, d-ghash uses a table to encode the hash function ID for the bucket where a key is located to guide the lookup and to avoid extra bucket access. Simulation results show that d-ghash achieves better balance than existing approaches and reduces the number of bucket accesses significantly.

The fourth work digs into the details of DRAM row buffer accesses. By collecting memory accesses of most of the SPECCPU workloads, we find out requests to the DRAM rows in each bank are not evenly distributed, meaning some of the rows in a bank could have more requests than other rows. We call these more accessed rows “hot rows”. Based on the observed

98 hot row pattern, we propose a simple design that uses a Learning table which is able to capture these hot rows. Once we identifies a hot row, we would sequentially prefetch blocks in that hot row upon a row access. We evaluate this idea using a simple stream prefetcher and the results show we are able to gain 9.1% average IPC improvement over a design without a prefetcher.

The proposed ideas have been verified by the results presented in each individual

Chapter.

99

LIST OF REFERENCES

[1] P. Hammarlund, "The Fourth-Generation Intel Core Processor," in MICRO, 2014.

[2] Y. Deng and W. P. Maly, "2.5-dimensional VLSI system integration.," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2005.

[3] K. Banerjee, S. Souri, P. Kapur and K. C. Saraswat, "3-D ICs: a novel chip design for improving deep-submicrometer interconnect performance and systems-on-chip integration," in Proceedings of the IEEE, 2001.

[4] G. Loh and M. D. Hill, "Efficiently enabling conventional block sizes for very large die- stacked DRAM caches," in MICRO, 2011.

[5] J. Sim, G. Loh, H. Kim , M. Connor and M. Thottehodi, "A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-balancing Dispatch," in MICRO, 2012.

[6] X. Jiang and e. al., "CHOP: Adaptive filter-based DRAM caching for CMP server platforms," in HPCA, 2010.

[7] G. Loh, "Extending the Effectiveness of 3D-stacked DRAM Cache with An Adaptive Multi-queue Policy," in MICRO, 2009.

[8] G. Loh and M. Hill, " Supporting very large DRAM caches with compound access scheduling and MissMaps," in MICRO, 2012.

[9] M. K. Qureshi and G. Loh, "Fundamental Latency Trade-offs in Architecting DRAM Caches," in MICRO, 2012.

[10] L. Zhao, R. Iyer, R. Illikkal and D. Newell, "Exploring DRAM cache architectures for CMP server platforms," in ICCD, 2007.

[11] D. Woo, N. Seong, D. Lewis and H. Lee, "An Optimized 3D-stacked Memory Architecture by Exploiting Excessive High-density TSV Bandwidth," in HPCA, 2010.

[12] T. Kgil, S. D'Souza, A. Saidi, N. Binkert, R. Dreslinski, T. Mudge, S. Reinhardt and K. Flautner, "PicoServer: Using 3D stacking technology to enable a compact energy efficient chip multiprocessor," in ASPLOS, 2006.

[13] C. Liu, I. Ganusov and M. Burtscher, "Bridging the Procesor-memory Performance Gap with 3D IC Technology," in IEEE Design & Test of Computers, 2005.

[14] G. Loh, "3D-stacked Memory Architectures for Multi-core Processors," in ISCA, 2008.

[15] C. Chou, A. Jaleel and M. K. Qureshi, "A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache," in MICRO, 2014.

100

[16] X. Dong, Y. Xie, N. Muralimanohar and N. P. Jouppi, "Simple but effective heterogeneous main memory with on-chip memory controller support," in SC, 2010.

[17] G. Loh and et al., "Challenges in Heterogeneous Die-Stacked and Off- Chip Memory Systems," in SHAW, 2012.

[18] J. Pawlowski, "Hybrid Memory Cube: Breakthrough DRAM Performance with a Fundamentally Re-Architected DRAM Subsystem," in Hot Chips, 2011.

[19] J. Sim, A. Alameldeen, Z. chishti, C. Wilkerson and H. Kim, "Transparent Hardware Management of Stacked DRAM as Part of Memory," in MICRO, 2014.

[20] F. Botelho, R. Pagh and N. Ziviani, "Simple and space-efficient minimal perfect hash functions," in WADS, 2007.

[21] B. Vocking, "How Asymmtry Helps Load Balancing," in IEEE Symn. on FCS, 1999.

[22] A. Kirsch and M. Mitzenmacher, "On the Performance of Multiple Choice Hash Tables with Moves on Deletes and Inserts," in Communication, Control, and Computing, 2008.

[23] F. Hao, M. Kodialam and T. V. Lakshman, "Building high accuracy bloom filters using partitioned hashing," in SIGMETRICS, 2007.

[24] B. Bloom, "Space / Time Trade-offs in Hash Coding with Allowable Errors," in Comm. ACM, 1970.

[25] T. Wang. http://burtleburtle.net/bob/hash/integer.html.

[26] V. Srinivasan and G. Varghese, "Fast Address Lookups Using Controlled Prefix Expansion," in ACM Transactions on Computer Systems, 1999.

[27] A. Patel, F. Afram, S. Chen and K. Ghose, "MARSSx86: A Full System Simulator for x86 CPUs," in DAC, 2011.

[28] M. T. Yourst, PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator, ISPASS, 2007.

[29] J. Stevens, P. Tschirhart, M. Chang, I. Bhati, P. Enns, J. Greensky, Z. Chishti, S. Lu and B. Jacob, "An Integrated Simulation Infrastructure for the Entire Memory Hierarchy: Cache, DRAM, Nonvolatile Memory, and Disk," in ITJ, 2013.

[30] Qemu http://wiki.qemu.org/Main_Page.

[31] Y. Chou, Y. Fahs and S. Abraham, Microarchitecture optimizations for exploiting memory-level parallelism, 2004: ISCA.

[32] T. Carson, H. W. and E. L., Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation, HPCA, 2011.

101

[33] J. Henning., "SPEC CPU2006 memory footprint," in ACM SIGARCH Computer Architecture News, 2007.

[34] B. Rogers, A. Krishna, G. Bell, K. Vu, X. Jiang and Y. Solihin, "Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scaling," in ISCA, 2009.

[35] N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian, R. Iyer, S. Makineni and D. Newell, "Optimizing Communication and Capacity in a 3D Stacked Reconfigurable Cache Hierarchy," in HPCA, 2009.

[36] S. Lai, "Current Status of the Phase Change Memory and Its Future," in IEDM, 2003..

[37] S. Franey and M. Lipasti, Tag Tables, HPCA, 2015.

[38] C. Huang and V. Nagarajan, "ATCache: Reducing DRAM cache Latency via a Small SRAM Tag Cache," in PACT, 2014.

[39] D. Jevdjic, S. Volos and B. Falsafi, "Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache," in ISCA, 2013.

[40] D. Jevdjic, G. Loh, C. Kaynak and B. Falsafi, "Unison Cache : A Scalable and Effective Die-Stacked DRAM Cache," in MICRO, 2014.

[41] N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki, "Reactive NUCA: Near- optimal block placement and replication in distributed caches," in ISCA, 2009.

[42] J. Liptay, "Structural aspects of the System/360 Model 85, Part II: The cache," in IBM Syst.J., 1968.

[43] S. Przybylski, "The Performance Impact of Block Sizes and Fetch Strategies," in ISCA, 1990.

[44] J. B. Rothman and A. J. Smith, "The Pool of Subsectors Cache Design," in ICS, 1999.

[45] A. Seznec, "Decoupled sectored caches: conciliating low tag implementation cost and low miss ratio," in ISCA, 1994.

[46] S. Somogyi, T. Wenish, A. Ailamaki, B. Falsafi and A. Moshovos, "Spatial memory streaming," in ISCA, 2006.

[47] G. Loh and M. Hill, "Addendum for “Efficiently enabling conventional block sizes for very large die-stacked DRAM caches”," 2011.

[48] J. Meza, J. Chang, H. Yoon, O. Mutlu and P. Ranganathan, "Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management," in CAL, 2012.

[49] W. Chou, Y. Nain, H. Wei and C. Ma, "Caching tag for a large scale cache computer memory system," in US Patent 5813031, 1998.

102

[50] T. Wicki, M. Kasinathan and R. Hetherington, "Cache tag caching," in US Patent 6212602, 2001.

[51] M. Qureshi, "Memory access prediction," in US Patent 12700043, 2011.

[52] "Cacti 6.5," http://www.hpl.hp.com/research/cacti.

[53] A. Seznec and P. Michaud, A case for (partially) tagged geometric history length branch prediction, Journal of Instruction Level Parallelism, 2006.

[54] A. Broder and M. Mitzenmacher, "Network applications of Bloom filters: A survey," in Internet Math, 2004.

[55] J. K. Mullin, "Optimal semijoins for distributed database systems," in IEEE Transactions on Software Engineering, 1990.

[56] L. Fan, P. Cao, J. Almeida and A. Broder, "Summary cache: a scalable widearea Web cache sharing protocol," in IEEE Transactions on Networking, 2000.

[57] R. Rajwar, M. Herlihy and K. Lai, "Virtualizing Transactional Memory," in ISCA, 2005.

[58] A. Roth, "Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization," in ISCA, 2005.

[59] A. Moshovos, "RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence," in ISCA, 2005.

[60] J. Peir, S. Lai, S. Lu, J. Stark and K. Lai, "Bloom filtering cache misses for accurate data speculation and prefetching," in ICS, 2002.

[61] X. Li, D. Franklin, R. Bianchini and F. T. Chong, "ReDHiP: Recalibrating Deep Hierarchy Prediction for Energy Efficiency," in IPDPS, 2014.

[62] A. Broder and M. Mitzenmacher, "Using Multiple Hash Functions to Improve IP Lookups," in INFOCOM, 2001.

[63] S. Demetriades, S. C. M. Hanna and R. Melhem, "An Efficient Hardware-Based Multi- hash Scheme for High Speed IP Lookup," in HOTI, 2008.

[64] H. Song, S. Dharmapurikar, J. Turner and J. Lockwood, "Fast Hash Table Lookup Using Extended Bloom Filter: An Aid to Network," in SIGCOMM, 2005.

[65] Z. Huang, D. Lin, J.-K. Peir and S. M. I. Alam, "Fast Routing Table Lookup Based on Deterministic Multi-hashing," in ICNP, 2010.

[66] "Routing Information Service," http://www.ripe.net/ris.

[67] C. Hermsmeyer, H. Song, R. Schlenk, R. Gemelli and S. Bunse, "Towards 100G packet processing: Challenges and technologies," in Bell Labs Technical Journal, 2009.

103

[68] S. Lumetta and M. Mitzenmacher, "Using the Power of Two Choices to Improve Bloom Filter," in Internet Mathematics, 2007.

[69] Z. Huang, J.-K. Peir and S. Chen, "Approximately-Perfect Hashing: Improving Network Throughput through Efficient Off-chip Routing," in INFOCOM, 2011.

[70] Y. Azar, A. Broder, A. Karlin and E. Upfal, "Balanced Allocations," in Theory of Computing, 1994.

[71] R. Sprugnoli, "Perfect hashing functions: a single probe retrieving method for static sets,," in ACM Comm., 1977.

[72] F. F. Rodler and R. Pagh, "Cuckoo Hashing," in ESA, 2001.

[73] S. Dharmapurikar, P. Krishnamurthy and D. Taylor, "Longest Prefix Matching Using Bloom Filters," in SIGCOMM, 2003.

[74] B. Chazelle, R. Kilian and A. Tal, "The Bloomier filter: an efficient data structure for static support lookup tables," in ACM SIAM, 2004.

[75] J. Hasan, S. Cadambi, V. Jakkula and S. Chakradhar, "“Chisel: A Storage efficient, Collision-free Hash-based Network Processing Architecture," in ISCA, 2006.

[76] M. L. Fredman and J. Komlos, "On the Size of Separating Systems and Families of Perfect Hash Functions," in SIAM. J. on Algebraic and Discrete Methods, 1984.

[77] Z. Zhang and e. al, "A permutation-based page interleaving scheme to reduce row- buffer conflicts and exploit data locality," in MICRO, 2000.

[78] D. Kaseridis, J. Stuecheli and L. John, "Minimalist Openpage: A DRAM Page-mode Scheduling Policy for the manycore Era," in MICRO, 2011.

[79] T. Mosciborda and O. Mutlu, "Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems," in USENIX, 2007.

[80] R. Ausavarungnirun, K. Chang, L. Subramanian, G. H. Loh and M. O., "Staged memory scheduling: achieving high performance and scalability in heterogeneous systems," in ISCA, 2012.

[81] Y. Xu, A. Agarwal and B. Davis, "Prediction in dynamic sdram controller policies," in SAMOS, 2009.

[82] M. Awasthi, D. W. Nellans, R. Balasubramonian and A. Davis, "Prediction Based DRAM Row-Buffer Management in the Many-Core Era," in PACT, 2011.

[83] M. Jeong, D. Yoon, D. Sunwoo, M. Sullivan, I. Lee and M. Erez, "Balancing DRAM Locality and Parallelism in Shared Memory CMP system," in HPCA, 2012.

104

[84] M. Xie, D. Tong, Y. Feng, K. Huang and X. Cheng, "Page Policy Control with Memory Partitioning for DRAM Performance and Power Efficiency," in ISLPED, 2013.

[85] X. Shen, F. Song, H. Meng and e. al, "RBPP: A Row Based DRAM Page Policy fo rthe Many-core Era," in ICPADS, 2014.

[86] A. Jaleel, "Memory Characterization of Workloads Using Instrumentation-Driven Simulation," in VSSAD, 2007.

[87] P. Rosenfeld, E. Copper-Balis and B. Jacob, "DRAMSim2: A Cycle Accurate Memory System Simulator," in CAL, 2011.

[88] A. J. Smith, "Line (block) size choice for memories," in IEEE transactions on Computers, 1987.

[89] "NAS Parallel Benchmarks," http://www.nas.nasa.gov/publications/npb.html.

[90] A. Brodnik and J. I. Munro, "Membership in Constant Time and Almost-Minimum Space," in SIAM Journal on Computing, 1999.

[91] W. Starke and et al, "The cache and memory subsystems of the IBM POWER8 processor," in IBM J. Res & Dev. Vol.59(1) , 2015.

[92] D. Patterson and J. Hennessy, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, 2011.

[93] K. D., S. J. and J. L. K., Minimalist open-page: a DRAM page-mode scheduling policy for the many-core era, MICRO, 2011.

[94] T. Mosciborda and O. Mutlu, Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems, USENIX, 2007.

[95] R. Ausavarungnirun, K. Chang, L. Subramanian, G. H. Loh and O. Mutlu, Staged memory scheduling: achieving high performance and scalability in heterogeneous systems, ISCA, 2012.

[96] M. Hill, "A case for direct-mapped caches," in IEEE computer, 1988.

105

BIOGRAPHICAL SKETCH

Xi Tao received his Ph.D. in computer engineering from the University of Florida in the fall of 2016. He received his B.S. degree in Electric Engineering and Information Science in

University of Science and Technology of China, in 2007. His research interests include computer architecture, cache and Bloom Filter applications.

106