The Pennsylvania State University The Graduate School College of Engineering

ARCHITECTURAL TECHNIQUES TO ENABLE RELIABLE AND

HIGH PERFORMANCE MEMORY HIERARCHY IN CHIP

MULTI-PROCESSORS

A Dissertation in Computer Science and Engineering by Amin Jadidi

© 2018 Amin Jadidi

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

August 2018 The dissertation of Amin Jadidi was reviewed and approved∗ by the following:

Chita R. Das Head of the Department of Computer Science and Engineering Dissertation Advisor, Chair of Committee

Mahmut T. Kandemir Professor of Computer Science and Engineering

John Sampson Assistant Professor of Computer Science and Engineering

Prasenjit Mitra Professor of Information Sciences and Technology

∗Signatures are on file in the Graduate School.

ii Abstract

Constant technology scaling has enabled modern computing systems to achieve high degrees of thread-level parallelism, making the design of a highly scalable and dense memory hierarchy a major challenge. During the past few decades SRAM has been widely used as the dominant technology to build on-chip cache hierarchies. On the other hand, for the main memory, DRAM has been exploited to satisfy the applications demand. However, both of these two technologies face serious scalability and power consumption problems. While there has been enormous research work to address the drawbacks of these technologies, researchers have also been considering non-volatile memory technologies to replace SRAM and DRAM in future processors. Among different non-volatile technologies, Spin-Transfer Torque RAM (STT-RAM) and Phase Change Memory (PCM) are the most promising candidates to replace SRAM and DRAM technologies, respectively. Researchers believe that the memory hierarchy in future computing systems will consist of a hybrid combination of current technologies (i.e., SRAM and DRAM) and non-volatile technologies (e.g., STT-RAM, and PCM). While each of these technologies have their own unique features, they have some specific limitations as well. Therefore, in order to achieve a memory hierarchy that satisfies all the system-level requirements, we need to study each of these memory technologies. In this dissertation, the author proposes several mechanisms to address some of the major issues with each of these technologies. To relieve the wear-out problem in a PCM-based main memory, a compression-based platform is proposed, where the compression scheme collaborates with wear-leveling and error correction schemes to further extend the memory lifetime. On the other hand, to mitigate the write disturbance problem in PCM, a new write strategy as well as a non-overlapping data layout is proposed to manage the thermal disturbance among adjacent cells.

iii For the on-chip cache, however, we would like to achieve a scalable low-latency configuration. To this end, the author proposes a morphable SLC-MLC STT-RAM cache which dynamically trade-offs between larger capacity and lower latency, based on the applications demand. While adopting scalable memory technologies, such as STT-RAM, improves the performance of cache-sensitive applications, the cache- thrashing problem will stil exist in applications with very large data working-set. To address this issue, the author proposes a selective caching mechanism for highly parallel architectures. And, also introduces a criticality-aware compressed last-level cache which is capable of holding a larger portion of the data working-set while the access latency is kept low.

iv Table of Contents

List of Figures x

List of Tables xvii

Acknowledgments xix

Chapter 1 Introduction 1 1.1 Construction of PCM-based Main Memories ...... 2 1.1.1 Limited Lifetime of PCM-based Main Memories ...... 3 1.1.2 Reliable Write Operations in PCM-based Main Memories . 3 1.2 Construction of STTRAM-based Last-level Caches ...... 4 1.3 Construction of SRAM-based Last-level Caches ...... 5 1.3.1 Managing Cache Conflicts in Highly Parallel Architectures 5 1.3.2 Balancing Capacity and Latency in Compressed Caches . . 6

Chapter 2 Background and Related Work 7 2.1 PCM-based Main Memory Organization ...... 7 2.1.1 Wear-out Faults ...... 8 2.1.2 Write Disturbance Faults ...... 9 2.1.3 Resistance Drift Faults ...... 10 2.2 STT-RAM Last-Level Cache Organization ...... 10 2.3 SRAM-based Last-Level Cache Organization ...... 11 2.3.1 Addressing Conflicts in Last-Level Cache ...... 11 2.3.2 Data Compression in Last-Level Cache ...... 11

v Chapter 3 Extending the Lifetime of PCM-based Main Memory 12 3.1 Introduction ...... 13 3.2 Background and Related Work ...... 16 3.2.1 PCM Basics and Baseline Organization ...... 17 3.2.2 Wear-out in PCM ...... 18 3.2.3 Prior Work on Improving PCM Lifetime ...... 19 3.3 The Proposed Approach: Employing Compression for Robust PCM Design ...... 21 3.3.1 The Proposed Mechanism and its Lifetime Impacts . . . . 23 3.3.1.1 Impact of compression on bit flips ...... 25 3.3.1.2 The need for intra-line wear-leveling ...... 28 3.3.1.3 Interaction with error tolerant schemes ...... 29 3.3.1.4 Using a worn-out block after changes in compres- sion ratio ...... 30 3.3.2 Metadata management ...... 31 3.4 Experimental Setup ...... 32 3.5 Evaluation ...... 35 3.5.1 Memory Lifetime Analysis ...... 35 3.5.1.1 Impact of using compression (Comp) ...... 35 3.5.1.2 Impact of intra-line wear-leveling (Comp+W) . . 37 3.5.1.3 Impact of advanced hard-error tolerance (Comp+WF) 37 3.5.1.4 Summary ...... 38 3.5.1.5 Number of Tolerable errors ...... 38 3.5.2 Performance Overhead Analysis ...... 39 3.5.3 Sensitivity to the Effect of Process Variation ...... 40 3.5.4 Efficiency of Our Design for MLC-based PCMs ...... 40 3.6 Conclusions ...... 41

Chapter 4 Tolerating Write Disturbance Errors in PCM Devices 42 4.1 Introduction ...... 43 4.2 Background and Related Work ...... 45 4.2.1 PCM Basics ...... 46 4.2.2 Fault Models in PCM ...... 48 4.3 Experimental Methodology ...... 53 4.4 The Proposed Approach: Enabling Reliable Write Operations in Super Dense PCM ...... 56 4.4.1 Intra-line Scheme: Preventing Write Disturbance Errors Along the Word-line ...... 56

vi 4.4.2 Inter-line Scheme: Tolerating Write Disturbance Errors Along the Bit-line ...... 61 4.4.3 The Interaction Between the Inter-line and Intra-line Schemes 70 4.5 Conclusion ...... 71

Chapter 5 Performance and Power-Efficient Design of Dense Non-Volatile Cache in CMPs 72 5.1 Introduction ...... 73 5.2 Overview of STT-RAM Technology ...... 76 5.2.1 Single-Level Cell (SLC) Device ...... 76 5.2.2 Multi-Level Cell (MLC) Device ...... 77 5.2.2.1 Two-step Write Operation ...... 78 5.2.2.2 Two-step Read Operation ...... 78 5.2.3 SLC versus MLC: Device-Level Comparison ...... 79 5.3 MLC STT-RAM Cache: The Baseline ...... 80 5.3.1 Stripped Data-to-Cell Mapping ...... 83 5.3.2 Performance Analysis ...... 85 5.3.3 Enhancements for the Stripped MLC Cache ...... 88 5.3.3.1 The Need for Dynamic Associativity ...... 88 5.3.3.2 The Need for a Cache Line Swapping Policy . . . 90 5.3.3.3 Overhead of the Counters ...... 91 5.4 Experimental Methodology ...... 92 5.4.1 Infrastructure ...... 92 5.4.2 Configuration of the Baseline System ...... 93 5.4.3 Workloads ...... 95 5.5 Evaluation Results ...... 96 5.5.1 Performance Analysis ...... 97 5.5.2 Energy Consumption Analysis ...... 98 5.5.3 Lifetime Analysis ...... 98 5.5.3.1 Comparison with Some Prior Works on Reducing Cache Misses ...... 99 5.6 Related Work on STT-RAM-based Caches ...... 101 5.7 Conclusions ...... 102

Chapter 6 Improving the Performance of Cache Hierarchy through Selec- tive Caching 103 6.1 Introduction ...... 104 6.2 Background ...... 106

vii 6.3 Problem Formulation ...... 108 6.3.1 Kernel-Based Analysis ...... 108 6.3.2 Proposed for Selective Caching . . . . . 111 6.4 Dynamic Cache Reconfiguration ...... 113 6.4.1 Proposed Microarchitecture for Run-Time Sampling . . . . 114 6.4.2 Kernel Characterization ...... 117 6.4.3 Determining the Ideal Configuration ...... 118 6.5 Evaluation ...... 120 6.6 Experimental Result ...... 122 6.6.1 Dynamism ...... 122 6.6.2 Performance ...... 124 6.6.3 Sensitivity Study ...... 125 6.6.4 Cache Miss-Rate ...... 126 6.6.5 Comparison with Warp-Throttling Techniques ...... 127 6.6.6 Comparison with Reuse Distance-Based Caching Policies . 128 6.7 Related Work ...... 130 6.8 Conclusion ...... 131

Chapter 7 Improving the Performance of Last-Level Cache through a Criticality- Aware Cache Compression 133 7.1 Introduction ...... 134 7.2 Background and Related Works ...... 136 7.2.1 Baseline Platform ...... 136 7.2.2 Cache Compression ...... 136 7.3 Compression Implications ...... 137 7.3.1 Latency versus Capacity ...... 137 7.4 Criticality-Aware Compression ...... 139 7.4.1 Data Criticality ...... 139 7.4.2 Non-Uniform Compression ...... 141 7.4.3 Relaxing the Decompression Latency ...... 142 7.5 Methodology ...... 143 7.6 Evaluation ...... 144 7.6.1 Compression Ratio ...... 144 7.6.2 Misses-Per-Kilo-Instructions (MPKI) ...... 145 7.6.3 Average Data Access Latency ...... 146 7.6.4 Performance ...... 146 7.7 Conclusion ...... 147

viii Chapter 8 Conclusions and Future Work 148 8.1 Conclusions ...... 148 8.2 Future Research Directions ...... 150 8.2.1 Lifetime-aware Performance Improvement ...... 150 8.2.2 Secure PCM-based Main Memory ...... 150

Bibliography 151

ix List of Figures

1.1 Memory hierarchy in a typical chip multi-processor...... 2

3.1 Distribution of updated bits for consecutive writes to a specific and randomly-chosen 64- memory block for the gobmk application. 14 3.2 (a) PCM cell (b) PCM-based DIMM with ECC chip (c) DW circuit 17 3.3 The average compressed data size for BDI, FPC, and best of the two...... 22 3.4 An example of the proposed mechanism...... 24 3.5 Percentage of write-backs that exhibit increased, decreased and untouched bit flips after compression...... 25 3.6 Probability that two consecutive writes to the same block have different sizes after compression...... 27 3.7 Changes in the written data size after compression for three repre- sentative memory blocks for bzip2 and hammer...... 27 3.8 The flow of our mechanism...... 28 3.9 abc...... 30 3.10 Lifetime of different systems, normalized to the baseline system. . 36 3.11 Size of the compressed memory block for different memory addresses. For each memory address, we have considered the size of the largest compressed data block over all the write operations to that address. 36 3.12 The average number of faulty cells in a failed 512-bit memory block. 39 3.13 Lifetime of a Comp+WF system, normalized to the baseline system. (CoV=0.25) ...... 40

4.1 (a) PCM cell (b) PCM-based DIMM with 8 chips (c) Differential write (DW) circuit ...... 46

x 4.2 Write disturbance and vulnerable data patterns. Only one of the bits at row i is updated from 1 to 0 which potentially makes its four adjacent cells susceptible to the write disturbance error. However, only two of the adjacent cells are in the RESET state (red cells), and may experience a disturbance and turn into the SET state (1 binary value)...... 49 4.3 Average number of vulnerable cells in a memory line (i.e., intra-line) after each write operation...... 51 4.4 Cascading effect of the Verify-and-Correct (VnC) scheme. After each write operation, we perform a read to detect and modify all the disturbed cells (yellow cells in the figure). In each iteration of VnC, we RESET the faulty cells (red cells in the figure) which itself can cause more faulty cells along the word-line...... 51 4.5 Average number of vulnerable cells in adjacent memory lines (i.e., inter-line) after each write operation...... 52 4.6 Average number of updated chips in a write operation. Each 64-byte write operation (i.e., a cache line write-back from LLC) is spread over 8 chips within a memory rank (shown in Figure 4.1.b). . . . . 57 4.7 Average number of vulnerable cells in each chip, for each write operation. Each 64-byte write operation (i.e., a cache line write- back from LLC) is spread over 8 chips within a memory rank (shown in Figure 4.1.b)...... 57 4.8 Percentage of extra write operations (at the cell-level) imposed by our proposed modified-DW scheme...... 59 4.9 Exploiting data compression to place adjacent memory lines in a non- overlapping fashion (i.e., alternate left-aligned and right-aligned). The write disturbance errors are contained within the overlapping areas...... 63 4.10 Compression ratio of FPC algorithm [1]...... 63 4.11 Reduction in the number of vulnerable cells along the bit-line through non-overlapping data placement...... 64 4.12 Updating the dividing index used in our non-overlapping data layout to achieve uniform chip wear-out...... 65 4.13 Integrating BCH code in read-intensive addresses...... 66 4.14 Reduction in the number of redundant read operations (i.e., pre- write and post-write read commands)...... 69 4.15 Performance improvement achieved by our proposed inter-line scheme over the SD-PCM technique...... 69

xi 4.16 Performance improvement achieved by integrating our proposed intra-line and inter-line schemes into the system, compared to the memory system supported by the DIN and SD-PCM techniques. . 70

5.1 (a) SLC STT-RAM cell consisting of one access transistor and one MTJ storage carrier (“1T1J”); (b) Binary states of an MTJ: two ferromagnetic layers with anti-parallel (or parallel) direction indicate a logical ‘1’ (or ‘0’) state; (c) Resistance levels for 2-bit STT-RAM: four resistance levels are obtained by combining the states of two MTJs having different threshold current...... 76 5.2 (a) MLC STT-RAM cell with serial MTJs: the soft-domain on top of the hard-domain; (b) MLC STT-RAM cell with parallel MTJs . 77 5.3 (a) Schematic of the read and write access circuits in a 2-bit MLC cache array; (b,c) write operation transition model: first, the MSB is written to the hard-domain and if the LSB differs from MSB a small current is driven to switch its direction; (d) three resistance references in a sense amplifier, each between the two neighboring resistance states...... 79 5.4 2-bit STT-RAM layout and a schematic view of the cache array, read and write circuits. Because of the technology compatibility issues, the STT-RAM last-level cache is built on top of the cores. 82 5.5 An illustration of stacked versus stripped data-to-cell mapping for an 8-bit data array (four 2-bit MLC cells) and 2-bit tag arrays (in SLC). In stacked data-to-cell mapping scheme, data bits of the same cache block (each one has 8 bits) are mapped to 4 independent memory cells (i.e., 2 in each cell). In stripped mapping, each memory cell contains only one bit of each cache block – so, each 8-bit cache block spans over 8 memory cells. For instance, lines ‘B’ and ‘D’ are mapped to hard-domains (MTJ1), whereas lines ‘A’ and ‘C’ use soft-domains (MTJ2). Note that, tag arrays are made of SLC cells. Therefore, both of the data-to-cell mapping schemes have the same tag structure...... 83

xii 5.6 Performance comparison of the SLC- and MLC-based STT-RAM caches in terms of the LLC miss rate and IPC (as a system-level metric) for four workloads from the SPEC-CPU 2006 benchmark suite [2]. SLC-based configuration outperforms MLC-based con- figuration over the low miss-rate execution phases because of the faster read and write accesses in SLC format. On the other hand, MLC-based cache is more efficient during the high miss-rate phases, thanks to its larger capacity that can hold a larger portion of the working set...... 86 5.7 Comparison of the stripped MLC cache configuration with the stacked MLC configuration and the SLC format, each with the same die area. Stripped MLC configuration outperforms the SLC format in applications with high and medium L3 miss rates, because it increases the effective cache capacity in terms of lines and associativity. It is also better than the conventional stacked MLC cache as it constructs the fast read lines (i.e., FRHE)...... 87 5.8 Distribution of the missed accesses over LLC’s sets for 200 million instructions in four applications from the SPEC-CPU 2006 bench- mark suite [2]. We see that, in three out of four applications, there are some sets that have few conflict misses, while some others are very stressed. Such access non-uniformity over different sets requires a fine-grained (i.e., at the set-level) tunning to optimize both the latency and capacity of the STT-RAM cache...... 88 5.9 Percent of memory blocks in cache with read-dominated, write- dominated, and non-dominated properties for a set of workloads from PARSEC-2 and SPEC-CPU 2006 programs...... 91 5.10 Percentage of IPC improvement for the proposed cache architecture with respect to the baselines. The proposed architecture has ca- pacity advantage of MLCs in applications with high misses (9 first) programs and medium misses (13 second) programs. It also has the SLC access latency in 8 applications with low misses (at left side). 97 5.11 Total energy consumption of the cache architectures normalized to the SLC baselines. It shows that proposed architecture uses low read and low write energy of FRHE and SRLE lines...... 99 5.12 Lifetime of the cache configurations normalized to the SLC baselines. Our, proposed stripped scheme tries to act like SLCs to reach maximum lifetime...... 99

xiii 5.13 Comparison of our proposed cache with V-Way [3], Scavenger [4], and SBC [5] caches in terms of the percentage reduction in cache misses relative to the SLC cache in previous configuration. Note that cache sizes are set to have same die area. This figure shows that our technique is better than its counterparts in miss ratio, especially when the application requires large associativity...... 100

6.1 Target GPGPU architecture and the details of the computation hierarchy in a typical GPGPU application. Each streaming multi- processor (SM) has a private L1D cache. The L2 cache is logically shared but physically distributed among 6 memory channels which are connected to the SMs through an interconnection network (i.e, a crossbar) [6]...... 107 6.2 Impact of caching ratio on the system performance. IPC normal- ized to the baseline configuration (i.e., 100% caching ratio). X-axis represents the percentage of the memory requests that are allowed to access the L1D cache. Among these kernels, BFS, and PVR2 do not achieve their optimal performance under the baseline config- uration...... 110 6.3 Impact of caching ratio on the system performance. IPC normalized to the baseline configuration (i.e., 100% caching ratio). X-axis represents the percentage of the memory requests that are allowed to access the L2 cache. Among these kernels, BFS, and TRA do not achieve their optimal performance under the baseline configuration. 110 6.4 Impact of the warp synchronization on selective warp caching. . . 112 6.5 Restricting the fraction of cacheable memory requests in the L1D and L2 caches based on the warp and SM granularities, respectively. L1D offers 25% caching ratio by caching 2 warps out of 8 warps. L2 cache has a 50% caching ratio by caching 16 SMs out of 32 SMs. . 113 6.6 Microarchitecture design of the monitoring hardware. In this figure, Set(4*i) captures the miss-rate for baseline configuration. Similarly, Set(4*i+1), Set(4*i+2), and Set(4*i+3) capture the miss-rate for 75%, 50%, and 25% partial caching configurations, respectively. . 115

xiv 6.7 The ideal caching configurations for the L1D and L2 caches deter- mined by our proposed mechanism. Configuration of the L1D is determined after the first sampling phase. However, another sam- pling needs to be performed to accurately capture the characteristics of the L2 cache based on the new L1D configuration. Some appli- cations (e.g., PVR) can be seen in multiple settings because they have multiple kernels each of which with a different demand. Since some of the applications consist of many kernels, differently kernels of the same application are not individually indexed to keep the figure readable...... 123 6.8 IPC normalized to the baseline configuration where all the threads are allowed to access the cache. The L1, L2, and L1&L2 sections represent the applications which experienced reconfigurations in their L1D (only), L2 (only), and both the L1D and L2 caches, respectively...... 125 6.9 (L2=256KB, L1D=32KB) IPC normalized to the baseline. The L1, L2, and L1&L2 sections represent the cases which experienced reconfigurations in their L1D (only), L2 (only), and both the L1D and L2 caches, respectively...... 126 6.10 (L2=128KB, L1D=16KB) IPC normalized to the baseline. The L1, L2, and L1&L2 sections represent the cases which experienced reconfigurations in their L1D (only), L2 (only), and both the L1D and L2 caches, respectively...... 126 6.11 Cache miss-rate reduction after reconfiguring the cache hierarchy to resolve the cache-thrashing issue. (This figure only contains the kernels from Table 6.3 with cache-thrashing problem in the L1D and/or L2 caches) ...... 127 6.12 Comparing the impact of a throttling scheme with our proposed scheme on resolving cache-thrashing. (This figure only contains the kernels from Table 6.3 with cache-thrashing problem in the L1D and/or L2 caches) ...... 128 6.13 IPC normalized to the baseline configuration where all the threads are allowed to access the cache. The L1, L2, and L1&L2 sections represent the applications which experienced reconfigurations in their L1D (only), L2 (only), and both the L1D and L2 caches, respectively. CBWT (Cache Bypassing Warp Throttling) is the proposed scheme in [7]...... 129

xv 7.1 Typical tiled multi-core architecture. Tiles are interconnected into a 2-D mesh. Each tile contains a core, private L1I and L1D caches, a shared L2 cache bank, and a router for data movement between nodes...... 137 7.2 Impact of compression on data access latency. WS1: working-set that fits in baseline. WS2: extra portion of the working-set that fits in compressed LLC...... 138 7.3 Impact of larger cache capacity versus longer cache access latency on the system performance, in terms of IPC...... 139 7.4 Configuration of the critical load predictor logic...... 140 7.5 Comparison of compression ratio for different schemes: BDI [8], FPC [1], SC2 [9], and Hybrid-Comp...... 145 7.6 Effect of data compression on the number of misses per kilo instruc- tion at the last-level cache...... 145 7.7 Average LLC data access latency normalized to the baseline 4MB LLC...... 146 7.8 Comparison of weighted speedup: BDI–4MB [8], FPC–4MB [1], Hybrid-Comp–4MB, and uncompressed 8MB cache...... 147

xvi List of Tables

1.1 Comparison of different memory technologies [10]...... 2

3.1 The characteristics of BDI and FPC techniques...... 22 3.2 The system specification of our simulated system...... 32 3.3 Characteristics of the evaluated workloads WPKI refers to the number of L2 write-backs per kilo instructions (per core). CR refers to the compression ratio and is defined as: the size of the compressed data divided by the original size. We use the best of BDI and FPC for compressing each block. H, M, and L refer to High, Medium, and Low compressibility, respectively...... 34 3.4 Final lifetime (months) in baseline and Comp+WF...... 39

4.1 The specification of our simulated system...... 55 4.2 Characteristics of the evaluated workloads...... 55

5.1 MLC STT-RAM compared to SLC cell model [11]...... 80 5.2 Sequence of transactions when accessing a 2-bit MLC cache with stripped data-to-cell mapping. FRHE (Fast Read High-Energy write) cache lines consist of hard-domains (MTJ2), and SRLE (Slow Read Low-Energy write) cache lines consist of soft-domains (MTJ1). 84 5.3 Main characteristics of our simulated CMPs...... 94 5.4 Evaluated L3 configurations...... 94 5.5 Characteristics of the evaluated workloads...... 95

6.1 Variation in the miss-rate of different L1D and L2 caches. . . . . 115 6.2 Baseline configuration...... 121 6.3 List of GPGPU benchmarks: CS and CI represent Cache-Sensitive and Cache-Insensitive kernels, respectively...... 121

7.1 Cache compression techniques...... 137 7.2 Main characteristics of simulated system...... 143

xvii 7.3 Characteristics of the evaluated workloads for last-level cache. . . . . 144

xviii Acknowledgments

I am deeply thankful to all the people who have provided intellectual contributions and motivational support to this dissertation. I have no words to describe all their help and support. Hence, a simple acknowledgment - but a sincere thanks from the bottom of my heart. Each and everyone acknowledged below has taught me something meaningful. Advisors, Co-advisors and Teachers Chita R. Das Mahmut T. Kandemir

Lab Mates Adwait Jog Onur Kayiran Bikash Sharma Nachiappan C. Nachiappan Tuba Kesten Mahshid Sedghi Ashutosh Pattnaik Prashanth Thinakaran Xulong Tang Tulika Parija Haibo Zhang Jashwant Raj Gunasekaran Anup Sarma Huaipan Jiang Jagadish Kotra

xix Family and Friends Abbas Jadidi (father) Zahra Rezaei (mother) Mina Jadidi (sister) Shima Jadidi (sister) Rachel Isaacs Mohammad Arjomand Morteza Karimzadeh Azita Ranjbar Nima Elyasi Nima Zolghadr Morteza Ramezani Farshid Farhat Diman Tootaghaj Narges Shahidi Mohammad K. Tavana

In conclusion, I recognize that this research would not have been possible without the financial assistance provided in part by NSF grants 1526750, 1763681, 1439057, 1439021, 1629129, 1409095, 1626251, 1629915, and a grant from and I would like to express my gratitude to them.

xx Dedication

This dissertation is dedicated to my parents for their endless love, support and encouragement throughout my life.

xxi Chapter 1

Introduction

Figure 1.1 depicts high level structure of the memory hierarchy in a typical multi- processor. During the last few decades, SRAM and DRAM have been used as the dominant technologies to build on-chip cache hierarchies and main memories in chip multi-processors (CMPs), respectively. However, both of these two technologies face serious scalability and power consumption problems. While researchers have been looking for different techniques to resolve those issues, they also have been studying alternative memory technologies that can satisfy the capacity, speed, and power consumption requirements of the next generation of processors. Among different technologies, non-volatile memory technologies have been the most promising candidates to replace SRAM and DRAM. More precisely, Spin-Transfer Torque RAM (STT-RAM) and Phase Change Memory (PCM) are the leading technologies to be used as scalable alternatives for SRAM and DRAM in future systems [12–14], respectively. Table 1.1 compares different characteristics of these technologies. Since each of these memory technologies have their own unique features, memory designers believe that the memory hierarchy in future computing systems will consist of a hybrid combination of current technologies (i.e., SRAM and DRAM) and non-volatile technologies (e.g., STT-RAM, and PCM). Therefore, in order to achieve a memory hierarchy that satisfies all the system-level requirements, we need to study each of these technologies, and accordingly address their drawbacks. In the following, we discuss some of the most important concerns regarding these memory technologies, and briefly discuss our focus throughout this dissertation.

1 rt endurance write nodrt civ eibePMbsdmi eoy ohtelmtdlifetime resolved. limited the be should both memory, problems main Therefore, disturbance PCM-based write operation. reliable and write a the achieve during to heat order generated the in the faces by PCM caused technology, [15] 20nm problem below for decreases, write Besides, size limited cell systems. as as decrease well energy as and energy, latency write write PCM higher While and endurance. disadvantages latency The write scalability. longer better and include density, higher dissipation, power lower of Memories Main PCM-based of Construction 1.1 Non-volatility Power Leakage Power Dynamic Speed Density Features C a oeavnae n iavnae vrDA.I a h advantages the has It DRAM. over disadvantages and advantages some has PCM iue11 eoyheacyi yia hpmulti-processor. chip typical a in hierarchy Memory 1.1: Figure al .:Cmaio fdffrn eoytcnlge [10]. technologies memory different of Comparison 1.1: Table Data Access

ean h anosal oaotPMi uuecommercial future in PCM adopt to obstacle main the remains Latency eyfast Very SRAM 1 1 High Low Low 1 0 No 0 ' 0 m ' s ' s

s s n

n

n s s s DRAM Medium Medium C L High Fast o No 1 r $ e L 2 M a eyhg o write for high Very C s L t lwfrwrite for Slow a atfrread; for Fast o o read; for Low o 1 STT-RAM

i L r $ n e e

High Low M Yes v e e l C

L m C o 1 o a r $ r e c y h

e eyso o write for slow Very eimfrread; for Medium C L rt disturbance write lwfrread; for Slow ihfrwrite for high o 1 eyhigh Very r $ e PCM Low Yes 1.1.1 Limited Lifetime of PCM-based Main Memories

To address the endurance problem in PCM, two categories of techniques are proposed: 1. hard-error postponement and 2. hard-error tolerant techniques. Postponement techniques attempt to uniformly distribute write operations within the memory in order to achieve uniform wear-out at different levels [16]. On the other hand, error tolerant techniques are employed to keep the system working while there are some faculty cells within the memory lines [17–20]. Most previously proposed PCM-based memories also use write reduction schemes such as buffering [21] or differential writes (DW) [14] to reduce the number of cell writes. Although DW scheme significantly decreases the number of bit flips for every write operation, the resultant bit-level updates usually have a fairly random pattern.

Dissertation Contribution: In first part of this dissertation, we exploit data compression to further extend the memory lifetime. While data compression has been traditionally used to achieve larger capacity, we employ compression to limit the window of cell writes within a memory line. We then propose a platform in which data compression collaborates with the error correction and the wear-leveling schemes to further extend the lifetime. Our proposed platform improves the memory lifetime by 4.3x, on average, and achieves an average lifetime of 79 months.

1.1.2 Reliable Write Operations in PCM-based Main Memories

The existing solutions for resolving write disturbance in PCM devices are inef- fective. A simple solution that is widely adopted is to allocate sufficient inter-cell distance in order to avoid thermal disturbances between the cell [15,22,23]. How- ever, this approach results in significantly less memory capacity, compared to what theoretically can be fabricated on a chip. A common approach that specifically addresses the write disturbance problem along the word-lines is to perform a read after the write operation to detect and correct the potential errors [24]. This approach is known as verify-and-correct (VnC) scheme. However, VnC can lead to cascading verification steps and significant performance loss in memory-intensive applications. On the other hand, the common approach to detect write disturbance errors along the bit-lines is to issue pre-write and post-write read operations to the

3 adjacent memory lines [25], that is known as SD-PCM technique. Therefore, for each write operation, SD-PCM technique imposes four redundant read operations to the two adjacent memory lines, which can lead to considerable performance loss in memory-intensive applications.

Dissertation Contribution: In second part of this dissertation, we propose a hybrid mechanism that consists of two general-purpose and cost-effective schemes to address the write disturbance problem within each memory line and also between the adjacent memory lines: (1) We propose two programming strategies in order to avoid cascading VnC steps. Based on the number of vulnerable cells in each chip, we determine what programming strategy should be used on each write operation. Our proposed programming strategies guarantee the correctness of the written data (i.e., no errors along the word-line) while the performance loss is kept less than 1%. (2) we employ data compression to achieve a non-overlapping data layout between the adjacent memory lines. Such non-overlapping layout reduces the need for redundant read operations by 65%, on average, and improves the system performance by an average of 13%, compared to the SD-PCM scheme.

1.2 Construction of STTRAM-based Last-level Caches

STT-RAM has zero leakage power, accommodates almost 4× more density than SRAM, and has small read access latency and high endurance. While the Single- Level Cell (SLC) STT-RAM cache hierarchies are well-studied [12, 26–32], little attention has been paid to explore the potential of the Multi-Level Cell (MLC) STT-RAM caches in CMPs. As the new generation of workloads require larger on-chip cache capacities, the natural next step would be to use MLC STT-SRAM, but its advantage in doubling the storage density comes with a number of serious shortcomings in terms of lifetime, performance, and energy consumption.

Dissertation Contribution: In order to address these issues, in third part of this dissertation we present a novel cache design based on MLC STT-RAM that exploits the asymmetric nature of the MLC STT-RAM to build cache lines featuring heterogeneous performances, that is, half of the cache lines are read-friendly, while the other half are write-friendly. This asymmetry in read and write latencies are

4 then used by a line migration policy in order to reduce the high latency of the baseline MLC-based cache. Furthermore, in order to extend the lifetime, we propose a mechanism which deactivates some of the ways in underutilized cache sets which in turn enables us to convert those cache lines from the MLC to SLC mode. Such partial state transition balances the cache capacity with the latency of the accesses.

1.3 Construction of SRAM-based Last-level Caches

Workloads in the next generation of computing systems are expected to be highly data-intensive. The processing power is also steadily increasing and major manufacturers are planning to integrate hundreds of cores on a die. In such multi- core/multi-threaded systems, computer architects employ high capacity on-chip cache hierarchies to reduce data access latency and bandwidth utilization of main memory. However, applications with large data working-set still experience high cache miss-rates due to the relatively small cache size, leading to frequent cache conflicts. In the following, we briefly discuss our proposed techniques to mitigate this problem in multi-core/multi-threaded systems.

1.3.1 Managing Cache Conflicts in Highly Parallel Architectures

The memory system in general-purpose graphics processing units (GPGPUs), is very similar to modern multi-core processors – they have two levels of on-chip cache and below, there is a main memory module. In such platforms, many multi-threaded applications have inherent data locality but exhibit poor cache utilization because of frequent cache conflicts that happens as a result of huge number of memory accesses issued by thousands of concurrently running threads. A heuristic to resolve this problem is to reduce the number of accesses to the cache by restricting the number of running threads [33, 34]. However, this approach leaves other shared resources, such as the memory bandwidth and computing cores, underutilized.

Dissertation Contribution: Alternatively, in fourth part of this dissertation, we propose a selective caching mechanism which dynamically adjusts the number of threads which are allowed to access the cache over different execution phases.

5 The proposed cache management policy effectively controls the access traffic to both the L1D and L2 cache without sacrificing the thread-level parallelism.

1.3.2 Balancing Capacity and Latency in Compressed Caches

Data compression has been used as an approach to achieve larger on-chip cache capacities (e.g., [1, 8, 9, 35–37]). In order to improve the system performance in such platforms, it is vital to achieve a fine balance between the extra capacity achieved by compression and the extra access latency imposed by that. Studying previous compression schemes show that, compression mechanisms always sacrifice one for the other. Meaning that, sophisticated compression schemes are capable of achieving larger capacities but at the cost of longer latencies, and vice versa.

Dissertation Contribution: In fifth part of this dissertation, we demonstrate that, in designing a compressed cache, data criticality should be considered as the third design parameter, along with compression ratio and decompression latency. While typical compression schemes decide to store a cache block either in compressed or uncompressed format just based on the content of the cache block, our proposed mechanism also considers the criticality (i.e., latency-sensitivity) of the data blocks.

6 Chapter 2

Background and Related Work

In this section, we discuss the details of a typical SRAM/STT-RAM last-level cache as well as a PCM-based main memory system [13,38,39]. We also provide an overview of the most recent studies on these memory technologies. The related works are categorized into different sections based on their relevance to each of our proposals in this dissertation.

2.1 PCM-based Main Memory Organization

PCM stores binary data in a chalcogenide material (i.e., Ge2Sb2Te5 or GST for short), in either a high-resistance (RESET) or low-resistance (SET) state. Reading a PCM cell’s contents requires us to apply a low current for a short period of time. However, the write process will depend on the written value: a SET pulse is a low-amplitude and long-duration current, while a RESET pulse is a large and short-time current. In contrast to a SET, a RESET is fast, and consumes more energy. The write operation in PCM is the underlying reason for the occurrence of most common fault models in memory. In order to achieve a reliable system, each of those fault models should be addressed. In the following, we discuss each of those fault models and overview prior proposed solutions to address each of them.

7 2.1.1 Wear-out Faults

Similar to other non-volatile memories, PCM has the problem of limited write endurance. This problem is primarily due to the heating and cooling process during write operations – after a finite number of writes, the cell loses its programmability. According to a recent ITRS report [40], SLC PCM devices at a 9nm technology node will fail after around 107 SET-to-RESET transitions. To cope with the limited endurance in PCM, prior studies take different approaches. We can categorize the relevant studies into two main groups: (i) mechanisms that postpone hard errors, and (ii) mechanisms that correct hard errors.

Hard-Error Postponement: One can improve the lifetime of a PCM system by reducing write traffic through either exploiting write buffers [21], using encoding schemes [41] or applying compression [42]. Even recently proposed PCM proto- types/chips have an embedded read-modify-write, (RMW) circuit [43] (shown in Figure 4.1b) which performs differential writes (DW) to reduce bit flips, improve PCM lifetime, and reduce energy consumption. The PCM memory controller may use wear-leveling algorithms in order to further postpone reaching PCM lifetime limits. The idea is to uniformly spread the write traffic over all memory blocks, avoiding premature wear-out due to write-intensive memory blocks [16].

Hard-Error Tolerance (Correction): Even though if a PCM-based main memory is supported by advanced fault-postponement mechanisms, we cannot achieve perfect wear-leveling due to process variation. Meaning that, some PCM cells will be worn-out much faster than the other ones. Therefore, we need to have some mechanisms to keep the system working even if few cells become faulty within a memory block. To achieve this, DECDED (Dingle Error Correction, Double Error Detection) schemes can be used. Relying on the error detection property of hard errors in PCMs, several proposed solutions exist in literature for error correction (e.g., ECP [17], SAFER [18], FREE-p [19], and Aegis [20]).

8 2.1.2 Write Disturbance Faults

When programming a PCM cell, particularly resetting a cell, we apply a high- amplitude short-duration current to increase the GST temperature above the melting point. The heat generated by the RESET operation may disseminate beyond the target cell, and reaches its neighboring cells which can consequently disturb the resistivity of those cells. Previous studies have shown that write disturbance has become a major reliability problem below 20nm where inter-cell space is much less [15, 23,44]. In the following, we describe the occurrence of write disturbance errors along the bit-lines and word-lines.

Intra-Line: write disturbance along the word-line: At the circuit-level, Verify-and-Correct (VnC) technique [24] performs an extra read operation after the completion of the write, in order to detect and modify any possible write disturbance errors. VnC technique RESETs the disturbed cells after the verification process. However, each RESET operation can cause more faulty cells in a cascading fashion. At the architecture-level, Jiang et al. [22] propose a Data encoding-based Insulation technique (DIN) that exploits data encoding to reduce the number of vulnerable cells by manipulating the data pattern. Similarly, Tavana and Kaeli [45] propose a partitioning-based encoding scheme to mitigate write disturbance.

Inter-Line: write disturbance along the bit-line: A simple yet widely adopted approach to mitigate the inter-line write disturbance problem is to allocate sufficiently large inter-cell space along the bit-line. The major issue with large inter-cell space is considerable capacity loss [15,23]. SD-PCM scheme [25] focuses on this category of write disturbance errors. SD-PCM issues pre-write and post-write read operations to the memory lines adjacent to the line being written. By having the values of the adjacent memory lines, before and after the write operation, SD-PCM can detect and correct any disturbed PCM cells. However, SD-PCM affects the system performance considerably, because for each write operation it issues 4 redundant reads. Decongest [46], on the other hand, detects and remaps write-intensive pages to a disturbance-free part of the memory. While this scheme can reduce the probability of having inter-line errors, it still needs a supporting mechanism to guarantee reliable writes for the remaining memory addresses, and it also leads to quick wear-out of the disturbance-free part of memory.

9 2.1.3 Resistance Drift Faults

In multi-level cell (MLC) PCM, we are capable of storing multiple data bits in only one memory cell. For instance, in a 2-bit MLC device, the resistance range is split into four sections, representing the 00, 01, 10, and 11 binary values. The drift process starts once a PCM cell is programmed. Due to the thermally-affected atomic rearrangement of the amorphous structure, the resistance value of the cell increases over time and this increase is accelerated as chip temperature increases. Therefore, the stored value in a cell can change after a long period of idleness. Different techniques such as periodical scrubbing [47,48], and data encoding mechanism [49] are proposed to prevent the occurrence of such transient faults.

2.2 STT-RAM Last-Level Cache Organization

Two types of STT-RAM cell prototypes can be realized: Single-Level Cell (SLC) STT-RAM and Multi-Level Cell (MLC) STT-RAM. The SLC STT-RAM cell consists of one Magnetic Tunnel Junction (MTJ) component which is used to store one bit information. The MLC STT-RAM device, on the other hand, is typically composed of multiple MTJs and are used to store more than one bit information in a single cell. Such increased density in MLCs comes at the cost of linear increase in access latency and energy with respect to the cell storage level. Over the past few years, several device-level and architecture-level optimizations have been proposed that attempt to address the issues of high write latency/energy [12, 26–30] and limited endurance [31, 32] in SLC STT-RAM caches. However, little attention has been paid to explore the potential of the MLC STT-RAM last-level caches in multi-core systems. Regarding MLC STT-RAM, Chen et al. proposed a dense cache architecture using devices with parallel MTJs [28]. Although they use the MLCs with parallel MTJs to have lower write power (compared to series MTJs) suitable for cache [50], a reliability comparison of these two devices show that parallel devices confront serious challenges in nanometer technologies with large process variations [51]. Besides, some recent proposals studied the effect of decoupling bits of an MLC device (STT-RAM [52] or PcRAM [53]) in performance, energy, and reliability improvement of non-volatile memories.

10 2.3 SRAM-based Last-Level Cache Organization

While SRAM-based caches have been widely adopted in commercial products, we still observe high cache miss-rate (due to frequent cache conflicts) in applications with large data working-set. There has been different category of works which specifically address this issue. Below we discuss two main groups of those works.

2.3.1 Addressing Conflicts in Last-Level Cache

Some mechanisms determine at the compile time which addresses should be cached [54–59]. However, such compile-time approaches are not aware of run-time parameters and cannot detect cache-thrashing during different phases of execution. From a difference perspective (for the GPGPU platforms), [33] and [34] propose thread-throttling schemes to deal with the contention problem by reducing the number of concurrently running warps. However, these mechanisms underutilize other shared resources such as the memory bandwidth and computing cores. Some other works, specifically focus on data reuse distance to determine what addresses should be cached [60–64]. In this dissertation, we propose a partial caching mechanism which determines what percentage of the threads should be allowed to access the cache, based on the characteristics of the running kernel.

2.3.2 Data Compression in Last-Level Cache

while larger caches often reduce number of cache misses, this potential benefit comes at the cost of higher power consumption, longer cache access latencies, and increased chip area. To resolve this issue, some prior works (e.g., [1,8,9,35–37]) use various data compression schemes to achieve larger capacity without suffering all disadvantages of fabricating larger caches. In order to improve the system performance, it is vital to achieve a fine balance between the extra capacity achieved by data compression and the extra access delay imposed by that. However, in this dissertation we introduce data criticality as a third parameter that should be considered in designing a compressed LLC.

11 Chapter 3

Extending the Lifetime of PCM-based Main Memory

Limited write endurance is the main obstacle standing in the way of using phase change memory (PCM) in future computing systems. While several wear-leveling and hard-error tolerant techniques have been proposed for improving PCM lifetime, most of these approaches assume that the underlying memory uses a very simple write traffic reduction scheme. In particular, most PCM prototypes/chips are equipped with an embedded circuit to support differential writes (DW) – on a write, only the bits that differ between the old and new data are updated. With DW, the bit-pattern of updates in a memory block is usually random, which limits the opportunity to exploit the resulting bit pattern for lifetime enhancement at an architecture level (e.g., using techniques such as wear-leveling and hard-error tolerance). This work focuses on this inefficiency and proposes a solution based on data compression. Employing compression can improve the lifetime of the PCM memory. Using state-of-the-art compression schemes, the size of the compressed data is usually much smaller than the original data written back to memory from the last-level cache on an eviction. By storing data in a compressed format in the target memory block, first, we limit the number of bit flips to fewer memory cells, enabling more efficient intra-line wear-leveling and error recovery; and second, the unused bits in the memory block can be reused as replacements for faulty bits given the reduced size of the data. It can also happen that for a portion of the

12 memory blocks, the resulting compressed data is not very small. This can be due to increased data entropy introduced by compression, where the total number of bit flips will be increased over the baseline system. In this work, we present an approach that provides collaborative operation of data compression, differential writes, wear- leveling and hard-error tolerant techniques targeting PCM memories. We propose approaches that reap the maximum benefits from compression, while also enjoying the benefits of techniques that reduce the number of high-entropy writes. Using an approach that combines different solutions, our mechanism tolerates 2.9× more cell failures per memory line and achieves a 4.3× increase in PCM memory lifetime, relative to our baseline state-of-the-art PCM DIMM memory.

3.1 Introduction

During the last three decades, DRAM has been used as the dominant technology in main memory for computing systems. However, enter the deep nanometer era, where DRAM faces serious scalability and power consumption problems. These are well-documented problems with DRAM [65,66] that have attracted several solutions, ranging from device-level [67] to architectural-level [68] solutions. Researchers from both academia and industry have suggested that phase change memory (PCM) is one of the leading technologies to be used as a scalable alternative for DRAM in future systems [13, 14]. PCM has some advantages and disadvantages over DRAM. It has the advantages of lower power dissipation, higher density and better scalability. The disadvantages include longer write latency and higher write energy, as well as limited write endurance. While PCM write latency and energy decrease as cell sizes decrease, write endurance remains the main obstacle to adopt PCM in future commercial systems.

The solutions for the PCM endurance problem can be categorized into two groups: 1. hard-error postponement and 2. hard-error tolerant techniques. Postponement techniques can be further divided into two sub-categories: 1. write traffic reduction schemes and 2. wear-leveling schemes. In recent years, several wear-leveling schemes [16] and hard-error tolerant techniques [17–20] have been proposed for PCM memories. However, little attention has been paid to techniques for memory

13 512

384

256

128

# Bit Flips per Write per Flips # Bit 0 Consecutive Writes Figure 3.1: Distribution of updated bits for consecutive writes to a specific and randomly-chosen 64-byte memory block for the gobmk application. write reduction. Most previously proposed PCM memories use very simple schemes such as buffering [21] or differential writes (DW) [14] for this purpose. DW is a circuit-level mechanism supported by the PCM chips – each PCM chip has read-modify-write (RMW) logic [43] that reads old data from the target memory block and compares it with new values (bit by bit), in order to only write the bits that need to be updated. Although DW schemes significantly decrease the number of bit flips for every memory write-back, the resultant bit-level updates usually have a fairly random pattern (i.e., bit flips are randomly scattered over the entire memory block). Figure 3.1 illustrates this behavior, showing the number of bit-level updates from consecutive writes to a specific and randomly-chosen 64-byte memory block for the gobmk application – we can see that the update pattern is fairly random in nature 1. Because of this behavior, designers usually look at DW as an orthogonal mechanism, combined with other techniques implemented at higher levels. We believe such a collaborative design between low-level and high-level PCM endurance enhancement techniques is missing. Further, a multi-level approach can also be very useful in future technologies, where PCM lifetime will become even more challenging (due to huge process variations and/or the use of multi-bit PCM memories, which have lower endurance).

To this end, we suggest to exploit compression for decreasing the size of the written data. Using compression, we limit the bit flips to a smaller window rather than the entire block. We refer to this window as the compression window throughout this study. The compression window confines bit flips to a small number of bits, and its size is variable, depending on the compressibility of the written data.

1We observed similar bit flip patterns in other applications as well.

14 Exploiting compression, we hope to improve lifetime of each memory block from two perspectives.

1. When a bit fails, the memory controller has two options: (i) a large class of prior hard-error tolerant schemes (e.g., SAFER [18] of Aegis [20]) rely on partitioning, in which they try to partition the entire memory block such that each partition has at most one failed bit. Finding such partitions are easier as errors are limited to a smaller window in this condition, and (ii) If the hard-error technique in use cannot save the block, the remaining bits in the target block can be exploited as replacements for the failed bits. In other words, the memory controller can continue writing into the target block by sliding the compression window in order to have enough healthy cells for writing data.

2. In order to postpone fast wear-out in the bits within a compression window of a block, intra-line wear-leveling can be employed, thereby avoiding putting pressure on a limited number of bits, and hence, evenly distributing writes over the entire block. In such a design, wear-leveling within blocks can be implemented by sliding the compression window periodically. We will describe how the memory controller can achieve nearly perfect intra-line wear-leveling without using any dedicated write counters.

Then the memory controller can overcome the aforementioned limits of DW, using compression to extend PCM lifetimes by revisiting intra-line wear-leveling and hard-error tolerant techniques. Blindly using compression in PCM memory may affect lifetime negatively. We generally expect that by reducing the size of the written data using compression, the number of bit flips decreases with respect to the baseline PCM. However, this is not always the case, as we will show in this study. For 20% of memory writes, the number of bit flips increases, since compression increases data entropy (i.e., the chance of over-writing a ‘0’ with a ‘1’ or a ‘1’ with a ‘0’). The increase in the number of bit flips leads to both increased energy consumption and decreased lifetime of the PCM memory. Thus, the controller may decide not to compress the selected data if it leads to an increase in the number of bit flips. Nonetheless, as we will discuss, this is not easy to implement since compression is performed by the memory controller (residing on CPU chip), while the number of bit flips is determined after applying DW at the chip-level. Therefore,

15 it is very costly to have the memory controller perform the compression in current memory organizations. Moreover, the compression-based design optimizations in PCM should take into account this limitation when trying to reduce the bit flips.

The main contributions of this work are as follows:

• We begin by describing the limitations of widely-used DW schemes in PCM memories. We also discuss how compression can help us overcome these issues. We also qualitatively and quantitatively discuss the benefits and side effects of using compression for lifetime enhancement in PCM memories.

• We propose a novel PCM memory architecture, and corresponding changes at memory controller, that efficiently leverage compression in collaboration with hard-error tolerant and intra-line wear-leveling schemes. The cost of managing the compression of metadata and implementing intra-line wear-leveling in the proposed design is insignificant.

• Using a heuristic approach in our mechanism, we reduce the cost of increased bit flips due to compression, allowing the memory controller to predict which conditions will result in an increased number of bit flips.

• Using an extensive simulation-based study, the proposed design yields a 90% reduction in uncorrectable errors and a 20× increase in PCM memory lifetimes, relative to the baseline state-of-the-art PCM DIMM memory.

Finally note that our proposed design assumes that any prior compression algorithm (such as [1,8,9,37]), hard-error tolerant technique (such as [17–20]), or inter-line wear-leveling (such as [16]) can be used by the memory controller.

3.2 Background and Related Work

Next, we discuss the details of a typical PCM-based main memory system [13,38, 39], which we use as our baseline configuration. We also provide an overview of the PCM wear-out problem, and discuss prior related work on resolving this issue.

16 RD Select WR

S/H Column Multiplexer Set

Reset Write Read PCM Cell Circuit Circuit Write Data Read Data (a) (b)

One Bank (Interleaved over all chips) ) k s n p i a h R

C

M M C C P

P RMW RMW RMW RMW RMW RMW RMW RMW RMW

e 1

n Data-1 Data-2 Data-3 Data-4 Data-5 Data-6 Data-7 Data-8 ECC + 8 O ( 72-bit Data Bus (64 bits Data and 8 bits ECC) (c) To/from PCM memory/cache Figure 3.2: (a) PCM cell (b) PCM-based DIMM with ECC chip (c) DW circuit

3.2.1 PCM Basics and Baseline Organization

Figure 4.1a shows a PCM cell structure and its access circuits. PCM stores binary data in a chalcogenide material (i.e., Ge2Sb2Te5 or GST for short), in either a high-resistance (RESET) or low-resistance (SET) state. Reading a PCM cell’s contents requires us to apply a low current for a short period of time to sense its resistivity. However, the write process will depend on the written value: a SET pulse is a low-amplitude and long-duration current, while a RESET pulse is a large and short-time current. In contrast to a SET operation, a RESET is fast, consumes more energy, and significantly contributes to PCM wear-out2. We assume a PCM main memory organization similar to a traditional DRAM as our baseline (i.e., the PCM has multiple channels, each connected to multiple memory modules, a Dual

2PCM can store either one bit (Single-Level Cell or SLC) or multiple bits (Multi-Level Cell or MLC) within each cell. Although the proposed approach can be applied to both, we assume an SLC PCM as our baseline. This is because MLC endurance (105–106 [69]) and performance is worse than that of SLC PCM, which complicates its use for main memory.

17 In-line Memory Module or DIMM). Figure 4.1b illustrates a DIMM-based PCM memory using the DDRx configuration. Each DIMM has multiple ranks, and each rank consists of a set of PCM chips, that together feed the data bus. In the common case, a rank in a PCM DIMM contains nine 8-bit chips (8 data bits of input/output on every clock edge), providing a 72-bit rank. This organization is called an ECC- DIMM, where the ninth chip stores an Error Correcting Code (ECC) 3. A rank is partitioned into multiple banks, where different banks can process independent memory requests. As shown in Figure 4.1b, each bank is distributed across all chips in a rank. When the memory controller issues a request for a cache line, all PCM chips in a rank are activated, and each sub-bank contributes a portion of the requested block4. For example, if we assume a 64-byte cache line, and as each chip provides 8 bits of information on every clock edge (a total of 72 bits for both data and ECC), it takes a burst of 8 cycles to transfer a cache line. Note that, an ECC-DIMM only provides additional storage for ECC, and on each memory request, the ECC information and the original data are sent to the memory controller on the processor die, where the actual error detection/correction takes place. This leaves the decision of which error protection mechanism to implement to the memory system designer.

3.2.2 Wear-out in PCM

Similar to other non-volatile memories, PCM has the problem of limited write endurance. This problem is primarily due to the heating and cooling process during write operations – after a finite number of writes, the cell loses its programmability and becomes “stuck” at either the SET or RESET state. Stuck-at SET and stuck- at RESET faults occur for different reasons [71]. Stuck-at SET happens because the GST material loses its crystalline quality over time. Experimental device- level studies have shown that a cell with a stuck-at SET fault can be recovered by applying a reverse electron field [71]. Stuck-at RESET, on the other hand, happens because the heating electrode detaches from the GST. Stuck-at RESET is

3Error correction capability is a must in PCM products, due to their high hard-error rate. All PCM-based reliability designs adhere to a 12.5% capacity overhead, the same as in ECC-DIMMs in DDRx DRAMs. While we describe our scheme for only ECC-based DIMMs, it can be used in any memory system with other protection mechanisms (e.g., chipkill [70]). 4In this study we use the terms block and line interchangeably.

18 unrecoverable [71, 72]. Prior studies report that SAR is the dominant failure mode in PCMs, especially in small-geometry cells where the contact between the heating electrode and GST becomes weak – according to a recent ITRS report [40], SLC PCM devices at a 9nm technology node will fail after around 107 SET-to-RESET transitions. However, our proposed technique is general, and we assume that either fault model can occur.

3.2.3 Prior Work on Improving PCM Lifetime

To cope with the limited endurance in PCM, prior studies take different ap- proaches, working at different abstraction levels. We can categorize the relevant studies into two main groups: (i) mechanisms that postpone hard errors, and (ii) mechanisms that correct hard errors. We discuss the state-of-the-art prior work in each category.

Hard-Error Postponement: One can improve the lifetime of a PCM system by reducing write traffic through either exploiting write buffers [21], using encoding schemes [41] or applying compression [42]. Even recently proposed PCM proto- types/chips have an embedded read-modify-write, (RMW) circuit [43] (shown in Figure 4.1b) which performs differential writes (DW) to reduce bit flips, improve PCM lifetime, and reduce energy consumption. The Flip-N-Write strategy [73] is a similar, but more efficient, approach. On each write, this strategy checks whether writing the original data or the complement of the data produces fewer bit flips relative to the stored data. Therefore, at most, half of the bits will ever have to be written on any write. The PCM memory controller may use wear-leveling algorithms in order to further postpone reaching PCM lifetime limits. The idea is to uniformly spread the write traffic over all memory blocks, avoiding premature wear-out due to write-intensive memory blocks. Different wear-leveling techniques have been proposed for PCM main such as Start-and-Gap [16]. This scheme is an inter-line (or inter-page) scheme that is reasonably efficient and imposes negligible cost. In this study, we assume that the PCM baseline system already exploits DW and the Start-and-Gap schemes as hard-error postponement schemes, and then applies our approach to work on top of these schemes.

19 Hard-Error Tolerance (Correction): For all the above fault-postponement mecha- nisms, when a cell becomes faulty, the PCM memory continues to work. Conven- tional SECDED (Single Error Correction, Double Error Detection) schemes, which has been widely used in DRAM memories, are not a good choice for PCM for two reasons. First, SECDED are write-intensive (any update in data needs to update ECC), and so, even in the presence of DW, it is likely that an ECC chip fails before a data chip fails. Second, in contrast to DRAM’s fault model, where faults are random and rarely happen, the number of stuck-at faults in PCM increases over time. Thus, PCM may need to correct more than one bit error. Relying on the error detection property of hard errors in PCMs, several proposed solutions exist in literature for error correction (e.g., ECP [17], SAFER [18], FREE-p [19], and Aegis [20]). We assume that our baseline PCM memory uses ECP [17], due to its simplicity and reasonably high error coverage. We also evaluate the efficiency of our proposal in a system with SAFER [18] and Aegis [20] schemes, both of which trade off design simplicity for stronger correction capabilities. Here, we describe these three mechanisms in more detail.

• ECP [17]: For every faulty bit, ECP keeps one pointer and one replacement bit. Error correction is performed after the read operation, restoring the faulty bit (the location where the ECP points) with the replacement bit. With 12.5% overhead for error correction code (ECC-DIMM in Figure 4.1b), ECP is capable of correcting 6 faulty bits. We refer to this design as ECP-6. This method has been used as the default error recovery scheme in our platform.

• SAFER [18]: SAFER dynamically partitions each faulty memory block in order to ensure that each partition has at most one faulty bit. Then it decides whether to store data in its original or complement form, in order to mask the faulty bit in each partition. With a 12.5% metadata overhead, SAFER deterministically corrects 6 faulty bits, and up to 32 bits probabilistically, in a memory block of size 64 . The chances of correcting more than 8 bit failures are very small, as mentioned in the original paper.

• Aegis [20]: Aegis employs a similar strategy to SAFER (partitioning memory block, storing data in a form in order to mask errors), but is able to correct more errors with fewer partitions.

20 3.3 The Proposed Approach: Employing Compres- sion for Robust PCM Design

As shown in Figure 3.1, by employing DW in a PCM-based memory, the resultant bit flips (or changed bits) on each write have a random pattern. The main contribution of this study is to employ compression for reducing this randomness and to show how this can aid other lifetime enhancement solutions (such as wear- leveling and hard-error tolerant) to deliver longer PCM lifetimes. Indeed, when data is stored in compressed form, the bit flips are limited to a small window of a memory block and thus the faulty bits are localized, which in turn presents a higher chance to locate errors at a lower cost and more intuitive wear-leveling.

Our approach here is general. We assume that the memory controller can use any technique for compression. There are several techniques for data compression through the entire stack of the memory system. Each achieves different optimization goals (i.e., either lowering access latency, lowering energy consumption, increasing memory bandwidth, or reducing utilized capacity). Most existing compression algorithms for cache and main memory [1, 8, 9, 37] rely on value popularity. The implementation of these schemes usually trade-off compression ratio (i.e., size of compressed data divided by size of the uncompressed data) for design complexity (i.e., compression/decompression latency).

In this work, without loss of generality, we assume that our baseline PCM memory controller uses two of compression techniques: BDI (Base-Delta Immediate) [8] and FPC (Frequent Pattern Compression) [1]. Table 3.1 provides the key specifications and settings for BDI and FPC. Here, we briefly describe the functionality of these two data compression schemes. • BDI exploits the fact that value dynamism of the words in a memory block is usually small in most applications. BDI stores the value of one word (as the base) and differences of other words with the base (as deltas) for compression. BDI is a very fast compression scheme (taking 1 CPU cycle) and produces a fairly good compression ratio – the size of the compressed data is between 1 to 40 bytes for a 64-byte data block.

21 Table 3.1: The characteristics of BDI and FPC techniques.

Compression Technique FPC [1] BDI [8] Target values Frequent data patterns Narrow data values Input chunk size 4 bytes 64 bytes Compression size 3∼8 bits 1∼40 bytes Decompression latency 5 clock cycles 1 clock cycle

BDI FPC BEST 60

40

20 Number of Bytes of Number 0 GemsFDTDlbm bzip2leslie3dhmmermcf gobmkbwavesastar calculixsjenggcc zeusmpmilc cactusADMAverage

Figure 3.3: The average compressed data size for BDI, FPC, and best of the two.

• FPC is based on specific predefined data patterns and uses them for data compression. The frequent patterns are used for compression at the block-level. FPC compresses 4-byte data chunks into 3–8 bits of data. FPC’s decompression is reasonably fast (taking 5 CPU cycles).

In our memory model, the memory controller (on the CPU chip) has separate compression units for BDI and FPC that work in parallel on every data write-back. Having the results of both compression units, the memory controller chooses the compressed data of the one with smaller size for writing to the PCM memory (we will discuss metadata management later). On read accesses, the memory controller gets the compressed data (and the corresponding metadata) and delivers it to either the BDI or FPC decompression logic. Thus, with this implementation, the memory controller always stores the best of BDI and FPC outputs (in terms of compressed data size) for every memory block. Figure 7.5 shows the average compressed data size for BDI, FPC, and the best (i.e., it can be BDI or FPC for every data block) for a set of SPEC-CPU 2006 applications. The memory block size is set to 64 bytes in all of our evaluations in this work (Section 4.3 provide details of our evaluation methodology, system configuration and workload characteristics). Using the best

22 between BDI and FPC leads to a better use of the contents of a given memory block, helping to achieve the highest compression ratio. The resultant compressed size varies for different applications, as shown in Figure 7.5 – applications such as zeusmp and cactusADM have the highest compression ratio (i.e., about 3/64, and 2/64, respectively) and applications such as lbm or leslie3d have the lowest (saving 13 and 19 bytes, respectively, for 64 bytes). On average, the best of BDI-FPC gives a compression ratio of 0.43.

3.3.1 The Proposed Mechanism and its Lifetime Impacts

Figure 3.4 presents the details of the proposed mechanism for a single memory block. Circled numbers in the text below refer to the operations shown in this figure. The assumed memory block size is B bytes and uses ECP-6 as the error correction code. As described earlier, ECP-6 is able to correct up to 6 faulty cells, regardless of their position in the block. We also assume that the memory controller reduces the amount of written data by W bytes when using compression, giving a W compression ratio of B . The compressed data is then stored in a small part of the block, which is called the compression window in our design. Clearly, the size of the compression window is determined by the compressibility of the written data, and thus needs to remain variable. We begin with a simple approach in which the compression window is initially mapped to the W least significant bits of the line ( 1 ). The memory controller does not shift the window (to either side) for consecutive write-backs5 if the number of faulty cells inside the window is lower than 6 in our assumed ECP-6 correction model ( 2 ).

When the number of faulty cells in the currently active compression window exceeds 6 bits in ECP-6 (or more generally, the maximum correction strength of the selected hard-error tolerant scheme), the controller shifts the compression window to the higher order bits and uses the remaining cells (that are healthy) as replacements for faulty ones ( 3 ). In other words, by shifting the compression window, the controller ensures that the number of faulty cells residing inside the window is less than, or equal to, 6 correctable errors using ECP-6. With this

5Due to the variable size of the compression window, this means that, in our simplified design, the memory controller does not change the position of the lowest bit for consecutive writes.

23 W Bytes B Bytes 4 Errors

ECP6 Free Space Compressed ECP6 Free Space 1 2 + + Uncompressed Compression Uncompressed Compression Data Scheme Data Scheme Basic Operation of Our Architecture Tolerable Number of Bit Failures (<=6)

ECP6 Free Space Compressed Worn Out ECP6 Free Space New Window 3 4 + + Slide Uncompressed Compression Uncompressed Compression Data Scheme Data Scheme Sliding Write Window When the Old Sliding Write Window When the New Windows Has More than 6 Errors Write Request Has a Different Size

W Bytes B Bytes ECP6 Unused Compression Window Compressed Data Write Back (B Bytes) CL (W Bytes) 1 Initial State

B Bytes #Errors=4 ECP6 Unused Compressed Data Write Back (B Bytes) CL (W Bytes) 2 #Errors <=6

B Bytes #Errors=7 ECP6 Unused Compressed Data Write Back (B Bytes) CL (W Bytes) 3.a Initial Write Try: #Errors > 6

B Bytes W Bytes #Errors=4 ECP6 Unused Compressed Data Write Back (B Bytes) CL (W Bytes) 3.b Slided Compression Window: #Errors <=6

Figure 3.4: An example of the proposed mechanism. mechanism, the number of tolerable faulty bits per memory block is not limited to 6 (in ECP-6) anymore. The memory controller keeps writing into a memory block if it finds a contiguous region where: (i) its size is equal to or larger than the compressed write-back, and (ii) it does not have more than 6 faulty cells. This means that our scheme has the potential to substantially increase the number of correctable bits without changing the hard-error tolerant scheme.

24 3.3.1.1 Impact of compression on bit flips

Using compression in PCM memory leads to one of the two scenarios that impact the number of cells that need to be updated. If the compression ratio is high for a given write, the number of bit flips due to writing the compressed data is substantially fewer than when writing uncompressed data. On the other hand, if the compression ratio of the write-back data is low, in some cases we observe higher bit flips in compressed memory compared to the conventional memory. This can decrease the energy efficiency and lifetime benefits of our mechanism. The reason is that most compression algorithms generate somewhat random data patterns, and this can increase bit entropy (i.e., for compressed data, the probability of writing a ‘0’ bit with a ‘1’ bit, and vice versa, is generally higher than that in the uncompressed data). Figure 3.5 shows this behavior for our workload set when using the BEST compression algorithm (see Figure 7.5). We report the percentage of write-backs with increased, decreased or untouched (i.e., number of bit flips is only about 5% lower or higher than baseline) bit flips in the compressed data, with respect to the uncompressed data. The block size is 64 bytes. Our results support the above discussion. For most of the writes in the applications that have a high compression ratio (e.g., sjeng, milc and cactusADM), the number of bit flips decreases after compression – thus, compression is not harmful to the lifetime of most memory blocks in these applications. In contrast, most of the writes in applications with reasonably low compression ratios (such as lbm and GemsFDTD) lead to an increase in the number of bit flips. An exception to this trend is in the

Increased Decreased Untouched 100 80 60 40 20

Write Operations (%) Operations Write 0 GemsFDTDlbm bzip2 leslie3dhmmermcf gobmkbwavesastar calculixsjeng gcc zeusmpmilc cactusADMAverage GemsFDTDlbm bzip2 leslie3dhmmermcf gobmkbwavesastar calculixsjeng gcc zeusmpmilc cactusADMAverage

Figure 3.5: Percentage of write-backs that exhibit increased, decreased and un- touched bit flips after compression.

25 leslie3d workload, where compression does not affect the number of bit flips, even though its compression ratio is low. Note that the bzip2 and gcc applications do not follow our expectations, experiencing a high number of bit flips even though they have good compression ratios. This issue is due to frequent changes in the size of compressed data, which will be discussed later.

In order to avoid this shortcoming of compression, we should avoid writing compressed data if it increases the number of bit flips compared to the baseline. This is not easy to implement in today’s DDR-based memory systems. The reason is that the choice of whether to compress or not is the responsibility of the memory controller, which resides on the CPU chip, while the actual number of bit flips is determined at the chip level by using the DW mechanism. Thus, the memory controller does not have information about number of bit flips ahead of time. Alternatively, one may suggest that, before writing data, we read the entire memory block, and perform a bit-wise compression of the new and old data at the memory controller in order to determine the number of bit flips before and after compression on the CPU chip. This approach has two downsides: (i) it increases the memory traffic, as each write should be preceded by a read to the same block, and (ii) we need to change the memory direction on each write, which takes some cycles in current DDR protocols. In the following, we introduce a heuristic approach to relax this problem without imposing these costs. Our heuristic is based on two characteristics of the write patterns in real applications:

1. As discussed earlier, the number of bit flips usually decreases if the compression ratio is high.

2. On the other hand, for a written data block with low compression ratio, the number of bit flips may or may not increase. We found that the main reason for an increase is due to consecutive writes to a given memory block that has variable sizes after compression. To better understand this behavior, Figure 3.6 presents the probability that two consecutive writes to the same block have different sizes after compression. From Figures 3.5 and 3.6, we deduce that there is a relationship between an increase in the number of bit flips and the probability of writing different sizes into a block (example applications exhibiting this behavior are bzip2 and gcc). Figure 3.7 provides some better insight by showing the

26 changes in compressed data sizes for consecutive writes to three blocks in bzip2 and hammer programs. We choose these two applications as they have similar compression ratios (Figure 7.5). The selected blocks are good representatives for similar blocks in other applications. We see that the size of the compressed data for most blocks in bzip2 significantly varies over time. The behavior is different in hammer, where writes to different blocks do not vary much after compression. Although both applications have similar compression ratios, hammer does not see a large increase in the number of bit flips as compared to bzip2.

1 0.8 0.6 0.4

Probability of Probability 0.2 Write Size Change Size Write 0 GemsFDTDlbm bzip2leslie3dhmmermcf gobmkbwavesastar calculixsjeng gcc zeusmpmilc cactusADMAverage GemsFDTDlbm bzip2leslie3dhmmermcf gobmkbwavesastar calculixsjeng gcc zeusmpmilc cactusADMAverage

Figure 3.6: Probability that two consecutive writes to the same block have different sizes after compression.

Block1 Block2 Block3 60 45 30

Size (Bytes) Size 15

Compressed Block Compressed 0 Consecutive Writes Consecutive Writes (a) bzip2 Block1 Block2 Block3 50 45 40

Size (Bytes) Size 35

Compressed Block Compressed 30 Consecutive Writes Consecutive Writes (b) hmmer Figure 3.7: Changes in the written data size after compression for three representa- tive memory blocks for bzip2 and hammer.

27 New Data NO SC==11 |Old_S – New_S| < Compression 2 Threshold2 3 Logic S E S O Y E N Y YES NO

According to these behaviors, we derive a heuristic algorithm for controlling the number of bit flips. Figure 3.8 shows our heuristic. Circled numbers in the text below refer to steps shown in this figure. In our design, we keep a small saturating counter per each block in order to track changes of written data (shown as SC, 2 bits in width). If the resultant compressed data size for new data is less than a threshold (Threshold1), we always write data in compressed form (Step 1 ). Otherwise, if SC is saturated (SC=“11”), this means that the associated block has has experienced variable writes (i.e., variable in size), and so the memory controller decides to write uncompressed data to avoid extra bit flips (Step 2 ). If SC is not saturated, (i) data is written in compressed form, and (ii) SC is updated to track data write sizes (Step 3 ) – if the size of new and old data does not differ significantly (i.e., their difference is less than a threshold, Threshold2), SC is decremented to reflect minor/no changes in sizes; otherwise, SC is incremented.

3.3.1.2 The need for intra-line wear-leveling

If the write patterns to a memory block suddenly change, the proposed naive design may encounter a serious lifetime issue. To examine this issue in more detail, let us assume that writes to a specific block have two phases. During the first phase, writes have a high compression ratio and the writes span a limited portion of the block (using the above proposed approach). We suppose that the first phase is long enough to have more than 6 faulty cells in the block (note that in our approach, we can tolerate many faulty bits, in the case where the compression ratio remains

28 high). During the second phase, the write pattern changes6 and the modified data possesses a low compression ratio. In this scenario, it is very likely that the memory controller cannot find a contiguous compression window with less than 6 faulty bits that is capable of holding the new compressed data. Thus, our aggressive compression scheme fails to work in this scenario. To resolve this problem, we need an intra-line wear-leveling solution to evenly distribute write pressure over more bits of the same line, and hence, avoid localized wear-out within a memory block. We employ a counter-based mechanism for intra-line wear-leveling, but in order to decrease the hardware cost, we propose to use a single large counter per memory bank (instead of having a per-line counter). On every write to a bank, the memory controller increments the corresponding wear-leveling counter. If we reach counter saturation, all new writes to that bank then use intra-line wear-leveling. Based on our studies, we use 16-bit wear-leveling counters with a step size of 1 byte.

3.3.1.3 Interaction with error tolerant schemes

One of the potential benefits with our approach is that we can tolerate many faulty cells, without adding any extra overhead for hard-error tolerance. We have discussed this benefit for the ECP-6 scheme in Figure 3.4 and described how our design is able to tolerate more than 6 faulty cells in a single block. We expect that our system will behave more efficiently for more advanced hard-error correction schemes such as SAFER and Aegis, which work based on partitioning, as described in Section 3.2. The reason is that, by collocating faulty cells to a small window (i.e., the compression window) in our design, these schemes can easily partition a memory block so that there is no more than a single faulty bit in each partition. In other words, as these techniques can tolerate many faulty bits in theory, the compression algorithm increases this chance. To examine this in practice, we conducted a set of Monte Carlo simulations for a single memory block. The error correction techniques considered include ECP-6, SAFER-32, and Aegis 17x31. In these experiments, the block size is 64 bytes (512 bits), the number of errors is changed from 1 to 128 (which are uniformly distributed over an entire block), and the amount of written

6This can happen, not only due to changes in the behavior of an application, but also due to inter-line wear-leveling. More accurately, when two lines are swapped by inter-line wear-leveling, the newly-written data and old data normally have different sizes.

29 1B 8B 16B 20B 24B 32B 34B 36B 40B 64B 1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4 Failure Probability Failure Probability Failure Failure Probability Failure 0.2 0.2 0.2

0 0 0 0 16 32 48 64 80 96 112 128 0 16 32 48 64 80 96 112 128 0 16 32 48 64 80 96 112 128 Number of Errors Number of Errors Number of Errors (a) ECP (b) SAFER (c) Aegis Figure 3.9: The failure probability of a single block, as a function of the number of faulty bits and the data size after compression, for three error correction schemes: (a) ECP-6, (b) SAFER, and (c) Aegis.

data size is changed from 1B to 64B (to model different compression ratios). The Monte Carlo results are collected after running 100,000 fault injections, and the failure probability (1 − Reliability) is reported as the figure of merit for different error correction schemes. Figure 3.9 presents the results for the three schemes. We make two main observations from this data. First, the higher the compression ratio, the greater the chance of using a memory block in the presence of higher bit error rates. Second, the proposed scheme works more efficiently when introducing more advanced hard-error techniques. For instance, if the compressed data size is 32 bytes, assuming a 0.5 failure probability, the average number of tolerable faulty bits for ECP-6, SAFER and Aegis is 18, 38, and 41, respectively.

3.3.1.4 Using a worn-out block after changes in compression ratio

Based on the above discussion, the memory controller marks a memory block as a "dead block" if it cannot find a contiguous section with less than 6 faulty cells to fit the compressed/uncompressed data. However, due to variations in compressed data sizes over time, it may happen that the size of the written data decreases and can be fit in an already-dead block. Although beneficial, this mechanism is very costly to implement, as we would need to check the status blocks on each write. Alternatively, we propose to check for this condition (if we can fit the compressed data in an already-dead block) only during inter-line wear-leveling. With this approach, we can still achieve some benefits from reusing dead blocks, while imposing little cost.

30 3.3.2 Metadata management

In our proposed mechanism, for each memory line, we need to keep three kinds of meta-data: (i) a pointer to the start of compressed block (6 bits), (ii) encoding information for the compressed data (5 bits), and (iii) a saturation counter (2 bits). Therefore, for each memory block, we need to keep 13 bits of metadata. This information is stored at the beginning of each memory line. On a read operation, after reading the whole block, the pointer bits determine which part of the block should be passed to the decompression logic, and the encoding information is used by the decompression logic to generate uncompressed data. Note that for each memory block, one bit is required to determine whether the line is compressed (or is not). Since one chip (i.e., 64 bits for each memory block) is dedicated to error correction logic, and ECP6 only uses 61 bits from that space, there are already 3 unused bits within that chip. We use one of those bits to indicate if a line is compressed. Overall, the metadata does not cause any hardware overhead on the memory side. On the other hand, once a write request is received by the memory controller, we need to determine if the data should be stored in a compressed format. As discussed earlier, a saturation counter is kept in memory, even though the decision is made at the controller. Any interaction in between could cause extra delays and is not tolerable. To resolve this issue, once a memory block is sent to the last level cache, we append to that message the size of the corresponding compressed data in memory, along with its saturation counter. This data introduces a one byte overhead per 64-bytes of memory. Since the memory controller and the last level cache are tightly connected, this modification is practical in terms of cost. Based on this design, once a cache line is written back to memory, the memory controller already knows its saturation counter value and the size of the stored data. Therefore, using this information, the memory controller can decide whether that data should be stored in a compressed or uncompressed format.

Note that, cell-failure in metadata is infrequent because its update frequency is generally lower than the rate of written data. More accurately, (i) The start-pointer is updated every 216 writes to the memory-bank, (210 writes to a line, on average). (ii) The coding bits and counter get updated when the size of compressed-data changes, which is infrequent as reported in Figure 3.6 (every 4-5 writes, on average).

31 3.4 Experimental Setup

In this section we present details of our platform setup, assumed PCM fault model, and design space parameters that are used in our evaluations.

Evaluation infrastructure: We use the Gem5 simulator [74] for performance evaluation of a chip multi-processor. Each core in the target system uses the ARM instruction set architecture. The memory hierarchy is based on the Ruby model provided in Gem5. We also collect traces of main memory accesses in Gem5, which are then fed to a lightweight memory simulator for lifetime analysis. The lifetime simulator models the behavior of all lifetime enhancement techniques, including chip-level DW, intra and inter-line wear-leveling, hard-error correction codes, as well as the details of our compression-based enhancements.

Table 3.2: The system specification of our simulated system.

Processor ISA ARM CMP 16-core, out-of-order, 2.5GHz Interconnect Network 4×4 Mesh network On-chip Cache Hierarchy L1 I-Cache 32KB/64B/2 way, private, 2-cycle, Write-Back L1 D-Cache 32KB/64B/2 way, private, 2-cycle, Write-Back L2 Cache 4MB/64B/8 way, shared, 20-cycle Coherency Protocol Snooping MOESI PCM Main memory Total Capacity 4GB (4KB pages, and 64B rows) Configuration 2 channels, 1 DIMM/channel, 1 rank/DIMM 8 devices/rank, 48ns read, 40ns RESET, 150ns SET Endurance Mean: 107 writes, Variance: 0.15 Interface [13] 400MHz, tRDC =60 cycles, tCL=5 cycles, tWL=4 cycles, tCCD=4 cycles tWTR=4 cycles, tRTP =3 cycles, tRP =60 cycles, tRRDact=2 cycles, tRRDpre=11 cycles On-chip Memory Controller Decompression latency BDI: 1 cycle, FPC: 5 cycles Read/Write Queue 8/32 entries per bank

32 Baseline configuration The evaluated CMP is a 16-core processor, using the configuration described in Table 4.1. The memory hierarchy has two levels of caching: the L1 caches are private to each core, and the last level cache (LLC) is a 4MB cache, shared between all 16 cores. This capacity is large enough to filter out a large portion of the traffic destined for the PCM-based main memory. Below the caches is a 4GB SLC-based PCM memory, configured using the latency parameters used in NVsim [75]. The main memory has 4 channels, each interconnected to a separate memory controller on the CPU chip, one rank per each channel, with nine ×8 chips per rank (Figure 4.1), and 4 banks in each rank. We use the DDR3 standard, with a burst length of eight and timing parameters taken from Lee et al. [13]. Each memory controller uses separate per-bank read and write buffers. The write and read buffers are 32-entry and 8-entry FIFOs, respectively. The baseline system also uses Start-Gap [16] for wear-leveling and ECP-6 [17] for per-line hard-error correction.

Fault model: Lifetime analysis is performed using our trace-driven PCM lifetime simulator. We feed the simulator a trace file of memory accesses collected from an application and replay the trace until it reaches the PCM lifetime limit. The PCM cell lifetime limit is set to an average of 107 [13], with a variance of 0.15 (based on the model in prior work [17,19]) in order to model process variation in the PCM array. The simulator continuously writes into a memory block, up to the point that the provided error correction mechanism (i.e., ECP-6) cannot recover the error, and consequently, the block meets its first uncorrectable faulty bit. As modeled in a prior study [17], we assume the system fails when 50% of the memory capacity is worn out.

Workloads: For our workloads, we selected 15 memory-intensive programs from the SPEC-2006 suite [2]. Each of our applications has at least one L2 cache write-back per one kilo instructions (WPKI). Each workload is composed of running the same program on all 16 cores of CMP. For each workload, we fast-forward approximately 2 billion instructions, warmup the caches by running 100 million instructions, and then collect results by simulating the next 100 million instructions on each core. Table 4.2 characterizes each workload with respect to its WPKI and compression ratio (CR). The CR values reported are for the BEST compression scheme in Figure 7.5. We classify our workloads into high compressibility (H),

33 medium compressibility (M), and low compressibility (L) categories. We refer to a workload as highly compressible, if its CR is below 0.3; if the CR of a workload is above 0.7, we say that the workload has low compressibility. Otherwise, we refer to the workload as medium compressible.

Evaluated systems: We evaluate and compare the results of four systems with different lifetime enhancements:

• Baseline: Th model described in Table 4.1.

• Comp: This system uses our naive data compression scheme (i.e., without intra-line wear-leveling and advanced definition of hard-error correction). The compression is based on best of BDI and FPC.

• Comp+W: This system uses our naive data compression scheme and the proposed intra-line wear-leveling.

• Comp+WF: This system uses all our proposed schemes, i.e., compression-based write, intra-line wear-leveling, and the advanced definition of hard-error tolerant.

All four systems use DW at the chip level for bit-level write reduction, Start- Gap [16] for inter-line wear-leveling, and ECP-6 [17] as the hard-error tolerant scheme. Throughout our analysis, the reported results for different systems are normalized to those of the baseline system.

Table 3.3: Characteristics of the evaluated workloads WPKI refers to the number of L2 write-backs per kilo instructions (per core). CR refers to the compression ratio and is defined as: the size of the compressed data divided by the original size. We use the best of BDI and FPC for compressing each block. H, M, and L refer to High, Medium, and Low compressibility, respectively.

Workload WPKI CR Workload WPKI CR astar (M) 1.04 0.53 bwaves (M) 9.78 0.34 bzip2 (M) 4.6 0.53 cactusADM (H) 8.09 0.03 calculix (M) 1.08 0.37 gcc (M) 8.05 0.5 GemsFDTD (L) 4.15 0.70 gobmk (M) 1.14 0.39 hmmer (M) 1.9 0.59 leslie3d (L) 8.32 0.70 lbm (L) 15.6 0.79 mcf (M) 10.35 0.55 milc (H) 3.4 0.29 sjeng (H) 4.38 0.08 zeusmp (H) 5.46 0.05

34 3.5 Evaluation

In this section, we present and discuss results of the evaluated systems. We begin with lifetime analysis for the baseline SLC-based PCM configuration, and then analyze the performance overhead of the proposed technique. Finally, we discuss the efficiency of the proposed approach under higher degree of process variation.

3.5.1 Memory Lifetime Analysis

Figure 3.10 presents lifetime results, with all values normalized to the baseline. In general, we find that the degree of lifetime improvement is highly dependent on the compressibility of the running application. Below, we discuss the behavior of each configuration in detail.

3.5.1.1 Impact of using compression (Comp)

In Figure 3.10, we find that half of the applications experience shorter lifetimes under the Comp configuration. For instance, in bzip2 and gcc, memory lifetimes are shortened by almost half. The underlying reason for such a dramatic impact is that compression localizes write operations to the least significant bytes in a memory block, meaning that the least significant and most significant bytes (LSB and MSB) within a block wear out at wildly different rates. This issue will lead to early failures once an uncompressed block is mapped to an unevenly worn out memory block. Note that, for highly compressible data blocks, such non-uniformity is less detrimental because once the least significant bytes are worn out, the compressed data blocks can still be placed in another working compression window. This non-uniformity, however, is not tolerable for uncompressed/less-compressible blocks. Figure 3.11 can be used to quantitatively explain what types of applications still experience lifetime improvements using the Comp system. During the execution of an application, write-intensive memory addresses receive multiple write operations with different compressed sizes. In Figure 3.11, for each memory address, we show the size of the largest compressed block written to that address. As can be seen, in milc, 80% of the write operations are smaller than 25 bytes.

35 Comp Comp+W Comp+WF 9 11 6 11 11 13 5 4 3 2 (times) 1 0

Normalized Lifetime Normalized GemsFDTDlbm bzip2leslie3dhmmermcf gobmkbwavesastar calculixsjeng gcc zeusmpmilc cactusADMAverage

Figure 3.10: Lifetime of different systems, normalized to the baseline system.

1 0.8 0.6 0.4 0.2

CDF of Memory Blocks Memory of CDF 0 0 8 16 24 32 40 48 56 64 0 8 16 24 32 40 48 56 64 Compressed Size (Bytes) Compressed Size (Bytes) (a) gcc 1 0.8 0.6 0.4 0.2

CDF of Memory Blocks Memory of CDF 0 0 8 16 24 32 40 48 56 64 0 8 16 24 32 40 48 56 64 Compressed Size (Bytes) Compressed Size (Bytes) (b) milc

Figure 3.11: Size of the compressed memory block for different memory addresses. For each memory address, we have considered the size of the largest compressed data block over all the write operations to that address.

Therefore, even though in the Comp configuration we have a non-uniform wear- out pattern within a memory block, 80% of the memory addresses still have a good chance to fit in those unevenly worn-out memory blocks. In gcc, however, even though this application achieves a high compression ratio, we observe a uniform distribution from 25 bytes to 64 bytes, where only 10% of the memory addresses receive write operations smaller than 25 bytes (Figure 3.11a). Therefore, in gcc, only a small percentage of the memory addresses can tolerate wear-out due to non- uniformity within the memory blocks. In general, in highly compressible applications

36 such as zeusmp, we observe similar data patterns as found in milc. On the other hand, for low compressible applications, we observe lifetime degradation. For medium compressible applications, the lifetime may decrease or marginally increase, depending both on the compressibility and the distribution of the compressed block sizes, as discussed in Figure 3.11.

3.5.1.2 Impact of intra-line wear-leveling (Comp+W)

To achieve a uniform cell wear-out pattern within a line, Comp+W adopts an intra-line wear-leveling scheme for compressed write operations. As can be seen, in contrast to Comp, Comp+W does not hurt PCM lifetimes for any of the applications tested, and in fact it prolongs the average lifetime by 3.2X. In the Comp+W configuration, gcc and mcf experience considerable improvements (5.5 and 2.8 times, respectively), as compared to the Comp configuration. In other words, Comp+W resolves the premature block failure problem, which frequently occurs in Comp for less-compressible data blocks.

3.5.1.3 Impact of advanced hard-error tolerance (Comp+WF)

The Comp+WF configuration in Figure 3.10 represents the lifetime improvements achieved by our advanced fault tolerance approach. Comp+WF increases the system lifetime by 4.3X on average. The first time a memory block is unable to store a new value, it is marked as permanently dead. Our Comp+WF approach however, does not mark failed memory blocks as permanently dead because those dead blocks could potentially return from the dead and be used by memory addresses with higher data compressibility. As seen in Figure 3.10, not all applications benefit from this approach to the same extent. Figure 3.11 presents two extreme cases in that regard. For instance, in milc, 20% of the memory addresses have low compressibility when using a compressed size of larger than 40 bytes. If a particular memory block from this group of blocks fails on a write operation, it still has a high chance of being practically useful for a memory address from the remaining 80% of the memory addresses (which have a compressed size of less than 25 bytes). On the other hand, in gcc, we do not observe such a compressibility variance.

37 3.5.1.4 Summary

While the Comp configuration improves the lifetime of PCM on the system by 35% on average, it can also decrease the lifetime for less-compressible applications. The basic weakness in the Comp configuration is the localization of write operations to the least significant bytes within a memory block. The Comp+W configuration exploits an intra-line wear-leveling scheme to address this non-uniformity. The Comp+WF configuration does not reduce the lifetime of any of our applications, and on average, prolongs the system lifetime by 3.2X. Finally, the Como+WF configuration redefines the notion of a permanently dead block, and increases the system lifetime 4.3X on average. Table 3.4 reports the impact of our proposed architecture on the system lifetime in terms of months.

3.5.1.5 Number of Tolerable errors

In the ECP technique, the storage overhead linearly increases based on the number of recoverable errors per memory block. Since ECP6 (with the capability of correcting 6 errors) introduces less than a 12.5% storage overhead, can be considered to be a practical ECP configuration for standard memory systems, assuming one extra chip to hold ECC. However, our proposed architecture tolerates more cell failures per memory block. This is because compressed data blocks take up less space, and thus have a higher chance of fitting in a working section (i.e., a working section is a contiguous section of a block containing less than, or equal to, 6 faulty cells) of the memory block. Figure 3.12 shows the average number of tolerable errors per memory block in Comp+WF. Our proposed architecture can, on average, tolerate 3X more errors per memory block. To sustain this level of error tolerance using ECP, one would see a 40% increase in storage, which is hard to justify for many systems. Figure 3.12 also shows the correlation between compressibility of an application, and the number of tolerable errors within a memory block. As we observe, sjeng, milc, and cactusADM, which are categorized as highly compressible, can, on average, tolerate 25, 32, and 35 faulty cells per block, respectively.

38 Table 3.4: Final lifetime (months) in baseline and Comp+WF.

Workload Baseline Comp+WF Workload Baseline Comp+WF astar 52.1 150.2 bwaves 8.6 23.6 bzip2 13.4 19.8 cactusADM 9.2 119.6 calculix 51 159.4 gcc 8.7 36.2 GemsFDTD 15.6 19.6 gobmk 50.4 131.7 hmmer 32.1 70.6 leslie3d 8.3 13.5 lbm 20.7 28.8 mcf 18.7 48 milc 16 184 sjeng 13.2 50.4 zeusmp 11.7 128.7 Average 22 79

40

30

20

Faulty Cells 10

Avgerage # Recoverd 0 GemsFDTDlbm bzip2 leslie3dhmmermcf gobmkbwavesastar calculixsjenggcc zeusmpmilc cactusADMAverage GemsFDTDlbm bzip2 leslie3dhmmermcf gobmkbwavesastar calculixsjenggcc zeusmpmilc cactusADMAverage

Figure 3.12: The average number of faulty cells in a failed 512-bit memory block.

3.5.2 Performance Overhead Analysis

In this work, the memory system is augmented with a data compression mecha- nism. The use of compression is not specifically trying to improve system perfor- mance (though it can increase the PCM’s effective memory capacity), but instead as a knob to improve the lifetime of the resistive memory system. As discussed in Section 4.3, a typical PCM-based memory system has a 32-entry write queue. Therefore, data compression can be performed in the background upon receiving write requests. Data decompression, however, is on the critical path. Read requests to the compressed memory blocks will experience longer access times. This extra delay is 1 cycle if the block is compressed using the BDI, and 5 cycles if it is compressed using FPC. Based on our experimental observations, read accesses to compressed data blocks are delayed by up to 2%, on average. Considering all the extra delays caused by the decompression process, we observe less than a 0.3% performance degradation in the applications studied.

39 Comp+WF 5 10X 13X 15X 10X 13X 15X 4 3 2 (times) 1

Normalized Lifetime 0 GemsFDTDlbm bzip2 leslie3dhmmermcf bwavesgobmkastar calculixsjeng gcc zeusmpmilc cactusADMAverage GemsFDTDlbm bzip2 leslie3dhmmermcf bwavesgobmkastar calculixsjeng gcc zeusmpmilc cactusADMAverage

Figure 3.13: Lifetime of a Comp+WF system, normalized to the baseline system. (CoV=0.25)

3.5.3 Sensitivity to the Effect of Process Variation

As reported in previous work [76], process variation affects different physical dimension parameters in PCM cells, resulting in variability in their electrical char- acteristics. The variability in fabricated devices will become even more pronounced with aggressive feature size scaling. Imperfect wear-leveling exacerbates this prob- lem, given the uneven distribution of write operations. Figure 3.13 highlights the impact of our proposed architecture on the lifetime of a memory system assuming a higher process variation. For this analysis, our PCM simulator assigns the cell lifetimes with a variance of 0.25.

3.5.4 Efficiency of Our Design for MLC-based PCMs

MLC PCM, which has been recently prototyped [69], gives higher density com- pared to SLC at the cost of degraded performance and cell endurance. More specifically, some recent works such as [69] have experimentally shown that MLC PCM endurance limit is 10 times to 100 times lower than SLC’s – this makes the limited endurance more challenging in MLC PCMs compared to conventional SLCs. To make a MLC-based memory lasting for a long time, we need all kinds of optimizations at different levels. In particular, techniques correcting many faulty bits are highly desirable for these memories which highlights the importance of our proposed mechanism. To summarize, as PCM technology improves and cells get weak (due to severe process variations and emergence of MLC devices), the importance of our proposed technique which can correct many faulty bits is bolded.

40 3.6 Conclusions

Current challenges involving write endurance seem to be the main obstacle preventing wide development of PCM memories. Most prior studies concentrated on improving PCM lifetime by using wear-leveling and hard-error correction schemes, implemented at the architecture level. However, little attention has been paid to lower-level optimizations – current proposals usually rely on the PCM chip’s capability of performing differential writes (DW). While beneficial in reducing the write traffic in the memory array, DW normally results in random updates of cells over the entire memory block. To reduce this randomness, and also to provide the chance of combining device-level and architecture-level optimizations, we propose to employ compression. Although beneficial in most applications, blindly using compression may increase data entropy in some memory blocks of certain applications. Motivated by this, we propose a novel compression-based PCM memory system that relaxes the problem of increased bit flippping. Our scheme can work with other wear-leveling and hard-error tolerance schemes to boost PCM lifetime. Our proposed design, on average, increases PCM memory lifetimes by 4.3×, compared to a state-of-the-art PCM DIMM memory.

41 Chapter 4

Tolerating Write Disturbance Errors in PCM Devices

Constant technology scaling has enabled modern computing systems to achieve high degrees of thread-level parallelism, making the design of a highly scalable and dense main memory subsystem a major challenge, especially for data-intensive workloads. While during the past three decades DRAM has been widely used as the dominant technology to build main memory, it faces serious scalability and power consumption problems at sub-micron scales. Phase Change Memory (PCM) has been proposed as one of the most promising technologies to replace DRAM in future computing systems, because of its short access latency and high scalability. However, PCM has some reliability problems below 20nm technology. As the cell size scales down, the thermal-disturbance between the cells is exacerbated. More precisely, the generated heat during the write operation can disseminate to the adjacent cells and potentially change their values. This phenomenon is known as the write disturbance problem. In this work, we study the impact of write disturbance along the word-line and bit-line in a PCM-based main memory. We then propose two low-overhead mechanisms to address this problem in different dimensions. Our proposed schemes are based on a rank subsetting memory architecture where each chip can be accessed separately. To mitigate the write disturbance problem along the word-line, we use a combination of differential write (DW) and verify-and- correct (VnC) schemes only if the chance of having cascaded verification steps is

42 low. Otherwise, we write into all the vulnerable cells within the target chip to avoid performance loss. To tolerate write disturbance along the bit-line, we use data compression to compact the data and place the adjacent memory lines in a non-overlapping fashion. We further use the extra space within a compressed memory line to store BCH code, in order to protect read-intensive memory lines against write-intensive addresses. Our proposed schemes guarantee reliable write operations in a PCM-based main memory, and improve the system performance by 17% on average, compared to the state-of-the-art mechanism in this domain.

4.1 Introduction

Workloads in the next generation of computing systems are expected to be highly data-intensive and have very large working-sets. While during the past decades DRAM has been used as the dominant technology in main memory for computing systems, it currently faces serious scalability and power consumption problems [66]. Researchers from both academia and industry have suggested that phase change memory (PCM) is the most promising technology to be used as a scalable alternative for DRAM technology in future computing systems [13,14,77], or to complement DRAM as part of a hybrid memory system [78]. PCM has lower power dissipation, higher density and better scalability compared to DRAM. However, recent studies have shown that, aggressive scaling of PCM in deep sub-micron regime reduces the inter-cell distance, leading to thermal disturbance between the adjacent cells. More precisely, during the write operation, the heat generated by the write current may disseminate beyond the target cell and disturb the resistance states of its neighboring cells. This phenomenon is referred to as the write disturbance problem. Write disturbance phenomenon has been reported in 54nm technology by Lee et al. [79], and has become a major reliability issue below 20nm [15].

The existing solutions for resolving write disturbance in PCM devices are inef- fective. A simple solution that is widely adopted is to allocate sufficient inter-cell distance in order to avoid thermal disturbances between the cell [15,22,23]. How- ever, this approach results in significantly less chip capacity, compared to what theoretically can be fabricated on a chip. For instance, in recent PCM proto-

43 types [15,23], designers introduced 20nm extra inter-cell space for the word-lines, and 40nm extra inter-cell space for the bit-lines, which results in about only 33% chip capacity of what ideally can be achieved, based on the PCM cell size. A common approach that specifically addresses the write disturbance problem along the word-line is to perform a read after the write operation to detect and correct possible errors [24]. This approach is known as verify-and-correct (VnC) scheme and may lead to cascading verification steps, and significant performance loss.

Another approach is to increase the inter-cell space only along the bit-lines, in order to isolate the errors within the word-lines [22]. This mechanism then exploits an encoding scheme to reduce the number of vulnerable cells, through manipulating the data pattern. However, they still employ the VnC scheme to guarantee the correctness of the written data, and the encoding scheme only reduces the chance of having cascading verification steps. SD-PCM [25], on the other hand, focuses on the errors along the bit-lines. This technique issues pre-write and post-write read operations to the adjacent memory lines in order to detect and correct any potential errors caused by a write operation. Therefore, for each write operation, SD-PCM technique imposes four redundant read operations to the two adjacent memory lines, which can lead to considerable performance loss.

In this work, we do not increase the inter-cell space along the word-lines or the bit-lines to address the write disturbance problem. Instead, we propose a hybrid mechanism that consists of two general-purpose and cost-effective schemes to address the write disturbance problem within each and also between the adjacent memory lines. The goal of our proposed mechanisms is to reduce the probability of write disturbance error occurrence, and, at the same time tolerate as many errors as possible. The main contributions of this study are as follows:

• First, instead of naively adopting a performance-costly VnC scheme, we predict the cases where we will experience cascading VnC steps. For such cases, we write into all the vulnerable cells to avoid any performance loss. This prediction is based on the observation that, the number of VnC iterations is correlated to the number of vulnerable cells. Besides, we make this decision for each chip separately. Such chip-based approach is based on the observation that, on each write operation, only few of the chips within a memory rank are

44 updated, and also each of the updated chips exhibits a different number of vulnerable cells. Therefore, for each write operation, our proposed mechanism might use a different programming strategy for each of the chips in a rank. Our proposed programming strategy resolves the write disturbance problem along the word-line and achieves an average of 4% higher performance compared to the state-of-the-art technique in this domain [22].

• Second, we adopt data compression to compact each memory line and place the adjacent memory lines in a non-overlapping fashion. Such data layout reduces the chances of thermal disturbance among the vertically adjacent cells (i.e., along the bit-line). We then use the extra space within a compressed memory line to store BCH code, capable of double error detection and double error correction. This strong BCH code protects read-intensive memory lines against write-intensive addresses that are the main sources of write disturbance errors along the bit-lines. Compared to a state-of-the-art technique in this domain [25], our proposed mechanism significantly reduces the number of redundant read operations (65% on average), which in turn leads to 13% performance improvement, on average.

• To the best of our knowledge, our proposed mechanism is the first work that addresses the write disturbance problem along the both word-lines and bit-lines. Overall, our proposed hybrid mechanism improves the system performance (i.e., IPC) by 17% on average, compared to the system where the DIN [22] and SD-PCM [25] schemes are used to resolve the write disturbance problem along the word-lines and bit-lines, respectively.

4.2 Background and Related Work

We begin by describing a PCM cell structure, and the architecture of a typical main memory subsystem. We then review different fault models in PCM, focusing on the write disturbance issue.

45 One Bank (Interleaved over all chips)

RD Select WR )

s S/H k

p Column Multiplexer n i a h R

C

M Set M C C

P RMW RMW RMW RMW RMW RMW RMW RMW

P Write Read e

n 8 Data-1 Data-2 Data-3 Data-4 Data-5 Data-6 Data-7 Data-8 Circuit Circuit (

Reset O PCM Cell Write Read (a) (b) 64-bit Data Bus To/from memory/cache (c) Data Data Figure 4.1: (a) PCM cell (b) PCM-based DIMM with 8 chips (c) Differential write (DW) circuit

4.2.1 PCM Basics

Figure 4.1.a shows a PCM cell structure and its read/write access circuit. PCM stores binary data values in a chalcogenide material (i.e., Ge2Sb2Te5 or GST for short), in either a high-resistance (RESET) or low-resistance (SET) state. Reading a PCM cell’s content requires applying a low-amplitude current for a short period of time to sense its resistivity. However, the write process depends on the written value: a RESET pulse is a large-amplitude and short-duration current that increases the temperature of GST above the melting point, while a SET pulse is a low-amplitude and long-duration current which increases the GST temperature above the crystallization point but below the melting point. In contrast to a SET operation, a RESET operation consumes more energy, and significantly contributes to PCM wear-out. We will further discuss in the next section that, the high-amplitude current used in a RESET operation is the underlying reason for the write disturbance phenomenon.

We assume a PCM-based main memory structure similar to a traditional DRAM as our baseline (i.e., the PCM has multiple channels where each channel is connected to a Dual In-line Memory Module (DIMM)). Figure 4.1.b illustrates such DIMM- based PCM memory. Each DIMM has multiple ranks, and each rank consists of a set 8 PCM chips, that together feed the data bus. A rank is partitioned into multiple banks such that each bank can process independent memory requests. As shown in Figure 4.1.b, each bank is distributed across all the chips in a rank. When the memory controller issues a read request (to read a 64-byte cache line), all PCM chips in a rank are activated, and each sub-bank contributes to a portion

46 of the requested block. Similar to recent PCM prototypes, we adopt an embedded read-modify-write (RMW) circuit [43,73] (shown in Figure 4.1.c) which performs differential writes (DW) to reduce the number of cell writes, improve PCM lifetime, and reduce energy consumption. DW performs a read-before-write operation in order to write into the cells only if the stored value and the new value are different. We will further discuss in the next section that, if we do not use the DW scheme, we will not experience write disturbance errors along the word-line. However, this improvement comes at the cost of higher number of write operations (at the cell-level) and shorter lifetime.

Rank Subsetting: In a conventional memory architecture, a read/write access to the memory consists of a RAS command (i.e., Row Address Strobe command), followed by 8 CAS commands (i.e., Column Address Strobe command). For instance, as shown in Figure 4.1.b, there is one RAS command bus shared by all the chips in a row, and on each CAS command, 1 byte is read from each chip. Therefore, it takes 8 cycles to read 64 bytes (a cache line) from a rank consisting of 8 chips. As opposed to this architecture, rank subsetting architecture is capable of accessing each chip separately, through partitions a 64- bit rank into smaller sub-ranks (e.g., 8 sub-ranks) where each can be independently controlled with the same single command/address bus on the channel. In such architecture, a RAS command only applies to one sub-rank at any time, i.e., it only activates a subset of the PCM chips in the rank, and limits the amount of data brought into the row-buffer. Different configurations of rank subsetting have been introduced to reduce energy consumption and improve performance of the memory [80,81]. Arjomand et al. [82] exploit rank subsetting architecture to increase the read and write parallelism in PCM-based main memories. In our work, we adopt a rank subsetting architecture similar to that of [82]. However, we are not seeking higher performance or lower power consumption. Instead, we exploit the rank subsetting’s flexibility to address the write disturbance problem in PCM memories.

47 4.2.2 Fault Models in PCM

Three different fault models are known in the PCM technology. Here we discuss those fault models as well as the previous studies.

i) Wear-out faults. Similar to other non-volatile memory technologies, PCM has the problem of limited write endurance. This problem is primarily due to the frequent heating and cooling process during write operations - after a finite number of writes, the cell loses its programmability and becomes permanently ”stuck” at either the SET or RESET state [71,72]. Wear-leveling algorithms (e.g., [16,21,43,73,83]) and error-correction techniques (e.g., [17–20,84]) are specifically proposed to tackle this category of permanent cell failures.

ii) Resistance drift faults. In multi-level cell (MLC) PCM, we are capable of storing multiple data bits in only one memory cell. For instance, in a 2-bit MLC device, the resistance range is split into four sections, representing the 00, 01, 10, and 11 binary values. The drift process starts once a PCM cell is programmed. Due to the thermally-affected atomic rearrangement of the amorphous structure, the resistance value of the cell increases over time and this increase is accelerated as chip temperature increases. Therefore, the stored value in a cell can change after a long period of time. Different techniques such as periodical scrubbing [47, 48], and data encoding mechanism [49] are proposed to prevent the occurrence of such transient faults over long periods of idleness. In this work, we primarily focus on single-level cell PCM devices where resistance drift is not a reliability concern.

iii) Write disturbance faults. The write disturbance (WD) problem in PCM devices arises from the thermal disturbance between the adjacent cells. When programming a PCM cell, particularly resetting a cell, we apply a high-amplitude short-duration current to increase the GST temperature above the melting point. The heat generated by the RESET operation may disseminate beyond the target cell, and reaches its neighboring cells which can consequently disturb the resistivity of those cells (only if they are in idle mode1). More precisely, write disturbance can

1As shown in Figure 4.1.c, PCM devices adopt a differential write (DW) scheme at the cell-level, in order to avoid redundant write operations if the new value and the old value stored in the cell are the same [43,73]. Such cells are labeled as idle cells during the write operations. While deactivating the DW scheme can resolve write disturbance problem along the bit-lines, it comes at the cost of higher number of bit flips, higher power consumption, and shorter lifetime.

48 Before After Write Operation Write Operation Row: i-1 0 1 1 0 1 0 1 1 0 1 Row: i 0 1 1 1 0 0 1 1 0 0 Row: i+1 1 0 1 1 1 1 0 1 1 1 New Value to Row i: 01100 Figure 4.2: Write disturbance and vulnerable data patterns. Only one of the bits at row i is updated from 1 to 0 which potentially makes its four adjacent cells susceptible to the write disturbance error. However, only two of the adjacent cells are in the RESET state (red cells), and may experience a disturbance and turn into the SET state (1 binary value). occur only when a cell is under RESET, while its neighboring cells are (1) in idle mode, and (2) in RESET state. Figure 4.2 shows a hypothetical scenario for write disturbance. As can be seen, the write operation to row i updates only one of the cell values (from 01110 to 01100). Among the four neighboring cells, only two of them are in RESET state (red cells) which are considered as vulnerable cells and can potentially morph to the SET state because of the heat disseminated by the cell under RESET. Such disturbance does not happen during the SET operations because the SET current is about 50% less than the RESET current, and the temperature increase during a SET operation is about four times lower than that during a RESET operation. Therefore, the disturbing effect of SET operations is considered negligible [85,86]. Note that, besides the magnitude and width of the RESET current, write disturbance is also heavily affected by the distance between the PCM cells. As we move forward, the technology scaling keeps reducing the inter-cell distance which makes the write disturbance problem a major scaling bottleneck for the practical deployment of PCM devices. Previous studies have shown that write disturbance has become a major reliability problem below 20nm where inter-cell space is much less [15, 23, 44]. In the following, we describe the occurrence of write disturbance errors along the bit-lines and word-lines, and also discuss the most recent studies in this domain. 1) Intra-Line: write disturbance along the word-line Occurrence of write disturbance error in a word-line is dependent on the program- ming strategy used in the PCM device. On a write operation, if we write into all the cells within the memory line, no write disturbance error can happen, because

49 the heat generated by SET/RESET current is higher than the disseminated heat from the neighboring cells. However, to extend the lifetime of PCM device and also to reduce the power consumption, we only write into a cell if the new value and the previously stored value are different. This write scheme is known as differential write (DW). In this work, similar to recent PCM prototypes [43, 73], we assume the baseline system is supported by the DW scheme. By using the DW scheme, we leave some of the cells within a word-line in idle mode. Such idle cells are vulnerable to the disseminated heat from the adjacent cells and can morph from the RESET state to SET state. In this study, we refer to this category of write disturbance errors as intra-line errors, as they occur within the memory line where the write operation is being performed. As reported in Figure 4.3, a write operation results in an average of 18 vulnerabilities in a word-line (our evaluation methodology is discussed in Section 4.3), and can go as high as an average of 48 in games.

Some prior studies specifically focus on this category of write disturbance errors. At the circuit-level, Verify-and-Correct (VnC) technique [24] performs an extra read operation after the completion of the write, in order to detect and modify any possible write disturbance errors. VnC technique RESETs the disturbed cells after the verification process. However, each RESET operation can cause more faulty cells in a cascading fashion (shown in Figure 4.4). In order to manage the latency imposed by the VnC scheme, after a few iterations of verification (typically 5 iterations [22]), the whole memory line will be written in order to avoid further performance loss. In other words, even though the DW scheme is used to avoid redundant cell updates, for the case of cascading VnC iterations, it is better to update all the cells for the sake of performance. At the architecture-level, Jiang et al. [22] propose a Data encoding-based Insulation technique (DIN) that exploits data encoding to reduce the number of intra-line vulnerable cells by manipulating the pattern of written data. DIN further exploits the VnC scheme (with a 5-step repetition threshold) to verify the correctness of the written data. Similarly, Tavana and Kaeli [45] propose a partitioning-based encoding approach to mitigate write disturbance errors along the word-line. As mentioned earlier, all these techniques specifically focus on the write disturbance errors along the word-line, and increase the inter-cell space along the bit-line to resolve the write disturbance problem between the adjacent memory lines.

50 50 40 30 20 10

0 astar bzip2calculixdealIIgamessgcc gobmkh264refhmmerleslie3dmilc omnetpppovraytonto zeusmpMean Average Number of Intra-Line Vulnerabilities

Figure 4.3: Average number of vulnerable cells in a memory line (i.e., intra-line) after each write operation.

Before After Write Operation Write Operation Write 1 1 0 0 1 1 0 1 0 1 Operation New Data Value: 10001 VnC Step 1 1 0 1 0 1 1 0 0 1 1 VnC Step 2 1 0 0 1 1 1 0 0 0 1

Figure 4.4: Cascading effect of the Verify-and-Correct (VnC) scheme. After each write operation, we perform a read to detect and modify all the disturbed cells (yellow cells in the figure). In each iteration of VnC, we RESET the faulty cells (red cells in the figure) which itself can cause more faulty cells along the word-line.

Compared to the previous works in this domain, our proposed intra-line scheme employs two different programming strategies (built on top of the DW and VnC schemes) to resolve the write disturbance problem along the word-line. On each write operation, our intra-line scheme determines what programming strategy should be used for each of the chips within a rank, and achieves an average of 4% higher performance compared to DIN [22]. 2) Inter-Line: write disturbance along the bit-line A write operation into a memory line can also cause write disturbance errors in adjacent memory lines. On a RESET operation, the cells that are vertically adjacent to the target cells, can be affected if they are in the RESET state (shown in Figure 4.2). In this work, we refer to this category of write disturbance errors as inter-line errors. Figure 4.5 shows the average number of vulnerable cells in adjacent memory lines on each write operation (our evaluation methodology is discussed in Section 4.3). As can be seen, a write operation can result in an average of 40 vulnerabilities in the adjacent memory lines. A simple yet widely

51 100 80 60 40 20

0 astar bzip2calculixdealIIgamessgcc gobmkh264refhmmerleslie3dmilc omnetpppovraytonto zeusmpMean Average Number of Inter-Line Vulnerabilities

Figure 4.5: Average number of vulnerable cells in adjacent memory lines (i.e., inter-line) after each write operation. adopted approach to mitigate the inter-line write disturbance problem is to allocate sufficiently large inter-cell space along the bit-line. The major issue with large inter-cell space is considerable capacity loss [15,23]. SD-PCM scheme [25] focuses on this category of write disturbance errors and proposes a mechanism to achieve reliable write operations in PCM devices, without the need to increase the inter-cell space along the bit-line. SD-PCM issues pre-write and post-write read operations to the memory lines adjacent to the line being written. By having the values of the adjacent memory lines, before and after the write operation, SD-PCM can detect and correct any disturbed PCM cells. Although this approach guarantees the validity of the adjacent memory lines after each write operation, it affects the system performance considerably, because for each write operation SD-PCM issues 4 redundant read operations. This problem becomes very prominent in memory- intensive applications. Decongest technique [46], on the other hand, detects and remaps write-intensive memory pages to a disturbance-free part of the memory. While this technique can reduce the probability of having inter-line write disturbance errors, it still needs a supporting mechanism to guarantee reliable write operations for the remaining less write-intensive memory addresses, and it also leads to quick wear-out of the disturbance-free part of the memory.

Compared to the previous works in this domain, our proposed inter-line scheme reduces the probability of having write disturbance errors along the bit-line by manipulating the data placement layout within the chips. First, we use data compression to compact the data and place the adjacent memory lines in a non- overlapping fashion, in order to decrease the possibility of encountering write disturbance errors between the vertically adjacent cells. We further integrate

52 BCH code within compressed memory lines to protect read-intensive memory lines against write-intensive addresses. We also update the memory scheduling algorithm to take advantage of our proposed mechanism, which leads to significant reduction in the number of redundant read operations (65% reduction in the number of redundant read operations compared to SD-PCM), and an average of 13% performance improvement.

4.3 Experimental Methodology

In this section, we present details of our evaluation setup, design space parameters, and evaluation metrics used through out this study.

Evaluation infrastructure: We use the Gem5 simulator [74] to model a chip multi-processor for our experimental studies. Each core in the target system uses the ARM instruction set architecture. The memory hierarchy is based on the Ruby model provided in Gem5. We collect traces of main memory accesses in Gem5, which are then fed to our lightweight PCM-based memory simulator to study the write disturbance problem in PCM, and also to evaluate our proposed mechanisms. Our in-house PCM simulator is based on the rank subsetting architecture [82] and uses the differential write (DW) scheme [43] for lifetime and energy consumption optimizations. The latency of all the proposed mechanisms are also integrated into the simulator for accurate evaluations.

Baseline configuration: The evaluated CMP is a 16-core processor, configured based on the parameters reported in Table 4.1. The memory hierarchy has two levels of caching: the L1 caches are private to each core, and the last level cache (LLC) is an 8MB cache, shared by all the cores. This capacity is large enough to filter out a large portion of the traffic destined for the PCM-based main memory. Below the caches is a 4GB SLC-based PCM memory, configured using the latency parameters used in NVsim [75]. The main memory has 4 channels, each interconnected to a separate memory controller on the CPU chip. Each channel consists of one rank, with 8×8 chips per rank (shown in Figure 4.1.b), and 4 banks in each rank. We use the DDR3 standard with timing parameters taken from Lee et al. [13].

53 Workloads: For our workloads, we selected 15 memory-intensive programs from the SPEC-2006 suite [2]. Each workload is composed of running the same program on all 16 cores of CMP. For each workload, we fast-forward approximately 2 billion instructions, warmup the caches by running 100 million instructions, and then collect results by simulating the next 100 million instructions on each core. Table 4.2 characterizes each workload with respect to its average number of vulnerable cells along the word-lines and bit-lines. For each application, we fed the trace of the memory accesses to our PCM simulator, and logged the number of vulnerable cells along the word-lines and bit-lines, for each write operation. As reported in Table 4.2, in order to properly evaluate our proposed mechanisms, we only picked applications which experience considerable number of vulnerable cells in each write operation. Note that, to stress the system and study the performance of each scheme under extreme conditions, we build our test-traces by collecting the memory addresses which receive write operations over the course of execution. These traces are then fed to our PCM simulator to analyze the efficiency of the different schemes tested.

Evaluation and comparison metrics: The probability that a vulnerable cell experiences a transition from the RESET state to the SET state is a function of the inter-cell distance, magnitude and width of the RESET current, and process variation, among other parameters. Previous studies consider a uniform probability for the occurrence of write disturbance errors over different vulnerable cells. For instance, DIN [22], and Tavana and Kaeli [45] assume that a vulnerable cell along the word-line becomes faulty by 9.9% probability. SD-PCM [25] assumes that a vulnerable cell along the bit-line becomes faulty by 11.5% probability. Note that, changing each design parameter changes those probabilities. In this work, we consider the average number of vulnerable cells as our evaluation metric, in order to perform our studies and comparisons against other works, independent of the low-level design parameters. On the other hand, for lifetime studies, we consider the percentage of extra cell-level write operations (imposed by our mechanisms), as our evaluation metric. Note that, the differential write scheme used in our baseline system achieves the least number of bit-flips.

54 Table 4.1: The specification of our simulated system.

Processor CMP ARM ISA, 16-core, out-of-order, 2.5GHz Interconnect Network 4×4 Mesh network On-chip Cache Hierarchy L1 I-Cache 32KB/64B/2 way, private, 2-cycle, Write-Back L1 D-Cache 32KB/64B/2 way, private, 2-cycle, Write-Back L2 Cache 8MB/64B/8 way, shared, 20-cycle Coherence Snooping MOESI PCM Main memory Total Capacity 4GB (4KB pages, and 64B rows) Configuration 2 channels, 1 DIMM/channel, 1 rank/DIMM 8 devices/rank, 48ns read, 40ns RESET, 150ns SET Endurance Mean: 107 writes, Variance: 0.15

Interface [13] 400MHz, RDC =60 cycles, CL=5 cycles, WL=4 cycles CCD=4 cycles, WTR=4 cycles, RTP =3 cycles, RP =60 cycles, RRDact=2 cycles, RRDpre=11 cycles On-chip Memory Controller Decomp. latency FPC: 5 cycles Read/Write Queue 8/32 entries per bank

Table 4.2: Characteristics of the evaluated workloads.

Workload Average Average Workload Average Average Intra-Line Inter-Line Intra-Line Inter-Line Vulnerable Vulnerable Vulnerable Vulnerable Cells Cells Cells Cells astar 20.7354 55.2467 hmmer 24.1349 59.4925 bzip2 6.93494 12.3765 leslie3d 21.4627 51.3507 calculix 13.9527 28.6065 milc 0.90387 2.92335 dealII 26.1787 62.9146 omnetpp 24.6085 43.1351 gamess 48.3151 107.714 povray 5.92851 14.1548 gcc 1.88422 4.70384 tonto 20.6123 56.6098 gobmk 12.2668 32.5623 zeusmp 23.1654 26.9903 h264ref 20.4383 41.2089 Mean 18.1015 39.9993

55 4.4 The Proposed Approach: Enabling Reliable Write Operations in Super Dense PCM

In order to achieve super dense PCM-based main memories, we do not increase the inter-cell space along the word-lines or the bit-lines, to address the write disturbance problem. Instead, we propose new architecture-level mechanisms to mitigate this problem at a low cost. In this section, we start our discussion by describing our intra-line mechanism to achieve reliable write operations along the word-line. The goal of this scheme is to guarantee the correctness of the written data in a performance-efficient manner. We then explain our inter-line scheme, which first reduces the probability of write disturbance errors along the bit-line by using a non-overlapping data layout, and then recovers intra-line faculty cells by employing strong BCH code for read-intensive addresses.

4.4.1 Intra-line Scheme: Preventing Write Disturbance Errors Along the Word-line

Occurrence of write disturbance errors in a word-line is dependent on the pro- gramming strategy used in the PCM device. Recent PCM prototypes adopt a differential write (DW) scheme to avoid redundant write operation to the cells where the old and new data are equal. However, this programming strategy makes the idle cells vulnerable to the write disturbance. In this work, we similarly assume that the baseline system is equipped with the DW scheme to achieve reasonable lifetime and lower power consumption. Our goal is to guarantee the correctness of the written word-line in a cost-efficient manner. Our proposed intra-line scheme is based on our experimental observations over the studied applications:

• First, on each write operation, only a few of the chips within the rank need to be updated. As discussed in Section 4.2, in a typical main memory architecture, a rank consists of 8 chips and a write operation to the rank (i.e., a write-back from LLC that is a 64-byte cache line) is interleaved over the chips. Since usually only a few of the words in a cache line are updated, only

56 a few of the chips within the rank need to be written into2. Figure 4.6 reports this distribution for the studied applications. As can be seen, on average 47% of the time, only less than 2 chips out of the 8 chips within a memory rank need to be updated.

• Second, on each write operation, most of the chips that need to be updated, only experience a few number of vulnerable cells. Figure 4.7 reports this distribution over the studied workloads. As can be seen, on average, 78% of the chips contain less than 10 vulnerable cells, in each write operation. Note that, such variation in the number of vulnerable cells in a chip, is not only observed over different memory addresses, but also over different chips for write operations to the same memory address.

2Consider a layout where bytes 0-7 are mapped to chip0, bytes 8-15 are mapped to chip1, and so forth. In other words, consequent 8-byte chunks in a cache line are mapped to consequent chips in a memory rank (Figure 4.1.b).

100 0-2 3-4 5-6 7-8 80 60 40 20

Avg. #Updated 0 astar bzip2calculixdealIIgamessgcc gobmkh264refhmmerleslie3dmilc omnetpppovraytonto zeusmpMean Chips Within a Rank

Figure 4.6: Average number of updated chips in a write operation. Each 64-byte write operation (i.e., a cache line write-back from LLC) is spread over 8 chips within a memory rank (shown in Figure 4.1.b).

100 0-10 11-20 21-32 80 60 40 20 0

Avg. #Vulnerable astar bzip2calculixdealIIgamessgcc gobmkh264refhmmerleslie3dmilc omnetpppovraytonto zeusmpMean Cells in a Memory Chip Figure 4.7: Average number of vulnerable cells in each chip, for each write operation. Each 64-byte write operation (i.e., a cache line write-back from LLC) is spread over 8 chips within a memory rank (shown in Figure 4.1.b).

57 These two observations motivate us to address each individual chip within a memory rank based on its unique characteristics, as opposed to the traditional architectures where optimizations are performed at a line granularity. Our proposed intra-line scheme is composed of two steps: First, on each LLC write-back, we detect the chips that need to be updated, and only issue write operations to those specific chips. Second, for each of the chips that are meant to be updated, we separately decide what programming strategy should be used. Here, we consider two programming strategies: (1) a combination of the DW and VnC schemes, and (2) a modified-DW scheme which updates all the vulnerable cells as well.

This decision is made based on the number of vulnerable cells within the target chip. It is intuitively clear that, when we have a large number of vulnerable cells in a chip, the probability of experiencing cascading VnC steps is higher. Our experimental studies confirmed this issue and we observed that there is a close correlation between the number of vulnerable cells in a chip and the number of verification steps performed by the VnC scheme. In other words, when there is a high number of vulnerable cells in a chip, the VnC scheme will experience frequent cascading effects, leading to extra cycles of verifications and significant performance loss. Therefore, on each write operation, for the chips where only a few of the cells are vulnerable (in this study, less than 10 vulnerable cells), we adopt the first programming strategy (i.e., combination of the DW and VnC schemes), because the VnC scheme can modify all the disturbed cells without experiencing cascading modification steps. For the chips where there are many vulnerable cells, we use the second programming strategy (i.e., modified-DW scheme) which writes into all the vulnerable cells. Even though the modified-DW scheme increases the number of cells that need to be written into, it avoids long-latency multi-step VnC steps and minimizes the latency overhead.

Extra Cell Writes Due to Modified-DW Scheme: While the DW scheme writes into to a cell only if the stored value and the new value are different, our modified-DW scheme writes into the vulnerable cells as well. Therefore, it can potentially increase the number of cell writes and decrease the memory lifetime. Here, we should mention three important factors: First, writing into all the vulnerable cells in a chip does not mean writing into all the cells. As defined in Section 4.2, vulnerable cells are the cells that are, (1) in idle mode, and

58 3

2

1 Percentage of

Extra Cell Writes 0 astar bzip2calculixdealIIgamessgcc gobmkh264refhmmerleslie3dmilc omnetpppovraytonto zeusmpMean

Figure 4.8: Percentage of extra write operations (at the cell-level) imposed by our proposed modified-DW scheme.

(2) in the RESTE state, while one of their adjacent cells is being RESET. Second, only for a limited percentage of the updated chips we use the modified-DW scheme as our programming strategy. Third, for the cases where we use our modified-DW scheme, we do not need to use the VnC scheme anymore. Therefore, while on the one hand we increase the cell writes by writing into all the vulnerable cells, on the other hand, we avoid the cell writes imposed by the VnC scheme. Figure 4.8 reports the overhead of our proposed intra-line mechanism in terms of extra cell writes, compared to the baseline configuration which is equipped with the DW scheme. As can be seen, our intra-line scheme only increases the write operations to the cells by an average of 1.2% and a maximum of 2.5% in bzip2.

Performance Degradation: The baseline PCM-based main memory employs the DW scheme to reduce the number of write operations to the cells. However, this scheme leaves some of the cells in idle mode which might experience write disturbance under certain circumstances. To cope with this problem, a typical PCM-based main memory also uses the VnC scheme. VnC performs a read-after-write operation to detect faulty cells and then issues a RESET operation to those specific cells. However, this approach can cause a cascading effect such that each RESET operation may make other cells faulty. In our studies, we observed that such approach causes about 20% performance loss (i.e., IPC) on average, due to the extra cycles of latency imposed by multi-step VnC iterations. Similar observations regarding the negative impact of the VnC scheme on the system performance have been reported by previous studies as well [22,25,45].

59 DIN [22], on the other hand, first uses an encoding scheme to reduce the probability of write disturbance errors along the word-line, and then exploits the VnC scheme to detect and correct any potentially faulty cells. Since the encoding scheme reduces the chance of facing cascading VnC steps (by reducing the number of vulnerable cell), DIN is capable to reduce the performance loss to about 5% on average. In our intra-line scheme however, one of our primary goals is to completely avoid using the VnC scheme if there is a high chance of having cascading VnC iterations. More specifically, by adopting the modified-DW scheme for the chips with high number of vulnerable cells (high probability of experiencing cascading VnC steps), we guarantee the correctness of the written data, without the need to employ the VnC scheme. Based on our experimental studies, our proposed intra-line scheme causes only 1% performance loss on average (compared to the baseline memory system with no verification mechanism), due to the rare cases where our first programming strategy leads to cascading VnC steps3. However, compared to the DIN scheme [22], our proposed intra-line mechanism achieves 4% higher performance, on average.

Finding Updated Chips Within a Rank: In our proposed scheme, we need to determine which chips need to be updated for each write operation. To tackle this, we can take three different approaches:

• Multiple dirty bits for each cache line: At the last-level cache (LLC), we typically use one bit to indicate whether the cache line is dirty or not. By extending the number of dirty bits and allocating one dirty bit to each sub-block, we can determine which sub-blocks are dirty. For instance, in our architecture where a 64-byte cache line is interleaved over the 8 chips in a rank, we need to allocate one dirty bit to each 8 consecutive bytes in the cache line. By having this information, the memory controller can detect the chips that need to be written into. However, this approach requires modifications to the architecture of LLC. Besides, for every write to the LLC, we need to perform a read-before-write to detect updated sub-blocks, and set their dirty bits accordingly.

3As mentioned earlier, our mechanism is based on the rank subsetting architecture [80, 81]. Arjomand et al. [82] exploits this architecture to increase read and write parallelism in a PCM- based main memory. While our proposed mechanism exploits the rank subsetting architecture for reliability purposes, all of our schemes are compatible with the scheduling schemes propose by [82]. Therefore, one can use those memory scheduling schemes along with our proposed mechanisms in order to further improve the system performance.

60 • Read-before-write at memory controller: In this approach, the memory controller (that is placed on the CPU chip) first reads the target memory address. Then, by comparing the new data with the data stored in the memory, the memory controller determines which chips need to be updated. Although this approach does not need any modifications to the architecture of main memory, it can degrade the system performance by increasing the memory traffic.

• Read-before-write on PCM chip: In this approach, we perform the read- before-write operation inside the PCM chip. The DW scheme [43,73] already employs such mechanism to determine which cells need to be updated. By using this approach, we first perform a read to determine which chips need to be updated, and then issue write operations to those specific chips. After the write completion, the memory controller will be notified to schedule the next memory request. Note that, this chip-level mechanism has been made possible by the rank subsetting architecture [82].

In this work, we employ the third approach because it does not need to modify the LLC architecture, and also does not increase the traffic between the memory controller and the PCM chip. Besides, by performing the data comparison on the PCM chip, we can determine the number of vulnerable cells in each chip, and accordingly determine which programming strategy should be used.

4.4.2 Inter-line Scheme: Tolerating Write Disturbance Errors Along the Bit-line

While write disturbance errors along the word-line can be detected by simply performing a read-after-write operation, detecting write disturbance errors along the bit-line is not a trivial task. The state-of-the-art technique in this domain, SD-PCM [25], reads the adjacent memory lines before and after each write operation (referred to as pre-write and post-write read operations, respectively). Then, by comparing the pre-write and post-write values stored in each of the adjacent memory lines, SD-PCM can determine if the write operation has indeed induced faults in any of those lines. And, if so, SD-PCM rewrites (i.e., a RESET operation) the faulty

61 cells. While this approach detects any potential write disturbance errors along the bit-line, it imposes four redundant read operations for each write operation. Consequently, it causes significant performance loss especially in memory-intensive applications. In our inter-line scheme, our goal is to tackle write disturbance errors along the bit-line in a performance-efficient manner. More precisely, instead of performing four redundant read operations for each write operation, we issue a read to an adjacent memory line only when the write operation can make non-recoverable errors in the adjacent line. Our inter-line scheme is composed of two mechanisms: First, we employ data compression to compact the data and place adjacent memory lines in a non-overlapping fashion, in order to avoid write disturbance errors between the vertically adjacent cells. Second, we integrate strong BCH code within a compressed memory line, in order to protect read-intensive memory addresses against write-intensive addresses that induce lots of write disturbance errors into the system. Below, we discuss the details of these two mechanisms.

A) A Non-overlapping Data Layout Data compression has been adopted at different levels of memory hierarchy for different purposes. At the main memory, compression has been traditionally used to achieve larger on-chip capacity and lower power consumption [1, 8, 9]. In the domain of PCM-based main memory, compression techniques have been used to reduce the power consumption [42], and to extend the lifetime of PCM-based main memories [77]. In this work, we employ data compression to mitigate write disturbance errors along the bit-line. In the baseline memory system, a 64-byte write operation (i.e., a write-back from LLC) is interleaved over 8 chips within a memory rank (shown in Figure 4.1.b). By employing data compression, a 64-byte cache line can be stored in a more compact form and occupies less space in the memory. Note that, we do not use the extra space achieved by compression to store more data in the memory.

Instead, we use data compression to attain some degree of flexibility for data placement. For instance, by compressing a 64-byte cache line into 16 bytes of data, we can theoretically place the compressed data in any of the 2 chips in a memory rank. Such flexibility in data placement can be used to avoid write disturbance errors along the bit-line (vertically adjacent cells). Figure 4.9 illustrates how we can use data compression to compact the data and place adjacent memory lines in

62 No Overlap Row: i-1 Compressed Data Free Space Row: i Free Space Compressed Data Row: i+1 Compressed Data Free Space Row: i+2 Free Space Compressed Data Partial Overlap Figure 4.9: Exploiting data compression to place adjacent memory lines in a non-overlapping fashion (i.e., alternate left-aligned and right-aligned). The write disturbance errors are contained within the overlapping areas.

100 80 60 40 20

0 astar bzip2calculixdealIIgamessgcc gobmkh264refhmmerleslie3dmilc omnetpppovraytonto zeusmpMean Compression Ratio %

Figure 4.10: Compression ratio of FPC algorithm [1]. a non-overlapping manner. As can be seen, by placing adjacent memory lines in an alternate left-aligned and right-aligned layout, we can limit the occurrence of inter-line disturbance errors to the overlapping areas.

Effectiveness of this mechanism is a function of the compression ratio (i.e., the average size of the compressed data to the original size of data). For instance, as shown in Figure 4.9, for the cases where most of the memory lines are highly compressible, the stored data in adjacent memory lines do not overlap and it cannot cause any write disturbance errors along the bit-line (e.g., Row i-1 to Row i+1). On the other hand, when the compression ratio is lower, we still encounter write disturbance errors over the overlapping areas (e.g., between Row i+1 and Row i+2). Figure 4.10 reports the average compression ratio for the adopted compression algorithm in this work (i.e., FPC [1]). The average compression ratio of an application gives us a rough estimation for the average number of overlapping bytes between each two adjacent memory lines. Figure 4.11 reports the reduction in average number of inter-line vulnerable cells, after using our proposed non-overlapping data layout. As can be seen, by employing data compression and placing the compressed data in a non-overlapping fashion, we can reduce the

63 100 80 60 40 20

0 astar bzip2calculixdealIIgamessgcc gobmkh264refhmmerleslie3dmilc omnetpppovraytonto zeusmpMean Cells Along the Bit-line %Reduction in #Vulnerable Figure 4.11: Reduction in the number of vulnerable cells along the bit-line through non-overlapping data placement. average number of inter-line vulnerable cells by 47%, on average. By comparing Figures 4.10 and 4.11, we can see that applications with higher compression ratios experience higher reductions in the number of vulnerable cells along the bit-line.

Architectural Support: While any compression algorithm can be used in our mechanism, in this work we adopt Frequent-Pattern-Compression scheme (FPC [1]) because of its low hardware- and latency-overheads. FPC uses predefined data patterns to compress data at 4-byte chunks. FPC integrates the meta-data within the compressed data (stored at the first byte of the compressed data) which will later be used in the decompression process. For each memory line, we also keep one bit of meta-data to indicate whether the data is stored in compressed format or not. This bit is set on each write operation. On a read request, this bit determines if the read data should be decompressed to retrieve the original data. By placing the compressed memory lines in an alternate left-aligned right-aligned layout (shown in Figure 4.9), the chips placed on the two ends could potentially be worn-out in a much faster rate. Theoretically, a compressed cache line can be placed anywhere in the corresponding memory line4. For the case of alternate left-aligned right-aligned data layout, we consider CHIP0 as our diving point. Meaning that, if LINEi is placed on

CHIP0-CHIPn, then LINEi+1 is placed backwards starting from CHIP8.

However, any other chip can be used as our dividing index to achieve a non- overlapping data layout between adjacent memory lines. Figure 4.12 illustrates two data layouts with two different dividing indices. At the top, we have the

4For the sake of hardware simplicity, we assume that a compressed cache line can be placed only in a consecutive segment of the memory line.

64 Row: i-1 A1 A2 A3 Row: i B4 B3 B2 B1 0 2 Dividing Index Row: i-1 A1 A2 A3 Row: i B3 B2 B1 B4

Figure 4.12: Updating the dividing index used in our non-overlapping data layout to achieve uniform chip wear-out.

alternate left-aligned right-aligned layout (where CHIP0 is the dividing index) and at the bottom we use CHIP2 as the dividing index. By using different dividing indices during the course of program execution, we can uniformly distribute write operations over different chips. To keep track of this index, we use a global 3-bit pointer (log2(#chips)) that stores the current dividing index. On a read request, we read the whole memory line. Then, based on the dividing index and the memory address (even or odd), we can retrieve the stored data and decompress it. Since the first byte within the compressed data contains the meta-data, we always place it at the dividing index.

Note that, a typical PCM-based main memory is supported by a wear-leveling mechanism [16,77]. A wear-leveling mechanism periodically remaps memory ad- dresses to different physical locations within the memory in order to achieve a uniform write distribution. In our architecture, at the beginning of each remapping iteration, we update the dividing index used in our data placement mechanism. Since the address remapping is performed in a sequential order, based on the index of the last remapped address we can determine what memory locations use the new dividing index and which of them use the old one (during the remapping phase).

B) Integrating BCH Code with Compression Typically, different memory addresses exhibit widely different characteristics. From the write disturbance point of view, some memory addresses are write-intensive which potentially can induce lots of errors into their adjacent memory lines. On the other hand, read-intensive memory addresses are rarely written into and can be considered as the victims for the case of write disturbance issue. In this section, we discuss how BCH code can be integrated into our compression-based scheme in order to further protect the read-intensive memory addresses against the write-intensive

65 WR-Intensive Free Space Compressed RD-Intensive Compressed BCH Code WR-Intensive Free Space Compressed WR-Intensive Compressed Free Space Figure 4.13: Integrating BCH code in read-intensive addresses. addresses. The BCH codes (Bose-Chaudhuri-Hocquenghem codes) are a category of error-detecting and error-correcting codes that are capable to recover multiple errors within the data. A typical BCH code requires R = E.ceil(log2N)+1 redundant bits to correct E bits of errors in N bits of data [87]. Figure 4.13 illustrates our approach to integrate the BCH code into the data compression mechanism. In our proposed non-overlapping data layout, we employ compression to achieve some degree of flexibility in data placement. Therefore, it seems to be against the fundamental idea behind the non-overlapping data layout to use the extra space within a compressed memory line to store the BCH code. In our mechanism, we only integrate the BCH code within the read-intensive (more specifically, read-intensive and in clean state) memory addresses because of two major issues:

• Read-intensive memory addresses, and more specifically the read-intensive addresses that are in the clean state, do not receive write operations from the CPU5. Therefore, even though the augmented BCH code within such memory lines can overlap with the data stored in the adjacent memory lines, their infrequent write operations (happening on a page allocation or wear-leveling address remapping [16]) cannot cause significant write disturbance issue along the bit-line.

• Read-intensive memory addresses are the major victims of the write dis- turbance problem (along the bit-line), and can experience cumulative write disturbance errors over a long period of time. By integrating BCH code into such read-intensive memory lines (if they are compressible), they will be able to tolerate a certain number of faulty cells, depending on the strength of the augmented BCH code.

5Some lines might receive write operations from the CPU but still stay in the clean state, because the old value and the new value are the same. Therefore, we update the dirty bit only if an actual update has been performed within the memory line.

66 C) Updating the Memory Controller Here, the question is how our proposed mechanisms can be used to avoid redundant pre-write and post-write read operations on each write into a memory line? As discussed by SD-PCM [25], the only way to detect write disturbance errors along the bit-line is to read the content of each adjacent memory line, before and after the write operation. However, our goal is to achieve reliable write operations at a performance-efficient cost, and without causing major changes in the memory architecture. In the following, we discuss how the memory scheduling algorithm can be updated to take advantage of our proposed inter-line scheme.

At the memory controller, we keep 3 bits of meta-data for each memory line to keep track of the size of the stored data in each address (chip-level size, log(#chips)). One a write operation, the memory controller first compresses the data. Then, by comparing the size of the compressed data with the size of the data stored in adjacent memory lines, memory controller can determine whether the write operation will have an overlap with any of the adjacent memory lines. If there is no overlap with any of the adjacent memory lines, we will not be concerned about the occurrence of write disturbance errors. In applications with high compression ratio, most of the adjacent memory lines do not overlap. On the other hand, if there is a partial overlap with any of the adjacent lines, there is a chance of causing write disturbance errors in the overlapping areas. In this case, we take different actions based on the status (clean or dirty) of the overlapping adjacent memory line:

• Overlapping memory line is in the clean state: If the overlapping line is in the clean state, we simply ignore the fact that the write operation could cause errors in the other line. Note that, a clean memory line is augmented with the BCH code. Therefore, the BCH code can recover specific number of errors (based on its strength). Note also that, read- intensive memory addresses receive frequent read requests and the BCH code corrects any potential faulty cells upon each read operation. Therefore, clean and read-intensive memory lines will not experience cumulative faults over time. However, clean memory lines with infrequent accesses might experience cumulative write disturbance errors (especially, if they are placed next a write-intensive memory line with overlapping areas). On a read access to such idle lines, we might encounter non-recoverable number of errors. In such

67 cases, we need to read the correct data from the next memory level, which can potentially degrade the system performance by imposing long delays (many of the clean and long-period idle memory lines will not be accessed by the CPU and will eventually be replaced from the memory). However, based on our observations, the gain from not having 4 redundant read operations for each write, considerably outweighs the performance loss caused by rare cases for which we need to recover the correct data from the next memory level. Note also that, in a PCM-based main memory, the write operations are scheduled over the phases when there are not many read requests in the read-queue or if the write-queue is full. In order to further reduce the chances of facing cumulative and non-recoverable number of errors in idle memory lines, in case the read-queue is empty, we issue a read operation (only pre- write read) to the adjacent overlapping memory line to correct any previously injected write disturbance errors to avoid further error accumulation after the write operation6.

• Overlapping memory line is in the dirty state: If the adjacent overlapping line is in dirty mode, we need to guarantee the correctness of its data after the write operation. For such cases we need to read the stored data in the adjacent line, before and after the write operation. We employ a scheduling mechanism similar to SD-PCM [25]. However, we customize it to our rank subsetting architecture. Meaning that, instead of reading the whole memory line, we only need to read the overlapping chips to reduce the power- and performance-overhead. Besides, we do not need to perform the post-write read operations immediately after the write operation. We only need to verify the correctness of the overlapping chips before they are read by the CPU. Therefore, we schedule the post-write read operations only when the target memory bank is idle or if their queue is full. In addition, for the cases where there is write disturbance errors, we do not use the VnC scheme as it can cause further performance loss. Instead, we write into all the faulty and vulnerable cells within the target chip.

6Prior works employ various techniques to avoid accumulation/occurrence of different types of errors in different memory technologies. For example, periodic access to the PCM memory lines is used to avoid resistance drift errors [47], and probabilistic access to adjacent memory lines is used to avoid the row-hammer effect in DRAM memories [88]. Such techniques can be added to our mechanism to prevent accumulation of write disturbance errors in memory lines.

68 100 80 60 40 20

Read Operations 0 astar bzip2calculixdealIIgamessgcc gobmkh264refhmmerleslie3dmilc omnetpppovraytonto zeusmpMean %Reduction in #Redundant Figure 4.14: Reduction in the number of redundant read operations (i.e., pre-write and post-write read commands).

20 16 12 8 4

0 astar bzip2calculixdealIIgamessgcc gobmkh264refhmmerleslie3dmilc omnetpppovraytonto zeusmpMean of Inter-line Scheme Performance Improvement Figure 4.15: Performance improvement achieved by our proposed inter-line scheme over the SD-PCM technique.

While the proposed scheme causes minor changes in memory controller, it can significantly reduce the need for redundant read operations. Figure 4.14 reports the impact of our proposed inter-line scheme on the number of redundant read operations, compared to the SD-PCM scheme that performs four redundant reads for each write operation. As can be seen, our inter-line scheme reduces the average number of redundant read operations by 65%. In terms of performance (instruction per cycle), our proposed inter-line scheme achieves 13% improvement over SD-PCM, on average. This performance improvement is a function of both the percentage of reduction in the number of redundant read operations, and also the memory- sensitivity of each application. Meaning that, a highly memory-intensive application might exhibit higher performance improvement even though it has experienced less reduction in its redundant read operations.

69 4.4.3 The Interaction Between the Inter-line and Intra-line Schemes

Our proposed inter-line scheme compresses the data to achieve a non-overlapping data layout between the adjacent memory lines. Therefore, the effectiveness of our proposed intra-line scheme might be affected once we compress the data. Data compression changes the data pattern and can potentially affect the functionality of the intra-line scheme. However, since our adopted compression scheme (i.e., FPC [1]) compresses a cache line at 4-byte chunks, once a portion of the original data is changed, the effect of that is only contained within the equivalent portion of the compressed data. In short, after data compression we still observe the same two major characteristics within a word-line, i.e., (1) only a few of the chips are updated on each write operation, and (2) most of the updated chips only contain a few vulnerable cells. Based on our experimental evaluations, our proposed intra-line scheme achieves the same performance when we integrate the inter-line scheme into the system. Note that, all the discussions and reported evaluations on the inter-line scheme is already based on employing our intra-line scheme for each write.

In short, by integrating our proposed inter-line and intra-line schemes into the system, we achieve reliable write operations along the both bit-lines and word- lines. Figure 4.16 reports the performance improvement achieved by our hybrid mechanism, compared to the system where the DIN [22] and SD-PCM [25] schemes are used to resolve the write disturbance problem along the word-lines and bit-lines, respectively. As can be seen, our proposed schemes improve the system performance by 17% on average.

20 16 12 8 4

0 astar bzip2calculixdealIIgamessgcc gobmkh264refhmmerleslie3dmilc omnetpppovraytonto zeusmpMean of our Hybrid Mechanism Performance Improvement Figure 4.16: Performance improvement achieved by integrating our proposed intra- line and inter-line schemes into the system, compared to the memory system supported by the DIN and SD-PCM techniques.

70 4.5 Conclusion

Write disturbance in PCM has become a major obstacle in achieving super dense memories. Existing techniques either increase the inter-cell space to prevent write disturbance errors or use a VnC scheme to detect and modify the errors. However, these approaches come at the cost of large capacity overhead and significant performance loss, respectively.

To resolve the write disturbance errors along the word-line, we propose our intra- line scheme that is based on two different programming strategies. Based on the number of vulnerable cells in each chip, we determine which programming strategy should be used to guarantee the correctness of the written data in a performance- efficient manner. Our proposed intra-line scheme causes only 1% performance loss, on average, compared to the DIN scheme [22], with an average of 5% performance loss. To address the write disturbance errors along the bit-line, our proposed inter-line mechanism first uses data compression to achieve a non-overlapping data layout between the adjacent memory lines, and then integrates the BCH code within the compressed memory lines to protect clean memory lines against write-intensive addresses. Our proposed intra-line scheme achieves reliable write operations while its performance is about 13% better than the SD-PCM scheme [25].

Overall, by integrating our proposed inter-line and intra-line schemes into the system, we achieve reliable write operations along the both bit-lines and word-lines, and a performance improvement of 17% on average, compared to the state-of-the-art techniques in this domain.

71 Chapter 5

Performance and Power-Efficient Design of Dense Non-Volatile Cache in CMPs

In this work, we present a novel cache design based on Multi-Level Cell Spin- Transfer Torque RAM (MLC STT-RAM) that can dynamically adapt the set capacity and associativity to efficiently use the full potential of MLC STT-RAM technology. We exploit the asymmetric nature of the MLC storage scheme to build cache lines featuring heterogeneous performances, that is, half of the cache lines are read-friendly, while the other half are write-friendly. Furthermore, we propose to opportunistically deactivate cache ways in underutilized sets to convert MLC to Single-Level Cell (SLC) mode, which features overall better performance and lifetime. Our ultimate goal is to build a cache architecture that combines the capacity advantages of MLC and performance/energy advantages of SLC. Our experimental evaluations show an average improvement of 43% in total numbers of conflict misses, 27% in memory access latency, 12% in system performance (i.e., IPC), and 26% in L3 access energy, with a slight degradation in cache lifetime (about 7%) compared to an SLC cache.

72 5.1 Introduction

Ever increasing number of cores in chip multiprocessors (CMPs) coupled with the trend toward rising working set sizes of emerging workloads stresses the demand for large, multi-level, on-die cache hierarchy to hide long latency of off-chip main memory. During the last few decades, SRAM-based cache memories successfully kept pace with this capacity demand by exponential reduction in cost per bit. However, entering sub-20nm technology era, where leakage power becomes dominant, it is very challenging to continue with the expansion of the cache hierarchy while avoiding the power wall. One promising approach to address this problem is to replace SRAM with a non-volatile memory (NVM) technology [26,27]. Recently, various NVMs have been prototyped in nano-technology regime [11,89] and many of them are expected to be commercially available by the end of the decade [90]. Among such memory technologies, Spin-Transfer Torque RAM (STT-RAM) is the best candidate for use within the processor; STT-RAM has zero leakage power, accommodates almost 4× more density than SRAM, and has small read access latency and high endurance.

The storage element in a STT-RAM cell is a Magnetic Tunnel Junction (MTJ) which stores binary data in form of either parallel magnetic direction (set) or anti-parallel magnetic direction (reset). Two types of STT-RAM cell prototypes can be realized: Single-Level Cell (SLC) STT-RAM and Multi-Level Cell (MLC) STT-RAM. The SLC STT-RAM cell consists of one MTJ component which is used to store one bit information. The MLC STT-RAM device, on the other hand, is typically composed of multiple MTJs, which are connected either serially or in parallel, and are used to store more than one bit information in a single cell. Such increased density in MLCs comes at the cost of linear increase in access latency and energy with respect to the cell storage level (i.e., the number of bits stored). For instance, the read (or write) latency and energy consumption of a 2-bit STT-RAM cell is two times higher than that of a SLC STT-RAM device under same fabrication technology. Furthermore, MLC STT-RAM usually has lower endurance (in terms of write cycles) compared to the SLC. In short, SLC and MLC storage elements show two different characteristics: SLC is fast, power-efficient, and has a long lifetime; but, MLC trade-offs these metrics for high density.

73 Over the past few years, several device-level and architecture-level optimiza- tions have been proposed that attempt to address the issues of high write la- tency/energy [12,26–30] and limited endurance [31,32] in SLC STT-RAM caches. However, little attention has been paid to explore the potential of the MLC STT- RAM cache in multi-core systems. Indeed, such an analysis and study is necessary as feature scaling is continuing and employing MLC devices seems to be the only way of increasing cache capacity in a cost-efficient and power-efficient manner. In this study, we focus on design space exploration of MLC STT-RAM cache and re- lated optimizations, when it is used as last level cache (LLC) in CMPs. Specifically, this work tries to answer the following research questions:

1. What is the design of an MLC STT-RAM cache? Can we reduce its read and write access latency/energy by employing logic-level optimizations?

2. What are the performance implications of an MLC STT-RAM cache compared to an SLC STT-RAM cache in an iso-area design? What kinds of workloads can get benefits from MLC STT-RAM cache configuration? And what kinds of workloads exhibit performance degradation in a system with MLC cache?

3. What types of architecture-level optimizations can be employed to further im- prove the performance-efficiency and energy-efficiency of an MLC STT-RAM cache? After all logic-level and architecture-level optimizations, does the resul- tant MLC STT-RAM cache perform better than its SLC counterpart for a wide range of workload categories?

This work answers the above questions in detail and introduces the following novel mechanisms to tackle the challenges brought by an MLC STT-RAM cache:

• To reduce the read/write access latency and energy of an MLC-based cache, we introduce stripped data-to-cell mapping scheme as a logic-level optimization. This design mainly relies on the asymmetric behavior of reading or writing different bits of an MLC; while one bit can be read fast, using the other saves more energy during writes. Instead of storing cache lines next to one another in independent memory cells, this mapping enables data blocks to be “stacked” on top of each other – storing bits of two cache lines in the same cell, one as MSB and the other as LSB. With this data layout arrangement, we demonstrate

74 that, for half of the cache lines, the read/write access latency is comparable with SLC cache; and for the rest of the cache lines, the read/write access energy is in range of an SLC cache.

• To further improve performance of the cache, we propose an associativity adjust- ment scheme. This scheme tries to adjust the associativity degree of each set independently by switching on or off the ways stacked onto others. With this mechanism in place, each cache set acts like highly-associative cache for sets that benefit from more associativity and behaves like a low-associative cache when extra capacity is not useful, thereby reducing both energy consumption and cache access latency. The main feature of this dynamic design is that it can easily determine cache associativity with limited hardware overhead. In- deed, this scheme only requires (1) a dynamic mechanism to detect workload behavior within each cache set and (2) an efficient read and write-aware inter-set data movement which does not alter any address and data paths in the cache hierarchy.

• We also propose a swapping policy to enhance performance and energy of the cache, on top of the other two schemes. When used with a stripped cache design, this scheme tries to place the read-dominant cache blocks to the ways with small read access latency, while it places the write-dominant cache blocks to the ways with small write energy.

Compared to an SLC STT-RAM cache with limited associativity, the proposed design reduces conflict misses of the sets with large working set. And, compared to an MLC STT-RAM cache with conventional stacked data-to-cell mapping, it improves read performance and write energy by adopting stripped data-to-cell mapping scheme and also converting MLC cache lines to SLC lines when a set does not need the extra capacity.

75 5.2 Overview of STT-RAM Technology

5.2.1 Single-Level Cell (SLC) Device

STT-RAM is a scalable generation of Magnetic Random Access Memory (MRAM). Figure 5.1a shows the basic structure of an SLC STT-RAM cell which is composed of a standard NMOS access transistor and a Magnetic Tunnel Junction (MTJ) as the information carrier – This forms a “1T1J” memory element. Each MTJ consists of two ferromagnetic layers and one oxide barrier layer. One of the ferromagnetic layers (i.e., the reference layer) has a fixed magnetic direction and the magnetic direction of the other (i.e., the free layer) can be changed by passing a spin-polarized current into the cell. If these two ferromagnetic layers have anti-parallel (or parallel) direction, the MTJ resistance is high (or low), indicating a binary value of ‘1’ (or ‘0’) state (Figure 5.1b). Compared to conventional MRAM, STT-RAM exhibits superior scalability since the threshold current required to make the status reversal (from ‘0’ to ‘1’ or vice versa) decreases as the MTJ size shrinks. Furthermore, the STT-RAM technology has reached a good level of maturity and the fabrication cost of cells in this technology is small – the number of additional mask steps beyond a standard CMOS process is no more than two (less than 3% added cost) [89].

BL Resistance MTJ ‘11’

M ΔR

g 2 ‘0’ ΔR1 O ‘01’ ‘10’ WL ± MTJ (ΔR @Ic 1) ΔR2 ΔR1 ‘00’ Current ‘1’ - - + + SL Ic 2 Ic 1 Ic 1 Ic 2 (a) (b) (c)

Figure 5.1: (a) SLC STT-RAM cell consisting of one access transistor and one MTJ storage carrier (“1T1J”); (b) Binary states of an MTJ: two ferromagnetic layers with anti-parallel (or parallel) direction indicate a logical ‘1’ (or ‘0’) state; (c) Resistance levels for 2-bit STT-RAM: four resistance levels are obtained by combining the states of two MTJs having different threshold current.

76 5.2.2 Multi-Level Cell (MLC) Device

The MLC capability can be realized by modeling four or more resistance levels within one cell. With the current VLSI technologies, the MLC STT-RAM devices are fabricated by using multiple MTJs structured either in parallel or series con- nections. Figures 5.2a and 5.2b depict these two configurations for a 2-bit MLC device – the 2-bit series STT-RAM is composed of two vertically stacked MTJs, and the 2-bit parallel cell is composed of one common reference layer and two independent free layers which are arranged side-by-side (parallel). Under the same fabrication conditions and process variation, the serial MLC has advantages over the parallel design: (1) it exhibits much lower bit error rates related to read and write disturbance issues [51]; (2) it results in smaller cells; and (3) while it has full-compatibility with modern perpendicular STT-RAM technology, fabricating parallel MTJs with this technology is very challenging. We consider serial MTJ- based 2-bit STT-RAM cells in this work, although our ideas apply equally to parallel MTJ-based devices and MLCs with higher bit densities. In the 2-bit serial STT-RAM, MTJs have different layer thickness and area. This results in different ± threshold current, Ic and resistance variation, ∆R for the two MTJs. Four levels of resistance are obtained by combining the binary states of both MTJs (Figure 5.1c).

In such a cell, the layer that requires a small current to switch (MTJ1) is referred

to as soft-domain and the layer that requires a large current to switch (MTJ2) is referred to as hard-domain. Assuming 2-bit information, the Least Significant Bit (LSB) and the Most Significant Bit (MSB) are stored into the soft-domain and hard-domain, respectively.

BL BL J 1 Top Electrode

T Top Electrode

J

J

1

2

T T

M Soft Hard

M

M J 2 Domain Domain

T Inter Metal M

Bottom Electrode Bottom Electrode WL SL WL SL (a) (b)

Figure 5.2: (a) MLC STT-RAM cell with serial MTJs: the soft-domain on top of the hard-domain; (b) MLC STT-RAM cell with parallel MTJs

77 5.2.2.1 Two-step Write Operation

The circuit schematic for access operations on a 2-bit serial STT-RAM is depicted in Figure 5.3. Two MTJs are accessed by an access transistor controlled by a word-line (WL) signal and a write current always passes through both MTJs. Accordingly, when the hard-domain is written, the state of the soft-domain is also

switched into the same direction, because of its larger threshold current (Ic). In order to write the LSB, the direction and amplitude of the current pulse are defined by the logical value of the MSB and LSB as well as the characteristics of both MTJs. If the desired LSB value is equal to the currently stored MSB, a second current pulse is not necessary, since the soft-domain was already switched to the proper state. Otherwise, a second current pulse is required to write the LSB into the soft-domain (Figure 5.3c). This second current pulse should be set between

the Ic values of the two MTJs to prevent bit flip of the hard-domain (Figure 5.3b). Consequently, the number, direction, and amplitude of write current pulses in this two-step write scheme vary depending on the written data.

5.2.2.2 Two-step Read Operation

On a read, the access transistor is turned on and a voltage difference is applied between the bit-line (BL) and source-line (SL) terminals. This voltage difference causes a current to pass through the cell and is small enough to avoid any MTJs to switch its magnetic direction. The value of the current is a function of the MTJs’ resistance and is input into a multi-reference sense amplifier. The sense amplifier unit has three resistance references, each between two neighboring resistance states (Figure 5.3d). The cell content is read through two read cycles in a binary search

fashion. In the first step, the sense amplifier uses R2 as a reference to identify the value of MSB stored in the hard-domain. In the second step, depending on the

MSB value, the reference resistance is switched to R1 or R3, and the LSB is read from the soft-domain.

78 Write Circuit 2bit MLC Read Circuit MSB BL + WEN RR1 IC1 SA SLSB SMSB REN - MSB IC1 S1

R L R2 L

MSB A Read Bias A M LSB Generator L S2 + WL IC2 RR3 δ - IC2 WEN REN SL S3 (a) R R R 1st Cycle ‘11’ ‘11’ RAP1+RAP2 ‘xx’ → ‘11’ 1st Cycle ‘10’ R 2nd Cycle ‘10’ RR3 AP1+ ‘xx’ → ‘10’ RP2 ‘01’ st ‘01’ RR2 2nd Cycle 1 Cycle +RAP2 RP1 ‘xx’ → ‘01’ ‘00’ RR1 RP1+RP2 ‘00’ 1st Cycle ‘xx’ → ‘00’ - - + + - - + + IC2 IC1 IC1 IC2 IC2 IC1 IC1 IC2

Icell Icell Icell (b) (c) (d) Figure 5.3: (a) Schematic of the read and write access circuits in a 2-bit MLC cache array; (b,c) write operation transition model: first, the MSB is written to the hard-domain and if the LSB differs from MSB a small current is driven to switch its direction; (d) three resistance references in a sense amplifier, each between the two neighboring resistance states.

5.2.3 SLC versus MLC: Device-Level Comparison

Table 5.1 gives the typical latency, energy and lifetime parameters of a 2-bit MLC and an SLC STT-RAM cell. These parameters are taken from state-of-the-art prototypes (e.g., [11]). We can compare SLC and MLC in four ways:

• Cell Area – In STT-RAM, the MTJ size is larger than the access transistor and determines the cell size. As a result, under same technology constraints, the

SLC has the same area as smaller MTJ of the MLC (MTJ1), i.e., 70×140, and

MLC area is 75×150 nm which is a little larger (due to larger size MTJ2). As such, in an ISO-area design, the capacity of the 2-bit MLC cache is less than two times of the SLC cache.

79 Table 5.1: MLC STT-RAM compared to SLC cell model [11].

MLC Parameters SLC LSB (MTJ1) MSB (MTJ2) Dimensions 70×140 nm 75×150 nm 75×140 nm Read Latency 0.962 ns 0.962 ns 0.856 ns Read Energy 0.0115 pJ 0.0115 pJ 0.0112 pJ Write Latency 10 ns 10 ns 10 ns Write Energy 1.92 pJ 3.192 pJ 3.192 pJ Endurance 1010 Writes 1012 Writes

• Read Operation – Both devices use the same read voltage difference (i.e., -0.1 V [11]), which results in a read latency of 0.856ns for SLC cell, and total read latency of 1.9ns for the MLC cell (about 0.9 ns per bit which is in the same range of SLC). The same discussion can be made for energy consumption.

• Write Operation – In order to meet the write performance requirement of LLC, the writing pulse width is set to 10 ns [50] (which is in range of today’s

SRAM caches). For 2-bit MLC, this results in the write energy of the MTJ2(soft-

domain) and MTJ1(hard-domain) to be 1.92pJ and 3.192pJ, respectively [91].

Also, writing into SLC cell consumes the same energy as writing into MTJ2, as both employ the same sized MTJs.

• Cell Endurance – Due to the need for larger write currents and two-step write operation in the MLC cell, it generally has lower cell endurance than SLC (about 100× based on the model in Table 5.1).

Throughout this study, we use the values in this table for performance, energy and endurance analysis of the STT-RAM-based cache designs.

5.3 MLC STT-RAM Cache: The Baseline

We assume a chip multiprocessor (CMP) with an MLC-based STT-RAM last-level cache (LLC). Figure 5.4 shows the architectural model of this system where, due to CMOS non-compatibility of STT-RAM, the LLC is integrated into the processor

80 die using 3D VLSI1. This system is a 2-tier 3D chip. At the processor tier, cores and all lower caches are placed. The STT-RAM cache (i.e., LLC), is logically shared among all cores while physically structured as static NUCA and mounted at the top tier of 3D die. Note that, prior studies assume the same integration model for NVM caches in future chip multiprocessors [12, 29, 93]. The modeled last level STT-RAM cache has two characteristics, namely, NUCA structure and serial-lookup access, which are explained below in detail.

NUCA structure – The LLC in our CMP model is a large cache, which has to be structured as a NUCA (i.e., Non-Uniform Cache Architecture) for scalability and high-performance. In NUCA structure, a large cache is divided into multiple banks (one bank per each core) connected by an on-chip network for data and address transfer between the banks. NUCA exhibits non-uniform access latencies depending on the physical location of the cache bank being accessed [94]. Two main types of NUCA have been proposed: static NUCA and dynamic NUCA. In static NUCA, a cache line is statically mapped into a specific bank, with the lower bits of the index determining the bank. In dynamic NUCA, on the other hand, any given line can be mapped into several banks based on a placement policy. Although dynamic NUCA can drastically reduce average data access latency (compared to the static NUCA structure) by putting the frequently-accessed data into banks that are closer to the source of request [95], it is not a good design choice in non-volatile caches, since it significantly increases the cache write traffic (due to frequent movement of cache lines between different cache banks), which can in turn accelerate the wear-out problem. Therefore, we assume that the LLC is built as a static NUCA.

Serial-lookup access – The modeled LLC is a serial-lookup cache. In serial caches, tag and data arrays are accessed sequentially, saving energy at the expense of increased delay. The serial cache access latency relies heavily on the tag array latency, and consequently, we choose SLC for the tag array to minimize the latency. Figure 5.4b shows the modeled LLC with SLC STT-RAM tag array and 2-bit MLC STT-RAM data array. As shown in this figure, STT-RAM has the same peripheral interfaces used in SRAM caches: each bank consists of a number of

1Today, 3D ICs are commercially available [92] and have been receiving immense research interest from early 2000. Besides its latency and bandwidth benefits, one of the advantages of 3D ICs over 2D ICs is that they provide a platform to integrate different (non-compatible) technologies on the same die, with less concern on impacts of noises and fabrication costs.

81 cache lines, decoders for set index, read circuits (RDs), write circuits (WRs), and output multiplexers. Unlike SRAM, however, the current sense amplifiers in the STT-RAM read circuit are shared and multiplexed across bit-lines due to their large size compared to the cell array. For the read and write operations, a decoder selects a cache set and connects the selected physical line to RDs for reading or

WRs for writing.

y

r

3

4

o

k

&

k

&

t

R

R

c

n

n

D

e

D

W

W

r

a

a

i

R

R

B

B

D

2

5

k

k

&

&

R

R

n

n

D

D

W

W

a

a

R

R

B

B

1

6

&

k

&

k

R

R

n

n

D

D

W

W

a

a

R

R

B

B

0

7

k

k

&

&

R

R

n

n

D

D

W

W

a

a

R

R

B B

y (a) Router

a r

r STT-RAM LLC

)

C

A

r

L

e

a

d

M

t

(

o

a

c

e

D

D

w

o

y

R

a

r

)

r

C

A

L

S Column Selector

g

(

a T RD/WR RD/WR RD/WR RD/WR TSBs Cores Valid CCoommppaarrraatttoorrrsss MMMuuuxxx d ddrririvivveee + L1 and L2 Caches + peripheral circuits (b) (c) Figure 5.4: 2-bit STT-RAM layout and a schematic view of the cache array, read and write circuits. Because of the technology compatibility issues, the STT-RAM last-level cache is built on top of the cores.

82 5.3.1 Stripped Data-to-Cell Mapping

In the discussed 2-bit MLC model (Section 5.2.3), both bits are assumed to be written or read together, although sequentially. Applied to a cache context, both bits of an MLC cell would normally be mapped to the same cache line, as illustrated in Figure 5.5 as stacked mapping. In this stacked data-to-cell mapping, reading a cache line always takes two read cycles (i.e., 1.9ns in our 2-bit model), while writing takes two write cycles at most (i.e., 20ns). In other words, with the stacked mapping, the access latency to an MLC STT-RAM cache is roughly twice as that of an SLC cache. The same discussion applies to the energy consumption of each read/write access. As opposed to this design, we propose to exploit the read and write asymmetry of the two MTJs in order to simultaneously optimize overall access latency and energy consumption. As discussed earlier regarding the cache accesses in an MLC configuration, we observe a different latency and energy consumption for different bits: for read operations, the MSB can be read from the hard-domain in a single read cycle (0.96 ns for the 2-bit model in Table 5.1), whereas reading the LSB requires a second read cycle.

Stacked Mapping Stripped Mapping

A7 A6 B7 B6 C7 C6 D7 D6 B3 A3 B7 A7 D3 C3 D7 C7

r r

e e d

d A5 A4 B5 B4 C5 C4 D5 D4 B2 A2 B6 A6 D2 C2 D6 C6

o o

c c e

e A3 A2 B3 B2 C3 C2 D3 D2 B1 A1 B5 A5 D1 C1 D5 C5

D D

w

w A1 A0 B1 B0 C1 C0 D1 D0 B0 A0 B4 A4 D0 C0 D4 C4

o o

R R TA1 TB1 TC1 TD1 TA1 TB1 TC1 TD1

TA0 TB0 TC0 TD0 TA0 TB0 TC0 TD0

Column Selector Column Selector

Figure 5.5: An illustration of stacked versus stripped data-to-cell mapping for an 8-bit data array (four 2-bit MLC cells) and 2-bit tag arrays (in SLC). In stacked data-to-cell mapping scheme, data bits of the same cache block (each one has 8 bits) are mapped to 4 independent memory cells (i.e., 2 in each cell). In stripped mapping, each memory cell contains only one bit of each cache block – so, each 8-bit cache block spans over 8 memory cells. For instance, lines ‘B’ and ‘D’ are mapped to hard-domains (MTJ1), whereas lines ‘A’ and ‘C’ use soft-domains (MTJ2). Note that, tag arrays are made of SLC cells. Therefore, both of the data-to-cell mapping schemes have the same tag structure.

83 For write operations, on the other hand, one can write into either the hard- domain or the soft-domain independently, by using a single current pulse. Unlike the soft-domain, writing into the hard-domain has two effects:

1. Writing into the hard-domain may cause the LSB to flip. Thus, each write request to MSB must be preceded with an LSB read, which takes two read cycles. Following that, both MTJs are written sequentially that is, first the MSB and then the read LSB.

2. Writing into the hard-domain dissipates 1.6 times the energy required for the soft-domain. The larger is the writing current, the shorter the cell lifetime.

Accordingly, although the hard-domain emulates SLC read access in latency, the cost of writing into the soft-domain is much lower, primarily because only a small current is required to switch its polarity. Based on the different read/write characteristics of both MTJs, we propose a stripped data-to-cell mapping, which groups the hard- domains together to form Fast Read High-Energy write (FRHE) lines, and groups the soft-domains to form Slow Read Low-Energy write (SRLE) lines. The right portion of Figure 5.5 depicts the logical arrangement of the tag and data arrays for the stripped mapping. Within each cache set, half of the cache lines will be mapped to the hard-domains and the other half to the soft-domains. Table 5.2 summarizes the sequence of transactions required to read and write the FRHE or SRLE lines in a stripped MLC-based (2-bit) cache configuration. In addition to its performance efficiency, the stripped data-to-cell mapping provides the opportunity to optimize energy consumption and lifetime of a 2-bit MLC cache. Before describing these advantages, we first evaluate the performance efficiency of an MLC-based cache when running real workloads.

Table 5.2: Sequence of transactions when accessing a 2-bit MLC cache with stripped data-to-cell mapping. FRHE (Fast Read High-Energy write) cache lines consist of hard-domains (MTJ2), and SRLE (Slow Read Low-Energy write) cache lines consist of soft-domains (MTJ1).

FRHE Read (1) Read hard-domains FRHE Write (1) Read hard-domains; (2) Read soft-domains; (3) Write hard-domains; (4) Write soft-domains SRLE Read (1) Read hard-domains; (2) Read soft-domains SRLE Write (1) Write soft-domains

84 5.3.2 Performance Analysis

Employing an MLC STT-RAM cache has two opposing impacts on the perfor- mance of a multi-core system. On the one hand, thanks to its large capacity, it can improve performance by reducing the cache miss-rate and so the need for accessing off-chip main memory if employed as an LLC. This is especially true for emerging workloads (like social networking applications and new database workloads) that usually have very large data working set sizes. On the other hand, due to its high read and write access latencies, such a cache architecture degrades the system performance for workloads with low or negligible miss-rate for on-chip caches. Although the stripped data-to-cell mapping partially addresses the problem of high access latency, the problem still exists for half of the cache lines which always exhibit the maximum MLC STT-RAM read/write latencies.

To examine the performance efficiency of a 2-bit STT-RAM cache with the stripped data-to-cell mapping, Figure 5.6 presents the LLC miss rate and Instruction- per-Cycle (IPC) of a single-core system for one-second execution of four applications from SPEC-CPU 2006 [2]. The applications are chosen to cover a wide range of scenarios with the low, moderate, and high LLC miss-rates. In this experiment, we skip the first 100 Million instructions as the initialization stage, and the results for the next 200 million instructions are collected. We assume two LLC configurations: (1) a 256KB SLC-based STT-RAM cache with 8 ways per set, and (2) a 512KB MLC-based STT-RAM cache with 16 ways per set which is also equipped with the proposed stripped data-to-cell mapping. In both of the configurations, the cache line size is set to 64B. Note that, both of 8the configurations have almost the same LLC area size (i.e., an iso-area analysis)2.

One can make two main observations from the results reported in Figure 5.6. First, if an application exhibits low LLC miss rate during some of its phases or its entire execution, the SLC cache results in better performance (i.e., higher IPC), thanks to its shorter access latency compared to the MLC-based cache. Second, during the phases where SLC cache’s miss rate is high, the MLC cache can increase IPC if it can hold the whole or major part of the application’s working set.

2For this experiment, we used the same simulation platform described in Section 5.4.

85 SLC Cache MLC Cache SLC Cache MLC Cache 0.75 0.75

0.5 0.5

0.25 0.25 Miss Rate Miss Miss Rate Miss 0 0

1 1 IPC IPC 0.5 0.5

0 0 Time (One second) Time (One second) (a) gcc (b) xalancbmk

SLC Cache MLC Cache SLC Cache MLC Cache 1 0.5 0.75 0.5 0.25 0.25 Miss Rate Miss Miss Rate Miss 0 0 4 1 3 IPC 2 IPC 0.5 1 0 0 Time (One second) Time (One second) (c) hmmer (d) omnetpp

Figure 5.6: Performance comparison of the SLC- and MLC-based STT-RAM caches in terms of the LLC miss rate and IPC (as a system-level metric) for four workloads from the SPEC-CPU 2006 benchmark suite [2]. SLC-based configuration outperforms MLC-based configuration over the low miss-rate execution phases because of the faster read and write accesses in SLC format. On the other hand, MLC-based cache is more efficient during the high miss-rate phases, thanks to its larger capacity that can hold a larger portion of the working set.

Indeed, there are the cases where the application’s memory footprint is very large (even larger than MLC cache size) or the application has a streaming behavior (i.e., accessing a large set of addresses in a sequential fashion without reuse), during which both the SLC and MLC caches have high miss rates and consequently low IPC values. Note that, we have observed the same behavior for energy consumption of the SLC and MLC caches which will be discussed in the evaluation section. To generalize our observation to large LLC sizes and evaluate the performance efficiency of the stripped data-to-cell mapping, Figure 5.7 reports the average memory access latency seen by the requests missed in L2 cache in the evaluated system. In this

86 2MB-8W-SLC 4M-16W-MLC 4MB-16W-Stripped

1 0.8 0.6 0.4 0.2

Access Latency at L3 Ports 0 Low-Miss Medium-Miss High-Miss

Figure 5.7: Comparison of the stripped MLC cache configuration with the stacked MLC configuration and the SLC format, each with the same die area. Stripped MLC configuration outperforms the SLC format in applications with high and medium L3 miss rates, because it increases the effective cache capacity in terms of lines and associativity. It is also better than the conventional stacked MLC cache as it constructs the fast read lines (i.e., FRHE).

figure, we compare the results of the proposed architecture with a 4MB 16-way stripped LLC (L3) and two extreme baselines: (1) a 2MB 8-way SLC cache with nearly the same die area (i.e., fast cache), and (2) a 4MB 16-way 2-bit cache with stacked data-to-cell mapping (i.e., dense cache). For this experiment, we use the same evaluation methodology described later in Section 5.4 and the workload sets in Table 5.5. This figure confirms that stripped MLC let us have cake and eat it too: in applications with high and medium LLC miss rates, the performance improvement over the SLC baseline is mainly due to the increase in effective cache capacity; while in applications with low L3 miss ratio, it reduces the access time of the MLC cache by constructing separate fast read and write lines.

Based on this discussion, one can conclude that the MLC STT-RAM cache is not always beneficial – it can be harmful to the cache latency and energy in certain cases. Therefore, in order to consistently achieve low memory access latency, we need a mechanism to “dynamically” shape-shift the MLC cache to SLC cache or vise versa, depending on the applications’ dynamic cache requirements.

87 5.3.3 Enhancements for the Stripped MLC Cache

5.3.3.1 The Need for Dynamic Associativity

It is not performance beneficial to shape-shift the cache configuration from SLC to MLC (or vice versa) at the granularity of an entire cache size. Memory references in general purpose applications are often non-uniformly distributed across the sets of a set-associative cache. This non-uniformity creates a heavy capacity demand on some sets, which can lead to a high number of local conflict misses, while lines in some other sets can remain underutilized. This fact, which is the base of our dynamic associativity technique, can be illustrated with some examples. Figure 5.8 plots the absolute number of conflict misses that each set exhibits in an 8-way 512KB MLC- based cache in a single-core system during one-second execution of four workloads (the same as workloads as in Figure 5.8). One can see that, for each workload there are some sets that have few conflict misses, while some others are very much stressed. Moreover, the numbers of these opposite-behaving sets vary from one

2000 10000 1500 7500 1000 5000 500 2500 Miss Count Miss Miss Count Miss 0 0 0 128 256 384 512 0 128 256 384 512 Set Index Set Index (a) gcc (b) xalancbmk 1000 3000 750 2250 500 1500 250 750 Miss Count Miss Miss Count Miss 0 0 0 128 256 384 512 0 128 256 384 512 Set Index Set Index (c) hmmer (d) omnetpp

Figure 5.8: Distribution of the missed accesses over LLC’s sets for 200 million instructions in four applications from the SPEC-CPU 2006 benchmark suite [2]. We see that, in three out of four applications, there are some sets that have few conflict misses, while some others are very stressed. Such access non-uniformity over different sets requires a fine-grained (i.e., at the set-level) tunning to optimize both the latency and capacity of the STT-RAM cache.

88 program phase to another through the course of execution. This non-uniformity of miss counts across different cache sets indicates that, in the stripped MLC-based cache, high associativity not only brings no benefits for sets with low-utilized lines, but it can also result in great degradation in performance, lifetime and energy consumption of the cache. As a result, the new cache architecture we propose involves, besides the stripped data-to-cell mapping, an on-demand associativity policy which dynamically modulates the associativity of each set.

To this end, we propose an on-demand associativity adjustment policy which determines the associativity of each set in the stripped MLC-based cache considering their local miss rates. We initialize the associativity to the lowest level, which corresponds to half of the full capacity in our case. Then, the associativity of a set will grow and increase overtime depending on the dynamic utilization. To mitigate the effects of slow reads and high-energy writes, when a cache line needs to be turned off, an FRHE and SRLE pair is merged into an SLC line, which uses exclusively the soft-domains, while all hard-domains are fixed at the same value (‘0’ or ‘1’)3. As a result, an SLC line will be read in a single cycle since the hard-domains are known and a single resistance reference will be required. This causes an SLC line to feature fast reads and low-energy writes.

We first describe how the decision to grow the associativity is taken. As re- placement policies are not ideal, it is clear that the associativity should not be prematurely increased after every miss. To achieve better cache performance, we introduce two saturation counters for each set: a miss counter (Mcnt) and a weight counter (Wcnt). The miss counter captures the number of misses that a set exhibits and the weight counter prevents the effects of short-term variations in misses and makes a set with large weight value to be less likely to increase its associativity. Wcnt reflects the associativity of a set and is initialized to the minimum associa- tivity. In each program epoch (i.e., every 100 Million instructions in our settings, after sensitivity analysis), Mcnt is initialized to Wcnt×N; N is a design parameter which is discussed in more details in Section 5.3.3.3. When Mcnt reaches zero, the hardware increases the associativity by one and on the next miss, the fetched block will be placed into the newly-allocated way.

3In our experiments, the hard-domains are set to logic ‘0’ since bits “00” and “01” have less overlap under severe process variation [51].

89 In an effort to balance the wear between blocks and maximize lifetime, we introduce a circular pointer (a 3-bit counter for the eight cache-line pairs of our stripped cache) indicating which cache-line pair should be selected for the next increase in associativity. Therefore, cache lines are switched to the MLC mode in a round-robin fashion and writes are distributed among them. At the end of an epoch, Mcnt is compared with SLC-Associativity×N to decide whether the cache set associativity should be reduced. If Mcnt is larger than SLC-Associativity ×N, this indicates that the utilization is low enough to reduce the associativity by one for the next epoch. In this case, the replacement policy is triggered and the associativity is reduced by evicting a cache line and converting the corresponding cache-line pair into SLC.

5.3.3.2 The Need for a Cache Line Swapping Policy

Besides performance efficiency, the stripped mapping provides an opportunity to further enhance the performance efficiency, energy efficiency, and lifetime of a MLC- based STT-RAM cache. More accurately, fast read lines (i.e., FRHE lines) greatly speed up read operations compared to the stacked mapping. On the other hand, if write-dominated lines can be directed to low write-energy lines (SRLE), the write energy and cell lifetime can be kept close to those of SLC. To maximize the benefits provided by the stripping data layout, we propose a swapping policy to dynamically promote write-dominated data blocks to SRLE lines and read-dominated ones to FRHE lines. The swapping mechanism in the stripped mapping must be used with care. Looking into the cache traces, we find many blocks that are both read- and write-dominated. In Figure 5.9, we report the read/write intensity of the all memory blocks in LLC for a set of workloads from the multi-thread PARSEC-2 benchmark programs [96] and the SPEC-CPU 2006 benchmark programs [2]. A memory block is considered write-dominated (or read-dominated) if that memory block is written into (or read from) in 90% of the accesses to that block. We observe 32.7% of such memory block on average, and down to 2% for some programs. In this situation, even utilizing a near-optimal data swap, the remaining non-dominated memory blocks might keep swapping between the FRHE and SRLE lines. This might cause higher number of contentions and energy consumption. Worse, the cache lifetime would be considerably reduced by the write amplification provoked by the swaps.

90 RD-to-WR < 0.1 RD-to-WR > 0.9 0.1 < RD-to-WR < 0.9 100 80 60 40

% of Total 20

0 blackscholesbodytrackbzip2canealdealIIdedupfacesimferretfluidanimatefreqminegcc GemsFDTDgromacslbm leslie3dmcf milc namdomnetppparserperlbenchpovrayraytracesoplexstreamclusterswaptionsvips X264xalancbmk

Figure 5.9: Percent of memory blocks in cache with read-dominated, write- dominated, and non-dominated properties for a set of workloads from PARSEC-2 and SPEC-CPU 2006 programs.

Therefore, to avoid this scenario, a swap policy is introduced. For read- and write-aware swap, each cache line is associated with a swap counter (Scnt) and a swap weight counter (SWcnt). At each epoch, SWcnt is initialized to ‘1’, increments when a swap happens, and saturates when it reaches M. Scnt is initialized at each epoch with SWcnt×K, and decrements when its FRHE line is written, or when its SRLE line is read. If Scnt reaches zero either its SRLE or FRHE line will replace the block to be evicted based on the replacement policy in use. If the victim line is an FRHE, it is replaced by the SRLE; otherwise it is replaced by the FRHE. Here, M and K are design parameters which are discussed below in more details. Note that, on a miss, Scnt and SWcnt are reinitialized.

5.3.3.3 Overhead of the Counters

The main logic overhead of the proposed dynamic association logic and swapping logic are the saturating counters. Our mechanism uses one miss counter (Mcnt), one weight counter (Wcnt), one swap counter (Scnt), and one swap weight counter (SWcnt) for each set in a set-associative cache.

1. Wcnt reflects the associativity of a set. So, in a 16-way cache (which is the default configuration throughout our evaluation), Wcnt is a 4-bit counter.

2. Mcnt size is determined by both Wcnt and N (a design parameter). Indeed, as we described earlier, Mcnt is initialized to W cnt × N at the beginning of each epoch. The value of N does not have to be either large and small, as it prevents

91 our algorithm to suitably track the changes in miss rate of the associated set. In our settings, we set N to 256 after sensitivity analysis. Accordingly, the maximum value of Mcnt is 16 × 256, which means Mcnt is a 12-bit counter.

3. The size of Swcnt is M, which is set to 256 (i.e., the maximum value that a hardware counter can represent). Thus, Swcnt is a 8-bit counter.

4. The maximum value of Scnt is “Maximum value of Swcnt” ×K. K is a design parameter in our design which is set to 16 after sensitivity analysis. Thus, Scnt is a 12-bit counter.

To summarize, our proposed mechanism keeps two 12-bit counters (Mcnt and Scnt) and two 4-bit counters (Wcnt and Swcnt) per each set, which means an overhead of 4 bytes per set. As each set has 16 ways (each with 64-byte size), the overhead of our mechanism is very small (< 0.5%).

5.4 Experimental Methodology

In this section, we describe our simulation platform, the characteristics of the studied workloads, and the design methodology used in our experimental evaluations. Note that, Figures 5.6, 5.7, 5.8, and 5.9 also use the simulation methodology described here.

5.4.1 Infrastructure

We perform a microarchitectural, execution-driven simulation of an out-of-order processor model with Alpha ISA using the Gem5 simulator [74]. The simulated CMP runs at 2.5 GHz frequency. We use McPAT [97] to obtain the timing, area, energy and thermal estimation for the CMP we model, and use CACTI 6.5 [98] for detailed cache area, power and timing models. For STT-RAM LLC, NVSim [75] is used and is parametrized with the cell latency and energy parameters from Table 5.1. We use 45 nm ITRS models, with a High-Performance (HP) process for all the components of the chip except for LLC, which uses a Low-Operating-Power (LOP) process.

92 5.4.2 Configuration of the Baseline System

We model the 8-core CMP system detailed in Table 5.3. The system has three levels of caches and, because STT-RAM is not compatible with CMOS, it is built on a 2-tier 3D integration. At the processor tier, the cache hierarchy has a private L1 instruction and data cache for each core. Each core also has a private L2 that is kept exclusive to the L1 cache. The STT-RAM L3 cache is logically shared among all the cores while physically structured as static NUCA and mounted at the top tier of 3D die [12,29,93]. On 45 nm, the CMP area is estimated to 60 mm2 and has a TDP of 77 W at 2.5 GHz and 1.1 V supply voltage.

Using McPAT [97], the processor layer in our 3D IC (i.e., tier 1) has a 5.1 mm2 die area. By assuming an SLC STT-RAM cell size of 14 F2 [29], we derived that a 5 MB SLC can fit in same die area in tier 2. Since the area of an STT-RAM cell is dominated by its access transistor and we use an SLC tag array for our MLC data array, 8 MB MLC STT-RAM can fit within the 5.1 mm2 area. Table 5.4 summarizes the configuration of the LLC in the proposed system as well as in three reference configurations: a 5 MB SLC cache (fast cache), an 8 MB 2-bit MLC cache with stacked mapping(dense cache), both with the same die area, and an 8 MB SLC cache (fast-dense cache) with double area.

The performance of static STT-RAM NUCA, is highly sensitive to write operations that can be blocking for the subsequent read requests. To alleviate this inefficiency, each bank has a separate 8-entry Read Queue (RDQ) and 32-entry Write Queue (WRQ) that queue all pending requests. A read request to a line pending in the WRQ is serviced by the WRQ. When a bank is idle and either RDQ or WRQ (but not both) is non-empty, the oldest request from that queue is serviced. If both RDQ and WRQ are non-empty, then a read request is serviced unless the WRQ is more than 80% full, in which case a write request is serviced. This ensures that read requests are given priority for service in the common case, and write requests eventually have a chance to get serviced.

93 Table 5.3: Main characteristics of our simulated CMPs.

Processor Layer (Tier 1) Cores 8-cores, SPARC-III ISA, out-of-order, 2.5 GHz L1 cache 32 kB private, 4-way, 64 B, LRU, Write through, 2-port, 1-cycle Access time, MSHR: 4 instruction & 32 data L2 cache 2 MB private, NUCA, Unified, Inclusive, 16-way, 64 B, LRU, Write-back, 10 cycle, MSHR: 32 (instruction and data) Coherency MOESI directory; 2×4 grid packet switched NoC; XY routing; 1-cycle router; 1-cycle link. L3 Cache Layer (Tier 2) L3 cache NUCA, 8 banks, Shared, Inclusive, STT-RAM, 64B, LRU, Write-back, 1-port, 32×64 B write buffer, 8×64 B read buffer, 4-cycle L2-to-L3 latency, MSHR: 128 (instruction and data) Off-Chip Main Memory Controller 4 on-chip, FR-FCFS scheduling policy DRAM DDR3 1333 MHz (MICRON), 8 B data bus, tRP-tRCD-CL: 15-15-15 ns, 8 DRAM banks, 16 kB row buffer per bank, Row buffer hit: 36 ns, Row buffer miss: 66 ns

Table 5.4: Evaluated L3 configurations.

Latency Dynamic Leakage L3 Configuration [cycles] Energy [nJ] Power [W] 8 MB Lookup: 3 hard-R: 0.34 1.52 8-to-16way hard-R-hit: 3 soft-R: 0.38 stripped MLC soft-R-hit: 5 hard-W: 1.93 hard-W-hit: 19 soft-W: 1.28 soft-W-hit: 42 5MB 8way Lookup: 1 R: 0.32 0.156 SLC R-hit: 3 W: 1.29 W-hit: 19 8MB 16way Lookup: 3 R: 0.64 0.152 stacked MLC R-hit: 5 W: 1.58 W-hit: 37 8MB 16way Lookup: 2 R: 0.32 0.217 SLC R-hit: 3 W: 1.29 W-hit: 19

94 5.4.3 Workloads

For multi-threaded workloads, we use the complete set of parallel programs in PARSEC-2 suite [96]. For multi-program evaluation, we use the SPEC-CPU2006 benchmarks [2]. We classify a benchmark as memory-intensive if its L3 cache Misses Per 1 K Instruction (MPKI) is greater than three; otherwise, we refer to it as memory non-intensive. Also, we say a benchmark has cache locality if the number of L3 cache Hits Per 1K Instruction (HPKI) for the benchmark is greater than five. Each benchmark is classified by measuring the hits and misses when running alone in the 8-core system given in Table 5.3. For the multi-program workloads, we used eleven 8-core-application workloads that are chosen such that each workload

Table 5.5: Characteristics of the evaluated workloads.

Workload MPKI HPKI Multi-Threaded PARSEC-2 Workloads (8 Threads) blackscholes 0.78 (L) 37.33 (H) bodytrack 0.96 (L) 11.90 (M) canneal 15.19 (H) 27.13 (H) dedup 3.04 (M) 9.072 (M) facesim 10.66 (H) 14.26 (M) ferret 7.80 (M) 23.21 (M) fluidanimate 5.54 (M) 10.51 (M) freqmine 0.51 (L) 7.30 (M) raytrace 0.45 (L) 0.92 (L) streamcluster 0.51 (L) 5.35 (M) swaptions 0.15 (L) 4.34 (M) vips 2.24 (M) 15.69 (M) x264 1.22 (M) 12.92 (M) 8-Application Multi-Programmed (SPEC-CPU 2006) MP1: 2 Copies of (xalancbmk, omnetpp, bzip2, mcf) 20.16 (H) 39.51 (H) MP2: 2 Copies of (milc, leslie3d, GemsFDTD, lbm) 33.01 (H) 38.69 (H) MP3: 2 Copies of (mcf, xalancbmk, GemsFDTD, lbm) 24.23 (H) 37.55 (H) MP4: 2 Copies of (mcf, GemsFDTD, povray, perlbench) 14.89 (H) 23.41 (H) MP5: 2 Copies of (mcf, xalancbmk, perlbench, gcc) 18.33 (H) 49.41 (H) MP6: 2 Copies of (GemsFDTD, lbm, povray, namd) 6.99 (M) 11.68 (M) MP7: 2 Copies of (gromacs, namd, dealII, povray) 1.85 (M) 7.94 (M) MP8: 2 Copies of (perlbench, gcc, dealII, povray) 5.21 (M) 25.63 (H) MP9: 2 Copies of (namd, povray, perlbench, gcc) 2.22 (M) 8.87 (M) MP10: 2 Copies of (milc, soplex, bzip2, mcf) 22.94 (H) 38.89 (H) MP11: 2 Copies of (parser, gcc, namd, povray) 4.27 (M) 26.39 (H)

95 consists of at least six memory-intensive applications and two applications with good cache locality. Each application is simulated to completion and the results are taken from the instructions in parallel region (i.e., the region of interest). Regarding the input sets, we use Large set for the PARSEC-2 applications and sim-large for the SPEC-CPU2006 workloads. Table 5.5 characterizes the evaluated workloads based on L3 MPKI and HPKI for the SLC reference system. To justify the results, our workloads are classified based on their L3 miss count and hit count intensity: considering L3 MPKI, each workload is either high-missed (H) if MPKI is greater than 10, medium-missed (M) if MPKI is between 1 and 10 or low-missed (L) if MPKI is less than 1. A workload is either high-hit (H) if HPKI is greater than 20, medium-hit (M) if HPKI is between 1 and 20 or low-hit (L) if HPKI is less than 1.

5.5 Evaluation Results

Our dynamic stripped cache has a minimum and maximum associativity of 8 ways and 16 ways, respectively. The first and second baselines are the SLC and stacked MLC cache with the same die area and line size, but different associativities: 8 ways for SLC baseline and 16 ways for stacked MLC. The last baseline is a cache with SLC devices with the same capacity and associativity of MLC, but the die area is doubled. For our dynamic configuration, on the other hand, a read hit or a write hit can be serviced by either MSB lines or LSB lines. The proposed cache organization centers around the use of SLC configuration in applications with low misses and MLC configuration in applications with medium and high misses. Ultimately, this mechanism attempts to reduce the miss penalty by the same measures as an MLC cache. Thus, an upper bound on the miss reduction for the proposed mechanism is provided by stacked MLC cache of the same size. Our cache is expected to approach this upper bound for high missed and medium missed applications. This upper bound can result in our scheme outperforming the SLC baseline as seen in the results. The two other upper bounds for the proposed cache are determined by the read hit at fast read lines (i.e., FRHE) and a write hit at low write-energy lines (i.e., SRLE). These second and third upper bounds determine the reduction in access latency and power consumption compared to MLC arrays and is provided by the SLC baseline with double capacity.

96 5.5.1 Performance Analysis

For programs with the high and medium L3 MPKI, we expect a higher effect on the latency when increasing cache associativity. This is also observed in Figure 5.10 that plots the CPI improvement for a system with proposed cache structure support with respect to the studied baselines. For each benchmark, the results are normalized to the SLC baseline for ease of comparison. This figure shows an improvement of up to 29% in CPI of the high associativity caches (i.e., MLC cache configurations and 8 MB SLC cache) with respect to the 4 MB SLC baseline. Our scheme also outperforms the 8 MB stacked 2-bit cache by 10% on average thanks to being able to construct FRHE and SRLE lines without generally loosing maximum way associativity requirement of a set. Comparing the results with the 8 MB SLC cache baseline, it can be seen that the performance of the system with proposed cache structure is within 5% of the maximum performance observed. For applications with low miss ratio in the LLC, one can see our cache configuration behave like an SLC baseline in most applications. Only in some application, there is a slight degradation in overall system performance (up to 4% in blackscholes) that is because of the higher latency of 8 MB NUCA access circuit compared to 4 MB SLC configuration. In short, as the last group of bars show in Figure 5.10, the cache architecture with dynamic associativity achieves 10% CPI improvement (on average) for all applications with different miss ratio behavior.

5MB SLC 8MB SLC 8MB Stacked MLC 8MB Stripped MLC 1.4 1.2 1 0.8 0.6 0.4

Per Cycle (IPC) 0.2 0 cannealfacesimMP-H1MP-H2MP-H3MP-H4MP-H5MP-H6dedupferret fluidanimatevips x264 MP-M1MP-M2MP-M3MP-M4MP-M5blackscholesbodytrackfreqmineraytracestreamclusterswaptionsGmean Normalized Instruction

Figure 5.10: Percentage of IPC improvement for the proposed cache architecture with respect to the baselines. The proposed architecture has capacity advantage of MLCs in applications with high misses (9 first) programs and medium misses (13 second) programs. It also has the SLC access latency in 8 applications with low misses (at left side).

97 5.5.2 Energy Consumption Analysis

The percentage of reduction in total memory energy compared to the baseline STT-RAM cache configurations is shown in Figure 5.11. This evaluation includes the energy consumptions of both the STT-RAM LLC and off-chip main memory, and uses the energy model given in Table 5.4. Generally speaking, compared to the SLC and MLC baselines, the percentage reduction in energy consumption follows the same trend observed for the performance improvement. In other words, we can see that the energy consumption of the proposed scheme is much better than the SLC baseline in all applications with high and medium misses due to the higher hit ratio of the on-chip memory hierarchy. On the other hand, the energy consumption of our scheme is better than the MLC baseline with the stacked data-to-cell mapping, as it constructs lines with low write energy and trying to allocate them write-dominated blocks. Also, with respect to the SLC baseline cache, the proposed cache architecture results in 17% reduction in the total memory energy on average (and 29% to 46% for programs such as canneal, and facesim).

5.5.3 Lifetime Analysis

In this work, we assume that reliable writes into an SLC STT-RAM cell is limited to 1012 cycles [29], and it is linearly scaled down for 2-bit STT-RAMs (i.e., exactly one-tenth). For the lifetime evaluation, the main memory traces are extracted from the full-system simulator and are fed into a simulation tool. To avoid cache failure on wear-out of limited cells, each cache line is augmented with an ECC correcting up to 5 faulty bits. After each cache write, the read access circuit is used to read it out and a cache line (or equally a physical way) is assumed to be dead if the read-out data block has more than 5 bit mismatches with the original data. Finally, the simulator keeps track of the write counts on each coding set until a cache set has more than four dead physical ways. We measure this duration and use it to estimate and analyze the lifetime. The lifetime of the complete cache architecture is compared with the 5 MB SLC caches and an 8 MB stacked MLC cache in Figure 5.12. Overall, the proposed scheme provides a lifetime larger than 70% of an SLC cache with identical ECC strength.

98 5MB SLC 8MB SLC 8MB Stacked MLC 8MB Stripped MLC 1.4 1.2 1 0.8 0.6 0.4 0.2 Normalized Total

Energy Consumption 0 cannealfacesimMP-H1MP-H2MP-H3MP-H4MP-H5MP-H6dedupferret fluidanimatevips x264 MP-M1MP-M2MP-M3MP-M4MP-M5blackscholesbodytrackfreqmineraytracestreamclusterswaptionsGmean

Figure 5.11: Total energy consumption of the cache architectures normalized to the SLC baselines. It shows that proposed architecture uses low read and low write energy of FRHE and SRLE lines.

5MB SLC 8MB Stacked MLC 8MB Stripped MLC 1.2 1 0.8 0.6 0.4 0.2

Normalized Lifetime 0 cannealfacesimMP-H1MP-H2MP-H3MP-H4MP-H5MP-H6dedupferret fluidanimatevips x264 MP-M1MP-M2MP-M3MP-M4MP-M5blackscholesbodytrackfreqmineraytracestreamclusterswaptionsGmean

Figure 5.12: Lifetime of the cache configurations normalized to the SLC baselines. Our, proposed stripped scheme tries to act like SLCs to reach maximum lifetime.

5.5.3.1 Comparison with Some Prior Works on Reducing Cache Misses

Several proposals increase cache associativity relying on techniques requiring either heaps [4], hash table [99] or prediction mechanism [100]. This may increase energy and latency of cache hits and the resulting cache design may be much more complex than conventional cache arrays. On the other hand, this study focuses on multi-bit capability of cutting-edge STT-RAM technology and proposes a workload-aware per-set associativity regulation using its SLC to MLC (and vice versa) shape-shifting property. Here, we compare the proposed MLC STT-RAM cache system against state-of-the-art works on cache associativity applied to the same platform (i.e., STT-RAM cache). We use three schemes for comparison, including the V-Way cache [3], the Scavenger cache [4], and the dynamic SBC cache [5]. For fair analysis, these three approaches are applied to SLC with the

99 Our-Proposal Scavenger V-Way SBC 25 20 15 10 5

0 Low-Missed Medium-MissedHigh-Missed Gmean % of Reduction in Miss Rate

Figure 5.13: Comparison of our proposed cache with V-Way [3], Scavenger [4], and SBC [5] caches in terms of the percentage reduction in cache misses relative to the SLC cache in previous configuration. Note that cache sizes are set to have same die area. This figure shows that our technique is better than its counterparts in miss ratio, especially when the application requires large associativity. same die size. Figure 5.13 compares the LLC miss rates of these schemes for the configuration used previously (Table 5.3). The results are shown as an average reduction in miss rate for different workloads categories in Table 5.5, i.e., workloads with low LLC miss rate, workloads with moderate miss rate, and workloads with high miss rate. The results are normalized to the miss rate of the baseline SLC configuration. We can observe that, the results vary between the 12.1% reduction for our proposed solution and the 8.6% reduction for Scavenger. And, SBC achieves a 10.8% reduction, better than the 9% obtained by V-Way. More accurately, the proposed cache is the best one in high and medium missed category and SBC is the best for workloads with low miss accesses. We must take into account that V-Way cache turns misses into hits, while the other three one turns them into secondary hits, which suffer the delay of a second access to the tag array. On the other hand, the duplication of the tag-store entries, the addition of one pointer to each entry and a mux to choose the correct pointer increases the V-Way tag access time by around 39%, while our solution along with SBC involves very light structures, thus having a negligible impact on access time.

100 5.6 Related Work on STT-RAM-based Caches

Owing to real concerns of SRAM power in nanometer regime, resistive memories such as STT-RAM are presented to offer a highly-scalable low-leakage alternative for large cache arrays. Compared to competitive non-volatile memories (such as ReRAM, PcRAM and FeRAM), STT-RAM benefits from the best attributes of fast nanosecond access time, CMOS process compatibility, high density, and better write endurance. Dong et al. give a detailed circuit-level comparison between SRAM cache and STT-RAM cache in a single-core [27]. Based on the findings given in this study, Sun et al. extended the application of STT-RAM to NUCA cache substrate in CMPs and studied the impact of the costly write operation in STT-RAM on power and performance [101]. To address the slow write speed and high write energy of STT-RAM, many proposals have been made in recent years. Zhou et al. proposed an early write termination scheme that uses write current as a read bias current and cancels an ongoing write if the write data is unnecessary [102]. Alternative approaches are SRAM/STT-RAM hybrid cache hierarchies and some enhancements, such as write buffering [29, 103, 104], data migration [29, 104–108], and data encoding [109–111]. As a cache solution with uniform technology, previous work proposed to trade off the non-volatility of STT-RAM for write performance and power improvement [30, 112–115]. To ensure data integrity in these architectures, some DRAM-style refresh schemes are introduced which may not scale well for large cache capacities.

Regarding MLC STT-RAM, Chen et al. proposed a dense cache architecture using devices with parallel MTJs [28]. Although they use the MLCs with parallel MTJs to have lower write power (compared to series MTJs) suitable for cache [50], a reliability comparison of these two devices show that parallel devices confront serious challenges in nanometer technologies with large process variations [51]. Finally, some recent proposals studied the effect of decoupling bits of an MLC device (STT-RAM [52] or PcRAM [53]) in performance, energy, and reliability improvement of non-volatile memories.

101 5.7 Conclusions

The emerging technology of SLC STT-RAM has been shown to be a promising candidate for building large last-level caches. The natural next step would be to use MLC STT-SRAM, but its advantage in doubling the storage density comes with a number of serious shortcomings in terms of lifetime, performance, and energy consumption. In this work, we have shown that, by operating MLC STT-RAM in SLC mode when the additional density is not required, one can achieve the best of both worlds and improve performance and energy with only a minimal impact on lifetime. This improvement requires more than a naive shut down of unused ways in the cache (which are thus used in SLC mode instead of MLC mode) and we have shown how one should actively migrate data across physical ways to maximize the benefits of this technique. This work shows that emerging memory technologies can be efficiently accommodated in traditional memory technologies, but they require some new techniques for the integration to be successful.

102 Chapter 6

Improving the Performance of Cache Hierarchy through Selective Caching

Emerging general purpose graphics processing units (GPGPU) are not only used to accelerate big data analytics in cloud data centers or high-performance computing systems, but are also employed in mobile and wearable devices for efficient execution of multimedia-rich applications and smooth rendering of display. GPGPUs make use of a memory hierarchy very similar to that of modern multi-core processors – they typically have multiple levels of on-chip caches and a DDR-like off-chip main memory. In such massively parallel architectures, caches are expected to reduce the average data access latency by reducing the number of off-chip memory accesses; however, our extensive experimental studies confirm that not all applications utilize the on-chip caches in an efficient manner. Even though GPGPUs are adopted to run a wide range of general purpose applications, the conventional cache management policies are incapable of achieving the optimal performance over different memory characteristics of the applications. This study first investigates the underlying reasons for inefficiency of common cache management policies in GPGPUs. To address and resolve those issues, we then propose (i) a characterization mechanism to analyze each kernel at run-time and, (ii) a selective caching policy to manage the flow of cache accesses. Evaluation results of the studied platform show that

103 our proposed dynamically reconfigurable cache hierarchy improves the system performance by up to 105% (average of 27%) over a wide range of modern GPGPU applications, which is within 10% of the optimal improvement.

6.1 Introduction

Graphics Processing Units (GPUs) are becoming a major part of every computing system because of their capability to accelerate applications by exploiting high level of parallelism. They are not only used to accelerate big data analytics in cloud data centers or high-performance computing systems, but are also employed in mobile and wearable devices to achieve an efficient system. The memory system in this class of GPUs, known as general-purpose GPUs (GPGPUs), is very similar to modern multi-core processors – recent commercial GPGPUs (e.g., [6,116,117]) usually have two levels of cache (private L1 data and L1 instruction caches as well as a shared L2 cache) integrated on the GPU chip and below, there is a DDR-like off-chip main memory module. The on-chip cache hierarchy along with abundant thread-level parallelism helps such massively parallel architectures to hide long memory access latencies. Considering a GPGPU system with two levels of on-chip cache, we performed a series of experimental studies to evaluate the performance of a conventional cache hierarchy over a wide range of applications. Our experimental studies collectively demonstrate that not all applications efficiently utilize the L1 data (L1D) and L2 caches. In this work, based on the characterized underlying reasons, we categorize such applications into the following groups.

Low data locality. A major class of applications which target GPGPU platforms are streaming applications. These applications usually have weak or no data locality (at least over some phases of execution) and most of their cache accesses are missed, adding up to the average memory access latency. Note that, for such high cache miss-rate applications, the latency caused by going through multiple levels of caches is not only the latency of tag checks, but also the queuing delay in cache controllers caused by unavailable MSHR, and full miss queue [118]. It is evident that the performance of this category of applications can be improved by completely bypassing the cache hierarchy for all the memory accesses. Here, the main issue

104 is to dynamically recognize the streaming behavior of each individual application over different execution phases. Note that, one can employ data prefetching for streaming applications in order to improve the performance of cache hierarchy (e.g., [119,120]). In this study we assume the underlying platform does not support hardware-level cache prefetching as it is not widely adopted in commercial products.

High data locality with too many cache conflicts. Many multi-threaded applications have inherent data locality but exhibit poor cache utilization in GPGPUs. The reason is that such applications suffer from frequent cache conflicts as a result of huge number of memory accesses issued by thousands of concurrently running threads. This phenomenon occurs in CPUs as well but it is much more severe in GPGPUs because of relatively smaller cache sizes compared to high degree of thread-level parallelism. A heuristic to resolve this problem is to reduce the number of accesses to the cache by restricting the number of running threads. Along this line, Rogers et al. [33] and Kayiran et al. [34] propose different thread-throttling schemes to adjust the number of running threads based on the L1D cache utilization. From a different perspective, Jadidi et al. [121] uses a core shutdown mechanism to control the access traffic to the L2 cache and main memory in memory-intensive applications. In general, reducing number of running threads can effectively mitigate the cache contention issue; however, it leaves other shared resources, such as the memory bandwidth and computing cores, underutilized. Alternatively, in order to avoid cache conflicts (in both L1D and L2 cache), we propose selective caching mechanism which dynamically adjusts the number of threads which are allowed to access the cache over different execution phases. The proposed cache management policy effectively controls the access traffic to both the L1D and L2 caches without sacrificing the thread-level parallelism. This mechanism reconfigures different cache levels in a coordinated fashion to optimize the performance of the whole cache hierarchy. We will demonstrate that such reconfigurable partial-caching architecture can improve the system performance by an average 27%. Overall, this study makes the following contributions:

• We first demonstrate that, as different kernels execute different parts of the same application, they may exhibit significant variations regarding their cache characteristics. Therefore, in order to achieve a well-tuned platform, we need to characterize and optimize each kernel based on its unique properties.

105 • We propose a fine-grained CTA-based characterization scheme to analyze the cache utilization at run-time. Based on such characterization, we then categorize kernels into two classes. The first class of kernels do not benefit from caches due to low data locality. These kernels may even lose performance because of the longer memory access delays caused by going through multiple levels of caches. The second class of kernels, although inherently have high data locality, do not achieve their optimal performance due to high cache miss-rate caused by very frequent cache conflicts. The performance of such kernels can be improved by regulating the number of threads that are allowed to access the cache. Note that, the characteristics of a kernel could change over different execution phases. Therefore, the behavior of the running kernel should be reevaluated over the course of execution.

• Based on the findings of run-time characterization, we propose a cache management policy which regulates the number of accesses to the L1D and L2 caches through a feedback-driven selective caching scheme. To the best of our knowledge, this is the first study that reconfigures both the L1D and L2 caches in a coordinated fashion in order to further improve the utilization of cache hierarchy in a GPGPU platform.

6.2 Background

GPGPU Architecture: Figure 6.1 shows a typical GPU architecture which consists of a two-level cache hierarchy. This architecture consists of multiple Streaming Multiprocessors (SMs), each containing 32 CUDA cores [6]. Each CUDA core can execute a thread, in a “Single-Instruction, Multiple-Threads” (SIMT) fashion. This architecture is supported by a large register file that hides the memory access latency. Each SM has a private L1D, a read-only texture and a constant cache, along with a low-latency shared memory. The memory requests generated by multiple threads in an SM are coalesced into fewer cache lines and sent to the L1D that is shared by all CUDA cores in the SM. Misses in the L1D are injected into an on-chip network, which connects the SMs to the memory. Each memory partition includes a slice of the shared L2 and a memory controller [122].

106 GPGPU Applications: Figure 6.1 demonstrates the computation hierarchy for a typical GPGPU application. Each GPGPU application consists of one or multiple kernels, each of which is launched once or multiple-times during the entire execution of the application. Each kernel consists of a set of parallel threads which are divided into groups of threads, called Cooperative Thread Arrays (CTAs). The underlying architecture further divides each CTA into groups of threads called warps that is transparent to the programmer. After a kernel is launched on GPU, the CTA scheduler schedules available CTAs associated with the kernel on all the available SMs. The maximum number of CTAs per SM is limited by SM resources (i.e., number of threads, size of shared memory, register file size, and etc). The CTA assignment policy is followed by per-core warp scheduling. Warps associated with CTAs are scheduled on the assigned cores and get equal priority. Once an SM finishes executing of a CTA, the CTA scheduler assigns another CTA to that SM to execute. In common CTA scheduling mechanisms, there is no priority among CTAs and the process continues until all the CTAs are executed. In Section 6.4, we explain how our fine-grained characterization scheme cooperates with the CTA scheduler at run-time to analyze the behavior of the running kernel. In the rest of the study terms kernel and application are used interchangeably.

U Running Application P

C Application Kernel-1 CTA-1 Launching Kernels Kernel-1 CTA-1 Warp-1 SM#1 ... SM#32 L1D $ L1D $ Kernel-2 CTA-2 Warp-2 m r CTA-3 Warp-3

o Kernel-3 f t Interconnection Network (Crossbar) . a l . .

P . . . U . . P L2 $ L2 $

G . DRAM Channel #1 ... DRAM Channel #6

Figure 6.1: Target GPGPU architecture and the details of the computation hierarchy in a typical GPGPU application. Each streaming multiprocessor (SM) has a private L1D cache. The L2 cache is logically shared but physically distributed among 6 memory channels which are connected to the SMs through an interconnection network (i.e, a crossbar) [6].

107 6.3 Problem Formulation

In this section, we discuss the underlying reasons of poor cache utilization in some GPGPU applications. Following this analysis, we demonstrate how a partial- caching mechanism can resolve those issues by dynamically reconfiguring the cache hierarchy for each individual kernel.

6.3.1 Kernel-Based Analysis

As discussed in Section 6.2, GPGPU architectures usually employ multiple levels of caches in their memory hierarchy in order to hide the main memory access latency. If the caches are not utilized properly, the computing cores will experience long data access latencies which directly degrades the system performance. Based on our observations, two category of memory-intensive kernels do not efficiently utilize caches. First, going through multiple levels of caches increases the memory access latency for the streaming applications with low data locality. For such applications, not only the cache hierarchy does not noticeably reduce the number of accesses to the off-chip main memory, but also increases the overall data access latency. This extra latency is caused by the latency of tag lookups as well as the queuing delay in the cache controllers (i.e., being queued because of unavailable MSHR entry, and/or full miss queue [118]). Second, the small size of L1D and L2 caches compared to the enormous number of concurrently running threads in could cause cache-thrashing which significantly degrades the performance of cache-sensitive applications. For such applications, the cache hierarchy is expected to considerably reduce the number of off-chip main memory accesses; however, its utilization is severely declined once it is flooded by requests from thousands of running threads1.

To have a better understanding of the impact of (i) degree of thread-level par- allelism, and (ii) data locality on the cache utilization in GPGPUs, we studied the performance of a target GPGPU platform (detailed configuration is given in Table 6.2) under different scenarios of cache setup. In these scenarios, we sepa- rately limited the number of accesses to the L1D and L2 caches by labeling the

1Even though this problem is also present in conventional CPUs, it is exacerbated in GPGPUs due to the small cache sizes compared to the number of running threads.

108 memory requests as cacheable or non-cacheable accesses. To do so, we ran different applications under five caching scenarios for both the L1D and L2 caches. In Figures 6.2 and 6.32, the X% caching ratio (on x-axis) refers to the configuration in which only X% of the memory requests are allowed to access the cache while the rest of the memory requests are being bypassed to the next memory level. We change the X value from 0% (i.e., no memory requests is cached) to 100% (i.e., all memory requests are cached). Figures 6.2 and 6.3 show the IPC of each caching configuration normalized to the baseline system where all the memory requests are allowed to access the cache. This evaluation reveals different characteristics which can be used to improve the utilization of the cache hierarchy in GPGPU architectures. Note that, in Figure 6.2 we change the caching ratio only in the L1D while the L2 cache operation is not changed, meaning that all the memory requests can access L2 cache. In Figure 6.3 however, the L1D is untouched and the L2 cache experiences different configurations.

3 In Figure 6.2, PVR1 and BFS are examples of two L1D-sensitive kernels. As we increase the fraction of cacheable memory accesses in L1D, the performance

of PVR1 improves linearly. This kernel represents an L1D-sensitive case in which the data working-set of the running threads fits into the L1D. BFS on the other hand, achieves its optimal performance when only 50% of the running threads are allowed to access the L1D. Indeed, BFS is a cache-sensitive application that has large data working-set size and, hence suffers from frequent cache conflicts. Therefore, for BFS, to achieve the optimal performance for this particular platform (detailed configuration in Table 6.2), we have to limit the fraction of memory

requests that are allowed to access the L1D. In this evaluation, PVR2 and TRA are two L1D-insensitive kernels. TRA does not exhibit any changes in its performance

by changing the ratio of cacheable threads, but in PVR2 the performance degrades

as the fraction of cacheable threads increases. Although both TRA and PVR2

exhibit L1D-insensitive behavior, only in PVR2 we observe performance loss as the fraction of cacheable memory requests increases. The reason behind this behavior

is that PVR2 generates considerable number of memory requests over the course of

2In Figures 6.2 and 6.3 we have reported only few of our studied applications as representatives of different characteristics observed in our studies. 3Different kernels from the same application are indexed with different numbers to be distin- guished in our discussions.

109 1.4 PVR-1 1.4 TPS Accuracy BFS TPS Area Overhead PVR-2 1.2 TRA 1.2

1 1

Normalized IPC 0.8 Normalized IPC 0.8 0 25 50 75 100 0 25 50 75 100 Caching Ratio [%] Caching Ratio [%]

Figure 6.2: Impact of caching ratio on the system performance. IPC normalized to the baseline configuration (i.e., 100% caching ratio). X-axis represents the percentage of the memory requests that are allowed to access the L1D cache. Among these kernels, BFS, and PVR2 do not achieve their optimal performance under the baseline configuration.

1.4 PVR-1 1.4 TPS Accuracy BFS TPS Area Overhead PVR-2 1.2 TRA 1.2

1 1 Normalized IPC Normalized IPC 0.8 0.8 0 25 50 75 100 0 25 50 75 100 Caching Ratio [%] Caching Ratio [%]

Figure 6.3: Impact of caching ratio on the system performance. IPC normalized to the baseline configuration (i.e., 100% caching ratio). X-axis represents the percentage of the memory requests that are allowed to access the L2 cache. Among these kernels, BFS, and TRA do not achieve their optimal performance under the baseline configuration. execution which makes the negative impact of longer data access latencies (imposed by going through L1D) on the overall system performance quite sensible.

Figure 6.3 reports similar variations for the L2 cache utilization from the same set of applications. As can be seen, similar to the L1D, different applications/kernels exhibit different L2 demands. TRA, and PVR1 exhibit L2-insensitive characteristics.

BFS, and PVR2 however, have L2-sensitive features. Besides, BFS experiences cache thrashing in the L2 cache, similar to the L1D cache. Analyzing and comparing Figures 6.2 and 6.3 gives us important insights regarding cache utilization:

• Different applications exhibit widely different L1D and L2 cache utilizations.

110 • Different kernels of the same application might exhibit a large variance in their cache demands. For instance, as shown in Figure 6.2 and Figure 6.3,

although PVR1 and PVR2 belong to the same application, they exhibit widely different behaviors in terms of the cache utilization for both the L1D and L2.

• A particular kernel does not necessarily exhibit similar demands for the L1D and L2 caches. This inconsistency between the L1D and L2 utilizations can

be observed for PVR1, PVR2, and TRA in Figures 6.2 and 6.3.

To sum up, integrating multiple levels of caches in the memory hierarchy of GPGPUs is expected to improve the performance. However, the cache hierarchy will not be efficiently utilized if it is only equiped with traditional cache management policies. For the cache-sensitive applications, we need to detect and resolve any potential cache-thrashing cases by tuning the fraction of cacheable accesses. For cache-insensitive applications on the other hand, the best design choice could be to completely bypass the cache. Such run-time optimizations should be made at the kernel-level for both the L1D and L2 caches to accurately capture characteristics of each individual kernel for each cache level.

6.3.2 Proposed Microarchitecture for Selective Caching

In Figures 6.2 and 6.3, we discussed partial caching and its impact on the overall system performance. However, we did not mention (i) how we can achieve a specific caching ratio at run-time and, (ii) based on what metrics we should determine cacheable and non-cacheable memory accesses. In this section, we answer these questions and propose1 a low-overhead microarchitecture to achieve a dynamically tunable caching ratio for both the L1D and L2 caches.

L1D Management: Since L1D is private within each SM (see Figure 6.1) and the hardware scheduler works at the warp-level, we use a warp as the granularity for controlling the caching ratio in L1D. Besides, within each SM each warp has a unique identification number called warp-id. Therefore, managing the L1D at the warp-level enables us to achieve any caching ratio at run-time and also distinguish cacheable and non-cacheable L1D accesses based on their warp-id. Now, the question is how to determine what warps should be allowed to access L1D?

111 Note that, warps within a CTA synchronize using barriers but there is no synchronization across threads belonging to different CTAs. Therefore, if we randomly choose cacheable and non-cacheable warps from different CTAs scheduled on an SM, it could cause longer synchronization delays between the warps within a CTA because cacheable warps will be executed in a shorter time. This can reduce the potential gain that can be achieved by our mechanism. For instance, as shown in Figure 6.4.a, assume half of the warps within a CTA are labeled as cacheable (e.g., W-X1 and W-X2 from CTA-X, and W-Y1 and W-Y2 from CTA-Y). Those cacheable warps will finish in a relatively shorter time but they still have to wait for the non-cacheable warps to finish due to the synchronization requirements between those warps (i.e., W-X1 and W-X2 should wait for W-X3 and W-X4, and W-Y1 and W-Y2 should wait for W-Y3 and W-Y4).

In order to resolve this issue, we limit our selection and choose warps from the same CTA (or group of CTAs), as shown in Figure 6.4.b. If we cannot achieve the target caching ratio by selecting the cacheable warps from one specific CTA, we move to the next CTA to mark more warps as cacheable. In such approach only one of the CTAs can have both cacheable and non-cacheable warps. Even though this policy favors a CTA (or group of CTAs), it does not cause long synchronization delays for the warps within the same CTA. Note that, once a cacheable warp finishes, we label another available warp as cacheable to efficiently utilize the cache during the execution. X W-X1 W-X2 W-X3 W-X4 X

- W-X1 W-X2 W-X4 - W-X3 A SYNC A SYNC T T C Finished Finished Running Running C Finished Finished Finished Finished

Y W-Y1 W-Y2 W-Y3 W-Y4 Y W-Y1 W-Y2 W-Y3 W-Y4 - -

A SYNC A SYNC T T

C Finished Finished Running Running C Running Running Running Running a. Uniform warp selection over all of the b. Localized warp selection over limited available CTAs on the SM number of available CTAs on the SM

Figure 6.4: Impact of the warp synchronization on selective warp caching.

112 Warp0 Warp1 Warp2 Warp7

0 ...

M Bypass S L1D Cache L1D L2

SM0 ... SM15 SM16 ... SM31 Bypass L2 Cache L2 Mem Figure 6.5: Restricting the fraction of cacheable memory requests in the L1D and L2 caches based on the warp and SM granularities, respectively. L1D offers 25% caching ratio by caching 2 warps out of 8 warps. L2 cache has a 50% caching ratio by caching 16 SMs out of 32 SMs.

L2 Management: Unlike L1D, L2 cache is shared among all the SMs. Therefore, we consider an SM as the granularity for controlling the caching ratio of L2 cache and use the SM identification number, called SM-id, to distinguish cacheable and non-cacheable L2 accesses. Unlike warps, there is no dependency among different CTAs running on different SMs. Meaning that, for instance, if we need to allow only 25% of the SMs to access L2 cache, we can randomly pick any 8 SMs out of 32 SMs. Figure 6.5 depicts an example in which 25% of the warps in an SM (2 warps out of 8 warps) are allowed to access the L1D. And for the L2 cache, 50% of the SMs (16 SMs out of 32 SMs) are labeled as cacheable. For each cache level, the rest of the accesses are being bypassed to the next level in the memory hierarchy.

6.4 Dynamic Cache Reconfiguration

Theoretically, if we knew the relationship between the ratio of cacheable requests at different cache levels and the GPGPU performance (like those in Figure 6.2 and Figure 6.3), we would be able to determine the optimal caching configuration for each cache level. Our dynamic approach gathers similar information at run-time to determine the ideal cache configuration. In this work, similar to Figures 6.2 and 6.3, we assume to have five discrete cache configurations, starting from the base configuration where all the incoming requests are allowed to be cached (i.e., 100% of the requests are cacheable), to the fully bypass configuration (i.e., 0% of the

113 requests are cacheable). Partial caching configurations are assumed to be uniformly distributed and form 25%, 50%, and 75% caching ratios. Our observations confirm that having only three partial caching configurations effectively resolves the cache- thrashing problem over the studied applications; however, the proposed scheme is scalable and the number of partial caching configurations can be increased.

6.4.1 Proposed Microarchitecture for Run-Time Sampling

In our scheme, we need four different information to determine the ideal L1D and L2 configurations. More precisely, the characterization hardware should provide us the L1D and L2 miss-rates for the baseline configuration, as well as the three partial caching configurations. Figure 6.6 depicts our proposed hardware to collect the required statistics at run-time. The studied platform has 32 SMs and 6 memory channels (i.e., memory partitions). Each SM has a private L1D cache and the L2 cache is logically shared but physically distributed among 6 memory partitions. Based on the fact that all of the 32 SMs run CTAs from the same kernel, all of the 32 L1D caches exhibit similar properties. Therefore, in order to analyze the L1D characteristics we need to study only one of the L1D caches. Furthermore, because the data is interleaved among 6 memory partitions, all of the L2 caches exhibit similar behavior. Therefore, we need to study only one of the L2 caches. Table 6.1 reports the difference in the miss-rate of different L1D and different L2 caches over the studied applications. Hence, as shown in Figure 6.6, we only evaluate one of the L1D caches (i.e., SM0 in Figure 6.6) and one of the L2 cache partitions (i.e.,

Par0 in Figure 6.6) for our characterization purposes. In order to analyze the characteristics of a kernel at run-time, we use a uniform-set-sampling scheme which is a customized version of the sampling technique proposed in [128]. By employing this technique we can collect the required statistics for different partial caching configurations in only one iteration. In order to capture the miss-rate for the baseline configuration as well as the miss-rates for three partial caching configurations, we allow Seti to cache all the memory requests while Seti+1, Seti+2 and Seti+3 are allowed to cache only 75%, 50% and 25% of the memory requests, respectively. This mechanism uses warp-id at the L1D cache and SM-id at the L2 cache to distinguish between memory requests and determine which one to cache and which

114 P one to bypass. As depicted in Figure 6.6, Set4∗i gives us the miss-rate for the P P P baseline cache configuration. Similarly, Set4∗i+1, Set4∗i+2, and Set4∗i+3 give us the miss-rates for 75%, 50%, and 25% partial caching configurations, respectively. In order to calculate the miss-rate for each caching configuration, the proposed hardware needs to keep track of the number of cache accesses and the number of cache misses for each configuration. Since in this work we only have three partial caching configurations, the hardware needs 4×2 counters (4-byte integers) for each level of cache with a total of 16 counters or 64 bytes capacity overhead.

Set Fraction of cached Monitoring Nodes Index warps/SMs i-1 25% SM-0 Partition-0 i 100% L1D $ L2 $

i+1 75% C SM-1 Partition-1 r i+2 50% o L1D $ s L2 $ s b

i+3 25% a SM-30 r Partition-4 i+4 100% L1D $ L2 $ SM-31 Partition-5 i+8 100% L1D $ L2 $ Baseline Configuration 75% Caching Configuration 50% Caching Configuration 25% Caching Configuration Figure 6.6: Microarchitecture design of the monitoring hardware. In this figure, Set(4*i) captures the miss-rate for baseline configuration. Similarly, Set(4*i+1), Set(4*i+2), and Set(4*i+3) capture the miss-rate for 75%, 50%, and 25% partial caching configurations, respectively.

Table 6.1: Variation in the miss-rate of different L1D and L2 caches.

Application L1D L2 Application L1D L2 SSSP [123] 2% 2% TRA [124] 1% 1% BFS [123] 2% 4% SLA [124] 1% 1% SP [123] 2% 2% NN [124] 2% 1% DMR [123] 1% 1% PVR [125] 2% 3% STN [126] 6% 2% SSC [125] 1% 1% QTC [126] 2% 4% MMUL [125] 1% 2% CONS [124] 2% 4% SRAD1 [127] 1% 2% STO [124] 1% 1% SRAD2 [127] 2% 2% BLK [124] 1% 1% STC [127] 5% 2%

115 Data Consistency: Our proposed scheme may decide to reduce the fraction of cacheable accesses at run-time. This could cause data inconsistency if we bypass the access to a dirty cache line. This problem cannot happen in the L1D cache because of its write-through policy in our architecture [6]. However, for the L2 cache, bypassing the accesses to the main memory could cause data inconsistency if the data is in the cache as a dirty line (i.e., updated by the core). One can resolve this problem by writing back all the dirty cache lines at the reconfiguration time. However, this could cause considerable performance overhead if the reconfiguration happens multiple times during the course of execution.

To resolve this problem, we propose a gradual reconfiguration process. Our mechanism relies on some properties of GPUs: (1) as discussed in [124], threads within a CTA can synchronize using barriers but there is no synchronization across threads belonging to different CTAs; (2) CTAs can be distributed to any cores (i.e., SM) with any order; and (3) once a CTA is assigned to a core, it cannot be preempted and must use that core for the whole execution. Knowing these properties, we do not write-back the L2 dirty cache lines at the reconfiguration time. Instead, for the SMs that are marked as non-cacheable, we let the scheduled CTAs on those SMs to access the L2 cache until they finish. However, for the newly scheduled CTAs on those SMs, we bypass the L2 cache from the beginning. Such gradual state transition satisfy our concerns in terms of write-back overhead and data consistency. In other words, we do not need to write back the dirty cache lines to guarantee the data consistency for non-cacheable SMs because the running CTAs on those SMs will be allowed to access the cache until they finish, and the data consistency of newly scheduled CTAs to non-cacheable SMs is also satisfied because they will bypass the L2 cache from the beginning and there is no synchronization or data dependency between those CTAs with any other CTA. Note that, although based on our experimental studies having only three partial caching configurations can effectively resolve the cache-thrashing problem, the proposed characterization mechanism is scalable and the number of partial caching configurations can be increased to achieve finer tunings. The overhead of our proposed mechanism increases by the number of partial caching configurations (i.e., 8 bytes per configuration), as opposed to other cache optimization techniques (e.g., [129]) where the hardware overhead linearly increases by the cache size.

116 6.4.2 Kernel Characterization

The proposed set-sampling mechanism collects the required statistics of the each individual kernel at run-time. The arising question is that “At what point during the execution is the running kernel in a steady state?” Indeed once the running kernel reaches to the steady state, the collected information can accurately represent the performance of different caching configurations; otherwise, there might be considerable differences between the sampled behavior and the actual characteristics of the kernel in long term. In our experimental studies over a wide range of GPGPU applications, we observed that using a fixed size sampling window (in terms of number of cycles or number of executed instructions) does not accurately capture the characteristics of the kernel because the execution time of different kernels are widely variable for a fixed window size. Therefore, our goal is to have a dynamic window size for our characterization purposes to obtain the intrinsic behavior of each individual kernel at run-time. To this end, we propose kernel-based and CTA-based sampling techniques.

Kernel-based sampling: Within a GPGPU application, two basic properties exist. First, most of the kernels of an application are launched multiple times during the course of application execution. Second, different invocations of the same kernel exhibit very similar behaviors [121]. These two common properties motivate us to exploit the first execution of a kernel as the sampling phase, and use the collected information for the future invocations of that kernel. However, such kernel-based approach is not applicable for two different situations: First, if the kernel is launched only once during the execution of the application. Second, if the kernel does not exhibit a consistent behavior over different invocations. Although these two properties are not very common, we do not use a kernel-based sampling as our primary scheme.

CTA-based sampling: As discussed in Section ??, each kernel is split into smaller blocks called CTAs. The number of CTAs that can concurrently execute on an SM is limited by the amount of available resources at the SM. Therefore, SMs start executing a kernel with the maximum possible number of CTAs that can concurrently execute, and whenever a CTA finishes, the CTA scheduler launches another CTA to the available SM. This procedure continues until service of all

117 the CTAs. Because these CTAs run similar code, each of them can represent the memory characteristics of the running kernel. This property motivates us to employ a CTA as a fine yet accurate granularity in the sampling phase. Here, the question then arises, “After the execution of how many CTAs is the running kernel in the steady state?” When a kernel starts executing, the CTA scheduler launches a fixed number of CTAs to the SMs, which is equal to the number of SMs multiplied by the maximum number of CTAs per SM. We denote this number by Max_Concurrent_CT As. In this work, we consider Max_Concurrent_CT As as our sampling-unit. However, the proposed characterization mechanism continuously analyzes the behavior of the kernel, and if it recognizes noticeable changes in the memory access patterns, it will accordingly update the collected statistics which can in turn change the cache reconfiguration. Unlike kernel-based approach, a CTA-based scheme can be exploited for (i) kernels that are launched only once, (ii) kernels with inconsistent behavior over different invocations, and (iii) it also recognizes the behavioral changes within each kernel execution. Therefore, in this work we use such CTA-based sampling scheme for our characterization purposes.

6.4.3 Determining the Ideal Configuration

Algorithm 1 represents our strategy for determining the ideal L1D and L2 cache

configurations. In this algorithm, MRi represents the collected cache miss-rate for the case where i percent of the memory requests (i.e., i% of the warps in the L1D

cache or i% of the SMs in the L2 cache) are allowed to access the cache. Stepi variable represents the amount of reduction in cache miss-rate in each step, as we move from 100% to 25% caching configuration. After the sampling phase, we know the performance of the L1D and L2 caches under different partial-caching scenarios. This knowledge is then translated into two independent performance curves, similar to that of Figures 6.2 and 6.3, in order to determine the ideal configuration for L1D and L2 caches. The reasoning behind our heuristic algorithm could be understood by analyzing Figures 6.2 and 6.3. Supported by our observations, for the kernels with high miss-rates in baseline configuration, we will decide to completely bypass the cache if the kernel does not exhibit any considerable reduction in its cache miss-rate as the fraction of cacheable accesses decreases.

118 Indeed, the high miss-rate of such kernels comes from the streaming behavior rather than cache contentions. PVR2 and TRA in Figure 6.2 are examples of this category of kernels. For the kernels in which reducing the fraction of cacheable accesses leads to a noticeable reduction in cache miss-rate, we potentially face a cache-thrashing situation and the proposed algorithm reconfigures the cache to improve its utilization (e.g., BFS in Figures 6.2 and 6.3). In order to find the optimal point on the performance curve, we pick the configuration that gives us the highest miss-rate reduction compared to the previous partial-caching configuration.

In other words, we select the configuration with the highest Stepi because that configuration is expected to accommodate the data working-set more efficiently.

Also for the kernels where the cache is already being utilized properly (e.g., PVR1 in Figure 6.2), our proposed algorithm does not make any changes. As shown in Algorithm 1, we use two thresholds for our reconfiguration purposes which are set by experimental evaluations. T hreshold2 simply filters small performance variations.

On the other hand, T hreshold1 is used to recognize cache-insensitive kernels. Note that, if our scheme makes a wrong decision based on this threshold (i.e., performs a

Algorithm 1 Pseudo code representing high-level strategy //Miss Rates Collected by the Proposed Hardware MRB ← MissRate.Base; MR75 ← MissRate.75%; MR50 ← MissRate.50%; MR25 ← MissRate.25%;

//Thresholds: T hreshold1 = 30% T hreshold2 = 5% //Miss-Rate Reduction in each Configuration Step1 ← (MR100-MR75); Step2 ← (MR75-MR50); Step3 ← (MR50-MR25);

if (MR100 ≤ T hreshold1) then if (Max(Step1, Step2, Step3) ≥ T hreshold2) then Cache.Config ← Max(Step1, Step2, Step3).Config; else Cache.Config ← BaseConfig; end if else if (Max(Step1, Step2, Step3) ≥ T hreshold2) then Cache.Config ← Max(Step1, Step2, Step3).Config; else Cache.Config ← Bypass; end if end if

119 complete cache bypassing for a cache-sensitive kernel), it will immediately degrade the system performance, after which the cache will be reconfigured to the previous configuration. Therefore, T hreshold1 will be updated dynamically based on the characteristics of the kernel.

Multi-Step Reconfiguration: Although the same algorithm is applied on both the L1D and L2 caches, they can end up with two different configurations because they observe different access traffics. Indeed, an L1D reconfiguration affects the access traffic seen by the L2 cache which consequently can change the behavior of the L2 cache. This issue is more prominent in kernels with cache-thrashing problem because a minor change in the access pattern can change the cache performance, significantly. Therefore, we do not reconfigure the L1D and L2 caches at the same time. Instead, after the first sampling phase we only reconfigure the L1D cache, if needed. Then, we reevaluate the characteristics of the L2 cache based on the new L1D configuration. At this point we can reconfigure the L2 cache, if needed.

6.5 Evaluation

Platform: In order to evaluate our proposal, we used GPGPU-Sim v3.2.2 [130], a publicly-available cycle-accurate GPGPU simulator. This configuration is similar to GTX480 model from NVidia. The details of the simulated configuration are listed in Table 6.2. Each SM is supported by a private 16KB L1D and L1I cache. SMs are connected to 6 memory channels. Each memory channel is coupled with a portion of the L2 cache with a size of 256KB.

Benchmarks: Table 6.3 lists the applications we used in our evaluations. We consider a wide range cache sensitive applications from various benchmark suites: CUDA SDK [124], Mars [125], Rodinia [127], Parboil [131], Shoc [126], and Lon- estarGPU [123]. In Table 6.3, CS and CI refer to Cache-Sensitive and Cache- Insensitive characteristics, respectively. Applications that are listed as CS-CI have multiple kernels with different characteristics and/or single kernel with different characteristics for the L1D and L2 caches. The studied GPGPU applications cover a wide range behaviors in terms of sensitivity to L1D and L2 caches to properly evaluate the efficiency of the proposed mechanism.

120 Table 6.2: Baseline configuration.

SM Config. 32 Shader Cores, 1400MHz, SIMT Width=32 Resources/Core 1536 Threads (48 warps, 32 threads/warp), 48KB Shared Memory, 32684 Registers Caches/Core 16KB 4-way L1D, 12KB 24-way Texture, 8KB 2-way Constant Cache, 2KB 4-way L1I L2 Cache 256 KB/Memory Partition, 128B Line Size, 8-way, 700MHz Warp Scheduler Greedy-then-oldest Features Memory Coalescing, Inter-warp Merging, Immediate Post Dominator Interconnect Crossbar, 1400MHz, 32B Channel Width Memory Model 6 GDDR5 Memory Controllers (MCs), 1 V, 8 DRAM-banks/MC, 4 bank-groups/MC, FR-FCFS scheduling (256 requests/MC)

GDDR5 Timing tCL = 12, tRP = 12, tRC = 40, tRAS = 28, tCCD = 2 tRCD = 12, tRRD = 6, tCDLR = 5, tWR = 12

Table 6.3: List of GPGPU benchmarks: CS and CI represent Cache-Sensitive and Cache-Insensitive kernels, respectively.

Suite Application Abbr. Type Lonestar [123] Single-Source Shortest Paths SSSP CS Lonestar [123] Breadth-First Search BFS CS Lonestar [123] Survey Propagation SP CS-CI Lonestar [123] Delaunay Mesh Refinement DMR CS Shoc [126] 2D Stencil Computation STN CS Shoc [126] Quality Transpose Cluster QTC CS-CI SDK [124] Separable Convolution CONS CI SDK [124] StoreGPU STO CS SDK [124] Blackscholes BLK CI SDK [124] Transpose TRA CS-CI SDK [124] Scan Large Array SLA CS-CI SDK [124] Neural Network NN CS Mars [125] Page View Rank PVR CS-CI Mars [125] Similarity Score SSC CS-CI Mars [125] Matrix Multiplication MMUL CS-CI Rodinia [127] SRAD1 SD1 CS-CI Rodinia [127] SRAD2 SD2 CS-CI Rodinia [127] Streamcluster STC CS-CI

121 6.6 Experimental Result

6.6.1 Dynamism

The proposed selective caching mechanism evaluates the behavior of each kernel at run-time and accordingly determines the ideal caching configuration for the L1D and L2 caches in a multi-step fashion. Figure 6.7 demonstrates how during a two-step process our mechanism reconfigures the cache hierarchy for the studied applications. After the first sampling period, the ideal L1D configuration is determined. As demonstrated in Figure 6.7, since a reconfiguration at the L1D changes the access pattern to the L2 cache, a second characterization is required to accurately capture the behavior of L2 cache under new L1D configuration. Then, our scheme determines the ideal configuration for the L2 cache. Note that, the L2 cache can be reconfigured after the first sampling phase if our scheme doesn’t change the configuration of the L1D. However, in Figure 6.7, we demonstrate all the L1D and L2 cache reconfigurations in two separate sections for the ease of analysis. As can be seen in this figure, for each individual kernel, the L1D and L2 caches might end up with two different configurations because they observe different access traffics. For instance, SSSP suffers from the cache-thrashing problem in the L1D, and our mechanism reduces the fraction of cacheable warps in L1D to 25% while the L2 cache is kept shared among all the SMs. Similar variations in L1D and L2 configurations can be seen in other applications as well (e.g., PVC, PVR, SP, and etc). In Figure 6.7, some applications (e.g., PVR) can be seen in multiple configuration settings because they have multiple kernels each of which with a different ideal cache configuration. For instance, as discussed in Figure 6.2 and Figure 6.3, PVR consists of two main kernels. The L1D cache is completely bypassed in PVR2 but it is being shared among all the warps for PVR1. On the other hand, PVR1 performs best without the L2 cache but it has been shared among all of the SMs in PVR2. Note that, the configuration of cache hierarchy might change for different invocations of the same kernel if the kernel exhibits different characteristics over different invocations. However, in Figur 6.7, we demonstrate the outcome of our proposed mechanism only for the first invocation of each kernel.

122 Ideal L1D Configuration Ideal L2 Configuration

BLK, CONS, QTC, TRA, STN, STC, SD1, PVR

L1-Bypass L2-Bypass BLK, CONS, TRA, SD1, , SSC, & SSC, SD2, SLA, SP, MMUL, PVR & 25% 25% SD2, SLA, MMUL

L1- SSSP, BFS, SD1, SD2, L2- n

n TRA, SP

o

o

i

i

t

t

a

a

r

r n

50% n u

L1- u 50% o

o L2-

g

g

i

i

i

i

t

t

f

f

a

a

n

n

z

z

i i

123

o

o

r

r

c

c

e

e

e

e t

75% t c

c 75% R

L1- R L2-

a

a r

r STN

a

a

1

2 h

PVR h C

C SSSP PVR L L1-Baseline DMR, NN, STO, QTC, , SSC, L L2-Baseline DMR, NN, STO, BFS, QTC, ,STC, , 1 SD1, SLA, MMUL 2 SD1, SSC, SD2, SLA, MMUL

Figure 6.7: The ideal caching configurations for the L1D and L2 caches determined by our proposed mechanism. Configuration of the L1D is determined after the first sampling phase. However, another sampling needs to be performed to accurately capture the characteristics of the L2 cache based on the new L1D configuration. Some applications (e.g., PVR) can be seen in multiple settings because they have multiple kernels each of which with a different demand. Since some of the applications consist of many kernels, differently kernels of the same application are not individually indexed to keep the figure readable. 6.6.2 Performance

Figure 6.8 demonstrates how our proposed mechanism improves the overall system performance (in terms of IPC) by dynamically reconfiguring the cache hierarchy. This figure is partitioned into four logical sections for the ease of evaluation. The first part represents the applications for which our technique does not make any changes in the cache hierarchy. Although these applications are cache-sensitive, our proposed scheme finds the baseline configuration already ideal and does not reconfigure that. In fact, the L1D miss-rate under the baseline configuration for DMR, NN, and STO respectively equals to 0.5%, 3%, and 5%. Similarly, the observed miss-rate for the L2 cache for DMR, NN, and STO equals to 4%, 2%, and 6%, respectively. The second section in Figure 6.8 (labeled as L1), represents the applications for which our proposed mechanism only reconfigures the L1D cache because the L2 cache performs efficiently under the base configuration. Our proposed selective caching scheme can achieve an average of 40% performance improvement for this category of applications. The third section in Figure 6.8 (labeled as L2), represents the cases where only the L2 cache is reconfigured. This category of applications experience an average performance improvement of 13%. The fourth section however (labeled as L1&L2), includes the applications in which both the L1D and L2 caches are reconfigured by our mechanism, and achieve an average of 31% performance improvement. As can be seen, for this category of applications, the coordinated reconfiguration of L1D and L2 caches outperforms the performance gain from each individual reconfiguration. Overall, our proposed mechanism improves the system performance by 27% on average, over 18 GPGPU applications consisting of more than 45 individual kernels.

Optimal Configuration: The optimal data presented in Figure 6.8 represents the optimal cache hierarchy found by offline analysis of all different combinations of the L1D and L2 configurations. As can be seen, for most of the applications the configuration found by our dynamic mechanism matches the optimal configuration. Note that, in applications with very few number of CTAs (e.g., BLK), although our run-time mechanism converges to the optimal configuration, the achieved overall gain is less than the optimal improvement because the characterization phase contributes to a considerable portion of the whole kernel execution time.

124 L1 L2 L1 & L2 Optimal 60 105 L1 L277 L1 & L2 50

40

30

20

10

0 Normalized IPC

-10 DMRNN STO BFS QTC SSSPSTC BLK MMULSLA CONSPVR SD1 SD2 SP SSC STN TRA Mean

Figure 6.8: IPC normalized to the baseline configuration where all the threads are allowed to access the cache. The L1, L2, and L1&L2 sections represent the applications which experienced reconfigurations in their L1D (only), L2 (only), and both the L1D and L2 caches, respectively.

6.6.3 Sensitivity Study

In this work, our baseline GPGPU architecture is based on GTX480 Fermi architecture [6]. The baseline cache hierarchy is assumed to have a 16KB private L1D cache for each SM (i.e., an overall of 512KB L1D cache), and a 256KB L2 cache for each memory partition (i.e., an overall of 1.5MB L2 cache). In order to analyze the efficiency of our proposed mechanism under different cache sizes, we first changed the L1D cache size in Figure 6.9, and then the L2 size in Figure 6.10. As we change the size of a cache level, the behavior of that cache as well as the access traffic to the next level cache changes. Therefore, as can be seen in Figures 6.9 and 6.10, our proposed mechanism might make a different adjustment for the same application under different cache settings. Overall, for the configuration reported in Figure 6.9, our proposed mechanism improves the system performance by an average of 25%. For the configuration reported in Figure 6.10 however, we observe higher performance improvements (an average of 33%) since a smaller L2 cache size causes more conflict misses in cache-sensitive applications. Note that, GPGPU resources are rapidly growing with each technology generation [132,133] to provide the demanded resources (e.g., computing cores, caches, and memory bandwidth) for efficient computation. However, the new generation of GPGPU applications are also becoming highly data-intensive and have large working-sets as well. This trend further signifies the importance of our proposed selective caching mechanism which can utilize the limited capacity on-chip caches in an efficient manner.

125 L1 L2 L1 & L2 Optimal 60 101 L1 L277 L1 & L2 50

40

30

20

10

0 Normalized IPC

-10 DMRNN STO PVR QTC SD2 SP SSC STC BLK MMULSLA BFS CONSSD1 SSSPSTN TRA Mean

Figure 6.9: (L2=256KB, L1D=32KB) IPC normalized to the baseline. The L1, L2, and L1&L2 sections represent the cases which experienced reconfigurations in their L1D (only), L2 (only), and both the L1D and L2 caches, respectively.

L1 L2 L1 & L2 Optimal 70 95 L1 L2 L1 & L2 77 60 50 40 30 20 10

Normalized IPC 0 -10 DMRNN STO QTC STC BLK SLA BFS CONSMMULPVR SD1 SD2 SP SSC SSSPSTN TRA Mean

Figure 6.10: (L2=128KB, L1D=16KB) IPC normalized to the baseline. The L1, L2, and L1&L2 sections represent the cases which experienced reconfigurations in their L1D (only), L2 (only), and both the L1D and L2 caches, respectively.

6.6.4 Cache Miss-Rate

Resolving cache-contention problem leads to a considerable reduction in cache miss-rate. Figure 6.11 demonstrates the kernels for which our mechanism has reduced their L1D and/or L2 cache miss-rates by regulating the fraction of cacheable accesses. Although these applications inherently have high data locality, they experience very low cache hit-rate because of insufficient cache capacity for the enormous number of concurrently running threads. Traditional cache management policies are incapable of handling this issue which in turn leads to frequent cache conflicts. Comparing Figures 6.7 and 6.11 tells us what configuration results in the corresponding hit-rate improvement for each application. For instance, for SSSP, L1D miss-rate is reduced by 50% by using the 25% partial-caching configuration.

126 MR 60 L1D80 L2 50 40 30 20

Reduction(%) 10 Cache Miss-Rate 0 BFS SD1 SD2 SSSP SP STN TRA Mean

Figure 6.11: Cache miss-rate reduction after reconfiguring the cache hierarchy to resolve the cache-thrashing issue. (This figure only contains the kernels from Table 6.3 with cache-thrashing problem in the L1D and/or L2 caches)

6.6.5 Comparison with Warp-Throttling Techniques

Some prior studies in GPGPU domain (e.g., [33,34,134]), exploit warp-throttling techniques to resolve different resource contention problems. In general, hard- ware/software throttling mechanisms try to reduce the memory traffic by decreasing the number of concurrently running threads. For instance, the proposed throttling techniques in [33,34] operate at the warp granularity to control the amount of ac- cesses to the L1D cache. Based on our experimental observations, thread-throttling mechanisms leave other shared resources, such as the memory bandwidth and computing cores, significantly underutilized. On the contrary, our proposed tech- nique does not sacrifice the thread-level parallelism to mitigate the cache conflict problem. Instead, we employ a selective caching approach to resolve the cache contention problem while other shared resources are actively utilized by all the running threads. Besides the resource underutilization problem, thread-throttling techniques are generally only effective for resolving potential conflicts at the L1D cache. In contrast, our proposed technique works at the SM granularity for the L2 cache which does not affect the thread-level parallelism nor the L1D performance.

Figure 6.12 demonstrates how our proposed mechanism that is a combination of a warp-level scheme in the L1D, and an SM-level scheme in the L2 cache outperforms the warp-throttling techniques proposed in [33,34]. This figure only contains the applications with cache contention problem because throttling techniques have no positive impact on other types of applications. As can be seen in the first section of

127 Our Technique Warp-Throttling 50 L1D L2 40 30 20 10 Normalized IPC

0 BFS SD1 SD2 SSSP sp STN TRA Mean

Figure 6.12: Comparing the impact of a throttling scheme with our proposed scheme on resolving cache-thrashing. (This figure only contains the kernels from Table 6.3 with cache-thrashing problem in the L1D and/or L2 caches)

Figure 6.12 (labeled as L1D), even though warp-throttling can reduce the number of conflicts at the L1D cache, our proposed technique considerably outperforms that because the selective caching policy does not sacrifice thread-level parallelism for the sake of reducing the number cache conflicts. For the L2 cache however (labeled as L2 in Figure 6.12), warp-throttling approach cannot effectively resolve the contention problem because the performance loss caused by underutilizing the cores and L1D outweighs the potential improvement gained by L2 optimization.

6.6.6 Comparison with Reuse Distance-Based Caching Policies

Some prior works in CPU and GPU domain (e.g., [7,129,135–138]) exploit data reuse distance as a metric to determine what data blocks and for how long should be kept within the cache. However, each of these reuse distance-based techniques use different heuristics to predict the time and frequency of future accesses to a cache block. Here, we compare our proposed architecture with the state-of-the- art reuse distance-based scheme for a GPGPU platform (i.e., Chen et al. [7]). Besides exploiting data reuse distance to determine which cache blocks should be kept within the cache, they also use a warp-throttling mechanism to resolve any potential congestion in the interconnection network. More precisely, they propose a coordinated cache bypassing and warp-throttling mechanism (named as CBWT) in order to resolve the contention in L1D cache as well as the congestion in the interconnection network, which are both caused by higher degree of thread-level parallelism in a typical GPGPU platform.

128 L1 L2 L1 & L2 CBWT 60 105 115L1 L277 L1 & L2 50

40

30

20

10

0 Normalized IPC

-10 DMRNN STO BFS QTC SSSPSTC BLK MMULSLA CONSPVR SD2 SSC SD1 SP STN TRA Mean

Figure 6.13: IPC normalized to the baseline configuration where all the threads are allowed to access the cache. The L1, L2, and L1&L2 sections represent the applications which experienced reconfigurations in their L1D (only), L2 (only), and both the L1D and L2 caches, respectively. CBWT (Cache Bypassing Warp Throttling) is the proposed scheme in [7].

Figure 6.13 compares our technique with CBWT in terms of IPC. We can make the following main conclusions from this comparison: (i) As discussed in [7], this technique resolves the contention at the L1D cache; but, it’s applicability to L2 cache is not trivial. As such, CBWT does not have any positive impact on applications which only suffer from contention at the L2 cache (e.g., BLK, MMUL, and SLA). (ii) For applications in which the interconnection network is also congested (e.g., BFS, QTC, and SSP), CBWT outperforms our proposed mechanism by dynamically reducing the number of running threads which in turn mitigates the congestion in the interconnection network. As a result, for the L1D-sensitive applications, CBWT outperforms our proposed mechanism by 2.1%, on average, as CBWT also resolves potential network congestion by run-time warp-throttling. Note that, one can also resolve the interconnection network congestion and/or memory bandwidth saturation by adopting a core shutdown mechanism (e.g., [121]) or a core side DVFS scheme (e.g., [134]) which are orthogonal to our proposed cache architecture. Nonetheless, since our proposed architecture is applicable to different cache levels and reconfigures both the L1D and L2 caches in a coordinated fashion, it outperforms CBWT by 10% on average.

129 6.7 Related Work

Jia et al. [54] propose a compile-time algorithm which analyzes load instructions and determines which one of them should be cached in the L1D. From a different perspective, Xie et al. [55] optimize the code through bypassing the load instructions with low data-reuse in L1D cache. They further improve that work by adding a dynamic L1D tuning scheme to the compiler-based optimizations [139]. Similarly, some prior studies in CPU and GPU domain (e.g., [56–59]) aim to keep date that most likely will be reused in the cache. In general, such compile-time approaches are not aware of run-time parameters and cannot detect cache-thrashing during different phases of execution. However, they can mitigate the thrashing problem by keeping the data that will most likely will be used in the cache. Note that, all of the compile-time techniques are proposed for the L1D cache and cannot easily be extended to optimize the performance of L2 cache.

Considering the concept of reuse distance, some prior works (e.g., [7,135–138]) propose heuristic dynamic techniques to optimize the performance of L1D cache. This category of reuse distance-based techniques can effectively detect contentions in L1D cache at run-time; however, the application of these techniques to the L2 cache is not trivial as it was thoroughly discussed in Section 6.6.6.

Some other category of studies propose new cache management policies to deal with the cache contention problem. Jia et al. [118] propose a dynamic approach for L1D cache bypassing. In this work, if an L1D miss request cannot be assigned the required resources (i.e., a cache line, an MSHR entry, and a miss queue entry), it will be sent to the memory without being cached. This mechanism does not resolve the cache contention problem; however, it reduces the latency of the L1D cache misses. Qureshi et al. [140] propose dual insertion policy (DIP) to mitigate the contention at the last-level cache in CMPs. The proposed mechanism determines if a cache line should be inserted in the cache at the LRU or at the MRU position to reduce the contentions caused by large working-set. Based on our evaluations, DIP is not effective in GPU platforms. We observed considerable improvement for only two of the studied applications (i.e., 12% improvement in STN, and 9% improvement in TRA) by using DIP for the L2 cache in our platform. Similar

130 approaches can also be found in [128,141,142] in which the insertion and eviction policies are updated to optimize the cache performance in CMPs for applications which large working-set. Because of some fundamental differences between CPU and GPU architectures, such techniques are not as effective in GPU platforms.

From a difference perspective, [33] and [34] propose thread-throttling techniques to deal with the cache contention problem. Although these approaches are successful in reducing the number of conflicts at the L1D cache by reducing the number of concurrently running warps, they underutilize other shared resources such as the memory bandwidth and computing cores, as it was discussed in Section 6.6.5. Note that, throttling techniques could also be used to resolve the congestion in the interconnection network as well as memory bandwidth saturation problem in memory hungry applications (e.g., [121,134,143]). Such mechanisms are orthogonal to our work and can be augmented to our architecture to further improve the system performance.

Some other prior works particularly focus on the last-level caches in GPUs. Mu et al. [60] propose an L2 cache management technique in GPUs. The proposed technique tries to keep data that are most likely to be reused in cache based on the number of threads bind to each memory request. This mechanism is tightly integrated with the memory scheduler to reduce the average data access latency. Some prior studies in CPU domain take a similar approach to improve the performance of the last-level caches by employing locality-based bypassing [61–64]. These techniques are closely integrated with memory controller to improve the overall data access latency. In our work, we proposed our own run-time sampling methods based on the unique features of GPGPU applications. Further, our selective caching mechanism is applicable on both the L1D and L2 caches, unlike prior studies which are not applicable on different cache levels.

6.8 Conclusion

In this work, we demonstrated that the conventional cache management policies are incapable of efficiently exploiting the on-chip cache hierarchy for different GPGPU applications. In order to resolve this issue, we proposed a selective caching

131 mechanism which determines the configuration of the L1D and L2 caches based on run-time characterizations. The proposed mechanism improves the system performance by 27% on average over 18 GPGPU applications, which is within 10% of the optimal improvement.

132 Chapter 7

Improving the Performance of Last-Level Cache through a Criticality-Aware Cache Compression

Cache compression is a promising technique to increase on-chip cache capacity and to decrease off-chip bandwidth usage. While prior compression techniques always consider a trade-off between compression ratio and decompression latency, they are oblivious to the variation in criticality of different cache blocks. In multi-core processors, last-level cache (LLC) is logically shared but physically distributed among cores. In this work, we demonstrate that, cache blocks within such non- uniform architecture exhibit different sensitivity to the access latency. Owing to this behavior, we propose a criticality-aware compressed LLC that favors lower latency over higher capacity based on the criticality of the data blocks. Based on our studies on a 16-core processor with 4MB LLC, our proposed criticality-aware mechanism improves the system performance comparable to that of with an 8MB uncompressed LLC.

133 7.1 Introduction

Despite considerable research in the past three decades leading to multi-fold improvements in cache efficiencies, the problem has become more challenging in current and future generation of processors. Workloads in the next generation of computing systems are expected to be highly data-intensive. The processing power is also steadily increasing and major manufacturers are planning to integrate hundreds of cores on a die. In such multi-core systems, computer architects employ high capacity on-chip cache hierarchies to reduce data access latency. The decision of how large to make a given cache involves trade-offs: while larger caches often reduce number of cache misses, this potential benefit comes at the cost of higher power consumption, longer cache access latencies, and increased chip area. As we move forward, the processors demand more and more number of cores which in turn the issue of providing sufficient on-chip cache capacity becomes increasingly challenging. Simply scaling cache capacity linearly with the number of cores is not practical because of power limitations and on-chip area.

To resolve this issue, some prior works (e.g., [1, 8, 9, 35–37]) use various data compression schemes to achieve larger capacity without suffering all disadvantages of fabricating larger caches. The biggest obstacle in adopting cache compression in commercial processors is the decompression latency. Unlike compression, which takes place in the background, decompression is on the critical path which directly affects the system performance. Therefore, in order to improve the system perfor- mance, it is vital to achieve a fine balance between the extra capacity achieved by data compression and the extra access delay imposed by that. Studying pre- vious compression schemes show that, compression mechanisms always sacrifice one for the other. Meaning that, sophisticated compression schemes are capable of achieving higher compression ratios (i.e., larger cache capacities) but at the cost of longer decompression latencies (i.e., slower cache accesses), and vice versa. In this work, we demonstrate that in designing a compressed cache, data criticality should be considered as the third design parameter, along with compression ratio and decompression latency. While typical compression schemes decide to store a cache block either in compressed or uncompressed format just based on the content of the cache block, our proposed mechanism also considers the criticality of the

134 block. In other words, even if a cache block can be stored in a compressed format, we might decide to store it in an uncompressed or less-compressed format based on its criticality (i.e., its latency sensitivity). Based on our observations, applications exhibit different sensitivity to the access latency of different cache blocks. Such variation in latency sensitivity is partially a function of the underlying architecture where LLC is physically distributed among cores, forming a cache structure with non-uniform access latency. Considering this architecture, our proposed mechanism improves performance of the compressed cache through the following optimizations:

• Considering the fact that, (i) compression schemes always offer a trade-off between the compression ratio and decompression latency, and (ii) local and remote cache blocks exhibit different sensitivities to the access latency, we propose a hybrid mechanism to balance the extra cache capacity with the imposed decompression latency. To this end, our proposed architecture favors lower latency over higher capacity for local cache blocks by adopting a fast compression scheme (i.e., low-compression-ratio low-decompression- latency). For remote blocks however, we use a strong compression scheme (i.e., high-compression-ratio high-decompression-latency) to prioritize capacity.

• We will further discuss that, in an out-of-order execution processor, some cache blocks cause long ROB (i.e., ReOrder Buffer) stalls which directly degrades the system performance. Therefore, such blocks cannot tolerate long decompression latencies and our scheme categorizes them as critical as well. Meaning that, they can be compressed only by a fast compression scheme.

• We will also demonstrate that, by knowledgeably adopting multiple fast compression schemes, we can improve the overall data-type/data-pattern coverage which in turn provides a compression ratio comparable to that of strong compression schemes while the decompression latency is kept low.

• We will finally illustrate that, decompression process can be pipelined in a specific category of compression schemes. In such schemes, decompression process can be performed as the compressed data block traverses through the interconnection network. By performing the consecutive stages of decompres- sion over different routers along the traversal path, we can partially overlap the decompression delay with the data traversal delay.

135 7.2 Background and Related Works

7.2.1 Baseline Platform

As shown in Figure 7.1, large last-level caches (LLC) in modern multi-core processors are structured as non-uniform cache architecture (NUCA) where the LLC is logically shared but physically distributed among the cores. More precisely, each cache bank is connected to one core and data movement between the banks is managed by a network of routers. In this work, we adopt a mesh topology as our network-on-chip (NOC) configuration to have an efficient low-overhead platform. In such non-uniform architecture, accesses to a local block experience a latency equal to the cache hit-latency while remote accesses experience variable latencies depending on the distance between the requesting and the target nodes.

7.2.2 Cache Compression

In the context of on-chip caches, some prior works (e.g., [1,8,9,35–37]) use various data compression schemes to achieve larger cache capacity. Table 7.1 contains the characteristics of the most well-known cache compression schemes. ZCA [36] and [144] exploit zero values to compress cache blocks. FVC [37] and FPC [1] respectively use, frequently repeated data values and data patterns to encode data blocks into a compact format. BDI [8] relies on the observation that words in a cache block are close to each other, making it meaningful to code them with their difference to a base value. C-pack [35] and [145] utilize both static patterns and a dynamically updated dictionary to achieve higher coverage. SC2 [9] leverages Huffman-based compression to achieve higher compression ratio. In general, the ultimate goal of cache compression is to achieve larger capacities while the decompression latency is kept within a reasonable range. However, compression schemes usually sacrifice one for the sake of the other.

136 r r e e t t Core0 Core1 Accessing u u L1I L1D L1I L1D o o Local Cache Block R R L2- Bank1 L2-Bank0 (L2 hit latency) r r e e t Core2 t Core3 Accessing u L1I L1D u L1I L1D o o Remote Cache Block R L2- Bank2 R L2-Bank3 (L2 hit latency + NOC latency) r r e e t

t Core4 Core5 u u L1I L1D L1I L1D o

o Routing Unit c o h

c VC R R L2-Bank5

L2-Bank4 VC Allocator u h a i Identifier n n t a

Switch Allocator p p n n r r u u n e e e

Core6 Core7 t t t l t e s

u u l VC

L1I L1D L1I L1D s o o Identifier R R L2-Bank6 L2-Bank7 crossbar

Figure 7.1: Typical tiled multi-core architecture. Tiles are interconnected into a 2-D mesh. Each tile contains a core, private L1I and L1D caches, a shared L2 cache bank, and a router for data movement between nodes.

Table 7.1: Cache compression techniques. Compression Technical Decompression Compression Scheme Contribution Latency Ratio ZCA [36] Zero Values 1 Cycle Low FVC [37] Frequent Values 5 Cycles Modest BDI [8] Narrow Values 1 Cycle High FPC [1] Frequent Patterns 5 Cycles High C-Pack [35] Dynamic Dictionary 8 Cycles High SC2 [9] Statistical Compression 8/14 Cycles High

7.3 Compression Implications

7.3.1 Latency versus Capacity

Data compression can potentially improve the system performance by reducing the number of off-chip memory accesses through fitting a larger portion of the working-set (WS) in the on-chip cache. However, since decompression process is on the critical path, the average data access latency will be increased which in turn can negatively outweigh the gain from having larger cache capacity. To further clarify this issue, Figure 7.2 illustrates a hypothetical compression scenario. In this figure, only WS1 fits in the baseline cache configuration where all data blocks are stored in uncompressed format. By adopting compression, WS2 can also fit in the cache along with WS1. In baseline system, WS1 elements can be accessed in a latency equal to the cache hit-latency while the latency of accessing WS2 elements

137 is equal to the off-chip access latency. After compression however, compressed lines from WS1 and WS2 both experience a latency equal to the cache hit-latency plus the decompression latency. Therefore, in one hand, compression can degrade the performance by increasing the access latency of WS1 elements. On the other hand, it potentially improves the performance by eliminating the off-chip memory accesses for WS2 elements. In order to have a quantitative evaluation, Figure 7.3 illustrates the impact of data compression on gcc and omnetpp applications under BDI [8] and FPC [1] compression schemes. Figure 7.3.a represents what percentage of the working-set is covered by WS1 and WS2 under different compression schemes. Figure 7.3.b demonstrates the performance loss caused by imposing longer access latencies to WS1 elements (1 cycle for BDI and 5 cycles for FPC as reporetd in Table 7.1). Figure 7.3.c indicates the performance improvement achieved by eliminating off-chip accesses to WS2 elements. In short, one can expect to achieve better performance only if the improvement achieved by optimizing WS2 outweighs the performance degradation caused by having slower accesses to WS1. The former is a function of compression ratio and the latter a function of decompression latency. Therefore, while a high-compression-ratio high-decompression-latency scheme can be effective for a capacity-sensitive application, it can degrade the performance for a latency-sensitive application (e.g., FPC scheme for omnetpp in Figure 7.3), and vice versa. Even though one can determine the ideal compression scheme for an application by performing off-line evaluations, resolving this issue at run-time is more practical and is also capable of capturing the dynamic characteristics over different phases of execution. We will demonstrate that, because applications often exhibit different sensitivity to the access latency of different cache blocks, the compression mechanism should be aware of such variations in order to achieve a fine-balanced system, where each individual cache block is handled based on its sensitivity to the decompression latency.

Baseline → L2 Hit Latency WS1 WS2 Compressed → L2 Hit Latency + Decompression Latency Baseline → Off-Chip Memory Latency Compressed → L2 Hit Latency + Decompression Latency

Figure 7.2: Impact of compression on data access latency. WS1: working-set that fits in baseline. WS2: extra portion of the working-set that fits in compressed LLC.

138 erfrt uhccebok as blocks cache such to refer We lcsarayeprec ogracs aece oprdt oa blocks. local to compared those latencies the as access words, blocks, longer other cache remote experience In for already blocks. detrimental blocks cache less local is of latency to latency cores decompression has access computing imposed the data studies, to our the sensitive on because more Based variable are network. are interconnection blocks while the hit-latency cache thought cache remote go the of to latency equal is access which the latency minimum the with accessed blocks. critical of tolerable categories a two in consider accessed we be end, will this blocks To critical based latency. that blocks guarantees the and criticality distinguishes mechanism their on proposed our criticality, others. than of more definition blocks this cache some computing of studies, latency our access on the to Based cache sensitive distinguish more performance. we are hand, on cores other impact the their on on blocks work, based this other In blocks while space block. cache less whole occupy the subsequently occupy which compressed be can blocks Criticality Data 7.4.1 Compression Criticality-Aware 7.4 on latency access IPC. cache of longer terms versus in capacity performance, cache system larger the of Impact 7.3: Figure ngnrl opeso cee au ieetdt lcseuly oedata Some equally. blocks data different value schemes compression general, In oa ah Blocks: Cache Local

Working-Set Coverage (%) 1 3 5 7 9 0 0 0 0 0 a

- - - -

.

g S o c i f z c

BDI e c o

o o m m f n

p W e r

BDI t e p S s p 2 s

i g i o c s n c FPC a

nNC tutrdcce,lclccebok are blocks cache local caches, structured NUCA In r

o f a m u t i n n o c e . t t FPC p i critical o p n

b 139 IPC Impac t(%) ------. o l 1 2 3 4 5 6 7

a or

f ------N t

e l e o n g latency-sensitive g n c BDI c a g y c t e

i FPC o v r o

n e a m

c W i n m c BDI e e S t p s p

1 FPC a s p

c t

(%) c IPC Impact

1 2 3 4 5 6 7 c .

o ------a

lcs Considering blocks. P f p

o a g l a c c s BDI r c i i g t t i y FPC e v o

r e o

m c

n I a n W W

m BDI W c e t h S S p p

S FPC e 2 1 a p c 2 t

For instance, in a 16-core processor structured as a 4 × 4 mesh topology (detailed configuration is given in table 7.2), if we assume a uniform access traffic to different LLC banks, the access latency of a local data is 10 cycles while for a remote cache block it is about 15 cycles on average, assuming that the network is empty. For real memory-intensive applications however, the average access latency to remote cache blocks will be longer because of the traffic within the network. Therefore, we need to distinguish local and remote cache blocks because they experience widely different access latencies which subsequently affects their sensitivity to the decompression latency.

Long ROB Stalls: Most of the commercial processors in the market perform out-of-order execution to achieve maximum performance. Even though instructions are executed out-of-order, they are committed in-order. Processors typically adopt a buffer, called ReOrder Buffer (ROB) to commit the executed instructions in an in-order fashion. ROB could be stalled when the instruction at the head of ROB is not finished yet but many instructions after that are already finished and, are ready to commit. Such ROB stalls directly degrades the system performance. The memory request bind to the stalled instruction at the head of ROB is called critical load [146, 147]. Ghose et al. [146] demonstrate the importance of critical loads and accordingly proposes a memory scheduler which prioritize critical memory accesses. Kotra et al. [147] observe this issue for the cache blocks and accordingly proposes a customized NUCA platform for non-volatile LLCs. In our studies, we observe similar behaviors regarding cache blocks. Meaning that, data blocks which

Head Criticality Predictor Table (CPT)

PC numLoadCount robBlockCount

Update CPT: Lookup CPT: CPT is indexd by PC robBlockCount ≥ Threshold→Critical numLoadCount++ on load ROB robBlockCount++ if ROB is stalled

Figure 7.4: Configuration of the critical load predictor logic.

140 happen to stall ROB are more sensitive to the access latency and cannot tolerate long decompression latencies. In this work, we adopt a mechanism similar to that of [147] to detect critical cache blocks. Figure 7.4 depicts the structure of the criticality load predictor used in our work. On each instruction commit, the head of ROB is used to update the criticality predictor table (CPT). More precisely, PC of the head is used to index the CPT (only for loads). If it is a hit, numLoadCount is incremented. If this load results in an ROB stall, the robBlockCount is also incremented. If the CPT does not contain an entry with the corresponding PC, a new entry will be inserted into the CPT. On a lookup, CPT is accessed and based on the value of robBlockCount we determine if the target block is critical1.

7.4.2 Non-Uniform Compression

Our goal is to achieve a compressed LLC where critical data blocks are accessed with a latency comparable to that of uncompressed cache blocks while the achieved LLC capacity is comparable to that of strong compression schemes. In Section 4.1, we defined two types of critical data blocks: (i) Local cache blocks in NUCA, and (ii) cache blocks which cause long ROB stalls. The question is how critical and non-critical data blocks should be managed to achieve large cache capacities without suffering from the imposed decompression latency. One can naively propose to only compress the non-critical cache blocks. Although this guarantees a fast access to critical blocks, we lose opportunities to achieve higher cache capacities. Additionally, based on our experimental observations, even critical cache blocks can tolerate few cycles of extra delay (i.e., 1-2 cycles) without noticeably affecting the system performance. Considering this observation, we propose a non-uniform architecture which exploits well-known compression schemes in a criticality-aware fashion. To this end, we adopt a fast compression scheme (i.e., low-compression- ratio low-decompression-latency) for critical cache blocks in order to favor lower latency over higher capacity. On the other hand, for non-critical blocks we use a strong compression scheme (i.e., high-compression-ratio high-decompression-latency) because achieving larger capacity has higher priority for this category.

1Note that, in this work we do not need information such as PC viz, LastStallTime, MaxStall- Time, and TotalStallTime because we do not have to rank the loads in terms of criticality.

141 Even though our hybrid architecture is compatible with any compression scheme, in this work we adopt BDI [8] and FPC [1] schemes that are easy to implement in hardware. Since BDI offers a 1-cycle decompression latency, we use that for critical data blocks. FPC on the other hand, offers a 5-cycle decompression latency and can be used for non-critical data blocks. However, in Section 6 we will demonstrate that a hybrid BDI-FPC scheme covers a wider range of data-types/data-patterns which consequently provides higher compression ratio. Therefore, for non-critical blocks, instead of just using FPC scheme, we pick the best of FPC and BDI in order to achieve larger cache capacity while for critical blocks we always use BDI. Such hybrid approaches are exploited in other works for different purposes [77,148].

Recognizing local cache blocks is straightforward as we can determine the target LLC bank based on the address of the cache request. Therefore, on a write operation, we can determine which compression scheme should be used based on the placement of the target cache block in the NUCA. For the second category of critical cache blocks however (i.e., ROB stalls), the first time a block is brought into the cache, we consider that as a non-critical block. At some point during the program execution, the criticality predictor labels that block as critical and we change its compression scheme upon the next write operation. Similar adjustment is done if a critical cache block is labeled as non-critical during the course of execution. In this work, similar to previous studies, the tag array is doubled to be able to address twice an many cache blocks as the baseline. Besides, we keep one bit per tag entry to distinguish between BDI and FPC compressed blocks.

7.4.3 Relaxing the Decompression Latency

Data traversal in the interconnection network can be exploited as an opportu- nity to partially hide the decompression latency for remote data accesses. Some prior works (e.g., [149, 150]) exploit compression at the packet-level to optimize performance and power consumption of the NoC. Such techniques compress data at the network interface controller, prior to injection in the network. Unlike those techniques, we do not use compression as a knob to improve the NoC perfor- mance. Instead, we use data traversal as an opportunity to partially eliminate the decompression latency from the critical path.

142 FPC Pipelined Decompression: Decompression process in FPC takes 5 cycles and is performed in a pipeline fashion: 1- the length of each code is calculated using the prefix tags, 2&3- the starting bit address of each word is computed, 4&5: a parallel decoder produces the uncompressed data from compressed format using the available information. In a packet-based wormhole switched network, the decompression pipeline can be designed such that decompression starts as soon as the header flit of the compressed cache block arrives. This flit contains all the necessary information for the first three stages of the decompression pipeline. When the last flit arrives, we can complete the next two stages and deliver the data to the processor. This technique effectively reduces decompression latency of FPC from 5 cycles to 2 cycles (for the cache blocks that are at least two hops away) as cache block traverses through the network. Note that, even though our proposed architecture is compatible with any compression scheme, this specific optimization is only applicable on compression schemes with pipelined decompression process.

7.5 Methodology

Infrastructure. We evaluate our proposed mechanism using GEM5 simula- tor [74]. Target system is a 16-core processor with a cache hierarchy consisting of private 32KB L1 and a shared 4MB L2. Detailed configuration is given in Table 7.2. Workloads. For multi-program workloads, we use the SPEC-CPU2006 [2]. We fast-forward each workload for 2 billion instructions, warmup the caches by running 200 million instructions, and then simulate the next 200 million instructions.

Table 7.2: Main characteristics of simulated system. Processor CMP Config. ALPHA ISA, 16 OoO cores, 2.5GHz, 4×4 Mesh, 128 ROB entries Memory Hierarchy L1 Caches 32KB/4 way, private, 1-cycle, MSHR:(4I,32D) L2 Cache 4MB/64B/8 way, SNUCA, 10-cycle, MSHR:32 Coherency Snooping MESI: 4×4 grid packet switched Protocol NoC; XY routing; 2 cycle per-hop latency DRAM 16GB, 4 channels, 1 DIMM/channel, 2 ranks/DIMM, 8 devices/rank, Memory FR-FCFS, 667 MHz bus,8 Byte data bus, DDR3 1333 MHz, tRP- Config. tRCD-CL: 15-15-15 ns, 8 DRAM, banks, RB hit:36ns, RB miss:66ns

143 Table 7.3: Characteristics of the evaluated workloads for last-level cache.

Workload MPKI BDI FPC Hybrid_Comp MP1: namd, xalancbmk, zeusmp, gcc 2.3 5.2 2.2 7.1 MP2: gcc, GemsFDTD, gromacs, h264ref 0.7 1.5 1.9 2.2 MP3: soplex, sphinx3, tonto, xalancbmk 0.5 1.4 1.7 1.8 MP4: namd, tonto, cactusADM, dealII 1.5 2.0 1.7 2.6 MP5: xalancbmk, gcc, namd, omnetpp 1.0 1.3 1.7 1.8 MP6: xalancbmk, cactusADM, dealII, gcc 2.7 1.8 1.7 2.4 MP7: omnetpp, omnetpp, gcc, gcc 1.6 1.3 1.7 1.9 MP8: bzip, gcc, xalancbmk, zeusmp 8.9 2.5 1.7 2.8 MP9: xalancbmk, xalancbmk, gcc, gcc 2.65 1.3 1.7 1.8 MP10: deal, soplex, namd, bzip2 4.15 1.6 1.5 2.0 MP11: sphinx3, tonto, zeusmp, gcc 10.5 12.8 2.2 16 MP12: bzip, zeusmp, xalancbmk, dealII 10.8 2.4 1.7 2.9 MP13: deal, bzip2, zeusmp, xalancbmk 11.0 2.3 1.8 3.0 MP14: bzip2, cactusADM, dealII, gcc 8.85 1.8 1.6 2.2

7.6 Evaluation

7.6.1 Compression Ratio

Figure 7.5 demonstrates the compression ratio of different compression schemes over SPEC2006 [2] applications. As discussed in Section 2, BDI uses narrow data values to compress a cache block. FPC on the other hand, exploits frequent data patterns. Employing these two schemes in a unified architecture can cover a wider range of data types which considerably improves the average compression ratio. Our proposed architecture, uses these two schemes in a hybrid criticality-aware fashion and achieves a compression ratio comparable to that of SC2 [9]. Note that, SC2 uses Huffman-based statistical compression and has a decompression latency of 8/14 cycles (depending on the position of critical word in the cache block), while our proposed architecture achieves a 1-5 cycles decompression latency (depending on the type of compression scheme and position of data in NUCA). Therefore, our proposed architecture outperforms SC2 in terms of both the compression ratio, owing to the data type coverage achieved by using multiple compression schemes, and the decompression latency. Note that, in the compressed cache, the tag array is doubled to be able to address twice as many cache lines as the baseline. Therefore, the maximum compression ratio reported in Figure 7.5 is limited to 2. SC2 can

144 BDI FPC SC2 Hybrid-Comp 2.2

2

1.8

1.6

1.4

1.2

Compression Ratio 1 bzip cact. deal. gcc gems.grom.h264. mcf namd omne.sopl. sphi. tont. xala. zeus. Mean

Figure 7.5: Comparison of compression ratio for different schemes: BDI [8], FPC [1], SC2 [9], and Hybrid-Comp.

Uncompressed BDI FPC Hybrid-Comp 12

10

8

6

4

2 Miss-Per-Kilo-

Instruction (MPKI) 0 MP1 MP2 MP3 MP4 MP5 MP6 MP7 MP8 MP9 MP10 MP11 MP12 MP13 MP14 Mean

Figure 7.6: Effect of data compression on the number of misses per kilo instruction at the last-level cache. outperform our approach (in terms of compression ratio) for some applications if we use larger tag caches. Also, Arelakis et al. [148] propose a hybrid compression mechanism. However, they focus on the content of each cache block (i.e., run-time data-type prediction) rather than its criticality, and is orthogonal to our work.

7.6.2 Misses-Per-Kilo-Instructions (MPKI)

MPKI can be used as a metric to analyze the caching efficiency in cache-sensitive applications. As can be seen in Figure 7.6, MPKI is reduced in compressed caches, thanks to the extra cache capacity achieved by compression. Table 7.3 reports the compression ratios of BDI, FPC, and our hybrid architecture for the studied workloads. Comparing the reported compression ratios in Table 7.3 with MPKI in Figure 7.6 demonstrates the direct impact of compression ratio on MPKI. Our proposed architecture reduces MPKI by 25% on average compared to the baseline configuration while it outperforms BDI and FPC schemes by about 10% on average.

145 FPC-4MB BDI-4MB Hybrid-Comp 8MB 1.2

1

0.8

0.6

0.4 Normalized Average Data Access Latency MP1 MP2 MP3 MP4 MP5 MP6 MP7 MP8 MP9 MP10 MP11 MP12 MP13 MP14 Mean

Figure 7.7: Average LLC data access latency normalized to the baseline 4MB LLC.

7.6.3 Average Data Access Latency

Figure 7.7 reports average data access latency in LLC normalized to the baseline cache with no compression. Since we have doubled the size of tag array in the compressed cache, it can at most pack twice as many cache lines as the baseline cache. Therefore, as shown in Figure 7.7, the average data access latency for a double size LLC (i.e., 8MB) indicates the maximum improvement that can be achieved by compression. In a compressed cache, decompression latency determines the latency of hit accesses while compression ratio affects the hit-rate. Comparing average data access latency of FPC and BDI demonstrates the importance of decompression latency. Even though, for about half of the applications FPC provides better compression ratio compared to BDI, for all the workloads reported in Figure 7.7, FPC achieves longer access latencies (in some cases even worse than the baseline) because of its long decompression process. Our proposed architecture on the other hand, outperforms FPC and BDI in terms of compression ratio, and has a decompression latency comparable to that of BDI. Therefore, as can be seen in Figure 7.7, our proposed architecture outperforms both BDI and FPC, and achieves an average access latency close to that of a double size LLC.

7.6.4 Performance

Figure 7.8 demonstrates the impact of different compression techniques on the system performance in terms of weighted speedup. Latency-sensitive workloads (i.e., workloads with low MPKI shown in Figures 7.6) do not gain from the extra last- level cache capacity achieved by compression; however, the imposed decompression

146 FPC-4MB BDI-4MB Hybrid-Comp 8MB 15 12 9 6 3 0 -3 -6 MP1 MP2 MP3 MP4 MP5 MP6 MP7 MP8 MP9 MP10 MP11 MP12 MP13 MP14 Mean Weighted Speedup (%)

Figure 7.8: Comparison of weighted speedup: BDI–4MB [8], FPC–4MB [1], Hybrid- Comp–4MB, and uncompressed 8MB cache. latency can degrade their performance. For instance, comparing Figures 7.6 and 7.8 illustrates that, workloads with low MPKI (e.g., MP1 to MP5) incur performance loss under FPC cache compression scheme. Our proposed mechanism however, do not experience any performance loss for such latency-sensitive workloads as it fundamentally prioritize latency over capacity. On the other hand, capacity-sensitive workloads (i.e., workloads with high MPKI reported in Figures 7.6) considerably gain from the extra last-level cache capacity achieved by data compression (e.g., MP10 to MP14 in Figure 7.8). Overall, our proposed mechanism outperforms both FPC and BDI, thanks to its lower average data access latency, and achieves a performance improvement (i.e., 5.3% on average) comparable to that of an 8MB uncompressed LLC (i.e., 6% on average).

Note that, simply adopting a hybrid FPC-BDI scheme makes an average improve- ment of only 1.4%. By considering critical and non-critical data blocks we can achieve an improvement of 4.1% on average. Finally by using the decompression relaxing technique we reach an average improvement of 5.3%.

7.7 Conclusion

In this study, we propose data criticality as a parameter that should be considered in designing compressed caches, along with compression ratio and decompression latency. Based on our studies on a 16-core processor with 4MB last-level cache, the proposed criticality-aware architecture improves the system performance comparable to that of with an 8MB LLC.

147 Chapter 8

Conclusions and Future Work

8.1 Conclusions

Although during the past few decades SRAM and DRAM have been widely used in the memory hierarchy of chip multi-processor, they both face serious scalability and power consumption problems. Although there has been many circuit- and architecture-level mechanisms to mitigate these issues, researchers have also been considering non-volatile memory technologies to replace SRAM and DRAM in future processors. Among different non-volatile memory technologies, STT-RAM and PCM are the most promising candidates to replace SRAM and DRAM technologies, respectively. However, each of these technologies have some drawbacks which should be resolved in order to meet all the system-level requirements.

PCM has been considered as the most promising technology to replace DRAM. However, PCM has some reliability problems that should be addressed before it can be adopted in commercial products. In Chapter 3 of this dissertation, we modeled and studied the wear-out fault model in a PCM-based main memory. Accordingly, we proposed a compression-based mechanism to further extend the memory lifetime. By using a platform that combines different solutions (i.e., a wear-leveling scheme, an error-correction mechanism, and a data compression algorithm), our proposed mechanism tolerates 2.9× more cell failures per memory line and achieves a 4.3× increase in lifetime, compared to the state-of-the-art PCM-based main memory.

148 In Chapter 4 of this dissertation, we specifically addressed the write disturbance problem in PCM devices. We proposed multiple architecture-level techniques to mitigate the write disturbance along the both bit-lines and word-lines in a performance- and capacity-efficient manner. In order to resolve the write disturbance problem along the word-lines, we proposed a chip-level mechanism that determines what write strategy should be used for each individual chip, based on the number of vulnerable cells in each chip. On the other hand, in order to mitigate the write disturbance problem along the bit-lines, we exploited a compression-based non-overlapping data layout between the adjacent memory lines, to reduce the probability of thermal disturbance among the vertically adjacent cells. Further, we integrated BCH code within compressed read-intensive memory addresses to protect them against write-intensive memory addresses. Our proposed schemes guarantee reliable write operations in a PCM-based main memory, and improve the system performance by 17% on average, compared to the state-of-the-art system.

In Chapter 5 of this dissertation, we focused on design space exploration of MLC STT-RAM cache and related optimizations, when it is used as last-level cache in CMPs. Our proposed architecture uses an stripping data layout within each cache block in order to reduce the read and write access latencies. We further proposed a dynamic set associativity adjustment mechanism in order to tune the associativity of each cache set based on its run-time demand. Our experimental evaluations show an average improvement of 43% in total numbers of conflict misses, 27% in memory access latency, 12% in system performance (i.e., IPC), and 26% in L3 access energy, with a slight degradation in cache lifetime (about 7%) compared to an SLC cache.

In Chapters 6 and 7 of this dissertation, we studied the SRAM-based last-level caches which are widely used in current processors. While SRAM-based LLCs offer fast accesses, they are incapable of achieving high performance for application with large data working-set, due to frequent cache conflicts. In Chapter 6, we proposed a selective-caching policy to manage the cache-thrashing issue in highly parallel architectures, and achieved an average performance improvement of 27%. In Chapter 7 on the other hand, we introduced a new parameter (i.e., data criticality) in designing a compressed LLC, in order to balance the extra cache space provided by compression with the latency imposed by data decompression.

149 8.2 Future Research Directions

While in this dissertation we studied the most important reliability concerns regarding the PCM technology (i.e., limited lifetime, and the write disturbance problem), there are some other interesting design/research questions that should be investigated. Since different computing systems have widely different system-level requirements, a PCM-based main memory should meet each of those requirements. Such variety in system-level requirements brings up some exciting future research directions in this area: 1) Lifetime-aware performance improvement in latency- sensitive applications, and 2) Secure PCM-based main memory.

8.2.1 Lifetime-aware Performance Improvement

Write operation in PCM has been the major bottleneck in terms of performance, power consumption, and lifetime. Since different studies address each of these issues individually, the proposed schemes can improve one metric while negatively affecting the other ones. Achieving an architecture that addresses all those concerns in a unified fashion is an essential step for adopting PCM in commercial products.

8.2.2 Secure PCM-based Main Memory

As we begin to utilize non-volatile memory technologies as main memory elements, new challenges emerge. While non-volatility is a desirable feature to save energy, it creates some security concerns because data will persistent in the memory after the system is turned-off. More precisely, the stored data in the memory can be retrieved by whoever who has access to the memory. Although data can be secured using data encryption, it could result in excessive bit writes, which will drastically reduce the memory lifetime and also increase energy consumption. A typical encryption scheme causes about 3x-5x increase in the number of bits written to the memory. Therefore, encryption schemes should be modified to manage the extra cell writes within the memory in order to achieve a secure memory system without incurring power- and lifetime-overhead.

150 Bibliography

[1] Alameldeen, A. R. et al. (2004) “Adaptive Cache Compression for High- Performance Processors,” SIGARCH Comput. Archit. News. URL http://doi.acm.org/10.1145/1028176.1006719

[2] Spradling, C. D. (2007) “SPEC CPU2006 benchmark tools,” Computer Architecture News, 35(1), pp. 130–134.

[3] Qureshi, M., D. Thompson, and Y. Patt (2005) “The V-Way Cache: Demand-Based Associativity via Global Replacement,” in ISCA, pp. 544–555.

[4] Basu, A., N. Kirman, M. Kirman, M. Chaudhuri, and J. Martinez (2007) “Scavenger: A New Last Level Cache Architecture with Global Block Priority,” in MICRO, pp. 421–432.

[5] Rolan, D., B. Fraguela, and R. Doallo (2009) “Adaptive Line Place- ment with the Set Balancing Cache,” in MICRO, pp. 529–540.

[6] Fermi, N. “Nvidias next generation cuda compute architecture,” . URL http://www.nvidia.com/fermi

[7] Chen, X., L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.- M. Hwu (2014) “Adaptive Cache Management for Energy-Efficient GPU Computing,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, IEEE Computer Society, Wash- ington, DC, USA, pp. 343–355. URL http://dx.doi.org/10.1109/MICRO.2014.11

[8] Pekhimenko, G. et al. “Base-delta-immediate Compression: Practical Data Compression for On-chip Caches,” PACT ’12. URL http://doi.acm.org/10.1145/2370816.2370870

[9] Arelakis, A. and P. Stenstrom “SC2: A Statistical Compression Cache Scheme,” ISCA ’14. URL http://dl.acm.org/citation.cfm?id=2665671.2665696

151 [10] Wu, X., J. Li, L. Zhang, E. Speight, R. Rajamony, and Y. Xie (2009) “Hybrid Cache Architecture with Disparate Memory Technologies,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, ACM, New York, NY, USA, pp. 34–45. URL http://doi.acm.org/10.1145/1555754.1555761

[11] Ishigaki, T., T. Kawahara, R. Takemura, K. Ono, K. Ito, H. Mat- suoka, and H. Ohno (2010) “A Multi-Level-Cell Spin-Transfer Torque Memory with Series-Stacked Magnetotunnel Junctions,” in VLSIT, pp. 47– 48.

[12] Mishra, A. K., X. Dong, G. Sun, Y. Xie, N. Vijaykrishnan, and C. R. Das (2011) “Architecting On-Chip Iinterconnects for Stacked 3D STT-RAM Caches in CMPs,” in ISCA, pp. 69–80.

[13] Lee, B. C. et al. (2009) “Architecting Phase Change Memory As a Scalable Dram Alternative,” in ISCA, pp. 2–13.

[14] Zhou, P., B. Zhao, J. Yang, and Y. Zhang (2009) “A Durable and Energy Efficient Main Memory Using Phase Change Memory Technology,” SIGARCH Comput. Archit. News, 37(3), pp. 14–23. URL http://doi.acm.org/10.1145/1555815.1555759

[15] Choi, Y., I. Song, M. H. Park, H. Chung, S. Chang, B. Cho, J. Kim, Y. Oh, D. Kwon, J. Sunwoo, J. Shin, Y. Rho, C. Lee, M. G. Kang, J. Lee, Y. Kwon, S. Kim, J. Kim, Y. J. Lee, Q. Wang, S. Cha, S. Ahn, H. Horii, J. Lee, K. Kim, H. Joo, K. Lee, Y. T. Lee, J. Yoo, and G. Jeong (2012) “A 20nm 1.8V 8Gb PRAM with 40MB/s program bandwidth,” in 2012 IEEE International Solid-State Circuits Conference, pp. 46–48.

[16] Qureshi, M. K. et al. (2009) “Enhancing Lifetime and Security of PCM- based Main Memory with Start-gap Wear Leveling,” in MICRO, pp. 14–23.

[17] Schechter, S. et al. (2010) “Use ECP, Not ECC, for Hard Failures in Resistive Memories,” in ISCA, pp. 141–152.

[18] Seong, N. H. et al. (2010) “SAFER: Stuck-At-Fault Error Recovery for Memories,” in MICRO, pp. 115–124.

[19] Yoon, D. H. et al. (2011) “FREE-p: Protecting non-volatile memory against both hard and soft errors,” in HPCA, pp. 466–477.

[20] Fan, J. et al. (2013) “Aegis: Partitioning Data Block for Efficient Recovery of Stuck-at-faults in Phase Change Memory,” in MICRO, pp. 433–444.

152 [21] Qureshi, M. K. et al. (2009) “Scalable High Performance Main Memory System Using Phase-change Memory Technology,” in ISCA, pp. 24–33.

[22] Jiang, L., Y. Zhang, and J. Yang (2014) “Mitigating Write Disturbance in Super-Dense Phase Change Memories,” in 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 216–227.

[23] Ahn, S. J., Y. Song, H. Jeong, B. Kim, Y.-S. Kang, D.-H. Ahn, Y. Kwon, S. W. Nam, G. Jeong, H. Kang, and C. Chung (2011) “Reliability perspectives for high density PRAM manufacturing,” in 2011 International Electron Devices Meeting, pp. 12.6.1–12.6.4.

[24] Cho, W., K. Lee, and H. Kim (2012), “Phase change memory devices and systems, and related programming methods,” US Patent 8,116,127. URL https://www.google.tl/patents/US8116127

[25] Wang, R., L. Jiang, Y. Zhang, and J. Yang (2015) “SD-PCM: Construct- ing Reliable Super Dense Phase Change Memory Under Write Disturbance,” , pp. 19–31. URL http://doi.acm.org/10.1145/2694344.2694352

[26] Wu, X., J. Li, L. Zhang, E. Speight, R. Rajamony, and Y. Xie (2009) “Hybrid Cache Architecture with Disparate Memory Technologies,” in ISCA, pp. 34–45.

[27] Dong, X., X. Wu, G. Sun, Y. Xie, H. Li, and Y. Chen (2008) “Circuit and Microarchitecture Evaluation of 3D Stacking Magnetic RAM (MRAM) as a Universal Memory Replacement,” in DAC, pp. 554–559.

[28] Chen, Y.-T., J. Cong, H. Huang, C. Liu, R. Prabhakar, and G. Rein- man (2012) “Static and Dynamic Co-Optimizations for Blocks Mapping in Hybrid Caches,” in ISLPED, pp. 237–242.

[29] Sun, G., X. Dong, Y. Xie, J. Li, and Y. Chen (2009) “A novel architecture of the 3D stacked MRAM L2 cache for CMPs,” in HPCA.

[30] Sun, Z., X. Bi, H. H. Li, W.-F. Wong, Z.-L. Ong, X. Zhu, and W. Wu (2011) “Multi Retention Level STT-RAM Cache Designs with a Dynamic Refresh Scheme,” in MICRO, pp. 329–338.

[31] Wang, J., X. Dong, Y. Xie, and N. P. Jouppi (2013) “i2WAP: Improving non-volatile cache lifetime by reducing inter- and intra-set write variations,” in HPCA, pp. 234–245.

153 [32] Jokar, M. R., M. Arjomand, and H. Sarbazi-Azad (2016) “Sequoia: A High-Endurance NVM-Based Cache Architecture,” IEEE TVLSI, 24(3), pp. 954–967.

[33] Rogers, T. G., M. O’Connor, and T. M. Aamodt (2012) “Cache- Conscious Wavefront Scheduling,” in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, IEEE Computer Society, Washington, DC, USA, pp. 72–83. URL http://dx.doi.org/10.1109/MICRO.2012.16

[34] Kayiran, O., A. Jog, M. T. Kandemir, and C. R. Das (2013) “Neither More nor Less: Optimizing Thread-level Parallelism for GPGPUs,” in Pro- ceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13, IEEE Press, Piscataway, NJ, USA, pp. 157–166. URL http://dl.acm.org/citation.cfm?id=2523721.2523745

[35] Chen, X. et al. (2010) “C-pack: A High-performance Microprocessor Cache Compression Algorithm,” IEEE Trans. VLSI. URL http://dx.doi.org/10.1109/TVLSI.2009.2020989

[36] Dusser, J. et al. (2009) “Zero-content Augmented Caches,” in ICS. URL http://doi.acm.org/10.1145/1542275.1542288

[37] Yang, J. et al. (2000) “Frequent Value Compression in Data Caches,” in MICRO. URL http://doi.acm.org/10.1145/360128.360154

[38] Jiang, L. et al. (2012) “FPB: Fine-grained Power Budgeting to Improve Write Throughput of Multi-level Cell Phase Change Memory,” in MICRO.

[39] Hay, A. et al. (2011) “Preventing PCM Banks from Seizing Too Much Power,” in MICRO, pp. 186–195.

[40] ITRS (2015) “More Moore,” . URL http://www.itrs2.net/

[41] Seong, N. H. et al. (2013) “Tri-level-cell Phase Change Memory: Toward an Efficient and Reliable Memory System,” in ISCA, pp. 440–451.

[42] Palangappa, P. M. and K. Mohanram (2016) “CompEx: Compression- expansion coding for energy, latency, and lifetime improvements in MLC/TLC NVM,” in HPCA, pp. 90–101.

154 [43] Loh, G. et al. (2013), “Memory architecture for read-modify-write opera- tions,” US Patent App. 13/328,393. URL https://www.google.com/patents/US20130159812

[44] Kim, B., Y. Song, S. Ahn, Y. Kang, H. Jeong, D. Ahn, S. Nam, G. Jeong, and C. Chung (2011) “Current status and future prospect of Phase Change Memory,” in 2011 9th IEEE International Conference on ASIC, pp. 279–282.

[45] Tavana, M. K. and D. Kaeli (2017) “Cost-effective write disturbance mitigation techniques for advancing PCM density,” in 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 253–260.

[46] Wang, R., S. Mittal, Y. Zhang, and J. Yang (2017) “Decongest: Accel- erating Super-Dense PCM Under Write Disturbance by Hot Page Remapping,” Computer Architecture Letters, 16(2), pp. 107–110. URL https://doi.org/10.1109/LCA.2017.2675883

[47] Awasthi, M., M. Shevgoor, K. Sudan, B. Rajendran, R. Balasub- ramonian, and V. Srinivasan (2012) “Efficient Scrub Mechanisms for Error-prone Emerging Memories,” in Proceedings of the 2012 IEEE 18th In- ternational Symposium on High-Performance Computer Architecture, HPCA ’12, IEEE Computer Society, Washington, DC, USA, pp. 1–12. URL http://dx.doi.org/10.1109/HPCA.2012.6168941

[48] Jalili, M., M. Arjomand, and H. S. Azad (2014) “A Reliable 3D MLC PCM Architecture with Resistance Drift Predictor,” in 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 204–215.

[49] Papandreou, N., H. Pozidis, T. Mittelholzer, G. F. Close, M. Bre- itwisch, C. Lam, and E. Eleftheriou (2011) “Drift-Tolerant Multilevel Phase-Change Memory,” in 2011 3rd IEEE International Memory Workshop (IMW), pp. 1–4.

[50] Chen, Y., X. Wang, W. Zhu, H. Li, Z. Sun, G. Sun, and Y. Xie (2010) “Access scheme of Multi-Level Cell Spin-Transfer Torque Random Access Memory and its optimization,” in MWSCAS, pp. 1109–1112.

[51] Zhang, Y., L. Zhang, W. Wen, G. Sun, and Y. Chen (2012) “Multi- Level Cell STT-RAM: Is It Realistic or Just a Dream?” in ICCAD, pp. 526–532.

155 [52] Jiang, L., B. Zhao, Y. Zhang, and J. Yang (2012) “Constructing Large and Fast Multi-Level Cell STT-MRAM Based Cache for Embedded Proces- sors,” in DAC, pp. 907–912.

[53] Yoon, H., J. Meza, N. Muralimanohar, N. P. Jouppi, and O. Mutlu (2014) “Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories,” ACM TACO, 11(4), pp. 40:1–40:25.

[54] Jia, W., K. A. Shaw, and M. Martonosi (2012) “Characterizing and Improving the Use of Demand-fetched Caches in GPUs,” in Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, ACM, New York, NY, USA, pp. 15–24. URL http://doi.acm.org/10.1145/2304576.2304582

[55] Xie, X., Y. Liang, G. Sun, and D. Chen (2013) “An Efficient Compiler Framework for Cache Bypassing on GPUs,” in Proceedings of the International Conference on Computer-Aided Design, ICCAD ’13, IEEE Press, Piscataway, NJ, USA, pp. 516–523. URL http://dl.acm.org/citation.cfm?id=2561828.2561929

[56] Wu, Y., R. Rakvic, L.-L. Chen, C.-C. Miao, G. Chrysos, and J. Fang (2002) “Compiler Managed Micro-cache Bypassing for High Performance EPIC Processors,” in Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 35, IEEE Computer Society Press, Los Alamitos, CA, USA, pp. 134–145. URL http://dl.acm.org/citation.cfm?id=774861.774876

[57] Kharbutli, M. and Y. Solihin (2008) “Counter-Based Cache Replacement and Bypassing Algorithms,” IEEE Trans. Comput., 57(4), pp. 433–447. URL http://dx.doi.org/10.1109/TC.2007.70816

[58] Li, A., G. J. van den Braak, A. Kumar, and H. Corporaal (2015) “Adaptive and transparent cache bypassing for GPUs,” in SC15: Interna- tional Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12.

[59] Liu, H., M. Ferdman, J. Huh, and D. Burger (2008) “Cache Bursts: A New Approach for Eliminating Dead Blocks and Increasing Cache Efficiency,” in Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 41, IEEE Computer Society, Washington, DC, USA, pp. 222–233. URL http://dx.doi.org/10.1109/MICRO.2008.4771793

[60] Mu, S., Y. Deng, Y. Chen, H. Li, J. Pan, W. Zhang, and Z. Wang (2014) “Orchestrating Cache Management and Memory Scheduling for

156 GPGPU Applications,” IEEE Transactions on Very Large Scale Integra- tion (VLSI) Systems, 22(8), pp. 1803–1814.

[61] Xiang, L., T. Chen, Q. Shi, and W. Hu (2009) “Less Reused Filter: Improving L2 Cache Performance via Filtering Less Reused Lines,” in Pro- ceedings of the 23rd International Conference on Supercomputing, ICS ’09, ACM, New York, NY, USA, pp. 68–79. URL http://doi.acm.org/10.1145/1542275.1542290

[62] Gaur, J., M. Chaudhuri, and S. Subramoney (2011) “Bypass and Insertion Algorithms for Exclusive Last-level Caches,” vol. 39, ACM, New York, NY, USA, pp. 81–92. URL http://doi.acm.org/10.1145/2024723.2000075

[63] Gupta, S., H. Gao, and H. Zhou (2013) “Adaptive Cache Bypassing for Inclusive Last Level Caches,” in 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 1243–1253.

[64] Mekkat, V., A. Holey, P. C. Yew, and A. Zhai (2013) “Managing shared last-level cache in a heterogeneous multicore processor,” in Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pp. 225–234.

[65] Hong, S. (2010) “Memory technology trend and future challenges,” in 2010 International Electron Devices Meeting, pp. 12.4.1–12.4.4.

[66] Schroeder, B., E. Pinheiro, and W.-D. Weber (2009) “DRAM Errors in the Wild: A Large-scale Field Study,” in Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’09, ACM, New York, NY, USA, pp. 193–204. URL http://doi.acm.org/10.1145/1555349.1555372

[67] Tech Insights, “Technology Roadmap of DRAM for Three Major manufac- turers: Samsung, SK-Hynix and Micron,” http://www.techinsights.com.

[68] Kim, D. W. and M. Erez (2016) “RelaxFault Memory Repair,” in ISCA, pp. 645–657.

[69] Kang, D. H. et al. (2008) “Two-bit cell operation in diode-switch phase change memory cells with 90nm technology,” in Symposium on VLSI Tech- nology, pp. 98–99. [70] “AMD Inc. BIOS and Kernel Developers Guide for AMD NPT Family 0Fh Processors,” .

157 [71] Pirovano, A. et al. (2004) “Reliability study of phase-change nonvolatile memories,” IEEE TDMR, pp. 422–427.

[72] Kim, K. and S. J. Ahn (2005) “Reliability investigations for manufacturable high density PRAM,” in IRPS, pp. 157–162.

[73] Cho, S. and H. Lee (2009) “Flip-N-Write: A Simple Deterministic Technique to Improve PRAM Write Performance, Energy and Endurance,” in MICRO, pp. 347–357.

[74] Binkert, N., B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood (2011) “The Gem5 Simulator,” SIGARCH Computer Architecture News, 39(2), pp. 1–7.

[75] Dong, X. et al. (2012) “NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory,” IEEE TCAD.

[76] Zhang, W. et al. (2009) “Characterizing and Mitigating the Impact of Process Variations on Phase Change Based Memory Systems,” in MICRO, pp. 2–13.

[77] Jadidi, A., M. Arjomand, M. K. Tavana, D. R. Kaeli, M. T. Kan- demir, and C. R. Das (2017) “Exploring the Potential for Collaborative Data Compression and Hard-Error Tolerance in PCM Memories,” in 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 85–96.

[78] Dhiman, G., R. Ayoub, and T. Rosing (2009) “PDRAM: A hybrid PRAM and DRAM main memory system,” in 2009 46th ACM/IEEE Design Automation Conference, pp. 664–669.

[79] Lee, S. H., M. S. Kim, G. S. Do, S. G. Kim, H. J. Lee, J. S. Sim, N. G. Park, S. B. Hong, Y. H. Jeon, K. S. Choi, H. C. Park, T. H. Kim, J. U. Lee, H. W. Kim, M. R. Choi, S. Y. Lee, Y. S. Kim, H. J. Kang, J. H. Kim, H. J. Kim, Y. S. Son, B. H. Lee, J. H. Choi, S. C. Kim, J. H. Lee, S. J. Hong, and S. W. Park (2010) “Programming disturbance and cell scaling in phase change memory: For up to 16nm based 4F2 cell,” in 2010 Symposium on VLSI Technology, pp. 199–200.

[80] Udipi, A. N., N. Muralimanohar, N. Chatterjee, R. Balasubramo- nian, A. Davis, and N. P. Jouppi (2010) “Rethinking DRAM Design and Organization for Energy-constrained Multi-cores,” in Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, ACM,

158 New York, NY, USA, pp. 175–186. URL http://doi.acm.org/10.1145/1815961.1815983

[81] Zheng, H., J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu (2008) “Mini-rank: Adaptive DRAM architecture for improving memory power efficiency,” in 2008 41st IEEE/ACM International Symposium on Microarchitecture, pp. 210–221.

[82] Arjomand, M., M. T. Kandemir, A. Sivasubramaniam, and C. R. Das (2016) “Boosting Access Parallelism to PCM-Based Main Memory,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 695–706.

[83] Cintra, M. and N. Linkewitsch (2013) “Characterizing the Impact of Process Variation on Write Endurance Enhancing Techniques for Non-volatile Memory Systems,” in Proceedings of the ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, SIGMET- RICS ’13, ACM, New York, NY, USA, pp. 217–228. URL http://doi.acm.org/10.1145/2465529.2465755

[84] Qureshi, M. K. (2011) “Pay-As-You-Go: Low-overhead Hard-error Correc- tion for Phase Change Memories,” in MICRO, pp. 318–328.

[85] Russo, U., D. Ielmini, A. Redaelli, and A. L. Lacaita (2008) “Modeling of Programming and Read Performance in Phase-Change Memories Part I: Cell Optimization and Scaling,” IEEE Transactions on Electron Devices, 55(2), pp. 506–514. [86] ——— (2008) “Modeling of Programming and Read Performance in Phase- Change Memories Part II: Program Disturb and Mixed-Scaling Approach,” IEEE Transactions on Electron Devices, 55(2), pp. 515–522.

[87] Wilkerson, C., A. R. Alameldeen, Z. Chishti, W. Wu, D. So- masekhar, and S.-l. Lu (2010) “Reducing Cache Power with Low-cost, Multi-bit Error-correcting Codes,” in Proceedings of the 37th Annual Inter- national Symposium on Computer Architecture, ISCA ’10, ACM, New York, NY, USA, pp. 83–93. URL http://doi.acm.org/10.1145/1815961.1815973

[88] Kim, Y., R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilk- erson, K. Lai, and O. Mutlu (2014) “Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors,” in 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp. 361–372.

159 [89] Driskill-Smith, A. (2010) “Latest Advances and Future Prospects of STT-RAM,” in NVM Workshop.

[90] Qureshi, M. K., M. M. Franceschini, L. A. Lastras-Montaño, and J. P. Karidis (2010) “Morphable Memory System: A Robust Architecture for Exploiting Multi-level Phase Change Memories,” in ISCA, pp. 153–162.

[91] Wang, M., S. Peng, Y. Zhang, Y. Zhang, Y. Zhang, Q. Zhang, D. Ravelosona, and W. Zhao (2008) “Demonstration of multilevel cell spin transfer switching in MgO magnetic tunnel junctions,” Applied Physics Letters, 93(24), p. 242502.

[92] Pawlowski, J. T. (2011) “Hybrid memory cube (HMC),” in IEEE Hot Chips Symposium, pp. 1–24.

[93] Black, B., M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb (2006) “Die Stacking (3D) Microarchitecture,” in MICRO, pp. 469–479.

[94] Muralimanohar, N. and R. Balasubramonian (2007) “Interconnect design considerations for large NUCA caches,” in ISCA, pp. 369–380.

[95] Kim, C., D. Burger, and S. W. Keckler (2002) “An adaptive, non- uniform cache structure for wire-delay dominated on-chip caches,” in ASPLOS, pp. 211–222.

[96] Bienia, C. and K. Li (2009) “PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors,” in MoBS.

[97] Li, S., J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi (2009) “McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures,” in MICRO, pp. 469–480.

[98] Muralimanohar, N. et al. (2007) “Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0,” in MICRO, IEEE Computer Society. URL http://dx.doi.org/10.1109/MICRO.2007.30

[99] Hallnor, E. G. and S. K. Reinhardt (2000) “A fully associative software- managed cache design,” in ISCA, pp. 107–116.

[100] Calder, B., D. Grunwald, and J. Emer (1996) “Predictive sequential associative cache,” in HPCA, pp. 244–253.

160 [101] Microsystems, S. (2007) ULTRASPARC T2 supplement to the ULTRA- SPARC architecture 2007, Tech. rep.

[102] Zhou, P., B. Zhao, J. Yang, and Y. Zhang (2009) “Energy reduction for STT-RAM using early write termination,” in ICCAD, pp. 264–268.

[103] Zhan, J., J. Ouyang, F. Ge, J. Zhao, and Y. Xie (2016) “Hybrid Drowsy SRAM and STT-RAM Buffer Designs for Dark-Silicon-Aware NoC,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 24(10), pp. 3041–3054.

[104] Ahn, J., S. Yoo, and K. Choi (2016) “Prediction Hybrid Cache: An Energy-Efficient STT-RAM Cache Architecture,” IEEE Transactions on Computers, 65(3), pp. 940–951.

[105] Jadidi, A., M. Arjomand, and H. Sarbazi-Azad (2011) “High-endurance and Performance-efficient Design of Hybrid Cache Architectures Through Adaptive Line Replacement,” in Proceedings of the 17th IEEE/ACM Interna- tional Symposium on Low-power Electronics and Design, ISLPED ’11, IEEE Press, Piscataway, NJ, USA, pp. 79–84. URL http://dl.acm.org/citation.cfm?id=2016802.2016827

[106] Wu, X., J. Li, L. Zhang, E. Speight, and Y. Xie (2009) “Power and performance of read-write aware hybrid caches with non-volatile memories,” in DATE, pp. 737–742.

[107] Wang, Z., D. A. JimÃľnez, C. Xu, G. Sun, and Y. Xie (2014) “Adaptive placement and migration policy for an STT-RAM-based hybrid cache,” in 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp. 13–24.

[108] Imani, M., S. Patil, and T. Rosing (2016) “Low power data-aware STT- RAM based hybrid cache architecture,” in 2016 17th International Symposium on Quality Electronic Design (ISQED), pp. 88–94.

[109] Arjomand, M., A. Jadidi, and H. Sarbazi-Azad (2012) “Relaxing Writes in Non-Volatile Processor Cache using Frequent Value Locality,” in DAC.

[110] Yazdanshenas, S., M. R. Pirbasti, M. Fazeli, and A. Patooghy (2014) “Coding Last Level STT-RAM Cache for High Endurance and Low Power,” IEEE Computer Architecture Letters, 13(2), pp. 73–76.

[111] Liu, L., P. Chi, S. Li, Y. Cheng, and Y. Xie (2017) “Building energy- efficient multi-level cell STT-RAM caches with data compression,” in 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 751–756.

161 [112] Smullen, C. W., V. Mohan, A. Nigam, S. Gurumurthi, and M. R. Stan (2011) “Relaxing non-volatility for fast and energy-efficient STT-RAM caches,” in HPCA, pp. 50–61.

[113] Jog, A., A. K. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and C. R. Das (2012) “Cache revive: architecting volatile STT-RAM caches for enhanced performance in CMPs,” in DAC, pp. 243–252.

[114] Zhao, H., H. Sun, Q. Yang, T. Min, and N. Zheng (2016) “Exploring the use of volatile STT-RAM for energy efficient video processing,” in 2016 17th International Symposium on Quality Electronic Design (ISQED), pp. 81–87.

[115] Liu, Z., W. Wen, L. Jiang, Y. Jin, and G. Quan (2017) “A statistical STT-RAM retention model for fast memory subsystem designs,” in 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 720–725.

[116] Kepler, N. “NvidiaâĂŹs next generation cuda compute architecture,” . URL http://www.nvidia.com/kepler [117] “NVIDIA GeForce series GTX280, 8800GTX, 8800GT,” . URL http://www.nvidia.com/geforce

[118] Jia, W., K. A. Shaw, and M. Martonosi (2014) “MRPB: Memory request prioritization for massively parallel processors,” in 2014 IEEE 20th Interna- tional Symposium on High Performance Computer Architecture (HPCA), pp. 272–283.

[119] Jog, A., O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das (2013) “Orchestrated Scheduling and Prefetching for GPGPUs,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, ACM, New York, NY, USA, pp. 332–343. URL http://doi.acm.org/10.1145/2485922.2485951

[120] Sethia, A., G. Dasika, M. Samadi, and S. Mahlke (2013) “APOGEE: Adaptive prefetching on GPUs for energy efficiency,” in Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pp. 73–82.

[121] Jadidi, A., M. Arjomand, M. T. Kandemir, and C. R. Das (2017) “Optimizing Energy Consumption in GPUS Through Feedback-driven CTA Scheduling,” in Proceedings of the 25th High Performance Computing Sympo- sium, HPC ’17, Society for Computer Simulation International, San Diego,

162 CA, USA, pp. 12:1–12:12. URL http://dl.acm.org/citation.cfm?id=3108096.3108108

[122] Singh, I., A. Shriraman, W. W. L. Fung, M. O’Connor, and T. M. Aamodt (2014) “Cache Coherence for GPU Architectures,” IEEE Micro, 34(3), pp. 69–79.

[123] Burtscher, M., R. Nasre, and K. Pingali (2012) “A quantitative study of irregular programs on GPUs,” in 2012 IEEE International Symposium on Workload Characterization (IISWC), pp. 141–151.

[124] NVIDIA (2011), “CUDA C/C++ SDK Code Samples,” . URL http://developer.nvidia.com/cuda-cc-sdk-code-samples

[125] He, B., W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang (2008) “Mars: A MapReduce Framework on Graphics Processors,” in Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, ACM, New York, NY, USA, pp. 260–269. URL http://doi.acm.org/10.1145/1454115.1454152

[126] Danalis, A., G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter (2010) “The Scalable Heterogeneous Computing (SHOC) Benchmark Suite,” in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, ACM, New York, NY, USA, pp. 63–74. URL http://doi.acm.org/10.1145/1735688.1735702

[127] Che, S., M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron (2009) “Rodinia: A benchmark suite for hetero- geneous computing,” in 2009 IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54.

[128] Qureshi, M. K. and Y. N. Patt (2006) “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” in Proceedings of the 39th Annual IEEE/ACM International Sympo- sium on Microarchitecture, MICRO 39, IEEE Computer Society, Washington, DC, USA, pp. 423–432. URL http://dx.doi.org/10.1109/MICRO.2006.49

[129] Li, C., S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou (2015) “Locality-Driven Dynamic GPU Cache Bypassing,” in Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, ACM, New York, NY, USA, pp. 67–77. URL http://doi.acm.org/10.1145/2751205.2751237

163 [130] Bakhoda, A., G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt (2009) “Analyzing CUDA workloads using a detailed GPU simu- lator,” in 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 163–174.

[131] Stratton, J. A. et al. (2012) Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing, Tech. rep.

[132] AMD Radeon R9 290X. URL http://www.amd.com/us/press-releases/Pages/ amd-radeon-r9-290x-2013oct24.aspx.

[133] NVIDIA GTX 780-Ti. URL http://www.nvidia.com/gtx-700-graphics-cards/gtx-780ti/

[134] Sethia, A. and S. Mahlke (2014) “Equalizer: Dynamic Tuning of GPU Re- sources for Efficient Execution,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, IEEE Computer Society, Washington, DC, USA, pp. 647–658. URL http://dx.doi.org/10.1109/MICRO.2014.16

[135] Keramidas, G., P. Petoumenos, and S. Kaxiras (2007) “Cache re- placement based on reuse-distance prediction,” in 2007 25th International Conference on Computer Design, pp. 245–250.

[136] Jaleel, A., K. B. Theobald, S. C. Steely, Jr., and J. Emer (2010) “High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP),” in Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, ACM, New York, NY, USA, pp. 60–71. URL http://doi.acm.org/10.1145/1815961.1815971

[137] Duong, N., D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V. Veidenbaum (2012) “Improving Cache Management Policies Using Dynamic Reuse Distances,” in Proceedings of the 2012 45th Annual IEEE/ACM In- ternational Symposium on Microarchitecture, MICRO-45, IEEE Computer Society, Washington, DC, USA, pp. 389–400. URL https://doi.org/10.1109/MICRO.2012.43

[138] Tian, Y., S. Puthoor, J. L. Greathouse, B. M. Beckmann, and D. A. Jiménez (2015) “Adaptive GPU Cache Bypassing,” in Proceedings of the 8th Workshop on General Purpose Processing Using GPUs, GPGPU-8, ACM, New York, NY, USA, pp. 25–35. URL http://doi.acm.org/10.1145/2716282.2716283

164 [139] Xie, X., Y. Liang, Y. Wang, G. Sun, and T. Wang (2015) “Coordinated static and dynamic cache bypassing for GPUs,” in 2015 IEEE 21st Interna- tional Symposium on High Performance Computer Architecture (HPCA), pp. 76–88.

[140] Qureshi, M. K., A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer (2007) “Adaptive Insertion Policies for High Performance Caching,” in Proceed- ings of the 34th Annual International Symposium on Computer Architecture, ISCA ’07, ACM, New York, NY, USA, pp. 381–391. URL http://doi.acm.org/10.1145/1250662.1250709

[141] Qureshi, M. K., A. Jaleel, Y. N. Patt, S. C. S. Jr., and J. Emer (2008) “Set-Dueling-Controlled Adaptive Insertion for High-Performance Caching,” IEEE Micro, 28(1), pp. 91–98.

[142] Sridharan, A. and A. Seznec (2016) “Discrete Cache Insertion Policies for Shared Last Level Cache Management on Large Multicores,” in 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 822–831.

[143] Kayiran, O., N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das (2014) “Managing GPU Concurrency in Heterogeneous Architectures,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO- 47, IEEE Computer Society, Washington, DC, USA, pp. 114–126. URL http://dx.doi.org/10.1109/MICRO.2014.62

[144] Arjomand, M. et al. (2011) “A morphable phase change memory archi- tecture considering frequent zero values,” in ICCD, pp. 373–380.

[145] Arjomand, M., A. Jadidi, M. T. Kandemir, and C. R. Das (2017) “Leveraging value locality for efficient design of a hybrid cache in multicore processors,” in ICCAD, pp. 1–8.

[146] Ghose, S. et al. (2013) “Improving Memory Scheduling via Processor-side Load Criticality Information,” in ISCA. URL http://doi.acm.org/10.1145/2485922.2485930

[147] Kotra, J. B. et al. (2016) “Re-NUCA: A Practical NUCA Architecture for ReRAM Based Last-Level Caches,” in IPDPS.

[148] Arelakis, A. et al. (2015) “HyComp: A Hybrid Cache Compression Method for Selection of Data-type-specific Compression Methods,” MICRO, ACM, pp. 38–49. URL http://doi.acm.org/1d0.1145/2830772.2830823

165 [149] Jin, Y. et al. “Adaptive Data Compression for High-performance Low- power On-chip Networks,” MICRO 41. URL http://dx.doi.org/10.1109/MICRO.2008.4771804

[150] Zhou, P. et al. “Frequent Value Compression in Packet-based NoC Archi- tectures,” ASP-DAC ’09. URL http://dl.acm.org/citation.cfm?id=1509633.1509640

166 Vita Amin Jadidi Amin Jadidi is a Ph.D. Candidate in the Department of Computer Science and Engineering at Pennsylvania State University. Amin joined Penn State in 2012, and worked under the supervisions of Prof. Chita R. Das and Prof. Mahmut T. Kandemir. His research interests lie in the broad areas of computer architecture and systems, with an emphasis on designing cache hierarchy and memory systems in chip multi-processors. Before joining Penn State, he completed his Master’s degree at Sharif University of Technology, Iran in 2011, and his undergraduate studies at University of Tehran, Iran in 2009.