Open Final Thesis2.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
The Pennsylvania State University The Graduate School College of Engineering ARCHITECTURAL TECHNIQUES TO ENABLE RELIABLE AND HIGH PERFORMANCE MEMORY HIERARCHY IN CHIP MULTI-PROCESSORS A Dissertation in Computer Science and Engineering by Amin Jadidi © 2018 Amin Jadidi Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy August 2018 The dissertation of Amin Jadidi was reviewed and approved∗ by the following: Chita R. Das Head of the Department of Computer Science and Engineering Dissertation Advisor, Chair of Committee Mahmut T. Kandemir Professor of Computer Science and Engineering John Sampson Assistant Professor of Computer Science and Engineering Prasenjit Mitra Professor of Information Sciences and Technology ∗Signatures are on file in the Graduate School. ii Abstract Constant technology scaling has enabled modern computing systems to achieve high degrees of thread-level parallelism, making the design of a highly scalable and dense memory hierarchy a major challenge. During the past few decades SRAM has been widely used as the dominant technology to build on-chip cache hierarchies. On the other hand, for the main memory, DRAM has been exploited to satisfy the applications demand. However, both of these two technologies face serious scalability and power consumption problems. While there has been enormous research work to address the drawbacks of these technologies, researchers have also been considering non-volatile memory technologies to replace SRAM and DRAM in future processors. Among different non-volatile technologies, Spin-Transfer Torque RAM (STT-RAM) and Phase Change Memory (PCM) are the most promising candidates to replace SRAM and DRAM technologies, respectively. Researchers believe that the memory hierarchy in future computing systems will consist of a hybrid combination of current technologies (i.e., SRAM and DRAM) and non-volatile technologies (e.g., STT-RAM, and PCM). While each of these technologies have their own unique features, they have some specific limitations as well. Therefore, in order to achieve a memory hierarchy that satisfies all the system-level requirements, we need to study each of these memory technologies. In this dissertation, the author proposes several mechanisms to address some of the major issues with each of these technologies. To relieve the wear-out problem in a PCM-based main memory, a compression-based platform is proposed, where the compression scheme collaborates with wear-leveling and error correction schemes to further extend the memory lifetime. On the other hand, to mitigate the write disturbance problem in PCM, a new write strategy as well as a non-overlapping data layout is proposed to manage the thermal disturbance among adjacent cells. iii For the on-chip cache, however, we would like to achieve a scalable low-latency configuration. To this end, the author proposes a morphable SLC-MLC STT-RAM cache which dynamically trade-offs between larger capacity and lower latency, based on the applications demand. While adopting scalable memory technologies, such as STT-RAM, improves the performance of cache-sensitive applications, the cache- thrashing problem will stil exist in applications with very large data working-set. To address this issue, the author proposes a selective caching mechanism for highly parallel architectures. And, also introduces a criticality-aware compressed last-level cache which is capable of holding a larger portion of the data working-set while the access latency is kept low. iv Table of Contents List of Figures x List of Tables xvii Acknowledgments xix Chapter 1 Introduction 1 1.1 Construction of PCM-based Main Memories . 2 1.1.1 Limited Lifetime of PCM-based Main Memories . 3 1.1.2 Reliable Write Operations in PCM-based Main Memories . 3 1.2 Construction of STTRAM-based Last-level Caches . 4 1.3 Construction of SRAM-based Last-level Caches . 5 1.3.1 Managing Cache Conflicts in Highly Parallel Architectures 5 1.3.2 Balancing Capacity and Latency in Compressed Caches . 6 Chapter 2 Background and Related Work 7 2.1 PCM-based Main Memory Organization . 7 2.1.1 Wear-out Faults . 8 2.1.2 Write Disturbance Faults . 9 2.1.3 Resistance Drift Faults . 10 2.2 STT-RAM Last-Level Cache Organization . 10 2.3 SRAM-based Last-Level Cache Organization . 11 2.3.1 Addressing Conflicts in Last-Level Cache . 11 2.3.2 Data Compression in Last-Level Cache . 11 v Chapter 3 Extending the Lifetime of PCM-based Main Memory 12 3.1 Introduction . 13 3.2 Background and Related Work . 16 3.2.1 PCM Basics and Baseline Organization . 17 3.2.2 Wear-out in PCM . 18 3.2.3 Prior Work on Improving PCM Lifetime . 19 3.3 The Proposed Approach: Employing Compression for Robust PCM Design . 21 3.3.1 The Proposed Mechanism and its Lifetime Impacts . 23 3.3.1.1 Impact of compression on bit flips . 25 3.3.1.2 The need for intra-line wear-leveling . 28 3.3.1.3 Interaction with error tolerant schemes . 29 3.3.1.4 Using a worn-out block after changes in compres- sion ratio . 30 3.3.2 Metadata management . 31 3.4 Experimental Setup . 32 3.5 Evaluation . 35 3.5.1 Memory Lifetime Analysis . 35 3.5.1.1 Impact of using compression (Comp) . 35 3.5.1.2 Impact of intra-line wear-leveling (Comp+W) . 37 3.5.1.3 Impact of advanced hard-error tolerance (Comp+WF) 37 3.5.1.4 Summary . 38 3.5.1.5 Number of Tolerable errors . 38 3.5.2 Performance Overhead Analysis . 39 3.5.3 Sensitivity to the Effect of Process Variation . 40 3.5.4 Efficiency of Our Design for MLC-based PCMs . 40 3.6 Conclusions . 41 Chapter 4 Tolerating Write Disturbance Errors in PCM Devices 42 4.1 Introduction . 43 4.2 Background and Related Work . 45 4.2.1 PCM Basics . 46 4.2.2 Fault Models in PCM . 48 4.3 Experimental Methodology . 53 4.4 The Proposed Approach: Enabling Reliable Write Operations in Super Dense PCM . 56 4.4.1 Intra-line Scheme: Preventing Write Disturbance Errors Along the Word-line . 56 vi 4.4.2 Inter-line Scheme: Tolerating Write Disturbance Errors Along the Bit-line . 61 4.4.3 The Interaction Between the Inter-line and Intra-line Schemes 70 4.5 Conclusion . 71 Chapter 5 Performance and Power-Efficient Design of Dense Non-Volatile Cache in CMPs 72 5.1 Introduction . 73 5.2 Overview of STT-RAM Technology . 76 5.2.1 Single-Level Cell (SLC) Device . 76 5.2.2 Multi-Level Cell (MLC) Device . 77 5.2.2.1 Two-step Write Operation . 78 5.2.2.2 Two-step Read Operation . 78 5.2.3 SLC versus MLC: Device-Level Comparison . 79 5.3 MLC STT-RAM Cache: The Baseline . 80 5.3.1 Stripped Data-to-Cell Mapping . 83 5.3.2 Performance Analysis . 85 5.3.3 Enhancements for the Stripped MLC Cache . 88 5.3.3.1 The Need for Dynamic Associativity . 88 5.3.3.2 The Need for a Cache Line Swapping Policy . 90 5.3.3.3 Overhead of the Counters . 91 5.4 Experimental Methodology . 92 5.4.1 Infrastructure . 92 5.4.2 Configuration of the Baseline System . 93 5.4.3 Workloads . 95 5.5 Evaluation Results . 96 5.5.1 Performance Analysis . 97 5.5.2 Energy Consumption Analysis . 98 5.5.3 Lifetime Analysis . 98 5.5.3.1 Comparison with Some Prior Works on Reducing Cache Misses . 99 5.6 Related Work on STT-RAM-based Caches . 101 5.7 Conclusions . 102 Chapter 6 Improving the Performance of Cache Hierarchy through Selec- tive Caching 103 6.1 Introduction . 104 6.2 Background . 106 vii 6.3 Problem Formulation . 108 6.3.1 Kernel-Based Analysis . 108 6.3.2 Proposed Microarchitecture for Selective Caching . 111 6.4 Dynamic Cache Reconfiguration . 113 6.4.1 Proposed Microarchitecture for Run-Time Sampling . 114 6.4.2 Kernel Characterization . 117 6.4.3 Determining the Ideal Configuration . 118 6.5 Evaluation . 120 6.6 Experimental Result . 122 6.6.1 Dynamism . 122 6.6.2 Performance . 124 6.6.3 Sensitivity Study . 125 6.6.4 Cache Miss-Rate . 126 6.6.5 Comparison with Warp-Throttling Techniques . 127 6.6.6 Comparison with Reuse Distance-Based Caching Policies . 128 6.7 Related Work . 130 6.8 Conclusion . 131 Chapter 7 Improving the Performance of Last-Level Cache through a Criticality- Aware Cache Compression 133 7.1 Introduction . 134 7.2 Background and Related Works . 136 7.2.1 Baseline Platform . 136 7.2.2 Cache Compression . 136 7.3 Compression Implications . 137 7.3.1 Latency versus Capacity . 137 7.4 Criticality-Aware Compression . 139 7.4.1 Data Criticality . 139 7.4.2 Non-Uniform Compression . 141 7.4.3 Relaxing the Decompression Latency . 142 7.5 Methodology . 143 7.6 Evaluation . 144 7.6.1 Compression Ratio . 144 7.6.2 Misses-Per-Kilo-Instructions (MPKI) . 145 7.6.3 Average Data Access Latency . 146 7.6.4 Performance . 146 7.7 Conclusion . 147 viii Chapter 8 Conclusions and Future Work 148 8.1 Conclusions . 148 8.2 Future Research Directions . 150 8.2.1 Lifetime-aware Performance Improvement . 150 8.2.2 Secure PCM-based Main Memory . 150 Bibliography 151 ix List of Figures 1.1 Memory hierarchy in a typical chip multi-processor. 2 3.1 Distribution of updated bits for consecutive writes to a specific and randomly-chosen 64-byte memory block for the gobmk application. 14 3.2 (a) PCM cell (b) PCM-based DIMM with ECC chip (c) DW circuit 17 3.3 The average compressed data size for BDI, FPC, and best of the two.