Opportunistic Memory Systems in Presence of Hardware Variability
Total Page:16
File Type:pdf, Size:1020Kb
UCLA UCLA Electronic Theses and Dissertations Title Opportunistic Memory Systems in Presence of Hardware Variability Permalink https://escholarship.org/uc/item/85c1r1q9 Author Gottscho, Mark William Publication Date 2017 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA Los Angeles Opportunistic Memory Systems in Presence of Hardware Variability A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Electrical Engineering by Mark William Gottscho 2017 c Copyright by Mark William Gottscho 2017 ABSTRACT OF THE DISSERTATION Opportunistic Memory Systems in Presence of Hardware Variability by Mark William Gottscho Doctor of Philosophy in Electrical Engineering University of California, Los Angeles, 2017 Professor Puneet Gupta, Chair The memory system presents many problems in computer architecture and system design. An important challenge is worsening hardware variability that is caused by nanometer-scale manufac- turing difficulties. Variability particularly affects memory circuits and systems – which are essen- tial in all aspects of computing – by degrading their reliability and energy efficiency. To address this challenge, this dissertation proposes Opportunistic Memory Systems in Presence of Hardware Variability. It describes a suite of techniques to opportunistically exploit memory variability for energy savings and cope with memory errors when they inevitably occur. In Part 1, three complementary projects are described that exploit memory variability for im- proved energy efficiency. First, ViPZonE and DPCS study how to save energy in off-chip DRAM main memory and on-chip SRAM caches, respectively, without significantly impacting perfor- mance or manufacturing cost. ViPZonE is a novel extension to the virtual memory subsystem in Linux that leverages power variation-aware physical address zoning for energy savings. The kernel intelligently allocates lower-power physical memory (when available) to tasks that access data fre- quently to save overall energy. Meanwhile, DPCS is a simple and low-overhead method to perform Dynamic Power/Capacity Scaling of SRAM-based caches. The key idea is that certain memory cells fail to retain data at a given low supply voltage; when full cache capacity is not needed, the voltage is opportunistically reduced and any failing cache blocks are disabled dynamically. The third project in Part 1 is X-Mem: a new extensible memory characterization tool. It is used in a se- ii ries of case studies on a cloud server, including one where the potential benefits of variation-aware DRAM latency tuning are evaluated. Part 2 of the dissertation focuses on ways to opportunistically cope with memory errors when- ever they occur. First, the Performability project studies the impact of corrected errors in mem- ory on the performance of applications. The measurements and models can help improve the availability and performance consistency of cloud server infrastructure. Second, the novel idea of Software-Defined Error-Correcting Codes (SDECCs) is proposed. SDECC opportunistically copes with detected-but-uncorrectable errors in main memory by combining concepts from cod- ing theory with an architecture that allows for heuristic recovery. SDECC leverages available side information about the contents of data in memory to essentially increase the strength of ECC with- out introducing significant hardware overheads. Finally, a methodology is proposed to achieve Virtualization-Free Fault Tolerance (ViFFTo) for embedded scratchpad memories. ViFFTo guards against both hard and soft faults at minimal cost and is suitable for future IoT devices. Together, the six projects in this dissertation comprise a complementary suite of methods for opportunistically exploiting hardware variability for energy savings, while reducing the impact of errors that will inevitably occur. Opportunistic Memory Systems can significantly improve the energy efficiency and reliability of current and future computing systems. There remain several promising directions for future work. iii The dissertation of Mark William Gottscho is approved. Glenn D. Reinman Mani B. Srivastava Lara Dolecek Puneet Gupta, Committee Chair University of California, Los Angeles 2017 iv To my wife, who has inspired me, and to whom I have vowed to expand our horizons. v TABLE OF CONTENTS 1 Introduction ::::::::::::::::::::::::::::::::::::::: 1 1.1 Memory Challenges in Computing . 1 1.2 Hardware Variability is a Key Issue for Memory Systems . 6 1.3 Opportunistic Memory Systems in Presence of Hardware Variability . 10 I Opportunistically Exploiting Memory Variability 14 2 ViPZonE: Saving Energy in DRAM Main Memory with Power Variation-Aware Mem- ory Management :::::::::::::::::::::::::::::::::::::: 15 2.1 Introduction . 17 2.1.1 Related Work . 18 2.1.2 Our Contributions . 19 2.1.3 Chapter Organization . 20 2.2 Background . 20 2.2.1 Memory System Architecture . 21 2.2.2 Main Memory Interleaving . 22 2.2.3 Vanilla Linux Kernel Memory Management . 23 2.3 Motivational Results . 26 2.3.1 Test Methodology . 27 2.3.2 Test Results . 29 2.3.3 Summary of Motivating Results . 33 2.3.4 Exploiting DRAM Power Variation . 34 2.4 Implementation . 35 vi 2.4.1 Target Platform and Assumptions . 37 2.4.2 Back-End: ViPZonE Implementation in the Linux Kernel . 38 2.4.3 Front-End: User API and System Call . 42 2.5 Evaluation . 45 2.5.1 Testbed Setup . 45 2.5.2 Interleaving Analysis . 47 2.5.3 ViPZonE Analysis . 51 2.5.4 What-If: Expected Benefits With Non-VolatileMemories Exhibiting Ultra- Low Idle Power . 52 2.5.5 Summary of Results . 53 2.6 Discussion and Conclusion . 54 3 DPCS: Saving Energy in SRAM Caches with Dynamic Power/Capacity Scaling :: 57 3.1 Introduction . 59 3.2 Related Work . 61 3.2.1 Leakage Reduction . 61 3.2.2 Fault Tolerance . 61 3.2.3 Memory Power/Performance Scaling . 62 3.2.4 Novelty of This Work . 62 3.3 The SRAM Fault Inclusion Property . 63 3.4 A Case for the “Static Power vs. Effective Capacity” Metric . 65 3.4.1 Amdahl’s Law Applied to Fault-Tolerant Voltage-Scalable Caches . 65 3.4.2 Problems With Yield Metric . 66 3.4.3 Proposed Metric . 67 3.5 Power/Capacity Scaling Architecture . 67 vii 3.5.1 Mechanism . 68 3.5.2 Static Policy: SPCS . 75 3.5.3 Dynamic Policies: DPCS . 76 3.6 Modeling and Evaluation Methodology . 78 3.6.1 Modeling Cache Fault Behavior . 78 3.6.2 System and Cache Configurations . 80 3.6.3 Technology Parameters and Modeling Cache Architecture . 80 3.6.4 Simulation Approach . 81 3.7 Analytical Results . 84 3.7.1 Fault Probabilities and Yield . 84 3.7.2 Static Power vs. Effective Capacity . 87 3.7.3 Area Overhead . 88 3.8 Simulation Results . 89 3.8.1 Fault Map Distributions . 89 3.8.2 Architectural Simulation . 91 3.9 Conclusion . 96 4 X-Mem: A New Extensible Memory Characterization Tool used for Case Studies on Memory Performance Variability ::::::::::::::::::::::::::::: 97 4.1 Introduction . 99 4.2 Related Work . 102 4.3 Background . 105 4.3.1 Non-Uniform Memory Access (NUMA) . 106 4.3.2 DDRx DRAM Operation . 107 4.4 X-Mem: Design and Implementation . 109 viii 4.4.1 Access Pattern Diversity . 109 4.4.2 Platform Variability . 113 4.4.3 Metric Flexibility . 115 4.4.4 UI and Benchmark Management . 117 4.5 Experimental Platforms and Tool Validation . 119 4.6 Case Study Evaluations . 124 4.6.1 Case Study 1: Characterization of the Memory Hierarchy for Cloud Sub- scribers . 124 4.6.2 Case Study 2: Cross-Platform Insights for Cloud Subscribers . 129 4.6.3 Case Study 3: Impact of Tuning Platform Configurations for Cloud Providers and Evaluating the Efficacy of Variation-Aware DRAM Timing Optimiza- tions . 134 4.7 Conclusion . 139 II Opportunistically Coping with Memory Errors 140 5 Performability: Measuring and Modeling the Impact of Corrected Memory Errors on Application Performance :::::::::::::::::::::::::::::::: 141 5.1 Introduction . 143 5.2 Related Work . 145 5.3 Background: DRAM Error Management and Reporting . 145 5.4 Measuring the Impact of Memory Errors on Performance . 147 5.4.1 Experimental Methods . 147 5.4.2 Empirical Results using Fault Injection . 149 5.5 Analytical Modeling Assumptions . 155 5.6 Model for Application Thread-Servicing Time (ATST) . 157 ix 5.6.1 Model Derivation . 157 5.6.2 Numeric Computation of ATST . 161 5.6.3 Average ATST . 164 5.6.4 ATST Slowdown . 165 5.6.5 Using the ATST Model to Predict the Impact of Page Retirement on Ap- plication Performance . 166 5.7 ATST-based Analytical Models for Batch Applications . 167 5.7.1 Uniprocessor System . 167 5.7.2 Multiprocessor System . 168 5.7.3 Predicting the Performance of Batch Applications . 170.