University of California San Diego

UNIVERSITY OF CALIFORNIA SAN DIEGO Enabling Non-Volatile Memory for Data-intensive Applications A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science (Computer Engineering) by Xiao Liu Committee in charge: Professor Jishen Zhao, Chair Professor Ryan Kastner Professor Farinaz Koushanfar Professor Tajana Rosing Professor Steve Swanson 2021 Copyright Xiao Liu, 2021 All rights reserved. The dissertation of Xiao Liu is approved, and it is accept- able in quality and form for publication on microfilm and electronically. University of California San Diego 2021 iii DEDICATION To my parents. iv EPIGRAPH Oh, let the sun beat down upon my face. With stars to fill my dreams. I am a traveler of both time and space. To be where I have been. Sit with elders of the gentle race. This world has seldom seen. Talk of days for which they sit and wait. All will be revealed. — Robert A. Plant v TABLE OF CONTENTS Dissertation Approval Page . iii Dedication . iv Epigraph . .v Table of Contents . vi List of Figures . ix List of Tables . xii Acknowledgements . xiii Vita ............................................. xvi Abstract of the Dissertation . xvii Chapter 1 Introduction . .1 1.1 Thesis Statement . .3 1.1.1 Enabling NVM for Big Data Applications . .3 1.1.2 Enabling NVM for Neural Network Applications . .4 1.2 Thesis Outline . .6 Chapter 2 Background and Motivation . .8 2.1 The Hardware of Non-Volatile Memory . .8 2.2 Data-intensive Applications . 10 2.2.1 Big Data Applications . 10 2.2.2 Neural Network Applications . 12 2.3 NVM DIMM and Persistent Memory . 13 2.3.1 Motivation: Lack of Reliability Support in NVM DIMM . 15 2.4 RRAM-based NN Accelerators . 16 2.4.1 Motivation: Missing Features Demanded by the Neural Net- work Applications . 19 Chapter 3 Persistent Memory Workload Characterization: A Hardware Perspective . 21 3.1 Background and Motivation . 23 3.1.1 Background on Persistent Memory Systems . 23 3.1.2 The Need for Understanding Hardware Behavior . 24 3.1.3 Row Buffer in Memory Devices and Its Impact on NVM Emulation . 25 3.2 Methodology . 26 vi 3.2.1 System Platforms . 27 3.2.2 File Systems . 27 3.2.3 Persistent Memory Workloads . 29 3.3 The Impact of Write Intensity . 31 3.3.1 Impact of Write Intensity on System Performance . 31 3.3.2 Bonnie Results . 34 3.4 The Impact of NVM Access Locality . 35 3.4.1 Buffer Hit Rate . 35 3.4.2 Buffer Size . 37 3.5 Microarchitecture Analysis . 38 3.6 Implications on Persistent Memory System Design . 40 3.7 Conclusion . 42 Chapter 4 Binary Star: Coordinated Reliability in Heterogeneous Memory Systems for High Performance and Scalability . 44 4.1 Background and Motivation . 47 4.1.1 DRAM Main Memory . 48 4.1.2 3D DRAM Last Level Cache . 49 4.1.3 NVM Main Memory . 49 4.1.4 Issues with Decoupled Reliability . 50 4.2 Binary Star Design . 52 4.2.1 Coordinated Reliability Scheme . 53 4.2.2 Coordinating Wear Leveling and Consistent Cache Writeback 57 4.2.3 3D DRAM LLC Error Correction and Error Recovery . 59 4.2.4 Putting it All Together: An Example . 60 4.3 Implementation . 61 4.3.1 3D DRAM LLC Modification . 62 4.3.2 Memory Controller Modifications . 62 4.3.3 System Software Modifications . 65 4.3.4 Binary Star Overheads . 66 4.4 Experimental Methodology . 67 4.5 Results . 70 4.5.1 Reliability . 70 4.5.2 Performance . 72 4.5.3 Impact of the Periodic Forced Writeback Interval . 74 4.6 Conclusion . 77 Chapter 5 HR3AM: A Heat Resilient Design for RRAM-based Neuromorphic Com- puting . 78 5.1 Background and Motivation . 80 5.1.1 Neural Network Data Mapping on RRAM crossbar . 80 5.1.2 Thermal Issues in RRAM Accelerator . 81 5.2 HR3AM Design . 83 vii 5.2.1 HR3AM Overview . 83 5.2.2 Heat-resilient Weight Adjustment . 84 5.2.3 Dynamic Thermal Management . 88 5.3 Evaluation . 90 5.3.1 Methodology . 90 5.3.2 Inference accuracy . 91 5.3.3 Thermal status . 92 5.3.4 Performance and energy . 93 5.4 Conclusion . 94 Chapter 6 Mirage: A Highly Paralleled And Flexible RRAM Accelerator . 96 6.1 Background & Motivation . 98 6.1.1 Shared Data Induced Data Dependency . 100 6.1.2 Issues with Existing Hardware Assignments . 102 6.2 Mirage . 104 6.2.1 Finer-grained Parallel RRAM Architecture (FPRA) . 105 6.2.2 Auto Assignment . 110 6.3 Mirage Implementation . 113 6.3.1 The Implementation of FPRA . 113 6.3.2 Auto Assignment Tool Implementation . 115 6.4 Experimental Setup . 116 6.5 Evaluations . 118 6.5.1 Performance . 118 6.5.2 Flexibility . 120 6.5.3 Power Efficiency . 123 6.6 Conclusion . 124 Chapter 7 Conclusion . 125 7.1 Summary of the Thesis . 125 7.2 Future Works . 126 Bibliography . 127 viii LIST OF FIGURES Figure 2.1: Overview of RRAM-based DNN accelerator. 17 Figure 3.1: System platform configuration. 26 Figure 3.2: Throughput of FFSB on four systems. The x-axis represents the read to write ratio. The y-axis represents the throughput. 31 Figure 3.3: Normalized results of Bonnie. The y-axis stands for all the six metrics. The x-axis stands for the normalized number of each metric. 34 Figure 3.4: Filebench throughputs under various memory models. The y-axis represents the throughput. The x-axis represents memory models, which include DRAM, classic NVM with no buffer, and NVM with buffer hit rates under 50% to 90%. 36 Figure 3.5: Filebench throughputs of DRAM and NVMs with different row buffer sizes. The workload has 60% sequential read operations. 37 Figure 3.6: Filebench throughputs of DRAM and NVMs with different row buffer sizes. The workload has 80% sequential read operations. 37 Figure 3.7: Correlation between workload performance and five hardware counter statis- tics. The x-axis represents different metrics. The y-axis represents the correlation coefficient between the throughput and the metric. 38 Figure 4.1: Reliability and throughput of various LLC and main memory hierarchy configurations normalized to 28nm DRAM with RECC. We define “reliability” as the reciprocal of device failures in time (FIT) [1, 2], i.e., 1/FIT. Reliability data is based on a recent study [1]. 45 Figure 4.2: (a) Multiple reliability mechanisms for a cache line across LLC and main memory. (b) Fraction of clean LLC lines during application execution. We collected data for ten seconds after the benchmark finishes the warm up stage. 50 Figure 4.3: Basic architecture configuration. 53 Figure 4.4: Binary Star overview. (a) Periodic forced writeback and (b) Consistent cache writeback. (c) Binary Star error correction and error recovery mechanisms. A solid filled square represents a checkpointed consistent data block. A striped filled square represents a remapped inconsistent data block. 54 Figure 4.5: An example of running an application on Binary Star. 60 Figure 4.6: Modifications to NVM wear leveling. 63 Figure 4.7: State machine of the Binary Star software daemon. 65 Figure 4.8: Device failure rates of 3D DRAM for traditional RECC [3], RECC with IECC [1], Binary Star, and Chipkill with IECC [1, 4] vs. single cell bit error rate. Vertical lines represent 3D DRAM technology nodes (28nm and sub-20nm).1 ................................. 71 Figure 4.9: System performance of different memory configurations with no ECC and Binary Star. We normalize the throughput of each configuration to the throughput of 3D DRAM LLC with DRAM. 73 ix Figure 4.10: Performance of Binary Star normalized to the performance of the 3D DRAM +DRAM configuration, as PCM latency is varied as a multiple of its value in Table 4.1. 73 Figure 4.11: Effect of varying the periodic forced writeback interval on (a) throughput, (b) probability of rollbacks (i.e., error rate on a dirty line), and (c) NVM writes. 75 Figure 5.1: Temperature impact on (a) RRAM cell conductance and (b) CNN applications relative inference accuracy. 81 Figure 5.2: Temperature distribution of a single RRAM chip when running VGG16, InceptionV3 and ResNet50. 82 Figure 5.3: Overview of HR3AM structure. 83 Figure 5.4: Bitwidth downgrading hardware in the crossbar array. 86 Figure 5.5:.

Load more